The Kubernetes controller-runtime library provides a Prometheus metrics
endpoint by default. The Upjet based providers including the
upbound/provider-aws, upbound/provider-azure, upbound/provider-azuread and
upbound/provider-gcp expose
various metrics
from the controller-runtime to help monitor the health of the various runtime
components, such as the controller-runtime client, the leader election
client, the controller workqueues, etc. In addition to these metrics, each
controller also
exposes
various metrics related to the reconciliation of the custom resources and active
reconciliation worker goroutines.
In addition to these metrics exposed by the controller-runtime, the Upjet
based providers also expose metrics specific to the Upjet runtime. The Upjet
runtime registers some custom metrics using the
available extension mechanism,
and are available from the default /metrics endpoint of the provider pod. Here
are these custom metrics exposed from the Upjet runtime:
upjet_terraform_cli_duration: This is a histogram metric and reports statistics, in seconds, on how long it takes a Terraform CLI invocation to complete.upjet_terraform_active_cli_invocations: This is a gauge metric and it's the number of active (running) Terraform CLI invocations.upjet_terraform_running_processes: This is a gauge metric and it's the number of running Terraform CLI and Terraform provider processes.upjet_resource_ttr: This is a histogram metric and it measures, in seconds, the time-to-readiness for managed resources.
Prometheus metrics can have labels associated with them to differentiate the characteristics of the measurements being made, such as differentiating between the CLI processes and the Terraform provider processes when counting the number of active Terraform processes running. Here is a list of labels associated with each of the above custom Upjet metrics:
- Labels associated with the
upjet_terraform_cli_durationmetric:subcommand: Theterraformsubcommand that's run, e.g.,init,apply,plan,destroy, etc.mode: The execution mode of the Terraform CLI, one ofsync(so that the CLI was invoked synchronously as part of a reconcile loop),async(so that the CLI was invoked asynchronously, the reconciler goroutine will poll and collect results in future).
- Labels associated with the
upjet_terraform_active_cli_invocationsmetric:subcommand: Theterraformsubcommand that's run, e.g.,init,apply,plan,destroy, etc.mode: The execution mode of the Terraform CLI, one ofsync(so that the CLI was invoked synchronously as part of a reconcile loop),async(so that the CLI was invoked asynchronously, the reconciler goroutine will poll and collect results in future).
- Labels associated with the
upjet_terraform_running_processesmetric:type: Eitherclifor Terraform CLI (theterraformprocess) processes orproviderfor the Terraform provider processes. Please note that this is a best effort metric that may not be able to precisely catch & report all relevant processes. We may, in the future, improve this if needed by for example watching theforksystem calls. But currently, it may prove to be useful to watch rouge Terraform provider processes.
- Labels associated with the
upjet_resource_ttrmetric:group,version,kindlabels record the API group, version and kind for the managed resource, whose time-to-readiness measurement is captured.
You can export all these
custom metrics and the controller-runtime metrics from the provider pod for
Prometheus. Here are some examples showing the custom metrics in action from the
Prometheus console:
-
upjet_terraform_active_cli_invocationsgauge metric showing the sync & asyncterraform init/apply/plan/destroyinvocations:
-
upjet_terraform_running_processesgauge metric showing bothcliandproviderlabels:
-
upjet_terraform_cli_durationhistogram metric, showing average Terraform CLI running times for the last 5m:
-
The medians (0.5-quantiles) for these observations aggregated by the mode and Terraform subcommand being invoked:

-
upjet_resource_ttrhistogram metric, showing average resource TTR for the last 10m:
These samples have been collected by provisioning 10 upbound/provider-aws
cognitoidp.UserPool resources by running the provider with a poll interval of
1m. In these examples, one can observe that the resources were polled
(reconciled) twice after they acquired the Ready=True condition and after
that, they were destroyed.
You can find a full reference of the exposed metrics from the Upjet-based providers here.
