Most useful Prometheus queries – Oleg Elantsev

Topk – largest k elements by sample value

topk(k, <metric_expression>)

Examples

Top 5 CPU usage metrics

topk(5, sum(rate(container_cpu_usage_seconds_total[5m])) by (container_name))

This query calculates the rate of CPU usage for each container over the last 5 minutes, sums it up by container_name, and then returns the top 5 containers with the highest CPU usage.

Top 10 HTTP Request Rates

topk(10, sum(rate(http_requests_total[5m])) by (instance))

This query calculates the rate of HTTP requests for each instance over the last 5 minutes and then returns the top 10 instances with the highest request rates.

Request rate

Throughput

Note: below examples are true for the span metrics generated with the Otel connector.

Rate of processed requests can be queries as follows:

sum(rate(duration_seconds_count{job:<service_name>}[5m]))

The output timeseries is measured in requests per seconds (RPS).

If you prefer to see requests per minute (RPM) then multiply it by 60:

sum(rate(duration_seconds_count{job:<service_name>}[5m])) * 60

Hit rate

It might be interesting to look at the rate of hits the service. Ideally it should be close to the throughput. Throughput rate cannot be higher than the hit rate. As before – its hits per seconds.

sum(rate(calls_total{}[5m]))

P95 response duration

histogram_quantile(0.95, sum(rate(duration_seconds_bucket{span_kind=~"SPAN_KIND_SERVER|SPAN_KIND_CONSUMER", job=<service_name>} [5m])) by (le))