Prometheus

Features

  • Multi-dementional.
  • Simple metrics.
  • Pulls metrics over HTTP(S)?.
  • TSDB.
  • Supports service discovery (via dns) and static configuration (hardcoded targets).
  • Has rich api libraries (e.g. for python, nodejs or golang).
  • Primary data source for Grafana.

#Reading

#Local Testing Stack

//  - docker-compose.yaml - 
version: "3"


x-healthchecks: &healthchecks
  start_period: 20s
  interval: 10s
  timeout: 1s
  retries: 60

services:
  prom:
    image: prom/prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - data_prom:/prometheus
    command:
      - --config.file=/etc/prometheus/prometheus.yml
    healthcheck:
      <<: *healthchecks
      test:
      - "CMD-SHELL"
      - "wget --no-verbose --tries=1 --spider '0.0.0.0:9090/-/healthy' || exit 1"

volumes:
  data_prom: ~
//  - prometheus.yml - 
global:
  scrape_interval: 15s
  evaluation_interval: 15s
rule_files:
scrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 5s
    static_configs:
    - targets:
      - 'localhost:9090'

#Metrics Format

# Prometheus metrics format
<metric name>{<label name>=<label value>, ...} <value>
# Example
http_requests_total{job="apiserver", handler="/api/comments"} 10
  • Metric - http_requests_total
  • Meta / Labeles - job="apiserver", handler="/api/comments"
  • Value - 10

#Types

Metric types are usually in the metric # TYPE annotation

  • counter represents any value that goes only up (http_request_total, etc). counters are useful to monitor rate of an event, like rps.
  • histogram measures a frequency of an event that falls into specific predefined buckets, like request counts by response status codes (request_duration{code="200"} 32)
  • summary is similar to histogram, but don’t require pre-defined buckets, that might be useful if you want to use quantiles, but not sure in ranges you need to use.
  • gauge is similar to summary and provides a bucket of metric values. Gauges are useful to observe some metrics can go up and down, like memory allocations (go_memstats_alloc_bytes).

#Labels

Label names starting with __ (such as __name__) are reserved for internal usage.

Labels in prometheus are divided into 2 types:

  • Instrumentation labels come from an instrumented application (type of HTTP request, etc).

  • Target labels identify a specific monitoring target, and related more to an architecture and infrastructure. For example, different teams may have different vision of what a “team”, “region”, or “service” is, so instrumented app shouldn’t expose this labels itself rather than leave it to relabeling feature. Labels most likely used as target labels: env, cluster, service, team, zone, and region.

#Annotations

# HELP latency_seconds Latency in seconds.
# TYPE latency_seconds summary
latency_seconds_sum{path="/foo"} 1.0

#Naming

#Sufixes

  • _totalis a counter
  • _count is a counter
  • _sum stands for summary
  • _bucket is for a histogram

#PromQL

# selecting by metric name
> nifi_stor_used_bytes
nifi_stor_used_bytes{env="testing", instance="nifi-export:9103", job="nifi-export", location="default", node_id="aggregate", type="content"}
	37968252928
nifi_stor_used_bytes{env="testing", instance="nifi-export:9103", job="nifi-export", location="default", node_id="aggregate", type="flow_file"}
	37968252928
nifi_stor_used_bytes{env="testing", instance="nifi-export:9103", job="nifi-export", location="default", node_id="aggregate", type="provenance"}

# aggregate by ignore type
> sum without(type)(nifi_amount_bytes_transferreds)
{env="testing", instance="nifi-export:9103", job="nifi-export", location="default", node_id="aggregate"} 113900236800
> avg without(type)(nifi_stor_used_bytes)
{env="testing", instance="nifi-export:9103", job="nifi-export", location="default", node_id="aggregate"} 	37967511552


# average per second for lat 5 minutes
> rate(nifi_amount_bytes_transferred[5m])
{component_id="1e8d6df2-d134-37e2-817a-88dc311642d3", component_name="default", component_type="ProcessGroup", exported_instance="nifi-node1", instance="nifi-node1:9092", job="nifi-analitics", parent_id="457d5c1e-018d-1000-1c9f-1eac4f5e00d1"}
	17180.74216684518
{component_id="329b6cc7-bdcf-3041-ba4b-795d6edda52e", component_name="blackbox", component_type="ProcessGroup", exported_instance="nifi-node1", instance="nifi-node1:9092", job="nifi-analitics", parent_id="1e8d6df2-d134-37e2-817a-88dc311642d3"}
	6872.296866738071
{component_id="457d5c1e-018d-1000-1c9f-1eac4f5e00d1", component_name="NiFi Flow", component_type="RootProcessGroup", exported_instance="nifi-node1", instance="nifi-node1:9092", job="nifi-analitics"}
	17180.74216684518

PromQL Matcher Operators

  • = - equality matcher
  • != - negative equality matcher
  • =~ - regexp matcher
  • !~ - negative regexp matcher

Notes:

# Return all TS with the metric http_requests_total:
http_requests_total
# Return all TS with the metric http_requests_total and the given job and labels:
http_requests_total{job="apiserver", handler="/api/comments"}
# 5 minutes vector
http_requests_total{job="apiserver", handler="/api/comments"}[5m]
# Using regular expressions, you could select time series only for
# jobs whose name match a certain pattern, in this case, all jobs that end with server:
http_requests_total{job=~".*server"}
# To select all HTTP status codes except 4xx ones, you could run:
http_requests_total{status!~"4.."}
#

#Subqueries

# Return the 5-minute rate of the http_requests_total metric for the past 30 minutes,
# with a resolution of 1 minute.
rate(http_requests_total[5m])[30m:1m]
# The subquery for the deriv function uses the default resolution.
# Note that using subqueries unnecessarily is unwise.
max_over_time(deriv(rate(distance_covered_total[5s])[30s:5s])[10m:])

#Using functions, operators, etc.

# Return the per-second rate for all time series with the http_requests_total
# metric name, as measured over the last 5 minutes:
rate(http_requests_total[5m])

Assuming that the http_requests_total`` time series all have the labels job (fanout by job`` name) and instance (fanout by instance of the job), we might want to sum over the rate of all instances, so we get fewer output time series, but still preserve the job dimension:

sum by (job) (
  rate(http_requests_total[5m])
)

#PromQL Aggregation Operators

  • sum (metric) returns a sum aggregated by same labels.
  • max (metric) returns max value from a gauge.
  • avg (metric) returns average value for a metric
  • / the division operator matches the timeseries with the same labels.

There is a without function in case you need to omit specific labels from an aggregation (sum without(label, label)).

#Push vs Pull

#pushgateway

You can use Pushgateway if you have reasons to use it.

  • When monitoring multiple instances through a single Pushgateway, the Pushgateway becomes both a single point of failure and a potential bottleneck.
  • You lose Prometheus’s automatic instance health monitoring via the up metric (generated on every scrape).
  • The Pushgateway never forgets series pushed to it and will expose them to Prometheus forever unless those series are manually deleted via the Pushgateway’s API.

#Exporters

#Service Discovery

#Consul

global:
  scrape_interval: 15s
  scrape_timeout: 15s
scrape_configs:
- job_name: 'consul-prometheus'
  consul_sd_configs:
    - server: 'consul:8500'
      services: ['foobar-service']

#DNS

global:
  scrape_interval:     5s

scrape_configs:
  - job_name: 'promtail'
    dns_sd_configs:
      - names: [ promtail ]
        type: A
        port: 9080

#Using Files

file_sd_config

global:
  evaluation_interval: 1m
  scrape_interval: 30s
  scrape_timeout: 10s
remote_write:
  - url: http://localhost:8080/workspaces/WORKSPACE/api/v1/remote_write
scrape_configs:
  - job_name: ecs_services
    file_sd_configs:
      - files:
          - /etc/config/ecs-services.json
        refresh_interval: 30s