May 16, 2020

Prometheus 四大度量指标的了解和应用

在上一个章节中我们完成了 Prometheus 的基本概念了解和安装，由于考虑到看我博客的估计是开发向的小伙伴居多，因此没有再更深入。而今天本章节将介绍我们开发用的最多的度量指标，并结合实战对 Metrics 进行使用和细节分析。

什么是度量指标

来自维基百科

度量是指对于一个物体或是事件的某个性质给予一个数字，使其可以和其他物体或是事件的相同性质比较。度量可以是对一物理量（如长度、尺寸或容量等）的估计或测定，也可以是其他较抽象的特质。

简单来讲，也就是数据的量化，形成对应的数据指标。

Prometheus 的指标格式

在 Prometheus 中，我们的指标表示格式如下：

<metric name>{<label name>=<label value>, ...}

主体为指标名称和标签组成：

api_http_requests_total{method="POST", handler="/eddycjy"}

对外提供 metrics 服务

首先创建一个示例项目：

func main() {
    engine := gin.New()
    engine.GET("/hello", func(c *gin.Context) {
        c.String(http.StatusOK, "煎鱼")
    })
    engine.Run(":10001")
}

接下我们需要安装 Prometheus Client SDK，在 Go 语言中对应 prometheus/client_golang 库：

$ go get github.com/prometheus/client_golang

然后调用 promhttp.Handler 方法创建对应的 metrics：

func main() {
    ...
    engine.GET("/metrics", gin.WrapH(promhttp.Handler()))
    engine.Run(":10001")
}

重新启动程序，并访问 http://127.0.0.1:10001/metrics：

# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0
go_gc_duration_seconds{quantile="0.25"} 0
go_gc_duration_seconds{quantile="0.5"} 0
go_gc_duration_seconds{quantile="0.75"} 0
go_gc_duration_seconds{quantile="1"} 0
go_gc_duration_seconds_sum 0
go_gc_duration_seconds_count 0
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 8
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.14.2"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 2.563056e+06
...

我们可以聚焦其中一个指标：

# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 8

你会发现其具有固定的表示格式，分别是指标的含义、指标的类型、指标的具体字段和数值。而在 promhttp.Handler 方法所暴露出来的 metrics 数值，虽然看似很多，但你认真看一下，可以主体为两块：

go_memstats 开头的指标都是 runtime.MemStats 的格式化数值。
promhttp_metric 开头的指标是 HTTP 服务的状态码统计。

Prometheus 四大度量指标的了解和应用

Counter（计数器）

Counter 类型代表一个累积的指标数据，其单调递增，只增不减。在应用场景中，像是请求次数、错误数量等等，就非常适合用 Counter 来做指标类型，另外 Counter 类型，只有在被采集端重新启动时才会归零。

Counter 类型一共包含两个常规方法，如下：

方法名	作用
Inc	将计数器递增 1。
Add(float64)	将给定值添加到计数器，如果设置的值 < 0，则发生错误。

实战演练

Counter 类型是单纯的累积类计数，最基础的就是在访问请求的时候进行分类统计，在上文的示例项目中继续添加代码：

var AccessCounter = prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "api_requests_total",
    },
    []string{"method", "path"},
)

func init() {
    prometheus.MustRegister(AccessCounter)
}

func main() {
    ...
    engine.GET("/counter", func(c *gin.Context) {
        purl, _ := url.Parse(c.Request.RequestURI)
        AccessCounter.With(prometheus.Labels{
            "method": c.Request.Method,
            "path":   purl.Path,
        }).Add(1)
    })
    engine.GET("/metrics", gin.WrapH(promhttp.Handler()))
    engine.Run(":10001")
}

这时候我们访问 http://127.0.0.1:10001/counter，就可以发现 metrics +1：

# HELP api_requests_total 
# TYPE api_requests_total counter
api_requests_total{method="GET",path="/counter"} 1

如果希望对全部请求进行记录和统计，我们可以利用拦截器来实现，但是在添加 Labels 时需要注意一点，就是你所定义的指标 Labels 和实际写入时的 Labels 要对应，否则会造成 panic：

2020/05/17 11:01:06 http: panic serving 127.0.0.1:53393: inconsistent label cardinality: expected 3 label values but got 2 in prometheus.Labels{"method":"GET", "path":"/hello"}
goroutine 51 [running]:
net/http.(*conn).serve.func1(0xc0000ee000)
        /usr/local/Cellar/go/1.14.2_1/libexec/src/net/http/server.go:1772 +0x139
panic(0x16272a0, 0xc00009c130)
        /usr/local/Cellar/go/1.14.2_1/libexec/src/runtime/panic.go:975 +0x3e3
github.com/prometheus/client_golang/prometheus.(*CounterVec).With(0xc0001347e0, 0xc00009a4e0, 0x16ea903, 0x4)
        /Users/eddycjy/go/pkg/mod/github.com/prometheus/client_golang@v1.6.0/prometheus/counter.go:259 +0xc2

Gauge（仪表盘）

Gauge 类型代表一个可以任意变化的指标数据，其可增可减。在应用场景中，像是 Go 应用程序运行时的 Goroutine 的数量就可以用该类型来表示，因为其是浮动的数值，并非固定的，侧重于反馈当前的情况。

Gauge 类型一共包含六个常规方法，如下：

方法名	作用
Set(float64)	将仪表设置为任意值。
Inc()	将仪表增加 1。
Dec()	将仪表减少 1。
Add(float64)	将给定值添加到仪表，该值如果为负数，那么将导致仪表值减少。
Sub(float64)	从仪表中减去给定值，该值如果为负数，那么将导致仪表值增加。
SetToCurrentTime()	将仪表设置为当前Unix时间（以秒为单位）。

实战演练

Gauge 类型是每次都重新设置的统计类型，在系统中统计 CPU、Memory 等等时很常见，而在业务场景中，业务队列的数量也可以用 Gauge 来统计，实时观察队列数量，及时发现堆积情况：

var QueueGauge = prometheus.NewGaugeVec(
    prometheus.GaugeOpts{
        Name: "queue_num_total",
    },
	[]string{"name"},
)

func init() {
    prometheus.MustRegister(AccessCounter)
}

func main() {
    ...
    engine.GET("/queue", func(c *gin.Context) {
        num := c.Query("num")
        fnum, _ := strconv.ParseFloat(num, 32)
        QueueGauge.With(prometheus.Labels{"name": "queue_eddycjy"}).Set(fnum)
    })
    engine.GET("/metrics", gin.WrapH(promhttp.Handler()))
    engine.Run(":10001")
}

访问 http://127.0.0.1:10001/queue?num=5 后，再查看 metrics 结果：

# HELP queue_num_total 
# TYPE queue_num_total gauge
queue_num_total{name="queue_eddycjy"} 5

另外 Gauge 类型也支持各种增减方法，大家根据实际情况调用即可。

Histogram（累积直方图）

Histogram 类型将会在一段时间范围内对数据进行采样（通常是请求持续时间或响应大小等等），并将其计入可配置的存储桶（bucket）中，后续可通过指定区间筛选样本，也可以统计样本总数。

简单来讲，也就是在配置 Histogram 类型时，我们会设置分组区间，例如要分析请求的响应时间，我们可以分为 0-100ms，100-500ms，500-1000ms 等等区间段，那么在 metrics 的上报接口中，将会分为多个维度显示统计情况。

Histogram 类型一共包含一个常规方法，如下：

方法名	作用
Observe(float64)	将一个观察值添加到直方图。

实战演练

Histogram 类型在应用场景中非常的常用，因为其代表的就是分组区间的统计，而在分布式场景盛行的现在，链路追踪系统是必不可少的，那么针对不同的链路的分析统计就非常的有必要，例如像是对 RPC、SQL、HTTP、Redis 的 P90、P95、P99 进行计算统计，并且更进一步的做告警，就能够及时的发现应用链路缓慢，进而发现和减少第三方系统的影响。

我们模仿记录 HTTP 调用响应时间的应用场景：

var HttpDurationsHistogram = prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Name:    "http_durations_histogram_seconds",
        Buckets: []float64{0.2, 0.5, 1, 2, 5, 10, 30},
    },
    []string{"path"},
)

func init() {
    prometheus.MustRegister(HttpDurationsHistogram)
}

func main() {
	...
    engine.GET("/histogram", func(c *gin.Context) {
        purl, _ := url.Parse(c.Request.RequestURI)
        HttpDurationsHistogram.With(prometheus.Labels{"path": purl.Path}).Observe(float64(rand.Intn(30)))
    })
    engine.GET("/metrics", gin.WrapH(promhttp.Handler()))
    engine.Run(":10001")
}

多次调用 http://127.0.0.1:10001/histogram，查看 metrics：

# HELP http_durations_histogram_seconds 
# TYPE http_durations_histogram_seconds histogram
http_durations_histogram_seconds_bucket{path="/histogram",le="0.2"} 1
http_durations_histogram_seconds_bucket{path="/histogram",le="0.5"} 1
http_durations_histogram_seconds_bucket{path="/histogram",le="1"} 3
http_durations_histogram_seconds_bucket{path="/histogram",le="2"} 3
http_durations_histogram_seconds_bucket{path="/histogram",le="5"} 3
http_durations_histogram_seconds_bucket{path="/histogram",le="10"} 3
http_durations_histogram_seconds_bucket{path="/histogram",le="30"} 13
http_durations_histogram_seconds_bucket{path="/histogram",le="+Inf"} 13
http_durations_histogram_seconds_sum{path="/histogram"} 191
http_durations_histogram_seconds_count{path="/histogram"} 13

我们结合 histogram metrics 的结果来看，可以发现其分为了三个部分：

http_durations_histogram_seconds_bucket：在 Buckets 中你可以发现一共包含 8 个值，分别代表：0-0.2s、0.2-0.5s、0.5-1s、1-2s、2-5s、5-10s、10-30s 以及大于 30s（+Inf），这是我们在 HistogramOpts.Buckets 中所定义的区间值。
http_durations_histogram_seconds_sum：调用的总耗时。
http_durations_histogram_seconds_count：调用总次数。

Histogram 是一个比较精巧类型，首先 Buckets 的分布区间要根据你的实际应用情况，合理的设置，否则就会出现不均，自然而然 PXX（P95、P99 等）计算也就会有问题，同时在 Grafana 上的绘图也会出现偏差，因此需要在理论上多多理解，然后再进行具体的设置，否则后期改来改去会比较麻烦

同时我们也可以利用 http_durations_histogram_seconds_sum 和 http_durations_histogram_seconds_count 相除得出平均耗时，一举多得。

Summary（摘要）

Summary 类型将会在一段时间范围内对数据进行采样，但是与 Histogram 类型不同的是 Summary 类型将会存储分位数（在客户端进行计算），而不像 Histogram 类型，根据所设置的区间情况统计存储。

Summary 类型在采样计算后，一共提供三种摘要指标，如下：

样本值的分位数分布情况。
所有样本值的大小总和。
样本总数。

Summary 类型一共包含一个常规方法，如下：

方法名	作用
Observe(float64)	将一个观察值添加到摘要。

实战演练

Summary 类型主要是

var HttpDurations = prometheus.NewSummaryVec(
    prometheus.SummaryOpts{
        Name:       "http_durations_seconds",
        Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
    },
    []string{"path"},
)

func init() {
    prometheus.MustRegister(HttpDurations)
}

func main() {
    ...
    engine.GET("/summary", func(c *gin.Context) {
        purl, _ := url.Parse(c.Request.RequestURI)
        HttpDurations.With(prometheus.Labels{"path": purl.Path}).Observe(float64(rand.Intn(30)))
    })
    engine.GET("/metrics", gin.WrapH(promhttp.Handler()))
    engine.Run(":10001")
}

多次调用 http://127.0.0.1:10001/summary，查看 metrics：

# HELP http_durations_seconds 
# TYPE http_durations_seconds summary
http_durations_seconds{path="/summary",quantile="0.5"} 17
http_durations_seconds{path="/summary",quantile="0.9"} 29
http_durations_seconds{path="/summary",quantile="0.99"} 29
http_durations_seconds_sum{path="/summary"} 85
http_durations_seconds_count{path="/summary"} 5

结合 summary metrics 来看，同样分为了三个部分：

http_durations_seconds：分别是中位数（0.5），9 分位数（0.9）以及 99 分位数（0.99），对应 SummaryOpts.Objectives 中我们所定义的中位数，而各自的意义代表着中位数（0.5）的耗时为 17s，9 分位数为 29s，99 分位数为 29s。
http_durations_seconds_sum：调用总耗时。
http_durations_seconds_count：调用总次数。

小结

在本章节中我们介绍并实操了 Prometheus 的四种度量指标类型 Counter、Gauge、Histogram、Summary，这四种度量类型都极具代表性：Counter 是单调递增的计数器，Gauge 是可任意调整数值的仪表盘，Histogram 是分组区间统计，Summary 是中位数统计。

其中 Histogram 和 Summary 具有一定的 “相似” 度，因为在 Histogram 指标中我们可以通过 histogram_quantile 函数计算出分位值，而 Summary 也可以计算分位值，两者区别就在于 Histogram 是在服务端计算的，而 Summary 是在客户端就进行了计算，其一个计算好了再推上去，一个直接推上去，数据维度不一样，可以做的事情也不一样，有利有弊，具体可以根据指标的实际情况做衡量。

另外针对度量指标的命名，这是一个非常多人问的问题，因为命名是一个难题，在这里大家可以参照官方的文档建议去针对指标命名就可以了。