golang学习记录:k8s operator内存泄露排查
最近项目上某个operator在新增定时调谐逻辑后,出现了内存泄露的问题,但整体内存增长速度不是很快。
本次内存泄露使用
pprof和go tool排查,这也是我第一次排查go的内存泄露问题,所以记录一下排查流程和结果
一、背景
某个operator之前使用sigs.k8s.io/controller-runtime相关的库来实现,但是未配置SyncPeriod定时调谐;
前段时间,有需求要求新增定时调谐逻辑,遂新增此函数,每分钟定时调谐一次;但是新增此函数后,发现出现内存缓慢升高的现象:环境A内存60小时提升了100M,环境B内存60小时提升了800M,两环境唯一的差别为operator watch的实例数不同。
二、排查过程
- 本次内存泄露排查,使用工具pprof进行。首先为源码引入pprof:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17import (
...
"net/http"
_ "net/http/pprof"
...
)
...
func main() {
...
go func() {
klog.Info("pprof start")
http.ListenAndServe("0.0.0.0:6060", nil) // 此处应为0.0.0.0 使用127.0.0.1会导致Pod外无法直接访问
}()
...
} - 由于本次内存泄露是新增定时调谐逻辑后出现,初步猜测是定时调谐导致内存泄露,故将调谐频率由1分钟1次调整至10秒一次,随后编译镜像上传环境。
1
2
3
4
5
6syncPeriod := time.Second * 10 // 10 s
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
...
SyncPeriod: &syncPeriod,
}) - 待镜像运行一段时间,pod内存占用明显升高时,采集pprof统计的内存分配信息。由于宿主机没有go环境,故使用
curl先采集数据,拷到本地分析。1
2podip
curl http://xxx.xxx.xxx.xxx:6060/debug/pprof/heap > heap.out - 使用go tool分析pprof文件,其中浏览器展示可以看图,会更直观一点,但是代码追踪没命令行分析方便。
1
2
3
4本地浏览器展示
go tool pprof -http=:8080 heap.out
命令行展示
go tool pprof /root/code/pprof/heap.out - pprof(inuse_space)分析结果如下,仅截取核心部分:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35(pprof) top
Showing nodes accounting for 58.50MB, 87.89% of 66.56MB total
Showing top 10 nodes out of 160
flat flat% sum% cum cum%
20MB 30.05% 30.05% 40.44MB 60.76% sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
16.44MB 24.69% 54.74% 16.44MB 24.69% sigs.k8s.io/controller-runtime/pkg/log.(*loggerPromise).WithValues
6.05MB 9.09% 63.83% 6.05MB 9.09% k8s.io/apimachinery/pkg/runtime.(*RawExtension).UnmarshalJSON
5MB 7.52% 71.35% 5MB 7.52% runtime.allocm
3MB 4.51% 75.86% 19.44MB 29.20% sigs.k8s.io/controller-runtime/pkg/log.(*DelegatingLogger).WithValues
...
(pprof)
(pprof) list reconcileHandler
Total: 66.56MB
ROUTINE ======================== sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler in /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go
20MB 40.44MB (flat, cum) 60.76% of Total
...
20MB 39.44MB 240: log := c.Log.WithValues("name", req.Name, "namespace", req.Namespace) # 注意
...
(pprof)
(pprof) list .WithValues
Total: 66.56MB
ROUTINE ======================== sigs.k8s.io/controller-runtime/pkg/log.(*DelegatingLogger).WithValues in /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go
3MB 19.44MB (flat, cum) 29.20% of Total
...
3MB 3MB 112: res := &DelegatingLogger{Logger: l.Logger}
. 16.44MB 113: promise := l.promise.WithValues(res, tags...)
...
ROUTINE ======================== sigs.k8s.io/controller-runtime/pkg/log.(*loggerPromise).WithValues in /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go
16.44MB 16.44MB (flat, cum) 24.69% of Total
...
. . 51:func (p *loggerPromise) WithValues(l *DelegatingLogger, tags ...interface{}) *loggerPromise {
15MB 15MB 52: res := &loggerPromise{
...
1.44MB 1.44MB 60: p.childPromises = append(p.childPromises, res)
...
三、排查结果
如上所示,可以发现内存占用大部分都是reconcileHandler()和WithValues()内占用的,通过代码追踪可以看到这块代码是log相关的内容,但是都是第三方的库,所以优先考虑是第三方库bug。
经过去网上查issue和帖子,发现确实是第三方库sigs.k8s.io/controller-runtime的bug,问题大概如下:
sigs.k8s.io/controller-runtime内延迟log需要用户自己提供日志实现,假如用户不自己实现,则日志会堆积,直至延迟log被实现。调谐频率越高,则日志生成越快、堆积越多,日志堆积会导致内存逐渐升高,最终导致OOM。
所以实际上这个问题一直存在,只是之前没有定时调谐,调谐频率不高导致未被发现。
相关issue & fix如下:
issue: https://github.com/kubernetes-sigs/controller-runtime/issues/1444
issue: https://github.com/kubernetes-sigs/controller-runtime/issues/1122
fix: https://github.com/kubernetes-sigs/controller-runtime/pull/1309
四、修复与测试
- 经过调研,可以得出此问题主要有两种解决方案
- 升级
controller-runtime版本至0.8.x (2021年修复) - 为延迟log定义实现,避免日志堆积
- 最终选择在代码内新增空实现延迟log,丢弃日志。代码改动如下:
1
2
3
4
5
6...
func main() {
...
ctrl.SetLogger(logf.NullLogger{})
...
} - 编译镜像上传环境,待镜像运行一段时间,内存未明显升高,此问题解决。
附录
- pprof完整内容
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72(pprof) top
Showing nodes accounting for 58.50MB, 87.89% of 66.56MB total
Showing top 10 nodes out of 160
flat flat% sum% cum cum%
20MB 30.05% 30.05% 40.44MB 60.76% sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
16.44MB 24.69% 54.74% 16.44MB 24.69% sigs.k8s.io/controller-runtime/pkg/log.(*loggerPromise).WithValues
6.05MB 9.09% 63.83% 6.05MB 9.09% k8s.io/apimachinery/pkg/runtime.(*RawExtension).UnmarshalJSON
5MB 7.52% 71.35% 5MB 7.52% runtime.allocm
3MB 4.51% 75.86% 19.44MB 29.20% sigs.k8s.io/controller-runtime/pkg/log.(*DelegatingLogger).WithValues
2.51MB 3.76% 79.62% 2.51MB 3.76% k8s.io/apimachinery/pkg/apis/meta/v1.(*FieldsV1).UnmarshalJSON
1.50MB 2.26% 81.88% 1.50MB 2.26% reflect.mapassign
1.50MB 2.25% 84.13% 1.50MB 2.25% reflect.New
1.50MB 2.25% 86.39% 2.50MB 3.76% github.com/json-iterator/go.(*Iterator).ReadString
1MB 1.50% 87.89% 1MB 1.50% github.com/json-iterator/go.(*Iterator).readStringSlowPath
(pprof)
(pprof) list reconcileHandler
Total: 66.56MB
ROUTINE ======================== sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler in /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go
20MB 40.44MB (flat, cum) 60.76% of Total
. . 235: c.Log.Error(nil, "Queue item was not a Request", "type", fmt.Sprintf("%T", obj), "value", obj)
. . 236: // Return true, don't take a break
. . 237: return true
. . 238: }
. . 239:
20MB 39.44MB 240: log := c.Log.WithValues("name", req.Name, "namespace", req.Namespace)
. . 241:
. . 242: // RunInformersAndControllers the syncHandler, passing it the namespace/Name string of the
. . 243: // resource to be synced.
. 1.01MB 244: if result, err := c.Do.Reconcile(req); err != nil {
. . 245: c.Queue.AddRateLimited(req)
. . 246: log.Error(err, "Reconciler error")
. . 247: ctrlmetrics.ReconcileErrors.WithLabelValues(c.Name).Inc()
. . 248: ctrlmetrics.ReconcileTotal.WithLabelValues(c.Name, "error").Inc()
. . 249: return false
(pprof)
(pprof) list .WithValues
Total: 66.56MB
ROUTINE ======================== sigs.k8s.io/controller-runtime/pkg/log.(*DelegatingLogger).WithValues in /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go
3MB 19.44MB (flat, cum) 29.20% of Total
. . 107:func (l *DelegatingLogger) WithValues(tags ...interface{}) logr.Logger {
. . 108: if l.promise == nil {
. . 109: return l.Logger.WithValues(tags...)
. . 110: }
. . 111:
3MB 3MB 112: res := &DelegatingLogger{Logger: l.Logger}
. 16.44MB 113: promise := l.promise.WithValues(res, tags...)
. . 114: res.promise = promise
. . 115:
. . 116: return res
. . 117:}
. . 118:
ROUTINE ======================== sigs.k8s.io/controller-runtime/pkg/log.(*loggerPromise).WithValues in /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go
16.44MB 16.44MB (flat, cum) 24.69% of Total
. . 47: return res
. . 48:}
. . 49:
. . 50:// WithValues provides a new Logger with the tags appended
. . 51:func (p *loggerPromise) WithValues(l *DelegatingLogger, tags ...interface{}) *loggerPromise {
15MB 15MB 52: res := &loggerPromise{
. . 53: logger: l,
. . 54: tags: tags,
. . 55: promisesLock: sync.Mutex{},
. . 56: }
. . 57:
. . 58: p.promisesLock.Lock()
. . 59: defer p.promisesLock.Unlock()
1.44MB 1.44MB 60: p.childPromises = append(p.childPromises, res)
. . 61: return res
. . 62:}
. . 63:
. . 64:// Fulfill instantiates the Logger with the provided logger
. . 65:func (p *loggerPromise) Fulfill(parentLogger logr.Logger) {