golang学习记录:k8s operator内存泄露排查

  • 最近项目上某个operator在新增定时调谐逻辑后,出现了内存泄露的问题,但整体内存增长速度不是很快。

  • 本次内存泄露使用pprofgo tool排查,这也是我第一次排查go的内存泄露问题,所以记录一下排查流程和结果

一、背景

  某个operator之前使用sigs.k8s.io/controller-runtime相关的库来实现,但是未配置SyncPeriod定时调谐;

  前段时间,有需求要求新增定时调谐逻辑,遂新增此函数,每分钟定时调谐一次;但是新增此函数后,发现出现内存缓慢升高的现象:环境A内存60小时提升了100M,环境B内存60小时提升了800M,两环境唯一的差别为operator watch的实例数不同。

二、排查过程

  • 本次内存泄露排查,使用工具pprof进行。首先为源码引入pprof:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    import (
    ...
    "net/http"
    _ "net/http/pprof"
    ...
    )

    ...

    func main() {
    ...
    go func() {
    klog.Info("pprof start")
    http.ListenAndServe("0.0.0.0:6060", nil) // 此处应为0.0.0.0 使用127.0.0.1会导致Pod外无法直接访问
    }()
    ...
    }
  • 由于本次内存泄露是新增定时调谐逻辑后出现,初步猜测是定时调谐导致内存泄露,故将调谐频率由1分钟1次调整至10秒一次,随后编译镜像上传环境。
    1
    2
    3
    4
    5
    6
    syncPeriod := time.Second * 10 // 10 s
    mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
    ...
    SyncPeriod: &syncPeriod,
    })

  • 待镜像运行一段时间,pod内存占用明显升高时,采集pprof统计的内存分配信息。由于宿主机没有go环境,故使用curl先采集数据,拷到本地分析。
    1
    2
    # podip
    curl http://xxx.xxx.xxx.xxx:6060/debug/pprof/heap > heap.out
  • 使用go tool分析pprof文件,其中浏览器展示可以看图,会更直观一点,但是代码追踪没命令行分析方便。
    1
    2
    3
    4
    # 本地浏览器展示
    go tool pprof -http=:8080 heap.out
    # 命令行展示
    go tool pprof /root/code/pprof/heap.out
  • pprof(inuse_space)分析结果如下,仅截取核心部分:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    (pprof) top
    Showing nodes accounting for 58.50MB, 87.89% of 66.56MB total
    Showing top 10 nodes out of 160
    flat flat% sum% cum cum%
    20MB 30.05% 30.05% 40.44MB 60.76% sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    16.44MB 24.69% 54.74% 16.44MB 24.69% sigs.k8s.io/controller-runtime/pkg/log.(*loggerPromise).WithValues
    6.05MB 9.09% 63.83% 6.05MB 9.09% k8s.io/apimachinery/pkg/runtime.(*RawExtension).UnmarshalJSON
    5MB 7.52% 71.35% 5MB 7.52% runtime.allocm
    3MB 4.51% 75.86% 19.44MB 29.20% sigs.k8s.io/controller-runtime/pkg/log.(*DelegatingLogger).WithValues
    ...
    (pprof)
    (pprof) list reconcileHandler
    Total: 66.56MB
    ROUTINE ======================== sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler in /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go
    20MB 40.44MB (flat, cum) 60.76% of Total
    ...
    20MB 39.44MB 240: log := c.Log.WithValues("name", req.Name, "namespace", req.Namespace) # 注意
    ...
    (pprof)
    (pprof) list .WithValues
    Total: 66.56MB
    ROUTINE ======================== sigs.k8s.io/controller-runtime/pkg/log.(*DelegatingLogger).WithValues in /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go
    3MB 19.44MB (flat, cum) 29.20% of Total
    ...
    3MB 3MB 112: res := &DelegatingLogger{Logger: l.Logger}
    . 16.44MB 113: promise := l.promise.WithValues(res, tags...)
    ...
    ROUTINE ======================== sigs.k8s.io/controller-runtime/pkg/log.(*loggerPromise).WithValues in /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go
    16.44MB 16.44MB (flat, cum) 24.69% of Total
    ...
    . . 51:func (p *loggerPromise) WithValues(l *DelegatingLogger, tags ...interface{}) *loggerPromise {
    15MB 15MB 52: res := &loggerPromise{
    ...
    1.44MB 1.44MB 60: p.childPromises = append(p.childPromises, res)
    ...

三、排查结果

  如上所示,可以发现内存占用大部分都是reconcileHandler()WithValues()内占用的,通过代码追踪可以看到这块代码是log相关的内容,但是都是第三方的库,所以优先考虑是第三方库bug。

  经过去网上查issue和帖子,发现确实是第三方库sigs.k8s.io/controller-runtime的bug,问题大概如下:

  • sigs.k8s.io/controller-runtime内延迟log需要用户自己提供日志实现,假如用户不自己实现,则日志会堆积,直至延迟log被实现。

  • 调谐频率越高,则日志生成越快、堆积越多,日志堆积会导致内存逐渐升高,最终导致OOM。

  所以实际上这个问题一直存在,只是之前没有定时调谐,调谐频率不高导致未被发现。

相关issue & fix如下:

四、修复与测试

  • 经过调研,可以得出此问题主要有两种解决方案
  1. 升级controller-runtime版本至0.8.x (2021年修复)
  2. 为延迟log定义实现,避免日志堆积
  • 最终选择在代码内新增空实现延迟log,丢弃日志。代码改动如下:
    1
    2
    3
    4
    5
    6
    ...
    func main() {
    ...
    ctrl.SetLogger(logf.NullLogger{})
    ...
    }
  • 编译镜像上传环境,待镜像运行一段时间,内存未明显升高,此问题解决。

附录

  • pprof完整内容
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    (pprof) top
    Showing nodes accounting for 58.50MB, 87.89% of 66.56MB total
    Showing top 10 nodes out of 160
    flat flat% sum% cum cum%
    20MB 30.05% 30.05% 40.44MB 60.76% sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    16.44MB 24.69% 54.74% 16.44MB 24.69% sigs.k8s.io/controller-runtime/pkg/log.(*loggerPromise).WithValues
    6.05MB 9.09% 63.83% 6.05MB 9.09% k8s.io/apimachinery/pkg/runtime.(*RawExtension).UnmarshalJSON
    5MB 7.52% 71.35% 5MB 7.52% runtime.allocm
    3MB 4.51% 75.86% 19.44MB 29.20% sigs.k8s.io/controller-runtime/pkg/log.(*DelegatingLogger).WithValues
    2.51MB 3.76% 79.62% 2.51MB 3.76% k8s.io/apimachinery/pkg/apis/meta/v1.(*FieldsV1).UnmarshalJSON
    1.50MB 2.26% 81.88% 1.50MB 2.26% reflect.mapassign
    1.50MB 2.25% 84.13% 1.50MB 2.25% reflect.New
    1.50MB 2.25% 86.39% 2.50MB 3.76% github.com/json-iterator/go.(*Iterator).ReadString
    1MB 1.50% 87.89% 1MB 1.50% github.com/json-iterator/go.(*Iterator).readStringSlowPath
    (pprof)
    (pprof) list reconcileHandler
    Total: 66.56MB
    ROUTINE ======================== sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler in /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go
    20MB 40.44MB (flat, cum) 60.76% of Total
    . . 235: c.Log.Error(nil, "Queue item was not a Request", "type", fmt.Sprintf("%T", obj), "value", obj)
    . . 236: // Return true, don't take a break
    . . 237: return true
    . . 238: }
    . . 239:
    20MB 39.44MB 240: log := c.Log.WithValues("name", req.Name, "namespace", req.Namespace)
    . . 241:
    . . 242: // RunInformersAndControllers the syncHandler, passing it the namespace/Name string of the
    . . 243: // resource to be synced.
    . 1.01MB 244: if result, err := c.Do.Reconcile(req); err != nil {
    . . 245: c.Queue.AddRateLimited(req)
    . . 246: log.Error(err, "Reconciler error")
    . . 247: ctrlmetrics.ReconcileErrors.WithLabelValues(c.Name).Inc()
    . . 248: ctrlmetrics.ReconcileTotal.WithLabelValues(c.Name, "error").Inc()
    . . 249: return false
    (pprof)
    (pprof) list .WithValues
    Total: 66.56MB
    ROUTINE ======================== sigs.k8s.io/controller-runtime/pkg/log.(*DelegatingLogger).WithValues in /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go
    3MB 19.44MB (flat, cum) 29.20% of Total
    . . 107:func (l *DelegatingLogger) WithValues(tags ...interface{}) logr.Logger {
    . . 108: if l.promise == nil {
    . . 109: return l.Logger.WithValues(tags...)
    . . 110: }
    . . 111:
    3MB 3MB 112: res := &DelegatingLogger{Logger: l.Logger}
    . 16.44MB 113: promise := l.promise.WithValues(res, tags...)
    . . 114: res.promise = promise
    . . 115:
    . . 116: return res
    . . 117:}
    . . 118:
    ROUTINE ======================== sigs.k8s.io/controller-runtime/pkg/log.(*loggerPromise).WithValues in /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go
    16.44MB 16.44MB (flat, cum) 24.69% of Total
    . . 47: return res
    . . 48:}
    . . 49:
    . . 50:// WithValues provides a new Logger with the tags appended
    . . 51:func (p *loggerPromise) WithValues(l *DelegatingLogger, tags ...interface{}) *loggerPromise {
    15MB 15MB 52: res := &loggerPromise{
    . . 53: logger: l,
    . . 54: tags: tags,
    . . 55: promisesLock: sync.Mutex{},
    . . 56: }
    . . 57:
    . . 58: p.promisesLock.Lock()
    . . 59: defer p.promisesLock.Unlock()
    1.44MB 1.44MB 60: p.childPromises = append(p.childPromises, res)
    . . 61: return res
    . . 62:}
    . . 63:
    . . 64:// Fulfill instantiates the Logger with the provided logger
    . . 65:func (p *loggerPromise) Fulfill(parentLogger logr.Logger) {