dmesg中持续打印的诡异日志追踪记

在我们内部的一个系统是跑在K8S之上的,而这批机器上的dmesg日志里,老是会不停的刷下面的这个日志:

03:52:42 localhost kernel: nf_conntrack: falling back to vmalloc.
03:52:42 localhost kernel: IPVS: Creating netns size=2048 id=97719
03:54:37 localhost kernel: nf_conntrack: falling back to vmalloc.
03:54:37 localhost kernel: nf_conntrack: falling back to vmalloc.
03:54:37 localhost kernel: IPVS: Creating netns size=2048 id=97720
03:56:48 localhost kernel: nf_conntrack: falling back to vmalloc.
03:56:48 localhost kernel: nf_conntrack: falling back to vmalloc.
03:56:48 localhost kernel: IPVS: Creating netns size=2048 id=97721
03:58:20 localhost kernel: IPVS: Creating netns size=2048 id=97722

本来这些日志刷就刷了,也没什么大问题,但是呢,时间一长,因为logrotate机制,会把之前产生的日志给顶掉,导致有些时候想看之前的dmesg日志看不了了,这就比较难受了,终于,在当鸵鸟很长时间之后,想想还是找找原因,把这个问题解决一下。

如果直接用Google搜索IPVS: Creating netns size=XXX id=XXX或者nf_conntrack: falling back to vmalloc这些关键词,得到的解决方案不痛不痒,针对nf_conntrack,大部分的答案都是提示vm.min_free_kbytes比较小,需要调大这个参数,或者,net.netfilter.nf_conntrack_*相关的几个参数太大了,需要调小。而IPVS基本都没有什么相关的结果。所以我们还是得自己尝试解决一下。

一开始呢,就觉得大概率是K8S的问题,一方面因为这些日志,只有在跑了K8S的机器上才会出现;另外一方面因为这个nf_conntrackIPVS和网络强相关,在K8S的Node节点上两个组件之一的kube-proxy,也会定期去同步一些网络配置,那很明显的kube-proxy嫌疑最大。于是我就配置了一下,把kube-proxy的日志等级开到最大,让运行中所有的tracing日志也都打印出来:

22:10:23 localhost kube-proxy[20761]: I0611 22:10:23.021677   20761 proxier.go:708] Syncing iptables rules
22:10:23 localhost kube-proxy[20761]: I0611 22:10:23.021703   20761 iptables.go:437] running iptables -N [KUBE-EXTERNAL-SERVICES -t filter]
...
22:10:23 localhost kube-proxy[20761]: I0611 22:10:23.038491   20761 iptables.go:397] running iptables-restore [-w --noflush --counters]
22:10:23 localhost kube-proxy[20761]: I0611 22:10:23.040739   20761 proxier.go:687] syncProxyRules took 19.090909ms
22:10:23 localhost kube-proxy[20761]: I0611 22:10:23.040756   20761 bounded_frequency_runner.go:221] sync-runner: ran, next possible in 0s, periodic in 30s
22:10:23 localhost kube-proxy[20761]: I0611 22:10:23.109303   20761 config.go:167] Calling handler.OnEndpointsUpdate
22:10:23 localhost kube-proxy[20761]: I0611 22:10:23.842263   20761 config.go:167] Calling handler.OnEndpointsUpdate
22:10:25 localhost kube-proxy[20761]: I0611 22:10:25.120048   20761 config.go:167] Calling handler.OnEndpointsUpdate
...
22:10:51 localhost kube-proxy[20761]: I0611 22:10:51.268046   20761 config.go:167] Calling handler.OnEndpointsUpdate
22:10:52 localhost kube-proxy[20761]: I0611 22:10:52.018512   20761 config.go:167] Calling handler.OnEndpointsUpdate
22:10:53 localhost kube-proxy[20761]: I0611 22:10:53.040907   20761 proxier.go:708] Syncing iptables rules
22:10:53 localhost kube-proxy[20761]: I0611 22:10:53.040933   20761 iptables.go:437] running iptables -N [KUBE-EXTERNAL-SERVICES -t filter]
...
22:10:53 localhost kube-proxy[20761]: I0611 22:10:53.056819   20761 iptables.go:397] running iptables-restore [-w --noflush --counters]
22:10:53 localhost kube-proxy[20761]: I0611 22:10:53.058812   20761 proxier.go:687] syncProxyRules took 17.935822ms
22:10:53 localhost kube-proxy[20761]: I0611 22:10:53.058827   20761 bounded_frequency_runner.go:221] sync-runner: ran, next possible in 0s, periodic in 30s
22:10:53 localhost kube-proxy[20761]: I0611 22:10:53.278609   20761 config.go:167] Calling handler.OnEndpointsUpdate
22:10:54 localhost kube-proxy[20761]: I0611 22:10:54.026900   20761 config.go:167] Calling handler.OnEndpointsUpdate
22:10:55 localhost kube-proxy[20761]: I0611 22:10:55.287980   20761 config.go:167] Calling handler.OnEndpointsUpdate

从日志可以看到,确实kube-proxy会定时同步iptables的规则,但是周期不一样,dmesg里日志的周期大概两分钟一次,而这里每30s就会同步一次了,而且同步的时间和日志也对应不上,那其实基本就可以排除是kube-proxy导致的了。不过除此之外,日志里发现了比较反常的Calling handler.OnEndpointsUpdate的输出,频率还挺高,这个和之前的预想不太相符,这里暂时忽略了,等着下篇继续分析原因吧。

既然直接原因不是kube-proxy,那就得继续看看是为啥了。

于是就又仔细分析了一下journalctl的输出,因为掺杂了各种K8S组件的凌乱输出,日志非常乱,不过功夫不负有心人,最终还是发现了和dmesg输出时间高度对应的日志:

03:52:42 localhost dbus[11450]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service'
03:52:42 localhost dbus[11450]: [system] Successfully activated service 'org.freedesktop.hostname1'
03:54:37 localhost dbus[11450]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service'
03:54:37 localhost dbus[11450]: [system] Successfully activated service 'org.freedesktop.hostname1'
03:56:48 localhost dbus[11450]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service'
03:56:48 localhost dbus[11450]: [system] Successfully activated service 'org.freedesktop.hostname1'
03:58:20 localhost dbus[11450]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service'
03:58:20 localhost dbus[11450]: [system] Successfully activated service 'org.freedesktop.hostname1'

似乎是systemd-hostnamed?再通过journalctl -u systemd-hostnamed.service看看hostnamed的日志:

03:52:42 localhost systemd[1]: Starting Hostname Service...
03:52:42 localhost systemd[1]: Started Hostname Service.
03:54:37 localhost systemd[1]: Starting Hostname Service...
03:54:37 localhost systemd[1]: Started Hostname Service.
03:56:48 localhost systemd[1]: Starting Hostname Service...
03:56:48 localhost systemd[1]: Started Hostname Service.
03:58:20 localhost systemd[1]: Starting Hostname Service...
03:58:20 localhost systemd[1]: Started Hostname Service.

还真是,那么问题来了:为什么hostnamed启动会刷两条dmesg日志?以及为什么hostnamed会这样定时启动?

先说明第二个问题吧,因为这个是已知的,我们系统的配置,都是用puppet这个自动化运维工具下发的,然后下发之后呢,为了防止人为的原因配置被修改了,所以又加了个Cron,定时2分钟+一小段随机时间执行,确保任何时候我们机器上的配置都是和我们预期是一样的。

而这些配置里,就包含了机器的主机名。具体的定义类似这样:

exec { 'set hostname':
    command => "hostnamectl set-hostname ${local_hostname}",
    unless  => "test `hostnamectl --static` == '${local_hostname}'",
    path    => ['/usr/bin', '/bin']
  }

简单解释下,就是每次都会执行hostnamectl --static命令,获得当前hostname,如果和预期不一致,就调用hostnamectl set-hostname ${local_hostname}命令把本地的hostname修改成我们期望的。

然后再来看看第一个问题,为什么hostnamed启动的时候会输出日志?看了下服务的service配置文件,一下就明白了:

[root@]# cat /usr/lib/systemd/system/systemd-hostnamed.service
#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU Lesser General Public License as published by
#  the Free Software Foundation; either version 2.1 of the License, or
#  (at your option) any later version.

[Unit]
Description=Hostname Service
Documentation=man:systemd-hostnamed.service(8) man:hostname(5) man:machine-info(5)
Documentation=http://www.freedesktop.org/wiki/Software/systemd/hostnamed

[Service]
ExecStart=/usr/lib/systemd/systemd-hostnamed
BusName=org.freedesktop.hostname1
CapabilityBoundingSet=CAP_SYS_ADMIN
WatchdogSec=3min
PrivateTmp=yes
PrivateDevices=yes
PrivateNetwork=yes
ProtectSystem=yes
ProtectHome=yes

可以看到在Service的配置里添加了PrivateNetwork=yes选项,那这个选项是什么作用呢?在man systemd.exec里找到了对应的说明:

PrivateNetwork=
Takes a boolean argument. If true, sets up a new network namespace for the executed processes and configures only the loopback network device “lo” inside it. No other network devices will be available to the executed process. This is
useful to securely turn off network access by the executed process. Defaults to false. It is possible to run two or more units within the same private network namespace by using the JoinsNamespaceOf= directive, see systemd.unit(5) for
details. Note that this option will disconnect all socket families from the host, this includes AF_NETLINK and AF_UNIX. The latter has the effect that AF_UNIX sockets in the abstract socket namespace will become unavailable to the
processes (however, those located in the file system will continue to be accessible).

也就是说,如果打开PrivateNetwork,那么systemd在启动这个服务时会创建一个新的network namespace,来隔离这个进程和主机的network,而刚好我们的hostnamed不需要访问网络,所以默认情况下就加上了这个限制。这个也就能解释为啥LVS模块的日志提示是Creating netns了,因为真的是在创建一个新的ns。同理nf_conntrack模块的输出也是因为新ns需要一些初始化操作。

另外呢,hostnamed这个服务也比较特殊,因为获取或者设置hostname不是一个很常见的操作,而且因为有dbus可以在需要的时候触发启动,所以这个进程运行一段时间后没有新的请求会自动退出,具体可以看下它的代码:

int main(int argc, char *argv[]) {
        Context context = {};
        ...
        r = bus_event_loop_with_idle(event, bus, "org.freedesktop.hostname1", DEFAULT_EXIT_USEC, NULL, NULL);
        if (r < 0) {
                log_error_errno(r, "Failed to run event loop: %m");
                goto finish;
        }

finish:
        context_free(&context);

        return r < 0 ? EXIT_FAILURE : EXIT_SUCCESS;
}

这个DEFAULT_EXIT_USEC定义在其他文件中,原型是#define DEFAULT_EXIT_USEC (30*USEC_PER_SEC),也就是30s。

到这里呢,问题算是理清楚了:

1. 自动化运维脚本会定时check hostname,这会导致hostnamed被拉起来
2. 因为hostnamed service描述里有PrivateNetwork=yes,所以systemd会创建network namespace
3. 因为主机上加载了conntrack和ipvs相关模块(这个是kube-proxy需要的),模块会在netns里初始化,又因为一些参数的原因,会打印相关的日志
4. hostnamed 30s后自定退出,导致下次执行会重新创建新的network namespace,如此反复

那最终的解决方案也比较明白了,根据上面的情况:

1. 调整自动化运维脚本,不使用hostnamectl命令获取当前主机名
2. 合理调整conntrack相关参数,尽可能避免内核内存被耗尽的情况
3. 因为我们的业务比较特殊,不需要kube-proxy,所以决定把kube-proxy服务停了,同时就不会依赖ipvs模块,顺便把ipvs模块也移除

其实被这个PrivateXXXX选项坑了不止一次了,之前还发生过PrivateTmp=yes导致服务启动卡住的情况,唉一言难尽啊。