

03:52:42 localhost kernel: nf_conntrack: falling back to vmalloc.
03:52:42 localhost kernel: IPVS: Creating netns size=2048 id=97719
03:54:37 localhost kernel: nf_conntrack: falling back to vmalloc.
03:54:37 localhost kernel: nf_conntrack: falling back to vmalloc.
03:54:37 localhost kernel: IPVS: Creating netns size=2048 id=97720
03:56:48 localhost kernel: nf_conntrack: falling back to vmalloc.
03:56:48 localhost kernel: nf_conntrack: falling back to vmalloc.
03:56:48 localhost kernel: IPVS: Creating netns size=2048 id=97721
03:58:20 localhost kernel: IPVS: Creating netns size=2048 id=97722


如果直接用Google搜索IPVS: Creating netns size=XXX id=XXX或者nf_conntrack: falling back to vmalloc这些关键词,得到的解决方案不痛不痒,针对nf_conntrack,大部分的答案都是提示vm.min_free_kbytes比较小,需要调大这个参数,或者,net.netfilter.nf_conntrack_*相关的几个参数太大了,需要调小。而IPVS基本都没有什么相关的结果。所以我们还是得自己尝试解决一下。


22:10:23 localhost kube-proxy[20761]: I0611 22:10:23.021677   20761 proxier.go:708] Syncing iptables rules
22:10:23 localhost kube-proxy[20761]: I0611 22:10:23.021703   20761 iptables.go:437] running iptables -N [KUBE-EXTERNAL-SERVICES -t filter]
22:10:23 localhost kube-proxy[20761]: I0611 22:10:23.038491   20761 iptables.go:397] running iptables-restore [-w --noflush --counters]
22:10:23 localhost kube-proxy[20761]: I0611 22:10:23.040739   20761 proxier.go:687] syncProxyRules took 19.090909ms
22:10:23 localhost kube-proxy[20761]: I0611 22:10:23.040756   20761 bounded_frequency_runner.go:221] sync-runner: ran, next possible in 0s, periodic in 30s
22:10:23 localhost kube-proxy[20761]: I0611 22:10:23.109303   20761 config.go:167] Calling handler.OnEndpointsUpdate
22:10:23 localhost kube-proxy[20761]: I0611 22:10:23.842263   20761 config.go:167] Calling handler.OnEndpointsUpdate
22:10:25 localhost kube-proxy[20761]: I0611 22:10:25.120048   20761 config.go:167] Calling handler.OnEndpointsUpdate
22:10:51 localhost kube-proxy[20761]: I0611 22:10:51.268046   20761 config.go:167] Calling handler.OnEndpointsUpdate
22:10:52 localhost kube-proxy[20761]: I0611 22:10:52.018512   20761 config.go:167] Calling handler.OnEndpointsUpdate
22:10:53 localhost kube-proxy[20761]: I0611 22:10:53.040907   20761 proxier.go:708] Syncing iptables rules
22:10:53 localhost kube-proxy[20761]: I0611 22:10:53.040933   20761 iptables.go:437] running iptables -N [KUBE-EXTERNAL-SERVICES -t filter]
22:10:53 localhost kube-proxy[20761]: I0611 22:10:53.056819   20761 iptables.go:397] running iptables-restore [-w --noflush --counters]
22:10:53 localhost kube-proxy[20761]: I0611 22:10:53.058812   20761 proxier.go:687] syncProxyRules took 17.935822ms
22:10:53 localhost kube-proxy[20761]: I0611 22:10:53.058827   20761 bounded_frequency_runner.go:221] sync-runner: ran, next possible in 0s, periodic in 30s
22:10:53 localhost kube-proxy[20761]: I0611 22:10:53.278609   20761 config.go:167] Calling handler.OnEndpointsUpdate
22:10:54 localhost kube-proxy[20761]: I0611 22:10:54.026900   20761 config.go:167] Calling handler.OnEndpointsUpdate
22:10:55 localhost kube-proxy[20761]: I0611 22:10:55.287980   20761 config.go:167] Calling handler.OnEndpointsUpdate

从日志可以看到,确实kube-proxy会定时同步iptables的规则,但是周期不一样,dmesg里日志的周期大概两分钟一次,而这里每30s就会同步一次了,而且同步的时间和日志也对应不上,那其实基本就可以排除是kube-proxy导致的了。不过除此之外,日志里发现了比较反常的Calling handler.OnEndpointsUpdate的输出,频率还挺高,这个和之前的预想不太相符,这里暂时忽略了,等着下篇继续分析原因吧。



03:52:42 localhost dbus[11450]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service'
03:52:42 localhost dbus[11450]: [system] Successfully activated service 'org.freedesktop.hostname1'
03:54:37 localhost dbus[11450]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service'
03:54:37 localhost dbus[11450]: [system] Successfully activated service 'org.freedesktop.hostname1'
03:56:48 localhost dbus[11450]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service'
03:56:48 localhost dbus[11450]: [system] Successfully activated service 'org.freedesktop.hostname1'
03:58:20 localhost dbus[11450]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service'
03:58:20 localhost dbus[11450]: [system] Successfully activated service 'org.freedesktop.hostname1'

似乎是systemd-hostnamed?再通过journalctl -u systemd-hostnamed.service看看hostnamed的日志:

03:52:42 localhost systemd[1]: Starting Hostname Service...
03:52:42 localhost systemd[1]: Started Hostname Service.
03:54:37 localhost systemd[1]: Starting Hostname Service...
03:54:37 localhost systemd[1]: Started Hostname Service.
03:56:48 localhost systemd[1]: Starting Hostname Service...
03:56:48 localhost systemd[1]: Started Hostname Service.
03:58:20 localhost systemd[1]: Starting Hostname Service...
03:58:20 localhost systemd[1]: Started Hostname Service.




exec { 'set hostname':
    command => "hostnamectl set-hostname ${local_hostname}",
    unless  => "test `hostnamectl --static` == '${local_hostname}'",
    path    => ['/usr/bin', '/bin']

简单解释下,就是每次都会执行hostnamectl --static命令,获得当前hostname,如果和预期不一致,就调用hostnamectl set-hostname ${local_hostname}命令把本地的hostname修改成我们期望的。


[root@]# cat /usr/lib/systemd/system/systemd-hostnamed.service
#  This file is part of systemd.
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU Lesser General Public License as published by
#  the Free Software Foundation; either version 2.1 of the License, or
#  (at your option) any later version.

Description=Hostname Service
Documentation=man:systemd-hostnamed.service(8) man:hostname(5) man:machine-info(5)


可以看到在Service的配置里添加了PrivateNetwork=yes选项,那这个选项是什么作用呢?在man systemd.exec里找到了对应的说明:

Takes a boolean argument. If true, sets up a new network namespace for the executed processes and configures only the loopback network device “lo” inside it. No other network devices will be available to the executed process. This is
useful to securely turn off network access by the executed process. Defaults to false. It is possible to run two or more units within the same private network namespace by using the JoinsNamespaceOf= directive, see systemd.unit(5) for
details. Note that this option will disconnect all socket families from the host, this includes AF_NETLINK and AF_UNIX. The latter has the effect that AF_UNIX sockets in the abstract socket namespace will become unavailable to the
processes (however, those located in the file system will continue to be accessible).

也就是说,如果打开PrivateNetwork,那么systemd在启动这个服务时会创建一个新的network namespace,来隔离这个进程和主机的network,而刚好我们的hostnamed不需要访问网络,所以默认情况下就加上了这个限制。这个也就能解释为啥LVS模块的日志提示是Creating netns了,因为真的是在创建一个新的ns。同理nf_conntrack模块的输出也是因为新ns需要一些初始化操作。


int main(int argc, char *argv[]) {
        Context context = {};
        r = bus_event_loop_with_idle(event, bus, "org.freedesktop.hostname1", DEFAULT_EXIT_USEC, NULL, NULL);
        if (r < 0) {
                log_error_errno(r, "Failed to run event loop: %m");
                goto finish;


        return r < 0 ? EXIT_FAILURE : EXIT_SUCCESS;

这个DEFAULT_EXIT_USEC定义在其他文件中,原型是#define DEFAULT_EXIT_USEC (30*USEC_PER_SEC),也就是30s。


1. 自动化运维脚本会定时check hostname,这会导致hostnamed被拉起来
2. 因为hostnamed service描述里有PrivateNetwork=yes,所以systemd会创建network namespace
3. 因为主机上加载了conntrack和ipvs相关模块(这个是kube-proxy需要的),模块会在netns里初始化,又因为一些参数的原因,会打印相关的日志
4. hostnamed 30s后自定退出,导致下次执行会重新创建新的network namespace,如此反复


1. 调整自动化运维脚本,不使用hostnamectl命令获取当前主机名
2. 合理调整conntrack相关参数,尽可能避免内核内存被耗尽的情况
3. 因为我们的业务比较特殊,不需要kube-proxy,所以决定把kube-proxy服务停了,同时就不会依赖ipvs模块,顺便把ipvs模块也移除
