一个UFO引发的惨案

首先需要解释一下标题,原谅我当了一回标题党,此UFO不是Unidentified flying object,而是在网络中的一个Oflload卸载技术UDP fragmentation offload。事情的起因是这样的,我们最近尝试将线上的虚拟机,从基于网卡SR-IOV+直通的方案,切换到基于DPDK+vhost-user的方案,以换取热迁移的效率提升。

从之前的模拟压测和线上灰度效果来看,新的DPDK方案的性能和稳定性都处于很好的水平,在我们的场景下可以很好地满足需求。
直到灰度到某个业务的时候,发生了一些问题,导致了虚拟机的网络中断。

我们通过热拔插方式进行网络切换,首先,会把当前直通的网卡从虚拟机中热拔出来,然后,再把一个vhost-user网卡热插到虚拟机中,从而实现网卡的切换。在切换过程中,大致会有3-5s左右的网络中断,但根据和业务的沟通,在单线程的操作情况下,这样的中断是没有问题的,不会影响业务。

为了保证业务的稳定,我们在网卡切换后,会持续ping 10s对应的虚拟机,确保网络正常后才会进行下一台的操作。

然后问题就发生了,在某些虚拟机切换网卡之后,大约5分钟内,网络是正常的,但是超过5分钟之后,突然网络就不通了,这个问题也是随机的,而对于网络不通的机器,通过重启DPDK进程的方式,网络又可以恢复几分钟,然后继续不通。这些现象确实在之前的测试中没有遇到过。从日志看,有少量的DPDK进程打印了这两条日志:

VHOST_DATA: (/tmp/ens8f0-2.sock) failed to allocate memory for mbuf.
VHOST_DATA: (/tmp/ens8f0-2.sock) failed to copy desc to mbuf.

从日志看,应该是给DPDK分配的内存不够了,导致DPDK从内存池里分配mbuf时无内存可用,但是DPDK使用的内存,是经过精确计算的呀?看看内存分配数量相关的代码:

#define DEFAULT_IP_MTU	  (1500)
#define L2_OVERHEAD		  (14 + 4 + 4)
#define VF_RX_OFFSET	  (32)
#define DEFAULT_MBUF_SIZE (DEFAULT_IP_MTU + L2_OVERHEAD + RTE_PKTMBUF_HEADROOM + VF_RX_OFFSET)

#define MAX_VHOST_QUEUE_PAIRS 16

/* rxq/txq descriptors numbers */
#define RXQ_TXQ_DESC_1K 1024
#define RXQ_TXQ_DESC_8K 8192

/* relay mempool config */
#define DEFAULT_NR_RX_QUEUE MAX_VHOST_QUEUE_PAIRS
#define DEFAULT_NR_TX_QUEUE MAX_VHOST_QUEUE_PAIRS
#define DEFAULT_NR_RX_DESC	RXQ_TXQ_DESC_8K
#define DEFAULT_NR_TX_DESC	RXQ_TXQ_DESC_1K
#define NUM_PKTMBUF_POOL	(DEFAULT_NR_RX_DESC * DEFAULT_NR_RX_QUEUE + DEFAULT_NR_TX_DESC * DEFAULT_NR_TX_QUEUE + 4096)

// ...
mpool = rte_pktmbuf_pool_create(mp_name, n_mbufs, RTE_MEMPOOL_CACHE_MAX_SIZE, 0, DEFAULT_MBUF_SIZE, request_socket_id);

我们给每个VM分配了最多16个队列,MTU为1500,网卡侧同样支持16个发送队列+16个接收队列,其中每个接收队列设置ring buffer大小为8192,发送队列ring buffer大小1024,经过一系列的计算NUM_PKTMBUF_POOL这个值应该在所有场景都能满足需求,那为什么会出现内存不够的情况呢?我们再深入看一下DPDK相关的代码:

首先这个日志,在DPDK中有两个地方会打印,一个是在virtio_dev_tx_split函数中,另一个函数是vhost_dequeue_single_packed,这俩函数的功能是一致的,只是一个是用来处理老的virtio split ring的场景,另一个是处理packed ring的场景,而我们目前用的还是老的split ring,于是着重看下相关的代码:

__rte_always_inline
static uint16_t
virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count,
	bool legacy_ol_flags)
{
    // ...省略

    // 一次循环,最多取32个数据
	count = RTE_MIN(count, MAX_PKT_BURST);
	count = RTE_MIN(count, avail_entries);
	VHOST_LOG_DATA(dev->ifname, DEBUG, "about to dequeue %u buffers\n", count);

	if (rte_pktmbuf_alloc_bulk(mbuf_pool, pkts, count))
		return 0;

	for (i = 0; i < count; i++) {
        // ...省略

        // 拷贝Guest中的网络包到pkts(rte_mbuf)中
		err = desc_to_mbuf(dev, vq, buf_vec, nr_vec, pkts[i],
				   mbuf_pool, legacy_ol_flags, 0, false);
		if (unlikely(err)) {
			if (!allocerr_warned) {
                // 拷贝失败了,打印了第二条日志,第一条日志在desc_to_mbuf打印了
				VHOST_LOG_DATA(dev->ifname, ERR, "failed to copy desc to mbuf.\n");
				allocerr_warned = true;
			}
			dropped += 1;
			i++;
			break;
		}

	}
    // ...省略

	return (i - dropped);
}

static __rte_always_inline int
desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
		  struct buf_vector *buf_vec, uint16_t nr_vec,
		  struct rte_mbuf *m, struct rte_mempool *mbuf_pool,
		  bool legacy_ol_flags, uint16_t slot_idx, bool is_async)
{

    // ...省略

	buf_addr = buf_vec[vec_idx].buf_addr;
	buf_iova = buf_vec[vec_idx].buf_iova;
	buf_len = buf_vec[vec_idx].buf_len;
	buf_offset = hdr_remain;
	buf_avail = buf_vec[vec_idx].buf_len - hdr_remain;

	PRINT_PACKET(dev,
			(uintptr_t)(buf_addr + buf_offset),
			(uint32_t)buf_avail, 0);

	mbuf_offset = 0;
	mbuf_avail  = m->buf_len - RTE_PKTMBUF_HEADROOM;

	if (is_async) {
		pkts_info = async->pkts_info;
		if (async_iter_initialize(dev, async))
			return -1;
	}

	while (1) {
        // buf_avail是Guest中网络包的剩余长度,mbuf_avail是当前mbuf中剩余的容量
        // 本次拷贝数据量是这两者的小值
		cpy_len = RTE_MIN(buf_avail, mbuf_avail);

        // 拷贝数据
		if (is_async) {
			if (async_fill_seg(dev, vq, cur, mbuf_offset,
					   buf_iova + buf_offset, cpy_len, false) < 0)
				goto error;
		} else if (likely(hdr && cur == m)) {
			rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *, mbuf_offset),
				(void *)((uintptr_t)(buf_addr + buf_offset)),
				cpy_len);
		} else {
			sync_fill_seg(dev, vq, cur, mbuf_offset,
				      buf_addr + buf_offset,
				      buf_iova + buf_offset, cpy_len, false);
		}

        // 计算拷贝后的结果
		mbuf_avail  -= cpy_len;
		mbuf_offset += cpy_len;
		buf_avail -= cpy_len;
		buf_offset += cpy_len;

		/* This buf reaches to its end, get the next one */
		if (buf_avail == 0) {
            // 如果Guest里的数据都拷贝完了,直接break循环
			if (++vec_idx >= nr_vec)
				break;

			buf_addr = buf_vec[vec_idx].buf_addr;
			buf_iova = buf_vec[vec_idx].buf_iova;
			buf_len = buf_vec[vec_idx].buf_len;

			buf_offset = 0;
			buf_avail  = buf_len;

			PRINT_PACKET(dev, (uintptr_t)buf_addr,
					(uint32_t)buf_avail, 0);
		}

		/*
		 * This mbuf reaches to its end, get a new one
		 * to hold more data.
		 */
        // 拷贝未结束,但是mbuf空间用完了,需要重新申请一个新的mbuf
		if (mbuf_avail == 0) {
			cur = rte_pktmbuf_alloc(mbuf_pool);
            // 申请新的mbuf失败了,打印第一条日志,跳到error退出
			if (unlikely(cur == NULL)) {
				VHOST_LOG_DATA(dev->ifname, ERR,
					"failed to allocate memory for mbuf.\n");
				goto error;
			}

			prev->next = cur;
			prev->data_len = mbuf_offset;
			m->nb_segs += 1;
			m->pkt_len += mbuf_offset;
			prev = cur;

			mbuf_offset = 0;
			mbuf_avail  = cur->buf_len - RTE_PKTMBUF_HEADROOM;
		}
	}
    // ...省略
error:
	if (is_async)
		async_iter_cancel(async);

	return -1;
}

从代码里可以看到确实是内存池不够用了,但是只有在Guest里数据包很大,超过我们预设的MTU的时候才会出现,什么时候Guest会发送超过MTU大小的包到网卡呢?很容易想到的一个点就是各种的Offload,特别是和网卡相关的分包Offload。于是就找了一台业务机器的现场看了一下网卡的特性开启情况:

# ethtool -k eth0
Features for eth0:
rx-checksumming: on [fixed]
tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: on
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
        tx-tcp-segmentation: off [fixed]
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp6-segmentation: off [fixed]
udp-fragmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on
# ...省略

再找一台没有问题的机器:

# ethtool -k eth0
Features for eth0:
rx-checksumming: on [fixed]
tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: on
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
        tx-tcp-segmentation: off [fixed]
        tx-tcp-ecn-segmentation: off [requested on]
        tx-tcp-mangleid-segmentation: off [fixed]
        tx-tcp6-segmentation: off [fixed]
generic-segmentation-offload: on
# ...省略

发现像TSO这种常用的Offload特性都关了,但是出问题的机器,比没出问题的机器,多了一个特性udp-fragmentation-offload: on,这个特性就是UFO,和TSO类似,TSO是TCP的offload,而UFO是UDP的offload,既然这里有不同,那是不是出问题里有应用会发送大的UDP包呢?再尝试抓包看下:

# tcpdump -i eth0 udp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
17:55:43.880517 IP localhost.29784 > 192.168.100.111.8333: UDP, length 9082
17:55:43.880526 IP localhost > 192.168.100.111: udp
17:55:43.880528 IP localhost > 192.168.100.111: udp
17:55:43.880539 IP localhost > 192.168.100.111: udp
17:55:43.880541 IP localhost > 192.168.100.111: udp
17:55:43.880542 IP localhost > 192.168.100.111: udp
17:55:43.880543 IP localhost > 192.168.100.111: udp
17:55:43.881681 IP localhost.29784 > 192.168.100.111.8333: UDP, length 9076
17:55:43.881684 IP localhost > 192.168.100.111: udp
17:55:43.881686 IP localhost > 192.168.100.111: udp
17:55:43.881687 IP localhost > 192.168.100.111: udp
17:55:43.881689 IP localhost > 192.168.100.111: udp
17:55:43.881690 IP localhost > 192.168.100.111: udp
17:55:43.881692 IP localhost > 192.168.100.111: udp
17:55:43.882214 IP localhost.17670 > 192.168.100.1.domain: 4442+ PTR? 192.168.100.1.in-addr.arpa. (42)
17:55:43.882341 IP 192.168.100.1.domain > localhost.17670: 4442 NXDomain 0/1/0 (119)
17:55:43.882828 IP localhost.29784 > 192.168.100.111.8333: UDP, length 9046
17:55:43.882835 IP localhost > 192.168.100.111: udp
17:55:43.882837 IP localhost > 192.168.100.111: udp
17:55:43.882850 IP localhost > 192.168.100.111: udp
17:55:43.882852 IP localhost > 192.168.100.111: udp
17:55:43.882853 IP localhost > 192.168.100.111: udp
17:55:43.882854 IP localhost > 192.168.100.111: udp

没错,除了DNS的请求之外,业务还会发送一些超过9000长度的UDP包,这些UDP包会占用预期外的mbuf资源,知道了这个问题,那就尝试把ufo特性关闭一下看看效果:

# ethtoo -K eth0 ufo off

关闭UFO之后,观察一段时间,虚拟机的网络恢复正常了。随后我们在测试环境也通过iperf复现了问题,便于后续的验证。

关闭UFO临时解决问题之余,又抛出来一个新的问题,为啥之前之前灰度的机器为什么没有出现同样的问题?为什么没有问题的机器里udp-fragmentation-offload: on这个特性直接消失了?

通过一些搜索,找到了答案,来着内核的官方文档: Segmentation Offloads

UDP Fragmentation Offload
UDP fragmentation offload allows a device to fragment an oversized UDP datagram into multiple IPv4 fragments. Many of the requirements for UDP fragmentation offload are the same as TSO. However the IPv4 ID for fragments should not increment as a single IPv4 datagram is fragmented.

UFO is deprecated: modern kernels will no longer generate UFO skbs, but can still receive them from tuntap and similar devices. Offload of UDP-based tunnel protocols is still supported.

原来UFO已经被废弃了,现代内核不会再发送大的UFO skbs,但是仍然允许从tuntap等类似设备上接收这些包。再看一下这俩机器的内核版本:

#uname -r
3.10.0-514.el7.x86_64
#uname -r
5.14.0-284.25.1.el9_2.x86_64

确实差了不少。

最后呢,解决办法是将虚拟机的网卡配置调整了一下,修改成了:

<interface type='vhostuser'>
  <mac address='{{.MACAddress}}'/>
  <source type='unix' path='{{.VhostPath}}' mode='server'/>
  <model type='virtio'/>
  <driver queues='16' rx_queue_size='1024' tx_queue_size='1024'>
    <host tso4='off' tso6='off' ufo='off' ecn='off' mrg_rxbuf='off'/>
    <guest tso4='off' tso6='off' ufo='off' ecn='off'/>
  </driver>
</interface>

其实TSO这个特性是在DPDK里关闭的,但是因为已经上线了不少业务,UFO在DPDK里关闭的话,会影响老虚拟机的热迁移,因此还是在qemu这侧关闭吧。