在KVM虚拟机中开启TSC作为时钟源

发表于 2024年12月17日 20:49 更新于 2024年12月17日 22:37 分类于虚拟化阅读次数：

上一篇x86平台的TSC（TIME-STAMP COUNTER）中大概分析了一下TSC的一些相关的特性，以及TSC作为系统时钟源的一些基础条件。那么，在虚拟化的场景下，如何让Guest也用上TSC呢？这篇文章就来讨论一下TSC在KVM虚拟化中的使用。

基础分析

默认情况下，KVM虚拟机首选的时钟源是kvm-clock，即使将VM的CPU Model设置为host-passthrough，也不会使用TSC作为时钟源。

# lscpu|grep Flags
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq dtes64 vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm md_clear flush_l1d arch_capabilities
# cat /sys/devices/system/clocksource/clocksource0/available_clocksource
kvm-clock acpi_pm
# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
kvm-clock

可以看到，即使CPU有大部分TSC相关的Flags，但是available_clocksource里并没有TSC，current_clocksource也是kvm-clock，原因可以从dmesg里看到：

# dmesg |grep -i tsc
[    0.000000] tsc: Detected 2199.998 MHz processor
[    0.001000] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1fb63109b96, max_idle_ns: 440795265316 ns
[    0.001000] TSC deadline timer enabled
[    0.577230] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x1fb63109b96, max_idle_ns: 440795265316 ns
[    0.692265] tsc: Marking TSC unstable due to TSC halts in idle states deeper than C2

可以看到，在启动的时候，但是由于TSC在C2状态下会停止，所以被标记为不稳定。
当然，还有另外一种情况：

# cat /sys/devices/system/clocksource/clocksource0/available_clocksource
kvm-clock tsc acpi_pm
# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
kvm-clock

这种情况下，虽然TSC是可用的，但是还是没有被优先使用。虽然有两种可能性，但其实根因都是一个，那就是在Guest里，CPU缺少一个关键特性，那就是上篇文章提到的Invariant TSC。

# cpuid -1 -l 0x80000007
CPU:
   RAS Capability (0x80000007/ebx):
      MCA overflow recovery support = false
      SUCCOR support                = false
      HWA: hardware assert support  = false
      scalable MCA support          = false
   Advanced Power Management Features (0x80000007/ecx):
      CmpUnitPwrSampleTimeRatio = 0x0 (0)
   Advanced Power Management Features (0x80000007/edx):
      TS: temperature sensing diode           = false
      FID: frequency ID control               = false
      VID: voltage ID control                 = false
      TTP: thermal trip                       = false
      TM: thermal monitor                     = false
      STC: software thermal control           = false
      100 MHz multiplier control              = false
      hardware P-State control                = false
      TscInvariant                            = false
      CPB: core performance boost             = false
      read-only effective frequency interface = false
      processor feedback interface            = false
      APM power reporting                     = false
      connected standby                       = false
      RAPL: running average power limit       = false

可以看到，关键的TscInvariant是false，在第一种情况下，intel_idle驱动正常加载，在驱动代码中：

static bool __init intel_idle_verify_cstate(unsigned int mwait_hint)
{
	unsigned int mwait_cstate = (MWAIT_HINT2CSTATE(mwait_hint) + 1) &
					MWAIT_CSTATE_MASK;
	unsigned int num_substates = (mwait_substates >> mwait_cstate * 4) &
					MWAIT_SUBSTATE_MASK;

	/* Ignore the C-state if there are NO sub-states in CPUID for it. */
	if (num_substates == 0)
		return false;

	if (mwait_cstate > 2 && !boot_cpu_has(X86_FEATURE_NONSTOP_TSC))
		mark_tsc_unstable("TSC halts in idle states deeper than C2");

	return true;
}

会检测CPU是否有X86_FEATURE_NONSTOP_TSC也就是TscInvariant，如果没有，就会标记TSC为不稳定。那么在这种情况下，因为TSC被标记为不稳定了，所以tsc是不会出现在available_clocksource中的。

那第二种情况呢，TSC没有被标记为不稳定，也出现在了available_clocksource中，但是为什么还是没有被优先使用呢？这是因为默认情况下，kvm-clock的优先级比TSC要高，可以看到在内核中的代码：

static struct clocksource kvm_clock = {
	.name	= "kvm-clock",
	.read	= kvm_clock_get_cycles,
    // 默认情况下，kvm-clock的rating是400，这比TSC的rating 300要高，所以当两者同时存在时，系统会优先使用kvm-clock作为时钟源
	.rating	= 400,
	.mask	= CLOCKSOURCE_MASK(64),
	.flags	= CLOCK_SOURCE_IS_CONTINUOUS,
	.id     = CSID_X86_KVM_CLK,
	.enable	= kvm_cs_enable,
};

但是在kvm-clock初始化的过程中，如果发现TSC满足条件的话，会主动降低自己的rating：

void __init kvmclock_init(void)
{
    // ...

	/*
	 * X86_FEATURE_NONSTOP_TSC is TSC runs at constant rate
	 * with P/T states and does not stop in deep C-states.
	 *
	 * Invariant TSC exposed by host means kvmclock is not necessary:
	 * can use TSC as clocksource.
	 *
	 */
	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
	    !check_tsc_unstable())
		kvm_clock.rating = 299;

	clocksource_register_hz(&kvm_clock, NSEC_PER_SEC);
	pv_info.name = "KVM";
}

可以看到，只要CPU支持TscInvariant，那么kvm-clock的rating会主动降低自己的rating到299，那么在这种情况下，TSC将会成为rating更高的时钟源，从而被优先使用。但是由于Guest里CPU不支持TscInvariant，所以TSC并没有被优先使用。

到这里可以看出，要想让Guest支持并默认使用TSC作为时钟源，TscInvariant这个特性是十分关键的。

开启TscInvariant特性

Qemu最早在2.1版本中已经支持了TscInvariant，可以看到在这个版本的Changelog中：

New “invtsc” (Invariant TSC) CPU feature. When enabled, this will block migration and savevm, so it is not enabled by default on any CPU model. To enable invtsc, the migratable=no flag (supported only by -cpu host, by now) is required. So, invtsc is available only if using: -cpu host,migratable=no,+invtsc.

开启方法很简单，只需要在启动时加上参数-cpu host,migratable=no,+invtsc即可，或者等价的，在Libvirt的XML中：

<cpu mode='host-passthrough' migratable='off'>
  <feature policy='require' name='invtsc'/>
</cpu>

按文档启动一个虚拟机，然后查看对应的效果：

# lscpu |grep Fla
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm md_clear arch_capabilities
# dmesg |grep tsc
[    0.000005] tsc: Detected 2199.998 MHz processor
[    0.112544] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1fb63109b96, max_idle_ns: 440795265316 ns
[    0.310799] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x1fb63109b96, max_idle_ns: 440795265316 ns
[    0.310905] clocksource: Switched to clocksource tsc
# cpuid -1 -l 0x80000007|grep TscInvariant
      TscInvariant                            = true
# cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc kvm-clock acpi_pm
# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

可以看到，TSC已经成为可用并且是默认的时钟源了。

VM热迁移

现在我们已经实现了Guest默认使用TSC作为时钟源，但是还有一个问题，从上面的changelog里其实也能看出来，那就是现在的配置，VM是没有迁移能力的，当前的配置下，如果尝试迁移VM，会出现如下错误：

Requested operation is not valid: cannot migrate domain: State blocked by non-migratable CPU device (invtsc flag)

为什么有了TscInvariant之后就无法迁移了呢？我们可以简单想象一下，一开始VM运行在Host1上，并且使用了TSC作为时钟源，那么VM里TSC的频率和Host1是一致的，这时如果VM被迁移到Host2上，并且Host2的TSC频率和Host1不一致的话，那此时VM读取到的TSC频率就会发生变化，这很显然并不是我们想要的结果。
但是，KVM是支持用户自定义VM的TSC频率的，如果我们手动设置一个TSC频率，让迁移前后，Guest看到的TSC频率保持一致，自然也就不会导致问题了，因此在Qemu 2.9版本中，也是支持了这种情况，当用户指定了TSC的频率，即使在有invtsc的情况下，依然可以支持热迁移，具体的修改可以参考这个commit。而我们需要做的，就是在启动参数里加上-cpu host,migratable=on,+invtsc,tsc-freq=XXX，或者等价的，使用libvirt xml：

<cpu mode='host-passthrough' migratable='on'>
  <feature policy='require' name='invtsc'/>
</cpu>
<clock offset='utc'>
  <timer name='tsc' frequency='2200000000'/>
</clock>

TSC虚拟化的硬件加速

还剩下最后的一个问题，KVM是如何高效的实现固定Guest TSC频率的？当Guest TSC频率和Host TSC频率不一致时，这中间又是如何转换的？以及如何在迁移过程中确保TSC不会发生跳变？
在这种场景下，CPU支持的TSC scaling以及TSC offseting这两个特性就十分重要了，怎么理解呢，如果启用了TSC offseting，那么Guest在读取TSC的时候，硬件会在原始TSC值的基础上，加上一个设置的offset，这样在迁移过程中，源和目的宿主机的TSC base值不一样的情况下，只需要改一下这个offset值就好了，由于这个offset也只会在Guest读取时加上，因此也不会影响宿主机使用TSC。
TSC scaling也是类似的机制，通过设置一个频率倍率，让Guest读取TSC时将CPU当前的TSC值乘以这个倍率之后返回给Geust，从而解决用户设置的TSC频率和CPU本身TSC频率不一致的问题。
具体的信息，可以参考一下Intel的开发手册：

26.6.5 Time-Stamp Counter Offset and Multiplier
The VM-execution control fields include a 64-bit TSC-offset field. If the “RDTSC exiting” control is 0 and the “use TSC offsetting” control is 1, this field controls executions of the RDTSC and RDTSCP instructions. It also controls executions of the RDMSR instruction that read from the IA32_TIME_STAMP_COUNTER MSR. For all of these, the value of the TSC offset is added to the value of the time-stamp counter, and the sum is returned to guest software in EDX:EAX.

Processors that support the 1-setting of the “use TSC scaling” control also support a 64-bit TSC-multiplier field. If this control is 1 (and the “RDTSC exiting” control is 0 and the “use TSC offsetting” control is 1), this field also affects the executions of the RDTSC, RDTSCP, and RDMSR instructions identified above. Specifically, the contents of the time-stamp counter is first multiplied by the TSC multiplier before adding the TSC offset.

See Chapter 26 for a detailed treatment of the behavior of RDTSC, RDTSCP, and RDMSR in VMX non-root operation.

27.3 CHANGES TO INSTRUCTION BEHAVIOR IN VMX NON-ROOT OPERATION

RDTSC. Behavior of the RDTSC instruction is determined by the settings of the “RDTSC exiting” and “use TSC offsetting” VM-execution controls:
- If both controls are 0, RDTSC operates normally.

- If the “RDTSC exiting” VM-execution control is 0 and the “use TSC offsetting” VM-execution control is 1, the value returned is determined by the setting of the “use TSC scaling” VM-execution control:

If the control is 0, RDTSC loads EAX:EDX with the sum of the value of the IA32_TIME_STAMP_COUNTER MSR and the value of the TSC offset.

If the control is 1, RDTSC first computes the product of the value of the IA32_TIME_STAMP_COUNTER MSR and the value of the TSC multiplier. It then shifts the value of the product right 48 bits and loads EAX:EDX with the sum of that shifted value and the value of the TSC offset.

- If the “RDTSC exiting” VM-execution control is 1, RDTSC causes a VM exit.

可以看到在Intel平台，TSC-offset以及TSC multiplier是VMCS中的两个字段，通过修改这两个以及RDTSC exiting字段，可以很好的控制Guest中TSC的行为。

当然，AMD的实现和Intel还有一些区别，具体的也可以参考AMD的文档

15.30.5 TSC Ratio MSR (C000_0104h)
Writing to the TSC Ratio MSR allows the hypervisor to control the guest’s view of the Time Stamp Counter. The contents of TSC Ratio MSR sets the value of the TSCRatio. This constant scales the timestamp value returned when the TSC is read by a guest via the RDTSC or RDTSCP instructions or when the TSC, MPERF, or MPerfReadOnly MSRs are read via the RDMSR instruction by a guest running under virtualization.

This facility allows the hypervisor to provide a consistent TSC, MPERF, and MPerfReadOnly rate for a guest process when moving that process between cores that have a differing P0 rate. The TSCRatio does not affect the value read from the TSC, MPERF, and MPerfReadOnly MSRs when in host mode or when virtualization is disabled. System Management Mode (SMM) code sees unscaled TSC, MPERF and MPerfReadOnly values unless the SMM code is executed within a guest container. The TSCRatio value does not affect the rate of the underlying TSC, MPERF, and MPerfReadOnly counters, nor the value that gets written to the TSC, MPERF, and MPerfReadOnly MSRs counters on a write by either the host or the guest.

The TSC Ratio MSR specifies the TSCRatio value as a fixed-point binary number in 8.32 format, which is composed of 8 bits of integer and 32 bits of fraction. This number is the ratio of the desired P0 frequency to be presented to the guest relative to the P0 frequency of the core (See Section 17.1, “PState Control,” on page 657). The reset value of the TSCRatio is 1.0, which sets the guest P0 frequency to match the core P0 frequency.

Note that:
TSCFreq = Core P0 frequency * TSCRatio, so TSCRatio = (Desired TSCFreq) / Core P0 frequency.

The TSC value read by the guest is computed using the TSC Ratio MSR along with the TSC_OFFSET field from the VMCB so that the actual value returned is:
TSC Value (in guest) = (P0 frequency * TSCRatio * t) + VMCB.TSC_OFFSET + (Last Value Written to TSC) * TSCRatio
Where t is time since the TSC was last written via the TSC MSR (or since reset if not written)

和Intel相比，AMD的TSC offset值是设置在VMCB中的，而TSC Scaling的倍率是基于MSR来实现的。实现的逻辑有区别并不重要，毕竟KVM会隔离掉不同平台的实现细节。重要的是，软硬件的协同配合，使得在虚拟化场景下，TSC可以作为一个高效的时钟源被VM使用。

性能测试

最后来看看相比于kvm-clock时钟源，使用tsc作为时钟源能够带来多大的性能提升吧。从红帽找到了一个测试时钟性能的例子：

#include <time.h>

main()
{
	int rc;
	long i;
	struct timespec ts;

	for(i=0; i<500000000; i++) {
		rc = clock_gettime(CLOCK_MONOTONIC, &ts);
	}
}

编译运行：

# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc
# time taskset -c 6 ./clock_timing

real    0m10.858s
user    0m10.821s
sys     0m0.000s
# echo kvm-clock |sudo tee /sys/devices/system/clocksource/clocksource0/current_clocksource
kvm-clock
# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
kvm-clock
# time taskset -c 6 ./clock_timing

real    0m13.530s
user    0m13.482s
sys     0m0.002s

同样是获取500000000次时间，tsc需要10.821s，而kvm-clock需要13.482s，差不多提升了20%，算是相当大的提升幅度了。