猿问

为什么 Mono 运行简单方法的速度较慢,而 RyuJIT 运行速度明显更快?

出于好奇,我创建了一个简单的基准测试,但无法解释结果。


作为基准数据,我准备了一个带有一些随机值的结构数组。准备阶段没有进行基准测试:


struct Val 

{

    public float val;

    public float min;

    public float max;

    public float padding;

}


const int iterations = 1000;

Val[] values = new Val[iterations];

// fill the array with randoms

基本上,我想比较这两个钳位实现:


static class Clamps

{

    public static float ClampSimple(float val, float min, float max)

    {

        if (val < min) return min;          

        if (val > max) return max;

        return val;

    }


    public static T ClampExt<T>(this T val, T min, T max) where T : IComparable<T>

    {

        if (val.CompareTo(min) < 0) return min;

        if (val.CompareTo(max) > 0) return max;

        return val;

    }

}

这是我的基准方法:


[Benchmark]

public float Extension()

{

    float result = 0;

    for (int i = 0; i < iterations; ++i)

    {

        ref Val v = ref values[i];

        result += v.val.ClampExt(v.min, v.max);

    }


    return result;

}


[Benchmark]

public float Direct()

{

    float result = 0;

    for (int i = 0; i < iterations; ++i)

    {

        ref Val v = ref values[i];

        result += Clamps.ClampSimple(v.val, v.min, v.max);

    }


    return result;

}

我将BenchmarkDotNet 0.10.12 版用于两项工作:


[MonoJob]

[RyuJitX64Job]

这些是我得到的结果:


BenchmarkDotNet=v0.10.12, OS=Windows 7 SP1 (6.1.7601.0)

Intel Core i7-6920HQ CPU 2.90GHz (Skylake), 1 CPU, 8 logical cores and 4 physical cores

Frequency=2836123 Hz, Resolution=352.5940 ns, Timer=TSC

  [Host]    : .NET Framework 4.7 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3062.0

  Mono      : Mono 5.12.0 (Visual Studio), 64bit

  RyuJitX64 : .NET Framework 4.7 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3062.0


我可以接受 Mono 在这里一般有点慢。但我不明白的是:


为什么 Mono 运行该Direct方法比记住使用非常简单的比较方法而使用具有附加方法调用的方法慢?ExtensionDirectExtension


RyuJIT 在这里展示了简单方法的 4 倍优势。


谁能解释一下?


慕村225694
浏览 289回答 1
1回答

三国纷争

由于没有人想做一些拆卸的东西,我回答我自己的问题。原因似乎是 JIT 生成的本机代码,而不是注释中提到的数组边界检查或缓存问题。RyuJIT 为该ClampSimple方法生成了一个非常有效的代码:&nbsp; &nbsp; vucomiss xmm1,xmm0&nbsp; &nbsp; jbe&nbsp; &nbsp; &nbsp;M01_L00&nbsp; &nbsp; vmovaps xmm0,xmm1&nbsp; &nbsp; retM01_L00:&nbsp; &nbsp; vucomiss xmm0,xmm2&nbsp; &nbsp; jbe&nbsp; &nbsp; &nbsp;M01_L01&nbsp; &nbsp; vmovaps xmm0,xmm2&nbsp; &nbsp; retM01_L01:&nbsp; &nbsp; ret它使用 CPU 的本机ucomiss操作来比较floats,并使用快速movaps操作float在 CPU 的寄存器之间移动这些s。扩展方法较慢,因为它有几个对 的函数调用System.Single.CompareTo(System.Single),这是第一个分支:lea&nbsp; &nbsp; &nbsp;rcx,[rsp+30h]vmovss&nbsp; dword ptr [rsp+38h],xmm1call&nbsp; &nbsp; mscorlib_ni+0xda98f0test&nbsp; &nbsp; eax,eaxjge&nbsp; &nbsp; &nbsp;M01_L00vmovss&nbsp; xmm0,dword ptr [rsp+38h]add&nbsp; &nbsp; &nbsp;rsp,28hret让我们看看 Mono 为该ClampSimple方法生成的本机代码:&nbsp; &nbsp; cvtss2sd&nbsp; &nbsp; xmm0,xmm0&nbsp;&nbsp;&nbsp; &nbsp; movss&nbsp; &nbsp; &nbsp; &nbsp;xmm1,dword ptr [rsp+8]&nbsp;&nbsp;&nbsp; &nbsp; cvtss2sd&nbsp; &nbsp; xmm1,xmm1&nbsp;&nbsp;&nbsp; &nbsp; comisd&nbsp; &nbsp; &nbsp; xmm1,xmm0&nbsp;&nbsp;&nbsp; &nbsp; jbe&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;M01_L00&nbsp;&nbsp;&nbsp; &nbsp; movss&nbsp; &nbsp; &nbsp; &nbsp;xmm0,dword ptr [rsp+8]&nbsp;&nbsp;&nbsp; &nbsp; cvtss2sd&nbsp; &nbsp; xmm0,xmm0&nbsp;&nbsp;&nbsp; &nbsp; cvtsd2ss&nbsp; &nbsp; xmm0,xmm0&nbsp;&nbsp;&nbsp; &nbsp; jmp&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;M01_L01&nbsp;M01_L00:&nbsp;&nbsp; &nbsp; movss&nbsp; &nbsp; &nbsp; &nbsp;xmm0,dword ptr [rsp]&nbsp;&nbsp;&nbsp; &nbsp; cvtss2sd&nbsp; &nbsp; xmm0,xmm0&nbsp;&nbsp;&nbsp; &nbsp; movss&nbsp; &nbsp; &nbsp; &nbsp;xmm1,dword ptr [rsp+10h]&nbsp;&nbsp;&nbsp; &nbsp; cvtss2sd&nbsp; &nbsp; xmm1,xmm1&nbsp;&nbsp;&nbsp; &nbsp; comisd&nbsp; &nbsp; &nbsp; xmm1,xmm0&nbsp;&nbsp;&nbsp; &nbsp; jp&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; M01_L02&nbsp; &nbsp; jae&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;M01_L02&nbsp;&nbsp;&nbsp; &nbsp; movss&nbsp; &nbsp; &nbsp; &nbsp;xmm0,dword ptr [rsp+10h]&nbsp;&nbsp;&nbsp; &nbsp; cvtss2sd&nbsp; &nbsp; xmm0,xmm0&nbsp;&nbsp;&nbsp; &nbsp; cvtsd2ss&nbsp; &nbsp; xmm0,xmm0&nbsp;&nbsp;&nbsp; &nbsp; jmp&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;M01_L01M01_L02:&nbsp; &nbsp; movss&nbsp; &nbsp; &nbsp; &nbsp;xmm0,dword ptr [rsp]&nbsp;&nbsp;&nbsp; &nbsp; cvtss2sd&nbsp; &nbsp; xmm0,xmm0&nbsp;&nbsp;&nbsp; &nbsp; cvtsd2ss&nbsp; &nbsp; xmm0,xmm0&nbsp;&nbsp;M01_L01:&nbsp; &nbsp; add&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;rsp,18h&nbsp;&nbsp;&nbsp; &nbsp; ret&nbsp;Mono 的代码转换floats为doubles 并使用comisd. 此外,在准备返回值时,还有奇怪的“转换翻转” float➞ double➞ float。而且在内存和寄存器之间还有更多的移动。这解释了为什么 Mono 的简单方法代码比 RyuJIT 的代码慢。该Extension方法代码与 RyuJIT 的代码非常相似,但再次具有奇怪的转换翻转float➞ double➞ float:movss&nbsp; &nbsp; &nbsp; &nbsp;xmm0,dword ptr [rbp-10h]&nbsp;&nbsp;cvtss2sd&nbsp; &nbsp; xmm0,xmm0&nbsp;&nbsp;movsd&nbsp; &nbsp; &nbsp; &nbsp;xmm1,xmm0&nbsp;&nbsp;cvtsd2ss&nbsp; &nbsp; xmm1,xmm1&nbsp;&nbsp;lea&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;rbp,[rbp]&nbsp;&nbsp;mov&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;r11,2061520h&nbsp;&nbsp;call&nbsp; &nbsp; &nbsp; &nbsp; r11&nbsp;&nbsp;test&nbsp; &nbsp; &nbsp; &nbsp; eax,eax&nbsp;&nbsp;jge&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;M0_L0&nbsp;movss&nbsp; &nbsp; &nbsp; &nbsp;xmm0,dword ptr [rbp-10h]&nbsp;&nbsp;cvtss2sd&nbsp; &nbsp; xmm0,xmm0&nbsp;&nbsp;cvtsd2ss&nbsp; &nbsp; xmm0,xmm0ret似乎 RyuJIT 可以生成更高效的代码来处理floats。Mono 将floats 视为doubles 并每次转换值,这也会导致 CPU 寄存器和内存之间的额外值传输。请注意,所有这些仅对 Windows x64 有效。我不知道这个基准测试在 Linux 或 Mac 上的表现如何。
随时随地看视频慕课网APP
我要回答