Program Tip

32 비트 및 64 비트 용으로 컴파일 할 때 큰 성능 차이 (26 배 더 빠름)

programtip 2020. 10. 12. 08:05
반응형

32 비트 및 64 비트 용으로 컴파일 할 때 큰 성능 차이 (26 배 더 빠름)


값 유형 및 참조 유형 목록에 액세스 할 때 a for와 a 를 사용하는 차이를 측정하려고했습니다 foreach.

다음 클래스를 사용하여 프로파일 링을 수행했습니다.

public static class Benchmarker
{
    public static void Profile(string description, int iterations, Action func)
    {
        Console.Write(description);

        // Warm up
        func();

        Stopwatch watch = new Stopwatch();

        // Clean up
        GC.Collect();
        GC.WaitForPendingFinalizers();
        GC.Collect();

        watch.Start();
        for (int i = 0; i < iterations; i++)
        {
            func();
        }
        watch.Stop();

        Console.WriteLine(" average time: {0} ms", watch.Elapsed.TotalMilliseconds / iterations);
    }
}

double내 가치 유형에 사용 했습니다. 그리고 참조 유형을 테스트하기 위해이 '가짜 클래스'를 만들었습니다.

class DoubleWrapper
{
    public double Value { get; set; }

    public DoubleWrapper(double value)
    {
        Value = value;
    }
}

마지막으로이 코드를 실행하고 시간 차이를 비교했습니다.

static void Main(string[] args)
{
    int size = 1000000;
    int iterationCount = 100;

    var valueList = new List<double>(size);
    for (int i = 0; i < size; i++) 
        valueList.Add(i);

    var refList = new List<DoubleWrapper>(size);
    for (int i = 0; i < size; i++) 
        refList.Add(new DoubleWrapper(i));

    double dummy;

    Benchmarker.Profile("valueList for: ", iterationCount, () =>
    {
        double result = 0;
        for (int i = 0; i < valueList.Count; i++)
        {
             unchecked
             {
                 var temp = valueList[i];
                 result *= temp;
                 result += temp;
                 result /= temp;
                 result -= temp;
             }
        }
        dummy = result;
    });

    Benchmarker.Profile("valueList foreach: ", iterationCount, () =>
    {
        double result = 0;
        foreach (var v in valueList)
        {
            var temp = v;
            result *= temp;
            result += temp;
            result /= temp;
            result -= temp;
        }
        dummy = result;
    });

    Benchmarker.Profile("refList for: ", iterationCount, () =>
    {
        double result = 0;
        for (int i = 0; i < refList.Count; i++)
        {
            unchecked
            {
                var temp = refList[i].Value;
                result *= temp;
                result += temp;
                result /= temp;
                result -= temp;
            }
        }
        dummy = result;
    });

    Benchmarker.Profile("refList foreach: ", iterationCount, () =>
    {
        double result = 0;
        foreach (var v in refList)
        {
            unchecked
            {
                var temp = v.Value;
                result *= temp;
                result += temp;
                result /= temp;
                result -= temp;
            }
        }

        dummy = result;
    });

    SafeExit();
}

나는 선택 Release하고 Any CPU옵션을 선택 하고 프로그램을 실행하고 다음과 같은 시간을 얻었습니다.

valueList for:  average time: 483,967938 ms
valueList foreach:  average time: 477,873079 ms
refList for:  average time: 490,524197 ms
refList foreach:  average time: 485,659557 ms
Done!

그런 다음 릴리스 및 x64 옵션을 선택하고 프로그램을 실행하고 다음과 같은 시간을 얻었습니다.

valueList for:  average time: 16,720209 ms
valueList foreach:  average time: 15,953483 ms
refList for:  average time: 19,381077 ms
refList foreach:  average time: 18,636781 ms
Done!

x64 비트 버전이 훨씬 빠른 이유는 무엇입니까? 나는 약간의 차이를 예상했지만 그렇게 큰 것은 아닙니다.

다른 컴퓨터에 액세스 할 수 없습니다. 당신의 컴퓨터에서 이것을 실행하고 결과를 알려주시겠습니까? Visual Studio 2015를 사용하고 있으며 Intel Core i7 930이 있습니다.

SafeExit()방법 은 다음과 같습니다 . 혼자서 컴파일 / 실행할 수 있습니다.

private static void SafeExit()
{
    Console.WriteLine("Done!");
    Console.ReadLine();
    System.Environment.Exit(1);
}

요청한대로 double?내 대신 사용 DoubleWrapper:

모든 CPU

valueList for:  average time: 482,98116 ms
valueList foreach:  average time: 478,837701 ms
refList for:  average time: 491,075915 ms
refList foreach:  average time: 483,206072 ms
Done!

x64

valueList for:  average time: 16,393947 ms
valueList foreach:  average time: 15,87007 ms
refList for:  average time: 18,267736 ms
refList foreach:  average time: 16,496038 ms
Done!

마지막으로 중요한 점 : x86프로필을 만들면Any CPU .


나는 이것을 4.5.2에서 재현 할 수있다. 여기에 RyuJIT가 없습니다. x86 및 x64 디스 어셈블리 모두 합리적으로 보입니다. 범위 검사 등은 동일합니다. 동일한 기본 구조. 루프 풀림이 없습니다.

x86은 다른 float 명령어 세트를 사용합니다. 이 명령어의 성능은 분할을 제외하고 x64 명령어와 비슷해 보입니다 .

  1. 32 비트 x87 float 명령어는 내부적으로 10 바이트 정밀도를 사용합니다.
  2. 확장 된 정밀도 분할은 매우 느립니다.

분할 작업은 32 비트 버전을 매우 느리게 만듭니다. 분할의 주석 처리를 제거하면 성능 이 크게 향상 됩니다 (430ms에서 3.25ms로 32 비트 감소).

Peter Cordes는 두 부동 소수점 단위의 명령 대기 시간이 그렇게 다르지 않다고 지적합니다. 중간 결과 중 일부는 비정규 화 된 숫자 또는 NaN 일 수 있습니다. 이것은 유닛 중 하나에서 느린 경로를 유발할 수 있습니다. 또는 10 바이트 대 8 바이트 부동 소수점 정밀도 때문에 두 구현간에 값이 다를 수 있습니다.

Peter Cordes 는 또한 모든 중간 결과가 NaN 이라고 지적합니다. 이 문제를 제거하면 ( valueList.Add(i + 1)제수가 0이되지 않도록) 대부분 결과가 동일 해집니다. 분명히 32 비트 코드는 NaN 피연산자를 전혀 좋아하지 않습니다. 중간 값을 인쇄 해 봅시다 : if (i % 1000 == 0) Console.WriteLine(result);. 이것은 데이터가 이제 정상임을 확인합니다.

벤치마킹 할 때 현실적인 워크로드를 벤치마킹해야합니다. 하지만 무고한 부서가 벤치 마크를 망칠 수 있다고 누가 생각했을까요?!

더 나은 벤치 마크를 얻으려면 단순히 숫자를 더해보십시오.

나누기와 모듈로는 항상 매우 느립니다. Dictionary모듈로 연산자를 사용하지 않도록 BCL 코드를 수정하면 버킷 인덱스를 계산할 수있는 성능이 향상됩니다. 이것이 얼마나 느린 분할인지입니다.

다음은 32 비트 코드입니다.

여기에 이미지 설명 입력

64 bit code (same structure, fast division):

여기에 이미지 설명 입력

This is not vectorized despite SSE instructions being used.


valueList[i] = i, starting from i=0, so the first loop iteration does 0.0 / 0.0. So every operation in your entire benchmark is done with NaNs.

As @usr showed in disassembly output, the 32bit version used x87 floating point, while 64bit used SSE floating point.

I'm not an expert on performance with NaNs, or the difference between x87 and SSE for this, but I think this explains the 26x perf difference. I bet your results will be a lot closer between 32 and 64bit if you initialize valueList[i] = i+1. (update: usr confirmed that this made 32 and 64bit performance fairly close.)

Division is very slow compared to other operations. See my comments on @usr's answer. Also see http://agner.org/optimize/ for tons of great stuff about hardware, and optimizing asm and C/C++, some of it relevant to C#. He has instruction tables of latency and throughput for most instructions for all recent x86 CPUs.

However, 10B x87 fdiv isn't much slower than SSE2's 8B double precision divsd, for normal values. IDK about perf differences with NaNs, infinities, or denormals.

They have different controls for what happens with NaNs and other FPU exceptions, though. The x87 FPU control word is separate from the SSE rounding / exception control register (MXCSR). If x87 is getting a CPU exception for every division, but SSE isn't, that easily explains the factor of 26. Or maybe there's just a performance difference that big when handling NaNs. The hardware is not optimized for churning through NaN after NaN.

IDK if the SSE controls for avoiding slowdowns with denormals will come into play here, since I believe result will be NaN all the time. IDK if C# sets the denormals-are-zero flag in the MXCSR, or the flush-to-zero-flag (which writes zeroes in the first place, instead of treating denormals as zero when read back).

I found an Intel article about SSE floating point controls, contrasting it with the x87 FPU control word. It doesn't have much to say about NaN, though. It ends with this:

Conclusion

To avoid serialization and performance issues due to denormals and underflow numbers, use the SSE and SSE2 instructions to set Flush-to-Zero and Denormals-Are-Zero modes within the hardware to enable highest performance for floating-point applications.

IDK if this helps any with divide-by-zero.

for vs. foreach

It might be interesting to test a loop body that is throughput-limited, rather than just being one single loop-carried dependency chain. As it is, all of the work depends on previous results; there's nothing for the CPU to do in parallel (other than bounds-check the next array load while the mul/div chain is running).

You might see more difference between the methods if the "real work" occupied more of the CPUs execution resources. Also, on pre-Sandybridge Intel, there's a big difference between a loop fitting in the 28uop loop buffer or not. You get instruction decode bottlenecks if not, esp. when the average instruction length is longer (which happens with SSE). Instructions that decode to more than one uop will also limit decoder throughput, unless they come in a pattern that's nice for the decoders (e.g. 2-1-1). So a loop with more instructions of loop overhead can make the difference between a loop fitting in the 28-entry uop cache or not, which is a big deal on Nehalem, and sometimes helpful on Sandybridge and later.


We have the observation that 99.9% of all the floating point operations will involve NaN's, which is at least highly unusual (found by Peter Cordes first). We have another experiment by usr, which found that removing the division instructions makes the time difference almost completely go away.

The fact however is that the NaN's are only generated because the very first division calculates 0.0 / 0.0 which gives the initial NaN. If the divisions are not performed, result will always be 0.0, and we will always calculate 0.0 * temp -> 0.0, 0.0 + temp -> temp, temp - temp = 0.0. So removing the division did not only remove the divisions, but also removed the NaNs. I would expect that the NaN's are actually the problem, and that one implementation handles NaN's very slowly, while the other one doesn't have the problem.

It would be worthwhile starting the loop at i = 1 and measuring again. The four operations result * temp, + temp, / temp, - temp effectively add (1 - temp) so we wouldn't have any unusual numbers (0, infinity, NaN) for most of the operations.

The only problem could be that the division always gives an integer result, and some division implementations have shortcuts when the correct result doesn't use many bits. For example, dividing 310.0 / 31.0 gives 10.0 as the first four bits with a remainder of 0.0, and some implementations can stop evaluating the remaining 50 or so bits while others can't. If there is a significiant difference, then starting the loop with result = 1.0 / 3.0 would make a difference.


There may be several reasons why this is executing faster in 64bit on your machine. The reason I asked which CPU you were using was because when 64bit CPUs first made their appearance, AMD and Intel had different mechanisms to handle 64bit code.

Processor architecture:

Intel's CPU architecture was purely 64bit. In order to execute 32bit code, the 32bit instructions needed to be converted (inside the CPU) to 64bit instructions before execution.

AMD's CPU architecture was to build 64bit right on top of their 32bit architecture; that is, it was essentially a 32bit architecture with 64bit extentions - there was no code conversion process.

This was obviously a few years ago now, so I've no idea if/how the technology has changed, but essentially, you would expect 64bit code to perform better on a 64bit machine since the CPU is able to work with double the amount of bits per instruction.

.NET JIT

It's argued that .NET (and other managed languages like Java) are capable of outperforming languages like C++ because of the way the JIT compiler is able to optimize your code according to your processor architecture. In this respect, you might find that the JIT compiler is utilizing something in 64bit architecture that possibly wasn't available or required a workaround when executed in 32bit.

Note:

Rather than using DoubleWrapper, have you considered using Nullable<double> or shorthand syntax: double? - I'd be interested to see if that has any impact on your tests.

참고 2 : 일부 사람들은 64 비트 아키텍처에 대한 내 의견을 IA-64와 통합하는 것 같습니다. 명확히하기 위해 내 대답에서 64 비트는 x86-64를 나타내고 32 비트는 x86-32를 나타냅니다. 여기에는 IA-64가 언급 된 것이 없습니다!

참고 URL : https://stackoverflow.com/questions/31875464/huge-performance-difference-26x-faster-when-compiling-for-32-and-64-bits

반응형