Erasure Coder

Multithreaded XOR Benchmarks

I recently ran an experiment to see how fast, using multi-threading on the CPU, we can XOR two sets of buffers in memory that do not overlap. A worker pool of the same size as the CPU count fights over the work items.

Spoilers: For any memory bound work-load over large data the correct solution is to create as many work items as there are CPU cores (even counting hyperthreading).

Here are some results on a fast PC:

For 100 byte data chunks and 10,000 xor operations:

mem_xor(1 MB) across 1 work-thread items: 14285.7 MB/s
mem_xor(1 MB) across 2 work-thread items: 22727.3 MB/s
mem_xor(1 MB) across 3 work-thread items: 21739.1 MB/s
mem_xor(1 MB) across 4 work-thread items: 23255.8 MB/s
mem_xor(1 MB) across 5 work-thread items: 27777.8 MB/s
mem_xor(1 MB) across 6 work-thread items: 33333.3 MB/s
mem_xor(1 MB) across 7 work-thread items: 32258.1 MB/s
mem_xor(1 MB) across 8 work-thread items: 34482.8 MB/s
mem_xor(1 MB) across 9 work-thread items: 35714.3 MB/s
mem_xor(1 MB) across 10 work-thread items: 35714.3 MB/s
mem_xor(1 MB) across 11 work-thread items: 40000 MB/s
mem_xor(1 MB) across 12 work-thread items: 40000 MB/s
mem_xor(1 MB) across 13 work-thread items: 37037 MB/s
mem_xor(1 MB) across 14 work-thread items: 37037 MB/s
mem_xor(1 MB) across 15 work-thread items: 35714.3 MB/s
mem_xor(1 MB) across 16 work-thread items: 37037 MB/s
mem_xor(1 MB) across 32 work-thread items: 38461.5 MB/s
mem_xor(1 MB) across 64 work-thread items: 35714.3 MB/s
mem_xor(1 MB) across 128 work-thread items: 16393.4 MB/s
mem_xor(1 MB) across 256 work-thread items: 8403.36 MB/s
mem_xor(1 MB) across 512 work-thread items: 4219.41 MB/s
mem_xor(1 MB) across 1024 work-thread items: 2145.92 MB/s
mem_xor(1 MB) across 2048 work-thread items: 1605.14 MB/s
mem_xor(1 MB) across 4096 work-thread items: 841.043 MB/s
mem_xor(1 MB) single-threaded: 14492.8 MB/s

In this case, using multi-threading can be as much as 2.76x faster than a single-thread.

For 1 KB data chunks and 10,000 xor operations:

mem_xor(10 MB) across 1 work-thread items: 6988.12 MB/s
mem_xor(10 MB) across 2 work-thread items: 13315.6 MB/s
mem_xor(10 MB) across 3 work-thread items: 11049.7 MB/s
mem_xor(10 MB) across 4 work-thread items: 15479.9 MB/s
mem_xor(10 MB) across 5 work-thread items: 17889.1 MB/s
mem_xor(10 MB) across 6 work-thread items: 18484.3 MB/s
mem_xor(10 MB) across 7 work-thread items: 20964.4 MB/s
mem_xor(10 MB) across 8 work-thread items: 21598.3 MB/s
mem_xor(10 MB) across 9 work-thread items: 22222.2 MB/s
mem_xor(10 MB) across 10 work-thread items: 19920.3 MB/s
mem_xor(10 MB) across 11 work-thread items: 19607.8 MB/s
mem_xor(10 MB) across 12 work-thread items: 20661.2 MB/s
mem_xor(10 MB) across 13 work-thread items: 18248.2 MB/s
mem_xor(10 MB) across 14 work-thread items: 19646.4 MB/s
mem_xor(10 MB) across 15 work-thread items: 20366.6 MB/s
mem_xor(10 MB) across 16 work-thread items: 20618.6 MB/s
mem_xor(10 MB) across 32 work-thread items: 21097 MB/s
mem_xor(10 MB) across 64 work-thread items: 20284 MB/s
mem_xor(10 MB) across 128 work-thread items: 19607.8 MB/s
mem_xor(10 MB) across 256 work-thread items: 18867.9 MB/s
mem_xor(10 MB) across 512 work-thread items: 18050.5 MB/s
mem_xor(10 MB) across 1024 work-thread items: 16638.9 MB/s
mem_xor(10 MB) across 2048 work-thread items: 9017.13 MB/s
mem_xor(10 MB) across 4096 work-thread items: 4426.74 MB/s
mem_xor(10 MB) single-threaded: 8635.58 MB/s

In this case, using multi-threading can be as much as 2.57x faster than a single-thread.

For 4 KB data chunks and 10,000 xor operations:

mem_xor(40 MB) across 1 work-thread items: 6767.04 MB/s
mem_xor(40 MB) across 2 work-thread items: 12694.4 MB/s
mem_xor(40 MB) across 3 work-thread items: 15625 MB/s
mem_xor(40 MB) across 4 work-thread items: 13110.5 MB/s
mem_xor(40 MB) across 5 work-thread items: 14798.4 MB/s
mem_xor(40 MB) across 6 work-thread items: 15655.6 MB/s
mem_xor(40 MB) across 7 work-thread items: 15898.3 MB/s
mem_xor(40 MB) across 8 work-thread items: 15606.7 MB/s
mem_xor(40 MB) across 9 work-thread items: 15829 MB/s
mem_xor(40 MB) across 10 work-thread items: 15717.1 MB/s
mem_xor(40 MB) across 11 work-thread items: 16051.4 MB/s
mem_xor(40 MB) across 12 work-thread items: 15558.1 MB/s
mem_xor(40 MB) across 13 work-thread items: 14908.7 MB/s
mem_xor(40 MB) across 14 work-thread items: 15319.8 MB/s
mem_xor(40 MB) across 15 work-thread items: 15612.8 MB/s
mem_xor(40 MB) across 16 work-thread items: 15290.5 MB/s
mem_xor(40 MB) across 32 work-thread items: 15564.2 MB/s
mem_xor(40 MB) across 64 work-thread items: 15366.9 MB/s
mem_xor(40 MB) across 128 work-thread items: 15378.7 MB/s
mem_xor(40 MB) across 256 work-thread items: 15612.8 MB/s
mem_xor(40 MB) across 512 work-thread items: 15521.9 MB/s
mem_xor(40 MB) across 1024 work-thread items: 15331.5 MB/s
mem_xor(40 MB) across 2048 work-thread items: 14853.3 MB/s
mem_xor(40 MB) across 4096 work-thread items: 14154.3 MB/s
mem_xor(40 MB) single-threaded: 7142.86 MB/s

In this case, using multi-threading can be as much as 2.25x faster than a single-thread.

For 64 KB data chunks and 10,000 xor operations:

mem_xor(640 MB) across 1 work-thread items: 7160.04 MB/s
mem_xor(640 MB) across 2 work-thread items: 13449.3 MB/s
mem_xor(640 MB) across 3 work-thread items: 16602.2 MB/s
mem_xor(640 MB) across 4 work-thread items: 13701 MB/s
mem_xor(640 MB) across 5 work-thread items: 15355.1 MB/s
mem_xor(640 MB) across 6 work-thread items: 15681.7 MB/s
mem_xor(640 MB) across 7 work-thread items: 16312 MB/s
mem_xor(640 MB) across 8 work-thread items: 16332.8 MB/s
mem_xor(640 MB) across 9 work-thread items: 16275 MB/s
mem_xor(640 MB) across 10 work-thread items: 16244.9 MB/s
mem_xor(640 MB) across 11 work-thread items: 16197.2 MB/s
mem_xor(640 MB) across 12 work-thread items: 16078.8 MB/s
mem_xor(640 MB) across 13 work-thread items: 15832.6 MB/s
mem_xor(640 MB) across 14 work-thread items: 15799.7 MB/s
mem_xor(640 MB) across 15 work-thread items: 16243.7 MB/s
mem_xor(640 MB) across 16 work-thread items: 16321.5 MB/s
mem_xor(640 MB) across 32 work-thread items: 16173.1 MB/s
mem_xor(640 MB) across 64 work-thread items: 16193.1 MB/s
mem_xor(640 MB) across 128 work-thread items: 16147.8 MB/s
mem_xor(640 MB) across 256 work-thread items: 16136.4 MB/s
mem_xor(640 MB) across 512 work-thread items: 16150.6 MB/s
mem_xor(640 MB) across 1024 work-thread items: 16098.2 MB/s
mem_xor(640 MB) across 2048 work-thread items: 16117.7 MB/s
mem_xor(640 MB) across 4096 work-thread items: 16006.4 MB/s
mem_xor(640 MB) single-threaded: 7181.01 MB/s

In this case, using multi-threading can be as much as 2.31x faster than a single-thread.

Posted June 5, 2017
READ THIS NEXT:

Leopard-RS Multithreading Results

Results from hand-tuned (Windows-only) worker thread-pool for Leopard: It’s actually pretty hard to tell why it’s not 8x faster. I’m guessing that it is hitting a memory bandwidth limit on the...


author Christopher A Taylor (catid)Development blog for Christopher A Taylor (catid), systems software engineer at Oculus/Facebook: Focus on erasure correction coding (ECC/FEC), cryptography, networking, lossless image compression.


Consult me via Email (mrcatid at gmail).
Follow me on twitter/@oculuscat.
Check out my free, BSD licensed software on github/catid.
Hobby coding for 22 years in GwBasic, QBasic, TI-BASIC, VB6, VBA, C, Intel assembly, C++, C#, JavaScript.