Monthly Archives: June 2018

Litmus Tests for Multithreaded Behavior

Litmus Tests for Multithreaded Behavior

Or How Processors Don’t Do What You Think

Modern multicore processors are entirely weirder than almost anyone thinks possible. They are somewhat weirder than chip makers were willing to admit until fairly recently. They are sufficiently weird enough that almost all multi-threaded programs, and many lock-free algorithms, had bugs because the hardware just does not work the way anyone would reasonably expect. And, in addition, optimizing compilers did not actually know how to not break your code. [boehm2005threads]

I’m in the (slow) process of writing some test cases for multi-threaded code. I eventually want to have some confidence that some code is executed once, and only once, in an efficient manner. It’s the ‘efficient’ part that worries me, because for efficient, it also has a tendency to be clever, and I’m learning that clever MT code is often subtly broken. [bacon2000double] So if smarter people than me make mistakes about MT code, I need tests to compensate. And ones that will cover occasional allowed but unexpected behavior. Which means the test framework should be able to detect them.

Also, fortunately, the RPi is a computer that exhibits some odd behavior, as it is an ARM system. X86 has a much stronger model. However, even the x86 model is allowed to perform in odd ways.

Starting in 2007, Intel has started publishing short snippets of assembly and documenting what are the allowed and disallowed results running them in parallel. [IWPAug2007] These snippets have come to be called litmus tests, and are used to confirm the behavior of hardware, and confirm models of the hardware behavior. A particularly important model for C++ programmers is the x86 Total Store Order model [owens2009better] which provides a way of mapping the C++11 memory model to X86 hardware. X86 hardware provides a strongly consistent memory model. Power and ARM provide fewer guarantees, and mapping the C++ memory model to these architectures is more challenging. [maranget2012tutorial]

Message Passing

The tests outlined in the Intel paper are short pieces of assembly to be exercised on different processors, with guarantees about behavior that will not happen. The first one essentially promises that message passing will work, and is now known as the MP test.

Processor 0 Processor 1
mov [ _x], 1 // M1 mov r1, [ _y] // M3
mov [ _y], 1 // M2 mov r2, [ _x] // M4

Initially x = y = 0

r1 = 1 and r2 = 0 is not allowed

That says that we can’t read the writes of x and y out of order, which ensures that if we’re waiting to see the write to the flag y, we can be guaranteed to see the payload in x. If we change it slightly to wait for the write to _y to be visible, we can pass a message from one thread to anothher in _x. This is also known as Dekards Algorithm.

ARM and Power do not provide that guarantee without additional synchronization instructions.

In C++ that looks something like the following, using the test framework I’m writing.

MP::MP() : x_(0), y_(0) {}
void MP::t1() {
    x_.store(1, std::memory_order_relaxed);
    y_.store(1, std::memory_order_relaxed);
}
void MP::t2(Result& read) {
    while (!y_.load(std::memory_order_relaxed)){}
    std::get<0>(read) = x_.load(std::memory_order_relaxed);
}

Here, x_ and y_ are atomic<int>s, and we’re using the lowest possible atomic guarantee in C++, relaxed. Relaxed guarantees that the operation happens atomically. but there are no synchronization properties with anything else. This usally corresponds to the basic int type. Unless you’re using a really insane processor that might let an int be partially written and observable. Like you might get the top half of the int, or the middle byte. The commercial processors that allowed this have pretty much died out.

The test spins on seeing the load of y to be complete. It loads the value of x_ into a result tuple. The tuple is used as the key to a map which accumulates how many times each result has been seen.
Running the above on my x86 laptop:

[ RUN      ] ExperimentTest.MPTest1
(1) : 2000000

on my Raspberry Pi 3:

[ RUN      ] ExperimentTest.MPTest1
(0) : 483
(1) : 1999517

Using objdump to check the generated assembly

00000088 <litmus::MP::MP()>:
  88:   mov     r2, #0
  8c:   str     r2, [r0]
  90:   str     r2, [r0, #64]   ; 0x40
  94:   bx      lr

00000098 <litmus::MP::t1()>:
  98:   mov     r3, #1
  9c:   str     r3, [r0]
  a0:   str     r3, [r0, #64]   ; 0x40
  a4:   bx      lr

000000a8 <litmus::MP::t2(std::tuple<int>&)>:
  a8:   ldr     r3, [r0, #64]   ; 0x40
  ac:   cmp     r3, #0
  b0:   beq     a8 <litmus::MP::t2(std::tuple<int>&)>
  b4:   ldr     r3, [r0]
  b8:   str     r3, [r1]
  bc:   bx      lr

So, out of the 2,000,000 times that I ran the experiment, there were 483 times that reading x_ resulted in 0, even though y_ was 1. ARM has a weaker memory model than x86. This has some advantages in processor implementation. It has distinct disadvantages in how our brains work. X86 tries to preserve the model that there is shared memory that everyone sees and works with. That’s not strictly true, even for X86, but ARM and Power don’t even come close. On the other hand, it’s also why it’s easier to add more cores to Power and ARM chips and systems. I routinely work with Power systems with 512 physical cores.

Store Buffering

Store buffering is the odd case that is allowed in the Intel memory model. When assigning locations in two threads, and then reading them on opposite threads, both threads are allowed to read the older state. The stores get buffered.
From the Intel White Paper:

Processor 0 Processor 1
mov [ _x], 1 // M1 mov [ _y], 1 // M3
mov r1, [ _y] // M2 mov r2, [ _x] // M4

Initially x = y = 0

r1 = 0 and r2 ==0 is allowed

Note, in particular, there is no interleaving of M1 – 4 that could result in r1 and r2 being 0. Not without interupting an instruction in the middle. But the instructions themselves are atomic, and indivisible. If they were actually operating on shared memory, this would not be possible. However, it does happen.

SB::SB() : x_(0), y_(0) {}
void SB::t1(Result& read) {
    y_.store(1, std::memory_order_relaxed);
    std::get<0>(read) = x_.load(std::memory_order_relaxed);
}
void SB::t2(Result& read) {
    x_.store(1, std::memory_order_relaxed);
    std::get<1>(read) = y_.load(std::memory_order_relaxed);
}

That generates the x86 code

00000000000000f0 <litmus::SB::t1(std::__1::tuple<int, int>&)>:
  f0:   mov    DWORD PTR [rdi+0x40],0x1
  f7:   mov    eax,DWORD PTR [rdi]
  f9:   mov    DWORD PTR [rsi],eax
  fb:   ret

0000000000000100 <litmus::SB::t2(std::__1::tuple<int, int>&)>:
 100:   mov    DWORD PTR [rdi],0x1
 106:   mov    eax,DWORD PTR [rdi+0x40]
 109:   mov    DWORD PTR [rsi+0x4],eax
 10c:   ret

And on my x86 machine:

[ RUN      ] ExperimentTest.SBTest1
(0, 0) : 559
(0, 1) : 999858
(1, 0) : 999576
(1, 1) : 7

So 559 times neither core saw the other core’s store.

Load Buffering

Load Buffering is the dual of store buffering. Loads into registers might be delayed, or buffered, and actually performed after following instructions. It’s not allowed in the Intel architecture.

From the Intel White Paper

Processor 0 Processor 1
mov r1, [ _x] // M1 mov r2, [ _y] // M3
mov [ _y], 1 // M2 mov [ _x], 1 // M4

Initially x = y = 0

r1 = 1 and r2 = 1 is not allowed

LB::LB() : x_(0), y_(0) {}
void LB::t1(Result& read) {
    std::get<0>(read) = x_.load(std::memory_order_relaxed);
    y_.store(1, std::memory_order_relaxed);
}
void LB::t2(Result& read) {
    std::get<1>(read) = y_.load(std::memory_order_relaxed);
    x_.store(1, std::memory_order_relaxed);
}

This is the x86 asm code

00000000000000c0 <litmus::LB::t1(std::__1::tuple<int, int>&)>:
  c0:   mov    eax,DWORD PTR [rdi]
  c2:   mov    DWORD PTR [rsi],eax
  c4:   mov    DWORD PTR [rdi+0x40],0x1
  cb:   ret
  cc:   nop    DWORD PTR [rax+0x0]

00000000000000d0 <litmus::LB::t2(std::__1::tuple<int, int>&)>:
  d0:   mov    eax,DWORD PTR [rdi+0x40]
  d3:   mov    DWORD PTR [rsi+0x4],eax
  d6:   mov    DWORD PTR [rdi],0x1
  dc:   ret
  dd:   nop    DWORD PTR [rax]

And the ARM code, at -O1

000000d0 <litmus::LB::t1(std::tuple<int, int>&)>:
  d0:   ldr     r3, [r0]
  d4:   str     r3, [r1, #4]
  d8:   mov     r3, #1
  dc:   str     r3, [r0, #64]   ; 0x40
  e0:   bx      lr

000000e4 <litmus::LB::t2(std::tuple<int, int>&)>:
  e4:   ldr     r3, [r0, #64]   ; 0x40
  e8:   str     r3, [r1]
  ec:   mov     r3, #1
  f0:   str     r3, [r0]
  f4:   bx      lr

ARM generally allows it, but per [maranget2012tutorial] it’s very sensitive, and dependencies will make it not appear. In my tests, I did not observe an instance of a buffering, but it may be due to the first store the compiler introduces, in order to actually get the data into the tuple. That it’s documented as possible is still exceedingly strange.

Independent Reads of Independent Writes

IRIW is a generalization of store buffering, where two reader threads each read different apparent orderings of writes from two distinct writer threads.

T1 T2 T3 T4
X = 1 Y = 1 R1 = X R3 = y
R2 = Y R4 = X

Initially X=Y=0
Allowed in ARM, not in x86 r1=1, r2=0, r3=1, r4=0 [maranget2012tutorial,owens2009better]

This is not observed in x86 processors, but is in some ARM and POWER, more often in POWER. X86 hardware has a consistent view of memory where other hardware can see memory writes in different orders on different threads. On my rPi, I didn’t observe any incidents of X and Y being read out of order, over 40 million runs.

IRIW::IRIW() : x_(0), y_(0) {}
void IRIW::t1() {
    x_.store(1, std::memory_order_relaxed);
}

void IRIW::t2() {
    y_.store(1, std::memory_order_relaxed);
}

void IRIW::t3(Result& read) {
    std::get<0>(read) = x_.load(std::memory_order_relaxed);
    std::get<1>(read) = y_.load(std::memory_order_relaxed);
}

void IRIW::t4(Result& read) {
    std::get<2>(read) = y_.load(std::memory_order_relaxed);
    std::get<3>(read) = x_.load(std::memory_order_relaxed);
}

Summary

The allowed behavior of modern processors is very different than our mental model of a Von Neumann architecture computer. Each core can have a different view of memory, and without additional controls, writes and reads can break the illusion of a single unified memory. The C++ memory model gives the controls and guarantees about what happens when different threads read and write memory, and here I’ve deliberately used the weakest version available, relaxed, in order to allow the processors the wideest latitude in behavior. Relaxed is, for processors that have it, often just an unconstrained int, which means that you will get odd behavior if you are running shared state multithreaded code that uses plain native types. It is a particular problem with code that was originally written and tested on a x86 architecture because the native model is fairly strong. This frequently causes problems when porting to a mobile platform, where ARM is a very popular hardware choice.

Org-mode source and git repo

Exported from an org-mode doc. All of the source is available on github at SpinGate

References

Bibliography

  • [boehm2005threads] Boehm, Threads cannot be implemented as a library, 261-268, in in: ACM Sigplan Notices, edited by (2005)
  • [bacon2000double] @miscbacon2000double,
    title=The “double-checked locking is broken” declaration,
    author=Bacon, David and Bloch, Joshua and Bogda, Jeff and Click, Cliff and Haahr, Paul and Lea, Doug and May, Tom and Maessen, Jan-Willem and Mitchell, JD and Nilsen, Kelvin and others,
    year=2000
  • [IWPAug2007] @miscIWPAug2007,
    howpublished =
    \urlhttp://www.cs.cmu.edu/~410-f10/doc/Intel_Reordering_318147.pdf,
    note = Accessed: 2017-04-30,
    title = Intel® 64 Architecture Memory Ordering
    White Paper,
  • [owens2009better] Owens, Sarkar & Sewell, A better x86 memory model: x86-TSO, 391-407, in in: International Conference on Theorem Proving in Higher Order Logics, edited by (2009)
  • [maranget2012tutorial] Maranget, Sarkar & Sewell, A tutorial introduction to the ARM and POWER relaxed memory models, Draft available from http://www. cl. cam. ac. uk/\~ pes20/ppc-supplemental/test7. pdf, (2012).

Date: 2017-04-30 Sun 00:00

Author: Steve Downey

Created: 2018-06-17 Sun 14:07

Validate

 

Multithread Experiments

An Experiment Collects Samples

I’m modelling this in order to run bits of code like the various litmus tests used to describe multi-core architectures. A set of functions to be run in parallel that may or may not write to a result, which type is a property of the Test being run. The Experiment will run the Test collecting Samples. The Test type will provide a tuple of functions to run. They will be run under a spingate in all permutations in order to remove scheduling bias.

What a Test looks like

class MP { // Message Passing
    int x_;
    int y_;

  public:
    typedef std::tuple<int> Result;
    MP();
    void t1();
    void t2(Result& read);

    auto actions() {
        return std::make_tuple([this]() { t1(); },
                               [this](Result& result) { t2(result); });
    }
};

The Test interface must provide a Result type, and an actions() member that will produce a tuple of functions to run which either take no arguments or a reference to a result.

The test being defined here is the basic Message Passing litmus test.

MP::MP() : x_(0), y_(0) {}

void MP::t1() {
    x_ = 1;
    y_ = 1;
}

void MP::t2(Result& read) {
    while (!y) {
    }
    std::get<0>(read) = x_;
}

Two variables are initialized to 0. One thread stores 1 to x first, then to 1 to y. The other thread loops until it reads a non-zero in y, and then reads x. The value in x is the message being passed between threads.

In an actual test, the variables would be atomics, specifiying load and store strength, and the variables might have constraints on layout to help sharing cache line updates.

An Experiment

An Experiment samples a test a number of times. It takes the result of each sample, and puts in a map of the results to count, incrementing the count for each distinct result. The actions to run are permuted each time, to help remove bias about which action is loaded behind the spingate first.

void Experiment::run(size_t count) {
    using Actions = decltype(std::declval<Test>().actions());
    auto getters = tupleutil::tuple_getters<Actions>();
    for (size_t i = 0; i < count; ++i) {
        Sample<Test> sample;
        sample.run(getters);
        resultMap_[sample.result_]++;
        std::next_permutation(getters.begin(), getters.end());
    }
}

tupleutil::tuple_getters returns an array of getters each of which returns a std::variant<Types…> with the same parameter pack as the tuple.

Sample runs all of the actions in a batch that locks them behind a spingate, and collects the results for each action.

template <class Test> class Sample {
  public:
    Batch                 batch_;
    Test                  test_;
    typename Test::Result result_;

    template <typename V, size_t I> void run(std::array<V, I> const& getters) {
        auto const& actions = test_.actions();
        add(actions, getters);
        batch_.run();
    }
};

Add is a templated member function that loops over the array, uses the getter to pull a function out of the tuple of actions and visits that with a lambda that will add either the function with no arguments, or that function with a reference to the results, to the batch.

    template <typename Tuple, typename Variant, size_t I>
    void add(Tuple const& actions, std::array<Variant, I> const& getters) {
        auto adder = [this](auto&& f) {
            using F = std::remove_cv_t<std::remove_reference_t<decltype(f)>>;
            if constexpr (std::is_invocable_v<F>) {
                batch_.add(f);
            } else {
                batch_.add(f, std::ref(result_));
            }
        };
        for (auto&& get_n : getters) {
            std::visit(adder, get_n(actions));
        }
        return;
    }

I am a bit dissatisfied with the else case not being constexpr if followed by a static assert, but getting the condition right didn’t work the obvious way, so I punted. There will be a compiler error if f(result_) can’t actually be called by the batch.

Batch recapped:

The key bit of code is

template <class Function, class... Args>
void Batch::add(Function&& f, Args&&... args) {
    workers_.emplace_back([ this, f = std::forward<Function>(f), args... ]() {
            gate_.wait();
            f(args...);
        });
}

Batch has a spingate and runs all of the functions that are added sitting behind it. The run() function opens the gate and joins all the worker threads.

void Batch::run() {
    gate_.open();
    for (auto& thr : workers_) {
        thr.join();
    }
}

Summary

With all the machinery in place, the test infrascructure can aggressively run multi-threaded tests, giving the thread scheduler the best opportunity to run all of the actions in any order. This allows multi thread bugs to be shaken out by looking for surprising results from the experiment.

Source Code

Exported from an org-mode doc, experiment.org, which is available, with all of the source on github at SpinGate.