Monthly Archives: October 2016

spingate

1 Building a simple spin gate in C++

This is a very simplified latch which allows several threads to block and busywait until they are released to begin work in parallel. I’m using this to do some multi-thread stress testing, where I want to maximize the overlapping work in order to check for suprising non-deterministic behavior. There’s no count associated with the gate, nor is it reuseable. All we need is the one bit of information, weather the gate is open or closed.

The simplest atomic boolean is the std::atomic_flag. It’s guaranteed by standard to be lock free. Unfortunately, to provide that guarantee, it provides no way to do an atomic read or write to the flag, just a clear, and a set and return prior value. This isn’t rich enough for multiple threads to wait, although it is enough to implement a spinlock.

std::atomic<bool> isn’t guaranteed to be lock free, but in practice, any architecture I’m interested has at least 32bit aligned atomics that are lock free. Older hardware, such as ARMv5, SparcV8, and 80386 are missing cmpxchg, so loads are generally implemented with a lock in order to maintain the guarantees if there were a simultaneous load and exchange. See, for example, LLVM Atomic Instructions and Concurrency Guide. Modern ARM, x86, Sparc, and Power chips are fine.

When the spin gate is constructed, we’ll mark the gate as closed. Threads will then wait on the flag, spinning until the gate is opened. For this we use Release-Acquire ordering between the open and wait. This will ensure any stores done before the gate is opened will be visible to the thread waiting.

// spingate.h                                                       -*-C++-*-
#ifndef INCLUDED_SPINGATE
#define INCLUDED_SPINGATE

#include <atomic>
#include <thread>

class SpinGate
{
    std::atomic_bool flag_;

  public:
    SpinGate();
    void wait();
    void open();
};

inline
SpinGate::SpinGate() {
    // Close the gate
    flag_.store(true, std::memory_order_release);
}

inline
void SpinGate::wait() {
    while (flag_.load(std::memory_order_acquire))
        ; // spin
}

inline
void SpinGate::open() {
    flag_.store(false, std::memory_order_release);
    std::this_thread::yield();
}


#endif
#include "spingate.h"
// Test that header is complete by including

Using a SpinGate is fairly straightfoward. Create an instance of SpinGate and wait() on it in each of the worker threads. Once all of the threads are created, open the gate to let them run. In this example, I sleep for one second in order to check that none of the worker threads get past the gate before it is opened.

The synchronization is on the SpingGate’s std::atomic_bool, flag_. The flag_ is set to true in the constructor, with release memory ordering. The function wait() spins on loading the flag_ with acquire memory ordering, until open() is called, which sets the flag_ to false with release semantics. The other threads that were spinning may now proceed. The release-acquires ordering ensures that happens-before writes by the thread setting up the work and calling open will be read by the threads that were spin waiting.

1.1 Update:

I’ve added a yield() call after the open, hinting that the thread opening the gate may be rescheduled and allow the other threads to run. On the system I’m testing on, it doesn’t seem to make a difference, but it does seem like the right thing to do, and it does not seem to hurt.

#include "spingate.h"

#include <vector>
#include <chrono>
#include <thread>
#include <iostream>


int main()
{
    std::vector<std::thread> workers;
    SpinGate gate;
    using time_point = std::chrono::time_point<std::chrono::high_resolution_clock>;
    time_point t1;
    auto threadCount = std::thread::hardware_concurrency();
    std::vector<time_point> times;
    times.resize(threadCount);

    for (size_t n = 0; n < threadCount; ++n) {
        workers.emplace_back([&gate, t1, &times, n]{
                gate.wait();
                time_point t2 = std::chrono::high_resolution_clock::now();
                times[n] = t2;
            });
    }

    std::cout << "Open the gate in 1 second: " << std::endl;
    using namespace std::chrono_literals;
    std::this_thread::sleep_for(1s);
    t1 = std::chrono::high_resolution_clock::now();
    gate.open();

    for (auto& thr : workers) {
        thr.join();
    }

    int threadNum = 0;
    for (auto& time: times) {
        auto diff = std::chrono::duration_cast<std::chrono::nanoseconds>(time - t1);
        std::cout << "Thread " << threadNum++ << " waited " << diff.count() << "ns\n";
    }
}

I’d originally had the body of the threads just spitting out that they were running on std::cout, and the lack of execution before the gate, plus the overlapping output, being evidence of the gate working. That looked like:

for (std::size_t n = 0; n < std::thread::hardware_concurrency(); ++n) {
    workers.emplace_back([&gate, n]{
            gate.wait();
            std::cout << "Output from gated thread " << n << std::endl;
        });
}

The gate is captured in the thread lambda by reference, the thread number by value, and when run, overlapping gibberish is printed to the console as soon as open() is called.

But then I became curious about how long the spin actually lasted. Particularly since the guarantees for atomics with release-acquire semantics, or really even sequentially consistent, are about once a change is visible, that changes before are also visible. It’s really a function of the underlying hardware how fast the change is visible, and what are the costs of making the happened-before writes available. I’d already observed better overlapping execution using the gate, as opposed to just letting the threads run, so for my initial purposes of making contention more likely, I was satisfied. Visibility, on my lightly loaded system, seems to be in the range of a few hundred to a couple thousand nanoseconds, which is fairly good.

Checking how long it took to start let me do two thing. First, play with the new-ish chrono library. Second, check that the release-acquire sync is working the way I expect. The lambdas that the threads are running capture the start time value by reference. The start time is set just before the gate is opened, and well after the threads have started running. The spin gate’s synchronization ensures that if the state change caused by open is visible, the setting of the start time is also visible.

Here are one set of results from running a spingate:

Open the gate in 1 second: 
Thread 0 waited 821ns
Thread 1 waited 14490ns
Thread 2 waited 521ns
Thread 3 waited 817ns

2 Comparison with Condition Variable gate

The normal way of implementing a gate like this is with a condition variable, associated mutex, and a plain bool. The mutex guarantees synchronization between the wait and open, rather than the atomic variable in the SpinGate. The unlock/lock pair in mutex have release-acquire semantics. The actual lock and unlock are done by the unique_lock guard.

// cvgate.h                                                           -*-C++-*-
#ifndef INCLUDED_CVGATE
#define INCLUDED_CVGATE

#include <mutex>
#include <condition_variable>
#include <atomic>
#include <thread>

class CVGate
{
    std::mutex lock_;
    std::condition_variable cv_;
    bool flag_;

  public:
    CVGate();

    void wait();
    void open();
};

inline
CVGate::CVGate()
: lock_(),
  cv_(),
  flag_(true)
{}

inline
void CVGate::wait() {
    std::unique_lock<std::mutex> lk(lock_);
    cv_.wait(lk, [&](){return !flag_;});
}

inline
void CVGate::open() {
    std::unique_lock<std::mutex> lk(lock_);
    flag_ = false;
    cv_.notify_all();
    std::this_thread::yield();
}
#endif
#include "cvgate.h"
// Test that header is complete by including

This has the same interface as SpinGate, and is used exactly the same way.

Running it shows:

Open the gate in 1 second: 
Thread 0 waited 54845ns
Thread 1 waited 76125ns
Thread 2 waited 91977ns
Thread 3 waited 128816ns

That the overhead of the mutex and condition variable is significant. On the other hand, the system load while it’s waiting is much lower. Spingate will use all available CPU, while CVGate yields, so useful work can be done byu the rest of the system.

However, for the use I was originally looking at, releasing threads for maximal overlap, spinning is clearly better. There is much less overlap as the cv blocked threads are woken up.

3 Building and Running

This is a minimal CMake file for building with the system compiler.

cmake_minimum_required(VERSION 3.5)
set(CMAKE_LEGACY_CYGWIN_WIN32 0)

project(SpinGate C CXX)

set(THREADS_PREFER_PTHREAD_FLAG ON)
find_package(Threads REQUIRED)

set(CMAKE_EXPORT_COMPILE_COMMANDS ON)

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++14 -ftemplate-backtrace-limit=0")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wextra -march=native")
set(CMAKE_CXX_FLAGS_DEBUG "-O0 -fno-inline -g3")
set(CMAKE_CXX_FLAGS_RELEASE "-Ofast -g0 -DNDEBUG")

include_directories(${CMAKE_CURRENT_SOURCE_DIR})

add_executable(spingate main.cpp spingate.cpp)
add_executable(cvgate cvmain.cpp cvgate.cpp)
target_link_libraries(spingate Threads::Threads)
target_link_libraries(cvgate Threads::Threads)

And here we build a release version of the test executable:

mkdir -p build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ../
make
-- Configuring done
-- Generating done
-- Build files have been written to: /home/sdowney/src/spingate/build
[ 50%] Built target cvgate
[100%] Built target spingate

4 Org-mode source and git repo

Exported from an org-mode doc. All of the source is available on github at SpinGate

C++ code in Org-mode

1 C++

This is a simple example with one c++ file, all in one src code block.

#include <iostream>

int main()
{
    std::cout << "Hello World++! 2.9";
}

The src block looks like:

    #+HEADERS: :tangle hello.c++ :exports code :eval never
    #+BEGIN_SRC C++
    // source code
    #+END_SRC

The HEADERS block is on a separate line because when the buffer is evaluated, code will get run, and the SRC blocks will get rewritten, as well as the RESULTS blocks. Since we want the headers to be preserved, we can’t make them part of the SRC block.

You can also tangle a Makefile.

clean:
    -rm hello

hello:  hello.c++
    clang++ hello.c++ -o hello

With the org mode headers exporting the code, and not evaluating this block. Just like the C++ code. We’ll evaluate the makefile, and run the program, a bit further down.

    #+HEADERS: :tangle Makefile :exports code :eval never
    #+BEGIN_SRC makefile
    # Makefile
    #+END_SRC

Now, we tangle the code out to the files.

    #+NAME: tangle-buffer
    #+HEADERS: :exports both :results value
    #+BEGIN_SRC emacs-lisp
    (org-babel-tangle)
    #+END_SRC
(org-babel-tangle)

That will write out two files when the buffer is evaluated using org-babel-execute-buffer, bound to \C-c \C-v b.

Give this block a name, so that where the results go can be controlled. Do that by giving the RESULTS block the name of the SRC block. Org will then produce a table of the results of executing the elisp, which is the two files produced.

And put the results here:

hello.c++ Makefile

Next, we run make with the target to compile the code. You could also simply write the compiler command here.

    #+NAME: make-clean-hello
    #+BEGIN_SRC sh :exports both :results output
    make clean
    make hello
    #+END_SRC
make clean
make hello

And make will run our compilation as spec’d in the Makefile we just tangled out.

rm hello
Makefile:2: recipe for target 'clean' failed
clang++ hello.c++ -o hello

And now get the output by running the program.

    #+NAME: run-hello
    #+BEGIN_SRC sh :exports results
    ./hello
    #+END_SRC

Which prints out our hello, world text. Which is has version number to convince myself it gets updated.

Hello World++! 2.9

2 Raw Document

Org mode in org is a little tricky, since to show the examples, you have to quote them.

Original document is available on GitHub

Building Emacs 25.1 on Ubuntu 16.10

1 Why notes

Making notes so I don’t forget, although the key problem is fixed upstream.

Ubuntu 16.10 (Yakkety Yak) has made a critical change to the system compiler, and everything is by default built with position independent executable support (PIE), in order to get better support for address space layout randomization. Here are the security notes for PIE. Emacs does not dump successfully with this. The compiler option -no-pie needs to be added into CLFAGS.

The option also means that static libraries you’ve built before will probably need to be rebuilt. See the link above for typical errors.

2 Getting Ready

First get dpkg-dev, g++, gcc, libc, make:

sudo apt-get install build-essentials

Then get the full set of build dependencies for last emacs, emacs24:

sudo apt-get build-dep emacs24

Decide if you want to build just this version, or track emacs. I track from git, because. So I have a directory for emacs where I have master, emacs25, and build directories. I try to avoid building in src dirs. It makes it easier to try out different options without polluting the src.

mkdir -p ~/bld/emacs
cd ~/bld/emacs
git clone git://git.savannah.gnu.org/emacs.git
cd emacs.git
git worktree add ../emacs-25.1 emacs-25.1
cd ..
mkdir bld-25.1

3 Configure and build with magic option

Now configure in the build directory:

cd bld-25.1
../emacs-25.1/configure \
  --prefix=~/install/emacs-25.1 \
  --with-x-toolkit=gtk3 \
  --with-xwidgets \
  CFLAGS=-no-pie

I built with xwidget support to play with the embedded webkit widget. It’s not really useable as a browser, but has uses for rendering. I also install into a local program directory, under my homedir.

Build and install:

make
make install

I have a bin directory early in $PATH so that I can select versions of local software ahead of system software.

cd ~/bin
ln -s ~/install/emacs-25.1/bin/emacs
ln -s ~/install/emacs-25.1/bin/emacsclient

Now you should have a working emacs 25.1 available.