Cross Compiling

1 Setting up Cross Compiling

In order to test out some of these multi-threaded tool properly, I really need to run them on a less strict platform than x86_64. X86_64 provides a lot of guarantees about sequential consistency and atomicity that hides problems that will happen on architectures that are not as strong, like power, sparc, and arm. Fortunately, one of the toys I have is a recent Raspberry Pi 3, which is based on a recent arm chip. Unfortunately, Raspbian, the normal linux distro for the Raspberry Pi is also based on a fairly old debian distro, with a fairly old compiler. Linaro is back porting their arm code genaration fixes to the old releases, but I’m more interested in the recent C++ language features. So I could attempt to compile GCC 6 on the RPi, or I can cross compile from my normal machine. I decided to cross compile, since if that worked, it would be considerably easier. It turnd out to be pretty straightfoward.

sudo apt-get install g++-6-arm-linux-gnueabihf

This is mostly because I’m already doing software development on the box, so I didn’t need any of the other parts of the compiler ecosystem, just the right c++ toolchain. The hardest part is determining the right one. There are a few flavors for arm development. The RPi is the gnu extended abi, with hardware float. The Ubuntu repositories only supply linux variants, which is sensible. Since that top level package ends up installing not just the compilers, but a libstdc++ and libc for arm-linux-gnueabihf, which need to know much more about the OS in order to interface with it.

This does lead to one snag, though. The versions of the libraries are not the ones available on the RPi. Which is a problem, since I want to use modern, or maybe even post-modern C++. There are two ways of dealing with this, and I’ve ended up using both.

2 Sysroot

When cross compiling, a sysroot is a system that looks just like the root file system of the target platform. It will have /lib, /usr/lib, etc, with the versions of the libraries that you want. You can either use a disk image, mounted somewhere convienent, or you can just mount the target computer’s root filesystem somewhere convienent. If you do that, you’ll have access to all of the libraries available, not just the minimal set typically available on a prepackaged sysroot. So that’s what I did.

sshfs sdowney@cobweb.local:/ /home/sdowney/mnt/rpi/ -o transform_symlinks -o allow_other

Cobweb is my Raspberry Pi box, and zeroconf makes the current ip address available as cobweb.local. I’m mounting that into ~/mnt/rpi, transforming symlinks so that they actually work, and allowing others to access the mounted fs.

With that I can specify the sysroot, and have the compiler look there for libraries:

arm-linux-gnueabihf-g++-6 -v --sysroot ~/mnt/rpi/ -o hello hw.cpp

That spits out all of what the compiler driver invokes, and as a byproduct, a bunch of what is needed to set up cross compiling with other compilers, like clang. The key things to look for are the include directories called out by “#include <…> search starts here”, and the LIBRARY_PATH variable that helps define what the linker does. I’ll be pulling those out for the clang cross compile cmake toolchain file.

Using built-in specs.
COLLECT_GCC=arm-linux-gnueabihf-g++-6
COLLECT_LTO_WRAPPER=/usr/lib/gcc-cross/arm-linux-gnueabihf/6/lto-wrapper
Target: arm-linux-gnueabihf
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 6.2.0-5ubuntu12' --with-bugurl=file:///usr/share/doc/gcc-6/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-6 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libitm --disable-libquadmath --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-6-armhf-cross/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-6-armhf-cross --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-6-armhf-cross --with-arch-directory=arm --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libgcj --enable-objc-gc --enable-multiarch --enable-multilib --disable-sjlj-exceptions --with-arch=armv7-a --with-fpu=vfpv3-d16 --with-float=hard --with-mode=thumb --disable-werror --enable-multilib --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=arm-linux-gnueabihf --program-prefix=arm-linux-gnueabihf- --includedir=/usr/arm-linux-gnueabihf/include
Thread model: posix
gcc version 6.2.0 20161005 (Ubuntu 6.2.0-5ubuntu12)
COLLECT_GCC_OPTIONS='-v' '-o' 'hello' '-shared-libgcc' '-march=armv7-a' '-mfloat-abi=hard' '-mfpu=vfpv3-d16' '-mthumb' '-mtls-dialect=gnu'
 /usr/lib/gcc-cross/arm-linux-gnueabihf/6/cc1plus -quiet -v -imultiarch arm-linux-gnueabihf -isysroot /home/sdowney/mnt/rpi/ -D_GNU_SOURCE hw.cpp -quiet -dumpbase hw.cpp -march=armv7-a -mfloat-abi=hard -mfpu=vfpv3-d16 -mthumb -mtls-dialect=gnu -auxbase hw -version -fstack-protector-strong -Wformat -Wformat-security -o /tmp/ccUwr5Jd.s
GNU C++14 (Ubuntu 6.2.0-5ubuntu12) version 6.2.0 20161005 (arm-linux-gnueabihf)
    compiled by GNU C version 6.2.0 20161005, GMP version 6.1.1, MPFR version 3.1.5, MPC version 1.0.3, isl version 0.15
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
ignoring nonexistent directory "/home/sdowney/mnt/rpi/usr/local/include/arm-linux-gnueabihf"
#include "..." search starts here:
#include <...> search starts here:
 /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include/c++/6
 /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include/c++/6/arm-linux-gnueabihf
 /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include/c++/6/backward
 /usr/lib/gcc-cross/arm-linux-gnueabihf/6/include
 /usr/lib/gcc-cross/arm-linux-gnueabihf/6/include-fixed
 /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include
 /home/sdowney/mnt/rpi/usr/include/arm-linux-gnueabihf
 /home/sdowney/mnt/rpi/usr/include
End of search list.
GNU C++14 (Ubuntu 6.2.0-5ubuntu12) version 6.2.0 20161005 (arm-linux-gnueabihf)
    compiled by GNU C version 6.2.0 20161005, GMP version 6.1.1, MPFR version 3.1.5, MPC version 1.0.3, isl version 0.15
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
Compiler executable checksum: 8867fa57a9cbba18ebd7880e42ca78ba
COLLECT_GCC_OPTIONS='-v' '-o' 'hello' '-shared-libgcc' '-march=armv7-a' '-mfloat-abi=hard' '-mfpu=vfpv3-d16' '-mthumb' '-mtls-dialect=gnu'
 /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/bin/as -v -march=armv7-a -mfloat-abi=hard -mfpu=vfpv3-d16 -meabi=5 -o /tmp/ccJH2IA5.o /tmp/ccUwr5Jd.s
GNU assembler version 2.27 (arm-linux-gnueabihf) using BFD version (GNU Binutils for Ubuntu) 2.27
COMPILER_PATH=/usr/lib/gcc-cross/arm-linux-gnueabihf/6/:/usr/lib/gcc-cross/arm-linux-gnueabihf/6/:/usr/lib/gcc-cross/arm-linux-gnueabihf/:/usr/lib/gcc-cross/arm-linux-gnueabihf/6/:/usr/lib/gcc-cross/arm-linux-gnueabihf/:/usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/bin/
LIBRARY_PATH=/usr/lib/gcc-cross/arm-linux-gnueabihf/6/:/usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib/../lib/:/home/sdowney/mnt/rpi/lib/arm-linux-gnueabihf/:/home/sdowney/mnt/rpi/lib/../lib/:/home/sdowney/mnt/rpi/usr/lib/arm-linux-gnueabihf/:/home/sdowney/mnt/rpi/usr/lib/../lib/:/usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib/:/home/sdowney/mnt/rpi/lib/:/home/sdowney/mnt/rpi/usr/lib/
COLLECT_GCC_OPTIONS='-v' '-o' 'hello' '-shared-libgcc' '-march=armv7-a' '-mfloat-abi=hard' '-mfpu=vfpv3-d16' '-mthumb' '-mtls-dialect=gnu'
 /usr/lib/gcc-cross/arm-linux-gnueabihf/6/collect2 -plugin /usr/lib/gcc-cross/arm-linux-gnueabihf/6/liblto_plugin.so -plugin-opt=/usr/lib/gcc-cross/arm-linux-gnueabihf/6/lto-wrapper -plugin-opt=-fresolution=/tmp/cctgBCzX.res -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lgcc --sysroot=/home/sdowney/mnt/rpi/ --build-id --eh-frame-hdr -dynamic-linker /lib/ld-linux-armhf.so.3 -X --hash-style=gnu --as-needed -m armelf_linux_eabi -z relro -o hello /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib/../lib/crt1.o /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib/../lib/crti.o /usr/lib/gcc-cross/arm-linux-gnueabihf/6/crtbegin.o -L/usr/lib/gcc-cross/arm-linux-gnueabihf/6 -L/usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib/../lib -L/home/sdowney/mnt/rpi/lib/arm-linux-gnueabihf -L/home/sdowney/mnt/rpi/lib/../lib -L/home/sdowney/mnt/rpi/usr/lib/arm-linux-gnueabihf -L/home/sdowney/mnt/rpi/usr/lib/../lib -L/usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib -L/home/sdowney/mnt/rpi/lib -L/home/sdowney/mnt/rpi/usr/lib /tmp/ccJH2IA5.o -lstdc++ -lm -lgcc_s -lgcc -lc -lgcc_s -lgcc /usr/lib/gcc-cross/arm-linux-gnueabihf/6/crtend.o /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib/../lib/crtn.o
COLLECT_GCC_OPTIONS='-v' '-o' 'hello' '-shared-libgcc' '-march=armv7-a' '-mfloat-abi=hard' '-mfpu=vfpv3-d16' '-mthumb' '-mtls-dialect=gnu'

Now, note that the compiler will prefer the locally installed versions before using the ones in the sysroot. This is fine, until I need to install something. Then I’ll get an error because the library on the RPi is too old. Particularly libstdc++. This works well for the non-core language libraries, though. Or at least ones that don’t have C++ in their interface. Mixing C++ versions is a horrible minefield. The easiest way to deal with it is to avoid it.

3 Static linking

Recent versions of gcc allow libstdc++ to be linked statically. It increases the size of the resulting executable, but with less worries about deployment issues.

-static-libstdc++

That will cause the compiler driver to direct the linker to prefer the static version of libstdc++, rather than the shared version. And I don’t have to worry about deploying or upgrading the system libraries on the target box.

Note, this isn’t really a supported deployment configuration. So any bugs are going to be my problem.

4 CMake

I’ve been using CMake to generate the build system, so I need to explain to it how to use the cross compiler instead of one for the host system. CMake has support for supplying definitions for these in Toolchain files. This is what I have so far

SET(CMAKE_SYSTEM_NAME Linux)
SET(CMAKE_SYSTEM_VERSION 1)
SET(CMAKE_SYSROOT $ENV{HOME}/mnt/rpi)

SET(CMAKE_C_COMPILER arm-linux-gnueabihf-gcc)
SET(CMAKE_CXX_COMPILER arm-linux-gnueabihf-g++)

SET(CMAKE_FIND_ROOT_PATH $ENV{HOME}/mnt/rpi)
SET(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER)
SET(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY)
SET(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY)

SET(CMAKE_CXX_FLAGS "-static-libgcc -static-libstdc++" CACHE STRING "CXX_FLAGS" FORCE)

SET( THREADS_PTHREAD_ARG
     "0"
     CACHE STRING "Result from TRY_RUN" FORCE)

That, in addition to setting the compiler to use, forces a few CMake options that are otherwise problems. The first is setting the static link flag for libstdc++. The second is overriding the search for pthreads, because trying to run programs built with a cross compiler doesn’t work very well. This lies and forces the option.

Used like so

cmake  -D CMAKE_TOOLCHAIN_FILE=~/src/toolchain/pi.cmake -DCMAKE_BUILD_TYPE=Release ..

A toolchain file for clang is a little more complicated, because it doesn’t really understand the gcc multilib layout, so it needs to be told where all the include and lib directories are for the target system, for both the C and C++ compiler.

SET(CMAKE_SYSTEM_NAME Linux)
SET(CMAKE_SYSTEM_VERSION 1)
SET(CMAKE_SYSROOT $ENV{HOME}/mnt/rpi)

set(triple arm-linux-gnueabihf)

set(CMAKE_C_COMPILER clang)
set(CMAKE_C_COMPILER_TARGET ${triple})
set(CMAKE_CXX_COMPILER clang++)
set(CMAKE_CXX_COMPILER_TARGET ${triple})

SET(CMAKE_FIND_ROOT_PATH $ENV{HOME}/mnt/rpi)
SET(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER)
SET(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY)
SET(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY)

SET(CMAKE_CXX_FLAGS "\
 -isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include/c++/6 \
 -isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include/c++/6/arm-linux-gnueabihf \
 -isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include/c++/6/backward \
 -isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/include \
 -isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/include-fixed \
 -isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include"
  CACHE STRING "CXX_FLAGS" FORCE)

SET(CMAKE_C_FLAGS "\
 -isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/include \
 -isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/include-fixed \
 -isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include"
  CACHE STRING "C_FLAGS" FORCE)

SET(CMAKE_EXE_LINKER_FLAGS "\
 -L /usr/lib/gcc-cross/arm-linux-gnueabihf/6 \
 -L /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib/../lib \
 -L /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib \
 -static-libgcc -static-libstdc++"
  CACHE STRING "LINKER FLAGS" FORCE)

SET( THREADS_PTHREAD_ARG
     "0"
     CACHE STRING "Result from TRY_RUN" FORCE)

5 Sources

Toolchain files are on Github next to the spingate sources, that now includes the org file that is the source for this entry, crosscompile.org.

batch: running functions under a spingate

1 A batch of tasks to run

This adds a rather simple component to spingate orchestrating a batch of tasks to be run, gated by the spingate. The tasks are added one at a time, a thread is created for the task, and the thread waits on the spingate to open before calling the task.

Or at least that’s how it started. Task was originally a std::function<void()>, which is essentially the interface of the thread pool I use. I realized, however, that I don’t actually need to restrict the interface quite that much. Thread takes a much wider range of things to run, and I can do the same thing. I have to forward the supplied callable and arguments into the lambda that the thread is running.

The key bit of code is

template <class Function, class... Args>
void Batch::add(Function&& f, Args&&... args) {
    workers_.emplace_back([ this, f = std::forward<Function>(f), args... ]() {
            gate_.wait();
            f(args...);
        });
}

There’s a lot of line noise in there, and it really looked simpler when it was just taking a std::function<void()>, but it’s not terrible. We take an object of type Function and a parameter pack of type Args by forwarding reference. That gets captured by the lambda, where we forward the function to the lambda, and capture the parameter pack. Inside the lambda we call the function with the pack, f(args). It’s probable that I should have used std::invoke there, which handles some of the more interesting cases of calling a thing with arguments. But this was sufficient unto the day. The captured this allows access to the gate_ variable the we’re waiting on. The workers_ are a vector of threads that we’ll later run run through and join() on, after open()ing the gate_.

void Batch::run() {
    gate_.open();
    for (auto& thr : workers_) {
        thr.join();
    }
}

That’s really all there is to Batch. It’s a middle connective glue component. Does one thing, and tries to do it obviously well. That is important since I’m trying to build up test infrastructure, and testing the test infrastrucure is a hard problem.

I have reorganized the code repo in order to do some light testing, though.

2 GTest

I’ve pushed things about in the source repo, moving the code into a library directory, which means I can link it into the existing mains, as well as into new gtests. In the CMake system, I’ve conditioned building tests on the existence of the googletest project being available as a subdirectory. I use enough different compilers and build options that trying to use a system build of gtest just doesn’t work. The best, and recommended, choice, is to build googletest as part of your project. That way any ABI impacting subtlety, like using a different C++ standard library, is take care of automatically. The bit of cmake magic is in the top level CMakeLists.txt :

# A directory to find Google Test sources.
if (EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/googletest/CMakeLists.txt")
  add_subdirectory(googletest EXCLUDE_FROM_ALL)
  add_subdirectory(tests)
else()
  message("GTEST Not Found at ${CMAKE_CURRENT_SOURCE_DIR}/googletest/CMakeLists.txt")
endif()

This looks for googletest to be available, and if it is, add it to the project, and my tests subdirectory, otherwise issue a message. I prefer this to attempting to fix up the missing gtest automatically. That always seems to cause me problems, such as when I’m operating disconnected, on a train, like right now.

The tests I have are pretty simple, not much more than primitive breathing tests.

TEST_F(BatchTest, run1Test)
{
    Batch batch;
    batch.add([this](){incrementCalled();});

    EXPECT_EQ(0u, called);

    batch.run();

    EXPECT_EQ(1u, called);
}

or, to make sure that passing arguments worked

TEST_F(BatchTest, runArgTest)
{
    Batch batch;
    int i = 0;
    batch.add([&i](int k){ i = k;}, 1);

    EXPECT_EQ(0, i);

    batch.run();

    EXPECT_EQ(1, i);
}

I don’t actually expect to find runtime errors with these tests. They exercise ths component just enough that I’m not generating compile errors in expected use cases. Template code can be tricky that way. Templates that aren’t instantiated can have horrible errors, but the compiler is willing to let them pass, if they mostly parse.

SFINAE may not be your friend.

3 Clang builds with current libc++

Building clang and libc++ locally is getting easier and easier. Using that is still a bit difficult. But there are some reasons to do so. One is just being able to cross check your code for sanity. I won’t reproduce building clang and libc++ here. It’s really at this point just checking out the repos in the right places and running cmake with something like:

cmake  -DCMAKE_INSTALL_PREFIX=~/install/llvm-master/ -DLLVM_ENABLE_LIBCXX=yes  -DCMAKE_BUILD_TYPE=Release   ../llvm/

Using that, at least from within cmake, is more complicated. Cmake has a strong bias towards using the system compiler. It also has a distinct problem with repeating builds.

NEVER edit your CMakeCache.txt. You can’t do anything with it. All the paths are hard coded. Always start over. Either keep the command line around, or create a cmake initial cache file, which isn’t the same thing at all as the CMakeCache.txt file.

Right now, I’m cargo-culting around code in my cmake files that checks if I’ve defined an LLVM_ROOT, and if I have supply the flags to ignore all the system files, and use the ones from the installed LLVM_ROOT, including some rpath fixup. There might be some way to convince cmake to do it, but there’s also only so much I will fight my metabuild system.

  if(LLVM_ROOT)
    message(STATUS "LLVM Root: ${LLVM_ROOT}")
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -nostdinc++ -isystem ${LLVM_ROOT}/include/c++/v1")
    set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -L ${LLVM_ROOT}/lib -l c++ -l c++abi")
    set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -Wl,-rpath,${LLVM_ROOT}/lib")
  else()

I only check for that if the compiler I’ve chosen is a clang compiler, and it’s not normally part of my environment.

4 Direction

Overall, what I want out of this library is to be able to stress test some nominally mt-safe code, and check that the conditions that I think hold are true. It’s heavily influenced by jcstress, but, because this is C+++, it will be rendered quite differently.

For what I’m considering, look at Close Encounters of The Java Memory Model Kind

I want to be able to specify a state, with operations that mutate and observe the state. I want to be able to collect those observations in a deterministic way, which may require cooperation from the observers. I want to be able to collect the observations and report how many times each set of observations was obtained.

Something like:

class State {
    int x_;
    int y_;

  public:
    typedef std::tuple<int, int, int, int> Result;
    State() : x_(0), y_(0) {}
    void writer1() {
        y_ = 1;
        x_ = 1;
    }
    void reader1(Result& read) {
        std::get<0>(read) = x_;
        std::get<1>(read) = y_;
    }
    void reader2(Result& read) {
        std::get<2>(read) = x_;
        std::get<3>(read) = y_;
    }
};

Running the writers and readers over different threads and observing the possible results. On some architectures, reader1 and reader2 can see entirely different orders, even though y_ will happen before x_, you might see x_ written and not y_.

What I’d eventually like to be able to do is say things like, “This function will only be evaluated once”, and have some evidence to back that up.

So the next step is something that will take a State and schedule all of the actions with appropriate parameters in a Batch, and produce the overall Result. Then something that will do that many many times, accumulating all of the results. And since this isn’t java, so we don’t have reflection techniques, the State class is going to have to cooperate a bit. The Result typedef is one way. It will also have to produce all of the actions that need to be batched, in some heterogenous form that I can then run.

5 Source Code

Exported from an org-mode doc, batch.org, which is available, with all of the source on github at SpinGate.

spingate

1 Building a simple spin gate in C++

This is a very simplified latch which allows several threads to block and busywait until they are released to begin work in parallel. I’m using this to do some multi-thread stress testing, where I want to maximize the overlapping work in order to check for suprising non-deterministic behavior. There’s no count associated with the gate, nor is it reuseable. All we need is the one bit of information, weather the gate is open or closed.

The simplest atomic boolean is the std::atomic_flag. It’s guaranteed by standard to be lock free. Unfortunately, to provide that guarantee, it provides no way to do an atomic read or write to the flag, just a clear, and a set and return prior value. This isn’t rich enough for multiple threads to wait, although it is enough to implement a spinlock.

std::atomic<bool> isn’t guaranteed to be lock free, but in practice, any architecture I’m interested has at least 32bit aligned atomics that are lock free. Older hardware, such as ARMv5, SparcV8, and 80386 are missing cmpxchg, so loads are generally implemented with a lock in order to maintain the guarantees if there were a simultaneous load and exchange. See, for example, LLVM Atomic Instructions and Concurrency Guide. Modern ARM, x86, Sparc, and Power chips are fine.

When the spin gate is constructed, we’ll mark the gate as closed. Threads will then wait on the flag, spinning until the gate is opened. For this we use Release-Acquire ordering between the open and wait. This will ensure any stores done before the gate is opened will be visible to the thread waiting.

// spingate.h                                                       -*-C++-*-
#ifndef INCLUDED_SPINGATE
#define INCLUDED_SPINGATE

#include <atomic>
#include <thread>

class SpinGate
{
    std::atomic_bool flag_;

  public:
    SpinGate();
    void wait();
    void open();
};

inline
SpinGate::SpinGate() {
    // Close the gate
    flag_.store(true, std::memory_order_release);
}

inline
void SpinGate::wait() {
    while (flag_.load(std::memory_order_acquire))
        ; // spin
}

inline
void SpinGate::open() {
    flag_.store(false, std::memory_order_release);
    std::this_thread::yield();
}


#endif
#include "spingate.h"
// Test that header is complete by including

Using a SpinGate is fairly straightfoward. Create an instance of SpinGate and wait() on it in each of the worker threads. Once all of the threads are created, open the gate to let them run. In this example, I sleep for one second in order to check that none of the worker threads get past the gate before it is opened.

The synchronization is on the SpingGate’s std::atomic_bool, flag_. The flag_ is set to true in the constructor, with release memory ordering. The function wait() spins on loading the flag_ with acquire memory ordering, until open() is called, which sets the flag_ to false with release semantics. The other threads that were spinning may now proceed. The release-acquires ordering ensures that happens-before writes by the thread setting up the work and calling open will be read by the threads that were spin waiting.

1.1 Update:

I’ve added a yield() call after the open, hinting that the thread opening the gate may be rescheduled and allow the other threads to run. On the system I’m testing on, it doesn’t seem to make a difference, but it does seem like the right thing to do, and it does not seem to hurt.

#include "spingate.h"

#include <vector>
#include <chrono>
#include <thread>
#include <iostream>


int main()
{
    std::vector<std::thread> workers;
    SpinGate gate;
    using time_point = std::chrono::time_point<std::chrono::high_resolution_clock>;
    time_point t1;
    auto threadCount = std::thread::hardware_concurrency();
    std::vector<time_point> times;
    times.resize(threadCount);

    for (size_t n = 0; n < threadCount; ++n) {
        workers.emplace_back([&gate, t1, &times, n]{
                gate.wait();
                time_point t2 = std::chrono::high_resolution_clock::now();
                times[n] = t2;
            });
    }

    std::cout << "Open the gate in 1 second: " << std::endl;
    using namespace std::chrono_literals;
    std::this_thread::sleep_for(1s);
    t1 = std::chrono::high_resolution_clock::now();
    gate.open();

    for (auto& thr : workers) {
        thr.join();
    }

    int threadNum = 0;
    for (auto& time: times) {
        auto diff = std::chrono::duration_cast<std::chrono::nanoseconds>(time - t1);
        std::cout << "Thread " << threadNum++ << " waited " << diff.count() << "ns\n";
    }
}

I’d originally had the body of the threads just spitting out that they were running on std::cout, and the lack of execution before the gate, plus the overlapping output, being evidence of the gate working. That looked like:

for (std::size_t n = 0; n < std::thread::hardware_concurrency(); ++n) {
    workers.emplace_back([&gate, n]{
            gate.wait();
            std::cout << "Output from gated thread " << n << std::endl;
        });
}

The gate is captured in the thread lambda by reference, the thread number by value, and when run, overlapping gibberish is printed to the console as soon as open() is called.

But then I became curious about how long the spin actually lasted. Particularly since the guarantees for atomics with release-acquire semantics, or really even sequentially consistent, are about once a change is visible, that changes before are also visible. It’s really a function of the underlying hardware how fast the change is visible, and what are the costs of making the happened-before writes available. I’d already observed better overlapping execution using the gate, as opposed to just letting the threads run, so for my initial purposes of making contention more likely, I was satisfied. Visibility, on my lightly loaded system, seems to be in the range of a few hundred to a couple thousand nanoseconds, which is fairly good.

Checking how long it took to start let me do two thing. First, play with the new-ish chrono library. Second, check that the release-acquire sync is working the way I expect. The lambdas that the threads are running capture the start time value by reference. The start time is set just before the gate is opened, and well after the threads have started running. The spin gate’s synchronization ensures that if the state change caused by open is visible, the setting of the start time is also visible.

Here are one set of results from running a spingate:

Open the gate in 1 second: 
Thread 0 waited 821ns
Thread 1 waited 14490ns
Thread 2 waited 521ns
Thread 3 waited 817ns

2 Comparison with Condition Variable gate

The normal way of implementing a gate like this is with a condition variable, associated mutex, and a plain bool. The mutex guarantees synchronization between the wait and open, rather than the atomic variable in the SpinGate. The unlock/lock pair in mutex have release-acquire semantics. The actual lock and unlock are done by the unique_lock guard.

// cvgate.h                                                           -*-C++-*-
#ifndef INCLUDED_CVGATE
#define INCLUDED_CVGATE

#include <mutex>
#include <condition_variable>
#include <atomic>
#include <thread>

class CVGate
{
    std::mutex lock_;
    std::condition_variable cv_;
    bool flag_;

  public:
    CVGate();

    void wait();
    void open();
};

inline
CVGate::CVGate()
: lock_(),
  cv_(),
  flag_(true)
{}

inline
void CVGate::wait() {
    std::unique_lock<std::mutex> lk(lock_);
    cv_.wait(lk, [&](){return !flag_;});
}

inline
void CVGate::open() {
    std::unique_lock<std::mutex> lk(lock_);
    flag_ = false;
    cv_.notify_all();
    std::this_thread::yield();
}
#endif
#include "cvgate.h"
// Test that header is complete by including

This has the same interface as SpinGate, and is used exactly the same way.

Running it shows:

Open the gate in 1 second: 
Thread 0 waited 54845ns
Thread 1 waited 76125ns
Thread 2 waited 91977ns
Thread 3 waited 128816ns

That the overhead of the mutex and condition variable is significant. On the other hand, the system load while it’s waiting is much lower. Spingate will use all available CPU, while CVGate yields, so useful work can be done byu the rest of the system.

However, for the use I was originally looking at, releasing threads for maximal overlap, spinning is clearly better. There is much less overlap as the cv blocked threads are woken up.

3 Building and Running

This is a minimal CMake file for building with the system compiler.

cmake_minimum_required(VERSION 3.5)
set(CMAKE_LEGACY_CYGWIN_WIN32 0)

project(SpinGate C CXX)

set(THREADS_PREFER_PTHREAD_FLAG ON)
find_package(Threads REQUIRED)

set(CMAKE_EXPORT_COMPILE_COMMANDS ON)

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++14 -ftemplate-backtrace-limit=0")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wextra -march=native")
set(CMAKE_CXX_FLAGS_DEBUG "-O0 -fno-inline -g3")
set(CMAKE_CXX_FLAGS_RELEASE "-Ofast -g0 -DNDEBUG")

include_directories(${CMAKE_CURRENT_SOURCE_DIR})

add_executable(spingate main.cpp spingate.cpp)
add_executable(cvgate cvmain.cpp cvgate.cpp)
target_link_libraries(spingate Threads::Threads)
target_link_libraries(cvgate Threads::Threads)

And here we build a release version of the test executable:

mkdir -p build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ../
make
-- Configuring done
-- Generating done
-- Build files have been written to: /home/sdowney/src/spingate/build
[ 50%] Built target cvgate
[100%] Built target spingate

4 Org-mode source and git repo

Exported from an org-mode doc. All of the source is available on github at SpinGate

C++ code in Org-mode

1 C++

This is a simple example with one c++ file, all in one src code block.

#include <iostream>

int main()
{
    std::cout << "Hello World++! 2.9";
}

The src block looks like:

    #+HEADERS: :tangle hello.c++ :exports code :eval never
    #+BEGIN_SRC C++
    // source code
    #+END_SRC

The HEADERS block is on a separate line because when the buffer is evaluated, code will get run, and the SRC blocks will get rewritten, as well as the RESULTS blocks. Since we want the headers to be preserved, we can’t make them part of the SRC block.

You can also tangle a Makefile.

clean:
    -rm hello

hello:  hello.c++
    clang++ hello.c++ -o hello

With the org mode headers exporting the code, and not evaluating this block. Just like the C++ code. We’ll evaluate the makefile, and run the program, a bit further down.

    #+HEADERS: :tangle Makefile :exports code :eval never
    #+BEGIN_SRC makefile
    # Makefile
    #+END_SRC

Now, we tangle the code out to the files.

    #+NAME: tangle-buffer
    #+HEADERS: :exports both :results value
    #+BEGIN_SRC emacs-lisp
    (org-babel-tangle)
    #+END_SRC
(org-babel-tangle)

That will write out two files when the buffer is evaluated using org-babel-execute-buffer, bound to \C-c \C-v b.

Give this block a name, so that where the results go can be controlled. Do that by giving the RESULTS block the name of the SRC block. Org will then produce a table of the results of executing the elisp, which is the two files produced.

And put the results here:

hello.c++ Makefile

Next, we run make with the target to compile the code. You could also simply write the compiler command here.

    #+NAME: make-clean-hello
    #+BEGIN_SRC sh :exports both :results output
    make clean
    make hello
    #+END_SRC
make clean
make hello

And make will run our compilation as spec’d in the Makefile we just tangled out.

rm hello
Makefile:2: recipe for target 'clean' failed
clang++ hello.c++ -o hello

And now get the output by running the program.

    #+NAME: run-hello
    #+BEGIN_SRC sh :exports results
    ./hello
    #+END_SRC

Which prints out our hello, world text. Which is has version number to convince myself it gets updated.

Hello World++! 2.9

2 Raw Document

Org mode in org is a little tricky, since to show the examples, you have to quote them.

Original document is available on GitHub

Building Emacs 25.1 on Ubuntu 16.10

1 Why notes

Making notes so I don’t forget, although the key problem is fixed upstream.

Ubuntu 16.10 (Yakkety Yak) has made a critical change to the system compiler, and everything is by default built with position independent executable support (PIE), in order to get better support for address space layout randomization. Here are the security notes for PIE. Emacs does not dump successfully with this. The compiler option -no-pie needs to be added into CLFAGS.

The option also means that static libraries you’ve built before will probably need to be rebuilt. See the link above for typical errors.

2 Getting Ready

First get dpkg-dev, g++, gcc, libc, make:

sudo apt-get install build-essentials

Then get the full set of build dependencies for last emacs, emacs24:

sudo apt-get build-dep emacs24

Decide if you want to build just this version, or track emacs. I track from git, because. So I have a directory for emacs where I have master, emacs25, and build directories. I try to avoid building in src dirs. It makes it easier to try out different options without polluting the src.

mkdir -p ~/bld/emacs
cd ~/bld/emacs
git clone git://git.savannah.gnu.org/emacs.git
cd emacs.git
git worktree add ../emacs-25.1 emacs-25.1
cd ..
mkdir bld-25.1

3 Configure and build with magic option

Now configure in the build directory:

cd bld-25.1
../emacs-25.1/configure \
  --prefix=~/install/emacs-25.1 \
  --with-x-toolkit=gtk3 \
  --with-xwidgets \
  CFLAGS=-no-pie

I built with xwidget support to play with the embedded webkit widget. It’s not really useable as a browser, but has uses for rendering. I also install into a local program directory, under my homedir.

Build and install:

make
make install

I have a bin directory early in $PATH so that I can select versions of local software ahead of system software.

cd ~/bin
ln -s ~/install/emacs-25.1/bin/emacs
ln -s ~/install/emacs-25.1/bin/emacsclient

Now you should have a working emacs 25.1 available.

Real World Haskell – Chapter 3

These are the exercises from chapter 3 of
Real World Haskell
by Bryan O’Sullivan, Don Stewart, and John Goerzen

> module RWHChapter3 where

{-# OPTIONS_GHC -XMagicHash #-}

Some useful things to check my work:

> import Test.QuickCheck
> import Data.List
> import GHC.Prim
> import GHC.Base

1) Write a function that computes the number of elements in a list. To test it, ensure that it gives the same answers as the standard length function.

> lengthList (x:xs) = 1 + lengthList xs
> lengthList [] = 0

check that it works by running

quickCheck (\s -> length s == lengthList s)

at the interactive prompt

> ghclength l = len l 0#
> where
> len :: [a] -> Int# -> Int
> len [] a# = I# a#
> len (_:xs) a# = len xs (a# +# 1#)

is the definition used by GHC.
It doesn’t stack overflow on ghclength [1..1000000]
lengthList [1..1000000] does, at least in ghci

2) Add a type signature for your function to your source file.

> lengthList :: [a] -> Integer

3. write a function that computes the mean of a list, i.e. the sum of all elements in the list divided by its length. (you may need to use the fromintegral function to convert the length of the list from an integer into a floating point number.)

> listMean x = sum x / fromIntegral (length x)

4. Turn a list into a palindrome, i.e. it should read the same both backwards and forwards. For example, given the list [1,2,3], your function should return [1,2,3,3,2,1].

> makePalindrome x = x ++ reverse x

5. Write a function that determines whether its input list is a palindrome.

> testPalindrome x = x == reverse x

> testPalindrome2 x = (take (halfLength x) x)
> == (take (halfLength x) (reverse x))
> where halfLength x = ((length x) `quot` 2)

reversing the whole list isn’t exactly right

> testPalindrome3 x = (take (halfLength x) x)
> == reverse (drop (halfLength x) x)
> where halfLength x = ((length x) `quot` 2)

except now it thinks that odd length lists aren’t palindromes.

6. Create a function that sorts a list of lists based on the length of each sublist. (You may want to look at the sortBy function from the Data.List module.)

> sortByLength = sortBy (\a b -> compare (length a) (length b))

7 An intersperse with type intersperse :: a -> [[a]] -> [a], such that
ghci> myintersperse ‘,’ []
“”
ghci> myintersperse ‘,’ [“foo”]
“foo”
ghci> myintersperse ‘,’ [“foo”,”bar”,”baz”,”quux”]
“foo,bar,baz,quux”

> myintersperse :: a -> [[a]] -> [a]
> myintersperse _ [] = []
> myintersperse _ [x] = x
> myintersperse c (x:xs) = x ++ [c] ++ (myintersperse c xs)

8. Height of a tree

> data Tree a = Node a (Tree a) (Tree a)
> | Empty
> deriving (Show)

depth’ :: Tree a -> Int

> depth’ Empty = 0
> depth’ (Node _ b c) = 1 + (max (depth’ b) (depth’ c))

9. Consider three two-dimensional points a, b, and c. If we look at the angle formed by the line segment from a to b and the line segment from b to c, it either turns left, turns right, or forms a straight line. Define a Direction data type that lets you represent these possibilities.

> data Direction = DLeft
> | DRight
> | DStraight
> deriving (Show)

> data Point = Point {
> x::Double,
> y::Double
> } deriving (Show)

10. Write a function that takes three 2D Points and returns a direction.

> direction p1 p2 p3
> | signCross p1 p2 p3 > | signCross p1 p2 p3 > 0 = DLeft
> | signCross p1 p2 p3 == 0 = DStraight
> where signCross (Point x1 y1) (Point x2 y2) (Point x3 y3) =
> (x2 – x1) * (y3 – y1) – (y2 – y1) * (x3 – x1)

11. Define a function that takes a list of 2D points and computes the direction of each successive triple. Given a list of points [a,b,c,d,e], it should begin by computing the turn made by [a,b,c], then the turn made by [b,c,d], then [c,d,e]. Your function should return a list of Direction.

> directionList :: [Point] -> [Direction]
> directionList ([]) = []
> directionList (_:[]) = []
> directionList (_:_:[]) = []
> directionList (x:x’:x”:[]) = [(direction x x’ x”)]
> directionList (x:x’:x”:xs) = (direction x x’ x”) : directionList (x’:x”:xs)

I don’t particularly like the pattern match on the three x’s. How
about a more general function to do sliding windows of data across the
list?

> window :: Int -> [a] -> [[a]]
> window2 :: Int -> [a] -> [[a]]

> window n xs = if (length xs) > then []
> else (take n xs) : window n (tail xs)

> window2 n xs | (length xs) > window2 n x@(_:xs) = (take n x) : window2 n xs

> direction2 (x1:x2:x3:[]) = direction x1 x2 x3

> directionList2′ (x:xs) = direction2 x : directionList2′ xs
> directionList2′ [] = []

> directionList2 x = directionList2′ (window2 3 x)

I’m still working on the convex hull problem