Category Archives: Uncategorized

AnyDuck : A Value Type Erased Type

A Constrained Duck Typed Value Type

For yak shaving reasons, I need a type able to hold any type conforming to a particular interface. I’d like this to act as a (Semi)Regular value type. That is, I’d like it to be copyable, assignable, and so forth, and not be sliced or otherwise mangeled in the process. I want it to be reasonably efficient, not significantly worse than traditional virtual function overhead. I also don’t want to be terribly creative in implementing, using existing std library types.

The overall pattern I’ve settled on for now is to hold the type in a std::any and dispatch to the held type through function pointers referring to lambdas. The lambda allows me to capture the type being stored into the AnyDuck and safely recover it. There’s some boilerplate to write to dispatch to the lambda. Perhaps one day, when we have reflection, that can be automated.

For purposes of this paper, I’ll assume I have an interface Duck that I want to model:

class Duck {
    void quack(int length) const;

Ducks are defined as things that quack, and quack is a const function. I want to be able to put any type that models Duck into an AnyDuck, and pass AnyDuck into any generic function expecting a Duck. I also want to be able to extend AnyDuck to unrelated types, as long as they model Duck. Mallard, for example:

class Mallard {
    void quack(int length) const;

The core of the idea, is to capture the Duck type in a templated constructor where I know the exact type, and create the appropriate lambda:

auto quack_ = [](std::any const& d, int i) {
    return std::any_cast<std::remove_reference_t<Duck>>(&d)->quack(i);

And then wrap the public facing call so that quackfn can be stored as a function pointer

void AnyDuck::quack(int length) const { return quack_(this->duck_, length); }

Here’s the whole thing:

class AnyDuck {
    std::any duck_;
    using quackfn = void (*)(std::any const&, int);
    quackfn quack_;

    AnyDuck(AnyDuck const&) = default;
    AnyDuck(AnyDuck&)       = default;

    template <typename Duck>
    AnyDuck(Duck&& duck)
        : duck_(std::forward<Duck>(duck)),
          quack_([](std::any const& d, int i) {
              return std::any_cast<std::remove_reference_t<Duck>>(&d)->quack(
          }) {}

    void quack(int length) const { return quack_(this->duck_, length); }

The copy constructors are there to be a better match than the templated constructor for copy by value. Codegen is surprisingly good. If the types are all present, the functions are inlined well, except for the overhead of storage into the any. For any unknown AnyDuck, there’s a dispatch via pointer indirection:

void test(AnyDuck a) {

results in something like

0000000000000050 <test(scratch::AnyDuck)>:
  50:   48 8b 47 10             mov    0x10(%rdi),%rax
  54:   be 01 00 00 00          mov    $0x1,%esi
  59:   ff e0                   jmpq   *%rax
  5b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

and the any_cast<> from the address of the passed in std::any is noexcept, but does in general have to check if the any has a value. Not as cheap as pure an interface type, but not terribly more expensive.

For the case where the quack is known, codegen is something like

scratch::AnyDuck::AnyDuck<Duck&>(Duck&)::{lambda(std::any const&, int)#1}::__invoke(std::any const&, int): # @scratch::AnyDuck::AnyDuck<Duck&>(Duck&)::{lambda(std::any const&, int)#1}::__invoke(std::any const&, int)
        movl    %esi, %edi
        jmp     bell(int)                # TAILCALL

If the implementation of the underlying quack is not available there’s a little more work

scratch::AnyDuck::AnyDuck<Mallard&>(Mallard&)::{lambda(std::any const&, int)#1}::__invoke(std::any const&, int): # @scratch::AnyDuck::AnyDuck<Mallard&>(Mallard&)::{lambda(std::any const&, int)#1}::__invoke(std::any const&, int)
        movl    $_ZNSt3any17_Manager_internalI7MallardE9_S_manageENS_3_OpEPKS_PNS_4_ArgE, %ecx
        xorl    %eax, %eax
        cmpq    %rcx, (%rdi)
        leaq    8(%rdi), %rcx
        cmoveq  %rcx, %rax
        movq    %rax, %rdi
        jmp     Mallard::quack(int) const    # TAILCALL

But far less than I would have expected. std::any is less awful than I thought.

You can take a look at the code so far here Compiler Explorer Link to see the results

I’ll clean up my scratch project and push it at some point.

Steve Downey’s Birthday (Observed)


Blooms Tavern
208 East 58th Street
Between 2nd & 3rd Ave
New York, NY 10022


6:00 PM ->


This is the sound
I bought a ticket to the world
But now I’ve come back again
Why do I find it hard to write the next line?
Oh, I want the truth to be said
I know this much is true


From: 110 W 14th St

Walk Walk

About 2 min , 335 ft

6:03 PM

14 Street Station

L Canarsie – Rockaway Pkwy

2 min (non-stop)

6:05 PM

14 Street – Union Sq Station

Walk Walk

About 1 min

6:09 PM

14 Street – Union Sq Station

5X Eastchester – Dyre Av

7 min (2 stops)

6:16 PM

59 St-Lexington Av Station

Walk Walk

About 4 min , 0.2 mi

6:20 PM

Blooms Tavern

208 E 58th St, New York, NY 10022

Walk Walk

About 2 min , 0.1 mi

6:03 PM

14 Street Station

M Forest Hills – 71 Av

13 min (6 stops)

6:16 PM

Lexington Av-53 St

Walk Walk

About 6 min , 0.3 mi

6:22 PM

Blooms Tavern

208 E 58th St, New York, NY 10022

From: 731 Lexington Avenue

Head southwest on Beacon Ct toward E 58th St
Restricted usage road
184 ft

Turn left onto E 58th St
Pass by Wells Fargo Bank (on the left in 115 ft)
Destination will be on the right
420 ft

Bloom’s Tavern

208 E 58th St, New York, NY 10022

Building Saar Raz’s clang concepts branch

A Recipe for building Saar Raz’s clang concepts branch

Saar Raz has been working on a Concepts implementation, available at It’s not much harder to build it than clang usually is, it’s just a matter of getting things checked out into the right places before configuring the build. Just like LLVM and clang normally.

In order to double check how, I peeked at the shell script used by the compiler explorer image to build the clang-concepts compiler:

The really important bit is getting exactly the right commit from LLVM, 893a41656b527af1b00a1f9e5c8fcecfff62e4b6.

To get a working directory something like: Starting from where you want your working tree and build tree, e.g ~/bld/llvm-concepts

git clone

pushd llvm
git reset --hard 893a41656b527af1b00a1f9e5c8fcecfff62e4b6

pushd llvm/tools
git clone clang

pushd llvm/projects
git clone
git clone
# The sanitizers: this is optional but you want them
git clone

Then to build and install

mkdir build && cd build

cmake \
    -DCMAKE_INSTALL_PREFIX=~/install/llvm-concepts/ \
    -DCMAKE_BUILD_TYPE=Release  \
    -G Ninja  \


ninja check

ninja install

Note that I install into a separate prefix, keeping it isolated from everything else. The compiler can be invoked as ~/install/llvm-concepts/bin/clang++. There’s no particular reason to put the compiler on your PATH.

Should Unicode literals be guaranteed to be well-formed?

TL;DR Betteridge’s law applies: No.

Are you still here?

Unicode Literals

In C++ 20 there are 2 kinds and 6 forms of Unicode literals. Character literals and string literals, in UTF-8, UTF-16, and UTF-32 encodings. Each of them uses a distinct char type to signal in the type system what the encoding is for the data. For plain char literals, the encoding is the execution character set, which is implementation defined. It might be something useful, like UTF-8, or old, like UCS-2, or something distinctly unhelpful, like EBCDIC. Whatever it is, it was fixed at compilation time, and is not affected by things like LOCALE settings. The source character set is the encoding the compiler believes your source code is written in. Including the octets that happen to be between " and ' characters.

char32_t s1[] = U"\u0073tring";
char16_t s2[] = U"\u0073tring";
char8_t s2[] = U"\u0073tring";

char32_t c1 = U'\u0073';
char16_t c1 = u'\u0073';
char8_t c1 = u8'\u0073';

Unicode codepoint U+0073 is ‘s’. So all of the strings are “string”, and all of the characters are ‘s’, however the rendering of each in memory is different. For the string types, each unit in the string is 32, 16 or 8 bits respectively, and for the character literals, each is one code unit, again of 32, 16, or 8 bits.


This is due to Zach Laine, an actual expert, and sorting out what happened took several other experts, so the rest of us have little hope.

char8_t suprise[] = u8"ς";
assert(strlen(surprise) == 5);

This comes down to your editor and compiler having a minor disagreement about encoding. The compiler was under the impression that the encoding was cp-1252, where the editor believed it to be UTF-8. The underlying octets for ς are 0xCF 0x82, each of which is a character in cp-1252. All octets are valid characters in cp-1252. So each was converted to UTF-8, resulting in 0xC3 0x8F and 0xE2 0x80 0x9A.

Of course.

The contents of the string are still in the source character set, not in any of the Unicode forms. Unless that happens to be the source character set.

But at least it’s well formed.


The \u escapes, which identify Unicode codepoints by number, will produce well formed Unicode encoding in strings, and in character literals if they fit. That is, in a char32_t, you can put any code point. In a char16_t you can put any character from the basic multilingual plane. In a char8_t you can put 7 bit ASCII characters.

Or you can use hex or octal escapes, which will be widened to code unit size and the value placed into the resultant character or string. And no current compiler checks if that makes sense, although they will warn you if, for example you try to write something like:

char16_t w4 = u'\x0001f9ff'; // NAZAR AMULET - Unicode 11.0 (June 2018)
char16_t sw4[] = u"\x0001f9ff";

where you’re trying to put a value that won’t fit into a code unit into the result.

warning: hex escape sequence out of range

The \xnn and \nnn hex and octal escapes are currently a hole that lets you construct ill-formed string literals. For example

char8_t oops = u8"\xfe\xed";

But there are lots of ways of constructing ill-formed arrays of char8_t or char16_t. Just spell them out as arrays:

char8_t ill = {0xfe, 0xed};

The type system doesn’t provide that char8_t means well formed UTF-8. All it does is tell you that the intended encoding is UTF-8. Which is a huge improvement over char.

But it does not provide any guarantee to an API taking a char8_t*.

\0 is an octal escape sequence

You can spell it \u0000. But that’s weird, since we spell it as \0 everywhere, and that it’s an octal escape is a C++ trivia question.

I want to be told if I’m forming ill-formed Unicode.

Then don’t use escape sequences. If you use either well encoded source code encodings or universal character names you will get well formed Unicode.

The primary reason for wanting ill-formed Unicode is for testing. It’s a convenience. And there are straightforward workarounds.

But disallowing hex and octal escapes in Unicode strings makes the language less regular while preventing an error that you had to go out of your way to create, and does not actually produce more runtime safety.

Litmus Tests for Multithreaded Behavior

Litmus Tests for Multithreaded Behavior

Or How Processors Don’t Do What You Think

Modern multicore processors are entirely weirder than almost anyone thinks possible. They are somewhat weirder than chip makers were willing to admit until fairly recently. They are sufficiently weird enough that almost all multi-threaded programs, and many lock-free algorithms, had bugs because the hardware just does not work the way anyone would reasonably expect. And, in addition, optimizing compilers did not actually know how to not break your code. [boehm2005threads]

I’m in the (slow) process of writing some test cases for multi-threaded code. I eventually want to have some confidence that some code is executed once, and only once, in an efficient manner. It’s the ‘efficient’ part that worries me, because for efficient, it also has a tendency to be clever, and I’m learning that clever MT code is often subtly broken. [bacon2000double] So if smarter people than me make mistakes about MT code, I need tests to compensate. And ones that will cover occasional allowed but unexpected behavior. Which means the test framework should be able to detect them.

Also, fortunately, the RPi is a computer that exhibits some odd behavior, as it is an ARM system. X86 has a much stronger model. However, even the x86 model is allowed to perform in odd ways.

Starting in 2007, Intel has started publishing short snippets of assembly and documenting what are the allowed and disallowed results running them in parallel. [IWPAug2007] These snippets have come to be called litmus tests, and are used to confirm the behavior of hardware, and confirm models of the hardware behavior. A particularly important model for C++ programmers is the x86 Total Store Order model [owens2009better] which provides a way of mapping the C++11 memory model to X86 hardware. X86 hardware provides a strongly consistent memory model. Power and ARM provide fewer guarantees, and mapping the C++ memory model to these architectures is more challenging. [maranget2012tutorial]

Message Passing

The tests outlined in the Intel paper are short pieces of assembly to be exercised on different processors, with guarantees about behavior that will not happen. The first one essentially promises that message passing will work, and is now known as the MP test.

Processor 0 Processor 1
mov [ _x], 1 // M1 mov r1, [ _y] // M3
mov [ _y], 1 // M2 mov r2, [ _x] // M4

Initially x = y = 0

r1 = 1 and r2 = 0 is not allowed

That says that we can’t read the writes of x and y out of order, which ensures that if we’re waiting to see the write to the flag y, we can be guaranteed to see the payload in x. If we change it slightly to wait for the write to _y to be visible, we can pass a message from one thread to anothher in _x. This is also known as Dekards Algorithm.

ARM and Power do not provide that guarantee without additional synchronization instructions.

In C++ that looks something like the following, using the test framework I’m writing.

MP::MP() : x_(0), y_(0) {}
void MP::t1() {, std::memory_order_relaxed);, std::memory_order_relaxed);
void MP::t2(Result& read) {
    while (!y_.load(std::memory_order_relaxed)){}
    std::get<0>(read) = x_.load(std::memory_order_relaxed);

Here, x_ and y_ are atomic<int>s, and we’re using the lowest possible atomic guarantee in C++, relaxed. Relaxed guarantees that the operation happens atomically. but there are no synchronization properties with anything else. This usally corresponds to the basic int type. Unless you’re using a really insane processor that might let an int be partially written and observable. Like you might get the top half of the int, or the middle byte. The commercial processors that allowed this have pretty much died out.

The test spins on seeing the load of y to be complete. It loads the value of x_ into a result tuple. The tuple is used as the key to a map which accumulates how many times each result has been seen.
Running the above on my x86 laptop:

[ RUN      ] ExperimentTest.MPTest1
(1) : 2000000

on my Raspberry Pi 3:

[ RUN      ] ExperimentTest.MPTest1
(0) : 483
(1) : 1999517

Using objdump to check the generated assembly

00000088 <litmus::MP::MP()>:
  88:   mov     r2, #0
  8c:   str     r2, [r0]
  90:   str     r2, [r0, #64]   ; 0x40
  94:   bx      lr

00000098 <litmus::MP::t1()>:
  98:   mov     r3, #1
  9c:   str     r3, [r0]
  a0:   str     r3, [r0, #64]   ; 0x40
  a4:   bx      lr

000000a8 <litmus::MP::t2(std::tuple<int>&)>:
  a8:   ldr     r3, [r0, #64]   ; 0x40
  ac:   cmp     r3, #0
  b0:   beq     a8 <litmus::MP::t2(std::tuple<int>&)>
  b4:   ldr     r3, [r0]
  b8:   str     r3, [r1]
  bc:   bx      lr

So, out of the 2,000,000 times that I ran the experiment, there were 483 times that reading x_ resulted in 0, even though y_ was 1. ARM has a weaker memory model than x86. This has some advantages in processor implementation. It has distinct disadvantages in how our brains work. X86 tries to preserve the model that there is shared memory that everyone sees and works with. That’s not strictly true, even for X86, but ARM and Power don’t even come close. On the other hand, it’s also why it’s easier to add more cores to Power and ARM chips and systems. I routinely work with Power systems with 512 physical cores.

Store Buffering

Store buffering is the odd case that is allowed in the Intel memory model. When assigning locations in two threads, and then reading them on opposite threads, both threads are allowed to read the older state. The stores get buffered.
From the Intel White Paper:

Processor 0 Processor 1
mov [ _x], 1 // M1 mov [ _y], 1 // M3
mov r1, [ _y] // M2 mov r2, [ _x] // M4

Initially x = y = 0

r1 = 0 and r2 ==0 is allowed

Note, in particular, there is no interleaving of M1 – 4 that could result in r1 and r2 being 0. Not without interupting an instruction in the middle. But the instructions themselves are atomic, and indivisible. If they were actually operating on shared memory, this would not be possible. However, it does happen.

SB::SB() : x_(0), y_(0) {}
void SB::t1(Result& read) {, std::memory_order_relaxed);
    std::get<0>(read) = x_.load(std::memory_order_relaxed);
void SB::t2(Result& read) {, std::memory_order_relaxed);
    std::get<1>(read) = y_.load(std::memory_order_relaxed);

That generates the x86 code

00000000000000f0 <litmus::SB::t1(std::__1::tuple<int, int>&)>:
  f0:   mov    DWORD PTR [rdi+0x40],0x1
  f7:   mov    eax,DWORD PTR [rdi]
  f9:   mov    DWORD PTR [rsi],eax
  fb:   ret

0000000000000100 <litmus::SB::t2(std::__1::tuple<int, int>&)>:
 100:   mov    DWORD PTR [rdi],0x1
 106:   mov    eax,DWORD PTR [rdi+0x40]
 109:   mov    DWORD PTR [rsi+0x4],eax
 10c:   ret

And on my x86 machine:

[ RUN      ] ExperimentTest.SBTest1
(0, 0) : 559
(0, 1) : 999858
(1, 0) : 999576
(1, 1) : 7

So 559 times neither core saw the other core’s store.

Load Buffering

Load Buffering is the dual of store buffering. Loads into registers might be delayed, or buffered, and actually performed after following instructions. It’s not allowed in the Intel architecture.

From the Intel White Paper

Processor 0 Processor 1
mov r1, [ _x] // M1 mov r2, [ _y] // M3
mov [ _y], 1 // M2 mov [ _x], 1 // M4

Initially x = y = 0

r1 = 1 and r2 = 1 is not allowed

LB::LB() : x_(0), y_(0) {}
void LB::t1(Result& read) {
    std::get<0>(read) = x_.load(std::memory_order_relaxed);, std::memory_order_relaxed);
void LB::t2(Result& read) {
    std::get<1>(read) = y_.load(std::memory_order_relaxed);, std::memory_order_relaxed);

This is the x86 asm code

00000000000000c0 <litmus::LB::t1(std::__1::tuple<int, int>&)>:
  c0:   mov    eax,DWORD PTR [rdi]
  c2:   mov    DWORD PTR [rsi],eax
  c4:   mov    DWORD PTR [rdi+0x40],0x1
  cb:   ret
  cc:   nop    DWORD PTR [rax+0x0]

00000000000000d0 <litmus::LB::t2(std::__1::tuple<int, int>&)>:
  d0:   mov    eax,DWORD PTR [rdi+0x40]
  d3:   mov    DWORD PTR [rsi+0x4],eax
  d6:   mov    DWORD PTR [rdi],0x1
  dc:   ret
  dd:   nop    DWORD PTR [rax]

And the ARM code, at -O1

000000d0 <litmus::LB::t1(std::tuple<int, int>&)>:
  d0:   ldr     r3, [r0]
  d4:   str     r3, [r1, #4]
  d8:   mov     r3, #1
  dc:   str     r3, [r0, #64]   ; 0x40
  e0:   bx      lr

000000e4 <litmus::LB::t2(std::tuple<int, int>&)>:
  e4:   ldr     r3, [r0, #64]   ; 0x40
  e8:   str     r3, [r1]
  ec:   mov     r3, #1
  f0:   str     r3, [r0]
  f4:   bx      lr

ARM generally allows it, but per [maranget2012tutorial] it’s very sensitive, and dependencies will make it not appear. In my tests, I did not observe an instance of a buffering, but it may be due to the first store the compiler introduces, in order to actually get the data into the tuple. That it’s documented as possible is still exceedingly strange.

Independent Reads of Independent Writes

IRIW is a generalization of store buffering, where two reader threads each read different apparent orderings of writes from two distinct writer threads.

T1 T2 T3 T4
X = 1 Y = 1 R1 = X R3 = y
R2 = Y R4 = X

Initially X=Y=0
Allowed in ARM, not in x86 r1=1, r2=0, r3=1, r4=0 [maranget2012tutorial,owens2009better]

This is not observed in x86 processors, but is in some ARM and POWER, more often in POWER. X86 hardware has a consistent view of memory where other hardware can see memory writes in different orders on different threads. On my rPi, I didn’t observe any incidents of X and Y being read out of order, over 40 million runs.

IRIW::IRIW() : x_(0), y_(0) {}
void IRIW::t1() {, std::memory_order_relaxed);

void IRIW::t2() {, std::memory_order_relaxed);

void IRIW::t3(Result& read) {
    std::get<0>(read) = x_.load(std::memory_order_relaxed);
    std::get<1>(read) = y_.load(std::memory_order_relaxed);

void IRIW::t4(Result& read) {
    std::get<2>(read) = y_.load(std::memory_order_relaxed);
    std::get<3>(read) = x_.load(std::memory_order_relaxed);


The allowed behavior of modern processors is very different than our mental model of a Von Neumann architecture computer. Each core can have a different view of memory, and without additional controls, writes and reads can break the illusion of a single unified memory. The C++ memory model gives the controls and guarantees about what happens when different threads read and write memory, and here I’ve deliberately used the weakest version available, relaxed, in order to allow the processors the wideest latitude in behavior. Relaxed is, for processors that have it, often just an unconstrained int, which means that you will get odd behavior if you are running shared state multithreaded code that uses plain native types. It is a particular problem with code that was originally written and tested on a x86 architecture because the native model is fairly strong. This frequently causes problems when porting to a mobile platform, where ARM is a very popular hardware choice.

Org-mode source and git repo

Exported from an org-mode doc. All of the source is available on github at SpinGate



  • [boehm2005threads] Boehm, Threads cannot be implemented as a library, 261-268, in in: ACM Sigplan Notices, edited by (2005)
  • [bacon2000double] @miscbacon2000double,
    title=The “double-checked locking is broken” declaration,
    author=Bacon, David and Bloch, Joshua and Bogda, Jeff and Click, Cliff and Haahr, Paul and Lea, Doug and May, Tom and Maessen, Jan-Willem and Mitchell, JD and Nilsen, Kelvin and others,
  • [IWPAug2007] @miscIWPAug2007,
    howpublished =
    note = Accessed: 2017-04-30,
    title = Intel® 64 Architecture Memory Ordering
    White Paper,
  • [owens2009better] Owens, Sarkar & Sewell, A better x86 memory model: x86-TSO, 391-407, in in: International Conference on Theorem Proving in Higher Order Logics, edited by (2009)
  • [maranget2012tutorial] Maranget, Sarkar & Sewell, A tutorial introduction to the ARM and POWER relaxed memory models, Draft available from http://www. cl. cam. ac. uk/\~ pes20/ppc-supplemental/test7. pdf, (2012).

Date: 2017-04-30 Sun 00:00

Author: Steve Downey

Created: 2018-06-17 Sun 14:07



Multithread Experiments

An Experiment Collects Samples

I’m modelling this in order to run bits of code like the various litmus tests used to describe multi-core architectures. A set of functions to be run in parallel that may or may not write to a result, which type is a property of the Test being run. The Experiment will run the Test collecting Samples. The Test type will provide a tuple of functions to run. They will be run under a spingate in all permutations in order to remove scheduling bias.

What a Test looks like

class MP { // Message Passing
    int x_;
    int y_;

    typedef std::tuple<int> Result;
    void t1();
    void t2(Result& read);

    auto actions() {
        return std::make_tuple([this]() { t1(); },
                               [this](Result& result) { t2(result); });

The Test interface must provide a Result type, and an actions() member that will produce a tuple of functions to run which either take no arguments or a reference to a result.

The test being defined here is the basic Message Passing litmus test.

MP::MP() : x_(0), y_(0) {}

void MP::t1() {
    x_ = 1;
    y_ = 1;

void MP::t2(Result& read) {
    while (!y) {
    std::get<0>(read) = x_;

Two variables are initialized to 0. One thread stores 1 to x first, then to 1 to y. The other thread loops until it reads a non-zero in y, and then reads x. The value in x is the message being passed between threads.

In an actual test, the variables would be atomics, specifiying load and store strength, and the variables might have constraints on layout to help sharing cache line updates.

An Experiment

An Experiment samples a test a number of times. It takes the result of each sample, and puts in a map of the results to count, incrementing the count for each distinct result. The actions to run are permuted each time, to help remove bias about which action is loaded behind the spingate first.

void Experiment::run(size_t count) {
    using Actions = decltype(std::declval<Test>().actions());
    auto getters = tupleutil::tuple_getters<Actions>();
    for (size_t i = 0; i < count; ++i) {
        Sample<Test> sample;;
        std::next_permutation(getters.begin(), getters.end());

tupleutil::tuple_getters returns an array of getters each of which returns a std::variant<Types…> with the same parameter pack as the tuple.

Sample runs all of the actions in a batch that locks them behind a spingate, and collects the results for each action.

template <class Test> class Sample {
    Batch                 batch_;
    Test                  test_;
    typename Test::Result result_;

    template <typename V, size_t I> void run(std::array<V, I> const& getters) {
        auto const& actions = test_.actions();
        add(actions, getters);;

Add is a templated member function that loops over the array, uses the getter to pull a function out of the tuple of actions and visits that with a lambda that will add either the function with no arguments, or that function with a reference to the results, to the batch.

    template <typename Tuple, typename Variant, size_t I>
    void add(Tuple const& actions, std::array<Variant, I> const& getters) {
        auto adder = [this](auto&& f) {
            using F = std::remove_cv_t<std::remove_reference_t<decltype(f)>>;
            if constexpr (std::is_invocable_v<F>) {
            } else {
                batch_.add(f, std::ref(result_));
        for (auto&& get_n : getters) {
            std::visit(adder, get_n(actions));

I am a bit dissatisfied with the else case not being constexpr if followed by a static assert, but getting the condition right didn’t work the obvious way, so I punted. There will be a compiler error if f(result_) can’t actually be called by the batch.

Batch recapped:

The key bit of code is

template <class Function, class... Args>
void Batch::add(Function&& f, Args&&... args) {
    workers_.emplace_back([ this, f = std::forward<Function>(f), args... ]() {

Batch has a spingate and runs all of the functions that are added sitting behind it. The run() function opens the gate and joins all the worker threads.

void Batch::run() {;
    for (auto& thr : workers_) {


With all the machinery in place, the test infrascructure can aggressively run multi-threaded tests, giving the thread scheduler the best opportunity to run all of the actions in any order. This allows multi thread bugs to be shaken out by looking for surprising results from the experiment.

Source Code

Exported from an org-mode doc,, which is available, with all of the source on github at SpinGate.

Why std::bind can’t be (formally) deprecated

Yes: std::bind should be replaced by lambda

For almost all cases, std::bind should be replaced by a lambda expression. It’s idiomatic, and results in better code. There is almost no reason post C++11 to use std::bind.

Doing so is quite straightforward, capture each bind argument by value in the lambda capture list, and provide auto parameters for each of the placeholders, then call the bound callable using std::invoke(). That will handle the cases of member function pointers, as well as regular functions. Now, this is how to do it mechanically, if you were doing this as part of a manual refactoring, the lambda can be made even clearer.

#include <functional>
#include <iostream>

void f(int n1, int n2, int n3) {
  std::cout << n1 << ' ' << n2 << ' ' << n3 << '\n';

int main() {
  using namespace std::placeholders;
  int n = 5;
  auto f1 = std::bind(f, 2, n, _1);
  f1(10); // calls f(2, 5, 10);

  auto l1 = [ p1 = 2, p2 = n ](auto _1) { return std::invoke(f, p1, p2, _1); };

  // idiomatically
  auto l1a = [=](auto _1){return f(2, n, _1);};

  auto f2 = std::bind(f, 2, std::cref(n), _1);
  auto l2 = [ p1 = 2, p2 = std::cref(n) ](auto _1) {
      return std::invoke(f, p1, p2, _1);
  // or
  auto l2a = [ p1 = 2, &p2 = n ](auto _1) {
      return std::invoke(f, p1, p2, _1);
  // more idiomatically
  auto l2b = [&](auto _1){f(2, n, _1);};

  n = 7;
  f2(10); // calls f(2, 7, 10);


Which results in:

2 5 10
2 5 10
2 5 10
2 7 10
2 7 10
2 7 10
2 7 10

No: std::bind provides one thing lambda doesn’t

The expression std::bind evaluates flattens std::bind sub-expressions, and passes the same placeholder parameters down. A nested bind is evaluated with the given parameters, and the result is passed in to the outer bind. So you can have a bind that does something like g( _1, f(_1)), and when you call it with a parameter, that same value will be passed to both g and f. The function g will receive f(_1) as its second parameter.

Now, you could rewrite the whole thing as a lambda, but auto potentially makes this a little more difficult. The result of std::bind is an unutterable type. They weren’t supposed to be naked. However, auto means the expression could be broken down into parts, meaning that the translation from a std::bind expression to a lambda expression is potentially not mechanical. Or, the bind could be part of a template, where the subexpression is a template parameter, which is likely working by accident, rather than design.

In any case, std::bind does not treat its arguments uniformly. It treats a bind expression distinctly differently. At the time, it made some sense. But it makes reasoning about bind expressions difficult.

Don’t do this. But it is why formally deprecating std::bind is difficult. They can be replaced, but not purely mechanically.

There isn’t a simple translation that works, unlike converting from std::auto_ptr to std::unique_ptr, or putting a space after a string where it now looks like a conversion. And, std::bind isn’t broken. It’s sub-optimal because of the complicated machinery to support all of the flexibility, where a lambda allows the compiler to do much better. Also, since the type isn’t utterable, it often ends up in a std::function, which erases the type, removing optimization options.

Example of fail code

#include <functional>
#include <iostream>

void f(int n1, int n2, int n3)
    std::cout << n1 << ' ' << n2 << ' ' << n3 << '\n';

int g(int n1) { return n1; }

int main()
    using namespace std::placeholders;

    auto g1 = std::bind(g, _1);
    auto f2 = std::bind(f, _1, g1, 4);
    f2(10); // calls f(10, g(10), 4);

    // auto l2 = [p1 = g1, p2 = 4](auto _1) {std::invoke(f, _1, p1, p2);};
    // l2(10);

    // The bind translation needs to be composed:
    auto l1 = [](auto _1){return g(_1);};
    auto l2 = [p1 = l1, p2 = 4](auto _1){f(_1, p1(_1), p2); };
    // idiomatically
    auto l2a = [](auto _1) { return f(_1, g(_1), 4);};

10 10 4
10 10 4
10 10 4


If someone can figure out a fixit recommendation that could be safely applied, transforming the old bind to a lambda, then std::bind could be deprecated in C++Next, and removed as soon as C++(Next++). But that right now is non-trivial in some cases.


  • Fix incorrect statement about type-erasure in std::bind. I was thinking std::function
  • Add more idiomatic transliterations of the std::bind lambdas

Building and running the examples


    -rm example1
    -rm example2

example1: example1.cpp
    clang++ --std=c++1z example1.cpp -o example1

example2: example2.cpp
    clang++ --std=c++1z example2.cpp -o example2

example3: example3.cpp
    clang++ --std=c++1z example3.cpp -o example3 2>&1

all: example1 example2


rm example1
rm example2
clang++ --std=c++1z example1.cpp -o example1
clang++ --std=c++1z example2.cpp -o example2

Original source

Original document is available on Github

Accessing the elements of a tuple as variant

A further digression, because it turns out I want to be able to permute a tuple at run time. That means treating the element of a tuple generically. And I can just barely do this, for some tuples, in c++17.

So a slight digression into ADTs. Which in this case means Algebraic Data Types, not Abstract Data Types. But just algebra. No calculus, or differentiation, of data types. Not today.

1 Tuple is Product, Variant is Sum

1.1 Products

In algebra, we usually start out with addition. It’s simpler. But for types, multiplication, or product, is in many ways much more natural. Your basic struct, record, etc is a natural product of types. A type is some kind of collection of things. And I’m being a bit vague here because this is right in the area where set seems like a good idea, and then we get into sets of sets, sets that might contain themselves, and barbers who shave all the people who don’t shave themselves. There is rigour, but I don’t really want to have to go there.

But, if we start with the idea that a type is a collection of things, and that we don’t look to closely at the infinities, we are not going to be terribly wrong. So a type is a way of describing if a thing is in or out of the collection.

Now, I could pretend we don’t know what a struct is. Start with pairs, where there are no names of the components of the struct, and build that up. But we all have a notion of struct. It’s an ordered collection of types. The instances of the struct are all of the elements of each type contained in the struct, matched up with all of the other elements of all the other types in the struct. Known as the Cartesion product. So if you have a type A, and a type B, the collection of things in struct {A a; B b;} is the cross of As and Bs. That is {{a1, b1}, {a1, b2}, {a1, b3}, … , {a2, b1}, {a2, b2}, … {an, b1}, … {an, bm}} is all of the elements that are part of the type struct {A a; B b;}. The cardinality of {A, B} is the product of the cardinalities of A and B.

Structs are very natural in C++, but hard to deal with generically, so there’s a type that does it all for you, std::tuple. Getting at the parts of the tuple is a little more difficult that with a struct. You have to say std::get<0>(tuple), or std::get<int>(tuple). And the second might not even compile, if the tuple has more than one int. But you get tools for composing and decomposing tuples at compile time. And std::tuple lets you put pretty much any C++ type into the tuple, only restricting you when you try to, e.g. move a tuple that has an element that can’t be moved.

There should also be a type that acts as a unit for the product, the equivalent of 1 for multiplication. The empty tuple can work as a unit. It contains any of the list of no types. This implies that all empty tuples are equivalent, so its cardinality is 1. There can be only one. The product of a type with the empty tuple is entirely equivalent to the the type itself. There are no additional elements in the type, and you can convert back and forth between them. They are isomorphic, having the same shape.

Isomorphisms are important in talking about types, because most of the time we can’t actually distinguish between isomorphic types, at least for proving things. The phrase “up to isomorphism” shows up a lot. To be isomorphic means that we can write a transformation X from type A to type B, and a reverse transformation Y from type B to type A, such that Y(X(a)) == a for all a, and that for any function from a1 to a2, there is an equivalent function from b1 to b2. We could mechanically replace instances of a with the appropriate b and add calls to X and Y without changing the behavior of a program.

1.2 Sums

The other basic algebraic type is the sum type. The corresponding primitive in C++ is a union, with one difference. In most type systems, the sum type automatically remembers which of the allowed types is in it. A union doesn’t, so the standard technique is to embed the union in a struct that carries a tag saying which type in the union was most recently written, and can be read from. I’ll be ignoring type-punning schemes allowing a read of a different type than was written.

So a Sum type of type A and type B is the union of all of the things in A and all of the things in B. {a1, a2, a3, … , an, b1, b2, … , bm}. The cardinality of is the sum of the cardinalities of A and B.

The unit type of the sum is equivalent to zero. The empty sum type, although a valid type, has no elements in the type. It’s like the empty set. It’s often known as Void, where the unit for product is often called Unit. It may also be known as Bottom, where that is a computation that never completes. Since there are no elements of the type Void, it can’t be instantiated. And a product of Void and any other type is equivalent to Void. The c++ type void is related, but not exactly the same, because it also represents an empty argument list, a function that returns, but does not return any value (a subroutine), and is also functions as the universal pointer.

C++17 recently standardized a sum type to go with the long standardized std::tuple, std::variant. Std::variant remembers which of the alternative types was last set. It is almost never empty, only so if a write into one of the alternatives threw an exception. It is not allowed to hold void, references, arrays, or to contain no types. This is a bit unfortunate, because except for void std::tuple can do all of those things.

There were several competing models for what std::variant should be, with various tradeoffs being made. It was always clear that std::tuple had to be able to represent everything a struct can, and in fact there are now language features to destructure a struct into a tuple. There is no equivalent model for sum types. Union can’t hold anything but trivial types because there is no mechanism to track what to do on destruction, since there is no built-in mechanism to determine what the union was last written as.

One of the popular models for variant rises out of database-like interfaces. Even though databases are internally strongly typed, SQL queries are not. And the model of sending text over and getting some kind of response back makes it difficult to expose that to a host language. Particularly when the database schema may change, the query still be perfectly valid, but no longer return the same types. However, since we do know there is a relatively short list of permitted types in the database, a variant that allows just those types and the ability to query what type was returned can be quite useful, and not terribly hard to implement. There are JSON parsers taking similar approaches, only with the addition that a JSON type may have JSON objects contained in them recursively, and those have to be outside the object somehow, or the size of the object is unbounded.

From the implementors point of view, supporting pointers and arrays is a huge amout of extra work. Not allowing an array to decay to a pointer is quite difficult. References have issues when treated generically. Not to mention that references have decidely odd semantics in the differences between construction and assignment. And the degenerate case of an empty variant was also difficult. If that needs to be represented, the type std::monostate has been introduced, which is a type designed to have exactly one item in it, so that all instances of std::monostate are identical. This is also the same as the unit type for product types. It’s not an accident that it’s represented in Haskell as (), which is the empty tuple. All empty lists are equivalent. It could have been std::tuple<>, but no one in the room happened to think of that.

2 Tuple is a Heterogenous Container, what is the iterator?

The C++ standard says “tuples are heterogeneous, fixed-size collections of values” – [tuple.general]. Collections generally have iterator types associated with them, but that’s a bit of a challenge since the iterator model in C++ assumes that for a collection, the type of *(Collection<T>::iterator) is T. But if the collection isn’t on T, but on Types…, you doesn’t quite work to say *(Collection<typename… Types>) is of type …Types. You need something to hold that. But in many cases, std::variant can work. It doesn’t quiet work, since we’d really need a variant of references to the elements of the tuple, so that they could be written to. However, for many purposes we can come close. For the case I was looking at, making copies is perfectly fine. What I’m looking for is something roughly with the signature

template <typename... Types
auto getElement(size_t i, std::tuple<Types...> tuple) -> std::variant<Types...>;

That is, something that will get me the ith element of a tuple, as a variant with the same typelist as the tuple, with the index determined at runtime. All of the normal accessors are compile time. So need to do something that will make the compile time information available at runtime.

Start with something I do know how to do, idiomatically printing a tuple.

template <typename Func, typename Tuple, std::size_t... I>
void tuple_for_each_impl(Tuple&& tuple, Func&& f, std::index_sequence<I...>)
    auto swallow = {0,
                        I, std::get<I>(std::forward<Tuple>(tuple))))...};

template <typename Func, typename... Args>
void tuple_for_each(std::tuple<Args...> const& tuple, Func&& f)
    tuple_for_each_impl(tuple, f, std::index_sequence_for<Args...>{});

template <typename... Args>
void print(std::ostream& os, std::tuple<Args...> const& tuple)
    auto printer = [&os](auto i, auto el) {
        os << (i == 0 ? "" : ", ") << el;
        return 0;
    return tuple_for_each(tuple, printer);

Actually, a bit more complicated than the totally standard idiom, since it factors out the printer into a application across the tuple, but it’s not much more compilcated. The tuple_for_each constructs an index sequence based on the argument list, and delegates that to the impl, which uses it to apply the function to each element of the tuple. The _impl ought to be in a nested detail namespace, so as not to leak out. Swallow is the typical name for using an otherwise unnamed, and uninteresting, type to apply something to each element of the tuple for a side-effect. The void cast is to make sure the variable is used, and is evaluated.

The next step is, instead of an application of a function for its side-effect, instead a mapping of the tuple, returning the transformed tuple.

template <typename Func, typename Tuple, std::size_t... I>
auto tuple_transform_impl(Tuple&& tuple, Func&& f, std::index_sequence<I...>)
    return std::make_tuple(

template <typename Func, typename... Args>
auto tuple_transform(std::tuple<Args...>&& tuple, Func&& f)
    return tuple_transform_impl(tuple, f, std::index_sequence_for<Args...>{});

template <typename Func, typename... Args>
auto tuple_transform(std::tuple<Args...> const& tuple, Func&& f)
    return tuple_transform_impl(tuple, f, std::index_sequence_for<Args...>{});

Because the std::tuple is not a template parameter, I have to supply a const& and a forwarding-reference form to cover both cases. And I’m ignoring volatile quals. The _impl function uses forwarding-reference parameters, which will decay or forward properly using std::forward. Using it is straightforward.

std::tuple<int, double, long> t = std::make_tuple(1, 2.3, 1l);
auto transform = tupleutil::tuple_transform(t,
                                            [](auto i) { return i + 1; });

EXPECT_EQ(3.3, std::get<1>(transform));

auto t2 = tupleutil::tuple_transform(std::make_tuple(4, 5.0),
                                     [](auto i) { return i + 1; });
EXPECT_EQ(6, std::get<1>(t2));

So, for functions over all the types in a tuple, tuple is a Functor. That is, we can apply the function to all elements in the tuple, and it’s just like making a tuple out of applying the functions to elements before making the tuple. If this sounds like a trivial distinction, you are mostly right. Almost all container-ish things are Functors, and a few non-containerish things are also. Plus Functor sounds more impressive.

The transform also suggests a way of solving the problem I was originally looking at. An array of the elements of the tuple each as a variant will let me permute them with std tools.

template <typename... Args, std::size_t... I>
constexpr std::array<std::variant<Args...>, sizeof...(Args)>
tuple_to_array_impl(std::tuple<Args...> const& tuple,
    using V = std::variant<Args...>;
    std::array<V, sizeof...(Args)> array = {
        {V(std::in_place_index_t<I>{}, std::get<I>(tuple))...}};
    return array;

template <typename... Args>
constexpr std::array<std::variant<Args...>, sizeof...(Args)>
tuple_to_array(std::tuple<Args...> const& tuple)

    return tuple_to_array_impl(tuple, std::index_sequence_for<Args...>{});

And that can be used something like:

TEST(TupleTest, to_array)
    constexpr std::tuple<int, double, long> t = std::make_tuple(1, 2.3, 1l);
    auto arr = tupleutil::tuple_to_array(t);
    int  i   = std::get<int>(arr[0]);
    EXPECT_EQ(1, i);

TEST(TupleTest, to_array_repeated)
    constexpr std::tuple<int, int, int> t = std::make_tuple(1, 2, 3);
    auto arr = tupleutil::tuple_to_array(t);
    int  i   = std::get<2>(arr[2]);
    EXPECT_EQ(3, i);

The second test is there because I was about to write, “as you can see, we can tell the differece between variants holding the same type”, except that wasn’t true. The original version of to_ar

    constexpr std::tuple<int, double, long> t = std::make_tuple(1, 2.3, 1l);
    std::variant<int, double, long> v0{1};
    auto v = tupleutil::get(0, t);
    EXPECT_EQ(v0, v);

ray didn’t use the constructor form with std::in_place_index_t. The code I ended up with did, but not at this point. There’s nothing like writing out what something is supposed to do to make you look and keep you honest.

So here, we’re constructing an array of std::variant<Args…> and constructing each member with the argument pack expansion into the std::variant constructor using the Ith index value to get that element of the tuple, and recording that we’re constructing the ith alternative of the variant. The second test checks that. The 2nd element of the array must be the 2nd variant of the tuple, and can be retrieved only by std::get<2>().

This would allow me to permutate the elements of a tuple, but I’m fairly close now to being able to writing a version that allows choice of the element at runtime, rather than at compile time.

    constexpr std::tuple<int, double, long> t = std::make_tuple(1, 2.3, 1l);
    std::variant<int, double, long> v0{1};
    auto v = tupleutil::get(0, t);
    EXPECT_EQ(v0, v);

What I’m going to do is construct an array of the getters for the tuple, each of which will return the element wrapped in a variant. The signature of the array will be of function pointer type, because, quite conveniently, a non-capturing lambda can decay to a function pointer.

First getting the array of getters for the tuple

template <typename V, typename T, size_t I> auto get_getter()
    return [](T const& t) {
        return V{std::in_place_index_t<I>{}, std::get<I>(t)};

template <typename... Args, std::size_t... I>
auto tuple_getters_impl(std::index_sequence<I...>)
    using V = std::variant<Args...>;
    using T = std::tuple<Args...>;
    using F = V (*)(T const&);
    std::array<F, sizeof...(Args)> array
        //        = {{[](T const& tuple){return V{std::get<I>(tuple)};}...}};
        = {{get_getter<V, T, I>()...}};
    return array;

template <typename... Args> auto tuple_getters(std::tuple<Args...>)
    return tuple_getters_impl<Args...>(std::index_sequence_for<Args...>{});

So first a function that returns a function that constructs a variant around the value of what’s returned from std::get<I>. Well, it could return anything that happens to have a constructor that takes a an in_place_index_t, take as the thing to be converted something that std::get<I> can extract from. This is actually a separate function because GCC was unhappy doing the template parameter pack expansion inline in the _impl function. Clang was happy with the expansion noted in the comment. I really have no idea who is wrong here, and the workaround was straight forward. The array is one of function pointers, which the returned lambdas can decay to.

Now the only remaining trick is to use this array as a table to dispatch to the appropriate getter for the tuple.

const auto get = [](size_t i, auto t) {
    static auto tbl = tupleutil::tuple_getters(t);
    return tbl[i](t);

Get the array as a static, so we only need to computer it once, and simply return tbl[i](t)

TEST(TupleTest, gettersStatic)
    constexpr std::tuple<int, double, long> t = std::make_tuple(1, 2.3, 1l);
    std::variant<int, double, long> v0{1};
    auto v = tupleutil::get(0, t);
    EXPECT_EQ(v0, v);

    int  i = std::get<0>(v);
    EXPECT_EQ(1, i);

    auto v2 = tupleutil::get(1, t);

    EXPECT_EQ(1ul, v2.index());
    double d = std::get<double>(v2);

    EXPECT_EQ(2.3, d);

    constexpr auto t2 = std::make_tuple(2.4, 1l);
    auto           v3 = tupleutil::get(0, t2);
    double         d2 = std::get<double>(v3);

    EXPECT_EQ(2.4, d2);

3 Source

All source is available at TupleUtil on GitHub, including org source for this post.

Cross Compiling

1 Setting up Cross Compiling

In order to test out some of these multi-threaded tool properly, I really need to run them on a less strict platform than x86_64. X86_64 provides a lot of guarantees about sequential consistency and atomicity that hides problems that will happen on architectures that are not as strong, like power, sparc, and arm. Fortunately, one of the toys I have is a recent Raspberry Pi 3, which is based on a recent arm chip. Unfortunately, Raspbian, the normal linux distro for the Raspberry Pi is also based on a fairly old debian distro, with a fairly old compiler. Linaro is back porting their arm code genaration fixes to the old releases, but I’m more interested in the recent C++ language features. So I could attempt to compile GCC 6 on the RPi, or I can cross compile from my normal machine. I decided to cross compile, since if that worked, it would be considerably easier. It turnd out to be pretty straightfoward.

sudo apt-get install g++-6-arm-linux-gnueabihf

This is mostly because I’m already doing software development on the box, so I didn’t need any of the other parts of the compiler ecosystem, just the right c++ toolchain. The hardest part is determining the right one. There are a few flavors for arm development. The RPi is the gnu extended abi, with hardware float. The Ubuntu repositories only supply linux variants, which is sensible. Since that top level package ends up installing not just the compilers, but a libstdc++ and libc for arm-linux-gnueabihf, which need to know much more about the OS in order to interface with it.

This does lead to one snag, though. The versions of the libraries are not the ones available on the RPi. Which is a problem, since I want to use modern, or maybe even post-modern C++. There are two ways of dealing with this, and I’ve ended up using both.

2 Sysroot

When cross compiling, a sysroot is a system that looks just like the root file system of the target platform. It will have /lib, /usr/lib, etc, with the versions of the libraries that you want. You can either use a disk image, mounted somewhere convienent, or you can just mount the target computer’s root filesystem somewhere convienent. If you do that, you’ll have access to all of the libraries available, not just the minimal set typically available on a prepackaged sysroot. So that’s what I did.

sshfs sdowney@cobweb.local:/ /home/sdowney/mnt/rpi/ -o transform_symlinks -o allow_other

Cobweb is my Raspberry Pi box, and zeroconf makes the current ip address available as cobweb.local. I’m mounting that into ~/mnt/rpi, transforming symlinks so that they actually work, and allowing others to access the mounted fs.

With that I can specify the sysroot, and have the compiler look there for libraries:

arm-linux-gnueabihf-g++-6 -v --sysroot ~/mnt/rpi/ -o hello hw.cpp

That spits out all of what the compiler driver invokes, and as a byproduct, a bunch of what is needed to set up cross compiling with other compilers, like clang. The key things to look for are the include directories called out by “#include <…> search starts here”, and the LIBRARY_PATH variable that helps define what the linker does. I’ll be pulling those out for the clang cross compile cmake toolchain file.

Using built-in specs.
Target: arm-linux-gnueabihf
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 6.2.0-5ubuntu12' --with-bugurl=file:///usr/share/doc/gcc-6/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-6 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libitm --disable-libquadmath --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-6-armhf-cross/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-6-armhf-cross --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-6-armhf-cross --with-arch-directory=arm --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libgcj --enable-objc-gc --enable-multiarch --enable-multilib --disable-sjlj-exceptions --with-arch=armv7-a --with-fpu=vfpv3-d16 --with-float=hard --with-mode=thumb --disable-werror --enable-multilib --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=arm-linux-gnueabihf --program-prefix=arm-linux-gnueabihf- --includedir=/usr/arm-linux-gnueabihf/include
Thread model: posix
gcc version 6.2.0 20161005 (Ubuntu 6.2.0-5ubuntu12)
COLLECT_GCC_OPTIONS='-v' '-o' 'hello' '-shared-libgcc' '-march=armv7-a' '-mfloat-abi=hard' '-mfpu=vfpv3-d16' '-mthumb' '-mtls-dialect=gnu'
 /usr/lib/gcc-cross/arm-linux-gnueabihf/6/cc1plus -quiet -v -imultiarch arm-linux-gnueabihf -isysroot /home/sdowney/mnt/rpi/ -D_GNU_SOURCE hw.cpp -quiet -dumpbase hw.cpp -march=armv7-a -mfloat-abi=hard -mfpu=vfpv3-d16 -mthumb -mtls-dialect=gnu -auxbase hw -version -fstack-protector-strong -Wformat -Wformat-security -o /tmp/ccUwr5Jd.s
GNU C++14 (Ubuntu 6.2.0-5ubuntu12) version 6.2.0 20161005 (arm-linux-gnueabihf)
    compiled by GNU C version 6.2.0 20161005, GMP version 6.1.1, MPFR version 3.1.5, MPC version 1.0.3, isl version 0.15
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
ignoring nonexistent directory "/home/sdowney/mnt/rpi/usr/local/include/arm-linux-gnueabihf"
#include "..." search starts here:
#include <...> search starts here:
End of search list.
GNU C++14 (Ubuntu 6.2.0-5ubuntu12) version 6.2.0 20161005 (arm-linux-gnueabihf)
    compiled by GNU C version 6.2.0 20161005, GMP version 6.1.1, MPFR version 3.1.5, MPC version 1.0.3, isl version 0.15
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
Compiler executable checksum: 8867fa57a9cbba18ebd7880e42ca78ba
COLLECT_GCC_OPTIONS='-v' '-o' 'hello' '-shared-libgcc' '-march=armv7-a' '-mfloat-abi=hard' '-mfpu=vfpv3-d16' '-mthumb' '-mtls-dialect=gnu'
 /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/bin/as -v -march=armv7-a -mfloat-abi=hard -mfpu=vfpv3-d16 -meabi=5 -o /tmp/ccJH2IA5.o /tmp/ccUwr5Jd.s
GNU assembler version 2.27 (arm-linux-gnueabihf) using BFD version (GNU Binutils for Ubuntu) 2.27
COLLECT_GCC_OPTIONS='-v' '-o' 'hello' '-shared-libgcc' '-march=armv7-a' '-mfloat-abi=hard' '-mfpu=vfpv3-d16' '-mthumb' '-mtls-dialect=gnu'
 /usr/lib/gcc-cross/arm-linux-gnueabihf/6/collect2 -plugin /usr/lib/gcc-cross/arm-linux-gnueabihf/6/ -plugin-opt=/usr/lib/gcc-cross/arm-linux-gnueabihf/6/lto-wrapper -plugin-opt=-fresolution=/tmp/cctgBCzX.res -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lgcc --sysroot=/home/sdowney/mnt/rpi/ --build-id --eh-frame-hdr -dynamic-linker /lib/ -X --hash-style=gnu --as-needed -m armelf_linux_eabi -z relro -o hello /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib/../lib/crt1.o /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib/../lib/crti.o /usr/lib/gcc-cross/arm-linux-gnueabihf/6/crtbegin.o -L/usr/lib/gcc-cross/arm-linux-gnueabihf/6 -L/usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib/../lib -L/home/sdowney/mnt/rpi/lib/arm-linux-gnueabihf -L/home/sdowney/mnt/rpi/lib/../lib -L/home/sdowney/mnt/rpi/usr/lib/arm-linux-gnueabihf -L/home/sdowney/mnt/rpi/usr/lib/../lib -L/usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib -L/home/sdowney/mnt/rpi/lib -L/home/sdowney/mnt/rpi/usr/lib /tmp/ccJH2IA5.o -lstdc++ -lm -lgcc_s -lgcc -lc -lgcc_s -lgcc /usr/lib/gcc-cross/arm-linux-gnueabihf/6/crtend.o /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib/../lib/crtn.o
COLLECT_GCC_OPTIONS='-v' '-o' 'hello' '-shared-libgcc' '-march=armv7-a' '-mfloat-abi=hard' '-mfpu=vfpv3-d16' '-mthumb' '-mtls-dialect=gnu'

Now, note that the compiler will prefer the locally installed versions before using the ones in the sysroot. This is fine, until I need to install something. Then I’ll get an error because the library on the RPi is too old. Particularly libstdc++. This works well for the non-core language libraries, though. Or at least ones that don’t have C++ in their interface. Mixing C++ versions is a horrible minefield. The easiest way to deal with it is to avoid it.

3 Static linking

Recent versions of gcc allow libstdc++ to be linked statically. It increases the size of the resulting executable, but with less worries about deployment issues.


That will cause the compiler driver to direct the linker to prefer the static version of libstdc++, rather than the shared version. And I don’t have to worry about deploying or upgrading the system libraries on the target box.

Note, this isn’t really a supported deployment configuration. So any bugs are going to be my problem.

4 CMake

I’ve been using CMake to generate the build system, so I need to explain to it how to use the cross compiler instead of one for the host system. CMake has support for supplying definitions for these in Toolchain files. This is what I have so far


SET(CMAKE_C_COMPILER arm-linux-gnueabihf-gcc)
SET(CMAKE_CXX_COMPILER arm-linux-gnueabihf-g++)


SET(CMAKE_CXX_FLAGS "-static-libgcc -static-libstdc++" CACHE STRING "CXX_FLAGS" FORCE)

     CACHE STRING "Result from TRY_RUN" FORCE)

That, in addition to setting the compiler to use, forces a few CMake options that are otherwise problems. The first is setting the static link flag for libstdc++. The second is overriding the search for pthreads, because trying to run programs built with a cross compiler doesn’t work very well. This lies and forces the option.

Used like so

cmake  -D CMAKE_TOOLCHAIN_FILE=~/src/toolchain/pi.cmake -DCMAKE_BUILD_TYPE=Release ..

A toolchain file for clang is a little more complicated, because it doesn’t really understand the gcc multilib layout, so it needs to be told where all the include and lib directories are for the target system, for both the C and C++ compiler.


set(triple arm-linux-gnueabihf)



 -isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include/c++/6 \
 -isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include/c++/6/arm-linux-gnueabihf \
 -isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include/c++/6/backward \
 -isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/include \
 -isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/include-fixed \
 -isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include"

 -isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/include \
 -isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/include-fixed \
 -isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include"

 -L /usr/lib/gcc-cross/arm-linux-gnueabihf/6 \
 -L /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib/../lib \
 -L /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib \
 -static-libgcc -static-libstdc++"

     CACHE STRING "Result from TRY_RUN" FORCE)

5 Sources

Toolchain files are on Github next to the spingate sources, that now includes the org file that is the source for this entry,

batch: running functions under a spingate

1 A batch of tasks to run

This adds a rather simple component to spingate orchestrating a batch of tasks to be run, gated by the spingate. The tasks are added one at a time, a thread is created for the task, and the thread waits on the spingate to open before calling the task.

Or at least that’s how it started. Task was originally a std::function<void()>, which is essentially the interface of the thread pool I use. I realized, however, that I don’t actually need to restrict the interface quite that much. Thread takes a much wider range of things to run, and I can do the same thing. I have to forward the supplied callable and arguments into the lambda that the thread is running.

The key bit of code is

template <class Function, class... Args>
void Batch::add(Function&& f, Args&&... args) {
    workers_.emplace_back([ this, f = std::forward<Function>(f), args... ]() {

There’s a lot of line noise in there, and it really looked simpler when it was just taking a std::function<void()>, but it’s not terrible. We take an object of type Function and a parameter pack of type Args by forwarding reference. That gets captured by the lambda, where we forward the function to the lambda, and capture the parameter pack. Inside the lambda we call the function with the pack, f(args). It’s probable that I should have used std::invoke there, which handles some of the more interesting cases of calling a thing with arguments. But this was sufficient unto the day. The captured this allows access to the gate_ variable the we’re waiting on. The workers_ are a vector of threads that we’ll later run run through and join() on, after open()ing the gate_.

void Batch::run() {;
    for (auto& thr : workers_) {

That’s really all there is to Batch. It’s a middle connective glue component. Does one thing, and tries to do it obviously well. That is important since I’m trying to build up test infrastructure, and testing the test infrastrucure is a hard problem.

I have reorganized the code repo in order to do some light testing, though.

2 GTest

I’ve pushed things about in the source repo, moving the code into a library directory, which means I can link it into the existing mains, as well as into new gtests. In the CMake system, I’ve conditioned building tests on the existence of the googletest project being available as a subdirectory. I use enough different compilers and build options that trying to use a system build of gtest just doesn’t work. The best, and recommended, choice, is to build googletest as part of your project. That way any ABI impacting subtlety, like using a different C++ standard library, is take care of automatically. The bit of cmake magic is in the top level CMakeLists.txt :

# A directory to find Google Test sources.
if (EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/googletest/CMakeLists.txt")
  add_subdirectory(googletest EXCLUDE_FROM_ALL)
  message("GTEST Not Found at ${CMAKE_CURRENT_SOURCE_DIR}/googletest/CMakeLists.txt")

This looks for googletest to be available, and if it is, add it to the project, and my tests subdirectory, otherwise issue a message. I prefer this to attempting to fix up the missing gtest automatically. That always seems to cause me problems, such as when I’m operating disconnected, on a train, like right now.

The tests I have are pretty simple, not much more than primitive breathing tests.

TEST_F(BatchTest, run1Test)
    Batch batch;

    EXPECT_EQ(0u, called);;

    EXPECT_EQ(1u, called);

or, to make sure that passing arguments worked

TEST_F(BatchTest, runArgTest)
    Batch batch;
    int i = 0;
    batch.add([&i](int k){ i = k;}, 1);

    EXPECT_EQ(0, i);;

    EXPECT_EQ(1, i);

I don’t actually expect to find runtime errors with these tests. They exercise ths component just enough that I’m not generating compile errors in expected use cases. Template code can be tricky that way. Templates that aren’t instantiated can have horrible errors, but the compiler is willing to let them pass, if they mostly parse.

SFINAE may not be your friend.

3 Clang builds with current libc++

Building clang and libc++ locally is getting easier and easier. Using that is still a bit difficult. But there are some reasons to do so. One is just being able to cross check your code for sanity. I won’t reproduce building clang and libc++ here. It’s really at this point just checking out the repos in the right places and running cmake with something like:

cmake  -DCMAKE_INSTALL_PREFIX=~/install/llvm-master/ -DLLVM_ENABLE_LIBCXX=yes  -DCMAKE_BUILD_TYPE=Release   ../llvm/

Using that, at least from within cmake, is more complicated. Cmake has a strong bias towards using the system compiler. It also has a distinct problem with repeating builds.

NEVER edit your CMakeCache.txt. You can’t do anything with it. All the paths are hard coded. Always start over. Either keep the command line around, or create a cmake initial cache file, which isn’t the same thing at all as the CMakeCache.txt file.

Right now, I’m cargo-culting around code in my cmake files that checks if I’ve defined an LLVM_ROOT, and if I have supply the flags to ignore all the system files, and use the ones from the installed LLVM_ROOT, including some rpath fixup. There might be some way to convince cmake to do it, but there’s also only so much I will fight my metabuild system.

    message(STATUS "LLVM Root: ${LLVM_ROOT}")
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -nostdinc++ -isystem ${LLVM_ROOT}/include/c++/v1")
    set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -L ${LLVM_ROOT}/lib -l c++ -l c++abi")

I only check for that if the compiler I’ve chosen is a clang compiler, and it’s not normally part of my environment.

4 Direction

Overall, what I want out of this library is to be able to stress test some nominally mt-safe code, and check that the conditions that I think hold are true. It’s heavily influenced by jcstress, but, because this is C+++, it will be rendered quite differently.

For what I’m considering, look at Close Encounters of The Java Memory Model Kind

I want to be able to specify a state, with operations that mutate and observe the state. I want to be able to collect those observations in a deterministic way, which may require cooperation from the observers. I want to be able to collect the observations and report how many times each set of observations was obtained.

Something like:

class State {
    int x_;
    int y_;

    typedef std::tuple<int, int, int, int> Result;
    State() : x_(0), y_(0) {}
    void writer1() {
        y_ = 1;
        x_ = 1;
    void reader1(Result& read) {
        std::get<0>(read) = x_;
        std::get<1>(read) = y_;
    void reader2(Result& read) {
        std::get<2>(read) = x_;
        std::get<3>(read) = y_;

Running the writers and readers over different threads and observing the possible results. On some architectures, reader1 and reader2 can see entirely different orders, even though y_ will happen before x_, you might see x_ written and not y_.

What I’d eventually like to be able to do is say things like, “This function will only be evaluated once”, and have some evidence to back that up.

So the next step is something that will take a State and schedule all of the actions with appropriate parameters in a Batch, and produce the overall Result. Then something that will do that many many times, accumulating all of the results. And since this isn’t java, so we don’t have reflection techniques, the State class is going to have to cooperate a bit. The Result typedef is one way. It will also have to produce all of the actions that need to be batched, in some heterogenous form that I can then run.

5 Source Code

Exported from an org-mode doc,, which is available, with all of the source on github at SpinGate.