# spingate

## 1 Building a simple spin gate in C++

This is a very simplified latch which allows several threads to block and busywait until they are released to begin work in parallel. I’m using this to do some multi-thread stress testing, where I want to maximize the overlapping work in order to check for suprising non-deterministic behavior. There’s no count associated with the gate, nor is it reuseable. All we need is the one bit of information, weather the gate is open or closed.

The simplest atomic boolean is the std::atomic_flag. It’s guaranteed by standard to be lock free. Unfortunately, to provide that guarantee, it provides no way to do an atomic read or write to the flag, just a clear, and a set and return prior value. This isn’t rich enough for multiple threads to wait, although it is enough to implement a spinlock.

std::atomic<bool> isn’t guaranteed to be lock free, but in practice, any architecture I’m interested has at least 32bit aligned atomics that are lock free. Older hardware, such as ARMv5, SparcV8, and 80386 are missing cmpxchg, so loads are generally implemented with a lock in order to maintain the guarantees if there were a simultaneous load and exchange. See, for example, LLVM Atomic Instructions and Concurrency Guide. Modern ARM, x86, Sparc, and Power chips are fine.

When the spin gate is constructed, we’ll mark the gate as closed. Threads will then wait on the flag, spinning until the gate is opened. For this we use Release-Acquire ordering between the open and wait. This will ensure any stores done before the gate is opened will be visible to the thread waiting.

// spingate.h                                                       -*-C++-*-
#ifndef INCLUDED_SPINGATE
#define INCLUDED_SPINGATE

#include <atomic>

class SpinGate
{
std::atomic_bool flag_;

public:
SpinGate();
void wait();
void open();
};

inline
SpinGate::SpinGate() {
// Close the gate
flag_.store(true, std::memory_order_release);
}

inline
void SpinGate::wait() {
; // spin
}

inline
void SpinGate::open() {
flag_.store(false, std::memory_order_release);
}

#endif

#include "spingate.h"
// Test that header is complete by including


Using a SpinGate is fairly straightfoward. Create an instance of SpinGate and wait() on it in each of the worker threads. Once all of the threads are created, open the gate to let them run. In this example, I sleep for one second in order to check that none of the worker threads get past the gate before it is opened.

The synchronization is on the SpingGate’s std::atomic_bool, flag_. The flag_ is set to true in the constructor, with release memory ordering. The function wait() spins on loading the flag_ with acquire memory ordering, until open() is called, which sets the flag_ to false with release semantics. The other threads that were spinning may now proceed. The release-acquires ordering ensures that happens-before writes by the thread setting up the work and calling open will be read by the threads that were spin waiting.

### 1.1 Update:

I’ve added a yield() call after the open, hinting that the thread opening the gate may be rescheduled and allow the other threads to run. On the system I’m testing on, it doesn’t seem to make a difference, but it does seem like the right thing to do, and it does not seem to hurt.

#include "spingate.h"

#include <vector>
#include <chrono>
#include <iostream>

int main()
{
SpinGate gate;
using time_point = std::chrono::time_point<std::chrono::high_resolution_clock>;
time_point t1;
std::vector<time_point> times;

for (size_t n = 0; n < threadCount; ++n) {
workers.emplace_back([&gate, t1, &times, n]{
gate.wait();
time_point t2 = std::chrono::high_resolution_clock::now();
times[n] = t2;
});
}

std::cout << "Open the gate in 1 second: " << std::endl;
using namespace std::chrono_literals;
t1 = std::chrono::high_resolution_clock::now();
gate.open();

for (auto& thr : workers) {
thr.join();
}

for (auto& time: times) {
auto diff = std::chrono::duration_cast<std::chrono::nanoseconds>(time - t1);
std::cout << "Thread " << threadNum++ << " waited " << diff.count() << "ns\n";
}
}


I’d originally had the body of the threads just spitting out that they were running on std::cout, and the lack of execution before the gate, plus the overlapping output, being evidence of the gate working. That looked like:

for (std::size_t n = 0; n < std::thread::hardware_concurrency(); ++n) {
workers.emplace_back([&gate, n]{
gate.wait();
std::cout << "Output from gated thread " << n << std::endl;
});
}


The gate is captured in the thread lambda by reference, the thread number by value, and when run, overlapping gibberish is printed to the console as soon as open() is called.

But then I became curious about how long the spin actually lasted. Particularly since the guarantees for atomics with release-acquire semantics, or really even sequentially consistent, are about once a change is visible, that changes before are also visible. It’s really a function of the underlying hardware how fast the change is visible, and what are the costs of making the happened-before writes available. I’d already observed better overlapping execution using the gate, as opposed to just letting the threads run, so for my initial purposes of making contention more likely, I was satisfied. Visibility, on my lightly loaded system, seems to be in the range of a few hundred to a couple thousand nanoseconds, which is fairly good.

Checking how long it took to start let me do two thing. First, play with the new-ish chrono library. Second, check that the release-acquire sync is working the way I expect. The lambdas that the threads are running capture the start time value by reference. The start time is set just before the gate is opened, and well after the threads have started running. The spin gate’s synchronization ensures that if the state change caused by open is visible, the setting of the start time is also visible.

Here are one set of results from running a spingate:

Open the gate in 1 second:


## 2 Comparison with Condition Variable gate

The normal way of implementing a gate like this is with a condition variable, associated mutex, and a plain bool. The mutex guarantees synchronization between the wait and open, rather than the atomic variable in the SpinGate. The unlock/lock pair in mutex have release-acquire semantics. The actual lock and unlock are done by the unique_lock guard.

// cvgate.h                                                           -*-C++-*-
#ifndef INCLUDED_CVGATE
#define INCLUDED_CVGATE

#include <mutex>
#include <condition_variable>
#include <atomic>

class CVGate
{
std::mutex lock_;
std::condition_variable cv_;
bool flag_;

public:
CVGate();

void wait();
void open();
};

inline
CVGate::CVGate()
: lock_(),
cv_(),
flag_(true)
{}

inline
void CVGate::wait() {
std::unique_lock<std::mutex> lk(lock_);
cv_.wait(lk, [&](){return !flag_;});
}

inline
void CVGate::open() {
std::unique_lock<std::mutex> lk(lock_);
flag_ = false;
cv_.notify_all();
}
#endif

#include "cvgate.h"
// Test that header is complete by including


This has the same interface as SpinGate, and is used exactly the same way.

Running it shows:

Open the gate in 1 second:


That the overhead of the mutex and condition variable is significant. On the other hand, the system load while it’s waiting is much lower. Spingate will use all available CPU, while CVGate yields, so useful work can be done byu the rest of the system.

However, for the use I was originally looking at, releasing threads for maximal overlap, spinning is clearly better. There is much less overlap as the cv blocked threads are woken up.

## 3 Building and Running

This is a minimal CMake file for building with the system compiler.

cmake_minimum_required(VERSION 3.5)
set(CMAKE_LEGACY_CYGWIN_WIN32 0)

project(SpinGate C CXX)

set(CMAKE_EXPORT_COMPILE_COMMANDS ON)

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++14 -ftemplate-backtrace-limit=0") set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wextra -march=native")
set(CMAKE_CXX_FLAGS_DEBUG "-O0 -fno-inline -g3")
set(CMAKE_CXX_FLAGS_RELEASE "-Ofast -g0 -DNDEBUG")

include_directories(${CMAKE_CURRENT_SOURCE_DIR}) add_executable(spingate main.cpp spingate.cpp) add_executable(cvgate cvmain.cpp cvgate.cpp) target_link_libraries(spingate Threads::Threads) target_link_libraries(cvgate Threads::Threads)  And here we build a release version of the test executable: mkdir -p build cd build cmake -DCMAKE_BUILD_TYPE=Release ../ make  -- Configuring done -- Generating done -- Build files have been written to: /home/sdowney/src/spingate/build [ 50%] Built target cvgate [100%] Built target spingate  ## 4 Org-mode source and git repo Exported from an org-mode doc. All of the source is available on github at SpinGate # C++ code in Org-mode ## 1 C++ This is a simple example with one c++ file, all in one src code block. #include <iostream> int main() { std::cout << "Hello World++! 2.9"; }  The src block looks like:  #+HEADERS: :tangle hello.c++ :exports code :eval never #+BEGIN_SRC C++ // source code #+END_SRC  The HEADERS block is on a separate line because when the buffer is evaluated, code will get run, and the SRC blocks will get rewritten, as well as the RESULTS blocks. Since we want the headers to be preserved, we can’t make them part of the SRC block. You can also tangle a Makefile. clean: -rm hello hello: hello.c++ clang++ hello.c++ -o hello  With the org mode headers exporting the code, and not evaluating this block. Just like the C++ code. We’ll evaluate the makefile, and run the program, a bit further down.  #+HEADERS: :tangle Makefile :exports code :eval never #+BEGIN_SRC makefile # Makefile #+END_SRC  Now, we tangle the code out to the files.  #+NAME: tangle-buffer #+HEADERS: :exports both :results value #+BEGIN_SRC emacs-lisp (org-babel-tangle) #+END_SRC  (org-babel-tangle)  That will write out two files when the buffer is evaluated using org-babel-execute-buffer, bound to \C-c \C-v b. Give this block a name, so that where the results go can be controlled. Do that by giving the RESULTS block the name of the SRC block. Org will then produce a table of the results of executing the elisp, which is the two files produced. And put the results here:  hello.c++ Makefile Next, we run make with the target to compile the code. You could also simply write the compiler command here.  #+NAME: make-clean-hello #+BEGIN_SRC sh :exports both :results output make clean make hello #+END_SRC  make clean make hello  And make will run our compilation as spec’d in the Makefile we just tangled out. rm hello Makefile:2: recipe for target 'clean' failed clang++ hello.c++ -o hello  And now get the output by running the program.  #+NAME: run-hello #+BEGIN_SRC sh :exports results ./hello #+END_SRC  Which prints out our hello, world text. Which is has version number to convince myself it gets updated. Hello World++! 2.9  ## 2 Raw Document Org mode in org is a little tricky, since to show the examples, you have to quote them. Original document is available on GitHub # Building Emacs 25.1 on Ubuntu 16.10 ## Table of Contents ## 1 Why notes Making notes so I don’t forget, although the key problem is fixed upstream. Ubuntu 16.10 (Yakkety Yak) has made a critical change to the system compiler, and everything is by default built with position independent executable support (PIE), in order to get better support for address space layout randomization. Here are the security notes for PIE. Emacs does not dump successfully with this. The compiler option -no-pie needs to be added into CLFAGS. The option also means that static libraries you’ve built before will probably need to be rebuilt. See the link above for typical errors. ## 2 Getting Ready First get dpkg-dev, g++, gcc, libc, make: sudo apt-get install build-essentials  Then get the full set of build dependencies for last emacs, emacs24: sudo apt-get build-dep emacs24  Decide if you want to build just this version, or track emacs. I track from git, because. So I have a directory for emacs where I have master, emacs25, and build directories. I try to avoid building in src dirs. It makes it easier to try out different options without polluting the src. mkdir -p ~/bld/emacs cd ~/bld/emacs git clone git://git.savannah.gnu.org/emacs.git cd emacs.git git worktree add ../emacs-25.1 emacs-25.1 cd .. mkdir bld-25.1  ## 3 Configure and build with magic option Now configure in the build directory: cd bld-25.1 ../emacs-25.1/configure \ --prefix=~/install/emacs-25.1 \ --with-x-toolkit=gtk3 \ --with-xwidgets \ CFLAGS=-no-pie  I built with xwidget support to play with the embedded webkit widget. It’s not really useable as a browser, but has uses for rendering. I also install into a local program directory, under my homedir. Build and install: make make install  I have a bin directory early in$PATH so that I can select versions of local software ahead of system software.

cd ~/bin
ln -s ~/install/emacs-25.1/bin/emacs
ln -s ~/install/emacs-25.1/bin/emacsclient


Now you should have a working emacs 25.1 available.

# Testing post via email

Yet another test post. Testing post via email

# Real World Haskell – Chapter 3

These are the exercises from chapter 3 of
by Bryan O’Sullivan, Don Stewart, and John Goerzen

> module RWHChapter3 where

{-# OPTIONS_GHC -XMagicHash #-}

Some useful things to check my work:

> import Test.QuickCheck
> import Data.List
> import GHC.Prim
> import GHC.Base

1) Write a function that computes the number of elements in a list. To test it, ensure that it gives the same answers as the standard length function.

> lengthList (x:xs) = 1 + lengthList xs
> lengthList [] = 0

check that it works by running

quickCheck (\s -> length s == lengthList s)

at the interactive prompt

> ghclength l = len l 0#
> where
> len :: [a] -> Int# -> Int
> len [] a# = I# a#
> len (_:xs) a# = len xs (a# +# 1#)

is the definition used by GHC.
It doesn’t stack overflow on ghclength [1..1000000]
lengthList [1..1000000] does, at least in ghci

> lengthList :: [a] -> Integer

3. write a function that computes the mean of a list, i.e. the sum of all elements in the list divided by its length. (you may need to use the fromintegral function to convert the length of the list from an integer into a floating point number.)

> listMean x = sum x / fromIntegral (length x)

4. Turn a list into a palindrome, i.e. it should read the same both backwards and forwards. For example, given the list [1,2,3], your function should return [1,2,3,3,2,1].

> makePalindrome x = x ++ reverse x

5. Write a function that determines whether its input list is a palindrome.

> testPalindrome x = x == reverse x

> testPalindrome2 x = (take (halfLength x) x)
> == (take (halfLength x) (reverse x))
> where halfLength x = ((length x) quot 2)

reversing the whole list isn’t exactly right

> testPalindrome3 x = (take (halfLength x) x)
> == reverse (drop (halfLength x) x)
> where halfLength x = ((length x) quot 2)

except now it thinks that odd length lists aren’t palindromes.

6. Create a function that sorts a list of lists based on the length of each sublist. (You may want to look at the sortBy function from the Data.List module.)

> sortByLength = sortBy (\a b -> compare (length a) (length b))

7 An intersperse with type intersperse :: a -> [[a]] -> [a], such that
ghci> myintersperse ‘,’ []
“”
ghci> myintersperse ‘,’ [“foo”]
“foo”
ghci> myintersperse ‘,’ [“foo”,”bar”,”baz”,”quux”]
“foo,bar,baz,quux”

> myintersperse :: a -> [[a]] -> [a]
> myintersperse _ [] = []
> myintersperse _ [x] = x
> myintersperse c (x:xs) = x ++ [c] ++ (myintersperse c xs)

8. Height of a tree

> data Tree a = Node a (Tree a) (Tree a)
> | Empty
> deriving (Show)

depth’ :: Tree a -> Int

> depth’ Empty = 0
> depth’ (Node _ b c) = 1 + (max (depth’ b) (depth’ c))

9. Consider three two-dimensional points a, b, and c. If we look at the angle formed by the line segment from a to b and the line segment from b to c, it either turns left, turns right, or forms a straight line. Define a Direction data type that lets you represent these possibilities.

> data Direction = DLeft
> | DRight
> | DStraight
> deriving (Show)

> data Point = Point {
> x::Double,
> y::Double
> } deriving (Show)

10. Write a function that takes three 2D Points and returns a direction.

> direction p1 p2 p3
> | signCross p1 p2 p3 > | signCross p1 p2 p3 > 0 = DLeft
> | signCross p1 p2 p3 == 0 = DStraight
> where signCross (Point x1 y1) (Point x2 y2) (Point x3 y3) =
> (x2 – x1) * (y3 – y1) – (y2 – y1) * (x3 – x1)

11. Define a function that takes a list of 2D points and computes the direction of each successive triple. Given a list of points [a,b,c,d,e], it should begin by computing the turn made by [a,b,c], then the turn made by [b,c,d], then [c,d,e]. Your function should return a list of Direction.

> directionList :: [Point] -> [Direction]
> directionList ([]) = []
> directionList (_:[]) = []
> directionList (_:_:[]) = []
> directionList (x:x’:x”:[]) = [(direction x x’ x”)]
> directionList (x:x’:x”:xs) = (direction x x’ x”) : directionList (x’:x”:xs)

I don’t particularly like the pattern match on the three x’s. How
about a more general function to do sliding windows of data across the
list?

> window :: Int -> [a] -> [[a]]
> window2 :: Int -> [a] -> [[a]]

> window n xs = if (length xs) > then []
> else (take n xs) : window n (tail xs)

> window2 n xs | (length xs) > window2 n x@(_:xs) = (take n x) : window2 n xs

> direction2 (x1:x2:x3:[]) = direction x1 x2 x3

> directionList2′ (x:xs) = direction2 x : directionList2′ xs
> directionList2′ [] = []

> directionList2 x = directionList2′ (window2 3 x)

I’m still working on the convex hull problem

# Testing embedded TeX

If I set the up correctly, I can now embed mathematical formulas in my blog:
$\int_{0}^{1}\frac{x^{4}\left(1-x\right)^{4}}{1+x^{2}}dx=\frac{22}{7}-\pi$

# tabbed working notes

 A           D            A           E           A          ------10-8---------10-8--------10-8--------10-7--------10-8--------8------10----8------11---8------10---7------10---8------10---9-----------10-----------9-----------9-----------9--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------