A Lesson in Multithreaded Bugs

By John Burger, April 01, 1998

A Lesson in Multithreaded Bugs | Dr Dobb’s (
(Printed in Windows Developers Journal, which was subsumed into Dr Dobbs)

Three years after writing my first multithreaded program, I felt I knew it all — multithreading could pull no more surprises on me. I had debugged so many deadlocks, predicted and circumvented so many race-conditions, and assiduously protected all my global variables, that it was second nature for any new task that came along. Not to mention that I had developed and debugged a library of multithreaded modules that I could call on with full confidence that they would work for me.

Needless to say, I was brought up short when a program that had been in use for over a year, that had been used flawlessly every day by a variety of operators, crashed horribly and immediately when it was put on a new platform: Windows NT running on a computer with two CPUs. It should have worked — multithreading makes it transparent whether there is one or a hundred CPUs running your program — but it didn’t. This article is a public warning to all programmers out there: your multithreaded application has not been tested until it has been tested on a multiprocessor platform.

The Basics

Anyone learning to write a multithreaded program is taught early on that all the threads must cooperate with each other to avoid corrupting shared data. The typical example is a counter that two threads try to manipulate. Suppose that thread A increments a counter, while thread B decrements it. If the two threads don’t cooperate, the following sequence could occur:

1) Thread A reads the global variable that needs to be incremented.

2) Thread A’s time slice expires and the operating system lets Thread B resume execution.

3) Thread B reads the (not-yet-incremented) global variable, decrements it, and writes it back.

4) Thread B’s time slice eventually runs out, and the operating system lets thread A resume execution.

5) Thread A increments the previously read (now incorrect) value, and writes it back to the global variable.

The decrement operation is lost. To work properly, the two threads need to synchronize their access to the shared resource, so that only one thread is manipulating the global variable at a time.

Win32 provides a variety of thread synchronization primitives. One of the most general is called a mutex, short for mutual exclusion. This construct prevents more than one thread from claiming the mutex at the same time — the second and subsequent threads trying to claim the mutex are blocked until the first thread releases it again. The idea is to associate a mutex with the counter, and require any thread that wants to operate on the counter to try to claim the mutex first. For example, the following code tries to create (if it doesn’t already exist) and acquire a mutex named “MyGlobalVariable”:

HANDLE Mutex = CreateMutex(NULL, TRUE, "MyGlobalVariable");
    // this thread now "owns" the mutex
    // OK to operate on shared data associated with mutex
    // ...
    printf("Bad problem: CreateMutex() failed!\n");

If another thread had already claimed the mutex named “MyGlobalVariable”, then this call to CreateMutex() would not return until that other thread called ReleaseMutex() to relinquish the mutex. Obviously, you need to remember to call ReleaseMutex() whenever you’re no longer operating on the shared data. The C++ destructor mechanism is a good way of not forgetting this very important point.

A Win32 named mutex is global to all processes, so it will work correctly even if the calling threads are in separate processes. That’s important if the shared resource you are protecting might be accessed by separate processes (e.g., a block of shared memory), but most situations that require synchronization involve resources that are only visible to the current process, such as global variables or data structures. In that case, you can use a much more efficient synchronization method called a critical section.

A critical section is basically like a mutex. Like a mutex, a critical section can only be owned by one thread at a time, and any other threads attempting to acquire the critical section will be blocked until the thread that currently owns it gives it up. Unlike a mutex, a critical section works correctly only with threads within the process that created it. In return for this limitation, the CPU time required to acquire a critical section that is already free is a fraction of the time required to acquire a named mutex. (EnterCriticalSection() has about the same overhead as executing two function calls.)

critsect.hpp (Listing 1) contains my C++ example of a Win32 critical section, and test1.cpp (Listing 2) shows how to use it. Although it works fine for critical sections that are dynamically allocated, it does not attempt to solve the problems associated with declaring a critical section as a global variable. C++ does not guarantee the order that static objects in separate modules will get constructed, so watch out for situations in which a static constructor for an object in one module assumes it can lock a global critical section declared in another module (which may not have been initialized yet).


Although it is necessary to protect global variables, every use of a synchronization mechanism effectively disables multithreading around the point of the synchronization. Injudicious use of global variables can bottleneck the program so much that it might as well have been written as a single thread!

Savvy programmers quickly learn that in many cases synchronization is only needed where the object being protected is being modified. A thread that only wants to read the status of the object doesn’t have to claim the lock — it can just “sneak a peek.” This only works if the object being examined is not a large structure with the examiner needing a coherent view of the whole thing. In the case of the counter, it’s a simple type, and if a thread wants to check if it’s zero it can do so with one test. If another thread had just incremented it, or a split second later was about to decrement it to zero, then that doesn’t matter — at the instant of the examination, the value is or is not zero.

This search for efficiency proved to be my undoing. Early in the development of my multithreaded library, I proved to myself that the compiler I was using incremented and decremented counters with a single assembler instruction. In other words, the generated code did not look like this:

mov ax, GlobalVar
inc ax
mov GlobalVar, ax

but like this:

inc GlobalVar

The 80×86 CPU will not interrupt itself in middle of a single instruction except for special cases such as REP-prefixed instructions. That implied that my code to increment or decrement the global variable could not be interrupted, and therefore could be done safely by any thread without requiring explicit synchronization. This meant that I didn’t have to protect the global variable! And sure enough, it worked fine.

Multiple CPUs

It turns out it wasn’t fine. When there is more than one CPU, then the above observation is not true. While CPU A will not be interrupted while performing a single INC instruction, CPU B could be accessing the same memory while CPU A is in the middle of completing its INC. Both CPUs can read the value. Both CPUs perform their operation on it, and then they fight with each other when they try to write it back. One CPU must win, and the other CPU’s operation is lost. If I had synchronized the counter as I should have done, it would have worked fine.

But I still didn’t like the idea of using a critical section for every counter that needed protecting. Using a critical section is still expensive, in terms of memory to store it and CPU time to claim and release it. All I wanted to do was increment a memory variable. I didn’t like having the CPU perform hundreds of instructions, not to mention two system calls, plus potentially block the thread, all for one instruction that would be over within around 100 nanoseconds!

From my assembler work, I knew there was an instruction for the Intel x86 range of CPUs specifically for use in multiprocessor work called LOCK. In my ten years of assembler programming I had never had reason to use it, but I looked at it again and decided it was perfect. I would put LOCK before any single read-modify-write instruction that I didn’t want interrupted by another processor. This is a form of hardware synchronization that would only block another CPU for the duration of the access, and it added only one byte to the program!

Unfortunately, I couldn’t find a way of getting my compiler to use the LOCK prefix. The inline assembler didn’t understand it, and it couldn’t insert the opcode (0xF0). Then I realized that this problem was not unique — the Win32 designers would have to have considered this. And sure enough they had, under the unlikely name of interlocked variable access.

InterlockedIncrement()InterlockedDecrement(), and InterlockedExchange() each operate on a 32-bit integer and guarantee that the operation completes without interference from other processors (so long as the integer is modified only via these functions). There are also two newer functions in this family: InterlockedCompareExchange() and InterlockedExchangeAdd(). Using a debugger, I followed a call to InterlockedIncrement() and found that it consisted of ten instructions that did the job: a bit more complicated than the one-byte prefix I was expecting, but the fact that the Win32 API works on non-Intel platforms means that it had to adhere to a portable API.

The portable API causes an interesting side effect with InterlockedIncrement() and InterlockedDecrement() — it performs the operation, but it doesn’t return the value afterwards. What it returns is whether the result of the operation is negative, zero, or positive. But this is fine, since all the counters I use are zero-relative: I increment as necessary, and when I decrement I see if the result is zero.

interlok.hpp (Listing 3) contains my C++ implementation of a Win32 interlocked variable, and test2.cpp (Listing 4) shows how to use it. The same problems using globally declared objects mentioned previously apply to this class as well. Beware of defining a global variable of type Interlock in one module and accessing it from the constructor of a static object in another module.

One Last Trap

I rewrote my counter code to use the interlocked Win32 functions, and most of my problems went away. But there was still a problem that would cause an occasional crash that I couldn’t hunt down. Aware of my previous problem, I looked again at all of the global variables, to see if there were any that were not protected. Then I found it.

My obsession with efficiency is a direct result of my experience with assembler programming. I believe that an operation should be performed once, and the results of the operation used from then on. I had declared all of my strings in one place, to aid in spell-checking and translation to other languages. In a true Win32 program, I would probably have put them in a string table resource, but this code had to work on Unix platforms so I couldn’t do that. Instead of declaring strings like this:

const char example[] = "Example String";

I had declared them using the standard C++ string class like this:

const string example = "Example String";

This way, whenever I passed the string to a function that expected a string object rather than a char *, the compiler wouldn’t have to go to the trouble of converting the array to a string, using it, and then destroying the temporary only to reconstruct it next time. As an aside, look at all your code that uses string. How often is the same character array converted to a string? I think you’ll be surprised.

The problem with this technique is that it doesn’t work in a multi-CPU environment. When I looked at those global strings the first time, I thought they wouldn’t be a problem. Not only were they set-and-forget, initialized once at the beginning of the program, they were const so that no one could write to them anyway! Wrong.

The string objects weren’t constant, only the contents were. Every time a string object was passed by value, or copied, the contents were not actually copied. The standard C++ string class uses a reference-counting algorithm so that multiple string objects can point to the same characters until one of them is modified. Rather than making a fresh copy of the characters in the string, a copy operation (such as passing the string object by value) merely requires the string class to increment a reference counter that records how many string objects are currently pointing to that particular string.

Of course, the reference counters used in the standard C++ string class were not accessed via InterlockedIncrement() and company, so two different threads using the same global string would lead directly to problems on a multiprocessor machine. I didn’t write the class, so I couldn’t change the way the counters were incremented. It looked like I would have to protect each and every global string with its own critical section, even when I only wanted to look at it. Instead, I simply wrote my own String class (doesn’t everybody?), and used the appropriate synchronization code.


While trying to coerce my C++ compiler to produce the LOCK prefix, I came across another construct that I had never had occasion to use. Declaring a variable as volatile is a signal to the compiler not to “cache” the value in a register in an effort to optimize the code. One place it is used is where the variable represents a piece of hardware that changes according to external influences, and unlike normal memory which keeps its value from second to second, this variable could change every time you looked at it.

Only system programmers or hardware programmers would need to use this keyword, but as more and more mainstream systems become multiprocessing (Unix workstations have had multiple CPUs for years) the volatile keyword could begin to take on a new meaning. If a variable was declared volatile, and then the ++ or  operator (or other read-modify-write operations) was applied to it, maybe the compiler should protect the variable for the duration of the operation. In the case of Intel platforms, it would just be a matter of prefixing the instruction with a LOCK opcode (assuming the supporting hardware correctly responds to the LOCK signal by locking the associated memory location). On other platforms, it may require the compiler to call an operating system service to do the requested operation. And there may be restrictions that the only post-operation check that was valid was a check for negative, zero, or positive.

Adding this extension to the volatile keyword allows C++ to be more multiprocessor-friendly, and libraries are more likely to be multiprocessor-correct.


The final problem I had was specific to C++, but the lesson learned wasn’t. True, nowhere does it say that the standard C++ libraries are multithreaded-safe, but using someone else’s libraries means that you hope that they had already learned what you had only just discovered. If they hadn’t, their library breaks.

In any case, even if your library has been tested against multiple threads, it also needs to be tested against multiple processors. I reiterate: if you haven’t tested your multithreaded application on a multiprocessor system, you haven’t tested your application.

John Burger has a B.A. in Computer Studies and has been a systems programmer for ten years on PCs and embedded systems. He programs in C/C++, assembler, and Pascal, and is currently completing a Masters in Operating System Implementation. You can contact John at