Introduction
Most code written today is sequential. What do we mean by the term sequential or serialized? Simply put, code is executed one instruction after the next in a monolithic fashion, with no regard to the many possible resources available to the program. Overall performance can be serverely degraded if the program performs a blocking call.
Why is it that most programs are sequential? One guess could be the relative dominance of uniprocessing machines available to programmers. Multithreading a program on a uniprocessor machine in most cases does not yield enough performance gains to merit days, weeks, or months worth of work to retrofit old code or build new codebases from scratch utilizing multiple threads. Another guess is that humans think in a sequential manner. Parallelizing our thoughts does not come naturally or is it an easy task.
However, with the increasing popularity of machines with Symmetric Multi-Processors (SMP) thanks to multi-core processors today, programming multi-threaded code is a skill worth learning.
We will dive into the world of threads with some a little bit of "theory" first. We will examine thread synchronization primitives and then a tutorial on how to use POSIX pthreads. Finally, we will finish off with thread performance and a brief overview of multiprocess programming.
What is a thread?
Part I :: Definition
Isn't that something you put through an eye of a sewing needle?
Yes.
How does it relate to programming then?
Think of sewing needles as the CPUs (or LWPs) and the threads in a program as the fiber. If you had two needles but only one thread, it would take longer to finish the job than if you split the thread into two and used both needles at the same time. Taking this analogy a little further, if one needle had to sew on a button (blocking I/O), the other needle could continue doing other useful work even if the other needle took 4 hours to sew on a single button. If you only used one needle, you would be ~4 hours behind!
Now that we have a real world analogy, let's establish something more concrete. A thread is a sequence of instructions that can be executed in parallel with other threads [wikipedia.com]. They are not processes, but rather lightweight threads of execution. Threads of a program are not full-blown processes, but are smaller portions of the process running concurrently (or in parallel). Hence, the term lightweight is used.
Part II :: Operating System Support
You cannot expect a multithreaded program to run on a kernel that does not support threads. Fortunately most modern Operating Systems support threads, either with their own thread library or through POSIX pthreads. Sun Solaris, FreeBSD, Linux, AIX, HP-UX, IRIX, and Windows NT, just to name a few, support multithreaded programs. However, each Operating System uses a different technique to support threads.
Before we can dive into the details of how threads are supported, we need to get familiarized with a few terms.
- Lightweight Process (LWP) can be thought of as a virtual CPU where the number of LWPs is usually greater than the number of CPUs in the system. Thread libraries communicate with LWPs to schedule threads. LWPs are also sometimes referred to as kernel threads.
- X-to-Y model. The mapping between LWPs and Threads.
- Contention Scope is how threads compete for system resources (e.g. scheduling)
- Bound threads have system-wide contention scope, in other words, these threads contend with other processes on the entire system (and thus are scheduled by the kernel)
- Unbound threads have process contention scope, in other words, these threads are scheduled by the library onto available LWPs
Solaris uses the many-to-many model. All CPUs are mapped to any number of LWPs which are then mapped to any number of threads. The kernel schedules the LWPs for slices of CPU time.
Linux uses the one-to-one model. Each thread is mapped to a single LWP. Why is this approach favored in Linux over the many-to-many model? Linux LWPs are really lightweight and thus LWP creation is not as expensive as it is in Solaris. Another bonus to make your program multithreaded rather than multiprocess in Linux is that the scheduler (2.4) gives a 1 point boost to "processes" scheduled which are in the same thread family as the currently running process.
Moreover, creating bound or unbound threads can greatly impact performance of your multithreaded program. There is no general rule when it comes to using either one. Each scenario demands a thorough analysis to select the right thread type for the job.
Part III :: Other Terms
- Thread-safe means that the program protects shared data, possibly through the use of mutual exclusion
- Reentrant code means that a program can have more than one thread executing concurrently
- Async-safe means that a function is reentrant while handling a signal (e.g. can be called from a signal handler)
- Concurrency vs. Parallelism - They are not the same! Parallelism is a subset of Concurrency. Parallelism implies simultaneous running of code (which is impossible on uniprocessor machines) while Concurrency implies that many tasks can run in any order and possibly in parallel.
Part IV :: Threads Rule!
Yes, threads are great... for the right tasks! Don't waste your time multithreading a program that isn't worth multithreading. Sometimes just plain, sequential code can do the job just right.
Thread Design Patterns
Now that we have a little basic background on threads, let's discuss how we can correctly use threads that best suits our task at hand.
Pattern I :: Boss/Worker
One thread dispatches other threads to do useful work which are usually part of a worker thread pool. This thread pool is usually pre-allocated before the boss begins dispatching threads to work. Although threads are lightweight, they still incur overhead when they are created. This is the classic and one of the more popular thread models.
Pattern II :: Peer (Workcrew)
The peer model is similar to the boss/worker model except once the worker pool has been created, the boss becomes the another thread in the thread pool, and is thus, a peer to the other threads.
Pattern III :: Pipeline
Similar to how pipelining works in a processor, each thread is part of a long chain in a processing factory. Each thread works on data processed by the previous thread and hands it off to the next thread. You must be careful to equally distribute work and take extra steps to ensure non-blocking behavior in this thread model or you could experience pipeline "stalls."
Protecting Shared Resources
Because most programs are more complex than a matrix-multiplication problem (which can be completely parallelized with speedup close to 100%, assuming you have enough processors) we must synchronize our threads and protect globally shared data across multiple threads.
Mutual Exclusion
Mutual exclusion is the method of serializing access to shared resources. You do not want a thread to be modifying a variable that is already in the process of being modified by another thread! Another scenario would be a dirty read where the value is in the process of being updated and another thread reads an old value.
Mutual exclusion (most often referred to as mutex) allows the programmer to "attach" locks to resources. If a thread wishes to modify or read a value from a shared resource, the thread must first gain the lock. Once it has the lock it may do what it wants with the shared resource while it has the lock because no other thread should have access to that variable. Once the thread finishes using the shared resource, it unlocks the mutex, which allows other threads to access the resource. This is referred to as serializing access to the shared resource. You can think of a mutex as a treasure chest, and the resource it is protecting lies within the chest. Only one person can have the key to the chest at any time, therefore, is the only person allowed to look or modify the contents of the chest at that time.
The code between the lock and unlock calls to the mutex, is referred to as the critical section. Minimizing time spent in the critical section allows for greater concurrency because it reduces the time other threads must wait to gain the lock. Therefore, it is important for a thread programmer to minimize critical sections.
Problems with Mutexes
An important problem associated with mutexes is the possibility of deadlock. A program can deadlock if two (or more) threads have stopped execution or are spinning permanently. The simplest deadlock situation: thread 1 locks lock A, thread 2 locks lock B, thread 1 wants lock B and thread 2 wants lock A. Instant deadlock. You can prevent this from happening by making sure threads acquire locks in an agreed order (lock ordering). Deadlock can also happen if threads do not unlock mutexes properly.
Race conditions occur when multiple threads share data and at least one of the threads accesses the data without going through a defined synchronization mechanism (Nichols 203). This could result in erroneous results even in an inconsistent manner which makes race conditions particularly difficult to debug. Library calls outside of your program's control are common culprits. Make sure you take steps within your program to enforce serial access to shared file descriptors and other external resources. On most Solaris man pages, you can find out if your library call is safe to use in reentrant code. Towards the bottom of the man page, you will see Categories of MT Library Calls. MT Safe means that the function can be called concurrently from different threads. MT Hot are "fast" MT Safe functions (usually not found on man pages). MT Unsafe means that the function cannot be called concurrently. Alternative means that there are MT Safe equivalents (e.g.gethostbyname()
and gethostbyname_r()
).
Another problem with mutexes is that contention for a mutex can lead to priority inversion. A higher priority thread can wait behind a lower priority thread if the lower priority thread holds a lock for which the higher priority thread is waiting. This can be eliminated/reduced by limiting the number of shared mutexes between different priority threads.
Thread Synchronization Primitives
Mutexes are one method of synchronizing threads, however, there are many other ways.
Synch Technique I :: Condition Variables
A popular method of synchronizing multiple threads is through the use of condition variables. Condition variables allow threads to synchronize to a value of a shared resource. Condition variables provide a kind of notification system among threads (Nichols 79).
For example, you could have a global counter, and once it reaches a certain count a thread activates. The thread that activates once the counter reaches the limit would wait on the condition variable. Other threads signal this condition variable if you want threads waiting/sleeping on this condition variable to wakeup. You can also use broadcast if you want to signal all threads waiting on the condition variable to wakeup. This sounds a bit confusing, but the pthread example below will clarify how condition variables work.
When waiting on condition variables, the wait should be inside a loop, not in a simple if statement because of spurious wakeups. You are not guaranteed that if a thread wakes up, it is the result of a signal or broadcastcall.
Synch Technique II :: Reader/Writer Locks
It is sometimes useful to allow multiple readers (threads) to read a shared resource variable without having to wait for locks if no thread is writing to the resource. With a multiple reader lock this is possible. Readers can enter the lock and do their business while writers wait until there are no readers left in the lock. Then the writer can modify the value while blocking all new readers from entering the lock.
Writer starvation is possible with the many reader/single writer lock technique. A priority system can eliminate writer starvation. For more information on r/w locks, see Nichols pgs. 84-89.
Synch Technique III :: Spinlocks
Spinlocks are less commonly used at the user-level. Spinlocks are used frequently in the Linux kernel itself. A spinlock will basically spin on a mutex. If a thread cannot obtain the mutex, it will keep polling the lock until it is free. The advantages to this is that if a thread is about to give up a mutex, you don't have to context switch to another thread. This situation is a bit tricky because you don't know when this might occur, and long spin times will result in poor performance.
Spinlocks should never be used on uniprocessor machines. Why is this?
Synch Technique IV :: Semaphores
Semaphores are another type of synchronization primitive that come in two flavors: binary and counting. Binary semaphores act much like mutexes, while counting semaphores can behave asrecursive mutexes. Counting semaphores can be initialized to any arbitrary value which should depend on how many resources you have available for that particular shared data. Many threads can obtain the lock simultaneously until the limit is reached. This is referred to as lock depth.
Semaphores are more common in multiprocess programming (i.e. it's usually used as a synch primitive between processes).
POSIX pthreads
Note: It is assumed that you have a good understanding of the C programming language. If you do not or need to brush up, please review basic C (including pointers and dynamic memory allocation). Here are some resources.
Now that we have a good foundation of thread concepts, lets talk about a particular thread library, POSIX pthreads. The pthread library can be found on almost any modern OS.
A few preliminary steps you should take before beginning any pthread coding is to:
- add
#include <pthread.h>
in your .c or .h header file(s) - define the
#define _REENTRANT
macro somewhere in a common .h or .c file - In your
Makefile
make sure gcc links against-lpthread
- Optional: add
-D_POSIX_PTHREAD_SEMANTICS
to yourMakefile
(gcc flag) for certain function calls likesigwait()
Pthread Basics
Now let's begin our journey into pthreads...
A thread is represented by the type pthread_t
. Let's begin by examining most of the pthread creation and initializing functions:
int pthread_create(pthread_t *thread, pthread_attr_t *attr,
void *(*start_routine)(void *), void *arg);
int pthread_attr_init(pthread_attr_t *attr);
int pthread_mutex_init(pthread_mutex_t *mutex,
const pthread_mutexattr_t *mutexattr);
int pthread_cond_init(pthread_cond_t *cond,
pthread_condattr_t *cond_attr);
pthread_create()
example:
pthread_create(&pt_worker, &thread_attributes,
thread_function, (void *)thread_args);
The above will create a pthread pt_worker
with thread attributes defined in thread_attributes
(this argument can be NULL
if you want default thread attributes). The thread code is contained in the function thread_function
and is passed in a arguments stored in thread_args
. The thread_function
prototype would look like this:
void *thread_function(void *args);
Immediately after the pthread_create call completes, the thread_function
will begin executing.
pthread_XXXX_init()
functions initialize thread attributes, mutexes, and condition variables. mutexattr
and cond_attr
can be NULL if you are using defaults. Mutexes and condition variables can be initialized to default values using the INITIALIZER
macros as well. For example:
pthread_mutex_t count_lock = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t count_cond = PTHREAD_COND_INITIALIZER;
Pthread Mutexes
To perform locks and unlocks the following functions are available for you:
int pthread_mutex_lock(pthread_mutex_t *mutex);
int pthread_mutex_trylock(pthread_mutex_t *mutex);
int pthread_mutex_unlock(pthread_mutex_t *mutex);
pthread_mutex_lock()
is a blocking call. If the thread cannot gain the lock, then the thread will block the thread from proceeding until it obtains the lock. pthread_mutex_trylock()
will return immediately if the mutex cannot be locked. To unlock a mutex, simply call pthread_mutex_unlock()
. An example on using Pthread mutexes:
pthread_mutex_lock(&count_lock);
count++;
pthread_mutex_unlock(&count_lock);
In the above example, we are incrementing the globally shared count
variable. The code between the lock and unlock calls is the critical section. Always try to minimize this section!
Pthread Condition Variables
Here are the pthread condition variable function prototypes:
int pthread_cond_wait(pthread_cond_t *cond,
pthread_mutex_t *mutex);
int pthread_cond_signal(pthread_cond_t *cond);
int pthread_cond_broadcast(pthread_cond_t *cond);
pthread_cond_wait()
puts the current thread to sleep. It requires a mutex of the associated shared resource value it is waiting on. pthread_cond_signal()
signals one thread out of the possibly many sleeping threads to wakeup. pthread_cond_broadcast()
signals all threads waiting on the cond
condition variable to wakeup. Here is an example on using pthread condition variables:
pthread_mutex_lock(&count_lock);
while (count < MAX_COUNT) {
pthread_cond_wait(&count_cond, &count_lock);
}
pthread_mutex_unlock(&count_lock);
We are locking the count_lock
mutex so we can read the value of count without entering a potential race condition. Notice how the we use a while
loop instead of an if
statement. This is because of spurious wakeups problem previously mentioned. Just because a thread has been woken does not mean it was due to a pthread_cond_signal()
or pthread_cond_broadcast()
call. The pthread_cond_wait()
call takes the count mutex and condition variable. Why does pthread_cond_wait()
require the mutex as well as the conditon variable? It's because it needs to unlock the mutex when going to sleep or you would potentially enter into a deadlock! pthread_cond_wait()
if awoken, automatically tries to reacquire the mutex, and will block if it cannot. Using the signal and broadcast functions is self-explanatory. It is recommended that you release any locks that other threads could be waiting on before you signal or broadcast.
Miscellaneous
Here are some suggestions and issues you should consider when creating using pthreads:
- One thing we have been disregarding in the above discussions is return values. You must check all return values! No exceptions!
- Sometimes it is desirable for a thread not to terminate (as in the case of the worker thread pool). This can be solved by placing the thread code in an infinite loop and using condition variables.
- Many of the pthread types can be "free'd" using the
pthread_XXXX_destroy()
calls. pthread_join()
can be used to wait on other threads to finish (if the JOINABLE attribute is set). This is useful in creating barriers or other synchronization windows/points.- To set thread attributes, use the
pthread_attr_setXXXX()
functions. scope, schedpolicy, and detachstate are only some of the useful attributes you can set on your threads. pthread_kill()
can be used to deliver signals to specific threads.pthread_self()
returns a handle on the calling thread.pthread_once()
can be used to ensure that an initializing function within a thread is only run once.- There are many, many more useful functions in the pthread library. Consult your man pages or the Nichols text (Appendix C).
Example?
Try google or either of the recommended books below :) Better yet, get your feet wet! Try coding a multithreaded program.
Performance Considerations
The performance gains from using threads is great, but can it be even better? You should consider the following when analyzing your program for potential bottlenecks:
- Lock granularity - How "big" (coarse) or "small" (fine) are your mutex locks? Do they lock your whole structure or fields of a structure? The more fine-grained you make your locks, the more concurrency you can gain, but at the cost of more overhead and potential deadlocks.
- Lock ordering - Make sure your locks are always locked in an agreed order (if they are not, make sure you take steps to rectify situations where locks are obtained in an out-of-order fashion, e.g. by using trylock/unlock calls).
- Lock frequency - Are you locking too often? Locking at unnecessary times? Reduce such occurences to fully exploit concurrency and reduce synchronization overhead.
- Critical sections - This has been mentioned before (twice), but you should take extra steps to minimize critical sections which can be potentially large bottlenecks.
- Worker thread pool - If you are using a Boss/Worker thread model, make sure you pre-allocate your threads instead of creating threads on demand. It doesn't matter to the user how long it took your server to initialize, it only matters how fast it processes his or her request!
- Contention scope - Do your threads perform better when they are in contention with all of the system's processes? Or do they perform better when individually scheduled by the thread library itself? Only experimentation can give you the answers.
- Scheduling class - We have not touched on this topic, but changing the thread scheduling class from FIFO to RR can give better response times. But is this what you really want? Refer to Nichols or Lewis book for more information on thread scheduling classes.
- Too many threads? - At what point are there too many threads? Can it serverely impact and degrade performance? Again, only experimentation will give you the real answers to this question.
Multiprocess Programming
We have explored the very basics of multithreaded programming. What about multiprocess programming? For example, Apache for Linux is a multiprocess program (however Apache for Windows is multithreaded). How are you supposed to synchronize between processes?
These topics are beyond the scope of this document, but to perform cross-process synchronization, one would use some form of IPC: pipes, semaphores, message queues, or shared memory. Of all of the forms of IPC, shared memory is the fastest (excluding doors). You can use either POSIX or System V semantics when dealing with cross-process resource management, IPC, and synchronization.
So you may be thinking. What's the better solution, multithreaded, multiprocess, or mixed multithreaded/multiprocess? Again it all depends! Pick the right tool for the task!
Resources
It is impossible to cover more than an introduction to threads with this short tutorial and overview. For more in-depth coverage on threads (like thread scheduling-classes, thread-specific data (TSD), and thread cancelling) and pthread programming I recommend these books:
Lewis, Bill and Daniel J. Berg. Multithreaded Programming with Pthreads. California: Prentice Hall, 1998.
Nichols, Bradford, et. al. Pthreads Programming. Beijing: O'Reilly & Associates, Inc., 1998.
GNU Pth (portable threads) is the "next generation" threads library and may be the future of multithreaded event-driven programming. Take a look here.
--
Siva
9886179349
No comments:
Post a Comment