andrew@theinternet

GIOS Lecture Notes - Part 2 Lesson 2 - Threads and Concurrency


Why threads?

Multiple processes from last lecture can still only happen on one CPU at a time. To take advantgage of multiple CPUs we need multiple execution contexts. This is usually done with threads.

Visual Metaphor for Threads

Again this toy shop comparison. This time the workers in the shop.

  • Active
  • Works simultaneously
  • Requires coordination

Process vs Thread

  • A single threaded process is represented by its address space
    • Also by its execution context. Represented in a PCB
  • Multiple threads all share the same address space, code, data, and files. However, each execute different instructions and operate on different parts of space
    • As a solution to this, a more complex PCB structure is needed. It will keep track of both the shared information and the per-thread execution context

Benefits of Multithreading

Why are threads useful?

  • At any given time, there may be multiple threads each running on a different processor
  • One potential application is that you can have many threads executing the same operation on a different input structure
    • This parallelization grants faster overall runtime
  • Alternatively, you may have threads execute different portions of the code that correspond to different functions
    • Specialization allows differentiation of how we manage the functions
      • Hot cache
  • Threading has much lighter memory requirements, as it does not need separate allocations for address space information and execution contexts for each thread while multiprocessing does.
  • Passing data among processes requires IPC between address spaces. Much harder than threads.

Benefits of Multithreading: Single CPU

  • Are threads useful when #threads > #cpus
    • Yes, take advantage if IO or other idle time
    • Any time the idle is longer than 2x time of context switch you could be working another thread
    • Notably context switches are much lighter for threads than processes
    • This is known as “hiding latency”

Basic Thread Mechanisms

  • Thread data structure
    • Identify threads, keep track of resource usage
  • Mechanisms to create and manage threads
  • Mechanisms to safely coordinate among threads running concurrently in the same address space
    • You can get data races or other issues when threads try to all access the same data concurrently
  • Concurrency control and coordination
    • Mutual exclusion
      • Exclusive access to only one thread at a time
      • mutex - a synchronization mechanism
    • Waiting on other threads
      • specific condition before proceeding
      • Condition variables - a synchronization mechanism
    • Waking up other threads from wait state

Threads and Thread Creation

THIS SECTION BASED ON BIRRELL PAPER, NOT PTHREADS LIBRARY

  • Thread type
    • thread data structure
    • contains all thread info to describe the thread
  • Fork (proc, args)
    • create a thread
    • not UNIX fork()
    • Need some mechanism to determine when forked thread is done, and retrieve its result
  • Join (thread)
    • terminate a thread
    • returns to parent the result of the child’s executrion

Mutexes

  • Mutal Exclusion
  • A lock that should be used whenever accessing data or state that’s shared among threads
  • To support this function, need a data structure for a mutes
    • Locked?
    • Owner
    • blocked_threads
  • Portion of code protected by the mutex is called the “critical sectioon”

Producer / Consumer Example

  • What if the processing you wish to perform with mutex needs to occur only under certain conditions?

Condition Variables

  • Condition type
  • Wait(mutex. cond)
    • mutex automatically released & re-aquired on wait
  • Signal(cond)
    • notify only one thread waiting on condition
  • Broadcast(cond)
    • notify all waiting threads
    • worth noting that in this case there’s still a mutex to be acquired, only one thread woken will get it. Use case for Broadcast() is somewhat complicated as a result
  • Condition Variable Data Structure
    • mutex ref
    • waiting threads

Readers/Writer Problem

Common Pitfalls

Avoiding common mistakes

  • Keep track of mutex/condvariables used with a aresource
    • e.g. mutex_type m1; //mutex for file
  • Check that you are always (and correctly) using lock and unlock
    • common mistake is to forget this in one of many places where a given piece of state is accessed
    • e.g. did you forget to lock/unlock compiler errors
  • Use a single mutex to access a single resource
  • Check that you are signaling correct condition
  • Check that you are not using signal when broadcast is needed
    • signal: only 1 thread will proceed, remaining threads will continue to wait, possibly indefinitely
  • Ask yourself: do you need priority guarantees?
    • thred execution order not controlled by signals to condition variable
  • Spurious wake ups
    • Doesn’t necessarily affect correctness, but will affect performance
    • When we wake threads up knowing they may not be able to proceed
    • Can we unlock the mutex before broadcast/signal?
      • Yes! Sometimes…
      • But you need to be very careful about ordering and atomicity
  • Deadlocks
    • Two or more competing threads are waiting on each other to complete, but none of them ever do
    • How to avoid?
      • Fine-grained locking
      • Get all locks upfront then release at end
      • Use one mega lock
      • Too restrictive, often
      • Maintain lock order!
        • Hierarchical locking. This is generally better than above solutions
        • In practice this can be very complicated on more complex systems
    • In summary
      • A cycle in the wait graph is necessary and sufficient for a deadlock to occur
        • Edges from thread waiting on a resource to thread owning a source
      • What can we do about it?
        • Deadlock prevention
          • Hierarchical locking
          • EXPENSIVE
        • Deadlock detection and recovery
          • Based on graph analysis
          • ROLLBACK - still expensive because you need to maintain enough state to recover
            • Recovery may not always be possible
        • Ostrich Algorithm
          • Head in the sand yo
          • Reboot if it stops working
          • Obviously not a great idea

Kernel vs User Level Threads

  • Kernel level threads imply that the OS itself is multi-threaded
  • OS scheduler handles how kernel level threads will work

One to One Model

  • Each user level thread has a kernel level thread associated with it
  • Pros
    • OS sees and understands the threads, synchronization, blocker, etc
    • User level processes directly benefit from threading in kernel
  • Cons
    • Must go to OS for all operations (may be expensive)
    • OS may have limits on policies, thread number, etc
    • Portability

Many to One Model

  • All user level threads are mapped onto single kernel level thread
  • At user level there is a thread management library that decides which user level thread will be mapped onto the kernel level thread at any given time
  • Pros
    • Totally portable
    • Doesn’t depend on OS limits and policies
    • Don’t have to make system calls or context switches, less overhead
  • Cons
    • OS has no insights into application needs
    • OS may block entire process if one user-level thread blocks on I/O

Many to Many Model

  • Allows some user threads to be associated with one kernel level process, while others may be one to one
  • Pros
    • Can be best of both worlds
    • Can have bound or unbound threads
  • Cons
    • Requires coordination between user and kernel level thread managers

Scope of Multithreading

  • Different levels at which multithreading is supported
  • System scope
    • system-wide thread management by OS-level thread managegers (e.g. CPU scheduler)
  • Process scope
    • User level library manages threads within a single process

Multithreading Patterns

Boss-workers

  • Characterized by one boss thread and multiple worker threads
  • Boss assigns work to workers
  • Worker performs entire task
  • Throughput of the system limited by boss thread. Must keep boss thread efficient
    • Throughput = 1/(boss time per order)
  • How does boss pass work?
    • One way is for boss to keep track of which workers are free, and directly signal specific worker
      • Pros
        • Workers don’t need to synchronize
      • Cons
        • Boss must track what each worker is doing
        • Boss must wait on targeted worker
        • Throughput will go down
    • Another option is for boss to assign work by placing work in producer/consumer queue that workers take from
      • Pros
        • Boss doesn’t need to know details about workers
        • Boss doesn’t need to wait on workers
        • Better throughput
      • Cons
        • Queue synchronization
      • This approach tends to be picked
      • If queue is full, boss has to wait to add stuff to it. More workers to pull from it will empty queue faster
      • How many workers is optimal?
        • On demand, dynamically?
        • Statically?
          • More commonly, keep a pool of workers created up front
        • Static vs dynamic
          • Increase pool in size of some conditions are met
  • Pros
    • Simplicity
  • Cons
    • thread pool management
    • ignores locality
      • boss doesn’t keep track of workers.
      • If a worker just finished a specific task, more likely to be efficient to do it again. State is loaded already. You leave that potential optimization on the table here.
  • An alternative to all workers performing all same task, you can have workers specialized for certain tasks
    • Boss would now need to classify tasks, and assign to correct type of worker. This creates bottleneck overhead, but my be offset by worker efficiency gains
    • better locality - because a worker does only a subset of tasks, more likely to have the needed state already loaded in cache
    • Quality of Service management - can better prioritize specific tasks
    • Downside is the overhead of more complex load balancing

Pipeline

  • Threads assigned one subtask in the system
  • Entire tasks == pipeline of threads
  • Multiple tasks concurrently in the system, in different pipeline stages
  • Throughput == weakest link (longest stage in the pipeline)
  • Each pipeline stage has its own thread pool
  • Shared-buffer based communication betweeen stages
  • Pros
    • Allows for highly specialized threads, great locality
  • Cons
    • Balancing and synchronization overheads

Layered

  • Each layer is assigned a group of related subtasks
  • End to end task must pass up and down through all layers
  • Pros
    • Specialization and locality
    • Less fine-grained than pipeline
  • Cons
    • Not suitable for all applications
    • Synchronization

Throughput Quiz

Boss-Worker Formula: time_to_finish_1_order * ceiling (num_orders / num_concurrent_threads)

Pipeline Formula: time_to_finish_first_order + (remaining_orders * time_to_finish_last_stage)