From: Matt Birkholz Date: Sat, 6 Dec 2014 19:45:40 +0000 (-0700) Subject: smp: Add a top-level README.txt, an overview of the SMP branch. X-Git-Url: https://birchwood-abbey.net/git?a=commitdiff_plain;h=9969ff076e5341187996e509d0254b730df1e956;p=mit-scheme.git smp: Add a top-level README.txt, an overview of the SMP branch. --- diff --git a/README.txt b/README.txt new file mode 100644 index 000000000..ec6c980cc --- /dev/null +++ b/README.txt @@ -0,0 +1,143 @@ +-*-Outline-*- + +Say you have multiple SMPing cores (e.g. an Intel Core Duo from 2007 or +better) and operating system support, e.g. POSIX threads, and you want +to run multiple Scheme threads on your multiple cores, just like you +have been running multiple Scheme threads on one. + +You might hatch a theory that you can make the Scheme machine's state +pthread-local, e.g. stick "__thread" in front of "SCHEME_OBJECT*Free". +Serialize use of the shared parts of the machine, and you are in +business. What could go wrong? :-} + +Fluid-let was an obstacle, but it was excised from the runtime system. +The remaining uses can be kludged in an SMP-enabled world, e.g. by +ensuring the compiler runs in one thread at a time. + +Without-interrupts has been used as a way to exclude other threads. +It produced an atomic section. In an SMP-enabled world, running with +timer interrupts masked guarantees only that the current thread will +not be switched out -- it will keep running on the current processor. +Threads on other processors will still be running, and may see +intermediate, inconsisent states (depending on subsystem). The use of +each such subsystem must be serialized/atomized somehow. + +* Multiple Machines + +The current implementation assumes multiple OS threads (pthreads) run +multiple Scheme machines in the same address space, accessing a shared +heap. Each machine has thread-local registers and gets sole use of +stack and heap segments in the shared address space UNTIL a garbage +collection is signaled. During GC, one processor becomes "the GC +processor". It interrupts the others, waits for them to stop in a "GC +wait" state, then performs the garbage collection, tracing each +processor's stack as well as the usual roots. It evacuates objects +from each processor's local heap segment to the shared to_space. When +complete, each processor's heap segment is empty and the pointers in +its stack all point to the new shared heap. + +When not garbage collecting (or waiting for a GC to complete), or +running, a processor is "idle" -- blocked waiting for an interrupt. +Processors become idle when they find no runnable threads and another +processor is already waiting for I/O. When runnable threads become +available, an idle processor is woken with a timer interrupt. Whether +waking from idle or switching threads because of a normal timer +interrupt, each machine takes a Scheme thread off the runnable queue +and continues it until interrupted, or until the thread is no longer +runnable (i.e. dead or blocked). + +The "I/O waiter" is the "first" processor to become idle. It blocks +in select instead of pause. If there are no idle processors, I/O is +polled in the thread timer interrupt handler as usual. Thus, when a +processor "finishes" a thread (i.e. the thread blocks, suspends or +dies), it consults the runnable queue. If the queue is empty, the +processor attempts to become the "I/O waiter". If there already is +one, it goes to idle. + +When I/O arrives, the I/O waiter runs with it, but first wakes an idle +processor (with a timer interrupt) if any. The awoken likely will +become the next I/O waiter, though another processor could finish a +thread or get a timer interrupt and become I/O waiter first. Ensuring +there is one I/O waiter simplifies updates to the system's select- +registry. + +All of the logic for scheduling threads and managing processors is +written in Scheme. A thread-system mutex keeps all of the relevant +data structures sane. A separate mutex, the state_mutex, is used by +the garbage collector to make processor state changes atomic. Two +condition variables, "ready" and "finished", are used with the +state_mutex to cause the GC processor to wait for the others to get +ready (export thread-local state) and for the others to wait until the +GC processor is finished (so they can import their thread-local state +and continue). + +** Creating Processors + +The first primitive executed by the main pthread is typically (outside +of the initial bootstrap) Prim_band_load, which begins with +smp_gc_start. Without a fixed objects vector, new processors cannot +handle the usual global-gc signal, so they start out in the GC-WAIT +state via the SMP-PAUSE primitive. + +To avoid signaling the new processors, the main pthread waits for them +all to reach the GC-WAIT state and signal "ready". When ready, the +main pthread runs the interpreter to bootstrap or load a band. The +waiting processors are blocked until the finished condition is +signaled by the primitive GC daemon. By that point, a bootstrap or +band restore will have initialized the interrupt handlers in the fixed +objects vector. The non-main processors continue from smp-pause by +"restarting" smp-idle where they wait for a timer interrupt +(e.g. when the main thread creates the first non-main thread). + +** Synchronizing for GC + +A processor aborting for GC locks the state_mutex and test-and-sets +the gc_processor variable. If successful, it proceeds to interrupt +the other processors, whose interrupt handlers cause them to execute +smp-gc-wait, a primitive that shifts them into GC-WAIT state where +they block on the finished condition. gc_processor waits for the +others to reach the GC-WAIT state by blocking on the ready condition. +When ready is signaled, the GC processor performs the garbage +collection and broadcasts the finished condition. + +A processor aborting for GC may lock the state_mutex only to find +another processor has already set the gc_processor variable. If it +does, it goes straight to the GC-WAIT state and blocks on the finished +condition, then proceeds with an empty local heap as though it had +accomplished the garbage collection itself. If the primitive being +executed needs exclusive access to reformat the heap (a purification +or fasdump) it will keep trying to grab the gc_processor variable +until it succeeds and can re-arrange the heap. + +* Runtime Support + +The runtime system keeps track of the threads running on each +processor by putting them in a "current-threads" vector. Newly +runnable threads are put on a runnable queue (and an idle processor is +woken, if any). The timer interrupt handler should pick up a runnable +thread and process its continuation. When a thread is no longer +runnable and there are no runnable threads in the queue, the processor +executes either test-select-registry (when it is the io-waiter) or +smp-idle and waits. + +All of the operations that manipulate Scheme threads, including +current-threads, the runnable queue, timers and i/o channel waits, +must be serialized. They must grab the thread_mutex and hold it for +the duration of the operation. This mutex is distinct from the +state_mutex so that it can serialize the thread operations, without +locking out GCs. + +An important Scheme thread operation is the timer interrupt handler. +It must grab the thread_mutex because other threads may be spawning or +signaling threads and thus frobbing the runnable queue and whatnot +too. An interrupt handler that grabs a mutex, interrupting code that +may hold that mutex, is courting deadlock, so care must be taken to +mask the interrupt as long as (longer than!) the mutex is held. + +The OS process receives a timer interrupt (SIGALRM) regularly while +there are threads in the runnable queue, and whenever a thread would +wake from sleep. The signal handler might run in any pthread(?), but +is forwarded to the next processor in a ring of all processors. +Thus 2 processors are interrupted half as often with the same thread +timer interval. Four processors run uninterrupted for 4 times the +interval.