smp: Add a top-level README.txt, an overview of the SMP branch.

author Matt Birkholz <puck@birchwood-abbey.net>

Sat, 6 Dec 2014 19:45:40 +0000 (12:45 -0700)

committer Matt Birkholz <puck@birchwood-abbey.net>

Sun, 21 Dec 2014 19:19:09 +0000 (12:19 -0700)
author Matt Birkholz <puck@birchwood-abbey.net>
Sat, 6 Dec 2014 19:45:40 +0000 (12:45 -0700)
committer Matt Birkholz <puck@birchwood-abbey.net>
Sun, 21 Dec 2014 19:19:09 +0000 (12:19 -0700)
diff --git a/README.txt b/README.txt

new file mode 100644 (file)

index 0000000..ec6c980
--- /dev/null
+++ b/README.txt
@@ -0,0 +1,143 @@
+-*-Outline-*-
+
+Say you have multiple SMPing cores (e.g. an Intel Core Duo from 2007 or
+better) and operating system support, e.g. POSIX threads, and you want
+to run multiple Scheme threads on your multiple cores, just like you
+have been running multiple Scheme threads on one.
+
+You might hatch a theory that you can make the Scheme machine's state
+pthread-local, e.g. stick "__thread" in front of "SCHEME_OBJECT*Free".
+Serialize use of the shared parts of the machine, and you are in
+business.  What could go wrong? :-}
+
+Fluid-let was an obstacle, but it was excised from the runtime system.
+The remaining uses can be kludged in an SMP-enabled world, e.g. by
+ensuring the compiler runs in one thread at a time.
+
+Without-interrupts has been used as a way to exclude other threads.
+It produced an atomic section.  In an SMP-enabled world, running with
+timer interrupts masked guarantees only that the current thread will
+not be switched out -- it will keep running on the current processor.
+Threads on other processors will still be running, and may see
+intermediate, inconsisent states (depending on subsystem).  The use of
+each such subsystem must be serialized/atomized somehow.
+
+* Multiple Machines
+
+The current implementation assumes multiple OS threads (pthreads) run
+multiple Scheme machines in the same address space, accessing a shared
+heap.  Each machine has thread-local registers and gets sole use of
+stack and heap segments in the shared address space UNTIL a garbage
+collection is signaled.  During GC, one processor becomes "the GC
+processor".  It interrupts the others, waits for them to stop in a "GC
+wait" state, then performs the garbage collection, tracing each
+processor's stack as well as the usual roots.  It evacuates objects
+from each processor's local heap segment to the shared to_space.  When
+complete, each processor's heap segment is empty and the pointers in
+its stack all point to the new shared heap.
+
+When not garbage collecting (or waiting for a GC to complete), or
+running, a processor is "idle" -- blocked waiting for an interrupt.
+Processors become idle when they find no runnable threads and another
+processor is already waiting for I/O.  When runnable threads become
+available, an idle processor is woken with a timer interrupt.  Whether
+waking from idle or switching threads because of a normal timer
+interrupt, each machine takes a Scheme thread off the runnable queue
+and continues it until interrupted, or until the thread is no longer
+runnable (i.e. dead or blocked).
+
+The "I/O waiter" is the "first" processor to become idle.  It blocks
+in select instead of pause.  If there are no idle processors, I/O is
+polled in the thread timer interrupt handler as usual.  Thus, when a
+processor "finishes" a thread (i.e. the thread blocks, suspends or
+dies), it consults the runnable queue.  If the queue is empty, the
+processor attempts to become the "I/O waiter".  If there already is
+one, it goes to idle.
+
+When I/O arrives, the I/O waiter runs with it, but first wakes an idle
+processor (with a timer interrupt) if any.  The awoken likely will
+become the next I/O waiter, though another processor could finish a
+thread or get a timer interrupt and become I/O waiter first.  Ensuring
+there is one I/O waiter simplifies updates to the system's select-
+registry.
+
+All of the logic for scheduling threads and managing processors is
+written in Scheme.  A thread-system mutex keeps all of the relevant
+data structures sane.  A separate mutex, the state_mutex, is used by
+the garbage collector to make processor state changes atomic.  Two
+condition variables, "ready" and "finished", are used with the
+state_mutex to cause the GC processor to wait for the others to get
+ready (export thread-local state) and for the others to wait until the
+GC processor is finished (so they can import their thread-local state
+and continue).
+
+** Creating Processors
+
+The first primitive executed by the main pthread is typically (outside
+of the initial bootstrap) Prim_band_load, which begins with
+smp_gc_start.  Without a fixed objects vector, new processors cannot
+handle the usual global-gc signal, so they start out in the GC-WAIT
+state via the SMP-PAUSE primitive.
+
+To avoid signaling the new processors, the main pthread waits for them
+all to reach the GC-WAIT state and signal "ready".  When ready, the
+main pthread runs the interpreter to bootstrap or load a band.  The
+waiting processors are blocked until the finished condition is
+signaled by the primitive GC daemon.  By that point, a bootstrap or
+band restore will have initialized the interrupt handlers in the fixed
+objects vector.  The non-main processors continue from smp-pause by
+"restarting" smp-idle where they wait for a timer interrupt
+(e.g. when the main thread creates the first non-main thread).
+
+** Synchronizing for GC
+
+A processor aborting for GC locks the state_mutex and test-and-sets
+the gc_processor variable.  If successful, it proceeds to interrupt
+the other processors, whose interrupt handlers cause them to execute
+smp-gc-wait, a primitive that shifts them into GC-WAIT state where
+they block on the finished condition.  gc_processor waits for the
+others to reach the GC-WAIT state by blocking on the ready condition.
+When ready is signaled, the GC processor performs the garbage
+collection and broadcasts the finished condition.
+
+A processor aborting for GC may lock the state_mutex only to find
+another processor has already set the gc_processor variable.  If it
+does, it goes straight to the GC-WAIT state and blocks on the finished
+condition, then proceeds with an empty local heap as though it had
+accomplished the garbage collection itself.  If the primitive being
+executed needs exclusive access to reformat the heap (a purification
+or fasdump) it will keep trying to grab the gc_processor variable
+until it succeeds and can re-arrange the heap.
+
+* Runtime Support
+
+The runtime system keeps track of the threads running on each
+processor by putting them in a "current-threads" vector.  Newly
+runnable threads are put on a runnable queue (and an idle processor is
+woken, if any).  The timer interrupt handler should pick up a runnable
+thread and process its continuation.  When a thread is no longer
+runnable and there are no runnable threads in the queue, the processor
+executes either test-select-registry (when it is the io-waiter) or
+smp-idle and waits.
+
+All of the operations that manipulate Scheme threads, including
+current-threads, the runnable queue, timers and i/o channel waits,
+must be serialized.  They must grab the thread_mutex and hold it for
+the duration of the operation.  This mutex is distinct from the
+state_mutex so that it can serialize the thread operations, without
+locking out GCs.
+
+An important Scheme thread operation is the timer interrupt handler.
+It must grab the thread_mutex because other threads may be spawning or
+signaling threads and thus frobbing the runnable queue and whatnot
+too.  An interrupt handler that grabs a mutex, interrupting code that
+may hold that mutex, is courting deadlock, so care must be taken to
+mask the interrupt as long as (longer than!) the mutex is held.
+
+The OS process receives a timer interrupt (SIGALRM) regularly while
+there are threads in the runnable queue, and whenever a thread would
+wake from sleep.  The signal handler might run in any pthread(?), but
+is forwarded to the next processor in a ring of all processors.
+Thus 2 processors are interrupted half as often with the same thread
+timer interval.  Four processors run uninterrupted for 4 times the
+interval.
author	Matt Birkholz <puck@birchwood-abbey.net>
	Sat, 6 Dec 2014 19:45:40 +0000 (12:45 -0700)
committer	Matt Birkholz <puck@birchwood-abbey.net>
	Sun, 21 Dec 2014 19:19:09 +0000 (12:19 -0700)