Parallel processing

 

Until recently, CPUs have been getting faster by increasing the megahertz of a CPU and using clever techniques such as pipelines, branch prediction and out-of-order execution. Recently the emphasis has changed to focus on parallelism - putting more cores into a CPU:

 

  • Both Intel and AMD are now selling two- and four-core processors with 8-core processors on the horizon.
  • The XBox 360 has three Power-PC based cores.
  • The Cell processor in the Playstation 3  has a simple Power-PC based core with 6 "SPE"s. Each SPE is a RISC core in its own right with 256KB of memory.
  • Sun's UltraSPARC T1 has 8 cores each running 4 threads each. Many of Sun's processors and computers are parallel.
  • NVIDIA's GeForce 8 Series video cards have 128 "stream processors" that may be programmable for parallel processing.
  • Intel has announced an 80-core CPU... at some stage in the future.
  • Then there's the Kilocore CPU...
  • The J-Machine: http://cva.stanford.edu/projects/j-machine/

 

What this means is that programmers will have to learn how to do parallel processing. Using multi-threading in conventional languages works if you don't mind race conditions and deadlocks. Java has good support for threading, and Squeak/Smalltalk supports it but currently doesn't take advantage of the extra hardware available.

 

Erlang apparently supports trivial multithreading. I'm not sure about Mozart/OZ.

 

Prolog, or a better logic-based language, should be parallisable. Prolog is essentially a search algorithm for a solution over logical clauses. Search algorithms are usually easily parallised.

 

In Smalltalk

 

Squeak, and Smalltalk in general, has quite good functionality in terms of multithreading, but unfortunately programmers often don't make use of it. As a result, Squeak feels very single-threaded. Running a command typically freezes up the entire UI.

 

There are a few patterns that can be used in Smalltalk to make use of parallel processing.

 

  • Any block can be forked: [ anObject doSomething ] fork. This creates a new thread.
  • Collections can be parallallised. "aCollection do: [ :each | each doSomething ]" could fork off a lot of threads. Ditto for other collection operations.
  • Futures can be used. This is similar to forking. If you don't need the result of something immediately, you could do "result := Future doing: [ anObject doSomething ].". When this code is complete, result contains the result of that block. If you try to use result before the block has finished, it will wait until the block returns.
  • Smalltalk has a notion of updating objects using Object>>addDependent:, Object>>changed and Object>>update (and friends). A small change here could make the VM use more threads; most updates don't need to be strictly in order.
  • The user interface could certainly be more multithreaded. The drawing thread should be high-priority and strictly non-blocking. There could be multiple threads updating widgets. As a result, multiple threads could be updating the screen at any one point.

 

One idea is to add a "percentageComplete" variable to a thread. When a thread is assigned to a particular task, it is assigned the name of that task and updates "percentageComplete" to reflect the progress being made. The user interface can then show a progress bar for any thread that it is waiting for.

 

VM optimisation idea: making a new process could be made effectively as expensive as a block invocation. When "[...] fork" is encountered, the VM doesn't actually create a new process but just flags the current call stack as having forked, and continues with execution in the forked block... if the forked block returns to a flagged thread, then execution continues with the parent process. When a context switch occurs, the VM will check if the current context is flagged for forking and if so, then actually makes the new process then. One possible bug is if a fork happens on an already flagged context - perhaps the flag needs to be an integer that gets incremented with each fork and decremented down to zero during the context switch? This optimisation would mean that forked processes would have very little overhead if it returns before the next context switch.

 

Parallel abstractions

 

 

Parallel programming is often considered as being difficult. One way of minimising bugs in parallel applications is to encapsulate and abstract parallel behaviour to contain bugs. A well-tested parallel abstraction will have less bugs than custom written code.

 

Trivially parallel

 

Often it is easiest just to write a parallel algorithm using forks and semaphores. I personally don't consider forking processes and synchronising on semaphores too low level to write parallel applications.

 

Semaphore can be used in many ways. A Mutex is a type of Semaphore that is already signalled; it can be used to protect critical sections of code. A Multex has been signalled n times, it can be used to make sure no more than n processes execute a critical section of code at the same time. Semaphores can be used to join processes .

 

Unrelatedly parallel / Embarrassingly parallel

 

Obviously, two unrelated sections of code that never share objects could run in parallel. A good example of this is a user interface that lets applications run in their own threads rather than block if one application blocks. Another example is initialisation of an application; two unrelated parts of an application could initialise at the same time if they share no state.

 

Parallel collections

 

Many of the methods on collections could be parallellised. For example, "someCollection do: [:each ...]" could easily fork off a thread for each element in that collection. There is a lot of scope for parallelisation here.

 

Using this abstraction, a parallel collection library could be built up. Compare Java's java.util.concurrent package.

 

Protected data types

 

Rather than include a mutex with every object you intend to protect in your code, it is less effort to have a library of thread-safe classes. A protected integer, for example, would be very useful if it had such methods as "increment" or "waitUntilGreaterThan:". A protected data type would contain its own locks so that it's user doesn't have to have a separate lock for protecting it.

 

Job management

 

A library could be written to manage "jobs" and job queues. Jobs can be submitted to a job queue, which are then processed by a limited set of threads. This could implement, e.g. scatter/gather type jobs.

 

There are different types of job queues that could be made - compare the queuing features in Java's java.util.concurrent package.

 

Futures

 

A future allows a program to continue executing while a variable's value is being calculated. They're described above somewhere.

 

Barrier

 

A barrier could be turned into a reusable object. A barrier makes sure that all threads have reached a particular rendevous point before continuing.

 

Queues

 

Various types of queues could be used. Synchronised queues are used in concurrent program for applications such as producer/consumer problems.

 

 

Data flow algorithms

 

A streaming architecture could be made, where the program represents a production line. A bunch of Stream objects are plugged into each other, plugging outputs from some Stream objects into inputs in others. Buffers may or may not be used depending on whether they improve or inhibit performance. Each Stream object does some processing on the elements it receives, and then passes the output to the next Stream object.

 

There are many ways of parallelising this:

  • Each Stream object could have a single thread or set of worker threads which read a FIFO queue (which elements are deposited into from other Streams), perform processing and send output to other Streams. This limits the parallelism to the number of worker threads, and necessitates a context switch to read elements - if a queue is used, a lot of objects can be processed in the same context switch, but latency would be higher than other implementations.
  • Semaphores could be used for synchronisation. Threads would continue executing though the entire chain of Stream objects, one thread would be forked for each element that passes through the chain. This has lower latency and is likely more efficient. An object could also pass through the entire chain in one context switch.
  • New processes could be forked to handle incoming objects at every point in the chain. With a VM optimisation (where new processes are only created when process scheduling occurs), the cost of forking could be reduced to that of a method invocation.

 

This abstraction allows for use of SIMD-like parallelism. If using a FIFO queue, objects could be queued until there are enough objects to run a parallel instruction on all of them in one go. One example would be 3-D graphics transformations, multimedia instructions, or hand-optimised primitives.

 

Objects could also be passed between streams in efficient bulk operations.

 

Parallel programming techniques

 

Parallel algorithms can get confusing. Here are some tips which can make parallel code easier to write:

  • Wherever possible, encapsulate the parallellism.
  • Keep locked regions of code as small as possible.
  • When suitable, use the fact that writing an object reference is an atomic operation. Rather than holding a lock, make a shallow copy of the object in question, make the changes, and then write the copy back to the original.
  • Learn how to use Semaphores well! They can do a whole bunch of tricks. See the Little Book of Semaphores: http://www.greenteapress.com/semaphores/
  • Keep side-effects minimal. For example, a forked block (aka Process) shouldn't change the state of unrelated objects unless it really has to.
  • Don't require that operations on your API need multiple method invocations (e.g. Canvas>>setColor:, followed by Canvas>>drawLine:).
  • When writing concurrent code, always keep in mind that multiple Processes may enter that method.
  • Aquire locks in a particular order. TODO: Is there a way we can automate this? For example, by assigning each lock a number, and locks need to be aquired from lowest to highest number or the Process will cause an error?
  • Use a code verifier to find concurrency bugs? Such a tool probably needs to first be written.

 

Performance tuning

 

Reasons for poor performance in highly parallel environments:

  • Cache misses.
  • CPU pipeline not well used.
  • Waiting on other threads.
  • Waiting on network.
  • Waiting on disk (swapping).
  • Waiting on anything else.
  • Bad algorithms.
  • Garbage collection overhead.
  • Scheduler overhead or poor scheduler choices.
  • (On architectures with shared CPU components such as the T2000) resource contention, possibly because of poor scheduler choices.
  • (On replicated architectures e.g. DPON) Poor RepAlg object distribution and load balancing.

 

You need to be able to measure these for performance tuning.

 

How do you measure cache misses?

 

A VM such as Squeak could be modified to measure the amount of time a thread spends waiting.

 

 

Scheduling

 

A process has several scheduling requirements:

  • real_time: It may be real-time or not (e.g. UI tasks, reacting to network). These are usually I/O bound and spend a lot of time waiting.
  • batch: It may be a batch task; temporary CPU starvation is no problem.
  • A process may have a minimum amount of CPU cycles per second that it requires, e.g. a sound or video player. If it cannot achieve this, it cannot run on the current computer.
  • For testing purposes, a scheduler should also support a maximum number of CPU cycles per second to simulate slower computers or high-load machines.

 

Real-time tasks have an ordering by priority - some I/O activities are more important to service than others.

 

In a dominioned system, CPU usage is scheduled per dominion which in turn shares its allocation across the processes in itself. CPU usage would also be billed/accounted according to dominion. It should be difficult for one dominion to deny resources to another, and it should be impossible for low-priority dominions to deny resources to higher-priority dominions.

 

 

Making Squeak multi-threaded

 

If I were working on a concurrent VM, I'd just barge ahead and parallelise the Squeak VM:

 

* I'd refactor the VM so that multiple copies of the interpreter state could be held in the heap. I'd design it so that each pthread would run one instance of the standard Squeak interpreter, and so that each interpreter runs multiple green theads as it does now. This makes forking a very cheap operation and allows for having as many green threads as would fit in memory.

** Each interpreter has it's own scheduler.

** When a pthread goes idle, it can grab jobs off a busy scheduler.

* I'd then see what needs fixing in the VM / plug-ins / image to make it work. I'd probably start with the simple approach: an evil but simple global VM lock for memory writing operations, and no changes to GC. Later, we can revisit this to make it scale better.

* I'd then spend the next decade finding and fixing concurrency bugs.

 

The only operations I can think of off the top of my head that would have concurrency problems are:

* Creating new objects.

* GC.

* Some plug-ins.

 

If writing an instance variable to an object can be done in a single machine code instruction, that would not need and locking or synchronisation.

 

Parallel garbage collection would take a long time to implement correctly. I'd leave that and use a global VM lock.

 

Creating new objects can be made faster by using a separate Eden space for each thread. For simplicity though, a global VM lock would do the same job with less programmer effort.

 

What I'm actually going to do is:

 

* Modify the Squeak scheduler to simulate concurrency better and bring up lots of concurrency bugs in the image.

* Spend a decade fixing concurrency bugs in the image.

* Write enough concurrent code to make it worth having a concurrent VM.

* Sell that code, make a lot of money.

* Buy the rights to Gemstone or some other high-performance, concurrent VM.

* Port my code to that instead.

* Make a lot more money.

 

Currently, there isn't really enough concurrent code to make a parallel VM worthwhile.

 

 

Links

 

http://citeseer.ist.psu.edu/182287.html

 

Jecel's wiki:

http://www.merlintec.com:8080/software

http://www.merlintec.com:8080/hardware

http://www.merlintec.com:8080/hardware/26

 

The plurion:

http://www.merlintec.com/merlin6/e_main.html

 

 

D. Lea. Concurrent Programming in Java: Design Principles and Patterns. Addison-Wesley, Reading

MA, 1997.

 

D. C. Schmidt, M. Stal, H. Rohnert, and F. Buschmann. Pattern-Oriented Software Architecture -

Patterns for Concurrent and Networked Objects. Wiley, 2000.

 

Reading material from Jason Johnson about alternative concurrency architectures:

http://research.microsoft.com/~simonpj/papers/stm/#beautiful

(Simon-Peton-Jones, Beautiful concurrency)

http://research.microsoft.com/~simonpj/papers/stm/#composable (SPJ and

others, Composable memory transactions)

http://www.drjava.de/e-presentation/html-english/img0.html (Stefan

Reich, Escape from multithread hell)

http://Erights.org  (various regarding the E language and the futures

technology)

http://armstrongonsoftware.blogspot.com/2006/09/why-i-dont-like-shared-memory.html

 (Joe Armstrong - explanations why shared memory is bad.  His blog

contains various examples of how things are done in Erlang in the

Actor model)

 

http://www.eros-os.org/pipermail/e-lang/2001-July/005410.html

 


Page Information

  • 1 month ago [history]
  • View page source
  • You're not logged in
  • No tags yet learn more

Wiki Information

Recent PBwiki Blog Posts