Until recently, CPUs have been getting faster by increasing the megahertz of a CPU and using clever techniques such as pipelines, branch prediction and out-of-order execution. Recently the emphasis has changed to focus on parallelism - putting more cores into a CPU:
What this means is that programmers will have to learn how to do parallel processing. Using multi-threading in conventional languages works if you don't mind race conditions and deadlocks. Java has good support for threading, and Squeak/Smalltalk supports it but currently doesn't take advantage of the extra hardware available.
Erlang apparently supports trivial multithreading. I'm not sure about Mozart/OZ.
Prolog, or a better logic-based language, should be parallisable. Prolog is essentially a search algorithm for a solution over logical clauses. Search algorithms are usually easily parallised.
Squeak, and Smalltalk in general, has quite good functionality in terms of multithreading, but unfortunately programmers often don't make use of it. As a result, Squeak feels very single-threaded. Running a command typically freezes up the entire UI.
There are a few patterns that can be used in Smalltalk to make use of parallel processing.
One idea is to add a "percentageComplete" variable to a thread. When a thread is assigned to a particular task, it is assigned the name of that task and updates "percentageComplete" to reflect the progress being made. The user interface can then show a progress bar for any thread that it is waiting for.
VM optimisation idea: making a new process could be made effectively as expensive as a block invocation. When "[...] fork" is encountered, the VM doesn't actually create a new process but just flags the current call stack as having forked, and continues with execution in the forked block... if the forked block returns to a flagged thread, then execution continues with the parent process. When a context switch occurs, the VM will check if the current context is flagged for forking and if so, then actually makes the new process then. One possible bug is if a fork happens on an already flagged context - perhaps the flag needs to be an integer that gets incremented with each fork and decremented down to zero during the context switch? This optimisation would mean that forked processes would have very little overhead if it returns before the next context switch.
Parallel programming is often considered as being difficult. One way of minimising bugs in parallel applications is to encapsulate and abstract parallel behaviour to contain bugs. A well-tested parallel abstraction will have less bugs than custom written code.
Often it is easiest just to write a parallel algorithm using forks and semaphores. I personally don't consider forking processes and synchronising on semaphores too low level to write parallel applications.
Semaphore can be used in many ways. A Mutex is a type of Semaphore that is already signalled; it can be used to protect critical sections of code. A Multex has been signalled n times, it can be used to make sure no more than n processes execute a critical section of code at the same time. Semaphores can be used to join processes .
Obviously, two unrelated sections of code that never share objects could run in parallel. A good example of this is a user interface that lets applications run in their own threads rather than block if one application blocks. Another example is initialisation of an application; two unrelated parts of an application could initialise at the same time if they share no state.
Many of the methods on collections could be parallellised. For example, "someCollection do: [:each ...]" could easily fork off a thread for each element in that collection. There is a lot of scope for parallelisation here.
Using this abstraction, a parallel collection library could be built up. Compare Java's java.util.concurrent package.
Rather than include a mutex with every object you intend to protect in your code, it is less effort to have a library of thread-safe classes. A protected integer, for example, would be very useful if it had such methods as "increment" or "waitUntilGreaterThan:". A protected data type would contain its own locks so that it's user doesn't have to have a separate lock for protecting it.
A library could be written to manage "jobs" and job queues. Jobs can be submitted to a job queue, which are then processed by a limited set of threads. This could implement, e.g. scatter/gather type jobs.
There are different types of job queues that could be made - compare the queuing features in Java's java.util.concurrent package.
A future allows a program to continue executing while a variable's value is being calculated. They're described above somewhere.
A barrier could be turned into a reusable object. A barrier makes sure that all threads have reached a particular rendevous point before continuing.
Various types of queues could be used. Synchronised queues are used in concurrent program for applications such as producer/consumer problems.
A streaming architecture could be made, where the program represents a production line. A bunch of Stream objects are plugged into each other, plugging outputs from some Stream objects into inputs in others. Buffers may or may not be used depending on whether they improve or inhibit performance. Each Stream object does some processing on the elements it receives, and then passes the output to the next Stream object.
There are many ways of parallelising this:
This abstraction allows for use of SIMD-like parallelism. If using a FIFO queue, objects could be queued until there are enough objects to run a parallel instruction on all of them in one go. One example would be 3-D graphics transformations, multimedia instructions, or hand-optimised primitives.
Objects could also be passed between streams in efficient bulk operations.
Parallel algorithms can get confusing. Here are some tips which can make parallel code easier to write:
Reasons for poor performance in highly parallel environments:
You need to be able to measure these for performance tuning.
How do you measure cache misses?
A VM such as Squeak could be modified to measure the amount of time a thread spends waiting.
A process has several scheduling requirements:
Real-time tasks have an ordering by priority - some I/O activities are more important to service than others.
In a dominioned system, CPU usage is scheduled per dominion which in turn shares its allocation across the processes in itself. CPU usage would also be billed/accounted according to dominion. It should be difficult for one dominion to deny resources to another, and it should be impossible for low-priority dominions to deny resources to higher-priority dominions.
If I were working on a concurrent VM, I'd just barge ahead and parallelise the Squeak VM:
* I'd refactor the VM so that multiple copies of the interpreter state could be held in the heap. I'd design it so that each pthread would run one instance of the standard Squeak interpreter, and so that each interpreter runs multiple green theads as it does now. This makes forking a very cheap operation and allows for having as many green threads as would fit in memory.
** Each interpreter has it's own scheduler.
** When a pthread goes idle, it can grab jobs off a busy scheduler.
* I'd then see what needs fixing in the VM / plug-ins / image to make it work. I'd probably start with the simple approach: an evil but simple global VM lock for memory writing operations, and no changes to GC. Later, we can revisit this to make it scale better.
* I'd then spend the next decade finding and fixing concurrency bugs.
The only operations I can think of off the top of my head that would have concurrency problems are:
* Creating new objects.
* GC.
* Some plug-ins.
If writing an instance variable to an object can be done in a single machine code instruction, that would not need and locking or synchronisation.
Parallel garbage collection would take a long time to implement correctly. I'd leave that and use a global VM lock.
Creating new objects can be made faster by using a separate Eden space for each thread. For simplicity though, a global VM lock would do the same job with less programmer effort.
What I'm actually going to do is:
* Modify the Squeak scheduler to simulate concurrency better and bring up lots of concurrency bugs in the image.
* Spend a decade fixing concurrency bugs in the image.
* Write enough concurrent code to make it worth having a concurrent VM.
* Sell that code, make a lot of money.
* Buy the rights to Gemstone or some other high-performance, concurrent VM.
* Port my code to that instead.
* Make a lot more money.
Currently, there isn't really enough concurrent code to make a parallel VM worthwhile.
http://citeseer.ist.psu.edu/182287.html
Jecel's wiki:
http://www.merlintec.com:8080/software
http://www.merlintec.com:8080/hardware
http://www.merlintec.com:8080/hardware/26
The plurion:
http://www.merlintec.com/merlin6/e_main.html
D. Lea. Concurrent Programming in Java: Design Principles and Patterns. Addison-Wesley, Reading
MA, 1997.
D. C. Schmidt, M. Stal, H. Rohnert, and F. Buschmann. Pattern-Oriented Software Architecture -
Patterns for Concurrent and Networked Objects. Wiley, 2000.
Reading material from Jason Johnson about alternative concurrency architectures:
http://research.microsoft.com/
(Simon-Peton-Jones, Beautiful concurrency)
http://research.microsoft.com/
others, Composable memory transactions)
http://www.drjava.de/e-present
Reich, Escape from multithread hell)
http://Erights.org (various regarding the E language and the futures
technology)
http://armstrongonsoftware
(Joe Armstrong - explanations why shared memory is bad. His blog
contains various examples of how things are done in Erlang in the
Actor model)
http://www.eros-os.org/piperma
Page Information
|
Wiki Information |
Recent PBwiki Blog Posts |