I've written an article for this before on people.squeakfoundation.org. TODO: find it.
This is a design for a virtual machine that:
This is achieved by:
A pointer within a block is a direct 16-bit pointer. These are fast and the intention is that these local pointers help absorb the overhead generated by the rest of this proposal.
A pointer from a block to an object is another block has three levels of indirection. Firstly, a 16-bit pointer points to a larger "OutPointer" object (e.g. 64 bits?) in the same block. That larger pointer points to an entry in a global object mapping table. That table entry then has the location of the external block and the "InPointer" index of the object in that block.
Blocks contain objects, InPointers and OutPointers. OutPointers point (indirectly) to InPointers in other blocks. InPointers in a block are referred to by index (0, 1, 2... etc) that doesn't change if the local object the InPointer refers to is moved around the block.
An InPointer consists of: [localObject, backRefs]. localObject is a 16-bit direct pointer to the object in this block. "backRefs" is a list of back references to the out pointers in other blocks which refer to this pointer. This list can be implemented in several ways. One way is to have each OutPointer have a nextPointer, so that they form a linked list of all OutPointers pointing to a particular InPointer (needs more thinking - how does the object table fit in?).
An OutPointer consists of a pointer (64 bits?) to an entry into the object mapping table. It may also contain a link to the next OutPointer in a linked list of back references if back references are implemented in this way.
* Including linked lists in the outpointers will cause a lot of CPU cache misses as it jumps around memory. Use a local array instead.
* Rather than using an object lookup table, just a block lookup table is needed. An OutPointer then is a (block#, inPtr#) tuple. An InPointer is then a (localRef, backRefs) tuple with both entries being direct 16-bit pointers. Unfortunately, this limits the number of back references to what can fit in a 64KB block.
A backtracing GC algorithm uses the back references to determine how many references there are to a particular object. When this list is empty, the object can be GCed.
The backtracing GC algorithm would mean that an image could be mostly stored on disk. The GC algorithm does not trace through all of memory, meaning that objects on the disk can remain untouched by a GC.
The backtracing GC algorithm also is very concurrent; a single object can be reclaimed by it's own private thread of execution independently of any other GCs which are also happening.
Each block can be garbage collected, compacted, etc independently of any other block.
Objects can be migrated between blocks to optimize object access. Preferably, most pointers are local 16-bit pointers which are much faster than inter-block pointers.
This type of virtual machine would make it possible to use the SSEs on the cell processor. It would make very good use of multi-cored CPUs.
Custom hardware could be made that executes Smalltalk bytecodes directly. GC and far reference management would then be implemented in bytecodes rather than by hardware. Also, a cell-processor style local memory could be used for each core with a master core doing transfers to/from main memory and disk.
Disadvantages: can't have large (>64KB) arrays.
This could also use a partition on a disk rather than files on a filesystem. Both options should be available. RAID would be used to make an image span multiple disks.
Cores must be able to be added and removed at run-time to facilitate:
Ideally, the word size for inter-block references should be customisable - 16 bit, 32 bit. This allows this VM to run efficiently on an architecture where 64kb is too fine a granularity.
This VM could be designed with an eventual hardware implementation in mind. A standard computer could have Smalltalk co-processors added to it, much like the Cell architecture has a PowerPC core with custom SSE added to it. The main CPU would handle moving data around the co-processors and doing GC and block management. Alternatively, this could also be done by a Smalltalk-based CPU which would just be one of many cores.
Each "core" would have 64KB (or maybe more if multiple blocks need to be loaded?) of local static RAM, and would execute Smalltalk bytecodes. The following would be implemented in hardware:
The following would be implemented using bytecodes because they would be too complex to be implemented directly in hardware:
Example: code in a CompiledMethod wants to write to an object referred to using a OutPointer. The hardware would detect that the object being written to is an OutPointer rather than an actual object and would jump to a special location that has bytecodes for dealing with OutPointers (i.e. cause an "interrupt"). That special code would load the block in the OutPointer into the local memory (maybe unloading other blocks in the process to make room), lock it for writing, write the value, unlock the block for writing and return from this "interrupt".
A message send would require the CompiledMethod to be locally available. This is discussed below.
Clients should be fast booting. Clients should only need minimal hardware resources; provided that they are reactive, all the grunt can sit on the server.
Anything that runs across an administered network should get it's full configuration from that network.
If an OS was going to be made out of this VM:
Booting a server:
If it is a client, then it boots the traditional way by reading a boot block from storage.
On disk, the latest snapshot contains a record of what was last in memory. This is loaded back into memory. It would be possible to boot from an earlier snapshot - on a server, this would involve using management commands and on a client using a pre-boot menu of some sort.
Debugging would be awesome if you could not only go forwards, but also backwards in time. This could be implemented by making snapshots, and recording the exact times when external resources affect the image. By recording the exact time (to the currently executing bytecode) of every external influence on the VM, the VM can be made completely deterministic allowing the exact sequence of events leading up to a bug occuring to be recreated.
This would require recording the exact time to the bytecode and the device I/O of:
Given a snapshot and the above events that occur, the VM can then recreate the exact circumstances under which an error occured. This allows the debugger to go back to a particular snapshot and recreate events up to a certain point in time with the user having full visibility over what happens.
The recorded information should be minimal. If the overhead is not to high, it may even be possible to keep, say, the last few minutes since the last snapshot of this information in memory and dump it to log files when an unhandled exception occurs (allowing the programmer full visiblility over what happened).
An added benefit is that this information acts as a log file which can be replayed when the VM is restored.
The object store sits on top of the block manager. The block manager keeps track of which blocks are stored on which physical disks (or if they are only in memory). The block table stores this information.
Would it be worth making the block manager able to keep "backup blocks" on separate devices, ala software RAID? If a disk is going to be decommissioned or replaced, it should be taken out of the pool meaning all blocks / objects need to be moved off it onto other disks. A disk can be marked as "very reliable" (i.e. RAID) to "unreliable" (i.e. IDE cables made of cellotape) and replication algorithms can be made depending on the reliability of disks. You would also need to specify that partitions are actually on the same disk, meaning that replication isn't useful.
Assuming a block lookup table is used.
The block is locked for writing by this thread.
Each collected OutPointer will mean that the backref list of its target needs to be updated. OutPointers can be collected when no local object refers to them.
If objects referenced to by an InPointer are moved around, the InPointer will need to be updated. Object movement and compaction is done following the standard GC algorithm that has been chosen for this.
InPointers are used as the root set for GC. They can sit in a special header area. OutPointers can be GCed away. When an OutPointer is GCed, the backref in it's corrosponding InPointer needs to be removed.
When an intra-block GC brings the number of back references of an InPointer to zero, that InPointer can be removed. A cycle-detecting algorithm would be needed.
One cycle-detecting algorithm would be to mark one of the backreferences as "primary", such that all primary references in all objects form a non-cyclic tree from the root set to every single object. When the primary reference is removed from an object's backreference list, that object must begin a breadth-first search through its backreferences searching for another primary backreference on another object that can be used to reconstitute the non-cyclic tree from the root set to this object. If the search fails, the object and all objects involved in the search are garbage.
This algorithm could be parallelised; multiple garbage collections could occur together, and the search could be made concurrent. Every object could possibly have a lock put on it; if an object is locked then that object has a search algorithm running on it. That lock could be waited on to retrieve the result of that search. This behaviour may be prone to stalling if a thread becomes stuck or experiences a problem waiting on I/O. Deadlock would occur if there are cycles in the search tree, or if two searches deadlock on each other. To prevent deadlock of two threads, the order of search must be well defined (i.e. the search is done using an ordered list of backreferences, and those lists must not change during the search). To prevent a thread doubling back on itself, it could mark an object with that thread's unique identifier when it searches past it, and then test for that ID on each new object it searches.
For optimization, it is sometimes preferable to move objects between blocks to minimize the number of far references. This is done as follows:
OutPointers in a block can be used by multiple objects in that block. To find if an OutPointer to an object already exists in a block, either the backref list of the target can be searched, or all OutPointers in a block can be searched.
Updating an object pointer needs to be an atomic operation. If the write can be done in a single operation, then no locking is needed. Otherwise the object may need to be locked to write to it. The whole block could be locked instead of just one object; this would serialize all writes to a block.
An optimisation is to make a thread hold a write lock on a block, releasing it only when asked to. When another thread gains the write lock on a thread's current block, that block's thread is blocked until it regains its lock. The owner thread cannot make any new objects meaning that it cannot create any MethodContexts until it regains its lock. This means that a thread might relinquish its lock on its current block at every context switch. An optimisation would be to make the owner of a block make a new block when another thread wants to access its current block.
This makes it effectively impossible for one thread to access another thread's current MethodContext, meaning the debugger would have a difficult job.
Alternatively, blocks can be copy-on-write. A thread will do all its work on a copy of the block. When the thread has finished, the block is "committed" back to be part of the main memory. TODO: think about how this affects concurrent synchronisation.
The copy-on-write method may make it possible to have a transacted VM with rollback possibilities. If blocks are copied on write, then a snapshot could be made by keeping the old versions of the blocks and keeping an old copy of the block table.
Say that to write to a block, a thread obtains a copy of that block for writing to, and when finished the block is inserted back into the image. This means:
The block will need to be locked. New space is allocated. The block can then be unlocked.
Alternatively, blocks can be copy-on-write as described above.
If a block becomes empty, it can be removed.
If a block has been near-empty for a significant amount of time, it may be worthwhile moving the objects out of it and removing the block.
The block's entry in the block table can be nullified, and the space of the entry re-used for a linked list of free table entries.
A block that is nearly full runs the risk of not having enough space for InPointers and OutPointers that need to be created. Moving objects out of the block may use even more space with the extra InPointers and OutPointers that are created.
The best approach may be to cut the block in two. This requires an efficient graph-theory algorithm to find a division between the objects with the least number of connections between the two sets (or at least a good minimal division). One approach is to do profiling to determine which instance variables of a class are used most (storing the information with the class?), and then creating a (depth-first?) tree following the most common links first from that stopping at half or one third of the way.
Another approach is to give each object a counter (in a new block?), then starting with any object, increment all objects it points to by one, and then follow the link to the object with the highest value until half the objects have been reached.
A snapshot can be taken of the running image by making a copy of the block table and making all blocks copy-on-write.
Every block has an age associated with it. One way of doing this is to have blocks in order (somehow) and the oldest blocks (that have not been read or written for a long time) are written to disk. This is only necessary when memory is full.
Occasionally (every 60 seconds?), a snapshot is made where all blocks in memory, including the most recently created ones are written to disk. This is done by marking those blocks as copy-on-write and synchronising the originals to disk in a background thread. Only dirty blocks need to be written; a block is marked as dirty when it is locked for writing.
Because the block table can have 2^64 entries in it, it too will need to be predominantly stored on the disk. The block table will be stored last, so that recovery is possible if the write to disk fails part way through. Blocks on the disk are not overwritten; the disk image is only appended to with newer versions of blocks.
One way of preventing ever increasing disk usage would be to keep only several snapshots, e.g. 1 minute ago, 1 hour, 1 day, 1 week, 1 month. These would roll over - the minutely one gets overwritten every minute. Once an hour, the hourly one is replaced by the minutely one. Every day, the daily one is replaced by the hourly one, etc. The rolling over happens from oldest to youngest so that the latest minutely snapshot doesn't overwrite everything! Snapshots refer to objects in each other, so they must all be present. Replacing one snapshot by another may just be a case of updating some pointer somewhere.
One way of thinking about snapshots is that they are like generational garbage collection. The oldest objects are in the oldest snapshot.
Writing to an object in an old snapshot would involve reading the relevant block from that snapshot into memory, writing to it, and when the snapshotting occurs it now becomes part of the latest snapshot.
Usually in Smalltalk, a message send requires looking up the object's class, search the method dictionary, and then searching method dictionaries of all superclasses. To optimise this slow process, a message dispatch table is used. This table needs to be stored somewhere.
Also, the CompiledMethods that are to be executed also need to be available locally.
One possibility is for each thread to have an auxilliary block that contains read-only replicas of CompiledMethods. This auxilliary block can also be used by the garbage collector for storing state. Potentially this block could also hold the current stack.
Another possibility is to have a special object which contains a CompiledMethod's code, but is read-only and transient, and it is this object which gets copied into a thread's current block. The garbage collector will simply remove this object when GC occurs as it is only a copy of the original CompiledMethod. This special object would contain a single link to the original CompiledMethod where the literals can be found. This copying may have significant overhead.
Another possibility is to have the interpreter be able to simply execute code which is in another block. This would make a hardware implementation need to understand OutReferences and InReferences. In this occasion, a hardware implementation with cores having their own static memory would mean that the block containing the method would need to be copied to the local memory - for multiple methods in different blocks this can be significant overhead. The overhead can be minimised by putting serially executed methods in the same block, and perhaps including a long-lived message dispatch table with that block as well.
This is a prime candidate for implementing in Slang.
A class MemoryBlock could be made with an artificial limit on the values each location can contain. In this way, both 16-bit and 32-bit NearRefs can be played with. The MemoryBlock would be an array of the size of each block that can be written to disk. By making the implementation flexibile (wrt size of NearRefs and block size), some experimentation can take place to see which gets the best performance.
These MemoryBlocks would be written to a real disk, which will show the exact same behaviour as a compiled VM.
Multi-threaded implementations can be simulated using Squeak Processes. These would become pthreads in a compiled version.
The Squeak interpreter as it stands could remain largely the same. The ObjectMemory could be replaced with a BlockMemory. Most methods would stay the same; no special block handling functionality would need to be added at the Interpreter other than "interrupts" when the block is full which causes a call to bytecodes which then handle that case. The BlockMemory objects would also be visible from within the image so that GC can happen in bytecodes (if desired - slow).
Ideally, the Slang compiler could be modified to be able to read and write real objects as well. These "real" objects would be compiled into a bunch of C functions that take a pointer to the memory location of that object as their first argument. These functions would have an implementation of the guessed class of that object (where perhaps profiling information would be useful?), and first tests to see if that class is correct. If that guessed class is wrong, the primitive would fail and the Interpreter would resume by trying to find another implementation of that method. Several tests like this could be chained together if the method is a common one. Alternatively, the compiled VM could keep a special method lookup table which referred to compiled versions of these methods. (TODO: how does Exupery do this?).
This looks eerily familiar: http://www.merlintec.com/lsi/persist.html
What every programmer should know about memory: http://people.redhat.com/drepper/cpumemory.pdf
Page Information
|
Wiki Information |
Recent PBwiki Blog Posts |