Design for a block-based virtual machine.

 

Random ideas:

 

I've written an article for this before on people.squeakfoundation.org. TODO: find it.

 

This is a design for a virtual machine that:

 

  • Scales to terabyte or larger images.
  • Has completely concurrent GC, scalable to any number of processors.
  • Stores most of the image on disk, only loads into memory what is needed. GC does not always touch the disk.
  • Potentially could have snapshots made of it which can be rolled back to.
  • Does transacted writes to disk.

 

This is achieved by:

 

  • VM is divided into blocks of e.g. 64KB.
  • Two types of pointers are used: near and far. Near pointers are e.g. 16 bits. Far pointers are bigger.
  • Simple single-threaded mark/sweep GC algorithm is used for intra-block GCs.
  • Backtracing GC algorithm is used for inter-block GCs.

 

A pointer within a block is a direct 16-bit pointer. These are fast and the intention is that these local pointers help absorb the overhead generated by the rest of this proposal.

 

A pointer from a block to an object is another block has three levels of indirection. Firstly, a 16-bit pointer points to a larger "OutPointer" object (e.g. 64 bits?) in the same block. That larger pointer points to an entry in a global object mapping table. That table entry then has the location of the external block and the "InPointer" index of the object in that block.

 

Blocks contain objects, InPointers and OutPointers. OutPointers point (indirectly) to InPointers in other blocks. InPointers in a block are referred to by index (0, 1, 2... etc) that doesn't change if the local object the InPointer refers to is moved around the block.

 

An InPointer consists of: [localObject, backRefs]. localObject is a 16-bit direct pointer to the object in this block. "backRefs" is a list of back references to the out pointers in other blocks which refer to this pointer. This list can be implemented in several ways. One way is to have each OutPointer have a nextPointer, so that they form a linked list of all OutPointers pointing to a particular InPointer (needs more thinking - how does the object table fit in?).

 

An OutPointer consists of a pointer (64 bits?) to an entry into the object mapping table. It may also contain a link to the next OutPointer in a linked list of back references if back references are implemented in this way.

 

* Including linked lists in the outpointers will cause a lot of CPU cache misses as it jumps around memory. Use a local array instead.

 

* Rather than using an object lookup table, just a block lookup table is needed. An OutPointer then is a (block#, inPtr#) tuple. An InPointer is then a (localRef, backRefs) tuple with both entries being direct 16-bit pointers.  Unfortunately, this limits the number of back references to what can fit in a 64KB block.

 

 

A backtracing GC algorithm uses the back references to determine how many references there are to a particular object. When this list is empty, the object can be GCed.

 

The backtracing GC algorithm would mean that an image could be mostly stored on disk. The GC algorithm does not trace through all of memory, meaning that objects on the disk can remain untouched by a GC.

 

 

The backtracing GC algorithm also is very concurrent; a single object can be reclaimed by it's own private thread of execution independently of any other GCs which are also happening.

 

Each block can be garbage collected, compacted, etc independently of any other block.

 

Objects can be migrated between blocks to optimize object access. Preferably, most pointers are local 16-bit pointers which are much faster than inter-block pointers.

 

This type of virtual machine would make it possible to use the SSEs on the cell processor. It would make very good use of multi-cored CPUs.

 

Custom hardware could be made that executes Smalltalk bytecodes directly. GC and far reference management would then be implemented in bytecodes rather than by hardware. Also, a cell-processor style local memory could be used for each core with a master core doing transfers to/from main memory and disk.

 

Disadvantages: can't have large (>64KB) arrays.

 

This could also use a partition on a disk rather than files on a filesystem. Both options should be available. RAID would be used to make an image span multiple disks.

 

Targetted Platforms

  • Multi-cored computers (dual-core / quad-core).
  • Very multi-cored computers, e.g. Sun T2000.
  • Cell processor (6 cores, local memory only).
  • Platforms with little memory - most data held on disk.
  • Custom hardware.

 

Cores must be able to be added and removed at run-time to facilitate:

  • Powering on and off the various hardware cores.
  • Reassigning hardware cores to other tasks (e.g. primitives, reconfiguring FPGA for other task, reconfiguring Rapport Kilocore core for another task).
  • testing, tuning.

 

Ideally, the word size for inter-block references should be customisable - 16 bit, 32 bit. This allows this VM to run efficiently on an architecture where 64kb is too fine a granularity.

 

Custom hardware implementation

 

This VM could be designed with an eventual hardware implementation in mind. A standard computer could have Smalltalk co-processors added to it, much like the Cell architecture has a PowerPC core with custom SSE added to it. The main CPU would handle moving data around the co-processors and doing GC and block management. Alternatively, this could also be done by a Smalltalk-based CPU which would just be one of many cores.

 

Each "core" would have 64KB (or maybe more if multiple blocks need to be loaded?) of local static RAM, and would execute Smalltalk bytecodes. The following would be implemented in hardware:

  • Bytecode execution (stack (push, pop), jump, send, return).
  • Primitives:
    • SmallInteger, Integer operations.
    • Floating point operations (done by shared co-processor?)
    • Object allocation (basicNew).
    • BitBlt and display operations on a shared display device?
  • Maybe intra-block garbage collection? Can this be done in hardware?
  • Block primitives - copying a block back to main memory, requesting a block from main memory, locking a block for writing. These would be callable using special bytecode primitives.
  • Message dispatch table

 

The following would be implemented using bytecodes because they would be too complex to be implemented directly in hardware:

  • Inter-block reference management.
  • Inter-block garbage collection.
  • Maybe intra-block GC if it can't be done in hardware.
  • Block management - moving blocks between disk, main memory and each core's private memory.
  • Method lookup (if it is not in the message dispatch table) and CompiledMethod loading from other blocks.

 

Example: code in a CompiledMethod wants to write to an object referred to using a OutPointer. The hardware would detect that the object being written to is an OutPointer rather than an actual object and would jump to a special location that has bytecodes for dealing with OutPointers (i.e. cause an "interrupt"). That special code would load the block in the OutPointer into the local memory (maybe unloading other blocks in the process to make room), lock it for writing, write the value, unlock the block for writing and return from this "interrupt".

 

A message send would require the CompiledMethod to be locally available. This is discussed below.

 

Block-based VM OS

 

Clients should be fast booting. Clients should only need minimal hardware resources; provided that they are reactive, all the grunt can sit on the server.

Anything that runs across an administered network should get it's full configuration from that network.

 

If an OS was going to be made out of this VM:

 

  • It should be easy to install - an installation of a server would be done using a remote management client. Clients could be either simple PC applications or thin clients.
  • It should be installable over a network connection and using nothing but a serial console or telnet session if need be.
  • Installation involves the absolute bare minimum. If anything can be configured in a running system, then that's where it will be configured rather than during installation.
  • Like ZFS, all storage vehicles (disks, CF cards, other devices) should be able to be added and removed from the "pool".
  • There should be no file system except as an optional device.
  • Object storage should sit directly on top of disk partitions.
  • Object storage can sit on a RAID array.
  • All management of a server will be remote over a network connection. It will be assumed that a server will never have a screen or keyboard.
  • Clients will boot in less than 2 seconds. Clients will shut down in less than 2 seconds. This may be simulated using a battery to keep DRAM refreshes going, or by doing rapid loading of memory directly from a reserved part of storage. Servers can take as long as they want to boot.

 

Booting a server:

  • The server boots over the network using DHCP, tftp or whatever. The network card does this, and the BIOS (or whatever) is configured to do this by default.
  • The boot server should inform the network administrator that a new server is asking for boot information and ask what to do. From this point onwards, the administrator has complete control over the new server.
  • Logging is done to a log server once the kernel has loaded (or before, if possible!). Ditto alarms, management. Infact, why not just keep a TCP connection to a management server for logging, alarms and management commands?
  • The server reads it's disk configuration from the network - i.e. which partitions on which disks it should start up.
  • It boots.

 

If it is a client, then it boots the traditional way by reading a boot block from storage.

 

On disk, the latest snapshot contains a record of what was last in memory. This is loaded back into memory. It would be possible to boot from an earlier snapshot - on a server, this would involve using management commands and on a client using a pre-boot menu of some sort.

 

Debugging made awesome

 

Debugging would be awesome if you could not only go forwards, but also backwards in time. This could be implemented by making snapshots, and recording the exact times when external resources affect the image. By recording the exact time (to the currently executing bytecode) of every external influence on the VM, the VM can be made completely deterministic allowing the exact sequence of events leading up to a bug occuring to be recreated.

 

This would require recording the exact time to the bytecode and the device I/O of:

  • Context switches (i.e. switching between processes) occur.
  • Accessed timers.
  • All network traffic coming in.
  • Keyboard, mouse events.
  • file I/O times and latency.
  • other devices.

 

Given a snapshot and the above events that occur, the VM can then recreate the exact circumstances under which an error occured. This allows the debugger to go back to a particular snapshot and recreate events up to a certain point in time with the user having full visibility over what happens.

 

The recorded information should be minimal. If the overhead is not to high, it may even be possible to keep, say, the last few minutes since the last snapshot of this information in memory and dump it to log files when an unhandled exception occurs (allowing the programmer full visiblility over what happened).

 

An added benefit is that this information acts as a log file which can be replayed when the VM is restored.

 

Implementation details

 

Pooling across partitions.

 

The object store sits on top of the block manager. The block manager keeps track of which blocks are stored on which physical disks (or if they are only in memory). The block table stores this information.

 

Would it be worth making the block manager able to keep "backup blocks" on separate devices, ala software RAID? If a disk is going to be decommissioned or replaced, it should be taken out of the pool meaning all blocks / objects need to be moved off it onto other disks. A disk can be marked as "very reliable" (i.e. RAID) to "unreliable" (i.e. IDE cables made of cellotape) and replication algorithms can be made depending on the reliability of disks. You would also need to specify that partitions are actually on the same disk, meaning that replication isn't useful.

 

 

Procedures

 

Assuming a block lookup table is used.

 

Intra-block Garbage Collection

 

The block is locked for writing by this thread.

 

Each collected OutPointer will mean that the backref list of its target needs to be updated. OutPointers can be collected when no local object refers to them.

 

If objects referenced to by an InPointer are moved around, the InPointer will need to be updated. Object movement and compaction is done following the standard GC algorithm that has been chosen for this.

 

InPointers are used as the root set for GC. They can sit in a special header area. OutPointers can be GCed away. When an OutPointer is GCed, the backref in it's corrosponding InPointer needs to be removed.

 

Inter-block Garbage Collection

 

When an intra-block GC brings the number of back references of an InPointer to zero, that InPointer can be removed. A cycle-detecting algorithm would be needed.

 

One cycle-detecting algorithm would be to mark one of the backreferences as "primary", such that all primary references in all objects form a non-cyclic tree from the root set to every single object. When the primary reference is removed from an object's backreference list, that object must begin a breadth-first search through its backreferences searching for another primary backreference on another object that can be used to reconstitute the non-cyclic tree from the root set to this object. If the search fails, the object and all objects involved in the search are garbage.

 

This algorithm could be parallelised; multiple garbage collections could occur together, and the search could be made concurrent. Every object could possibly have a lock put on it; if an object is locked then that object has a search algorithm running on it. That lock could be waited on to retrieve the result of that search. This behaviour may be prone to stalling if a thread becomes stuck or experiences a problem waiting on I/O. Deadlock would occur if there are cycles in the search tree, or if two searches deadlock on each other. To prevent deadlock of two threads, the order of search must be well defined (i.e. the search is done using an ordered list of backreferences, and those lists must not change during the search). To prevent a thread doubling back on itself, it could mark an object with that thread's unique identifier when it searches past it, and then test for that ID on each new object it searches.

 

Moving objects between blocks

 

For optimization, it is sometimes preferable to move objects between blocks to minimize the number of far references. This is done as follows:

 

  1. A GC is performed on the old block to remove unnecessary references to this object.
  2. Both blocks are locked for writing by this thread.
  3. The object is copied and a new InPointer is made for it in the new block. If the object referred to objects in the old block, OutPointers are made. The local block is checked for to see if any of these OutPointers already exist and can be re-used (maybe?).
  4. If any objects in the old block refer to this object, a new OutPointer is made in the old block referring to this object's new location. Those objects are updated to point to that OutPointer.
  5. The back references for the old InPointer are traversed and updated. If any of these back references are actually in the new block, then substitute them for a local reference.

 

OutPointers in a block can be used by multiple objects in that block. To find if an OutPointer to an object already exists in a block, either the backref list of the target can be searched, or all OutPointers in a block can be searched.

 

Writing to an object

 

Updating an object pointer needs to be an atomic operation. If the write can be done in a single operation, then no locking is needed. Otherwise the object may need to be locked to write to it. The whole block could be locked instead of just one object; this would serialize all writes to a block.

 

An optimisation is to make a thread hold a write lock on a block, releasing it only when asked to. When another thread gains the write lock on a thread's current block, that block's thread is blocked until it regains its lock. The owner thread cannot make any new objects meaning that it cannot create any MethodContexts until it regains its lock. This means that a thread might relinquish its lock on its current block at every context switch. An optimisation would be to make the owner of a block make a new block when another thread wants to access its current block.

 

This makes it effectively impossible for one thread to access another thread's current MethodContext, meaning the debugger would have a difficult job.

 

 

Alternatively, blocks can be copy-on-write. A thread will do all its work on a copy of the block. When the thread has finished, the block is "committed" back to be part of the main memory. TODO: think about how this affects concurrent synchronisation.

 

The copy-on-write method may make it possible to have a transacted VM with rollback possibilities. If blocks are copied on write, then a snapshot could be made by keeping the old versions of the blocks and keeping an old copy of the block table.

 

Say that to write to a block, a thread obtains a copy of that block for writing to, and when finished the block is inserted back into the image. This means:

  • Any other thread wanting to write to that block will need to wait for the existing block to finish. This means that the block needs to record who has it locked anyway. This only happens when writing to existing objects; new objects will be created in thread's private blocks.
  • I suspect that every time a Semaphore is signalled, the block will need to be written back before the involved thread is resumed to ensure that concurrent synchronisation still works. I'm not sure if there are other cases, such as waiting on a semaphore.
  • Other threads can still read from the old copy and so no blocking is needed. They only block on writing.
  • This approach may or may not have significant memory copying overheads. An initial write to a block means it needs to be copied.

 

Creating a new object

 

The block will need to be locked. New space is allocated. The block can then be unlocked.

 

Alternatively, blocks can be copy-on-write as described above.

 

Block is empty

 

If a block becomes empty, it can be removed.

 

If a block has been near-empty for a significant amount of time, it may be worthwhile moving the objects out of it and removing the block.

 

The block's entry in the block table can be nullified, and the space of the entry re-used for a linked list of free table entries.

 

Block is full

 

A block that is nearly full runs the risk of not having enough space for InPointers and OutPointers that need to be created. Moving objects out of the block may use even more space with the extra InPointers and OutPointers that are created.

 

The best approach may be to cut the block in two. This requires an efficient graph-theory algorithm to find a division between the objects with the least number of connections between the two sets (or at least a good minimal division). One approach is to do profiling to determine which instance variables of a class are used most (storing the information with the class?), and then creating a (depth-first?) tree following the most common links first from that stopping at half or one third of the way.

 

Another approach is to give each object a counter (in a new block?), then starting with any object, increment all objects it points to by one, and then follow the link to the object with the highest value until half the objects have been reached.

 

Snapshot

 

A snapshot can be taken of the running image by making a copy of the block table and making all blocks copy-on-write.

 

Writing a block to disk

 

Every block has an age associated with it. One way of doing this is to have blocks in order (somehow) and the oldest blocks (that have not been read or written for a long time) are written to disk. This is only necessary when memory is full.

 

Occasionally (every 60 seconds?), a snapshot is made where all blocks in memory, including the most recently created ones are written to disk. This is done by marking those blocks as copy-on-write and synchronising the originals to disk in a background thread. Only dirty blocks need to be written; a block is marked as dirty when it is locked for writing.

 

Because the block table can have 2^64 entries in it, it too will need to be predominantly stored on the disk. The block table will be stored last, so that recovery is possible if the write to disk fails part way through. Blocks on the disk are not overwritten; the disk image is only appended to with newer versions of blocks.

 

One way of preventing ever increasing disk usage would be to keep only several snapshots, e.g. 1 minute ago, 1 hour, 1 day, 1 week, 1 month. These would roll over - the minutely one gets overwritten every minute. Once an hour, the hourly one is replaced by the minutely one. Every day, the daily one is replaced by the hourly one, etc. The rolling over happens from oldest to youngest so that the latest minutely snapshot doesn't overwrite everything! Snapshots refer to objects in each other, so they must all be present. Replacing one snapshot by another may just be a case of updating some pointer somewhere.

 

One way of thinking about snapshots is that they are like generational garbage collection. The oldest objects are in the oldest snapshot.

 

Writing to an object in an old snapshot would involve reading the relevant block from that snapshot into memory, writing to it, and when the snapshotting occurs it now becomes part of the latest snapshot.

 

Sending a message

 

Usually in Smalltalk, a message send requires looking up the object's class, search the method dictionary, and then searching method dictionaries of all superclasses. To optimise this slow process, a message dispatch table is used. This table needs to be stored somewhere.

 

Also, the CompiledMethods that are to be executed also need to be available locally.

 

One possibility is for each thread to have an auxilliary block that contains read-only replicas of CompiledMethods. This auxilliary block can also be used by the garbage collector for storing state. Potentially this block could also hold the current stack.

 

Another possibility is to have a special object which contains a CompiledMethod's code, but is read-only and transient, and it is this object which gets copied into a thread's current block. The garbage collector will simply remove this object when GC occurs as it is only a copy of the original CompiledMethod. This special object would contain a single link to the original CompiledMethod where the literals can be found. This copying may have significant overhead.

 

Another possibility is to have the interpreter be able to simply execute code which is in another block. This would make a hardware implementation need to understand OutReferences and InReferences. In this occasion, a hardware implementation with cores having their own static memory would mean that the block containing the method would need to be copied to the local memory - for multiple methods in different blocks this can be significant overhead. The overhead can be minimised by putting serially executed methods in the same block, and perhaps including a long-lived message dispatch table with that block as well.

 

Implementation

 

This is a prime candidate for implementing in Slang.

 

A class MemoryBlock could be made with an artificial limit on the values each location can contain. In this way, both 16-bit and 32-bit NearRefs can be played with. The MemoryBlock would be an array of the size of each block that can be written to disk. By making the implementation flexibile (wrt size of NearRefs and block size), some experimentation can take place to see which gets the best performance.

 

These MemoryBlocks would be written to a real disk, which will show the exact same behaviour as a compiled VM.

 

Multi-threaded implementations can be simulated using Squeak Processes. These would become pthreads in a compiled version.

 

The Squeak interpreter as it stands could remain largely the same. The ObjectMemory could be replaced with a BlockMemory. Most methods would stay the same; no special block handling functionality would need to be added at the Interpreter other than "interrupts" when the block is full which causes a call to bytecodes which then handle that case. The BlockMemory objects would also be visible from within the image so that GC can happen in bytecodes (if desired - slow).

 

Ideally, the Slang compiler could be modified to be able to read and write real objects as well. These "real" objects would be compiled into a bunch of C functions that take a pointer to the memory location of that object as their first argument. These functions would have an implementation of the guessed class of that object (where perhaps profiling information would be useful?), and first tests to see if that class is correct. If that guessed class is wrong, the primitive would fail and the Interpreter would resume by trying to find another implementation of that method. Several tests like this could be chained together if the method is a common one. Alternatively, the compiled VM could keep a special method lookup table which referred to compiled versions of these methods. (TODO: how does Exupery do this?).

 

Links

 

This looks eerily familiar: http://www.merlintec.com/lsi/persist.html

What every programmer should know about memory: http://people.redhat.com/drepper/cpumemory.pdf

 


Page Information

  • 4 months ago [history]
  • View page source
  • You're not logged in
  • No tags yet learn more

Wiki Information

Recent PBwiki Blog Posts