Friday, June 21, 2013

Lisbon - Week43

My posts at the zones, after a break!
[21/06/2013] - It has been a while since I wrote a weekly blog post on my stay at Lisbon. The last post was on Lisbon - Week29. Later I started writing fortnightly posts, or even once in three weeks, as I got busier with studies. As I have got some free time and content to write after the exams, here comes this weekly update.

Campo Pequeno
This week was full of movies, eating out, frequent drinks at the Hard Rock Cafe - Restauradores, varieties of hamburgers at h3 new hamburgology at Campo Pequeno, and some parties. I specifically enjoyed the movies at the Cinema City - Campo Pequeno. The City bar, adjoining the cinema city is very stylish and unique. It was unfortunate that I haven't watched any movies here previously, considering that Campo Pequeno is next metro station to the one close to my apartment (Saldanha), and also in a walking distance from my apartment. 

Llovizna
During this week, I managed to update my blog with some technical blog posts on Jikes RVM, something I researched for our AVExe group project. Most of these posts managed to get into the Architects Zone and Javalobby of DZone. I hope these posts will be of use for a bigger audience. Without the technical blog posts, recently my blog hits was facing a downfall. This helped me realize the readers of my blog.

Cherries!
In Sri Lanka, I have never enjoyed the cherries. It always has reminded me "Lama Panadol" (The Paracitamol brand made for the kids). However, I have started to enjoy the cherries in Lisbon. They are much fresher and sweeter here, naturally.

MMTk utility classes

Class Diagram of Counters
High Level Language Virtual Machines is a core topic of interest for the researchers who are into virtual execution environments. As an open source virtual machine released to 16 universities, as early as 2001, Jikes RVM has been a major drive for many researches. Garbage Collectors for the Jikes RVM are constructed by the framework known as the Memory Management Toolkit (MMTk). Key Compositional Elements in MMTk are utilities, policies, and plans. Apart from these, package org.mmtk.harness contains the MMTk harness test cases. Sanity checks are included in the package org.mmtk.harness.sanity. GcSanity is the collection sanity checker, that takes snapshots before and after a collection to ensure that the live objects after a collection are the same that exist before the collection. Further options are implemented in the classes of org.mmtk.harness.options. We have looked at policies and plans. In this post, we will look at the MMTk utility package, which implements the MMTk mechanisms.

MMTk comes with the utility classes that provides mechanism for the memory management, across multiple policies and plans that use those policies. An ideal example showing the usage of the utility package is, the interface
Constants. All the classes either implement the Constants interface, or are sub classes of the classes that implement the interface.

1. Allocation
The package org.mmtk.utility.alloc handles the allocation. Allocator is the base class providing the basis for the processor-local allocation. This provides the retry mechanism, to prevent the slow-path allocation causing a garbage collection, violating the assumption of uninterruptibility. Allocator also ensures that requests are aligned according to requests. This class is very crucial in garbage collection as the base class for all the allocator algorithms. Improper handling of this will make it hard to trace the bugs, where allocations may cause a garbage collection, or a garbage collection may immediately follow an allocation.

The method alignAllocation() accepts a region to be aligned, and aligns up the request, according to the requested alignment and offset, optimizing with the known alignment for the allocator algorithms. This method returns the aligned up address.

Consecutive allocation failure for the threads are counted by determineCollectionAttempts(). fillAlignmentGap() gets the start and end addresses, and fills the region with the alignment value. The minimum size guaranteeing the allocation of a given number of bytes at the specified alignment is calculated by getMaximumAlignedSize().

All allocators should use the final method,
allocSlowInline() for the slow path allocation. This method attempts the protected method allocSlowOnce() multiple times, which is defined in the subclasses. This method ensures safe execution by taking care of the thread/mutator context affinity changes, whilst allowing the collection to occur.

2. Bump Pointer Allocation
This is implemented by the BumpPointer class, which extends the abstract class Allocator. Bump Pointer scans through the allocated objects linearly. To achieve parallelism, this class maintains a header at a region of 1 or more blocks granularity. The minimum region size is 32678 bytes. Hence the 3 or 4 word overhead is less than 0.05% of all space. BumpPointer is initialized by providing the space to bump point into, indicating whether the linear scanning is allowed in the region. The method linearScan() performs a linear scan through the objects allocated by the bump pointer, and scanRegion() scans through a single contiguous region.

Intra-block allocation requires only a load, addition comparison and store, and hence is fast. The allocator will request more memory, if a block boundary is encountered. The scanned objects maintain affinity with the thread which allocated the objects in the region. This class relies on the supporting virtual machine implementing the getNextObject and related operations.

Space is allocated for a new object, by calling alloc(). This method is frequently executed, and is sensitive to the optimizing compiler. Whenever the bump pointer reaches the internal limit, allocSlow() is called. This method should never be inlined by the optimizing compiler, and hence is annotated with @NoInline, to force out of line.

Bump pointer can be re-associated to a different space by calling
rebind(), providing the new space to which the pointer to be associated to. The bump pointer will be reset such that it will use the new space for the next call to alloc().

Address
is a stub implementation of an Address type, used by the runtime system and collector to denote the machine addresses. An allocation unit is denoted as a card, which is marked by an address that lies within the card. Providing the address of the object creating a new card, the address that lies within the card, and the size of the pending allocation in bytes, createCardAnchor() creates a record, where the start of the card is relative to the start of the object. The start of the card corresponding to the given address can be retrieved by calling getCard(). Similarly, the address of the card metadata can be retrieved by providing an address from the card, calling getCardMetaData().

Next region from the linked list of regions can be retrieved using
getNextRegion(). Similarly, setNextRegion() is used to set the next region in the linked list. clearNextRegion() clears the next region pointer in the linked list.

The DATA_END address from the region header can be retrieved using
getDataEnd(), by providing the bump pointer region address. setDataEnd() is used to set the new DATA_END address from the header. Similarly, getRegionLimit() and setRegionLimit() return or store the end address of the given region, respectively. The lowest address where the data can be stored can be retrieved by getDataStart(), for the given region. updateMetaData() is used to update the metadata, reflecting the addition of a new region. Where a bump pointer region has been consumed, but the contiguous region is available, consumeNextRegion() consumes it and returns the start of the region satisfying the outstanding allocation request.

3. Block Allocation
Blocks are a unit of storage of 2 ^ n bytes, that are non-shared (thread-local) and coarse-grained. Virtual memory resource provides virtual memory space. Here, pages consumed by blocks are accountable for the memory resource. BlockAllocator implements various sizes of Block data structure. alloc() allocates a block and returns the first usable bytes of the block. A block is freed by calling free(). If the block is completely free, the block is returned to the virtual memory resource. Otherwise, if the block is just a sub-page block, the block is added to the free list.

4. GCspy Integration
GCspy 2.0 is a tool that helps to analyse the heap, that often is used to understand the memory usage and effectiveness of garbage collectors in our project. The development of GCspy however lags behind that of the Jikes RVM core [10]. The data for GCspy is gathered using gcspyGatherData() in the classes. The package org.mmtk.utility.gcspy contains the classes for the GCspy integration, and org.mmtk.utility.gcspy.drivers contains the GCspy drivers for the MMTk collectors.

5. Treadmill
Each instance of the Treadmill is a doubly linked list, where each node is a piece of memory. The words of each node,
[Forward Link | Backward Link | Treadmill | Payload ----->]
The Treadmill object must not be moved. Depending on the constructor argument, access to the instances may be synchronized.
 
This assumes that the implementation language, and the language being implemented are the same. This works well with Jikes RVM, as these are Java. However, for using with other VM implementations in other languages, the doubly linked list class, upon which Treadmill depends, should be rewritten.
 
If the created instance is to be shared between the threads, which is decided by the boolean parameter “shared” provided as a parameter in the constructor, the access will be synchronized with locks. A node is added to the treadmill during the allocation, using addToTreadmill().

6. Large Object Allocation
This is implemented in LargeObjectLocal, which extends the abstract class LargeObjectAllocator. Each instance provides a fast unsynchronized access to the treadmill, and bound to a single CPU. Hence this should not be shared across different CPUs, as they provide truly concurrent threads.
Given c CPUs, and t Treadmill spaces, there will be c * t instances of this class, for each {CPU, Treadmill} pair.

7. Page Resource
Allocation of pages for a space is managed by the abstract class PageResource. When a space request a page, page budget and the use of virtual address space are checked. If the request can not be satisfied, garbage collection is triggered. MonotonePageResource, which is a subclass of this class handles the monotonic space usage. Similarly, the other sub class, FreeListPageResource handles the ad hoc usage. Copying collectors allocate space monotonically before freeing the entire space. Hence, MonotonePageResource is useful for them. Though the MonotonePageResource is more restrictive, it is easier to manage.

8. Heap Growth Management
HeapGrowthManager observes the heap utilization and GC load, and grows and shrinks the heap size accordingly. This class, and all the other classes in the package org.mmtk.utility.heap contains the heap related mechanisms.

9. Sanity Checker
Sanity checks for the simple collectors are handled by the classes in the package org.mmtk.utility.sanitychecker. SanityChecker is the major class handling the sanity checking. SanityDataTable implements a simple hashtable to store and retrieve per object information on sanity checks.

10. Statistics
The package org.mmtk.utility.statistics contains a number of counters, implementing the abstract class Counter for multiple counting purposes. SizeCounter, aggregates two EventCounter objects, to count the number of events and the volume. Xml class writes the output in XML format.

11. Deque
Classes extending Deque
The package org.mmtk.utility.deque defines the doubly-linked, double-ended queue (deque). Though the double linking slightly increases the space demand, this is a worth trade off, as this provides an efficient buffer and operations such as sorting.
 
LocalSSB implements a local unsynchronized sequential store buffer. This is used in critical code such as garbage collection work queue and the write buffer used by many collectors. Hence, this is implemented as efficient as possible in space and time. Each instance has a bump pointer and a pointer to the sharedDeque. This class follows FIFO, though it doesn't implement dequeing. TraceBuffer supports enqueuing and dequeuing of tracing data and bulk processing of the buffer.
12. Log
Log class is used for trace and error logging. Message buffer size is intentionally kept large (3000 characters), as Lock class of Jikes RVM logs a considerable amount of information during a potential GC deadlock. 

Wednesday, June 19, 2013

MMTk plans

MMTk phases and the flow
Collector algorithms are implemented in policies and plans. Referring to the respective collector algorithms from the papers is the ideal way to understand each of these polices/plans.

build/configs folder contains multiple properties files, which can be used to build Jikes RVM with different configurations and different collectors. MMTk plan is defined in the properties files. The recommended production or prototype configurations use GenImmix plan, as can be seen from production.properties and prototype.properties.
config.mmtk.plan=org.mmtk.plan.generational.immix.GenImmix
MMTk garbage collection is handled through phases. By going through the complex phase definition in the respective plan class such as GenImmix, you should be able to see the logic of garbage collection.

1. Plan
Plan is the abstract class that defines the core functionalities of all the memory management schemes, and hence becomes the base class for all the collector plans.
Global and local states are separated into different classes, as in the case of policies, as plans make a clear distinction between the global and thread-local activities. Global activities should be synchronized, where this is not required for thread-local activities. Static resources such as VM and memory resources are defined and managed by the global instances.

ParallelCollectorGroup is a pool of collector contexts, that can be triggered to perform collection. Plan consists two instances of this, named parallelWorkers and concurrentWorkers.

The respective sub class of Plan will have a single instance and there will be a one-to-one mapping between the PlanLocal and CPUs (kernel threads). This mapping is crucial in understanding the properties of the plans. Due to this separation, instance methods of the Local let the functions such as allocation and collection unsynchronized and faster.

2. Boot Sequence
The boot sequence calls the methods given below in the given order.
1. enableAllocation() - called early in the boot process for the allocation.
2. processOptions()
- called immediately by the run time, as the command line arguments are available. Calling the method enableAllocation() is a precondition, as the infrastructure may require allocation to be able to parse the arguments. All plans operate in the default minimum, till this method is called, and the respective values are obtained.
3. enableCollection()
- called when it is safe to spawn contexts and start garbage collection. If the threads have not explicitly been set, right defaults will be used. Creates the parallel workers and concurrent workers. Creates the control thread and allows mutators to trigger the collection.
4. fullyBooted() - called just before giving the control back to the application, and hence marks the completion of the collection cycle.
Before the VM exists, notifyExit() is called for the clean up. Further, plan specific timing information can be provided by overriding printDetailedTiming(). collectionPhase() performs a global collection phase.

3. Nursery
Nursery refers to the collection of objects allocated since last GC, hence a collection of short living objects. Generational collectors function based on the observation that an object living longer than a specific amount of time will live further long. That means, when a few objects in the nursery survives a few round of collections in the nursery, they are moved to the matured space. Matured space consists of 90% of space, but of 10% of the number of objects that are created from the beginning. Nursery contains just 10% of the heap space, but contains 90% of the objects created ever since, as the objects in nursery are often short-lived. Matured space is scanned rarer than the nursery in the generational collector algorithms, where the matured space is scanned only when the specific time occurs, or when the VM runs out of memory. IsCurrentGCNursery() checks whether the current GC is only collecting the objects allocated after the last garbage collection.

After a full heap GC, sanityLinearScan() performs a linear scan of all spaces for possible leaks. Triggering of a garbage collection is controlled by collectionRequired(). This is called periodically during the allocation. If collection is requested by the plan, this method would return true. Whether the object can ever move is determined by the method willNeverMove().

4. MutatorContext
MutatorContext and its subclasses define the per-mutator thread behaviour. MutatorContext is the base class of all the per-mutator contexts, ConcurrentMutator, StopTheWorldMutator, and all the XXXMutator (GC-specific sub classes), where XXX refers to the names of the collector plans such as GenImmix or CMS.

This implements the basic unsynchronized per-mutator behaviour that is common to all the collectors such as the support for immortal and large object space allocation and empty stubs for write barriers.

initMutator() is called before the MutatorContext is used. It is called after the context is fully registered and visible to the collection. It notifies that the context is registered such that it will be included in the iterations over the mutators. Similarly, deinitMutator() is the final method to be called, where the mutator is to be cleaned, and hence all local data should be returned. The implementation of deinitMutator() in this class, as well as the sub classes of MutatorContext by default, call the flush(), which flushes the mutator context, flushing the remembered sets into the global remset pool.

checkAllocator() performs a run time check for the allocator.

5. CollectorContext
CollectorContext is an abstract class that is extended by the classes XXXCollector, similar to the Mutators. These classes implement the per-collector thread behaviour and structures such as collection work queues.

MMTk assumes that for each thread participating in the collection, the VM instantiates instances of CollectorContext  and MutatorContext in thread local storage (TLS). Hence the access to this state is assumed to be low-cost at GC time (opposed to the access to MutatorContext, which is low-cost during mutator time), essentially separating this class which is thread-local, from the global operations in the Plan class. Operations in this class and the subclasses are local to each collector thread, where operations in the MutatorContext class and all of its sub classes are local to each mutator thread, whilst the synchronized operations are via the access to Plan and its subclasses, which are global. Synchronization is localized, explicit, and thus minimized.

Memory is allocated for an object using alloc(). Any required post allocation operations are handled by postAlloc(). getAllocatorFromSpace() finds the allocator that is associated with the given space.

The number of active collector threads (n), which is often the number of processors and the number of mutator (application) threads (m) are decided by the VM, and not MMTk. Collector operations are separated as below.

Per-collector thread operations: The operations for the entire GC, performed by each collector thread, and m per-mutator thread operations multiplexed across the n active collector threads.

Per-mutator thread operations: The operations such as flushing and restoring per-mutator state.   

The unpreemptible method run() provides an entry point for the collector context.
Space is allocated for copying the space using allocCopy(). This method just allocates the space, and doesn't really copy the object. Actions after the copy are performed by postCopy(). Run time check of the allocator for the copy allocation is performed by copyCheckAllocator().

We often need to find the number of parallel workers currently executing to be able to load  balance better. This can be found by calling parallelWorkerCount(), from anywhere within the collector context.

6. NoGC
This simple No-garbage-collector class implements the global state of a simple allocator without a collector.

7. Simple
The abstract class Simple provides the core functionalities of a Simple collector, which should be extended by the two types of collectors, non-concurrent or concurrent. Collection phases are defined along with their base level implementations. The spaces introduced by the subclasses should be implemented by them respectively.

8. StopTheWorld
 All the non-concurrent collectors should extend the abstract class StopTheWorld.

9. Concurrent
The abstract class Concurrent implements the global state of a concurrent collector.

JikesRVM Optimizing (JIT) Compiler - II

Compiler phases of Jikes RVM


This post continues from JikesRVM Optimizing (JIT) Compiler - I 

6. OptimizationPlanCompositeElement

6.1. LinearScanHandles the linear scan register allocation.

6.2. GCP – Handles Global Code Placement. Comes in two variants. One is, Loop Invariant Code Motion (LICM), which is applied to HIR and LIR. The other is, Global Common Sub Expression Elimination (GCSE), which is applied only to LIR and before LICM. These run on static single assignment form (SSA form). SSA form is a property of IR that states that each variable is assigned exactly once. Utility functions of SSA form are handled by the class SSA. These GCP algorithms use the dominator tree to determine the positions for operations. However, SSA, by default is disabled in the optimization compiler, as it is currently buggy.

6.3. SSATuneup – Places the IR in SSA form, and cleans up by performing a set of simple optimizations.

6.4. RegisterAllocatorThe driver routine for register allocation. The constructor initially prepares for the allocation, and then performs the allocation using the live information.

6.5. LoadElimination – Implements the redundant load elimination.

6.6. LiveRangeSplitting – Performed at the end of SSA in LIR. This performs live range splitting, when they enter and exit the loop bodies by normal/unexceptional control flow, also splitting on edges to and from infrequent code.

6.7. RedundantBranchElimination – Based on the SSA form, global value numbers, and dominance relationships, the below conditions are considered to be sufficient for a branch to be eliminated as redundant.

* It has the same value number as another conditional branch, and hence it is found to be equivalent to that branch.

* Target of taken branch of the other condition branch (cb2) dominates the condition branch (cb1) under consideration, making the target block having exactly one in edge.

* Non-taken continuation of cb2 dominates cb1, and the continuation block has exactly one in edge.

6.8. ConvertLIRtoMIR – Converts an IR object: LIR → MIR, using the Bottom-Up Rewrite System (BURS), which is a tree pattern matching system, for instruction selection. The constructor proceeds to create the phase element as a composite of the other elements, in stages as given below.

* Reduce the LIR operator set to a core set.

* Convert ALU operators.

* Normalize the usage of constants.

* Compute the liveness information, build DepGraph block by block and perform BURS based instruction selection.

* Complex operators expand to multiple basic blocks of MIR. Handle them.

* Perform Null Check Combining using validation operands. Remove all the validation operands that are not present in the MIR.

6.9. ConvertMIRtoMC – Converts an IR object: MIR → Final Machine Code. In the constructor, it initially performs the final MIR expansion. Then, it performs the assembly and map generation.


7. Subclasses of CompilerPhases

Different optimizations of different phases are taken care by the sub classes of CompilerPhases, which are responsible for them. Different optimization plans of different functions will still share the same component element objects, as every optimization plan contains a sub set of the elements from the master plan. To avoid this overhead, perform() of the atomic element creates a new instance of the phase immediately before calling the phase's perform(). The method newExecution() of CompilerPhase is overridden, such that, if the phase doesn't contain a state, the phase is returned instead of its clone.

 
8. Operators

Class hierarchy of Operands
IR includes a list of instructions, each including an operand or possibly, the operands. Operators are defined by the class Operator, which is auto-generated by the driver, from a template. The input files are Operators.Template and OperatorList.dat found in rvm/src-generated/opt-ir. This machine dependent Operator class resides in jikesrvm/generated/{ARCH}/main/java/org/jikesrvm/compilers/opt/ir, where ARCH refers to the architecture such as ia32-32, ia32-64, ppc-32, and ppc-64.

OperatorList.dat in rvm/src-generated/opt-ir defines the HIR and LIR subsets of the Operators where the first few define HIR only whilst the remaining define both, and OperatorList.dat in rvm/src-generated/opt-ir/{ARCH} (Here, ARCH is such as ia32) defines the MIR subset. Each operator definition consists of 5 lines given below:

SYMBOL – A static symbol identifying the operator.

INSTRUCTION_FORMAT – The class of Instruction format that accepts this operator.

TRAITS – Characteristics of the operator, composed with a bit-wise OR | operator. Valid traits are defined in the Operators class.

IMPLDEFS – Registers implicitly defined by the operator.

IMPLUSES - Registers implicitly used by the operator. Here the last two are usually applicable only to the MIR subset, which is machine dependent.

For example, the definition of the Floating Point Conditional Move (FCMOV) is,

IA32_FCMOV
MIR_CondMove
none
C0_C1_C2_C3
CF_PF_ZF

where, for the integer addition operator, it is,

INT_ADD
Binary
commutative

leaving the last two lines empty.

 
9. Instruction Formats

Instruction formats are defined in the package instructionFormats and each fixed length instruction format is defined in the InstructionFormatList.dat files in /rvm/src-generated/opt-ir and /rvm/src-generated/opt-ir/{ARCH} (Here, ARCH is such as ia32 and ppc).

Each entry in the InstructionFormatList.dat has 4 lines as below.

NAME – Name of the Instruction Format.

SIZES – Parameters such as the number of operands defined (NUMDEFS), defined and used (NUMDEFUSES), and used (NUMUSES), and additionally NUMVAR, VARDORU, and NUMALT for variable length instructions.

SIG – Describing the operands. The structure is, D/DU/U NAME TYPE [opt]. Here, D/DU/U defines whether the operand is a def, def and use (both), or use. Type defines the type of the operand, as a sub class of Operand. [opt] indicates whether the operand is optional.

VARSIG – Describing the repeating operands, used for variable length instructions. The structure is NAME TYPE [PLURAL]. Here, PLURAL is used, where NAMEs is not the plural of NAME.

For example, let's consider the definition of NEWARRAY.

NewArray
1 0 2
"D Result RegisterOperand" "U Type TypeOperand" "U Size Operand"

Here it indicates the three operands, one D (def), no operand of DU (def and use) type, and two U (use), where the Def type operand is called Result, and of type RegisterOperand, similarly the use operands are called Type and Size, and are of types TypeOperand and SizeOperand, respectively.

10. Instruction Selection

BURS is used for instruction selection, where the rules are defined in architecture specific files in rvm/src-generated/opt-burs/{ARCH}, where ARCH refers to architectures such as ia32 (example files: PPC_Alu32.rules, PPC_Alu64.rules, PPC_Common.rules, PPC_Mem32.rules, and PPC_Mem64.rules) or ppc (example files: IA32.rules, IA32_SSE2.rules, and IA32_x87.rules).

Each rule is defined by an entry of four lines, as given below.

PRODUCTION – The format is “non-terminal: rule”, where this is the tree pattern to be matched. Non-terminal denotes a value, followed by a colon, and followed by a dependence tree producing that value.

COST – The additional cost due to matching the given pattern.

FLAGS – Defined in the auto-generated class BURS_TreeNode. The below can be the flags.

* NOFLAGS – No operation performed in this production.

* EMIT_INSTRUCTION – Instructions are emitted in this production.

* LEFT_CHILD_FIRST – The left hand side of the production is visited first.

* RIGHT_CHILD_FIRST – The right hand side first.

* TEMPLATE: The code to be emitted.

Floating point load is defined below as an example.

Fpload: FLOAT_LOAD(riv, riv)
0
EMIT_INSTRUCTION
pushMO(MO_L(P(p), DW));

Jburg, a bottom-up rewrite machine generator for Java, is a compiler construction tool, that is used for instruction selection, converting the tree structure representation of the program into a machine code.

Logging in Jikes RVM

Jikes RVM development often requires debugging without proper debugger support. We may use gdb for debugging. However log outputs, using the org.mmtk.utility.Log class are often useful for the debugging purposes. The below code segment is such a log statement.
if(VERBOSE) {
Log.write("Using collector"); Log.write(head); Log.writeln();
}
Here, VERBOSE is defined as,
public static boolean VERBOSE = false;
and during the development and testing phase, we changed it to true in the respective classes, as required. Many of the debug statements included during the development stage are later removed, when they were no more necessary. The remaining in MMTk are changed to use the verbose command line option of rvm, by including the log statements inside the below block.
if (VM.VERIFY_ASSERTIONS && Options.verbose.getValue() >= 9) { … }
These logs are displayed only when the RVM is run with the respective verbosity, as given below.
./rvm -showfullversion -X:gc:verbose=9 HelloWorld
Our debug statements will be printed only if the verbose level is set to 9 or more by a command line argument. Hence, for the verbosity level of 8 or less, they won't be logged. This methodology avoids logging overhead in the production builds, while enabling detailed logs for the development and debugging efforts without any code changes.

Tuesday, June 18, 2013

JikesRVM Optimizing (JIT) Compiler - I


Jikes RVM includes two compilers. One is the baseline compiler, and the other is optimizing compiler, which is commonly referred to as the JIT compiler. BaseBase builds run JikesRVM with only the Baseline Compiler, where the other builds such as Production (FastAdaptive), Development (FullAdaptive), and the ExtremeAssertion builds have both the baseline and optimizing compilers.
Here, we will study the optimizing compiler in Jikes RVM. The package org.jikesrvm.compilers contains the source code of the compilers, where The package org.jikesrvm.compilers.opt contains the code of the optimizing compiler.

SharedBooleanOptions.dat found in the directory /rvm/src-generated/options defines the boolean options for the optimizing compiler. Similarly, SharedValueOptions.dat in the same directory defines the non-boolean options.


1. Transformations of Methods

Jikes RVM takes methods as the fundamental unit of optimization, and optimize them by transforming the intermediate representation as below.

Byte Code → [Optimizing Compiler] → Machine Code + Mapping Information

Mapping information consists of garbage collection maps, source code maps, and execution tables. Optimizing compiler has the following intermediate phase transitions, for the intermediate representation.

High-level Intermediate Representation (HIR) → Low-level Intermediate Representation → Machine Intermediate Representation.

The general structure of the master plan consists of elements,

  1. Converting, byte codes → HIR, HIR → LIR, LIR → MIR, and MIR → Machine Code.
  2. Performing optimization transformations on the HIR, LIR, and MIR.
  3. Performing optimization.

2. CompilationPlan

CompilationPlan is the major class, which instructs the optimizing compiler how to optimize a given method. Its constructor constructs a compilation plan, as the name indicates. This includes the instance of NormalMethod, the method to be optimized. An array of TypeReference is used in place of those defined in the method.

This also contains an object of InstrumentationPlan, defining how to instrument a method to gather the run time measurement information. The methods initInstrumentation() and finalizeInstrumentation() are called at the beginning of the compilation, and after the compilation just before the method is executed, respectively.

A compilation plan is executed by executing each element in the optimization plan, by calling the method execute(). The compiler queries the InlineOracle interface to decide whether to inline a call site. If there is some instrumentation to be performed, initialization takes place. After the instrumentation, finalization takes place to clean up. However, the finalization will not be executed, if this fails with an exception. 
 

3. OptimizationPlanElement
Class diagram of OptimizationPlanElement

An array of OptimizationPlanElement defines the compilation steps. OptimizationPlanElement thus represents an element in the optimization plan. Instances of the sub classes of the abstract class OptimizationPlanElement are held in OptimizationPlanner.masterPlan, and hence they represent the global state. It is incorrect to store any per-compilation state in the instance field of these objects. OptimizationPlanner specifies the order of execution during the HIR and LIR phases for a method. The method reportStats() generates a report on time spent performing the element.

The method shouldPerform() determines whether the optimization plan should be performed and be part of the compilation plan, by consulting the passed OptOptions object. When an element is included, all the aggregate elements that the element is a component of, are included too. The work represented by the element is done in the optimization plan, by perform(). The work done is assumed to modify the Intermediate Representation (IR) in some way. The perform() of the aggregate element will invoke all the elements' perform().

A client is a compiler driver that constructs a specific optimization plan by including all the OptimizationPlanElements from the master plan, that are appropriate for this compilation instance.


4. OptimizationPlanAtomicElement

The final class OptimizationPlanAtomicElement extends OptimizationPlanElement. This object consists of a single compiler phase in the compiler plan. The main work of the class is done by its phase, which is an instance of the class CompilerPhase. Each phase overrides this abstract class, and specifically the abstract methods perform(), which does the actual job, and getName(), which gets the name of the phase. A new instance of the phase is created when shouldPerform() is called, and is discarded as soon as the method is finished. Hence, the per-compilation state is contained in the instances of the sub classes of CompilerPhase, as the state is not stored in the element.


5. OptimizationPlanCompositeElement

Similarly, OptimizationPlanCompositeElement is the base class of all the elements in the compiler plan that aggregates together the other OptimizationPlan elements, as depicted in the Figure. Here an array of OptimizationPlanElement is kept to store the elements that compose this composite element. The constructor composes together the elements passed through as a method parameter, into the composite element. If the phase wants the IR to be dumped before or after it runs, it can be enabled using printingEnabled(). By default, this is disabled, where the sub classes are free to override this method, and make it true.

This post continues at JikesRVM Optimizing (JIT) Compiler - II.

Wednesday, June 12, 2013

The exam days and the return of the summer..

Bird watching at Jardim da Esstrela
[12/06/2013] - Lisbon - Weeks 40, 41, & 42: Exam times. One more remaining. Among the 3 exams that are completed, we have already received the grades for two of them. We have also received all the project grades. It is interesting to note that we have got the highest grades for our Virtualization project. The project included research, internal mechanisms study (we studied MMTk and Optimization Compiler, and researched on the Garbage Collection), and development.

These three weeks were probably the busiest days we had so far in Lisbon. These days marked the project discussions, exam preparations, and exams. The summer is here; but it doesn't prevent the cold nights. Most of the nights still remain cold, though it is eventually getting warmer. I am really happy to see the day time getting longer. I can even see the sun light as late as 9 p.m.

The vacation days are reaching. We are awaiting some good time at Lisbon, after exam, before returning to our home countries, for the summer vacation.  Among all our batch mates, it seems I am the only one who is returning to IST/Lisbon for the final semester, for the thesis. Others are planning to have their thesis mostly in the companies or research labs in Sweden or elsewhere. I will miss them in the final semester. However, that is after the third semester, for which we all will be in KTH/Sweden.

My thesis will be at INESC-ID, supervised by Prof. Luis Antunes Veiga. I hope my second stay at Lisbon will be as interesting as the first stay, though I will miss most of the good friends that I made during the first year, including the EMDC friends. Let me see how it goes. 

端午節快乐 (Duānwǔ jié kuàilè) Happy Dragon Boat Festival.. ^_^