Performance

We have run some performance tests to try to evaluate the impact of introducing reflective capabilities into a Java interpreter. Like the other few papers in the literature on reflection that provide performance data, we have preferred to evaluate the overhead of reflection on each particular operation, instead of running standard benchmarks. In fact, there are no standard benchmarks to evaluate the impact of reflection. Existing general-purpose benchmarks usually focus on optimization of complex patterns of control flow, which would not be affected by the introduction of interception for objects operations, and calculations on large arrays, which would incur a huge overhead.

**Table 1:** Description of the platforms.
Tag	Description
`i586`	100 MHz Pentium running RedHat Linux 5.1
`i686`	233 MHz Pentium Pro running RedHat Linux 5.0
`spu1`	167 MHz SPARC Ultra 1 running Solaris 2.6
`spu2`	200 MHz SPARC Ultra Enterprise 2 running Solaris 2.5

Our tests have been performed on four different platforms, listed in Table 1. On the Solaris platforms, the tests were run in real-time scheduling mode, so as to ensure that no other processes would affect the measured times. On the GNU/Linux platforms, this scheduling mechanism was not available, so we just ensured that the tested hosts were as lightly loaded as possible.

On each host, we have run the same Java program, compiled with Sun JDK's Java compiler, without optimization, to prevent method inlining. The produced bytecodes were executed by different interpreters under different configurations.

We have used Guaraná 1.4.1 and the snapshot of Kaffe 1.0.b1 distributed with it, using the JIT compiler and the interpreter engines. Kaffe and Guaraná were compiled with EGCS 1.1b, with default optimization levels. The program used to perform the tests was the one distributed with Guaraná 1.4.1.

**Table 2:** Description of the tests.
Operation	Description
`emptyloop`	No reflective operation.
`synchronized`	Empty block `synchronized` on an arbitrary object.
`invokestatic`	Invoke an empty `static` method that takes no arguments and returns void.
`invokespecial`	Invoke a non-`static` `private` do-nothing method that returns void and takes only the implicit `this` as argument. The same bytecode is used to invoke constructors and, in some cases, `final` methods.
`invokevirtual`	Invoke an empty method that takes only the implicit `this` as argument, and returns void. Dynamic binding, performed with a dispatch table, occurs before interception test.
`invokeinterface`	Invoke the same method, but through an object reference of `interface` type. Dynamic binding is much slower in this case.
`getstatic`	Load a `static int` field into a variable.
`putstatic`	Store a zero-valued variable in a `static int` field.
`getfield`	Load a non-`static int` field into a variable.
`putfield`	Store a zero-valued variable in a non-`static int` field.
`arraylength`	Load the length of an array of `int` into a variable.
`iaload`	Load the first element of an array of `int` into a variable.
`iastore`	Store a zero-initialized variable in the first element of an array of `int`.
`println`	Print the line ``Hello world!'' to System.err, which was redirected to `/dev/null` before starting the Virtual Machine. It is a first attempt to estimate the overall impact of introducing interception abilities.
`compile`	Compile the test program itself. Section 5.1 contains a detailed description and analysis.

For each configuration, we have timed several different operations, described in Table 2. Each operation was timed by running it repeatedly inside a loop, after running it once outside the loop, before starting the timer. This ensures that, before the loop starts, any JIT compilation has already taken place, all the data and code was brought into the cache and, unless the test involves object allocation, the garbage collector will not run.

This inner loop is run repeatedly, with the iteration count being adjusted at every outer iteration, aiming at a running time longer than 1 second. Since the operations that read the clock at the beginning and at the end of each inner loop take less than 1 microsecond to run, and the clock resolution is 1 millisecond, a total running time of 1 second is enough to elliminate any effects they might have in the outcome of the tests.

The inner-loop iteration count starts at 1, and is repeatedly multiplied by 10 until it is large enough to be measurable with the clock resolution. As soon as this happens, the elapsed time and the iteration count start to be used to estimate the running-time of an iteration. If the total elapsed time of an execution of the inner loop is longer than one second, the estimate is the final result of the test. Otherwise, it is used to compute the iteration count for the next execution of the inner loop, aiming at a total execution time of 1100 milliseconds.

With the exception of the tests println and compile, this mechanism selected an iteration count between 50,000 and 100,000,000, for the final execution of the inner loop of each test. In the case of println, the iteration count was never smaller than 500. The compile test was run stand-alone, not within this framework.

Each test case was run 50 times on each configuration and platform, and the average times of the runs were used to compute the relative overheads presented in Table 3 and Table 4. Although we have introduced the ability to intercept operations, no actual interception took place during those tests.

**Table 3:** Overhead on interpreter.
Operation	`i586`	`i686`	`spu1`	`spu2`
emptyloop	$-41\%$	$-15\%$	$-0\%$	$-0\%$
synchronized	$-0\%$	$+1\%$	$+0\%$	$+4\%$
invokestatic	$+13\%$	$+0\%$	$+4\%$	$-8\%$
invokespecial	$+30\%$	$+8\%$	$+38\%$	$-10\%$
invokevirtual	$+17\%$	$-0\%$	$+7\%$	$-9\%$
invokeinterface	$-3\%$	$-7\%$	$+20\%$	$-10\%$
getstatic	$-3\%$	$-2\%$	$+20\%$	$-0\%$
putstatic	$-23\%$	$-3\%$	$+24\%$	$+4\%$
getfield	$-22\%$	$-2\%$	$+19\%$	$-0\%$
putfield	$-26\%$	$-2\%$	$+25\%$	$+6\%$
arraylength	$-18\%$	$-9\%$	$+2\%$	$+12\%$
iaload	$-64\%$	$-6\%$	$+1\%$	$-0\%$
iastore	$-14\%$	$-3\%$	$+1\%$	$+1\%$
println	$+6\%$	$+4\%$	$+3\%$	$-2\%$
compile	$+5\%$	$+2\%$	$-2\%$	$-3\%$

**Table 4:** Overhead on JIT compiler.
Operation	`i586`	`i686`	`spu1`	`spu2`
emptyloop	$+0\%$	$+1\%$	$+0\%$	$+0\%$
synchronized	$+12\%$	$+10\%$	$+27\%$	$+3\%$
invokestatic	$+91\%$	$+20\%$	$+23\%$	$+34\%$
invokespecial	$+119\%$	$+8\%$	$+19\%$	$+28\%$
invokevirtual	$+30\%$	$+158\%$	$-6\%$	$+0\%$
invokeinterface	$+7\%$	$+2\%$	$+3\%$	$+2\%$
getstatic	$+68\%$	$+148\%$	$+163\%$	$+163\%$
putstatic	$+180\%$	$+97\%$	$+90\%$	$+90\%$
getfield	$+293\%$	$+86\%$	$+149\%$	$+149\%$
putfield	$+103\%$	$+96\%$	$+66\%$	$+66\%$
arraylength	$+258\%$	$+86\%$	$+140\%$	$+150\%$
iaload	$+191\%$	$+98\%$	$+55\%$	$+95\%$
iastore	$+236\%$	$+55\%$	$+41\%$	$+45\%$
println	$+45\%$	$+6\%$	$+5\%$	$+12\%$
compile	$+36\%$	$+42\%$	$+32\%$	$+29\%$
compile-JIT	$+105\%$	$+112\%$	$+81\%$	$+54\%$
compile-diff	$+16\%$	$+17\%$	$+20\%$	$+20\%$