Just-in-Time Compilation-Inspired Methodology for Parallelization of Compute Intensive Java Code

Compute intensive programs generally consume significant fraction of execution time in a small amount of repetitive code. Such repetitive code is commonly known as hotspot code. We observed that compute intensive hotspots often possess exploitable loop level parallelism. A JIT (Just-in-Time) compiler profiles a running program to identify its hotspots. Hotspots are then translated into native code, for efficient execution. Using similar approach, we propose a methodology to identify hotspots and exploit their parallelization potential on multicore systems. Proposed methodology selects and parallelizes each DOALL loop that is either contained in a hotspot method or calls a hotspot method. The methodology could be integrated in front-end of a JIT compiler to parallelize sequential code, just before native translation. However, compilation to native code is out of scope of this work. As a case study, we analyze eighteen JGF (Java Grande Forum) benchmarks to determine parallelization potential of hotspots. Eight benchmarks demonstrate a speedup of up to 7.6x on an 8-core system.

application, a convenient approach is to design a parallel algorithm explicitly [1][2][3].However, algorithmic restructuring for existing sequential applications is an on trivial manual effort.Automated parallelization techniques often rely on parallelizing compilers and runtime information.For example, auto-parallelizing compiler Parafrase-2 [4] detects and exploits implicit parallelism using a symbolic analysis framework [5].Autoparallelizing compilers typically use heuristics [6] and profiler feedback to analyze and parallelize code by the resolution of dependences by squashing and rerunning some of parallel tasks.This is a best effort approach that exploit parallelism if possible, otherwise code is run sequentially.In non-speculative parallelization paradigms, dependences are analyzed first and code is usually transformed to expose hidden parallelism.Parallel tasks are synchronized properly to preserve sequential semantic, and avoid dead/live locks and data races.
However, both cases have their own challenges.
JIT systems are typically used to facilitate dynamic compilation of binary code during execution [19][20][21].In case of Java, inefficiency of interpreted Java code stimulated the renaissance of JIT technologies [19].Java Typically, majority of computer applications spend large amount of their runtime in the hotspots [22][23].We observed that compute intensive hotspots have huge parallelization potential [22].This work focus on a single goal: achieve whatever parallelism can be realized from sequential code without any effort on the part of exploring hidden parallelism.Being a best effort approach, it may improve scalability where it can exploit parallelism potential but in other cases it may not modify even a single loop.Using profiler feedback, compute intensive DOALL loops are selected from Java bytecode just as JIT compiler selects frequently executing code for native translation.We have two reasons for considering loop level parallelization in this context.First, we observed that by setting a threshold on application's execution time, we are left with only a few most time consuming methods [22].For example, setting 90% threshold in JGF Crypt benchmark revealed that a single method consumed 90% time of the application [24].Such cases are not suitable for method level parallelization even on dual core system.Similarly, JIT compilation infrastructure selects only few methods as hotspots.Method level parallelization determines potential parallelism by doing inter-procedural analysis of complete application.During inter-procedural analysis, if some non-hotspot method is found as a caller of hotspot(s), modifications will also be needed in the non-hotspot method.Eventually, we will be dealing with entire application and taking almost no advantage of JIT compilation infrastructure.In contrast, modifications applied at loop level remains local to the hotspot only.JIT compiler could produce parallel native code transparently.
The paper is organized in following sections: Section 2 presents related work.Problem statement is formulated in Section 3 along with qualitative and quantitative features.
Overall methodology is proposed in Section 4.
Parallelization steps and implementation details are given in Section 5. Case studies and results are discussed in Section 6. Paper is concluded in Section 7.

RELATED WORK
Bytecode level parallelization has been tried since the inception of Java language [18].However, due to lack of instrumentation and on-the-fly class modification APIs, the effort relied on static modifications of single class at a time without considering profiler feedback.Now-adays, JIT parallelization is being revisited, thanks to the proliferation of multicore/manycore systems and advancements in virtualization technologies [25][26][27][28]30].
Österlund and Löwe exploit JVM's garbage collector to support JIT parallelization [26][27][28].A merger of DBP (Dynamic Binary Parallelization) and TLS is presented to emphasize the limitations of DBP and difficulties involved in JIT parallelization [29].Leung et. al. proposed auto-parallelizing extensions for Java JIT compiler so that the compiler could find potentially parallelizable code and compile it for parallel execution on multicore CPU and GPGPU (General Purpose Graphic Processing Unit) [30].However, code generation depends on RapidMind and GPU hardware [31].Majority of other efforts on runtime parallelization focus on speculative execution and/or exploit method level parallelism [32][33][34][35][36][37][38].

Percentage Contribution Threshold
T PC (Percentage contribution threshold) is the part of application run time (< 100%) that we want to be parallelized [22].For example, setting T PC = 80% for an application means that we are interested in parallelizing only most time consuming methods (i.e.hotspots) that collectively consume 80% time of the application.Fig. 1 shows the effect of setting T PC = 90% for eighteen JGF application benchmarks [24], where N h is the number of hotspots.It is obvious from Fig. 1 that majority of methods are shunt out because they collectively consume <10% time of the application.Analyzing and modifying these methods is likely to increase runtime overhead and may result in performance degradation compared to sequential code.

FIG. 1. SELECTION OF HOTSPOTS USING T PC = 90%
T PC facilitates the selection of hotspot methods.Next, we need to determine various characteristics of hotspot methods.We enumerate these characteristics in catalogs of qualitative and quantitative features of methods, as shown in Tables 1-2, respectively.

Qualitative Features of Methods
Qualitative features are binary variables to represent different characteristics of the method.Each qualitative feature indicates the presence (or absence) of a specific characteristic of a method, as described in Table 1.For example, LOOPY=0 means that the method does not contain loops.The idea of qualitative features is inspired by Nano-patterns that were proposed to characterize and classify Java methods [39].Catalog of qualitative features is constructed by extended catalog of Nano-patterns from 17 to 32, and giving them compact and descriptive names.
Previously, we used qualitative features to analyze thread level speculative parallelization potential at runtime [22].
We showed that binary features are very important decisive factors for runtime qualitative analysis of parallelization potential of methods.Qualitative features are generic in nature and could be used in any software reverse engineering and reengineering activity.We used some relevant features in this work.

Quantitative Features of Methods
Presence of a particular characteristic of method potentially necessitates the quantification of that characteristic.For example, if a method contains loops (i.e.LOOPY=1), we need to determine the number of single and nested loops.For this, we will observe the quantitative features f 37 and f 38 in Table 2.In

PROPOSED METHODOLOGY
Proposed methodology transforms Java classes at load time and works in three phases.Overall work flow is shown in Fig. 2. In profiling phase, an application is test-run to get profiling data.Profiler output is fed back to JVM during actual run.Using a value of T PC (i.e.90% in this paper), top N h hotspot methods are selected form the flat profile F which is sorted by PC in descending order.JVM class loader is hooked so that classes could be parsed and transformed at load time [22].Each class is parsed and modified just before it is loaded by the JVM.In parsing phase, list of methods L m of a class i is acquired to determine if it contains a hotspot.If a method m ij is hotspot, it is parsed to generate (1) list of qualitative features (2) list of quantitative features (3) list of backward jumps L SL (4) IR tuples, and (5) instruction patterns.A list of nested loop L NL is then generated using the loops of L SL .In modification phase, a heuristic on call count (CC i.e. feature f 46 ) of m ij is used to determine the potential location of parallelizable loop(s).If CC<N m and m ij is LOOPY then potentially parallelizable loop(s) lies within m ij otherwise lies within some caller of m ij .This heuristic implies that if CC is significantly large, the time consumption of m ij is not due to the loops in it but (potentially) it has been called within a loop of its parent method.In later case, parent of m ij becomes a hotspot provided that it is LOOPY.In any case, we get a loop l ijk .If l ijk is DOALL, it is marked to be modified for parallel execution using the operations mentioned in modification phase of Fig. 2 and threading framework of section 4.4.

Parallelization Criteria
There are two criteria for best effort parallelization of a loop.
Criterion-1: Hotspot Selection: Set T PC = 90% and select most time consuming methods that collectively consume 90% time of application, as hotspots.

Criterion-2: Loop Selection:
If a hotspot has significantly high CC value (e.g.> N m ), then go to its calling method(s).
In (any of) calling method, if the hotspot is called in a loop and the loop is DOALL, transform it for parallel execution.
(i) Otherwise, if the hotspot itself contains DOALL loop(s), transform it (them) for parallel execution. (ii) In case of invalidation of (I) and (II), run unmodified sequential application.

Loop Profiling
Loop profiling is used to determine the features like Nested loops are determined by observing the organization of simple loops.If a loop lies exactly within another loop then we come up with a loop nest.For two simple loops l i and l j if Offset i >Offset j AND Target i <Target j then l j lies within l i .So, there exist a 2-level nested loop instead of two single loops.In real world code, inner loops in a loop nest may appear in a variety of ways, as shown in Fig. 3.A loop nest could be represented as a loop tree to accommodate all possible organizations of inner loops.
Root of tree represents the outer most loop and other nodes represent inner loops of root.The data associated with each node is the loop quadruple, a reference to its parent node and a list of references to its children nodes.
Traversing nodes of a loop tree, we can represent nested loops as a 5-tuple <Offset, Target, Nest-Level, {Index-Vector}, {Stride-Vector}> where Offset is the offset of outer most loop, Target is offset of target label of outer most loop, Nest-Level is the height of loop tree, Index-

(a) TRANSFORM_INTERNAL() METHOD OF JGF FFT BENCHMARK (b) RUNITERS() METHOD OF JGF MOLDYN BENCHMARK. A LOOP FOREST IS IN (c) MATGEN() METHOD OF LUFACT BENCHMARK FIG.. 3. LOOP TREES IN
Vector is a list of indices of all loops in loop nest and Stride-Vector is a list of step sizes of all loop in loop nest.
Generally, a loop forest containing single and/or multinode tree(s), is constructed against each hotspot.

Algorithm for Identification of Single Loops
Single loop detection algorithm is shown in Table 3

Algorithm for Loop Forest Construction
Once we get a list of single loops L loop -using the algorithm shown in Figure 4, we can determine nested loops by using algorithm shown in Table 4. Considering each single loop l s ∈L loop as a node, loop tree T l is constructed against each nested loop and added to a loop forest F l .Depending upon the availability of loops, F l could possibly be (1) empty (2) containing single-node tree(s) only (3) containing multinode tree(s), or (4) containing a mixture of single-node and multi-node trees.At start the loop forest F l is empty and a tree T l is constructed using the first loop of L loop as root node.Subsequent loops from L loop are either added to an existing tree or cause the generation of new tree(s).An existing tree is re-adjusted if an outer loop comes after some inner loop(s) so that outer most loop is always the root node.

Loop Classification
Using feature f 37 and f

Recognition of Instructions Patterns
Compilers typically generate an instruction pattern against each source code statement.Java source compiler generates a stream of bytecode instructions which is interpreted by JVM.We recognize bytecode instruction patterns to distinguish memory accesses.The idea starts with the preparation of a catalog of ISA-specific fundamental instruction patterns.Each fundamental pattern consists of at least two instructions in a specific order and performs a smallest indivisible source level task e.g."variable initialization".Some instructions like INC or LV (Table 5) could independently perform an indivisible source level task e.g."j++;".We enumerate such instructions as independent instructions.A pattern is an arrangement of two or more independent instructions.Figure 6 shows an inner loop from SORrun(…) method of JGF SOR benchmark [24], to elaborate instruction pattern recognition.
Source code and bytecode of the loop is shown in Fig. 4 Each leaf is either an ID of fundamental pattern or pattern component, or an independent instruction, as shown in Fig.

TABLE 5. INTERMEDIATE REPRESENTATION OF BYTECODE INSTRUCTIONS
fundamental patterns and its children composite patterns (Table 6).The root of the tree represents top level composite pattern that is entire bytecode region shown in Fig. 4(b).

Inter-Iteration Data Dependence
DOALL loops could be identified by making sure that loop iterations either does not contain any instruction patterns corresponding to memory access or they access independent memory locations.We need to identify instruction patterns that are used to read/write local variables, arrays elements and class members (i.e.fields) of both primitive and userdefined types.If a loop does not contain any instruction pattern corresponding to inter-iteration data dependences, it is DOALL loop because of independent iterations.Let's analyze the loop given in Fig. 4 to determine if it is DOALL or not.Source code and Bytecode of the loop (Fig. 4) reveals that the only variables involved are local because compiled code does not contain any bytecode instruction related to class members (Fig. 4(b)).Table 7 shows the types and compiler-assigned indices of variables used by bytecode instructions.For example, loop index j is indexed at 17 and could be determined from IINC instruction.In Table 6, we can see that only one write operation, represented by P 60 , is performed in each iteration.This pattern has sixth level composition and its first component C 00 contains information about the variable involved.The IR tuples of C 00 are <LBL, L19>, <LR, 25, 14>, <LV, 21, 17> at line 6-8.It shows that the variable is indexed at 14 which is "double[] Gi".Hence, we are concerned about the read/write patterns of array elements.Write operation of Gi depends on three read operations of Gi, one of which is performed in the same iteration and is harmless.Other two reads in an iteration j are performed in immediately previous iteration j-1 and next iteration j+1, which causes inter-iteration data dependences.

Threading Framework
A threading mechanism is required by JIT compiler to modify selected loops for parallelization execution.We

FIG. 5. PARSE TREE OF EXAMPLE LOOP'S BYTECODE IN TERMS OF INSTRUCTION PATTERN IDS, PATTERN COMPONENT IDS AND INDEPENDENT INSTRUCTIONS
workload.We adapted the idea of source code level JAVAR framework [40].Our framework consists of only two classes, Worker ijk and Manager ijk , that are dynamically generated for each candidate loop l ijk .We used ASM [41] for generation of framework classes (in bytecode) as dynamic part of classes would not be available at compile time [41].Worker ijk encapsulates the entire implementation of parallel task whereas Manager ijk is responsible for creation and orchestration of workers.Manager ijk contains only one static method work(…) and each candidate loop l ijk is replaced with just a single call to Manager ijk .work(…).Fig. 6 shows the interaction of threading framework with Class i that contain loop l ijk in its method m ij .For a loop l ijk , a single Manager ijk manages life cycle of n Worker ijk threads.Each Worker ijk calls run ijk () method that is defined in Class i .The loop l ijk is replaced with a call to Manager ijk .work(...).Class i makes jxk calls for k DOALL loops in j methods of this class.Fig. 6 shows a cyclic dependency that could be removed by declaring run ijk () before generating Worker ijk and providing its definition after the generation of Manager ijk .Actual usage of framework is elaborated in Section 5 using the code in Fig. 8.

Motivational Example
To demonstrate the step-by-step working of proposed methodology, we identify and parallelize the most suitable loop of JGF Series benchmark [24].This benchmark manipulates various transcendental and trigonometric functions to calculate Fourier coefficients of function f(x) = (x+1) x .About 10,000 coefficients are computed with an interval of 0.

FIG. 6. CLASS DIAGRAM SHOWING THE ASSOCIATION OF THREADING FRAMEWORK CLASSES WITH THE CLASS CONTAINING HOTSPOT METHOD
relevant portion of application call graph shown in Fig. 7.

IMPLEMENTATION DETAILS
Implementation details include the steps taken to parallelize a candidate loop and a short note on proof of concept.All modifications are done on bytecode, as elaborated in section 4.

Parallelization Steps
Modifications steps are explained here in terms of Java source code.Bytecode level implementations details are given in section 5.2.

Loop Extraction:
The loop is shown at line 7-10 of Fig. 8

Declaration of Run ijk () Method:
A method run ijk is declared in the class of Do() method, as shown in Fig.

8(b)
, where a, b, c are <start, end, step> tuple for a worker thread.We cannot define run ijk yet because <start, end, step> is calculated in dynamically generated partitionLoop() method of Worker ijk class.We just declare run ijk here so that a call in Worker ijk could not pop error.

Generation of Worker ijk and Manager ijk Classes:
Next step is to generate and load Worker ijk and Manager ijk classes.We observed that all classes have to be loaded by the same class loader as that of the application.Against the source code shown in Fig. 8(c-d), bytecode is generated using ASM [41].

Proof of Concept
As a proof of concept, we implemented a research prototype by extending SeekBin [22]

CASE STUDIES
Data is collected by profiling and parsing eighteen benchmark applications [24] to analyze their parallelization potential.Data is analyzed for code comprehension regarding exploitable parallelism.

Code Comprehension
The purpose of code comprehension is twofold: first, we want to explore the parallelization potential of the application at hand.To avoid additional runtime overhead, it is crucial to estimate the feasibility of applying proposed methodology.We also need to decide the locality and extent of transformations needed as we want to transform bare minimum amount of most promising code.Table 10 represents an estimate of parallelization potential of 18 benchmarks in terms of method level features.
Parallelization potential of an application depends on the number of methods called during execution (N m ), frequency of method calls, number of loops, number of instructions in loop bodies, and dependencies among loop iterations.However, not all methods and loops are potentially feasible for parallelization and we need to filter them out by setting suitable T PC value i.e.T PC = 90% in this case.As a result, we converge to only few methods as potential hotspots.

Parallelization of JGF Benchmarks
Thirteen benchmark applications are explicitly transformed and eight benchmarks showed a reasonable speedup, as shown in Fig. 9. Instead of exposing hidden parallelism in other benchmarks, proposed best effort approach prefers to restore sequential versions of applications that do not show speedup.To demonstrate the scalability of transformed applications, we passed "number of workers" as command line argument, instead of getting it from target system as mentioned in Fig. 8  and Method is not quite significant on an 8-core system, the point is that changes are not permanent.In case of unsatisfactory speedup, we can restore to sequential execution anytime because transformations are applied at runtime and code on disk is intact.

Short Running JGF Benchmarks
Short running benchmarks that showed speedup are Crypt, LUFact, SparseMatMult and Cast, as shown in Fig. 9(b).In Crypt, out of 30 methods, only one method cipher_idea() consumes 90% time when called twice in the application, as shown in Table 10.In cipher_idea(), there is no single loop and one 2-level nested loop.Nested loop is DOALL and its outer loop is parallelized.
Crypt demonstrated a speedup of 5.8x and perfectly scale with the increasing number of threads, as shown in   (3) threading framework; and (4) a set of algorithms to profile and parallelize DOALL loops.
With increasing number of cores per chip, it is now possible to use at least part of this compute power to analyze the runtime characteristics of an application with minimal impact on expected performance.Such information can be exploited to improve the application performance.
Such approaches are particularly beneficial for complex long-running applications, which may not be simple to analyze manually.Loops are one of the simplest constructs that can be extracted from any type of code.Our work is an effort to demonstrate the feasibility of this approach.
In past efforts, success criteria of an automated or semiautomated parallelization approach has been based on achievable speedup.When compared with manually parallelized applications, these approaches do not fare well because one parallelization technique may work for a few parts of the code but degrades others.Restricting to hotspots and ability to reverse parallelization transforms at runtime enhances the possibilities of parallelizing long running compute-intensive applications.By relaxing the speedup requirements, it is possible to try multiple techniques for different parts of application code at runtime to achieve optimal performance with no user input.

FUTURE WORK
This work proposes a best effort parallelization methodology that could be used within the front end of JIT (i.e.dynamic) compiler.Integration of this methodology in an actual dynamic compiler is the obvious next step.We have designed a development project to integrate this methodology in an open source JIT compiler.
INTRODUCTION Mehran University Research Journal of Engineering & Technology, Volume 36, No. 1, January, 2017 [p-ISSN: 0254-7821, e-ISSN: 2413-7219] 67 M ultiple cores are typically exploited by parallelizing computer applications in a variety of ways.In case of writing a new

(
source code) compiler converts source code into bytecode which is stored in class file format.Classes are loaded in JVM (Java Virtual Machine) on-demand and bytecode instructions are interpreted by JVM.For JIT compilation, JVM profiles running applications to select most frequently called and/or most time consuming code regions as hotspots.JIT compiler dynamically compiles hotspots to potentially optimized native code.Since JIT compilers can exploit runtime characteristics of applications, it is plausible to use JIT compilation infrastructure for parallelization.
SIMPLELOOPS and NESTEDLOOPS.In each hotspot, loops are detected by recording the backward jumps.Each backward jump is represented as quadruple <Offset, Target, Index, Stride>, where Offset is the offset of backward jump, Target is offset of the target label of backward jump, Index is the variable acting as loop index and Stride is the step size of loop iterations.All backward jumps are recorded during parsing phase.Each backward jump is a potential single loop.A backward jump is one whose target has already been visited [39], either in terms of labels or memory addresses.Labels are used in bytecode because exact memory addresses are not known in intermediate code.By constructing basic block level CGF (Control Flow Graph), we can classify a backward jump as a loop if its Target lies in one of the dominator blocks of the block that contains its Offset.A block d dominates a block b (i.e.d DOM b), if all paths from entry block to bincluded.Also, DOM (b) denotes a set of all nodes that dominate b (including b itself).
Fig. 4 (b), all occurrences of LR-LV-LVA pattern are encircled with dashed lines.LR-LV-LVA is a fundamental pattern because it is composed of instructions only and indivisible into sub-patterns.All fundamental patterns and partial pattern components are recognized and assigned unique IDs P xy and C xy , respectively, as shown in Table 6, where each P xy (or C xy ) represents a pattern (or pattern component) y having x level composition.Composition level of fundamental patterns is zero.Using IDs of fundamental patterns and pattern components, and independent instructions, parse tree of bytecode, shown in Fig. 4(b), is shown in Fig. 5.It is constructed in reverse direction taking leaves at level 0. First level composite patterns do not contain any other composite pattern.Second level composition contains at least one first level composite pattern, third level contains at least one 2nd level composite pattern, and so on.
single loop.Hence, high time consumption (i.e.99.9% collectively) of these methods is due to high call count and not due to the loops in their own code.To determine the immediate caller methods, we have to look at the
(d).Transformed applications are run on an 8-core system comprising 2 x Quad Core Intel® Xeon® E5405, 1333 MHz FSB, CPU Speed 2.0 GHz, L1 D Cache 32 KB, L1 I Cache 32 KB, L2 Cache 2x(2x6 ) = 24 MB and 8 GB DRAM.In order to assess the scalability, data is organized in two sets; long running and short running applications, as shown in Fig. 9(a-b).

Four
long running benchmarks that demonstrated speedup are Series, Arith, Math and Method, as shown in Fig. 9(a).Parallelization of JGF Series benchmark has been described in section 4.5.Series benchmark demonstrated a speedup of 6.9x, which is comparable to HP (Hand Parallelized) version of Series, shown in Fig. 10(d).Outer loops contain instructions related to getting system time.They cannot be multithreaded without generating additional code for thread-local time management (which is out of scope of this work).Speedup observed in Arith, Math and Method benchmarks is 2.7x, 1.6x and 1.4x, respectively.Although speedup of Math

Fig. 9 (
Fig. 9(b).HP version of JGF Crypt, when run on the same system, demonstrated 7x speedup and resembling scalability, as shown in Fig. 10(c).The result is quite encouraging because proposed methodology is

TABLE 3 . ALGORITHM TO IDENTIFY SINGLE LOOPS TABLE 4. ALGORITHM TO CONSTRUCT LOOP FOREST. MULTI-NODE TREES IN THE FOREST REPRESENT NESTED LOOPS 4.3.1 Intermediate Representation of Bytecode Instructions
38 , we can iterate on all loops to classify them.As we are only interested in parallelization of compute intensive DOALL loops (having arbitrary stride size), we select DOALL loops by observing potential inter-iteration data dependences.Data dependences are analyzed by recognizing instruction patterns corresponding to read/write of local variables, arrays elements, and class members of primitive and user-defined data types.In a DOALL loop, all memory access (instruction) patterns operate on independent memory locations in each iteration.As number of instruction patterns depends on instruction set size, we define an intermediate representation to reduce the (instruction) pattern processing cost.precision-floating-point} numbers, respectively.A high level IR symbol ADD could suffice to recognize any of these four instructions.Similarly, we can recognize entire instruction set using a smaller set of IR symbols.By defining IR symbols, we could represent ~200 bytecode instructions (i.e.n ≈ 200) with 42 symbols (i.e.m = 42), as shown in Table 5. Labels are typically induced by compiler to facilitate control flow and demarcation of basic blocks.We consider LBL as part of IR symbols because labels are integral part of compiled code.As elaborated in next sub-section, presentation of instruction patterns in terms of IR symbols increases the occurrence frequency of instruction patterns.Using IR symbols, we have about five times (i.e.⎡n/m⎤) fewer choices at each position in instruction pattern.

TABLE 7 . VARIABLES USED IN EXAMPLE LOOP
2. Methodology starts with profiling phase in which we found that the application calls 28 methods i.e.N m = 28.By setting T PC = 90%, we found 2 potential hotspots.For a potential hotspot, top-ranking value of PC is either due to its high CC (Call Count) or due to having compute intensive loops indicated by f 10 , f 11 , f 37 , f 38 features.The reason is that PC is based on the self- time consumed by a method i.e. time consumption of its callee methods is excluded.Looking at Table 8, we come to know that CC value of both methods is significantly high but only TrapezoidIntegrate() method contains one

Table 9
(a)in source code of Do() method.Bytecode of this loop is extracted from the method and represented as IR tuples to recognize instruction patterns for data dependence analysis.

TABLE 9 . RELEVANT FEATURES OF DO() METHOD OF JGF SERIES before
the loop body and used in loop body.Local variables omega and i are not written in the loop body so there is no inter-iteration data dependence due to local variables.Table9shows that no static field is read/written In nested loops, compute intensive code was found in inner loops that were parallelized.Outer loops contain timing routines and cannot be parallelized due to the reason mentioned in section 6.2.1.JGF Cast demonstrated highest speedup of 7.6x.Overall, the observed speedup is in the range 1.2 -7.6x.This work emphasizes that best effort JIT compiler inspired parallelization has great potential of parallelizing executable code at runtime.Loops in compute-intensive applications exhibit greater parallelization potential, which makes it a worthwhile option.Although it may not be able to parallelize each and every application, it is plausible to exploit parallelism without programmer intervention.Best effort exploits parallelism wherever possible and there is no harm because transformations are not made permanent.In case of failure, sequential execution could be restored.However, in case of success, transformations could be made permanent at any time.The main contributions of this paper include: (1) catalogs of qualitative and

TABLE 10 . PARALLELIZATION POTENTIAL OF JGF BENCHMARK APPLICATIONS qualitative
features for runtime code comprehension; (2) compact intermediate representation of ISA and instruction pattern recognition for dependence analysis;