Loop Unrolling in the JVM
What is Loop Unrolling?
Loop unrolling (or loop unwinding) is a compiler optimization technique that reduces the overhead of loop execution by decreasing the number of iterations while performing more operations in each iteration. Instead of executing one operation per iteration, the JVM creates copies of the loop body to perform multiple operations per iteration.
How Does It Work?
Consider a simple loop that processes array elements one at a time. With loop unrolling, the JVM might transform it to process multiple elements per iteration. This reduces:
- Branch predictions
- Loop condition checks
- Loop counter increments
The JVM’s JIT compiler automatically decides when to unroll loops based on:
- Loop size
- Number of iterations
- Complexity of loop body
- Available CPU architecture features
When does JVM unroll loops?
Hot Loops
void hotLoop() {
// Called frequently enough to be "hot"
for (int i = 0; i < 1000; i++) {
process(i);
}
}
Predictable Iterations
// Good for unrolling - size is known
final int SIZE = 100;
for (int i = 0; i < SIZE; i++) {
array[i] = i;
}
// Bad for unrolling - size unknown until runtime
void unpredictable(int size) {
for (int i = 0; i < size; i++) {
array[i] = i;
}
}
Simple Loop Bodies
// Good for unrolling - simple operations
for (int i = 0; i < 100; i++) {
sum += array[i];
}
// Bad for unrolling - complex control flow
for (int i = 0; i < 100; i++) {
if (condition1) {
doSomething();
} else if (condition2) {
doSomethingElse();
}
}
Simple example
Here’s a basic loop before unrolling:
// Original loop
for (int i = 0; i < array.length; i++) {
array[i] *= 2;
}
After unrolling (conceptually what the JVM might do):
// Unrolled version
for (int i = 0; i < array.length - 3; i += 4) {
array[i] *= 2;
array[i + 1] *= 2;
array[i + 2] *= 2;
array[i + 3] *= 2;
}
// Cleanup loop for remaining elements
for (; i < array.length; i++) {
array[i] *= 2;
}
Benefits
Reduced loop overhead
- Each loop iteration normally requires incrementing a counter and checking if we’re done
- With unrolling, we do these checks less frequently (every 4 iterations instead of every 1)
- Example: Instead of checking 1000 times in a 1000-iteration loop, we might check only 250 times
Better instruction pipeline utilization
- Modern CPUs work on multiple instructions at once
- Regular loops often cause pipeline stalls because we wait for the next iteration
- Unrolled loops give the CPU more independent work to do at once
- Example: When multiplying array elements by 2, the CPU can start working on the next multiplication before finishing the current one
Improved cache usage for sequential access
- When we access memory (like array elements), nearby items are loaded into cache
- Unrolled loops process multiple nearby elements at once, making better use of this cached data
- Example: When we load array[0], we might automatically get array[1–3] in cache too, so processing them all at once is efficient
Fewer branch predictions
- CPUs try to guess if a loop will continue or end (branch prediction)
- Each guess that’s wrong slows things down
- Unrolled loops have fewer branches to predict since we check the condition less often
- Example: In a 1000-iteration loop, we reduce branch predictions from 1000 to 250
public class BranchPredictionDemo {
public static void main(String[] args) {
int size = 10_000_000;
int[] orderedArray = new int[size];
int[] randomArray = new int[size];
// Fill arrays
for (int i = 0; i < size; i++) {
orderedArray[i] = i < (size/2) ? 1 : 0; // Predictable pattern: first half 1s, second half 0s
randomArray[i] = Math.random() < 0.5 ? 1 : 0; // Random pattern
}
// Test predictable branches
long start = System.nanoTime();
int sumPredictable = 0;
for (int i = 0; i < size; i++) {
if (orderedArray[i] == 1) { // Predictable pattern
sumPredictable += i;
}
}
long predictableTime = System.nanoTime() - start;
// Test random branches
start = System.nanoTime();
int sumRandom = 0;
for (int i = 0; i < size; i++) {
if (randomArray[i] == 1) { // Random pattern
sumRandom += i;
}
}
long randomTime = System.nanoTime() - start;
System.out.println("Predictable pattern time: " + predictableTime/1_000_000 + "ms");
System.out.println("Random pattern time: " + randomTime/1_000_000 + "ms");
// Print sums to prevent JVM from optimizing away our calculations
System.out.println("Sums: " + sumPredictable + ", " + sumRandom);
}
}
The predictable pattern, in this case, is 30% faster:
Predictable pattern time: 7ms
Random pattern time: 10ms
Sums: 1642668640, 108766644
Tradeoffs
Increased code size
- The actual compiled machine code gets bigger because we duplicate the loop body
- Example: Unrolling a loop 4 times might make that section of code 4 times larger
- This usually isn’t a problem unless you’re unrolling many loops
Possible instruction cache pressure
- CPUs cache instructions as well as data
- Larger code (from unrolling) might not fit in this cache as well
- Example: If your unrolled loop is too big, the CPU might have to keep reloading parts of it from slower memory
More complex cleanup code for non-perfect divisions
- If your array size isn’t divisible by the unroll factor, you need extra code
- Example: When unrolling by 4, an array of 102 elements needs special handling for the last 2 elements
- This extra code can make things slower for certain array sizes
Here is an example:
class UnrollExample {
// Process 1 million arrays of size 102 (awkward size)
static void benchmark() {
int[][] arrays = new int[1_000_000][102];
long start = System.nanoTime();
for (int[] array : arrays) {
processArray(array);
}
long end = System.nanoTime();
System.out.println("Time: " + (end - start) / 1_000_000 + "ms");
}
static void processArray(int[] array) {
int i = 0;
// Main unrolled loop
for (; i < array.length - 3; i += 4) {
array[i] *= 2;
array[i + 1] *= 2;
array[i + 2] *= 2;
array[i + 3] *= 2;
}
// Cleanup loop
for (; i < array.length; i++) {
array[i] *= 2;
}
}
}
Otherwise, modern JVMs are quite smart about this. Here’s what typically happens:
void processArray(int[] array) {
// The JVM might create two versions of the code
if (array.length >= 4) {
// Main optimized path - unrolled version
// Handles most cases very efficiently
} else {
// Simple path for tiny arrays
// Avoids cleanup overhead completely
}
}
May not help with memory-bound operations
- If your loop is slow because of memory access (not CPU calculations), unrolling might not help
- Example: If you’re reading from random locations in a very large array, the speed is limited by memory access time, not loop overhead
Monitoring Loop Unrolling
To see loop optimizations in action, use these JVM flags:
javac LoopExample.java
java -XX:+PrintCompilation \
-XX:+UnlockDiagnosticVMOptions
-XX:+PrintAssembly \
LoopExample
Note: You’ll need the
hsdis
(HotSpot disassembler) plugin for PrintAssembly to work. See How to install HotSpot Disassembler (hsdis) on macOS for instructions how to install.
Best Practices
Let the JVM handle it
- Manual loop unrolling often makes code harder to read
- The JIT compiler is usually better at deciding when to unroll
Write clean, simple loops
- Complex loop bodies may prevent unrolling
- Keep loop conditions simple
// Good - straightforward array processing
void scaleArray(int[] array, int factor) {
for (int i = 0; i < array.length; i++) {
array[i] *= factor;
}
}
// Good - simple matrix operation
void addMatrices(int[][] a, int[][] b, int[][] result) {
for (int i = 0; i < a.length; i++) {
for (int j = 0; j < a[0].length; j++) {
result[i][j] = a[i][j] + b[i][j];
}
}
}
// Bad - multiple exit conditions
void processUntilCondition(int[] array) {
for (int i = 0; i < array.length && array[i] != -1 && i < 100; i++) {
array[i] *= 2;
}
}
// Bad - each iteration depends on the previous one
void fibonacci(int[] array) {
for (int i = 2; i < array.length; i++) {
array[i] = array[i-1] + array[i-2];
}
}
// Bad - unpredictable branching in loop body
void complexProcess(int[] array) {
for (int i = 0; i < array.length; i++) {
if (array[i] > 0) {
array[i] *= 2;
} else if (array[i] < 0) {
array[i] /= 2;
} else if (i > 0) {
array[i] = array[i-1];
} else {
array[i] = 1;
}
}
}
// Bad - non-sequential array access
void scatterAccess(int[] array, int[] indices) {
for (int i = 0; i < indices.length; i++) {
array[indices[i]] *= 2; // Accessing array at random positions
}
}
Consider array access patterns
- Sequential access is more likely to benefit
- Random access patterns may not benefit
// Sequential access - Great for unrolling
void sequentialProcess(int[] array) {
// Each iteration accesses the next memory location
for (int i = 0; i < array.length; i++) {
array[i] *= 2;
}
}
// When unrolled by the JVM:
void sequentialProcessUnrolled(int[] array) {
int i = 0;
// This works well because CPU cache can prefetch next elements
for (; i < array.length - 3; i += 4) {
array[i] *= 2; // Memory location 0, 4, 8...
array[i + 1] *= 2; // Memory location 1, 5, 9...
array[i + 2] *= 2; // Memory location 2, 6, 10...
array[i + 3] *= 2; // Memory location 3, 7, 11...
}
// Cleanup remaining elements
}
// Random access - Poor for unrolling
void randomProcess(int[] array, int[] indices) {
// Each iteration jumps to a random location in memory
for (int i = 0; i < indices.length; i++) {
array[indices[i]] *= 2;
}
}
// Even if unrolled, it doesn't help much:
void randomProcessUnrolled(int[] array, int[] indices) {
int i = 0;
// Cache can't predict or prefetch effectively
for (; i < indices.length - 3; i += 4) {
array[indices[i]] *= 2; // Could be anywhere in array
array[indices[i + 1]] *= 2; // Could be anywhere in array
array[indices[i + 2]] *= 2; // Could be anywhere in array
array[indices[i + 3]] *= 2; // Could be anywhere in array
}
// Cleanup remaining elements
}
When we compare the two, we will see the difference:
Sequential: 6ms
Random: 24ms
void benchmark() {
int[] array = new int[10_000_000];
int[] indices = new int[10_000_000];
// Fill indices with random positions
for (int i = 0; i < indices.length; i++) {
indices[i] = (int)(Math.random() * array.length);
}
// Test sequential access
long start = System.nanoTime();
sequentialProcess(array);
long sequentialTime = System.nanoTime() - start;
// Test random access
start = System.nanoTime();
randomProcess(array, indices);
long randomTime = System.nanoTime() - start;
System.out.println("Sequential: " + sequentialTime/1_000_000 + "ms");
System.out.println("Random: " + randomTime/1_000_000 + "ms");
}
Common Use Cases
Loop unrolling is particularly effective for:
- Array processing
- Vector operations
- Matrix computations
- Image processing
- Signal processing
Remember
Manual loop unrolling should be a last resort after profiling shows it’s necessary. The JVM’s automatic optimizations are usually sufficient and maintain better code readability.