Loop Unrolling in the JVM

8 min readJan 10, 2025

What is Loop Unrolling?

Loop unrolling (or loop unwinding) is a compiler optimization technique that reduces the overhead of loop execution by decreasing the number of iterations while performing more operations in each iteration. Instead of executing one operation per iteration, the JVM creates copies of the loop body to perform multiple operations per iteration.

How Does It Work?

Consider a simple loop that processes array elements one at a time. With loop unrolling, the JVM might transform it to process multiple elements per iteration. This reduces:

Branch predictions
Loop condition checks
Loop counter increments

The JVM’s JIT compiler automatically decides when to unroll loops based on:

Loop size
Number of iterations
Complexity of loop body
Available CPU architecture features

When does JVM unroll loops?

Hot Loops

void hotLoop() {
    // Called frequently enough to be "hot"
    for (int i = 0; i < 1000; i++) {
        process(i);
    }
}

Predictable Iterations

// Good for unrolling - size is known
final int SIZE = 100;
for (int i = 0; i < SIZE; i++) {
    array[i] = i;
}

// Bad for unrolling - size unknown until runtime
void unpredictable(int size) {
    for (int i = 0; i < size; i++) {
        array[i] = i;
    }
}

Simple Loop Bodies

// Good for unrolling - simple operations
for (int i = 0; i < 100; i++) {
    sum += array[i];
}

// Bad for unrolling - complex control flow
for (int i = 0; i < 100; i++) {
    if (condition1) {
        doSomething();
    } else if (condition2) {
        doSomethingElse();
    }
}

Simple example

Here’s a basic loop before unrolling:

// Original loop
for (int i = 0; i < array.length; i++) {
    array[i] *= 2;
}

After unrolling (conceptually what the JVM might do):

// Unrolled version
for (int i = 0; i < array.length - 3; i += 4) {
    array[i] *= 2;
    array[i + 1] *= 2;
    array[i + 2] *= 2;
    array[i + 3] *= 2;
}
// Cleanup loop for remaining elements
for (; i < array.length; i++) {
    array[i] *= 2;
}

Benefits

Reduced loop overhead

Each loop iteration normally requires incrementing a counter and checking if we’re done
With unrolling, we do these checks less frequently (every 4 iterations instead of every 1)
Example: Instead of checking 1000 times in a 1000-iteration loop, we might check only 250 times

Better instruction pipeline utilization

Modern CPUs work on multiple instructions at once
Regular loops often cause pipeline stalls because we wait for the next iteration
Unrolled loops give the CPU more independent work to do at once
Example: When multiplying array elements by 2, the CPU can start working on the next multiplication before finishing the current one

Improved cache usage for sequential access

When we access memory (like array elements), nearby items are loaded into cache
Unrolled loops process multiple nearby elements at once, making better use of this cached data
Example: When we load array[0], we might automatically get array[1–3] in cache too, so processing them all at once is efficient

Fewer branch predictions

CPUs try to guess if a loop will continue or end (branch prediction)
Each guess that’s wrong slows things down
Unrolled loops have fewer branches to predict since we check the condition less often
Example: In a 1000-iteration loop, we reduce branch predictions from 1000 to 250

public class BranchPredictionDemo {
    public static void main(String[] args) {
        int size = 10_000_000;
        int[] orderedArray = new int[size];
        int[] randomArray = new int[size];
        
        // Fill arrays
        for (int i = 0; i < size; i++) {
            orderedArray[i] = i < (size/2) ? 1 : 0; // Predictable pattern: first half 1s, second half 0s
            randomArray[i] = Math.random() < 0.5 ? 1 : 0; // Random pattern
        }

        // Test predictable branches
        long start = System.nanoTime();
        int sumPredictable = 0;
        for (int i = 0; i < size; i++) {
            if (orderedArray[i] == 1) { // Predictable pattern
                sumPredictable += i;
            }
        }
        long predictableTime = System.nanoTime() - start;

        // Test random branches
        start = System.nanoTime();
        int sumRandom = 0;
        for (int i = 0; i < size; i++) {
            if (randomArray[i] == 1) { // Random pattern
                sumRandom += i;
            }
        }
        long randomTime = System.nanoTime() - start;

        System.out.println("Predictable pattern time: " + predictableTime/1_000_000 + "ms");
        System.out.println("Random pattern time: " + randomTime/1_000_000 + "ms");
        // Print sums to prevent JVM from optimizing away our calculations
        System.out.println("Sums: " + sumPredictable + ", " + sumRandom);
    }
}

The predictable pattern, in this case, is 30% faster:

Predictable pattern time: 7ms
Random pattern time: 10ms
Sums: 1642668640, 108766644

Tradeoffs

Increased code size

The actual compiled machine code gets bigger because we duplicate the loop body
Example: Unrolling a loop 4 times might make that section of code 4 times larger
This usually isn’t a problem unless you’re unrolling many loops

Possible instruction cache pressure

CPUs cache instructions as well as data
Larger code (from unrolling) might not fit in this cache as well
Example: If your unrolled loop is too big, the CPU might have to keep reloading parts of it from slower memory

More complex cleanup code for non-perfect divisions

If your array size isn’t divisible by the unroll factor, you need extra code
Example: When unrolling by 4, an array of 102 elements needs special handling for the last 2 elements
This extra code can make things slower for certain array sizes

Here is an example:

class UnrollExample {
    // Process 1 million arrays of size 102 (awkward size)
    static void benchmark() {
        int[][] arrays = new int[1_000_000][102];
        long start = System.nanoTime();
        
        for (int[] array : arrays) {
            processArray(array);
        }
        
        long end = System.nanoTime();
        System.out.println("Time: " + (end - start) / 1_000_000 + "ms");
    }
    
    static void processArray(int[] array) {
        int i = 0;
        // Main unrolled loop
        for (; i < array.length - 3; i += 4) {
            array[i] *= 2;
            array[i + 1] *= 2;
            array[i + 2] *= 2;
            array[i + 3] *= 2;
        }
        // Cleanup loop
        for (; i < array.length; i++) {
            array[i] *= 2;
        }
    }
}

Otherwise, modern JVMs are quite smart about this. Here’s what typically happens:

void processArray(int[] array) {
    // The JVM might create two versions of the code
    if (array.length >= 4) {
        // Main optimized path - unrolled version
        // Handles most cases very efficiently
    } else {
        // Simple path for tiny arrays
        // Avoids cleanup overhead completely
    }
}

May not help with memory-bound operations

If your loop is slow because of memory access (not CPU calculations), unrolling might not help
Example: If you’re reading from random locations in a very large array, the speed is limited by memory access time, not loop overhead

Monitoring Loop Unrolling

To see loop optimizations in action, use these JVM flags:

javac LoopExample.java

java -XX:+PrintCompilation \
     -XX:+UnlockDiagnosticVMOptions 
     -XX:+PrintAssembly \
     LoopExample

Note: You’ll need the hsdis (HotSpot disassembler) plugin for PrintAssembly to work. See How to install HotSpot Disassembler (hsdis) on macOS for instructions how to install.

Best Practices

Let the JVM handle it

Manual loop unrolling often makes code harder to read
The JIT compiler is usually better at deciding when to unroll

Write clean, simple loops

Complex loop bodies may prevent unrolling
Keep loop conditions simple

// Good - straightforward array processing
void scaleArray(int[] array, int factor) {
    for (int i = 0; i < array.length; i++) {
        array[i] *= factor;
    }
}

// Good - simple matrix operation
void addMatrices(int[][] a, int[][] b, int[][] result) {
    for (int i = 0; i < a.length; i++) {
        for (int j = 0; j < a[0].length; j++) {
            result[i][j] = a[i][j] + b[i][j];
        }
    }
}

// Bad - multiple exit conditions
void processUntilCondition(int[] array) {
    for (int i = 0; i < array.length && array[i] != -1 && i < 100; i++) {
        array[i] *= 2;
    }
}

// Bad - each iteration depends on the previous one
void fibonacci(int[] array) {
    for (int i = 2; i < array.length; i++) {
        array[i] = array[i-1] + array[i-2];
    }
}

// Bad - unpredictable branching in loop body
void complexProcess(int[] array) {
    for (int i = 0; i < array.length; i++) {
        if (array[i] > 0) {
            array[i] *= 2;
        } else if (array[i] < 0) {
            array[i] /= 2;
        } else if (i > 0) {
            array[i] = array[i-1];
        } else {
            array[i] = 1;
        }
    }
}

// Bad - non-sequential array access
void scatterAccess(int[] array, int[] indices) {
    for (int i = 0; i < indices.length; i++) {
        array[indices[i]] *= 2;  // Accessing array at random positions
    }
}

Consider array access patterns

Sequential access is more likely to benefit
Random access patterns may not benefit

// Sequential access - Great for unrolling
void sequentialProcess(int[] array) {
    // Each iteration accesses the next memory location
    for (int i = 0; i < array.length; i++) {
        array[i] *= 2;
    }
}

// When unrolled by the JVM:
void sequentialProcessUnrolled(int[] array) {
    int i = 0;
    // This works well because CPU cache can prefetch next elements
    for (; i < array.length - 3; i += 4) {
        array[i] *= 2;     // Memory location 0, 4, 8...
        array[i + 1] *= 2; // Memory location 1, 5, 9...
        array[i + 2] *= 2; // Memory location 2, 6, 10...
        array[i + 3] *= 2; // Memory location 3, 7, 11...
    }
    // Cleanup remaining elements
}

// Random access - Poor for unrolling
void randomProcess(int[] array, int[] indices) {
    // Each iteration jumps to a random location in memory
    for (int i = 0; i < indices.length; i++) {
        array[indices[i]] *= 2;
    }
}

// Even if unrolled, it doesn't help much:
void randomProcessUnrolled(int[] array, int[] indices) {
    int i = 0;
    // Cache can't predict or prefetch effectively
    for (; i < indices.length - 3; i += 4) {
        array[indices[i]] *= 2;     // Could be anywhere in array
        array[indices[i + 1]] *= 2; // Could be anywhere in array
        array[indices[i + 2]] *= 2; // Could be anywhere in array
        array[indices[i + 3]] *= 2; // Could be anywhere in array
    }
    // Cleanup remaining elements
}

When we compare the two, we will see the difference:

Sequential: 6ms
Random:    24ms


void benchmark() {
    int[] array = new int[10_000_000];
    int[] indices = new int[10_000_000];
    
    // Fill indices with random positions
    for (int i = 0; i < indices.length; i++) {
        indices[i] = (int)(Math.random() * array.length);
    }
    
    // Test sequential access
    long start = System.nanoTime();
    sequentialProcess(array);
    long sequentialTime = System.nanoTime() - start;
    
    // Test random access
    start = System.nanoTime();
    randomProcess(array, indices);
    long randomTime = System.nanoTime() - start;
    
    System.out.println("Sequential: " + sequentialTime/1_000_000 + "ms");
    System.out.println("Random: " + randomTime/1_000_000 + "ms");
}

Common Use Cases

Loop unrolling is particularly effective for:

Array processing
Vector operations
Matrix computations
Image processing
Signal processing

Remember

Manual loop unrolling should be a last resort after profiling shows it’s necessary. The JVM’s automatic optimizations are usually sufficient and maintain better code readability.