Building a Virtual Machine, JVM-inspired — Compilation (Part 7)

11 min readJan 14, 2025

Introduction

In this part, we’ll explore how to add compilation support to our VM while maintaining compatibility with all the features implemented in previous articles. This addition brings us closer to how real-world VMs like the JVM operate.

Why compile?

JVM uses compilation for many good reasons, like:

Performance (parsing strings at runtime is slow, binary format is faster to read and execute)
Validation (code is verified before execution, types are checked)
Resource planning (plan memory usage, determine local variable needs)
Optimization (remove dead code, JIT compilation hints, inline small methods, constant folding)
Security (verify byte-code validity, check access permissions, validate method calls)
Platform independence (same binary format everywhere, no need to compile for specific platforms)
Loading efficiency (smaller files to load, which makes it faster to load to memory)
Errror detection (catch errors at compile time, not during runtime!)

Implementation goals

Until now, our VM has been operating differently from a traditional JVM by parsing and executing code directly from text. This approach, while simpler to implement initially, lacks many of the advantages listed above. Our goal is to create a proper compilation process that transforms our source code into an efficient bytecode format.

Let’s look at a concrete example. We will create a hello.tvm file (.tvm extension because we call our virtual machine — TinyVM):

function greet
set message 42
print message

function main
sync greet
exit

When we compile this program, the process looks like this:

$ tiny_vm_compile examples/counter.tvm
[2025-01-02 19:14:42.718010] [Compiler] Starting TinyVM Compiler...
[2025-01-02 19:14:42.718810] [File] Successfully read 6 lines from /Users/ondrej/Documents/Projects/c/tiny-vm/tiny-vm_07_compilation/examples/hello.tvm

[2025-01-02 19:14:42.718815] [Compiler] Bytecode for function 'greet':
[2025-01-02 19:14:42.718819]   0: LOAD_CONST message = 42
[2025-01-02 19:14:42.718820]   1: PRINT message

[2025-01-02 19:14:42.718822] [Compiler] Bytecode for function 'main':
[2025-01-02 19:14:42.718823]   0: INVOKE_SYNC greet
[2025-01-02 19:14:42.718824]   1: RETURN

[2025-01-02 19:14:42.718932] [Compiler] Successfully saved compiled bytecode to: /Users/ondrej/Documents/Projects/c/tiny-vm/tiny-vm_07_compilation/examples/hello.tvmc
[2025-01-02 19:14:42.718936] [Compiler] Cleaned up compilation results
[2025-01-02 19:14:42.718937] [Compiler] TinyVM Compiler finished.

The implementation

Project Structure

Our implementation now produces two separate executables:

The VM executable (tiny_vm_run) — Responsible for executing compiled bytecode
The Compiler executable (tiny_vm_compile) — Transforms source code into bytecode

We achieve this by splitting our main function into two separate files:

src/vm_main.c — Contains the VM execution logic
src/compiler_main.c — Handles the compilation process

Both executables share common code through a carefully structured set of source files. Here’s our updated CMakeLists.txt that reflects this organization:

# CMakeLists.txt
cmake_minimum_required(VERSION 3.30)
project(tiny_vm C)

set(CMAKE_C_STANDARD 11)

set(COMMON_SOURCES
        src/utils/logger.c
        src/utils/logger.h
        src/core/vm.c
        src/core/vm.h
        src/thread/thread.c
        src/thread/thread.h
        src/synchronization/synchronization.c
        src/synchronization/synchronization.h
        src/memory/memory.c
        src/memory/memory.h
        src/instruction/instruction.c
        src/instruction/instruction.h
        src/types.h
        src/function/function.c
        src/function/function.h
        src/execution/execution.c
        src/execution/execution.h
        src/compiler/compiler.h
        src/compiler/compiler.c
        src/compiler/bytecode.c
        src/compiler/source_loader.h
        src/compiler/source_loader.c
)

# Compiler executable
add_executable(tiny_vm_compile
        src/compiler_main.c
        ${COMMON_SOURCES}
)

# VM executable
add_executable(tiny_vm_run
        src/vm_main.c
        ${COMMON_SOURCES}
)

# Add include directories if needed
target_include_directories(tiny_vm_compile PRIVATE src)
target_include_directories(tiny_vm_run PRIVATE src)

add_custom_target(run_counter
        COMMAND tiny_vm_compile ${CMAKE_SOURCE_DIR}/examples/counter.tvm
        COMMAND tiny_vm_run ${CMAKE_SOURCE_DIR}/examples/counter.tvmc
)

add_custom_target(run_hello
        COMMAND tiny_vm_compile ${CMAKE_SOURCE_DIR}/examples/hello.tvm
        COMMAND tiny_vm_run ${CMAKE_SOURCE_DIR}/examples/hello.tvmc
)

The Compiler’s main function

The compiler’s main function orchestrates the compilation process. It handles:

Source file reading
Compilation
Output file generation
Resource cleanup

Here’s the implementation:

// src/compiler_main.c
#include "utils/logger.h"
#include "compiler/compiler.h"
#include "compiler/source_loader.h"
#include <stdlib.h>

int main(const int argc, char* argv[]) {
    if (argc < 2) {
        print("Usage: %s <source.tvm>", argv[0]);
        return 1;
    }

    print("[Compiler] Starting TinyVM Compiler...");

    // Read source file
    char*** source = read_tvm_source(argv[1]);
    if (!source) {
        print("[Compiler] Failed to read source file");
        return 1;
    }

    // Compile the program
    CompilationResult* compiled = compile_program((const char***)source);
    if (!compiled) {
        print("[Compiler] Compilation failed");
        free_source_code(source);
        return 1;
    }
    // print_compilation_result(compiled);

    // Get output filename
    char* output_file = get_output_filename(argv[1], ".tvmc");

    // Save the compiled bytecode
    save_compiled_bytecode(output_file, compiled);

    // Cleanup
    free(output_file);
    free_source_code(source);
    free_compilation_result(compiled);

    print("[Compiler] TinyVM Compiler finished.");
    return 0;
}

The compiler

The heart of our compilation process lies in the compiler implementation. Each instruction is assigned a specific bytecode operation code, creating a compact and efficient representation of our program.

The compiler header defines our bytecode structure:

// src/compiler/compiler.h
#ifndef TINY_VM_COMPILER_H
#define TINY_VM_COMPILER_H

#include <stdint.h>

// Bytecode operation codes
typedef enum {
    OP_NOP          = 0x00,    // No operation
    OP_PRINT        = 0x01,    // print variable
    OP_LOAD_CONST   = 0x02,    // Load constant into variable
    OP_ADD          = 0x03,    // Add two variables
    OP_SLEEP        = 0x04,    // Sleep for N milliseconds
    OP_SETSHARED    = 0x05,    // Set shared variable
    OP_MONITOR_ENTER = 0x06,   // Enter monitor (lock)
    OP_MONITOR_EXIT = 0x07,    // Exit monitor (unlock)
    OP_INVOKE_SYNC  = 0x08,    // Call function synchronously
    OP_INVOKE_ASYNC = 0x09,    // Call function asynchronously
    OP_RETURN       = 0x0A,    // Return from function
} OpCode;

// Bytecode instruction format
typedef struct {
    OpCode opcode;         // Operation code
    uint16_t var_index;    // Variable index (if needed)
    uint16_t var_index2;   // Second variable index (if needed)
    uint16_t var_index3;   // Third variable index (if needed)
    int32_t constant;      // Constant value (if needed)
    char* name;            // Name (for variables/functions)
} BytecodeInstruction;

// Compiled function
typedef struct {
    char* name;                     // Function name
    BytecodeInstruction* byte_code; // Bytecode instructions
    int code_length;                // Number of instructions
    int max_locals;                 // Maximum number of local variables
    char** constant_pool;           // Pool of constant values/names
    int constant_pool_size;         // Size of constant pool
} Function;

// Compilation result
typedef struct {
    Function** functions;  // Array of compiled functions
    int function_count;    // Number of functions
} CompilationResult;

// Compiler functions
CompilationResult* compile_program(const char*** source_functions);
void free_compilation_result(CompilationResult* result);
void print_compilation_result(const CompilationResult* result);

// Bytecode file operations
void save_compiled_bytecode(const char* filename, CompilationResult* compiled);

// Debug functions
void print_bytecode(const Function* function);

#endif

Byte-code generation

The compilation process transforms each instruction into bytecode while maintaining a constant pool for variables and function names. This approach allows for efficient storage and quick lookup during execution:

// src/compiler/compiler.c
#include "compiler.h"
#include "../instruction/instruction.h"
#include "../utils/logger.h"

#include <stdlib.h>
#include <string.h>

static void compile_instruction(const char* line, BytecodeInstruction* bytecode, int* constant_index, char** constant_pool) {
    const Instruction instr = parse_instruction(line);

    switch (instr.type) {
        case PRINT:
            bytecode->opcode = OP_PRINT;
            bytecode->name = strdup(instr.args[0]);
            // Store variable name in constant pool
            constant_pool[*constant_index] = strdup(instr.args[0]);
            bytecode->var_index = (*constant_index)++;
            break;

        case SET:
            bytecode->opcode = OP_LOAD_CONST;
            bytecode->name = strdup(instr.args[0]);
            // Store variable name in constant pool
            constant_pool[*constant_index] = strdup(instr.args[0]);
            bytecode->var_index = (*constant_index)++;
            bytecode->constant = atoi(instr.args[1]);
            break;

        case ADD:
            bytecode->opcode = OP_ADD;
            // Store all variable names in constant pool
            constant_pool[*constant_index] = strdup(instr.args[0]);
            bytecode->var_index = (*constant_index)++;
            constant_pool[*constant_index] = strdup(instr.args[1]);
            bytecode->var_index2 = (*constant_index)++;
            constant_pool[*constant_index] = strdup(instr.args[2]);
            bytecode->var_index3 = (*constant_index)++;
            break;

        case SLEEP:
            bytecode->opcode = OP_SLEEP;
            bytecode->constant = atoi(instr.args[0]);
            break;

        case SETSHARED:
            bytecode->opcode = OP_SETSHARED;
            bytecode->name = strdup(instr.args[0]);
            bytecode->constant = atoi(instr.args[1]);
            break;

        case LOCK:
            bytecode->opcode = OP_MONITOR_ENTER;
            bytecode->name = strdup(instr.args[0]);
            break;

        case UNLOCK:
            bytecode->opcode = OP_MONITOR_EXIT;
            bytecode->name = strdup(instr.args[0]);
            break;

        case SYNC:
            bytecode->opcode = OP_INVOKE_SYNC;
            bytecode->name = strdup(instr.args[0]);
            break;

        case ASYNC:
            bytecode->opcode = OP_INVOKE_ASYNC;
            bytecode->name = strdup(instr.args[0]);
            break;

        case EXIT:
            bytecode->opcode = OP_RETURN;
            break;

        default:
            bytecode->opcode = OP_NOP;
    }
}

static Function* compile_function(const char** source) {
    // Create function structure
    Function* function = malloc(sizeof(Function));
    if (!function) return NULL;

    // Get function name (first line must be function declaration)
    function->name = get_function_name(source);
    if (!function->name) {
        free(function);
        return NULL;
    }

    // Count instructions (excluding NULL terminator and function declaration)
    function->code_length = 0;
    for (int i = 1; source[i] != NULL; i++) {
        function->code_length++;
    }

    // Allocate bytecode array
    function->byte_code = malloc(sizeof(BytecodeInstruction) * function->code_length);
    if (!function->byte_code) {
        free(function->name);
        free(function);
        return NULL;
    }

    // Allocate constant pool (worst case: 3 constants per instruction)
    function->constant_pool = malloc(sizeof(char*) * function->code_length * 3);
    if (!function->constant_pool) {
        free(function->byte_code);
        free(function->name);
        free(function);
        return NULL;
    }

    // Initialize constant pool size
    function->constant_pool_size = 0;

    // Compile each instruction
    for (int i = 0; i < function->code_length; i++) {
        compile_instruction(
            source[i + 1],
            &function->byte_code[i],
            &function->constant_pool_size,
            function->constant_pool
        );
    }

    return function;
}

CompilationResult* compile_program(const char*** source_functions) {
    CompilationResult* result = malloc(sizeof(CompilationResult));
    if (!result) return NULL;

    // Count functions
    result->function_count = 0;
    while (source_functions[result->function_count] != NULL) {
        result->function_count++;
    }

    result->functions = malloc(sizeof(Function*) * result->function_count);
    if (!result->functions) {
        free(result);
        return NULL;
    }

    // Compile each function
    for (int i = 0; i < result->function_count; i++) {
        result->functions[i] = compile_function(source_functions[i]);
        if (result->functions[i]) {
            print_bytecode(result->functions[i]);
        }
    }

    return result;
}

// ... print functions are here, but ommited as they are not that important

Byte-code file format

We store the compiled bytecode in a dedicated file format that efficiently represents our program structure:

// src/compiler/bytecode.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "../utils/logger.h"
#include "compiler.h"

// File operations for saving compiled bytecode
void save_compiled_bytecode(const char* filename, CompilationResult* compiled) {
    FILE* file = fopen(filename, "wb");
    if (!file) {
        print("[Compiler] Error: Could not open file for writing: %s", filename);
        return;
    }

    // Write number of functions
    fwrite(&compiled->function_count, sizeof(int), 1, file);

    // Write each function
    for (int i = 0; i < compiled->function_count; i++) {
        Function* func = compiled->functions[i];

        // Write function name length and name
        int name_len = strlen(func->name) + 1;
        fwrite(&name_len, sizeof(int), 1, file);
        fwrite(func->name, 1, name_len, file);

        // Write code length
        fwrite(&func->code_length, sizeof(int), 1, file);

        // Write each instruction
        for (int j = 0; j < func->code_length; j++) {
            const BytecodeInstruction* instr = &func->byte_code[j];

            // Write opcode and indexes
            fwrite(&instr->opcode, sizeof(OpCode), 1, file);
            fwrite(&instr->var_index, sizeof(uint16_t), 1, file);
            fwrite(&instr->var_index2, sizeof(uint16_t), 1, file);
            fwrite(&instr->var_index3, sizeof(uint16_t), 1, file);
            fwrite(&instr->constant, sizeof(int32_t), 1, file);

            // Write name if present
            int has_name = (instr->name != NULL);
            fwrite(&has_name, sizeof(int), 1, file);
            if (has_name) {
                int instr_name_len = strlen(instr->name) + 1;
                fwrite(&instr_name_len, sizeof(int), 1, file);
                fwrite(instr->name, 1, instr_name_len, file);
            }
        }

        // Write constant pool
        fwrite(&func->constant_pool_size, sizeof(int), 1, file);
        for (int j = 0; j < func->constant_pool_size; j++) {
            int const_len = strlen(func->constant_pool[j]) + 1;
            fwrite(&const_len, sizeof(int), 1, file);
            fwrite(func->constant_pool[j], 1, const_len, file);
        }
    }

    fclose(file);
    print("[Compiler] Successfully saved compiled bytecode to: %s", filename);
}

Source code loading

Here are the functions that are responsible for loading the source code files, the ones that have .tvm extension.

// src/utils/source_loader.h
#ifndef TINY_VM_SOURCE_FILE_H
#define TINY_VM_SOURCE_FILE_H

#include <stdio.h>
#include "../compiler/compiler.h"

// Maximum line length for source files
#define MAX_LINE_LENGTH 1024
// Maximum number of functions per file
#define MAX_FUNCTIONS 100
// Maximum lines per function
#define MAX_LINES_PER_FUNCTION 100

// File related functions
char*** read_tvm_source(const char* filename);
void free_source_code(char*** source_code);

// Helper to get output filename
char* get_output_filename(const char* input_filename, const char* new_extension);

#endif

We are going to limit the compiler to only work with a single source file for now. But we can define multiple functions that can call each other!

// src/utils/source_loader.c
#include "source_loader.h"
#include "../utils/logger.h"
#include <stdlib.h>
#include <string.h>

char*** read_tvm_source(const char* filename) {
    FILE* file = fopen(filename, "r");
    if (!file) {
        print("[File] Error: Could not open source file: %s", filename);
        return NULL;
    }

    char*** functions = malloc(sizeof(char**) * MAX_FUNCTIONS);
    int current_function = -1;
    char line[MAX_LINE_LENGTH];
    int line_count = 0;

    // Initialize first function
    functions[0] = malloc(sizeof(char*) * MAX_LINES_PER_FUNCTION);
    int current_line = 0;

    while (fgets(line, sizeof(line), file)) {
        // Remove newline
        line[strcspn(line, "\n")] = 0;

        // Skip empty lines and comments
        if (strlen(line) == 0 || line[0] == '/') {
            continue;
        }

        // Check if this is a function declaration
        if (strncmp(line, "function ", 9) == 0) {
            current_function++;
            if (current_function > 0) {
                // Null terminate previous function
                functions[current_function-1][current_line] = NULL;
            }
            functions[current_function] = malloc(sizeof(char*) * MAX_LINES_PER_FUNCTION);
            current_line = 0;
        }

        // Store the line
        functions[current_function][current_line++] = strdup(line);
        line_count++;
    }

    // Null terminate last function
    functions[current_function][current_line] = NULL;
    // Null terminate function array
    functions[current_function + 1] = NULL;

    fclose(file);
    print("[File] Successfully read %d lines from %s", line_count, filename);
    return functions;
}

void free_source_code(char*** source_code) {
    if (!source_code) return;

    for (int i = 0; source_code[i] != NULL; i++) {
        for (int j = 0; source_code[i][j] != NULL; j++) {
            free(source_code[i][j]);
        }
        free(source_code[i]);
    }
    free(source_code);
}

char* get_output_filename(const char* input_filename, const char* new_extension) {
    char* output = malloc(strlen(input_filename) + strlen(new_extension) + 1);
    strcpy(output, input_filename);

    // Find last dot
    char* dot = strrchr(output, '.');
    if (dot) {
        *dot = '\0';  // Remove old extension
    }

    // Add new extension
    strcat(output, new_extension);
    return output;
}

Final updates

We also had to do few little changes to adopt the new Function struct that represents a compiled function with its instructions in byte-code.

// src/types.h
#ifndef TINY_VM_TYPES_H
#define TINY_VM_TYPES_H

#include <pthread.h>
#include "compiler/compiler.h"

//...

// Thread context
typedef struct ThreadContext {
    LocalScope* local_scope;
    const Function* current_function; // <-- added the Function being executed
    int pc; // Points to current bytecode instruction
    pthread_t thread;
    int thread_id;
    int is_running;
    struct VM* vm;
    // removed the function name, as it is redundant now
} ThreadContext;

//... 

// VM state
typedef struct VM {
    //...

    // Function management
    // FYI: Function** is a pointer to an array of Function* pointers
    Function** functions; // <-- updated this
    int function_count;
    int function_capacity;
    pthread_mutex_t function_mgmt_lock;
} VM;

#endif

Testing the compiler

Let’s test our compiler with a more complex program that demonstrates multiple functions, threading, and synchronization. Here’s our test file (counter.tvm):

function createCounter
setshared counter 1000
print counter

function incrementCounter
lock counter_lock
set increment 10
add counter increment counter
print counter
unlock counter_lock
exit

function decrementCounter
lock counter_lock
set decrement -10
add counter decrement counter
print counter
unlock counter_lock
exit

function main
sync createCounter
async incrementCounter
async decrementCounter
exit

When compiled, this produces bytecode that can be “disassembled” to show:

[Compiler] Compiled 4 functions:
[Compiler] Bytecode for function 'createCounter':
  0: SETSHARED counter = 1000
  1: PRINT counter
[Compiler] Bytecode for function 'incrementCounter':
  0: MONITOR_ENTER counter_lock
  1: LOAD_CONST increment = 10
  2: ADD counter = increment + counter
  3: PRINT counter
  4: MONITOR_EXIT counter_lock
  5: RETURN
[Compiler] Bytecode for function 'decrementCounter':
  0: MONITOR_ENTER counter_lock
  1: LOAD_CONST decrement = -10
  2: ADD counter = decrement + counter
  3: PRINT counter
  4: MONITOR_EXIT counter_lock
  5: RETURN
[Compiler] Bytecode for function 'main':
  0: INVOKE_SYNC createCounter
  1: INVOKE_ASYNC incrementCounter
  2: INVOKE_ASYNC decrementCounter
  3: RETURN

Note that just like the JVM provides the javap command for bytecode disassembly, we could implement a similar tool. However, that would be primarily a user interface enhancement rather than a core VM feature.

Our implementation vs JVM

Our TinyVM’s compilation process is significantly simpler than the JVM’s sophisticated multi-stage compilation pipeline. Let’s explore how is JVM compiling the code to learn more about it:

Source Code Processing

TinyVM:

Direct text-to-bytecode conversion with basic syntax parsing

JVM: Multi-step compilation through javac including:

Parsing source into Abstract Syntax Tree (AST)
Semantic analysis and type checking
Optimization passes
Generation of .class files with rich metadata

Type System

TinyVM:

Simple integer-only operations without type checking

JVM: Comprehensive type system with:

Verification of type safety at compile time
Generation of type descriptors and signatures
Support for generics and type erasure
Complex class loading and linking system

Optimization

TinyVM:

No optimization phase

JVM: Multiple optimization passes including:

Class File Format

TinyVM:

Simple binary format with basic function and instruction encoding

JVM — Sophisticated .class file format containing:

Constant pool with complex entry types
Full debugging information
Attributes for annotations and metadata
Version information
Security and verification data

The complete source code for this article is available in the tiny-vm_07_compilation directory of the TinyVM repository.

The next steps

With our compilation pipeline in place, the next crucial step is implementing bytecode execution in TinyVM. This will allow us to complete the circle from source code to efficient execution.

Introduction
Part 1 — Foundations
Part 2 — Multithreading
Part 3 — Heap
Part 4 — Synchronized
Part 5 — Refactoring
Part 6 — Functions
Part 7 — Compilation (you are here)
Part 8 — Byte-code execution
Part 9 — Function call stack (not started)
Part 10 — Garbage collector (not started)

Building a Virtual Machine, JVM-inspired — Compilation (Part 7)

Introduction

Why compile?

Implementation goals

The implementation

Project Structure

The Compiler’s main function

The compiler

Byte-code generation

Byte-code file format

Source code loading

Final updates

Testing the compiler

Our implementation vs JVM

Source Code Processing

Type System

Optimization

Class File Format

The next steps

Written by Ondrej Kvasnovsky

No responses yet