Building a Virtual Machine, JVM-inspired — Compilation (Part 7)

Ondrej Kvasnovsky
11 min readJan 14, 2025

--

Introduction

In this part, we’ll explore how to add compilation support to our VM while maintaining compatibility with all the features implemented in previous articles. This addition brings us closer to how real-world VMs like the JVM operate.

Why compile?

JVM uses compilation for many good reasons, like:

  • Performance (parsing strings at runtime is slow, binary format is faster to read and execute)
  • Validation (code is verified before execution, types are checked)
  • Resource planning (plan memory usage, determine local variable needs)
  • Optimization (remove dead code, JIT compilation hints, inline small methods, constant folding)
  • Security (verify byte-code validity, check access permissions, validate method calls)
  • Platform independence (same binary format everywhere, no need to compile for specific platforms)
  • Loading efficiency (smaller files to load, which makes it faster to load to memory)
  • Errror detection (catch errors at compile time, not during runtime!)

Implementation goals

Until now, our VM has been operating differently from a traditional JVM by parsing and executing code directly from text. This approach, while simpler to implement initially, lacks many of the advantages listed above. Our goal is to create a proper compilation process that transforms our source code into an efficient bytecode format.

Let’s look at a concrete example. We will create a hello.tvm file (.tvm extension because we call our virtual machine — TinyVM):

function greet
set message 42
print message

function main
sync greet
exit

When we compile this program, the process looks like this:

$ tiny_vm_compile examples/counter.tvm
[2025-01-02 19:14:42.718010] [Compiler] Starting TinyVM Compiler...
[2025-01-02 19:14:42.718810] [File] Successfully read 6 lines from /Users/ondrej/Documents/Projects/c/tiny-vm/tiny-vm_07_compilation/examples/hello.tvm

[2025-01-02 19:14:42.718815] [Compiler] Bytecode for function 'greet':
[2025-01-02 19:14:42.718819] 0: LOAD_CONST message = 42
[2025-01-02 19:14:42.718820] 1: PRINT message

[2025-01-02 19:14:42.718822] [Compiler] Bytecode for function 'main':
[2025-01-02 19:14:42.718823] 0: INVOKE_SYNC greet
[2025-01-02 19:14:42.718824] 1: RETURN

[2025-01-02 19:14:42.718932] [Compiler] Successfully saved compiled bytecode to: /Users/ondrej/Documents/Projects/c/tiny-vm/tiny-vm_07_compilation/examples/hello.tvmc
[2025-01-02 19:14:42.718936] [Compiler] Cleaned up compilation results
[2025-01-02 19:14:42.718937] [Compiler] TinyVM Compiler finished.

The implementation

Project Structure

Our implementation now produces two separate executables:

  1. The VM executable (tiny_vm_run) — Responsible for executing compiled bytecode
  2. The Compiler executable (tiny_vm_compile) — Transforms source code into bytecode

We achieve this by splitting our main function into two separate files:

  • src/vm_main.c — Contains the VM execution logic
  • src/compiler_main.c — Handles the compilation process

Both executables share common code through a carefully structured set of source files. Here’s our updated CMakeLists.txt that reflects this organization:

# CMakeLists.txt
cmake_minimum_required(VERSION 3.30)
project(tiny_vm C)

set(CMAKE_C_STANDARD 11)

set(COMMON_SOURCES
src/utils/logger.c
src/utils/logger.h
src/core/vm.c
src/core/vm.h
src/thread/thread.c
src/thread/thread.h
src/synchronization/synchronization.c
src/synchronization/synchronization.h
src/memory/memory.c
src/memory/memory.h
src/instruction/instruction.c
src/instruction/instruction.h
src/types.h
src/function/function.c
src/function/function.h
src/execution/execution.c
src/execution/execution.h
src/compiler/compiler.h
src/compiler/compiler.c
src/compiler/bytecode.c
src/compiler/source_loader.h
src/compiler/source_loader.c
)

# Compiler executable
add_executable(tiny_vm_compile
src/compiler_main.c
${COMMON_SOURCES}
)

# VM executable
add_executable(tiny_vm_run
src/vm_main.c
${COMMON_SOURCES}
)

# Add include directories if needed
target_include_directories(tiny_vm_compile PRIVATE src)
target_include_directories(tiny_vm_run PRIVATE src)

add_custom_target(run_counter
COMMAND tiny_vm_compile ${CMAKE_SOURCE_DIR}/examples/counter.tvm
COMMAND tiny_vm_run ${CMAKE_SOURCE_DIR}/examples/counter.tvmc
)

add_custom_target(run_hello
COMMAND tiny_vm_compile ${CMAKE_SOURCE_DIR}/examples/hello.tvm
COMMAND tiny_vm_run ${CMAKE_SOURCE_DIR}/examples/hello.tvmc
)

The Compiler’s main function

The compiler’s main function orchestrates the compilation process. It handles:

  • Source file reading
  • Compilation
  • Output file generation
  • Resource cleanup

Here’s the implementation:

// src/compiler_main.c
#include "utils/logger.h"
#include "compiler/compiler.h"
#include "compiler/source_loader.h"
#include <stdlib.h>

int main(const int argc, char* argv[]) {
if (argc < 2) {
print("Usage: %s <source.tvm>", argv[0]);
return 1;
}

print("[Compiler] Starting TinyVM Compiler...");

// Read source file
char*** source = read_tvm_source(argv[1]);
if (!source) {
print("[Compiler] Failed to read source file");
return 1;
}

// Compile the program
CompilationResult* compiled = compile_program((const char***)source);
if (!compiled) {
print("[Compiler] Compilation failed");
free_source_code(source);
return 1;
}
// print_compilation_result(compiled);

// Get output filename
char* output_file = get_output_filename(argv[1], ".tvmc");

// Save the compiled bytecode
save_compiled_bytecode(output_file, compiled);

// Cleanup
free(output_file);
free_source_code(source);
free_compilation_result(compiled);

print("[Compiler] TinyVM Compiler finished.");
return 0;
}

The compiler

The heart of our compilation process lies in the compiler implementation. Each instruction is assigned a specific bytecode operation code, creating a compact and efficient representation of our program.

The compiler header defines our bytecode structure:

// src/compiler/compiler.h
#ifndef TINY_VM_COMPILER_H
#define TINY_VM_COMPILER_H

#include <stdint.h>

// Bytecode operation codes
typedef enum {
OP_NOP = 0x00, // No operation
OP_PRINT = 0x01, // print variable
OP_LOAD_CONST = 0x02, // Load constant into variable
OP_ADD = 0x03, // Add two variables
OP_SLEEP = 0x04, // Sleep for N milliseconds
OP_SETSHARED = 0x05, // Set shared variable
OP_MONITOR_ENTER = 0x06, // Enter monitor (lock)
OP_MONITOR_EXIT = 0x07, // Exit monitor (unlock)
OP_INVOKE_SYNC = 0x08, // Call function synchronously
OP_INVOKE_ASYNC = 0x09, // Call function asynchronously
OP_RETURN = 0x0A, // Return from function
} OpCode;

// Bytecode instruction format
typedef struct {
OpCode opcode; // Operation code
uint16_t var_index; // Variable index (if needed)
uint16_t var_index2; // Second variable index (if needed)
uint16_t var_index3; // Third variable index (if needed)
int32_t constant; // Constant value (if needed)
char* name; // Name (for variables/functions)
} BytecodeInstruction;

// Compiled function
typedef struct {
char* name; // Function name
BytecodeInstruction* byte_code; // Bytecode instructions
int code_length; // Number of instructions
int max_locals; // Maximum number of local variables
char** constant_pool; // Pool of constant values/names
int constant_pool_size; // Size of constant pool
} Function;

// Compilation result
typedef struct {
Function** functions; // Array of compiled functions
int function_count; // Number of functions
} CompilationResult;

// Compiler functions
CompilationResult* compile_program(const char*** source_functions);
void free_compilation_result(CompilationResult* result);
void print_compilation_result(const CompilationResult* result);

// Bytecode file operations
void save_compiled_bytecode(const char* filename, CompilationResult* compiled);

// Debug functions
void print_bytecode(const Function* function);

#endif

Byte-code generation

The compilation process transforms each instruction into bytecode while maintaining a constant pool for variables and function names. This approach allows for efficient storage and quick lookup during execution:

// src/compiler/compiler.c
#include "compiler.h"
#include "../instruction/instruction.h"
#include "../utils/logger.h"

#include <stdlib.h>
#include <string.h>

static void compile_instruction(const char* line, BytecodeInstruction* bytecode, int* constant_index, char** constant_pool) {
const Instruction instr = parse_instruction(line);

switch (instr.type) {
case PRINT:
bytecode->opcode = OP_PRINT;
bytecode->name = strdup(instr.args[0]);
// Store variable name in constant pool
constant_pool[*constant_index] = strdup(instr.args[0]);
bytecode->var_index = (*constant_index)++;
break;

case SET:
bytecode->opcode = OP_LOAD_CONST;
bytecode->name = strdup(instr.args[0]);
// Store variable name in constant pool
constant_pool[*constant_index] = strdup(instr.args[0]);
bytecode->var_index = (*constant_index)++;
bytecode->constant = atoi(instr.args[1]);
break;

case ADD:
bytecode->opcode = OP_ADD;
// Store all variable names in constant pool
constant_pool[*constant_index] = strdup(instr.args[0]);
bytecode->var_index = (*constant_index)++;
constant_pool[*constant_index] = strdup(instr.args[1]);
bytecode->var_index2 = (*constant_index)++;
constant_pool[*constant_index] = strdup(instr.args[2]);
bytecode->var_index3 = (*constant_index)++;
break;

case SLEEP:
bytecode->opcode = OP_SLEEP;
bytecode->constant = atoi(instr.args[0]);
break;

case SETSHARED:
bytecode->opcode = OP_SETSHARED;
bytecode->name = strdup(instr.args[0]);
bytecode->constant = atoi(instr.args[1]);
break;

case LOCK:
bytecode->opcode = OP_MONITOR_ENTER;
bytecode->name = strdup(instr.args[0]);
break;

case UNLOCK:
bytecode->opcode = OP_MONITOR_EXIT;
bytecode->name = strdup(instr.args[0]);
break;

case SYNC:
bytecode->opcode = OP_INVOKE_SYNC;
bytecode->name = strdup(instr.args[0]);
break;

case ASYNC:
bytecode->opcode = OP_INVOKE_ASYNC;
bytecode->name = strdup(instr.args[0]);
break;

case EXIT:
bytecode->opcode = OP_RETURN;
break;

default:
bytecode->opcode = OP_NOP;
}
}

static Function* compile_function(const char** source) {
// Create function structure
Function* function = malloc(sizeof(Function));
if (!function) return NULL;

// Get function name (first line must be function declaration)
function->name = get_function_name(source);
if (!function->name) {
free(function);
return NULL;
}

// Count instructions (excluding NULL terminator and function declaration)
function->code_length = 0;
for (int i = 1; source[i] != NULL; i++) {
function->code_length++;
}

// Allocate bytecode array
function->byte_code = malloc(sizeof(BytecodeInstruction) * function->code_length);
if (!function->byte_code) {
free(function->name);
free(function);
return NULL;
}

// Allocate constant pool (worst case: 3 constants per instruction)
function->constant_pool = malloc(sizeof(char*) * function->code_length * 3);
if (!function->constant_pool) {
free(function->byte_code);
free(function->name);
free(function);
return NULL;
}

// Initialize constant pool size
function->constant_pool_size = 0;

// Compile each instruction
for (int i = 0; i < function->code_length; i++) {
compile_instruction(
source[i + 1],
&function->byte_code[i],
&function->constant_pool_size,
function->constant_pool
);
}

return function;
}

CompilationResult* compile_program(const char*** source_functions) {
CompilationResult* result = malloc(sizeof(CompilationResult));
if (!result) return NULL;

// Count functions
result->function_count = 0;
while (source_functions[result->function_count] != NULL) {
result->function_count++;
}

result->functions = malloc(sizeof(Function*) * result->function_count);
if (!result->functions) {
free(result);
return NULL;
}

// Compile each function
for (int i = 0; i < result->function_count; i++) {
result->functions[i] = compile_function(source_functions[i]);
if (result->functions[i]) {
print_bytecode(result->functions[i]);
}
}

return result;
}

// ... print functions are here, but ommited as they are not that important

Byte-code file format

We store the compiled bytecode in a dedicated file format that efficiently represents our program structure:

// src/compiler/bytecode.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "../utils/logger.h"
#include "compiler.h"

// File operations for saving compiled bytecode
void save_compiled_bytecode(const char* filename, CompilationResult* compiled) {
FILE* file = fopen(filename, "wb");
if (!file) {
print("[Compiler] Error: Could not open file for writing: %s", filename);
return;
}

// Write number of functions
fwrite(&compiled->function_count, sizeof(int), 1, file);

// Write each function
for (int i = 0; i < compiled->function_count; i++) {
Function* func = compiled->functions[i];

// Write function name length and name
int name_len = strlen(func->name) + 1;
fwrite(&name_len, sizeof(int), 1, file);
fwrite(func->name, 1, name_len, file);

// Write code length
fwrite(&func->code_length, sizeof(int), 1, file);

// Write each instruction
for (int j = 0; j < func->code_length; j++) {
const BytecodeInstruction* instr = &func->byte_code[j];

// Write opcode and indexes
fwrite(&instr->opcode, sizeof(OpCode), 1, file);
fwrite(&instr->var_index, sizeof(uint16_t), 1, file);
fwrite(&instr->var_index2, sizeof(uint16_t), 1, file);
fwrite(&instr->var_index3, sizeof(uint16_t), 1, file);
fwrite(&instr->constant, sizeof(int32_t), 1, file);

// Write name if present
int has_name = (instr->name != NULL);
fwrite(&has_name, sizeof(int), 1, file);
if (has_name) {
int instr_name_len = strlen(instr->name) + 1;
fwrite(&instr_name_len, sizeof(int), 1, file);
fwrite(instr->name, 1, instr_name_len, file);
}
}

// Write constant pool
fwrite(&func->constant_pool_size, sizeof(int), 1, file);
for (int j = 0; j < func->constant_pool_size; j++) {
int const_len = strlen(func->constant_pool[j]) + 1;
fwrite(&const_len, sizeof(int), 1, file);
fwrite(func->constant_pool[j], 1, const_len, file);
}
}

fclose(file);
print("[Compiler] Successfully saved compiled bytecode to: %s", filename);
}

Source code loading

Here are the functions that are responsible for loading the source code files, the ones that have .tvm extension.

// src/utils/source_loader.h
#ifndef TINY_VM_SOURCE_FILE_H
#define TINY_VM_SOURCE_FILE_H

#include <stdio.h>
#include "../compiler/compiler.h"

// Maximum line length for source files
#define MAX_LINE_LENGTH 1024
// Maximum number of functions per file
#define MAX_FUNCTIONS 100
// Maximum lines per function
#define MAX_LINES_PER_FUNCTION 100

// File related functions
char*** read_tvm_source(const char* filename);
void free_source_code(char*** source_code);

// Helper to get output filename
char* get_output_filename(const char* input_filename, const char* new_extension);

#endif

We are going to limit the compiler to only work with a single source file for now. But we can define multiple functions that can call each other!

// src/utils/source_loader.c
#include "source_loader.h"
#include "../utils/logger.h"
#include <stdlib.h>
#include <string.h>

char*** read_tvm_source(const char* filename) {
FILE* file = fopen(filename, "r");
if (!file) {
print("[File] Error: Could not open source file: %s", filename);
return NULL;
}

char*** functions = malloc(sizeof(char**) * MAX_FUNCTIONS);
int current_function = -1;
char line[MAX_LINE_LENGTH];
int line_count = 0;

// Initialize first function
functions[0] = malloc(sizeof(char*) * MAX_LINES_PER_FUNCTION);
int current_line = 0;

while (fgets(line, sizeof(line), file)) {
// Remove newline
line[strcspn(line, "\n")] = 0;

// Skip empty lines and comments
if (strlen(line) == 0 || line[0] == '/') {
continue;
}

// Check if this is a function declaration
if (strncmp(line, "function ", 9) == 0) {
current_function++;
if (current_function > 0) {
// Null terminate previous function
functions[current_function-1][current_line] = NULL;
}
functions[current_function] = malloc(sizeof(char*) * MAX_LINES_PER_FUNCTION);
current_line = 0;
}

// Store the line
functions[current_function][current_line++] = strdup(line);
line_count++;
}

// Null terminate last function
functions[current_function][current_line] = NULL;
// Null terminate function array
functions[current_function + 1] = NULL;

fclose(file);
print("[File] Successfully read %d lines from %s", line_count, filename);
return functions;
}

void free_source_code(char*** source_code) {
if (!source_code) return;

for (int i = 0; source_code[i] != NULL; i++) {
for (int j = 0; source_code[i][j] != NULL; j++) {
free(source_code[i][j]);
}
free(source_code[i]);
}
free(source_code);
}

char* get_output_filename(const char* input_filename, const char* new_extension) {
char* output = malloc(strlen(input_filename) + strlen(new_extension) + 1);
strcpy(output, input_filename);

// Find last dot
char* dot = strrchr(output, '.');
if (dot) {
*dot = '\0'; // Remove old extension
}

// Add new extension
strcat(output, new_extension);
return output;
}

Final updates

We also had to do few little changes to adopt the new Function struct that represents a compiled function with its instructions in byte-code.

// src/types.h
#ifndef TINY_VM_TYPES_H
#define TINY_VM_TYPES_H

#include <pthread.h>
#include "compiler/compiler.h"

//...

// Thread context
typedef struct ThreadContext {
LocalScope* local_scope;
const Function* current_function; // <-- added the Function being executed
int pc; // Points to current bytecode instruction
pthread_t thread;
int thread_id;
int is_running;
struct VM* vm;
// removed the function name, as it is redundant now
} ThreadContext;

//...

// VM state
typedef struct VM {
//...

// Function management
// FYI: Function** is a pointer to an array of Function* pointers
Function** functions; // <-- updated this
int function_count;
int function_capacity;
pthread_mutex_t function_mgmt_lock;
} VM;

#endif

Testing the compiler

Let’s test our compiler with a more complex program that demonstrates multiple functions, threading, and synchronization. Here’s our test file (counter.tvm):

function createCounter
setshared counter 1000
print counter

function incrementCounter
lock counter_lock
set increment 10
add counter increment counter
print counter
unlock counter_lock
exit

function decrementCounter
lock counter_lock
set decrement -10
add counter decrement counter
print counter
unlock counter_lock
exit

function main
sync createCounter
async incrementCounter
async decrementCounter
exit

When compiled, this produces bytecode that can be “disassembled” to show:

[Compiler] Compiled 4 functions:
[Compiler] Bytecode for function 'createCounter':
0: SETSHARED counter = 1000
1: PRINT counter
[Compiler] Bytecode for function 'incrementCounter':
0: MONITOR_ENTER counter_lock
1: LOAD_CONST increment = 10
2: ADD counter = increment + counter
3: PRINT counter
4: MONITOR_EXIT counter_lock
5: RETURN
[Compiler] Bytecode for function 'decrementCounter':
0: MONITOR_ENTER counter_lock
1: LOAD_CONST decrement = -10
2: ADD counter = decrement + counter
3: PRINT counter
4: MONITOR_EXIT counter_lock
5: RETURN
[Compiler] Bytecode for function 'main':
0: INVOKE_SYNC createCounter
1: INVOKE_ASYNC incrementCounter
2: INVOKE_ASYNC decrementCounter
3: RETURN

Note that just like the JVM provides the javap command for bytecode disassembly, we could implement a similar tool. However, that would be primarily a user interface enhancement rather than a core VM feature.

Our implementation vs JVM

Our TinyVM’s compilation process is significantly simpler than the JVM’s sophisticated multi-stage compilation pipeline. Let’s explore how is JVM compiling the code to learn more about it:

Source Code Processing

TinyVM:

  • Direct text-to-bytecode conversion with basic syntax parsing

JVM: Multi-step compilation through javac including:

  • Parsing source into Abstract Syntax Tree (AST)
  • Semantic analysis and type checking
  • Optimization passes
  • Generation of .class files with rich metadata

Type System

TinyVM:

  • Simple integer-only operations without type checking

JVM: Comprehensive type system with:

  • Verification of type safety at compile time
  • Generation of type descriptors and signatures
  • Support for generics and type erasure
  • Complex class loading and linking system

Optimization

TinyVM:

  • No optimization phase

JVM: Multiple optimization passes including:

Class File Format

TinyVM:

  • Simple binary format with basic function and instruction encoding

JVM — Sophisticated .class file format containing:

  • Constant pool with complex entry types
  • Full debugging information
  • Attributes for annotations and metadata
  • Version information
  • Security and verification data

The complete source code for this article is available in the tiny-vm_07_compilation directory of the TinyVM repository.

The next steps

With our compilation pipeline in place, the next crucial step is implementing bytecode execution in TinyVM. This will allow us to complete the circle from source code to efficient execution.

--

--

No responses yet