Skip to main content

2 posts tagged with "compile"

View All Tags

Deep Dive into Java: The Path to Hello World - Part 2

· 9 min read

banner

Continuing from the previous post, let's explore how the code evolves to print "Hello World."

Chapter 2. Compilation and Disassembly

Programming languages have levels.

The closer a programming language is to human language, the higher-level language it is, and the closer it is to the language a computer can understand (machine language), the lower-level language it is. Writing programs in a high-level language makes it easier for humans to understand and increases productivity, but it also creates a gap with machine language, requiring a process to bridge this gap.

The process of a high-level language descending to a lower level is called compilation.

Since Java is not a low-level language, there is a compilation process. Let's take a look at how this compilation process works in Java.

Compilation

As mentioned earlier, Java code cannot be directly executed by the computer. To execute Java code, it needs to be transformed into a form that the computer can read and interpret. This transformation involves the following major steps:

The resulting .class file from compilation is in bytecode. However, it is still not machine code that the computer can execute. The Java Virtual Machine (JVM) reads this bytecode and further processes it into machine code. We will cover how the JVM handles this in the final chapter.

First, let's compile the .java file to create a .class file. You can compile it using the javac command.

// VerboseLanguage.java
public class VerboseLanguage {
public static void main(String[] args) {
System.out.println("Hello World");
}
}
javac VerboseLanguage.java

You can see that the class file has been created. You can run the class file using the java command, and this is the basic flow of running a Java program.

java VerboseLanguage
// Hello World

Are you curious about the contents of the class file? Wondering how the computer reads and executes the language? What secrets lie within this file? It feels like opening Pandora's box.

Expecting something, you open it up, and...

No way!

Only a brief binary content is displayed.

Wait, wasn't the result of compilation supposed to be bytecode...?

Yes, it is bytecode. At the same time, it is also binary code. At this point, let's briefly touch on the differences between bytecode and binary code before moving on.

Binary Code : Code composed of 0s and 1s. While machine language is made up of binary code, not all binary code is machine language.

Bytecode : Code composed of 0s and 1s. However, bytecode is not intended for the machine but for the VM. It is converted into machine code by the VM through processes like the JIT compiler.

Still, as this article claims to be a deep dive, we reluctantly tried to read through the conversion.

Fortunately, our Pandora's box contains only 0s and 1s, with no other hardships or challenges.

While we succeeded in reading it, it is quite difficult to understand the content with just 0s and 1s 🤔

Now, let's decipher this code.

Disassembly

During the compilation process, the code is transformed into bytecode composed of 0s and 1s. As seen earlier, interpreting bytecode directly is quite challenging. Fortunately, the JDK includes tools that help developers read compiled bytecode, making it useful for debugging purposes.

The process of converting bytecode into a more readable form for developers is called disassembly. Sometimes this process can be confused with decompilation, but decompilation results in a higher-level programming language, not assembly language. Also, since the javap documentation clearly uses the term disassemble, we will follow suit.

info

Decompilation refers to representing binary code in a relatively higher-level language, just like before compiling binary. On the other hand, disassembly represents binary code in a minimal human-readable form (assembler language).

Virtual Machine Assembly Language

Let's use javap to disassemble the bytecode. The output is much more readable than just 0s and 1s.

javap -c VerboseLanguage.class
Compiled from "VerboseLanguage.java"
public class VerboseLanguage {
public VerboseLanguage();
Code:
0: aload_0
1: invokespecial #1 // Method java/lang/Object."<init>":()V
4: return

public static void main(java.lang.String[]);
Code:
0: getstatic #7 // Field java/lang/System.out:Ljava/io/PrintStream;
3: ldc #13 // String Hello World
5: invokevirtual #15 // Method java/io/PrintStream.println:(Ljava/lang/String;)V
8: return
}

What can we learn from this?

Firstly, this language is called virtual machine assembly language.

The Java Virtual Machine code is written in the informal “virtual machine assembly language” output by Oracle's javap utility, distributed with the JDK release. - JVM Spec

The format is as follows:

<index> <opcode> [ <operand1> [ <operand2>... ]] [<comment>]

index : Index of the JVM code byte array. It can be thought of as the method's starting offset.

opcode : Mnemonic symbol representing the set of instructions opcode. We remember the order of the rainbow colors as 'ROYGBIV' to distinguish the instruction set. If the rainbow colors represent the instruction set, each syllable of 'ROYGBIV' can be considered as a mnemonic symbol defined to differentiate them.

operandN : Operand of the instruction. The operand of a computer instruction is the address field. It points to where the data to be processed is stored in the constant pool.

Let's take a closer look at the main method part of the disassembled result.

Code:
0: getstatic #7 // Field java/lang/System.out:Ljava/io/PrintStream;
3: ldc #13 // String Hello World
5: invokevirtual #15 // Method java/io/PrintStream.println:(Ljava/lang/String;)V
8: return
  • invokevirtual: Call an instance method
  • getstatic: Get a static field from a class
  • ldc: Load data into the run-time constant pool.

The 3: ldc #13 on the third line means to put an item at index 13, and the item being put is kindly indicated in the comment.

Hello World

Note that bytecode instructions like getstatic and invokevirtual are represented by a single-byte opcode number. For example, getstatic=0xb2, invokevirtual=0xb6, and so on. It can be understood that Java bytecode instructions also have a maximum of 256 different opcodes.

JVM Instruction Set showing the bytecode for invokevirtual

If we look at the bytecode of the main method in hex, it would be as follows:

b2 00 07 12 0d b6

It might still be a bit hard to notice the pattern. As a hint, remember that earlier we mentioned the number before the opcode is the index in the JVM array. Let's slightly change the representation.

arr = [b2, 00, 07, 12, 0d, b6]
  • arr[0] = b2 = getstatic
  • arr[3] = 12 = ldc
  • arr[5] = b6 = invokevirtual

It becomes somewhat clearer what the index meant. The reason for skipping some indices is quite simple: getstatic requires a 2-byte operand, and ldc requires a 1-byte operand. Therefore, the ldc instruction, which is the next instruction after getstatic, is recorded at index 3, skipping 1 and 2. Similarly, skipping 4, the invokevirtual instruction is recorded at index 5.

Lastly, notice the comment (Ljava/lang/String;)V on the 4th line. Through this comment, we can see that in Java bytecode, classes are represented as L;, and void is represented as V. Other types also have their unique representations, summarized as follows:

Java BytecodeTypeDescription
Bbytesigned byte
CcharUnicode character
Ddoubledouble-precision floating-point value
Ffloatsingle-precision floating-point value
Iintinteger
Jlonglong integer
L<classname>;referencean instance of class <classname>
Sshortsigned short
Zbooleantrue or false
[referenceone array dimension

Using the -verbose option, you can see a more detailed disassembly result, including the constant pool. It would be interesting to examine the operands and constant pool together.

  Compiled from "VerboseLanguage.java"
public class VerboseLanguage
minor version: 0
major version: 65
flags: (0x0021) ACC_PUBLIC, ACC_SUPER
this_class: #21 // VerboseLanguage
super_class: #2 // java/lang/Object
interfaces: 0, fields: 0, methods: 2, attributes: 1
Constant pool:
#1 = Methodref #2.#3 // java/lang/Object."<init>":()V
#2 = Class #4 // java/lang/Object
#3 = NameAndType #5:#6 // "<init>":()V
#4 = Utf8 java/lang/Object
#5 = Utf8 <init>
#6 = Utf8 ()V
#7 = Fieldref #8.#9 // java/lang/System.out:Ljava/io/PrintStream;
#8 = Class #10 // java/lang/System
#9 = NameAndType #11:#12 // out:Ljava/io/PrintStream;
#10 = Utf8 java/lang/System
#11 = Utf8 out
#12 = Utf8 Ljava/io/PrintStream;
#13 = String #14 // Hello World
#14 = Utf8 Hello World
#15 = Methodref #16.#17 // java/io/PrintStream.println:(Ljava/lang/String;)V
#16 = Class #18 // java/io/PrintStream
#17 = NameAndType #19:#20 // println:(Ljava/lang/String;)V
#18 = Utf8 java/io/PrintStream
#19 = Utf8 println
#20 = Utf8 (Ljava/lang/String;)V
#21 = Class #22 // VerboseLanguage
#22 = Utf8 VerboseLanguage
#23 = Utf8 Code
#24 = Utf8 LineNumberTable
#25 = Utf8 main
#26 = Utf8 ([Ljava/lang/String;)V
#27 = Utf8 SourceFile
#28 = Utf8 VerboseLanguage.java
{
public VerboseLanguage();
descriptor: ()V
flags: (0x0001) ACC_PUBLIC
Code:
stack=1, locals=1, args_size=1
0: aload_0
1: invokespecial #1 // Method java/lang/Object."<init>":()V
4: return
LineNumberTable:
line 1: 0

public static void main(java.lang.String[]);
descriptor: ([Ljava/lang/String;)V
flags: (0x0009) ACC_PUBLIC, ACC_STATIC
Code:
stack=2, locals=1, args_size=1
0: getstatic #7 // Field java/lang/System.out:Ljava/io/PrintStream;
3: ldc #13 // String Hello World
5: invokevirtual #15 // Method java/io/PrintStream.println:(Ljava/lang/String;)V
8: return
LineNumberTable:
line 3: 0
line 4: 8
}
SourceFile: "VerboseLanguage.java"

Conclusion

In the previous chapter, we explored why a verbose process is required to print Hello World. In this chapter, we looked at the compilation and disassembly processes before printing Hello World. Next, we will finally examine the execution flow of the Hello World printing method with the JVM.

Reference

Deep Dive into Java: The Path to Hello World - Part 1

· 9 min read

banner

In the world of programming, it always starts with printing the sentence Hello World. It's like an unwritten rule.

# hello.py
print("Hello World")
python hello.py
// Hello World

Python? Excellent.

// hello.js
console.log("Hello World");
node hello.js
// Hello World

JavaScript? Not bad.

public class VerboseLanguage {
public static void main(String[] args) {
System.out.println("Hello World");
}
}
javac VerboseLanguage.java
java VerboseLanguage
// Hello World

However, Java feels like it's from a different world. We haven't even mentioned yet that the class name must match the file name.

What is public, what is class, what is static, and going through void, main, String[], and System.out.println, we finally reach the string "Hello World". Now, let's go learn another language.1

Even for simply printing "Hello World", Java demands quite a bit of background knowledge. Why does Java require such verbose processes?

This series is divided into 3 chapters. The goal is to delve into what happens behind the scenes to print the 2 words " Hello World" in detail. The specific contents of each chapter are as follows:

  • In the first chapter, we introduce the reasons behind the Hello World as the starting point.
  • In the second chapter, we examine the compiled class files and how the computer interprets and executes Java code.
  • Finally, we explore how the JVM loads and executes public static void main and the operating principles behind it.

By combining the contents of the 3 chapters, we can finally grasp the concept of "Hello World". It's quite a long journey, so let's take a deep breath and embark on it.

Chapter 1. Why?

Before printing Hello World in Java, there are several "why moments" that need to be considered.

Why must the class name match the file name?

More precisely, it is the name of the public class that must match the file name. Why is that?

Java programs are not directly understandable by computers. A virtual machine called JVM assists the computer in executing the program. To make a Java program executable by the computer, it needs to go through several steps to convert it into machine code that the JVM can interpret. The first step is using a compiler to convert the program into bytecode that the JVM can interpret. The converted bytecode is then passed through an interpreter inside the JVM to be translated into machine code and executed.

Let's briefly look at the compilation process.

public class Outer {
public static void main(String[] args) {
System.out.println("This is Outer class");
}

private class Inner {
}
}
javac Outer.java
Permissions Size User   Date Modified Name
.rw-r--r-- 302 haril 30 Nov 16:09 Outer$Inner.class
.rw-r--r-- 503 haril 30 Nov 16:09 Outer.class
.rw-r--r-- 159 haril 30 Nov 16:09 Outer.java

Java generates a .class file for every class at compile time.

Now, the JVM needs to find the main method for program execution. How does it know where the main method is?

Why does it have to find main() specifically? Just wait a little longer.

If the Java file name does not match the public class name, the Java interpreter has to read all class files to find the main method. If the file name matches the name of the public class, the Java interpreter can better identify the file it needs to interpret.

Imagine a file named Java1000 with 1000 classes inside. To identify where main() is among the 1000 classes, the interpreter would have to examine all the class files.

However, if the file name matches the name of the public class, it can access main() more quickly (since main exists in the public class), and it can easily access other classes since all the logic starts from main().

Why must it be public?

The JVM needs to find the main method inside the class. If the JVM, which accesses the class from outside, needs to find a method inside the class, that method must be public. In fact, changing the access modifier to private will result in an error message instructing you to declare main as public.

Error: Main method not found in class VerboseLanguage, please define the main method as:
public static void main(String[] args)

Why must it be static?

The JVM has found the public main() method. However, to invoke this method, an object must first be created. Does the JVM need this object? No, it just needs to be able to call main. By declaring it as static, the JVM does not need to create an unnecessary object, saving memory.

Why must it be void?

The end of the main method signifies the end of Java's execution. The JVM cannot do anything with the return value of main, so the presence of a return value is meaningless. Therefore, it is natural to declare it as void.

Why must it be named main?

The method name main is designed for the JVM to find the entry point for running the application.

Although the term "design" sounds grand, in reality, it is hard-coded to find the method named main. If the name to be found was not main but haril, it would have searched for a method named haril. Of course, the Java creators likely had reasons for choosing main, but that's about it.

mainClassName = GetMainClassName(env, jarfile);
mainClass = LoadClass(env, classname);

// Find the main method
mainID = (*env)->GetStaticMethodID(env, mainClass, "main", "([Ljava/lang/String;)V");

jbject obj = (*env)->ToReflectedMethod(env, mainClass, mainID, JNI_TRUE);

Why args?

Until now, we omitted mentioning String[] args in main(). Why must this argument be specified, and why does an error occur if it is omitted?

As public static void main(String[] args) is the entry point of a Java application, this argument must come from outside the Java application.

All types of standard input are entered as strings.

This is why args is declared as a string array. If you think about it, it makes sense. Before the Java application even runs, can you create custom object types directly? 🤔

So why is args necessary?

By passing arguments in a simple way from outside to inside, you can change the behavior of a Java application, a mechanism widely used since the early days of C programming to control program behavior. Especially for simple applications, this method is very effective. Java simply adopted this widely used method.

The reason String[] args cannot be omitted is that Java only allows one public static void main(String[] args) as the entry point. The Java creators thought it would be less confusing to declare and not use args than to allow it to be omitted.

System.out.println

Finally, we can start talking about the method related to output.

Just to mention it again, in Python it was print("Hello World"). 2

A Java program runs not directly on the operating system but on a virtual machine called JVM. This allows Java programs to be executed anywhere regardless of the operating system, but it also makes it difficult to use specific functions provided by the operating system. This is why coding at the system level, such as creating a CLI in Java or collecting OS metrics, is challenging.

However, there is a way to leverage limited OS functionality (JNI), and System provides this functionality. Some of the key functions include:

  • Standard input
  • Standard output
  • Setting environment variables
  • Terminating the running application and returning a status code

To print Hello World, we are using the standard output function of System.

In fact, as you follow the flow of System.out.println, you will encounter a writeBytes method with the native keyword attached, which delegates the operation to C code and transfers it to standard output.

// FileOutputStream.java
private native void writeBytes(byte b[], int off, int len, boolean append)
throws IOException;

The invocation of a method with the native keyword works through the Java Native Interface (JNI). This will be covered in a later chapter.

String

Strings in Java are somewhat special. No, they seem quite special. They are allocated separate memory space, indicating they are definitely treated as special. Why is that?

It is important to note the following properties of strings:

  • They can become very large.
  • They are relatively frequently reused.

Therefore, strings are designed with a focus on how to reuse them once created. To fully understand how large string data is managed in memory, you need an understanding of the topics to be covered later. For now, let's briefly touch on the principles of memory space saving.

First, let's look at how strings are declared in Java.

String greeting = "Hello World";

Internally, it works as follows:

Strings are created in the String Constant Pool and have immutable properties. Once a string is created, it does not change, and if the same string is found in the Constant Pool when creating a new string, it is reused.

We will cover JVM Stack, Frame, Heap in the next chapter.

Another way to declare strings is by instantiation.

String greeting = new String("Hello World");

This method is rarely used because there is a difference in internal behavior, as shown below.

When a string is used directly without the new keyword, it is created in the String Constant Pool and can be reused. However, if instantiated with the new keyword, it is not created in the Constant Pool. This means the same string can be created multiple times, potentially wasting memory space.

Summary

In this chapter, we answered the following questions:

  • Why must the .java file name match the class name?
  • Why must it be public static void main(String[] args)?
  • The flow of the output operation
  • The characteristics of strings and the basic principles of their creation and use

In the next chapter, we will compile Java code ourselves and explore how bytecode is generated, its relationship with memory areas, and more.

Reference

Footnotes

  1. Life Coding Python

  2. Life Coding Python