Skip to main content

Deep Dive into Java: The Path to Hello World - Part 1

Β· 9 min read

banner

In the world of programming, it always starts with printing the sentence Hello World. It's like an unwritten rule.

# hello.py
print("Hello World")
python hello.py
// Hello World

Python? Excellent.

// hello.js
console.log("Hello World");
node hello.js
// Hello World

JavaScript? Not bad.

public class VerboseLanguage {
public static void main(String[] args) {
System.out.println("Hello World");
}
}
javac VerboseLanguage.java
java VerboseLanguage
// Hello World

However, Java feels like it's from a different world. We haven't even mentioned yet that the class name must match the file name.

What is public, what is class, what is static, and going through void, main, String[], and System.out.println, we finally reach the string "Hello World". Now, let's go learn another language.1

Even for simply printing "Hello World", Java demands quite a bit of background knowledge. Why does Java require such verbose processes?

This series is divided into 3 chapters. The goal is to delve into what happens behind the scenes to print the 2 words " Hello World" in detail. The specific contents of each chapter are as follows:

  • In the first chapter, we introduce the reasons behind the Hello World as the starting point.
  • In the second chapter, we examine the compiled class files and how the computer interprets and executes Java code.
  • Finally, we explore how the JVM loads and executes public static void main and the operating principles behind it.

By combining the contents of the 3 chapters, we can finally grasp the concept of "Hello World". It's quite a long journey, so let's take a deep breath and embark on it.

Chapter 1. Why?​

Before printing Hello World in Java, there are several "why moments" that need to be considered.

Why must the class name match the file name?​

More precisely, it is the name of the public class that must match the file name. Why is that?

Java programs are not directly understandable by computers. A virtual machine called JVM assists the computer in executing the program. To make a Java program executable by the computer, it needs to go through several steps to convert it into machine code that the JVM can interpret. The first step is using a compiler to convert the program into bytecode that the JVM can interpret. The converted bytecode is then passed through an interpreter inside the JVM to be translated into machine code and executed.

Let's briefly look at the compilation process.

public class Outer {
public static void main(String[] args) {
System.out.println("This is Outer class");
}

private class Inner {
}
}
javac Outer.java
Permissions Size User   Date Modified Name
.rw-r--r-- 302 haril 30 Nov 16:09 Outer$Inner.class
.rw-r--r-- 503 haril 30 Nov 16:09 Outer.class
.rw-r--r-- 159 haril 30 Nov 16:09 Outer.java

Java generates a .class file for every class at compile time.

Now, the JVM needs to find the main method for program execution. How does it know where the main method is?

Why does it have to find main() specifically? Just wait a little longer.

If the Java file name does not match the public class name, the Java interpreter has to read all class files to find the main method. If the file name matches the name of the public class, the Java interpreter can better identify the file it needs to interpret.

Imagine a file named Java1000 with 1000 classes inside. To identify where main() is among the 1000 classes, the interpreter would have to examine all the class files.

However, if the file name matches the name of the public class, it can access main() more quickly (since main exists in the public class), and it can easily access other classes since all the logic starts from main().

Why must it be public?​

The JVM needs to find the main method inside the class. If the JVM, which accesses the class from outside, needs to find a method inside the class, that method must be public. In fact, changing the access modifier to private will result in an error message instructing you to declare main as public.

Error: Main method not found in class VerboseLanguage, please define the main method as:
public static void main(String[] args)

Why must it be static?​

The JVM has found the public main() method. However, to invoke this method, an object must first be created. Does the JVM need this object? No, it just needs to be able to call main. By declaring it as static, the JVM does not need to create an unnecessary object, saving memory.

Why must it be void?​

The end of the main method signifies the end of Java's execution. The JVM cannot do anything with the return value of main, so the presence of a return value is meaningless. Therefore, it is natural to declare it as void.

Why must it be named main?​

The method name main is designed for the JVM to find the entry point for running the application.

Although the term "design" sounds grand, in reality, it is hard-coded to find the method named main. If the name to be found was not main but haril, it would have searched for a method named haril. Of course, the Java creators likely had reasons for choosing main, but that's about it.

mainClassName = GetMainClassName(env, jarfile);
mainClass = LoadClass(env, classname);

// Find the main method
mainID = (*env)->GetStaticMethodID(env, mainClass, "main", "([Ljava/lang/String;)V");

jbject obj = (*env)->ToReflectedMethod(env, mainClass, mainID, JNI_TRUE);

Why args?​

Until now, we omitted mentioning String[] args in main(). Why must this argument be specified, and why does an error occur if it is omitted?

As public static void main(String[] args) is the entry point of a Java application, this argument must come from outside the Java application.

All types of standard input are entered as strings.

This is why args is declared as a string array. If you think about it, it makes sense. Before the Java application even runs, can you create custom object types directly? πŸ€”

So why is args necessary?

By passing arguments in a simple way from outside to inside, you can change the behavior of a Java application, a mechanism widely used since the early days of C programming to control program behavior. Especially for simple applications, this method is very effective. Java simply adopted this widely used method.

The reason String[] args cannot be omitted is that Java only allows one public static void main(String[] args) as the entry point. The Java creators thought it would be less confusing to declare and not use args than to allow it to be omitted.

System.out.println​

Finally, we can start talking about the method related to output.

Just to mention it again, in Python it was print("Hello World"). 2

A Java program runs not directly on the operating system but on a virtual machine called JVM. This allows Java programs to be executed anywhere regardless of the operating system, but it also makes it difficult to use specific functions provided by the operating system. This is why coding at the system level, such as creating a CLI in Java or collecting OS metrics, is challenging.

However, there is a way to leverage limited OS functionality (JNI), and System provides this functionality. Some of the key functions include:

  • Standard input
  • Standard output
  • Setting environment variables
  • Terminating the running application and returning a status code

To print Hello World, we are using the standard output function of System.

In fact, as you follow the flow of System.out.println, you will encounter a writeBytes method with the native keyword attached, which delegates the operation to C code and transfers it to standard output.

// FileOutputStream.java
private native void writeBytes(byte b[], int off, int len, boolean append)
throws IOException;

The invocation of a method with the native keyword works through the Java Native Interface (JNI). This will be covered in a later chapter.

String​

Strings in Java are somewhat special. No, they seem quite special. They are allocated separate memory space, indicating they are definitely treated as special. Why is that?

It is important to note the following properties of strings:

  • They can become very large.
  • They are relatively frequently reused.

Therefore, strings are designed with a focus on how to reuse them once created. To fully understand how large string data is managed in memory, you need an understanding of the topics to be covered later. For now, let's briefly touch on the principles of memory space saving.

First, let's look at how strings are declared in Java.

String greeting = "Hello World";

Internally, it works as follows:

Strings are created in the String Constant Pool and have immutable properties. Once a string is created, it does not change, and if the same string is found in the Constant Pool when creating a new string, it is reused.

We will cover JVM Stack, Frame, Heap in the next chapter.

Another way to declare strings is by instantiation.

String greeting = new String("Hello World");

This method is rarely used because there is a difference in internal behavior, as shown below.

When a string is used directly without the new keyword, it is created in the String Constant Pool and can be reused. However, if instantiated with the new keyword, it is not created in the Constant Pool. This means the same string can be created multiple times, potentially wasting memory space.

Summary​

In this chapter, we answered the following questions:

  • Why must the .java file name match the class name?
  • Why must it be public static void main(String[] args)?
  • The flow of the output operation
  • The characteristics of strings and the basic principles of their creation and use

In the next chapter, we will compile Java code ourselves and explore how bytecode is generated, its relationship with memory areas, and more.

Reference​

Footnotes​

  1. Life Coding Python ↩

  2. Life Coding Python ↩

How Many Concurrent Requests Can a Single Server Application Handle?

Β· 14 min read

banner

Overview​

How many concurrent users can a Spring MVC web application accommodate? πŸ€”

To estimate the approximate number of users a server needs to handle to provide stable service while accommodating many users, this article explores changes in network traffic focusing on Spring MVC's Tomcat configuration.

For the sake of convenience, the following text will be written in a conversational tone πŸ™

info

If you find any technical errors, typos, or incorrect information, please let us know in the comments. Your feedback is greatly appreciated πŸ™‡β€β™‚οΈ

[System Design Interview] Implementing a URL Shortener from Scratch

Β· 5 min read

banner

info

You can check the code on GitHub.

Overview​

Shortening URLs started to prevent URLs from being fragmented in email or SMS transmissions. However, nowadays, it is more actively used for sharing specific links on social media platforms like Twitter or Instagram. It improves readability by not looking verbose and can also provide additional features such as collecting user statistics before redirecting to the URL.

In this article, we will implement a URL shortener from scratch and explore how it works.

What is a URL Shortener?​

Let's first take a look at the result.

You can run the URL shortener we will implement in this article directly with the following command:

docker run -d -p 8080:8080 songkg7/url-shortener

Here is how to use it. Simply input the long URL you want to shorten as the value of longUrl.

curl -X POST --location "http://localhost:8080/api/v1/shorten" \
-H "Content-Type: application/json" \
-d "{
\"longUrl\": \"https://www.google.com/search?q=url+shortener&sourceid=chrome&ie=UTF-8\"
}"
# You will receive a random value like tN47tML.

Now, if you access http://localhost:8080/tN47tML in your web browser,

image

You will see that it correctly redirects to the original URL.

Before Shortening

After Shortening

Now, let's see how we can shorten URLs.

Rough Design​

Shortening URLs​

  1. Generate an ID before storing the longUrl.
  2. Encode the ID to base62 to create the shortUrl.
  3. Store the ID, shortUrl, and longUrl in the database.

Memory is finite and relatively expensive. RDB can be quickly queried through indexes and is relatively cheaper compared to memory, so we will use RDB to manage URLs.

To manage URLs, we first need to secure an ID generation strategy. There are various methods for ID generation, but it may be too lengthy to cover here, so we will skip it. I will simply use the current timestamp for ID generation.

Base62 Conversion​

By using ULID, you can generate a unique ID that includes a timestamp.

val id: Long = Ulid.fast().time // e.g., 3145144998701, used as a primary key

Converting this number to base62, we get the following string.

tN47tML

This string is stored in the database as the shortUrl.

idshortlong
3145144998701tN47tMLhttps://www.google.com/search?q=url+shortener&sourceid=chrome&ie=UTF-8

The retrieval process will proceed as follows:

  1. A GET request is made to localhost:8080/tN47tML.
  2. Decode tN47tML from base62.
  3. Obtain the primary key 3145144998701 and query the database.
  4. Redirect the request to the longUrl.

Now that we have briefly looked at it, let's implement it and delve into more details.

Implementation​

Just like the previous article on Consistent Hashing, we will implement it ourselves. Fortunately, implementing a URL shortener is not that difficult.

Model​

First, we implement the model to receive requests from users. We simplified the structure to only receive the URL to be shortened.

data class ShortenRequest(
val longUrl: String
)

We implement a Controller to handle POST requests.

@PostMapping("/api/v1/shorten")
fun shorten(@RequestBody request: ShortenRequest): ResponseEntity<ShortenResponse> {
val url = urlShortenService.shorten(request.longUrl)
return ResponseEntity.ok(ShortenResponse(url))
}

Base62 Conversion​

Finally, the most crucial part. After generating an ID, we encode it to base62 to shorten it. This shortened string becomes the shortUrl. Conversely, we decode the shortUrl to find the ID and use it to query the database to retrieve the longUrl.

private const val BASE62 = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"

class Base62Conversion : Conversion {
override fun encode(input: Long): String {
val sb = StringBuilder()
var num = BigInteger.valueOf(input)
while (num > BigInteger.ZERO) {
val remainder = num % BigInteger.valueOf(62)
sb.append(BASE62[remainder.toInt()])
num /= BigInteger.valueOf(62)
}
return sb.reverse().toString()
}

override fun decode(input: String): Long {
var num = BigInteger.ZERO
for (c in input) {
num *= BigInteger.valueOf(62)
num += BigInteger.valueOf(BASE62.indexOf(c).toLong())
}
return num.toLong()

}
}

The length of the shortened URL is inversely proportional to the size of the ID number. The smaller the generated ID number, the shorter the URL can be made.

If you want the length of the shortened URL to not exceed 8 characters, you should ensure that the size of the ID does not exceed 62^8. Therefore, how you generate the ID is also crucial. As mentioned earlier, to simplify the content in this article, we handled this part using a timestamp value.

Test​

Let's send a POST request with curl to shorten a random URL.

curl -X POST --location "http://localhost:8080/api/v1/shorten" \
-H "Content-Type: application/json" \
-d "{
\"longUrl\": \"https://www.google.com/search?q=url+shortener&sourceid=chrome&ie=UTF-8\"
}"

You can confirm that it correctly redirects by accessing http://localhost:8080/{shortUrl}.

Conclusion​

Here are some areas for improvement:

  • By controlling the ID generation strategy more precisely, you can further shorten the shortUrl.
    • If there is heavy traffic, you must consider issues related to concurrency.
    • Snowflake
  • Using DNS for the host part can further shorten the URL.
  • Applying cache to the Persistence Layer can achieve faster responses.

Exploring Docker Compose Support in Spring Boot 3.1

Β· 3 min read

Let's take a brief look at the Docker Compose Support introduced in Spring Boot 3.1.

info

Please provide feedback if there are any inaccuracies!

Overview​

When developing with the Spring framework, it seems that using Docker for setting up DB environments is more common than installing them directly on the local machine. Typically, the workflow involves:

  1. Using docker run before bootRun to prepare the DB in a running state
  2. Performing development and validation tasks using bootRun
  3. Stopping bootRun and using docker stop to stop the container DB

The process of running and stopping Docker before and after development tasks used to be quite cumbersome. However, starting from Spring Boot 3.1, you can use a docker-compose.yaml file to synchronize the lifecycle of Spring and Docker containers.

Contents​

First, add the dependency:

dependencies {
// ...
developmentOnly 'org.springframework.boot:spring-boot-docker-compose'
// ...
}

Next, create a compose file as follows:

services:
elasticsearch:
image: 'docker.elastic.co/elasticsearch/elasticsearch:7.17.10'
environment:
- 'ELASTIC_PASSWORD=secret'
- 'discovery.type=single-node'
- 'xpack.security.enabled=false'
ports:
- '9200' # random port mapping
- '9300'

image

During bootRun, the compose file is automatically recognized, and the docker compose up operation is executed first.

However, if you are mapping the container port to a random host port, you may need to update the application.yml every time docker compose down is triggered. Fortunately, starting from Spring Boot 3.1, once you write the compose file, Spring Boot takes care of the rest. It's incredibly convenient!

If you need to change the path to the compose file, simply modify the file property:

spring:
docker:
compose:
file: infrastructure/compose.yaml

There are also properties related to lifecycle management, allowing you to appropriately adjust the container lifecycle. If you don't want the container to stop every time you shut down Boot, you can use the start_only option:

spring:
docker:
compose:
lifecycle-management: start_and_stop # none, start_only

There are various other options available, so exploring them should help you choose what you need.

image

Conclusion​

No matter how much test code you write, verifying the interaction with the actual DB was essential during the development process. Setting up that environment felt like a tedious chore. While container technology made configuration much simpler, remembering to run docker commands before and after starting Spring Boot was definitely a hassle.

Now, starting from Spring Boot 3.1, developers can avoid situations where they forget to start or stop containers, preventing memory consumption. It allows developers to focus more on development. The seamless integration of Docker with Spring is both fascinating and convenient. Give it a try!

Reference​

A Yearlong Blogging Journey

Β· 5 min read

Overview​

This post holds a significant meaning for me. It is intended to be the final entry of the blog journey I have been on since the beginning of the year. As a review, I will summarize my blogging experience up to this point.

Criteria for Choosing a Blogging Platform​

I was looking for a platform that met the following criteria to facilitate convenient posting:

  • Easy use of Markdown
  • Convenient image uploading
  • Ongoing maintenance (especially for open-source platforms)

While platforms like Tistory lacked robust Markdown support and had cumbersome image uploading processes, Velog, although popular among developers, seemed neglected recently, so I decided against it. In the end, I found GitHub Page + Jekyll to be the most rational choice as it fully supports Markdown, makes image uploading easy, and allows for long-term maintenance. Although managing Jekyll requires some knowledge of Ruby, I had a basic understanding and committed to learning as needed, and have been operating with this setup to date.

SEO Struggles​

Despite my efforts to get all pages indexed, things haven't gone as smoothly as I hoped. When will the crawling finally start?

However, this journey has led me to study the field more and realize the importance of patience. Even though it's taking time for the pages to get indexed, I believe that with increased traffic, indexing will happen naturally. Gradually, I have noticed an increase in the number of indexed pages. While I am publishing content faster than the indexing speed, I have to accept that I cannot control the time it takes for the pages to get indexed and appear in search results due to Google's crawling policies.

image

Evolution of Content​

Initially, when I started my blog on Tistory, I focused on algorithm problem-solving as I was diving into algorithm studies.

image

As I delved into practical work, I realized that algorithm solutions are better explained on algorithmic problem-solving platforms, and simply listing knowledge felt redundant compared to consulting official documentation. I did not want my blog to become just another mundane one.

My desire to create a blog that is distinctive and personal, setting it apart from others has continued, driving me to enhance the quality and uniqueness of my content. Some posts that I find personally satisfying include my journey of creating open-source projects and implementing concepts rather than just reading about them.

image

info

In 2024, it evolved further into a blog using Docusaurus πŸ˜„.

Open-Sourcing Obsidian Plugin​

I have developed a plugin called O2 specifically for blog posting. It facilitates the continuity between Obsidian and Jekyll tasks. Developing this plugin required me to learn TypeScript as well πŸ˜….

Fortunately, around 400 users have joined me in using this plugin as of July 2023. Although most probably uninstalled it within 10 minutes... DAU 1...

image

Initially, there were many bugs, but now, after addressing numerous minor issues, the plugin has entered a stable phase. If you are an Obsidian user who uses Jekyll as a blogging platform, I would appreciate it if you could show some interest in this plugin!

image

I have also obtained the plugin dev role in the Obsidian Discord Community and am actively participating. Feel free to ask any Obsidian-related questions!

Growth Metrics​

To maintain consistent motivation and direction when starting my blog, I believed that using Google Analytics was essential. Seeing the graph gradually trend upwards gave me a sense of accomplishment. Some argue that having few initial blog visitors can have a negative impact, but personally, it motivated me. It sparked a desire to attract more people to my blog.

Below is the growth rate of my blog over the past year.

image

Despite the dynamic appearance of the graph, the numbers are not as high compared to many influential bloggers. That's the paradox of statistics... Nevertheless, the overall upward trend is encouraging.

Participating in the writing program has made me pay more attention to the quality of my posts, and as a result, external links have started to generate more traffic. Especially, being curated frequently on the Serfit community site has significantly boosted traffic. I am grateful to the curator who selected my mediocre posts. I will strive to write more diligently and refine my work in the future.

Future Goals​

When summarizing my goals for the second half of this year and the next year, they can be outlined as follows:

  1. Strive to publish high-quality, distinctive, and practical posts beyond simple knowledge sharing.
  2. Reach over 30,000 new users.
  3. Publish at least two posts per month.
  4. Start posting in English for language learning purposes.

I am particularly pondering the best approach and platform for English posts. In the future, I would like to post in languages other than English, so considering multilingual support will be crucial. As I progress through the writing program (please select me for the 9th cohort), I will further refine these plans.

Thank you for accompanying me on my journey so far. I look forward to your continued support πŸ™.

Saving EC2 Costs with Jenkins

Β· 3 min read

I would like to share a very simple method for optimizing resource costs when dealing with batch applications that need to run at specific times and under specific conditions.

Problem​

  1. Batches are only executed at specific times. For tasks like calculations, which need to run at regular intervals like daily, monthly, or yearly.
  2. Speed of response is not crucial; ensuring that the batch runs is the priority.
  3. Maintaining an EC2 instance for 24 hours just for resources needed at specific times is inefficient.
  4. Is it possible to have the EC2 instance ready only when the cloud server resources are needed?

Of course, it is possible. While there are various automation solutions like AWS ECS and AWS EKS, let's assume managing batches and EC2 servers directly with Jenkins and set up the environment.

Architecture​

With this infrastructure design, you can ensure that costs are incurred only when resources are needed for batch execution.

Jenkins​

Jenkins Node Management Policy​

image

Activates the node only when there are requests waiting in the queue, minimizing unnecessary error logs. Additionally, it transitions to idle state if there is no activity for 1 minute.

AWS CLI​

Installing AWS CLI​

With AWS CLI, you can manage AWS resources in a terminal environment. Use the following command to retrieve a list of currently running instances:

aws ec2 describe-instances

Once you have checked the information for the desired resource, you can specify the target and execute a specific action. The commands are as follows:

EC2 start​

aws ec2 start-instances --instance-ids {instanceId}

EC2 stop​

aws ec2 stop-instances --instance-ids {instanceId}

Scheduling​

By writing a cron expression for the batch to run once a month, you can set it up easily.

image

H 9 1 * *

Now, the EC2 instance will remain in a stopped state most of the time and will be activated by Jenkins once a month to process the batch.

Conclusion​

Keeping an EC2 instance in a running state when not in use is inefficient in terms of cost. This article has shown that with Jenkins and simple commands, you can use EC2 only when needed.

While higher-level cloud orchestration tools like EKS can elegantly solve such issues, sometimes a simple approach can be the most efficient. I hope you choose the method that suits your situation best as I conclude this article.

Changes in Spring Batch 5.0

Β· 2 min read

Here's a summary of the changes in Spring Batch 5.0.

What's new?​

@AutoConfiguration(after = { HibernateJpaAutoConfiguration.class, TransactionAutoConfiguration.class })
@ConditionalOnClass({ JobLauncher.class, DataSource.class, DatabasePopulator.class })
@ConditionalOnBean({ DataSource.class, PlatformTransactionManager.class })
@ConditionalOnMissingBean(value = DefaultBatchConfiguration.class, annotation = EnableBatchProcessing.class) // 5.0 λΆ€ν„° μΆ”κ°€λ˜μ—ˆμŠ΅λ‹ˆλ‹€.
@EnableConfigurationProperties(BatchProperties.class)
@Import(DatabaseInitializationDependencyConfigurer.class)
public class BatchAutoConfiguration {
// ...
}

In the past, you could activate Spring Batch's Spring Boot auto-configuration using the @EnableBatchProcessing annotation. However, now you need to remove it to use Spring Boot's auto-configuration. Specifying @EnableBatchProcessing or inheriting from DefaultBatchConfiguration now pushes back Spring Boot's auto-configuration and is used for customizing application settings.

Therefore, using @EnableBatchProcessing or DefaultBatchConfiguration will cause default settings like spring.batch.jdbc.initialize-schema not to work. Additionally, Jobs won't run automatically when Boot is started, so an implementation of a Runner is required.

Multiple Job Execution is no longer supported​

Previously, if there were multiple Jobs in a batch, you could execute them all at once. However, now Boot will execute a Job when it detects a single one. If there are multiple Jobs in the context, you need to specify the Job to be executed using spring.batch.job.name when starting Boot.

Expanded JobParameter support​

In Spring Batch v4, Job parameters could only be of types Long, String, Date, and Double. In v5, you can now implement converters to use any type as a JobParameter. However, the default conversion service in Spring Batch still does not support LocalDate and LocalDateTime, causing exceptions. Although you can resolve this by implementing a converter for the default conversion service, it is problematic that even though JobParametersBuilder provides related methods, the conversion does not actually occur and throws an exception. An issue has been opened regarding this, and it is expected to be fixed in 5.0.1.

JobParameters jobParameters = jobLauncherTestUtils.getUniqueJobParametersBuilder()
.addLocalDate("date", LocalDate.now()) // if you use this method, it will throw an exception even though it is provided.
.toJobParameters();

image

The issue was resolved in the release of 5.0.1 on 2023-02-23.

initializeSchema​

spring:
datasource:
url: jdbc:postgresql://localhost:5432/postgres?currentSchema=mySchema
username: postgres
password: 1234
driver-class-name: org.postgresql.Driver
batch:
jdbc:
initialize-schema: always
table-prefix: mySchema.BATCH_
sql:
init:
mode: always

Specify the currentSchema option for proper functioning.

Reference​

[System Design Interview] Chapter 5: Consistent Hashing

Β· 11 min read

What are the essential components needed to design a large-scale system?

In this article, we will directly implement and discuss Consistent Hashing, which is commonly used in routing systems, and talk about it based on data.

info

You can check the complete code on Github.

Since the article is quite lengthy, from now on, we will use '~' for convenience in explanations. πŸ™

What is Hashing?​

Before delving into Consistent Hashing, let's briefly touch on hashing.

The dictionary definition of hashing is 'a mathematical function that takes an arbitrary length data string as input and generates a fixed-size output, typically a hash value or hash code consisting of numbers and strings.'

In simple terms, it means that the same input string will always return the same hash code. This characteristic of hashing is used for various purposes such as encryption and file integrity verification.

So, What is Consistent Hashing?​

Consistent Hashing is a technique used to evenly distribute data among distributed servers or services.

Even without using Consistent Hashing, it is not impossible to evenly distribute data. However, Consistent Hashing is focused on making horizontal scaling easier. Before exploring Consistent Hashing, let's understand why Consistent Hashing emerged through a simple hash routing method.

Node-Based Hash Routing Method​

hash(key) % n

image

This method efficiently distributes traffic while being simple.

However, it has a significant weakness in horizontal scaling. When the node list changes, there is a high probability that traffic will be redistributed, leading to routing to new nodes instead of existing nodes.

If you are managing traffic by caching on specific nodes, if a node leaves the group for some reason, it can cause a massive cache miss, leading to service disruptions.

image

In an experiment with four nodes, it was observed that if only one node leaves, the cache hit rate drops drastically to 27%. We will examine the experimental method in detail in the following paragraphs.

Consistent Hash Routing Method​

Consistent Hashing is a concept designed to minimize the possibility of massive cache misses.

image

The idea is simple. Create a kind of ring by connecting the start and end of the hash space, then place nodes on the hash space above the ring. Each node is allocated its hash space and waits for traffic.

info

The hash function used to place nodes is independent of modulo operations.

Now, let's assume a situation where traffic enters this router implemented with Consistent Hashing.

image

Traffic passed through the hash function is routed towards the nearest node on the ring. Node B caches key1 in preparation for future requests.

Even in the scenario of a high volume of traffic, traffic will be routed to their respective nodes following the same principle.

Advantages of Consistent Hashing​

Low probability of cache misses even when the node list changes​

Let's consider a situation where Node E is added.

image

Previously entered keys are placed at the same points as before. Some keys that were placed between Nodes D and C now point to the new Node E, causing cache misses. However, the rest of the keys placed in other spaces do not experience cache misses.

Even if there is a network error causing Node C to disappear, the results are similar.

image

Keys that were directed to Node C now route to Node D, causing cache misses. However, the keys placed in other spaces do not experience cache misses.

In conclusion, regardless of any changes in the node list, only keys directly related to the changed nodes experience cache misses. This increases the cache hit rate compared to node-based hash routing, improving overall system performance.

Disadvantages of Consistent Hashing​

Like all other designs, Consistent Hashing, which may seem elegant, also has its drawbacks.

Difficult to maintain uniform partitions​

image Nodes with different sizes of hash spaces are placed on the ring.

It is very difficult to predict the results of a hash function without knowing which key will be generated. Therefore, Consistent Hashing, which determines the position on the ring based on the hash result, cannot guarantee that nodes will have uniform hash spaces and be distributed evenly on the ring.

Difficult to achieve uniform distribution​

image If a node's hash space is too wide, traffic can be concentrated.

This problem arises because nodes are not evenly distributed on the hash ring. If Node D's hash space is abnormally larger than other nodes, it can lead to a hotspot issue where traffic is concentrated on a specific node, causing overall system failure.

Virtual Nodes​

The hash space is finite. Therefore, if there are a large number of nodes placed in the hash space, the standard deviation decreases, meaning that even if one node is removed, the next node will not be heavily burdened. The problem lies in the fact that in the real world, the number of physical nodes equates to cost.

To address this, virtual nodes, which mimic physical nodes, are implemented to solve this intelligently.

image

Virtual nodes internally point to the hash value of the physical nodes. Think of them as a kind of duplication magic. The main physical node is not placed on the hash ring, only the replicated virtual nodes wait for traffic on the hash ring. When traffic is allocated to a virtual node, it is routed based on the hash value of the actual node it represents.

DIY Consistent Hashing​

DIY: Do It Yourself

So far, we have discussed the theoretical aspects. Personally, I believe that there is no better way to learn a concept than implementing it yourself. Let's implement it.

Choosing a Hash Algorithm​

It may seem obvious since the name includes hashing, but when implementing Consistent Hashing, selecting an appropriate hash algorithm is crucial. The speed of the hash function is directly related to performance. Commonly used hash algorithms are MD5 and SHA-256.

  • MD5: Suitable for applications where speed is more important than security. Has a smaller hash space compared to SHA-256. 2^128
  • SHA-256: Has a longer hash size and stronger encryption properties. Slower than MD5. With a very large hash space of about 2^256, collisions are almost non-existent.

For routing, speed is more important than security, and since there are fewer concerns about hash collisions, MD5 is considered sufficient for implementing the hash function.

public class MD5Hash implements HashAlgorithm {
MessageDigest instance;

public MD5Hash() {
try {
instance = MessageDigest.getInstance("MD5");
} catch (NoSuchAlgorithmException e) {
throw new IllegalStateException("no algorithm found");
}
}

@Override
public long hash(String key) {
instance.reset();
instance.update(key.getBytes());
byte[] digest = instance.digest();
long h = 0;
for (int i = 0; i < 4; i++) {
h <<= 8;
h |= (digest[i]) & 0xFF;
}
return h;
}
}
tip

In Java, you can conveniently implement a hash function using the MD5 algorithm through MessageDigest.

Hash Ring​

// Hash the businessKey and find the hashed value (node) placed on the ring.
public T routeNode(String businessKey) {
if (ring.isEmpty()) { // If the ring is empty, it means there are no nodes, so return null
return null;
}
Long hashOfBusinessKey = this.hashAlgorithm.hash(businessKey);
SortedMap<Long, VirtualNode<T>> biggerTailMap = ring.tailMap(hashOfBusinessKey);
Long nodeHash;
if (biggerTailMap.isEmpty()) {
nodeHash = ring.firstKey();
} else {
nodeHash = biggerTailMap.firstKey();
}
VirtualNode<T> virtualNode = ring.get(nodeHash);
return virtualNode.getPhysicalNode();
}

The hash ring is implemented using a TreeMap. Since TreeMap maintains keys (hash values) in ascending order upon storage, we can use the tailMap(key) method to find values greater than the key (hash value) and connect them to the largest key if a larger key cannot be found.

info

If you are not familiar with TreeMap, please refer to this link.

Testing​

How effective is Consistent Hashing compared to the standard routing method? Now that we have implemented it ourselves, let's resolve this question. The rough test design is as follows:

  • Process 1 million requests, then introduce changes to the node list and assume the same traffic comes in again.
  • 4 physical nodes

The numerical data was quantified through a simple test code1, and when graphed, it revealed six cases. Let's look at each one.

Case 1: Simple Hash, No Node Changes​

image

After sending 1 million requests and then another 1 million of the same requests, since there were no changes in the nodes, the cache hit rate was 100% from the second request onwards.

info

Although the cache hit rate was low, the possibility of cache hits even in the first request (gray graph) was due to the random nature of the keys used in the test, resulting in a low probability of duplicate key values.

Looking at the heights of the graphs for the nodes, we can see that the routing using hash % N is indeed distributing all traffic very evenly.

Case 2: Simple Hash, 1 Node Departure​

image

The cache hit rate, indicated by the green graph, significantly decreased. With Node 1 departing, the traffic was distributed to Nodes 2, 3, and 4. While some traffic luckily hit the cache on the same nodes as before, most of it was directed to different nodes, resulting in cache misses.

Case 3: Consistent Hash, No Node Changes, No Virtual Nodes​

image

info

Considering that physical nodes are not placed on the hash ring, using only one virtual node practically means not using virtual nodes.

Similar to Case 1, the red graph rises first as cache hits cannot occur immediately in the first request. By the second request, the cache hit rate is 100%, aligning the heights of the green and red graphs.

However, it can be observed that the heights of the graphs for each node are different, indicating the drawback of Consistent Hashingβ€”uneven traffic distribution due to non-uniform partitions.

Case 4: Consistent Hash, 1 Node Departure, No Virtual Nodes​

image

After Node 1 departs, the cache hit rate overwhelmingly improved compared to Case 2.

Upon closer inspection, it can be seen that the traffic originally directed to Node 1 then moved to Node 2 in the second traffic wave. Node 2 processed around 450,000 requests, including cache hits, which is more than twice the amount processed by Node 3 with 220,000 requests. Meanwhile, the traffic to Nodes 3 and 4 remained unchanged. This illustrates the advantage of Consistent Hashing while also highlighting a kind of hotspot phenomenon.

Case 5: Consistent Hash, 1 Node Departure, 10 Virtual Nodes​

To achieve uniform partitioning and resolve the hotspot issue, let's apply virtual nodes.

image

Overall, there is a change in the graphs. The traffic that was supposed to go to Node 1 is now divided among Nodes 2, 3, and 4. Although the partitions are not evenly distributed, the hotspot issue is gradually being resolved compared to Case 4. Since 10 virtual nodes seem insufficient, let's increase them further.

Case 6: Consistent Hash, 1 Node Departure, 100 Virtual Nodes​

image

Finally, the graphs for Nodes 2, 3, and 4 are similar. After Node 1's departure, there are 100 virtual nodes per physical node on the hash ring, totaling 300 virtual nodes. In summary:

  • It can be seen that traffic is evenly distributed enough to withstand Case 1.
  • Even if Node 1 departs, the traffic intended for Node 1 is spread across multiple nodes, preventing the hotspot issue.
  • Apart from the traffic directed to Node 1, the cache still hits.

By placing a sufficient number of virtual nodes, the routing method using Consistent Hashing has become highly advantageous for horizontal scaling compared to the remaining operations, as observed.

Conclusion​

We have examined Consistent Hashing as discussed in Chapter 5 of the fundamentals of large-scale system design. We hope this has helped you understand what Consistent Hashing is, and why it exists to solve certain problems.

Although not mentioned in a separate case, I was concerned about how many virtual nodes should be added to achieve a perfectly uniform distribution. Therefore, I increased the number of virtual nodes to 10,000 and found that adding more virtual nodes had minimal effect. Theoretically, increasing virtual nodes should converge the variance to zero and achieve a uniform distribution. However, increasing virtual nodes means having many instances on the hash ring, leading to unnecessary overhead. It requires the task of finding and organizing virtual nodes on the hash ring whenever a new node is added or removed2. In a live environment, please set an appropriate number of virtual nodes based on data.

Reference​

Footnotes​

  1. SimpleHashRouterTest ↩

  2. In particular, for a Hash Ring implemented using TreeMap, massive insertions and deletions are somewhat inefficient as the internal elements need to be rearranged each time. ↩

Understanding Garbage Collection

Β· 7 min read

Overview​

Let's delve into the topic of Garbage Collection (GC) in the JVM.

What is GC?​

The JVM memory is divided into several regions.

image

The Heap region is where objects and arrays created by operations like new are stored. Objects or arrays created in the Heap region can be referenced by other objects. GC occurs precisely in this Heap region.

If a Java program continues to run without terminating, data will keep piling up in memory. GC resolves this issue.

How does it resolve it? The JVM identifies unreachable objects as targets for GC. Understanding which objects become unreachable can be grasped by looking at the following code.

public class Main {
public static void main(String[] args) {
Person person = new Person("a", "soon to be unreferenced");
person = new Person("b", "reference maintained.");
}
}

When person is initially initialized, the created a is immediately reassigned to b on the next line, becoming an unreachable object. Now, a will be released from memory during the next GC.

Stop the World​

image The World! Time, halt! - JoJo's Bizarre Adventure

Stopping the application's execution to perform GC. When a "Stop the World" event occurs, all threads except the one executing GC are paused. Once the GC operation is completed, the paused tasks resume. Regardless of the GC algorithm used, "Stop the World" events occur, and GC tuning typically aims to reduce the time spent in this paused state.

warning

Java does not explicitly deallocate memory in program code. Occasionally setting an object to null to deallocate it is not a major issue, but calling System.gc() can significantly impact system performance and should never be used. Furthermore, System.gc() does not guarantee that GC will actually occur.

Two Areas Where GC Occurs​

Since developers do not explicitly deallocate memory in Java, the Garbage Collector is responsible for identifying and removing no longer needed (garbage) objects. The Garbage Collector operates under two main assumptions:

  • Most objects quickly become unreachable.
  • There are very few references from old objects to young objects.

Most objects quickly become unreachable​

for (int i = 0; i < 10000; i++) {
NewObject obj = new NewObject();
obj.doSomething();
}

The 10,000 NewObject instances are used within the loop and are not needed outside it. If these objects continue to occupy memory, resources for executing other code will gradually diminish.

Few references from old objects to young objects​

Consider the following code snippet for clarification.

Model model = new Model("value");
doSomething(model);

// model is no longer used

The initially created model is used within doSomething but is unlikely to be used much afterward. While there may be cases where it is reused, GC is designed with the assumption that such occurrences are rare. Looking at statistics from Oracle, most objects are cleaned up by GC shortly after being created, validating this assumption.

image

This assumption is known as the weak generational hypothesis. To maximize the benefits of this hypothesis, the HotSpot VM divides the physical space into two main areas: the Young Generation and the Old Generation.

image

  • Young Generation: This area primarily houses newly created objects. Since most objects quickly become unreachable, many objects are created and then disappear in the Young Generation. When objects disappear from this area, it triggers a Minor GC.
  • Old Generation: Objects that survive in the Young Generation without becoming unreachable are moved to the Old Generation. This area is typically larger than the Young Generation, and since it is larger, GC occurs less frequently here. When objects disappear from this area, it triggers a Major GC (or Full GC).

Each object in the Young Generation has an age bit that increments each time it survives a Minor GC. When the age bit exceeds a setting called MaxTenuringThreshold, the object is moved to the Old Generation. However, even if the age bit does not exceed the setting, an object can be moved to the Old Generation if there is insufficient memory in the Survivor space.

info

The Permanent space is where the addresses of created objects are stored. It is used by the class loader to store meta-information about loaded classes and methods. Prior to Java 7, it existed within the Heap.

Types of GC​

The Old Generation triggers GC when it becomes full. Understanding the different GC methods will help in comprehending the procedures involved.

Serial GC​

-XX:+UseSerialGC

To understand Serial GC, one must first grasp the Mark-Sweep-Compact algorithm. The first step of this algorithm involves identifying live objects in the Old Generation (Mark). Next, it sweeps through the heap from the front, retaining only live objects (Sweep). In the final step, it fills the heap from the front to ensure objects are stacked contiguously, dividing the heap into sections with and without objects (Compaction).

warning

Serial GC is suitable for systems with limited memory and CPU cores. However, using Serial GC can significantly impact application performance.

Parallel GC​

-XX:+UseParallelGC

  • Default GC in Java 8

While the basic algorithm is similar to Serial GC, Parallel GC performs Minor GC in the Young Generation using multiple threads.

Parallel Old GC​

-XX:+UseParallelOldGC

  • An improved version of Parallel GC

As the name suggests, this GC method is related to the Old Generation. Unlike ParallelGC, which only uses multiple threads for the Young Generation, Parallel Old GC performs GC using multiple threads in the Old Generation as well.

CMS GC (Concurrent Mark Sweep)​

This GC was designed to minimize "Stop the World" time by allowing application threads and GC threads to run concurrently. Due to the multi-step process of identifying GC targets, CPU usage is higher compared to other GC methods.

Ultimately, CMS GC was deprecated starting from Java 9 and completely discontinued in Java 14.

G1GC (Garbage First)​

-XX:+UseG1GC

  • Released in JDK 7 to replace CMS GC
  • Default GC in Java 9+
  • Recommended for situations requiring more than 4GB of heap memory and where a "Stop the World" time of around 0.5 seconds is acceptable (For smaller heaps, other algorithms are recommended)

G1GC requires a fresh approach as it is a completely redesigned GC method.

Q. Considering G1GC is the default in later versions, what are the pros and cons compared to the previous CMS?

  • Pros
    • G1GC performs compaction while scanning, reducing "Stop the World" time.
    • Provides the ability to compress free memory space without additional "Stop the World" pauses.
    • String Deduplication Optimization
    • Tuning options for size, count, etc.
  • Cons
    • During Full GC, it operates single-threaded.
    • Applications with small heap sizes may experience frequent Full GC events.

Shenandoah GC​

-XX:+UseShenandoahGC

  • Released in Java 12
  • Developed by Red Hat
  • Addresses memory fragmentation issues in CMS and pause issues in G1
  • Known for strong concurrency and lightweight GC logic, ensuring consistent pause times regardless of heap size

image

ZGC​

-XX:+UnlockExperimentalVMOptions -XX:+UseZGC

  • Released in Java 15
  • Designed for low-latency processing of large memory sizes (8MB to 16TB)
  • Utilizes ZPages similar to G1's Regions, but ZPages are dynamically managed in 2MB multiples (adjusting region sizes dynamically to accommodate large objects)
  • One of ZGC's key advantages is that "Stop the World" time never exceeds 10ms regardless of heap size

image

Conclusion​

While there are various GC types available, in most cases, using the default GC provided is sufficient. Tuning GC requires significant effort, involving tasks such as analyzing GC logs and heap dumps. Analyzing GC logs will be covered in a separate article.

Reference​

Optimizing Images for Blog Search Exposure

Β· 5 min read

In the process of automating blog posting, we discuss image optimization for SEO. This is a story of failure rather than success, where we had to resort to Plan B.

info

You can check the code on GitHub.

Identifying the Problem​

For SEO optimization, it is best to have images in blog posts as small as possible. This improves the efficiency of search engine crawling bots, speeds up page loading, and positively impacts user experience.

So, which image format should we use? πŸ€”

Google has developed an image format called WebP to address this issue and actively recommends its use. For Google, which profits from advertising, image optimization is directly related to profitability as it allows users to quickly reach website ads.

In fact, converting a jpg file of about 2.8MB to webp reduced it to around 47kb. That's more than a 1/50 reduction! Although some quality loss occurred, it was hardly noticeable on the webpage.

image

With this level of improvement, the motivation to solve the problem was more than enough. Let's gather information to implement it.

Approach to the Solution​

Plan A. Adding to O2 as a Feature​

We have a plugin called O2 that we developed for blog posting. Since we thought that including the WebP conversion task as part of this plugin's functionality would be the most ideal way, we first attempted this approach.

While sharp is the most famous library for image processing, it is OS-dependent and cannot be used with Obsidian plugins. To confirm this, I asked about it in the Obsidian community and received a clear answer that it cannot be used.

image

image

image Related community conversation

Unable to use sharp, we decided to use imagemin as an alternative.

However, there was a critical issue: imagemin requires the platform to be node for it to work when running esbuild, but the Obsidian plugin required the platform to be a browser. Setting it to neutral, which should work on both platforms, didn't work on either...

image

Since we couldn't find a suitable library to apply to O2 immediately, we decided to implement a simple script to handle the format conversion task.

Plan B. npm script​

Instead of adding functionality to the plugin, we can easily convert formats by scripting directly within the Jekyll project.

async function deleteFilesInDirectory(dir) {
const files = fs.readdirSync(dir);

files.forEach(function (file) {
const filePath = path.join(dir, file);
const extname = path.extname(filePath);
if (extname === '.png' || extname === '.jpg' || extname === '.jpeg') {
fs.unlinkSync(filePath);
console.log(`remove ${filePath}`);
}
});
}

async function convertImages(dir) {
const subDirs = fs
.readdirSync(dir)
.filter((file) => fs.statSync(path.join(dir, file)).isDirectory());

await imagemin([`${dir}/*.{png,jpg,jpeg}`], {
destination: dir,
plugins: [imageminWebp({quality: 75})]
});
await deleteFilesInDirectory(dir);

for (const subDir of subDirs) {
const subDirPath = path.join(dir, subDir);
await convertImages(subDirPath);
}
}

(async () => {
await convertImages('assets/img');
})();

While this method allows for quick implementation of the desired functionality, it requires users to manually relink the changed images to the markdown document outside of the process controlled by O2.

If we must use this method, we decided to use regular expressions to change the image extensions linked in all files to webp, thereby skipping the task of relinking images in the document.

// omitted
async function updateMarkdownFile(dir) {
const files = fs.readdirSync(dir);

files.forEach(function (file) {
const filePath = path.join(dir, file);
const extname = path.extname(filePath);
if (extname === '.md') {
const data = fs.readFileSync(filePath, 'utf-8');
const newData = data.replace(
/(!\^\*]\((.*?)\.(png|jpg|jpeg)\))/g,
(match, p1, p2, p3) => {
return p1.replace(`${p2}.${p3}`, `${p2}.webp`);
}
);
fs.writeFileSync(filePath, newData);
}
});
}

(async () => {
await convertImages('assets/img');
await updateMarkdownFile('_posts');
})();

Then, we wrote a script to run when publishing a blog post.

#!/usr/bin/env bash

echo "Image optimization️...πŸ–ΌοΈ"
node tools/imagemin.js

git add .
git commit -m "post: publishing"

echo "Pushing...πŸ“¦"
git push origin master

echo "Done! πŸŽ‰"
./tools/publish

Directly running sh in the terminal somehow felt inelegant. Let's add it to package.json for a cleaner usage.

{
"scripts": {
"publish": "./tools/publish"
}
}
npm run publish

image It works quite well.

For now, we concluded it this way.

Conclusion​

Through this process, the blog posting pipeline has transformed as follows:

Before

After

Looking at the results alone, it doesn't seem that bad, does it...? πŸ€”

We wanted to add the image format conversion feature as part of the O2 plugin functionality, but for various reasons, we couldn't apply it (for now), which is somewhat disappointing. The methods using JS and sh require additional actions from the user and are not easy to maintain. We need to consistently think about how to bring this feature into O2 internally.

Reference​