Skip to main content

Changes in Spring Batch 5.0

· 2 min read
Haril Song
Owner, Software Engineer at 42dot

Here's a summary of the changes in Spring Batch 5.0.

What's new?

@AutoConfiguration(after = { HibernateJpaAutoConfiguration.class, TransactionAutoConfiguration.class })
@ConditionalOnClass({ JobLauncher.class, DataSource.class, DatabasePopulator.class })
@ConditionalOnBean({ DataSource.class, PlatformTransactionManager.class })
@ConditionalOnMissingBean(value = DefaultBatchConfiguration.class, annotation = EnableBatchProcessing.class) // 5.0 부터 추가되었습니다.
@EnableConfigurationProperties(BatchProperties.class)
@Import(DatabaseInitializationDependencyConfigurer.class)
public class BatchAutoConfiguration {
// ...
}

In the past, you could activate Spring Batch's Spring Boot auto-configuration using the @EnableBatchProcessing annotation. However, now you need to remove it to use Spring Boot's auto-configuration. Specifying @EnableBatchProcessing or inheriting from DefaultBatchConfiguration now pushes back Spring Boot's auto-configuration and is used for customizing application settings.

Therefore, using @EnableBatchProcessing or DefaultBatchConfiguration will cause default settings like spring.batch.jdbc.initialize-schema not to work. Additionally, Jobs won't run automatically when Boot is started, so an implementation of a Runner is required.

Multiple Job Execution is no longer supported

Previously, if there were multiple Jobs in a batch, you could execute them all at once. However, now Boot will execute a Job when it detects a single one. If there are multiple Jobs in the context, you need to specify the Job to be executed using spring.batch.job.name when starting Boot.

Expanded JobParameter support

In Spring Batch v4, Job parameters could only be of types Long, String, Date, and Double. In v5, you can now implement converters to use any type as a JobParameter. However, the default conversion service in Spring Batch still does not support LocalDate and LocalDateTime, causing exceptions. Although you can resolve this by implementing a converter for the default conversion service, it is problematic that even though JobParametersBuilder provides related methods, the conversion does not actually occur and throws an exception. An issue has been opened regarding this, and it is expected to be fixed in 5.0.1.

JobParameters jobParameters = jobLauncherTestUtils.getUniqueJobParametersBuilder()
.addLocalDate("date", LocalDate.now()) // if you use this method, it will throw an exception even though it is provided.
.toJobParameters();

image

The issue was resolved in the release of 5.0.1 on 2023-02-23.

initializeSchema

spring:
datasource:
url: jdbc:postgresql://localhost:5432/postgres?currentSchema=mySchema
username: postgres
password: 1234
driver-class-name: org.postgresql.Driver
batch:
jdbc:
initialize-schema: always
table-prefix: mySchema.BATCH_
sql:
init:
mode: always

Specify the currentSchema option for proper functioning.

Reference

[System Design Interview] Chapter 5: Consistent Hashing

· 11 min read
Haril Song
Owner, Software Engineer at 42dot

What are the essential components needed to design a large-scale system?

In this article, we will directly implement and discuss Consistent Hashing, which is commonly used in routing systems, and talk about it based on data.

info

You can check the complete code on Github.

Since the article is quite lengthy, from now on, we will use '~' for convenience in explanations. 🙏

What is Hashing?

Before delving into Consistent Hashing, let's briefly touch on hashing.

The dictionary definition of hashing is 'a mathematical function that takes an arbitrary length data string as input and generates a fixed-size output, typically a hash value or hash code consisting of numbers and strings.'

In simple terms, it means that the same input string will always return the same hash code. This characteristic of hashing is used for various purposes such as encryption and file integrity verification.

So, What is Consistent Hashing?

Consistent Hashing is a technique used to evenly distribute data among distributed servers or services.

Even without using Consistent Hashing, it is not impossible to evenly distribute data. However, Consistent Hashing is focused on making horizontal scaling easier. Before exploring Consistent Hashing, let's understand why Consistent Hashing emerged through a simple hash routing method.

Node-Based Hash Routing Method

hash(key) % n

image

This method efficiently distributes traffic while being simple.

However, it has a significant weakness in horizontal scaling. When the node list changes, there is a high probability that traffic will be redistributed, leading to routing to new nodes instead of existing nodes.

If you are managing traffic by caching on specific nodes, if a node leaves the group for some reason, it can cause a massive cache miss, leading to service disruptions.

image

In an experiment with four nodes, it was observed that if only one node leaves, the cache hit rate drops drastically to 27%. We will examine the experimental method in detail in the following paragraphs.

Consistent Hash Routing Method

Consistent Hashing is a concept designed to minimize the possibility of massive cache misses.

image

The idea is simple. Create a kind of ring by connecting the start and end of the hash space, then place nodes on the hash space above the ring. Each node is allocated its hash space and waits for traffic.

info

The hash function used to place nodes is independent of modulo operations.

Now, let's assume a situation where traffic enters this router implemented with Consistent Hashing.

image

Traffic passed through the hash function is routed towards the nearest node on the ring. Node B caches key1 in preparation for future requests.

Even in the scenario of a high volume of traffic, traffic will be routed to their respective nodes following the same principle.

Advantages of Consistent Hashing

Low probability of cache misses even when the node list changes

Let's consider a situation where Node E is added.

image

Previously entered keys are placed at the same points as before. Some keys that were placed between Nodes D and C now point to the new Node E, causing cache misses. However, the rest of the keys placed in other spaces do not experience cache misses.

Even if there is a network error causing Node C to disappear, the results are similar.

image

Keys that were directed to Node C now route to Node D, causing cache misses. However, the keys placed in other spaces do not experience cache misses.

In conclusion, regardless of any changes in the node list, only keys directly related to the changed nodes experience cache misses. This increases the cache hit rate compared to node-based hash routing, improving overall system performance.

Disadvantages of Consistent Hashing

Like all other designs, Consistent Hashing, which may seem elegant, also has its drawbacks.

Difficult to maintain uniform partitions

image Nodes with different sizes of hash spaces are placed on the ring.

It is very difficult to predict the results of a hash function without knowing which key will be generated. Therefore, Consistent Hashing, which determines the position on the ring based on the hash result, cannot guarantee that nodes will have uniform hash spaces and be distributed evenly on the ring.

Difficult to achieve uniform distribution

image If a node's hash space is too wide, traffic can be concentrated.

This problem arises because nodes are not evenly distributed on the hash ring. If Node D's hash space is abnormally larger than other nodes, it can lead to a hotspot issue where traffic is concentrated on a specific node, causing overall system failure.

Virtual Nodes

The hash space is finite. Therefore, if there are a large number of nodes placed in the hash space, the standard deviation decreases, meaning that even if one node is removed, the next node will not be heavily burdened. The problem lies in the fact that in the real world, the number of physical nodes equates to cost.

To address this, virtual nodes, which mimic physical nodes, are implemented to solve this intelligently.

image

Virtual nodes internally point to the hash value of the physical nodes. Think of them as a kind of duplication magic. The main physical node is not placed on the hash ring, only the replicated virtual nodes wait for traffic on the hash ring. When traffic is allocated to a virtual node, it is routed based on the hash value of the actual node it represents.

DIY Consistent Hashing

DIY: Do It Yourself

So far, we have discussed the theoretical aspects. Personally, I believe that there is no better way to learn a concept than implementing it yourself. Let's implement it.

Choosing a Hash Algorithm

It may seem obvious since the name includes hashing, but when implementing Consistent Hashing, selecting an appropriate hash algorithm is crucial. The speed of the hash function is directly related to performance. Commonly used hash algorithms are MD5 and SHA-256.

  • MD5: Suitable for applications where speed is more important than security. Has a smaller hash space compared to SHA-256. 2^128
  • SHA-256: Has a longer hash size and stronger encryption properties. Slower than MD5. With a very large hash space of about 2^256, collisions are almost non-existent.

For routing, speed is more important than security, and since there are fewer concerns about hash collisions, MD5 is considered sufficient for implementing the hash function.

public class MD5Hash implements HashAlgorithm {
MessageDigest instance;

public MD5Hash() {
try {
instance = MessageDigest.getInstance("MD5");
} catch (NoSuchAlgorithmException e) {
throw new IllegalStateException("no algorithm found");
}
}

@Override
public long hash(String key) {
instance.reset();
instance.update(key.getBytes());
byte[] digest = instance.digest();
long h = 0;
for (int i = 0; i < 4; i++) {
h <<= 8;
h |= (digest[i]) & 0xFF;
}
return h;
}
}
tip

In Java, you can conveniently implement a hash function using the MD5 algorithm through MessageDigest.

Hash Ring

// Hash the businessKey and find the hashed value (node) placed on the ring.
public T routeNode(String businessKey) {
if (ring.isEmpty()) { // If the ring is empty, it means there are no nodes, so return null
return null;
}
Long hashOfBusinessKey = this.hashAlgorithm.hash(businessKey);
SortedMap<Long, VirtualNode<T>> biggerTailMap = ring.tailMap(hashOfBusinessKey);
Long nodeHash;
if (biggerTailMap.isEmpty()) {
nodeHash = ring.firstKey();
} else {
nodeHash = biggerTailMap.firstKey();
}
VirtualNode<T> virtualNode = ring.get(nodeHash);
return virtualNode.getPhysicalNode();
}

The hash ring is implemented using a TreeMap. Since TreeMap maintains keys (hash values) in ascending order upon storage, we can use the tailMap(key) method to find values greater than the key (hash value) and connect them to the largest key if a larger key cannot be found.

info

If you are not familiar with TreeMap, please refer to this link.

Testing

How effective is Consistent Hashing compared to the standard routing method? Now that we have implemented it ourselves, let's resolve this question. The rough test design is as follows:

  • Process 1 million requests, then introduce changes to the node list and assume the same traffic comes in again.
  • 4 physical nodes

The numerical data was quantified through a simple test code1, and when graphed, it revealed six cases. Let's look at each one.

Case 1: Simple Hash, No Node Changes

image

After sending 1 million requests and then another 1 million of the same requests, since there were no changes in the nodes, the cache hit rate was 100% from the second request onwards.

info

Although the cache hit rate was low, the possibility of cache hits even in the first request (gray graph) was due to the random nature of the keys used in the test, resulting in a low probability of duplicate key values.

Looking at the heights of the graphs for the nodes, we can see that the routing using hash % N is indeed distributing all traffic very evenly.

Case 2: Simple Hash, 1 Node Departure

image

The cache hit rate, indicated by the green graph, significantly decreased. With Node 1 departing, the traffic was distributed to Nodes 2, 3, and 4. While some traffic luckily hit the cache on the same nodes as before, most of it was directed to different nodes, resulting in cache misses.

Case 3: Consistent Hash, No Node Changes, No Virtual Nodes

image

info

Considering that physical nodes are not placed on the hash ring, using only one virtual node practically means not using virtual nodes.

Similar to Case 1, the red graph rises first as cache hits cannot occur immediately in the first request. By the second request, the cache hit rate is 100%, aligning the heights of the green and red graphs.

However, it can be observed that the heights of the graphs for each node are different, indicating the drawback of Consistent Hashing—uneven traffic distribution due to non-uniform partitions.

Case 4: Consistent Hash, 1 Node Departure, No Virtual Nodes

image

After Node 1 departs, the cache hit rate overwhelmingly improved compared to Case 2.

Upon closer inspection, it can be seen that the traffic originally directed to Node 1 then moved to Node 2 in the second traffic wave. Node 2 processed around 450,000 requests, including cache hits, which is more than twice the amount processed by Node 3 with 220,000 requests. Meanwhile, the traffic to Nodes 3 and 4 remained unchanged. This illustrates the advantage of Consistent Hashing while also highlighting a kind of hotspot phenomenon.

Case 5: Consistent Hash, 1 Node Departure, 10 Virtual Nodes

To achieve uniform partitioning and resolve the hotspot issue, let's apply virtual nodes.

image

Overall, there is a change in the graphs. The traffic that was supposed to go to Node 1 is now divided among Nodes 2, 3, and 4. Although the partitions are not evenly distributed, the hotspot issue is gradually being resolved compared to Case 4. Since 10 virtual nodes seem insufficient, let's increase them further.

Case 6: Consistent Hash, 1 Node Departure, 100 Virtual Nodes

image

Finally, the graphs for Nodes 2, 3, and 4 are similar. After Node 1's departure, there are 100 virtual nodes per physical node on the hash ring, totaling 300 virtual nodes. In summary:

  • It can be seen that traffic is evenly distributed enough to withstand Case 1.
  • Even if Node 1 departs, the traffic intended for Node 1 is spread across multiple nodes, preventing the hotspot issue.
  • Apart from the traffic directed to Node 1, the cache still hits.

By placing a sufficient number of virtual nodes, the routing method using Consistent Hashing has become highly advantageous for horizontal scaling compared to the remaining operations, as observed.

Conclusion

We have examined Consistent Hashing as discussed in Chapter 5 of the fundamentals of large-scale system design. We hope this has helped you understand what Consistent Hashing is, and why it exists to solve certain problems.

Although not mentioned in a separate case, I was concerned about how many virtual nodes should be added to achieve a perfectly uniform distribution. Therefore, I increased the number of virtual nodes to 10,000 and found that adding more virtual nodes had minimal effect. Theoretically, increasing virtual nodes should converge the variance to zero and achieve a uniform distribution. However, increasing virtual nodes means having many instances on the hash ring, leading to unnecessary overhead. It requires the task of finding and organizing virtual nodes on the hash ring whenever a new node is added or removed2. In a live environment, please set an appropriate number of virtual nodes based on data.

Reference

Footnotes

  1. SimpleHashRouterTest

  2. In particular, for a Hash Ring implemented using TreeMap, massive insertions and deletions are somewhat inefficient as the internal elements need to be rearranged each time.

Understanding Garbage Collection

· 7 min read
Haril Song
Owner, Software Engineer at 42dot

Overview

Let's delve into the topic of Garbage Collection (GC) in the JVM.

What is GC?

The JVM memory is divided into several regions.

image

The Heap region is where objects and arrays created by operations like new are stored. Objects or arrays created in the Heap region can be referenced by other objects. GC occurs precisely in this Heap region.

If a Java program continues to run without terminating, data will keep piling up in memory. GC resolves this issue.

How does it resolve it? The JVM identifies unreachable objects as targets for GC. Understanding which objects become unreachable can be grasped by looking at the following code.

public class Main {
public static void main(String[] args) {
Person person = new Person("a", "soon to be unreferenced");
person = new Person("b", "reference maintained.");
}
}

When person is initially initialized, the created a is immediately reassigned to b on the next line, becoming an unreachable object. Now, a will be released from memory during the next GC.

Stop the World

image The World! Time, halt! - JoJo's Bizarre Adventure

Stopping the application's execution to perform GC. When a "Stop the World" event occurs, all threads except the one executing GC are paused. Once the GC operation is completed, the paused tasks resume. Regardless of the GC algorithm used, "Stop the World" events occur, and GC tuning typically aims to reduce the time spent in this paused state.

warning

Java does not explicitly deallocate memory in program code. Occasionally setting an object to null to deallocate it is not a major issue, but calling System.gc() can significantly impact system performance and should never be used. Furthermore, System.gc() does not guarantee that GC will actually occur.

Two Areas Where GC Occurs

Since developers do not explicitly deallocate memory in Java, the Garbage Collector is responsible for identifying and removing no longer needed (garbage) objects. The Garbage Collector operates under two main assumptions:

  • Most objects quickly become unreachable.
  • There are very few references from old objects to young objects.

Most objects quickly become unreachable

for (int i = 0; i < 10000; i++) {
NewObject obj = new NewObject();
obj.doSomething();
}

The 10,000 NewObject instances are used within the loop and are not needed outside it. If these objects continue to occupy memory, resources for executing other code will gradually diminish.

Few references from old objects to young objects

Consider the following code snippet for clarification.

Model model = new Model("value");
doSomething(model);

// model is no longer used

The initially created model is used within doSomething but is unlikely to be used much afterward. While there may be cases where it is reused, GC is designed with the assumption that such occurrences are rare. Looking at statistics from Oracle, most objects are cleaned up by GC shortly after being created, validating this assumption.

image

This assumption is known as the weak generational hypothesis. To maximize the benefits of this hypothesis, the HotSpot VM divides the physical space into two main areas: the Young Generation and the Old Generation.

image

  • Young Generation: This area primarily houses newly created objects. Since most objects quickly become unreachable, many objects are created and then disappear in the Young Generation. When objects disappear from this area, it triggers a Minor GC.
  • Old Generation: Objects that survive in the Young Generation without becoming unreachable are moved to the Old Generation. This area is typically larger than the Young Generation, and since it is larger, GC occurs less frequently here. When objects disappear from this area, it triggers a Major GC (or Full GC).

Each object in the Young Generation has an age bit that increments each time it survives a Minor GC. When the age bit exceeds a setting called MaxTenuringThreshold, the object is moved to the Old Generation. However, even if the age bit does not exceed the setting, an object can be moved to the Old Generation if there is insufficient memory in the Survivor space.

info

The Permanent space is where the addresses of created objects are stored. It is used by the class loader to store meta-information about loaded classes and methods. Prior to Java 7, it existed within the Heap.

Types of GC

The Old Generation triggers GC when it becomes full. Understanding the different GC methods will help in comprehending the procedures involved.

Serial GC

-XX:+UseSerialGC

To understand Serial GC, one must first grasp the Mark-Sweep-Compact algorithm. The first step of this algorithm involves identifying live objects in the Old Generation (Mark). Next, it sweeps through the heap from the front, retaining only live objects (Sweep). In the final step, it fills the heap from the front to ensure objects are stacked contiguously, dividing the heap into sections with and without objects (Compaction).

warning

Serial GC is suitable for systems with limited memory and CPU cores. However, using Serial GC can significantly impact application performance.

Parallel GC

-XX:+UseParallelGC

  • Default GC in Java 8

While the basic algorithm is similar to Serial GC, Parallel GC performs Minor GC in the Young Generation using multiple threads.

Parallel Old GC

-XX:+UseParallelOldGC

  • An improved version of Parallel GC

As the name suggests, this GC method is related to the Old Generation. Unlike ParallelGC, which only uses multiple threads for the Young Generation, Parallel Old GC performs GC using multiple threads in the Old Generation as well.

CMS GC (Concurrent Mark Sweep)

This GC was designed to minimize "Stop the World" time by allowing application threads and GC threads to run concurrently. Due to the multi-step process of identifying GC targets, CPU usage is higher compared to other GC methods.

Ultimately, CMS GC was deprecated starting from Java 9 and completely discontinued in Java 14.

G1GC (Garbage First)

-XX:+UseG1GC

  • Released in JDK 7 to replace CMS GC
  • Default GC in Java 9+
  • Recommended for situations requiring more than 4GB of heap memory and where a "Stop the World" time of around 0.5 seconds is acceptable (For smaller heaps, other algorithms are recommended)

G1GC requires a fresh approach as it is a completely redesigned GC method.

Q. Considering G1GC is the default in later versions, what are the pros and cons compared to the previous CMS?

  • Pros
    • G1GC performs compaction while scanning, reducing "Stop the World" time.
    • Provides the ability to compress free memory space without additional "Stop the World" pauses.
    • String Deduplication Optimization
    • Tuning options for size, count, etc.
  • Cons
    • During Full GC, it operates single-threaded.
    • Applications with small heap sizes may experience frequent Full GC events.

Shenandoah GC

-XX:+UseShenandoahGC

  • Released in Java 12
  • Developed by Red Hat
  • Addresses memory fragmentation issues in CMS and pause issues in G1
  • Known for strong concurrency and lightweight GC logic, ensuring consistent pause times regardless of heap size

image

ZGC

-XX:+UnlockExperimentalVMOptions -XX:+UseZGC

  • Released in Java 15
  • Designed for low-latency processing of large memory sizes (8MB to 16TB)
  • Utilizes ZPages similar to G1's Regions, but ZPages are dynamically managed in 2MB multiples (adjusting region sizes dynamically to accommodate large objects)
  • One of ZGC's key advantages is that "Stop the World" time never exceeds 10ms regardless of heap size

image

Conclusion

While there are various GC types available, in most cases, using the default GC provided is sufficient. Tuning GC requires significant effort, involving tasks such as analyzing GC logs and heap dumps. Analyzing GC logs will be covered in a separate article.

Reference

Optimizing Images for Blog Search Exposure

· 5 min read
Haril Song
Owner, Software Engineer at 42dot

In the process of automating blog posting, we discuss image optimization for SEO. This is a story of failure rather than success, where we had to resort to Plan B.

info

You can check the code on GitHub.

Identifying the Problem

For SEO optimization, it is best to have images in blog posts as small as possible. This improves the efficiency of search engine crawling bots, speeds up page loading, and positively impacts user experience.

So, which image format should we use? 🤔

Google has developed an image format called WebP to address this issue and actively recommends its use. For Google, which profits from advertising, image optimization is directly related to profitability as it allows users to quickly reach website ads.

In fact, converting a jpg file of about 2.8MB to webp reduced it to around 47kb. That's more than a 1/50 reduction! Although some quality loss occurred, it was hardly noticeable on the webpage.

image

With this level of improvement, the motivation to solve the problem was more than enough. Let's gather information to implement it.

Approach to the Solution

Plan A. Adding to O2 as a Feature

We have a plugin called O2 that we developed for blog posting. Since we thought that including the WebP conversion task as part of this plugin's functionality would be the most ideal way, we first attempted this approach.

While sharp is the most famous library for image processing, it is OS-dependent and cannot be used with Obsidian plugins. To confirm this, I asked about it in the Obsidian community and received a clear answer that it cannot be used.

image

image

image Related community conversation

Unable to use sharp, we decided to use imagemin as an alternative.

However, there was a critical issue: imagemin requires the platform to be node for it to work when running esbuild, but the Obsidian plugin required the platform to be a browser. Setting it to neutral, which should work on both platforms, didn't work on either...

image

Since we couldn't find a suitable library to apply to O2 immediately, we decided to implement a simple script to handle the format conversion task.

Plan B. npm script

Instead of adding functionality to the plugin, we can easily convert formats by scripting directly within the Jekyll project.

async function deleteFilesInDirectory(dir) {
const files = fs.readdirSync(dir);

files.forEach(function (file) {
const filePath = path.join(dir, file);
const extname = path.extname(filePath);
if (extname === '.png' || extname === '.jpg' || extname === '.jpeg') {
fs.unlinkSync(filePath);
console.log(`remove ${filePath}`);
}
});
}

async function convertImages(dir) {
const subDirs = fs
.readdirSync(dir)
.filter((file) => fs.statSync(path.join(dir, file)).isDirectory());

await imagemin([`${dir}/*.{png,jpg,jpeg}`], {
destination: dir,
plugins: [imageminWebp({quality: 75})]
});
await deleteFilesInDirectory(dir);

for (const subDir of subDirs) {
const subDirPath = path.join(dir, subDir);
await convertImages(subDirPath);
}
}

(async () => {
await convertImages('assets/img');
})();

While this method allows for quick implementation of the desired functionality, it requires users to manually relink the changed images to the markdown document outside of the process controlled by O2.

If we must use this method, we decided to use regular expressions to change the image extensions linked in all files to webp, thereby skipping the task of relinking images in the document.

// omitted
async function updateMarkdownFile(dir) {
const files = fs.readdirSync(dir);

files.forEach(function (file) {
const filePath = path.join(dir, file);
const extname = path.extname(filePath);
if (extname === '.md') {
const data = fs.readFileSync(filePath, 'utf-8');
const newData = data.replace(
/(!\^\*]\((.*?)\.(png|jpg|jpeg)\))/g,
(match, p1, p2, p3) => {
return p1.replace(`${p2}.${p3}`, `${p2}.webp`);
}
);
fs.writeFileSync(filePath, newData);
}
});
}

(async () => {
await convertImages('assets/img');
await updateMarkdownFile('_posts');
})();

Then, we wrote a script to run when publishing a blog post.

#!/usr/bin/env bash

echo "Image optimization️...🖼️"
node tools/imagemin.js

git add .
git commit -m "post: publishing"

echo "Pushing...📦"
git push origin master

echo "Done! 🎉"
./tools/publish

Directly running sh in the terminal somehow felt inelegant. Let's add it to package.json for a cleaner usage.

{
"scripts": {
"publish": "./tools/publish"
}
}
npm run publish

image It works quite well.

For now, we concluded it this way.

Conclusion

Through this process, the blog posting pipeline has transformed as follows:

Before

After

Looking at the results alone, it doesn't seem that bad, does it...? 🤔

We wanted to add the image format conversion feature as part of the O2 plugin functionality, but for various reasons, we couldn't apply it (for now), which is somewhat disappointing. The methods using JS and sh require additional actions from the user and are not easy to maintain. We need to consistently think about how to bring this feature into O2 internally.

Reference

What Does It Mean to Write Well? - Writing Pipeline

· 7 min read
Haril Song
Owner, Software Engineer at 42dot

I mostly use the Markdown editor Obsidian for writing, and host my blog on GitHub pages. To maintain the habit of writing without interruption across these two different platforms, I'll share how I go about it.

info

This post was inspired by a presentation by Sungyun from 글또(geultto).

Gathering Material

In various situations like work, side projects, or studying, I often come across topics I don't know much about. Each time this happens, I create a new note immediately1. In this note, I write a brief summary of 1-2 lines focusing on the keywords I didn't know well.

I don't try to organize in detail from the beginning. I'm not familiar with the topic yet, so it can be tiring. Moreover, the newly learned information may not be immediately important. However, to prevent creating notes on the same topic later, I pay attention to the note title or tag for easy searching.

The key point is that this process is ongoing. If similar notes on the topic already exist, they will be enriched. Through repetition, eventually, a good post will emerge.

The initially created notes are stored in a directory named "inbox."

Learning and Organizing

The notes pile up in the inbox... I need to clear them out, right?

Once I find material that is useful and easy to organize, I study the topic and write a rough draft. Up to this point, I write not for blog purposes but for my own learning. Since it's just a simple memo, the writing style and expression are somewhat flexible. I might add some humor to make the structure of the post more interesting...

After writing this draft, I evaluate it to determine if it's suitable for posting on the blog. If the topic has been overly covered in other communities or blogs, I tend not to post it separately for the sake of differentiation.

info

However, for content related to personal experiences, such as introducing solutions to problems or sharing personal experiences, even if similar posts exist on other blogs, I try to write them as my feelings and perspectives may differ.

The organized posts are moved to the backlog directory.

Selecting Blog Posts

While not as many as in the inbox, a certain number of somewhat completed posts accumulate in the backlog. Around 10 posts linger there like a buffer. Over time, if my thoughts on the content change, requiring edits, or if incorrect information is found necessitating further study, some posts are demoted back to the inbox. This serves as a minimal verification process I can personally do to prevent spreading incorrect information. The surviving posts, after enduring all the hardships, are refined from personal learning posts to posts meant for others to see.

Once a post is satisfactory, it is moved to the directory "ready" for blog publication.

Uploading

Once all preparations for uploading are complete, I use O2 to convert the notes in "ready" to Markdown format and move them to the Jekyll project folder.

info

O2 is a community plugin for Obsidian that converts notes written in Obsidian to Markdown format.

gif You can see how the image links are automatically converted.

The notes in "ready" are copied to the published directory before being moved to the Jekyll project, where they are stored for backup. All Obsidian-specific syntax is converted to basic Markdown, and if there are attachments, they are copied to the Jekyll project folder along with the notes. Although the attachment file paths change, causing Markdown links that worked in Obsidian to break, there's no need to worry as all this is automated by O22. 😄 3

Now, I switch tools from Obsidian to VScode. Managing a Jekyll blog sometimes requires dealing with code. This goes beyond what a simple Markdown editor can handle, so continuing to work in Obsidian may present some challenges.

I briefly review the syntax and context, then run npm run publish to complete the blog post publication process.

info

You can learn more about publishing in this post.

Proofreading

I regularly review the posts to catch any unnoticed grammatical errors or awkward expressions and refine them gradually. This process doesn't have a set end point; just check the blog from time to time and make corrections consistently.

The blog post pipeline ends here, but I'll briefly explain how to utilize it to write better posts.

Optional. Data Analysis

Obsidian provides a graph view feature. By utilizing this feature, you can visualize how your notes are organically connected and use it for data analysis.

image In the graph, only the bright green nodes are posts published on the blog.

Most notes are still on topics I'm studying or posts that didn't make the cut for blog publication. From this graph, you can infer the following:

  • Nodes in the center with many edges but not published as blog posts likely cover very common topics that I chose not to publish. Or maybe I was just lazy...
  • Nodes scattered on the outer edges without edges represent fragmented knowledge that I haven't delved deeply into yet. Since they are not linked to any topic, they need further study to connect them internally 😂. These nodes need to be linked internally by learning more about related topics.
  • Posts on the outermost edges that have been published represent impulsively published posts during the process of acquiring new knowledge. Since they were impulsively published, it's important to periodically review them for any errors in the content.

Based on this objective data, I strive to expand my knowledge by checking how much I know and what I don't know regularly. 🧐

Conclusion

In this post, I introduced my writing pipeline and how I use Obsidian for data analysis to accurately understand myself. I hope writing becomes a natural part of your daily life, not just a task!

  • Quickly jot down emerging ideas to enable writing without losing the context of your work, making it possible to write consistently. Choose and utilize the appropriate tools according to the situation.
  • To make writing feel less like a chore, it's more efficient to add a few minutes consistently each day rather than holding onto writing for hours from scratch.
  • Blog publication can be a tedious task, so automate it as much as possible to keep the workflow simple. Focus on writing!
  • Assess how much you know and what you don't know (self-objectification). This helps greatly in selecting blog post topics and determining the direction of your learning.
info

All my writing, including drafts that are not posted on the blog, is managed publicly on GitHub.


Footnotes

  1. I've always kept notes nearby since I studied music in the past. It seemed like the best time to come up with something was when I was about to fall asleep. It doesn't seem much different now. Bugs solutions always seem to come to mind just before falling asleep...

  2. O2 Plugin Development Story

  3. Although new bug issues are added with each post... 😭 ???: That's a feature

Getting the Most Out of chezmoi

· 5 min read
Haril Song
Owner, Software Engineer at 42dot

Following on from the previous post, I'll share some ways to make better use of chezmoi.

info

You can check out the settings I'm currently using here.

How to Use It

You can find the usage of chezmoi commands with chezmoi help and in the official documentation. In this post, I'll explain some advanced ways to use chezmoi more conveniently.

Settings

chezmoi uses the ~/.config/chezmoi/chezmoi.toml file for settings. If you need tool-specific settings, you can define them in this file. It supports not only toml but also yaml and json, so you can write in a format you are familiar with. Since the official documentation guides with toml, I'll also explain using toml as the default.

Setting Merge Tool and Default Editor

chezmoi uses vi as the default editor. Since I mainly use nvim, I'll show you how to modify it to use nvim as the default editor.

chezmoi edit-config
[edit]
command = "nvim"

[merge]
command = "nvim"
args = ["-d", "{{ .Destination }}", "{{ .Source }}", "{{ .Target }}"]

If you use VScode, you can set it up like this:

[edit]
command = "code"
args = ["--wait"]

Managing gitconfig Using Templates

Sometimes you may need separate configurations rather than uniform settings. For example, you might need different gitconfig settings for work and personal environments. In such cases where only specific data needs to be separated while the rest remains similar, chezmoi allows you to control this through a method called templates, which inject environment variables.

First, create a gitconfig file:

mkdir ~/.config/git
touch ~/.config/git/config

Register gitconfig as a template to enable the use of variables:

chezmoi add --template ~/.config/git/config

Write the parts where data substitution is needed:

chezmoi edit ~/.config/git/config
[user]
name = {{ .name }}
email = {{ .email }}

These curly braces will be filled with variables defined in the local environment. You can check the default variable list with chezmoi data.

Write the variables in chezmoi.toml:

# Write local settings instead of `chezmoi edit-config`.
vi ~/.config/chezmoi/chezmoi.toml
[data]
name = "privateUser"
email = "private@gmail.com"

After writing all this, try using chezmoi apply -vn or chezmoi init -vn to see the template variables filled with data values in the config file that is generated.

Auto Commit and Push

Simply editing dotfiles with chezmoi edit does not automatically reflect changes to the git in the local repository.

# You have to do it manually.
chezmoi cd
git add .
git commit -m "update something"
git push

To automate this process, you need to add settings to chezmoi.toml.

# `~/.config/chezmoi/chezmoi.toml`
[git]
# autoAdd = true
autoCommit = true # add + commit
autoPush = true

However, if you automate the push as well, sensitive files could accidentally be uploaded to the remote repository. Therefore, personally, I recommend activating only the auto option until commit.

Managing Brew Packages

If you find a useful tool at work, don't forget to install it in your personal environment too. Let's manage it with chezmoi.

chezmoi cd
vi run_once_before_install-packages-darwin.sh.tmpl

The run_once_ is a script keyword used by chezmoi. It is used when you want to run a script only if it has not been executed before. By using the before_ keyword, you can run the script before creating dotfiles. The script written using these keywords is executed in two cases:

  • When it has never been executed before (initial setup)
  • When the script itself has been modified (update)

By scripting brew bundle using these keywords, you can have uniform brew packages across all environments. Here is the script I am using:

# Only run on MacOS
{{- if eq .chezmoi.os "darwin" -}}
#!/bin/bash

PACKAGES=(
asdf
exa
ranger
chezmoi
difftastic
gnupg
fzf
gh
glab
htop
httpie
neovim
nmap
starship
daipeihust/tap/im-select
)

CASKS=(
alt-tab
shottr
raycast
docker
hammerspoon
hiddenbar
karabiner-elements
obsidian
notion
slack
stats
visual-studio-code
warp
wireshark
google-chrome
)

# Install Homebrew if not already installed
if test ! $(which brew); then
printf '\n\n\e[33mHomebrew not found. \e[0mInstalling Homebrew...'
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"
else
printf '\n\n\e[0mHomebrew found. Continuing...'
fi

# Update homebrew packages
printf '\nInitiating Homebrew update...\n'
brew update

printf '\nInstalling packages...\n'
brew install ${PACKAGES[@]}

printf '\n\nRemoving out of date packages...\n'
brew cleanup

printf '\n\nInstalling cask apps...\n'
brew install --cask ${CASKS[@]}

{{ end -}}

Even if you are not familiar with sh, it shouldn't be too difficult to understand. Define the PACKAGES list for packages installed with brew install and CASKS for applications installed with brew install --cask. The installation process will be carried out by the script.

Scripting is a relatively complex feature among the functionalities available in chezmoi. There are various ways to apply it, and the same function can be defined differently. For more detailed usage, refer to the official documentation.

Conclusion

In this post, I summarized useful chezmoi settings following the basic usage explained in the previous post. The usage of the script I introduced at the end may seem somewhat complex, contrary to the title of basic settings, but once applied, it can be very convenient to use.

Reference

Managing Dotfiles Conveniently with Chezmoi

· 7 min read
Haril Song
Owner, Software Engineer at 42dot

Have you ever felt overwhelmed at the thought of setting up your development environment again after getting a new MacBook? Or perhaps you found a fantastic tool during work, but felt too lazy to set it up again in your personal environment at home? Have you ever hesitated to push your configurations to GitHub due to security concerns?

If you've ever used multiple devices, you might have faced these dilemmas. How can you manage your configurations consistently across different platforms?

Problem

Configuration files like .zshrc for various software are scattered across different paths, including $HOME (root). However, setting up Git at the root for version control of these files can be daunting. The wide range of scanning involved can actually make managing the files more difficult.

Maintaining a consistent development environment across three devices – a MacBook for work, an iMac at home, and a personal MacBook – seemed practically impossible.

Modifying just one Vim shortcut at work and realizing you have to do the same on the other two devices after work... 😭

With the advent of the Apple Silicon era, the significant disparity between Intel Macs and the new devices made achieving a unified environment even more challenging. I had pondered over this issue for quite some time, as I often forgot to set up aliases frequently used at work on my home machine.

Some of the methods I tried to solve this problem briefly were:

  1. Centralizing dotfiles in a specific folder and managing them as a Git project

    1. The locations of dotfiles vary. In most cases, there are predefined locations even if they are not in the root.
    2. You cannot work directly in a folder with Git set up, and you still have to resort to copy-pasting on other devices.
  2. Symbolic link

    1. To set up on a new device, you need to recreate symbolic links for all files in the correct locations(...). If you have many files to manage, this can be a tiresome task.
    2. The usage is more complex than Git, requiring attention to various details.

In the end, I resorted to using the Git method but only for files not in the root (~/.ssh/config, ~/.config/nvim, etc.), partially giving up on files using the root as the location (~/.zshrc, ~/.gitconfig, etc.) until I discovered chezmoi!

Now, let me introduce you to chezmoi, which elegantly solves this challenging problem.

What is Chezmoi?

Manage your dotfiles across multiple diverse machines, securely. - chezmoi.io

Chezmoi is a tool that allows you to manage numerous dotfiles consistently across various environments and devices. As described in the official documentation, with just a few settings, you can ensure 'security'. You don't need to worry about where your dotfiles are or where they should be. You simply need to inform chezmoi of the dotfiles to manage.

Concept

How is this seemingly magical feat possible? 🤔

In essence, chezmoi stores dotfiles in ~/.local/share/chezmoi and when you run chezmoi apply, it checks the status of each dotfile, making minimal changes to ensure they match the desired state. For more detailed concepts, refer to the reference manual.

Let's now briefly explain how to use it.

Getting Started with Chezmoi

Once you have installed chezmoi (installation guide here), perform the initialization with the following command:

chezmoi init

This action creates a new Git repository in ~/.local/share/chezmoi (working directory) on your local device to store dotfiles. By default, chezmoi reflects modifications in the working directory on your local device.

If you want to manage your ~/.zshrc file through chezmoi, run the following command:

chezmoi add ~/.zshrc

You will see that the ~/.zshrc file has been copied to ~/.local/share/chezmoi/dot_zshrc.

To edit the ~/.zshrc file managed by chezmoi, use the following command:

chezmoi edit ~/.zshrc

This command opens ~/.local/share/chezmoi/dot_zshrc with $EDITOR for editing. Make some changes for testing and save.

info

If $EDITOR is not in the environment variables, it defaults to using vi.

To check what changes have been made in the working directory, use the following command:

chezmoi diff

If you want to apply the changes made by chezmoi to your local device, use the following command:

chezmoi apply -v

All chezmoi commands can use the -v (verbose) option. This option visually displays what is being applied to your local device, making it clear in the console. By using the -n (dry run) option, you can execute commands without applying them. Therefore, combining the -v and -n options allows you to preview what actions will be taken when running unfamiliar commands.

Now, let's access the source directory directly and push the contents of chezmoi to a remote repository. It is recommended to name the repository dotfiles, as I will explain later.

chezmoi cd
git add .
git commit -m "Initial commit"
git remote add origin https://github.com/$GITHUB_USERNAME/dotfiles.git
git push
tip

By writing the relevant settings in the chezmoi.toml file, you can automate the repository synchronization process for more convenient use.

To exit the chezmoi working directory, use the following command:

exit

Visualizing the process up to this point, it looks like this:

image

Using Chezmoi on Another Device

This is why we use chezmoi. Let's fetch the contents on the second device using chezmoi. I have used an SSH URL for this example. Assume that chezmoi is already installed on the second device.

chezmoi init git@github.com:$GITHUB_USERNAME/dotfiles.git

By initializing with a specific repository, chezmoi automatically checks for submodules or necessary external source files and generates the chezmoi config file based on the options.

Inspect what changes chezmoi will bring to the second device using the diff command we saw earlier.

chezmoi diff

If you are satisfied with applying all the changes, use the apply command we discussed earlier.

chezmoi apply -v

If you need to modify some files before applying locally, use edit.

chezmoi edit $FILE

Alternatively, you can use a merge tool to apply local changes as if you were using Git merge.

chezmoi merge $FILE
tip

Using chezmoi merge-all will perform a merge operation on all files that require merging.

You can perform all these steps at once with the following command:

chezmoi update -v

Visualizing this process, it looks like this:

image

You can also apply all the steps needed on the second device at initialization...! This feature can be incredibly useful if the second device is a newly purchased one.

chezmoi init --apply https://github.com/$GITHUB_USERNAME/dotfiles.git

I recommended naming the repository dotfiles earlier because if the repository is named dotfiles, you can use a shortened version of the previous command.

chezmoi init --apply $GITHUB_USERNAME

image

It's truly convenient...🥹 I believe it will be one of the best open-source tools discovered in 2023.

Conclusion

Chezmoi is impressively well-documented and actively developed. Developed in Golang, it feels quite fast 😄. With some knowledge of shell scripting, you can implement highly automated processes, creating an environment where you hardly need to intervene for settings across multiple devices.

In this article, I covered the basic usage of chezmoi. In the next article, we will delve into managing chezmoi configuration files and maintaining security.

info

If you are curious about my configurations, you can check them here.

Reference

Optimizing Pagination in Spring Batch with Composite Keys

· 7 min read
Haril Song
Owner, Software Engineer at 42dot

In this article, I will discuss the issues and solutions encountered when querying a table with millions of data using Spring Batch.

Environment

  • Spring Batch 5.0.1
  • PostgreSQL 11

Problem

While using JdbcPagingItemReader to query a large table, I noticed a significant slowdown in query performance over time and decided to investigate the code in detail.

Default Behavior

The following query is automatically generated and executed by the PagingQueryProvider:

SELECT *
FROM large_table
WHERE id > ?
ORDER BY id
LIMIT 1000;

In Spring Batch, when using JdbcPagingItemReader, instead of using an offset, it generates a where clause for pagination. This allows for fast retrieval of data even from tables with millions of records without any delays.

tip

Even with LIMIT, using OFFSET means reading all previous data again. Therefore, as the amount of data to be read increases, the performance degrades. For more information, refer to the article1.

Using Multiple Sorting Conditions

The problem arises when querying a table with composite keys. When a composite key consisting of 3 columns is used as the sort key, the generated query looks like this:

SELECT *
FROM large_table
WHERE ((create_at > ?) OR
(create_at = ? AND user_id > ?) OR
(create_at = ? AND user_id = ? AND content_no > ?))
ORDER BY create_at, user_id, content_no
LIMIT 1000;

However, queries with OR operations in the where clause do not utilize indexes effectively. OR operations require executing multiple conditions, making it difficult for the optimizer to make accurate decisions. When I examined the explain output, I found the following results:

Limit  (cost=0.56..1902.12 rows=1000 width=327) (actual time=29065.549..29070.808 rows=1000 loops=1)
-> Index Scan using th_large_table_pkey on large_table (cost=0.56..31990859.76 rows=16823528 width=327) (actual time=29065.547..29070.627 rows=1000 loops=1)
" Filter: ((""create_at"" > '2023-01-28 06:58:13'::create_at without time zone) OR ((""create_at"" = '2023-01-28 06:58:13'::create_at without time zone) AND ((user_id)::text > '441997000'::text)) OR ((""create_at"" = '2023-01-28 06:58:13'::create_at without time zone) AND ((user_id)::text = '441997000'::text) AND ((content_no)::text > '9070711'::text)))"
Rows Removed by Filter: 10000001
Planning Time: 0.152 ms
Execution Time: 29070.915 ms

With a query execution time close to 30 seconds, most of the data is discarded during filtering on the index, resulting in unnecessary time wastage.

Since PostgreSQL manages composite keys as tuples, writing queries using tuples allows for utilizing the advantages of Index scan even in complex where clauses.

SELECT *
FROM large_table
WHERE (create_at, user_id, content_no) > (?, ?, ?)
ORDER BY create_at, user_id, content_no
LIMIT 1000;
Limit  (cost=0.56..1196.69 rows=1000 width=327) (actual time=3.204..11.393 rows=1000 loops=1)
-> Index Scan using th_large_table_pkey on large_table (cost=0.56..20122898.60 rows=16823319 width=327) (actual time=3.202..11.297 rows=1000 loops=1)
" Index Cond: (ROW(""create_at"", (user_id)::text, (content_no)::text) > ROW('2023-01-28 06:58:13'::create_at without time zone, '441997000'::text, '9070711'::text))"
Planning Time: 0.276 ms
Execution Time: 11.475 ms

It can be observed that data is directly retrieved through the index without discarding any data through filtering.

Therefore, when the query executed by JdbcPagingItemReader uses tuples, it means that even when using composite keys as sort keys, processing can be done very quickly.

Let's dive into the code immediately.

Modifying PagingQueryProvider

Analysis

As mentioned earlier, the responsibility of generating queries lies with the PagingQueryProvider. Since I am using PostgreSQL, the PostgresPagingQueryProvider is selected and used.

image The generated query differs based on whether it includes a group by clause.

By examining SqlPagingQueryUtils's buildSortConditions, we can see how the problematic query is generated.

image

Within the nested for loop, we can see how the query is generated based on the sort key.

Customizing buildSortConditions

Having directly inspected the code responsible for query generation, I decided to modify this code to achieve the desired behavior. However, direct overriding of this code is not possible, so I created a new class called PostgresOptimizingQueryProvider and re-implemented the code within this class.

private String buildSortConditions(StringBuilder sql) {
Map<String, Order> sortKeys = getSortKeys();
sql.append("(");
sortKeys.keySet().forEach(key -> sql.append(key).append(", "));
sql.delete(sql.length() - 2, sql.length());
if (is(sortKeys, order -> order == Order.ASCENDING)) {
sql.append(") > (");
} else if (is(sortKeys, order -> order == Order.DESCENDING)) {
sql.append(") < (");
} else {
throw new IllegalStateException("Cannot mix ascending and descending sort keys"); // Limitation of tuples
}
sortKeys.keySet().forEach(key -> sql.append("?, "));
sql.delete(sql.length() - 2, sql.length());
sql.append(")");
return sql.toString();
}

Test Code

To ensure that the newly implemented section works correctly, I validated it through a test code.

@Test
@DisplayName("The Where clause generated instead of Offset is (create_at, user_id, content_no) > (?, ?, ?).")
void test() {
// given
PostgresOptimizingQueryProvider queryProvider = new PostgresOptimizingQueryProvider();
queryProvider.setSelectClause("*");
queryProvider.setFromClause("large_table");

Map<String, Order> parameterMap = new LinkedHashMap<>();
parameterMap.put("create_at", Order.ASCENDING);
parameterMap.put("user_id", Order.ASCENDING);
parameterMap.put("content_no", Order.ASCENDING);
queryProvider.setSortKeys(parameterMap);

// when
String firstQuery = queryProvider.generateFirstPageQuery(10);
String secondQuery = queryProvider.generateRemainingPagesQuery(10);

// then
assertThat(firstQuery).isEqualTo("SELECT * FROM large_table ORDER BY create_at ASC, user_id ASC, content_no ASC LIMIT 10");
assertThat(secondQuery).isEqualTo("SELECT * FROM large_table WHERE (create_at, user_id, content_no) > (?, ?, ?) ORDER BY create_at ASC, user_id ASC, content_no ASC LIMIT 10");
}

image

The successful execution confirms that it is working as intended, and I proceeded to run the batch.

image

Guy: "is it over?"

Boy: "Shut up, It'll happen again!"

-- Within the Webtoon Hive

However, the out of range error occurred, indicating that the query was not recognized as having changed.

image

It seems that the parameter injection part is not automatically recognized just because the query has changed, so let's debug again to find the parameter injection part.

JdbcOptimizedPagingItemReader

The parameter is directly created by JdbcPagingItemReader, and I found that the number of parameters to be injected into SQL is increased by iterating through getParameterList in JdbcPagingItemReader.

image

I thought I could just override this method, but unfortunately it is not possible because it is private. After much thought, I copied the entire JdbcPagingItemReader and modified only the getParameterList part.

The getParameterList method is overridden in JdbcOptimizedPagingItemReader as follows:

private List<Object> getParameterList(Map<String, Object> values, Map<String, Object> sortKeyValue) {
// ...
// Returns the parameters that need to be set in the where clause without increasing them.
return new ArrayList<>(sortKeyValue.values());
}

There is no need to add sortKeyValue, so it is directly added to parameterList and returned.

Now, let's run the batch again.

The first query is executed without requiring parameters,

2023-03-13T17:43:14.240+09:00 DEBUG 70125 --- [           main] o.s.jdbc.core.JdbcTemplate               : Executing SQL query [SELECT * FROM large_table ORDER BY create_at ASC, user_id ASC, content_no ASC LIMIT 2000]

The subsequent query execution receives parameters from the previous query.

2023-03-13T17:43:14.253+09:00 DEBUG 70125 --- [           main] o.s.jdbc.core.JdbcTemplate               : Executing prepared SQL statement [SELECT * FROM large_table WHERE (create_at, user_id, content_no) > (?, ?, ?) ORDER BY create_at ASC, user_id ASC, content_no ASC LIMIT 2000]

The queries are executed exactly as intended! 🎉

For pagination processing with over 10 million records, queries that used to take around 30 seconds now run in the range of 0.1 seconds, representing a significant performance improvement of nearly 300 times.

image

Now, regardless of the amount of data, queries can be read within milliseconds without worrying about performance degradation. 😎

Conclusion

In this article, I introduced the method used to optimize Spring Batch in an environment with composite keys. However, there is a drawback to this method: all columns that make up the composite key must have the same sorting condition. If desc or asc are mixed within the index condition generated by the composite key, a separate index must be used to resolve this issue 😢

Let's summarize today's content in one line and conclude the article.

"Avoid using composite keys as much as possible and use surrogate keys unrelated to the business."

Reference


Footnotes

  1. https://jojoldu.tistory.com/528

Improving Code Productivity with Design Patterns in O2

· 7 min read
Haril Song
Owner, Software Engineer at 42dot

This article discusses the process of using design patterns to improve the structure of the O2 project for more flexible management.

Problem

While diligently working on development, a sudden Issue was raised one day.

image

Reflecting the contents of the Issue was not difficult. However, as I delved into the code, an issue that had been put off for a while started to surface.

image

Below is the implementation of the Markdown syntax conversion code that had been previously written.

warning

Due to the length of the code, a partial excerpt is provided. For the full code, please refer to O2 plugin v1.1.1 🙏

export async function convertToChirpy(plugin: O2Plugin) {
try {
await backupOriginalNotes(plugin);
const markdownFiles = await renameMarkdownFile(plugin);
for (const file of markdownFiles) {
// remove double square brackets
const title = file.name.replace('.md', '').replace(/\s/g, '-');
const contents = removeSquareBrackets(await plugin.app.vault.read(file));
// change resource link to jekyll link
const resourceConvertedContents = convertResourceLink(plugin, title, contents);

// callout
const result = convertCalloutSyntaxToChirpy(resourceConvertedContents);

await plugin.app.vault.modify(file, result);
}

await moveFilesToChirpy(plugin);
new Notice('Chirpy conversion complete.');
} catch (e) {
console.error(e);
new Notice('Chirpy conversion failed.');
}
}

Being unfamiliar with TypeScript and Obsidian usage, I had focused more on implementing features rather than the overall design. Now, trying to add new features, it became difficult to anticipate side effects, and the code implementation lacked clear communication of the developer's intent.

To better understand the code flow, I created a graph of the current process as follows.

Although I had separated functionalities into functions, the code was still procedurally written, where the order of code lines significantly impacted the overall operation. Adding a new feature in this state would require precise implementation to avoid breaking the entire conversion process. So, where would be the correct place to implement a new feature? Most likely, the answer would be 'I need to see the code.' Currently, with most of the code written in one large file, it was almost equivalent to needing to analyze the entire code. In object-oriented terms, one could say that the Single Responsibility Principle (SRP) was not being properly followed.

This state, no matter how positively described, did not seem easy to maintain. Since the O2 plugin was created for my personal use, I could not justify producing spaghetti code that was difficult to maintain by rationalizing, 'It's because I'm not familiar with TS.'

Before resolving the Issue, I decided to first improve the structure, putting the glory aside for a moment.

What Structure Should Be Implemented?

The O2 plugin, as a syntax conversion plugin, must be capable of converting Obsidian's Markdown syntax into various formats, which is a clear requirement.

Therefore, the design should focus primarily on scalability.

Each platform logic should be modularized, and the conversion process abstracted to implement it like a template. This way, when implementing new features for supporting different platform syntaxes, developers can focus only on the small unit of implementing syntax conversion without needing to reimplement the entire flow.

Based on this, the design requirements are as follows:

  1. Strings (content of Markdown files) should be converted in order (or not) as needed.
  2. Specific conversion logic should be skippable or dynamically controllable based on external settings.
  3. Implementing new features should be simple and should have minimal or no impact on existing code.

As there is a sequence of execution, and the ability to add features in between, the Chain of Responsibility pattern seemed appropriate for this purpose.

Applying Design Patterns

Process->Process->Process->Done! : Summary of Chain of Responsibility

export interface Converter {
setNext(next: Converter): Converter;
convert(input: string): string;
}

export abstract class AbstractConverter implements Converter {
private next: Converter;

setNext(next: Converter): Converter {
this.next = next;
return next;
}

convert(input: string): string {
if (this.next) {
return this.next.convert(input);
}
return input;
}
}

The Converter interface plays a role in converting specific strings through convert(input). By specifying the next Converter to be processed with setNext, and returning the Converter again, method chaining can be used.

With the abstraction in place, the conversion logic that was previously implemented in one file was separated into individual Converter implementations, assigning responsibility for each feature. Below is an example of the CalloutConverter that separates the Callout syntax conversion logic.

export class CalloutConverter extends AbstractConverter {
convert(input: string): string {
const result = convertCalloutSyntaxToChirpy(input);
return super.convert(result);
}
}

function convertCalloutSyntaxToChirpy(content: string) {
function replacer(match: string, p1: string, p2: string) {
return `${p2}\n{: .prompt-${replaceKeyword(p1)}}`;
}

return content.replace(ObsidianRegex.CALLOUT, replacer);
}

Now, the relationships between the classes are as follows.

Now, by combining the smallest units of functionality implemented in each Converter, a chain is created to perform operations in sequence. This is why this pattern is called Chain of Responsibility.

export async function convertToChirpy(plugin: O2Plugin) {
// ...
// Create conversion chain
frontMatterConverter.setNext(bracketConverter)
.setNext(resourceLinkConverter)
.setNext(calloutConverter);

// Request the frontMatterConverter at the head to perform the conversion, and the connected converters will operate sequentially.
const result = frontMatterConverter.convert(await plugin.app.vault.read(file));
await plugin.app.vault.modify(file, result);
// ...
}

Now that the logic has been separated into appropriate responsibilities, reading the code has become much easier. When needing to add a new feature, only the necessary Converter needs to be implemented. Additionally, without needing to know how other Converters work, new features can be added through setNext. Each Converter operates independently, following the principle of encapsulation.

Finally, I checked if all tests passed and created a PR.

image

Next Step

Although the structure has improved significantly, there is one remaining drawback. In the structure linked through setNext, calling the Converter at the very front is necessary for proper operation. If a different Converter is called instead of the one at the front, the result may be different from the intended one. If, for example, a NewConverter is implemented before frontMatterConverter but frontMatterConverter.convert(input) is not modified, NewConverter will not be applied.

This is one of the aspects that developers need to pay attention to, as there is room for error, and it is one of the areas that needs improvement in the future. For instance, implementing a kind of Context to contain the Converters and executing the conversion process without directly calling the Converters could be a way to improve. This is something I plan to implement in the next version.


2023-03-12 Update

Thanks to PR, the same functionality was performed, but with a more flexible structure using composition instead of inheritance.

Conclusion

In this article, I described the process of redistributing roles and responsibilities through design patterns from a procedurally written, monolithic file to a more object-oriented and maintainable code.

info

The complete code can be found on GitHub.

Caution when setting sortKeys in JdbcItemReader

· 4 min read
Haril Song
Owner, Software Engineer at 42dot

I would like to share an issue I encountered while retrieving large amounts of data in PostgreSQL.

Problem

While using the Spring Batch JdbcPagingItemReader, I set the sortKeys as follows:

...
.selectClause("SELECT *")
.fromClause("FROM big_partitioned_table_" + yearMonth)
.sortKeys(Map.of(
"timestamp", Order.ASCENDING,
"mmsi", Order.ASCENDING,
"imo_no", Order.ASCENDING
)
)
...

Although the current table's index is set as a composite index with timestamp, mmsi, and imo_no, I expected an Index scan to occur during the retrieval. However, in reality, a Seq scan occurred. The target table contains around 200 million records, causing the batch process to show no signs of completion. Eventually, I had to forcibly shut down the batch. Why did a Seq scan occur even when querying with index conditions? 🤔

In PostgreSQL, Seq scans occur in the following cases:

  • When the optimizer determines that a Seq scan is faster due to the table having a small amount of data
  • When the data being queried is too large (more than 10% of the table), and the optimizer deems Index scan less efficient than Seq scan
    • In such cases, you can use limit to adjust the amount of data and execute an Index scan

In this case, since select * was used, there was a possibility of a Seq scan due to the large amount of data being queried. However, due to the chunk size, the query was performed with limit, so I thought that Index scan would occur continuously.

Debugging

To identify the exact cause, let's check the actual query being executed. By slightly modifying the YAML configuration, we can observe the queries executed by the JdbcPagingItemReader.

logging:
level.org.springframework.jdbc.core.JdbcTemplate: DEBUG

I reran the batch process to directly observe the queries.

SELECT * FROM big_partitioned_table_202301 ORDER BY imo_no ASC, mmsi ASC, timestamp ASC LIMIT 1000

The order of the order by clause seemed odd, so I ran it again.

SELECT * FROM big_partitioned_table_202301 ORDER BY timestamp ASC, mmsi ASC, imo_no ASC LIMIT 1000

It was evident that the order by condition was changing with each execution.

To ensure the correct order for an Index scan, the sorting conditions must be specified in the correct order. Passing a general Map to sortKeys does not guarantee the order, leading to the SQL query not executing as intended.

To maintain order, you can use a LinkedHashMap to create the sortKeys.

Map<String, Order> sortKeys = new LinkedHashMap<>();
sortKeys.put("timestamp", Order.ASCENDING);
sortKeys.put("mmsi", Order.ASCENDING);
sortKeys.put("imo_no", Order.ASCENDING);

After making this adjustment and rerunning the batch, we could confirm that the sorting conditions were specified in the correct order.

SELECT * FROM big_partitioned_table_202301 ORDER BY timestamp ASC, mmsi ASC, imo_no ASC LIMIT 1000

Conclusion

The issue of Seq scan occurring instead of an Index scan was not something that could be verified with the application's test code, so we were unaware of any potential bugs. It was only when we observed a significant slowdown in the batch process in the production environment that we realized something was amiss. During development, I had not anticipated that the order of sorting conditions could change due to the Map data structure.

Fortunately, if an Index scan does not occur due to the large amount of data being queried, the batch process would slow down significantly with the LIMIT query, making it easy to notice the issue. However, if the data volume was low and the execution speeds of Index scan and Seq scan were similar, it might have taken a while to notice the problem.

Further consideration is needed on how to anticipate and address this issue in advance. Since the order of the order by condition is often crucial, it is advisable to use LinkedHashMap over HashMap whenever possible.