if(kakaoAI) 2024 참여 후기
Overview
- 참여 날짜: 2024년 10월 24일 목 (day 3)
- 장소: Kakao AI Campus
Haril is a software engineer who loves to build things. He is passionate about open-source and loves to contribute to the community. He is the owner of this blog.
View all authorsDo you use cloud storage across multiple devices? If so, you've probably noticed the gradual increase of conflict files.
Conflict files that keep piling up whenever you turn around
Conflict files tend to accumulate for various reasons, such as making edits before files are synced or experiencing network delays.
Personally, I like to keep things tidy, so I regularly delete these dummy files. However, today I find the repetitive task a bit tedious. So, I thought I'd write a shell script to automate the process and show off my developer skills.
Recently, I undertook the task of moving my blog to a new platform. As I encountered various issues, I jotted down potential solutions, thinking they might be useful to others. Here’s a detailed account of the migration process.
With mise, you can use the exact version of any language or tool you need, switch between different versions, and specify versions for each project. By specifying versions in a file, you can reduce communication costs among team members about which version to use.
Until now, the most famous tool in this field was asdf[^fn-nth-1]. However, after starting to use mise recently, I found that mise offers a slightly better user experience. In this post, I will introduce some simple use cases.
Not sure if it's intentional, but even the web pages look similar.
mise
(pronounced 'meez') is a tool for setting up development environments. The name comes from a French culinary term that roughly translates to "setting" or "putting in place." It means having all your tools and ingredients ready before you start cooking.
Here are some of its simple features:
Implementing a server application that can handle multiple client requests simultaneously is now very easy. Just using Spring MVC alone can get you there in no time. However, as an engineer, I am curious about the underlying principles. In this article, we will embark on a journey to reflect on the considerations that were made to implement a multi-connection server by questioning the things that may seem obvious.
You can check the example code on GitHub.
The first destination is 'Socket'. From a network programming perspective, a socket is a communication endpoint used like a file to exchange data over a network. The description 'used like a file' is important because it is accessed through a file descriptor (fd) and supports I/O operations similar to files.
While sockets can be identified using one's IP, port, and the other party's IP and port, using fd is preferred because sockets have no information until a connection is accepted, and more data is needed than just a simple integer like fd.
To implement a server application using sockets, you need to go through the following steps:
In PostgreSQL, the FOR UPDATE lock is used to explicitly lock rows in a table while performing a SELECT query within a transaction. This lock mode is typically used to ensure that the selected rows do not change until the transaction is completed, preventing other transactions from modifying or locking these rows in a conflicting manner.
For example, it can be used to prevent other customers from changing data while a specific customer is going through the ticket booking process.
The cases we will examine in this article are somewhat special:
select for update
behave if there is a mix of locked reads and unlocked reads?In PostgreSQL, the select for update
clause operates differently depending on the transaction isolation level. Therefore, it is necessary to examine how it behaves at each isolation level.
Let’s assume a scenario where data is being modified when the following data exists.
id | name |
---|---|
1 | null |
How do we transmit data over a network? Establishing a connection with the recipient and sending the data all at once might seem like the most straightforward approach. However, this method becomes inefficient when handling multiple requests because a single connection can only maintain one data transfer at a time. If a connection is prolonged due to a large data transfer, other data will have to wait.
To efficiently handle the data transmission process, networks divide data into multiple pieces and require the receiving end to reassemble them. These fragmented data structures are called packets. Packets include additional information to allow the receiving end to reassemble the data in the correct order.
While transmitting data in multiple packets enables efficient processing of many requests through packet switching, it can also lead to various errors such as data loss or incorrect delivery order. How should we debug such issues? 🤔
bootRun
locally is necessary..env
files are typically ignored in Git, making version tracking difficult and prone to fragmentation.
.env
files be managed?.env
files is convenient with AWS CLI..env
files can be done through snapshots..
..
...
....
If that's it, the article might seem a bit dull, right? Of course, there are still a few issues remaining.
When using S3, it's common to end up with many buckets due to file structure optimization or business-specific categorization.
aws s3 cp s3://something.service.com/enviroment/.env .env
If the .env
file is missing, you'll need to download it using AWS CLI as shown above. Without someone sharing the bucket with you in advance, you might have to search through all buckets to find the environment variable file, which could be inconvenient. The intention was to avoid sharing, but having to receive something to share again might feel a bit cumbersome.
Too many buckets. Where's the env...?
Automating the process of exploring buckets in S3 to find and download the necessary .env
file would make things much easier. This can be achieved by writing a script using tools like fzf or gum.
.env
...Some of you may have already noticed that Spring Boot reads system environment variables to fill in placeholders in YAML files. However, using just the .env
file won't apply the system environment variables, thus not being picked up during Spring Boot's initialization process.
Let's briefly look at how it works.
# .env
HELLO=WORLD
# application.yml
something:
hello: ${HELLO} # Retrieves the value from the HELLO environment variable on the OS.
@Slf4j
@Component
public class HelloWorld {
@Value("${something.hello}")
private String hello;
@PostConstruct
public void init() {
log.info("Hello: {}", hello);
}
}
SystemEnvironmentPropertySource.java
You can see that the placeholder in @Value
is not resolved, causing the bean registration to fail and resulting in an error.
Just having a .env
file doesn't register it as a system environment variable.
To apply the .env
file, you can either run the export
command or register the .env
file in IntelliJ's run configurations. However, using the export
command to register too many variables globally on your local machine can lead to unintended behavior like overwriting, so it's recommended to manage them individually through IntelliJ's GUI.
IntelliJ supports configuring .env
files via GUI.
The placeholder is resolved and applied correctly.
Phew, the long process of problem identification and scoping has come to an end. Let's summarize the workflow once more and introduce a script.
.env
from S3..env
as system environment variables.The shell script is written to be simple yet stylized using gum.
#!/bin/bash
S3_BUCKET=$(aws s3 ls | awk '{print $3}' | gum filter --reverse --placeholder "Select...") # 1.
# Choose deployment environment
TARGET=$(gum choose --header "Select a environment" "Elastic Container Service" "EC2")
if [ "$TARGET" = "Elastic Container Service" ]; then
TARGET="ecs"
else
TARGET="ec2"
fi
S3_BUCKET_PATH=s3://$S3_BUCKET/$TARGET/
# Search for the env file
ENV_FILE=$(aws s3 ls "$S3_BUCKET_PATH" | grep env | awk '{print $4}' | gum filter --reverse --placeholder "Select...") # 2.
# Confirm
if (gum confirm "Are you sure you want to use $ENV_FILE?"); then
echo "You selected $ENV_FILE"
else
die "Aborted."
fi
ENV_FILE_NAME=$(gum input --prompt.foreground "#04B575" --prompt "Enter the name of the env file: " --value ".env" --placeholder ".env")
gum spin -s meter --title "Copying env file..." -- aws s3 cp "$S3_BUCKET_PATH$ENV_FILE" "$ENV_FILE_NAME" # 3.
echo "Done."
gum filter
to select the desired S3 bucket.env
and assign it to a variable named ENV_FILE
..env
file and proceed with the download.I've created a demo video of the execution process.
Demo
After this, you just need to apply the .env
file copied to the current directory to IntelliJ as mentioned earlier.
Using direnv and IntelliJ's direnv plugin can make the application even more convenient.
This article discusses the inefficient existing implementation and documents the methods attempted to improve it.
While it wasn't impossible to join tables scattered across multiple databases in a single query, it was challenging...
Given that the primary reason for not being able to use database joins had been resolved, I actively considered utilizing index scans for geometry processing.
To simulate this process, I prepared the exact same data as in the live DB and conducted experiments.
First, I created the index:
CREATE INDEX idx_port_geom ON port USING GIST (geom);
Then, I ran the PostGIS contains
function:
SELECT *
FROM ais AS a
JOIN port AS p ON st_contains(p.geom, a.geom);
Awesome...
1 minute 47 seconds to 2 minutes 30 seconds
0.23 milliseconds to 0.243 milliseconds
I didn't prepare a capture, but before applying the index, queries took over 1 minute and 30 seconds.
Let's start with the conclusion and then delve into why these results were achieved.
A highly useful index for querying complex geometric data, the internal structure is illustrated below.
The idea of an R-tree is to divide the plane into rectangles to encompass all indexed points. Index rows store rectangles and can be defined as follows:
"The point we are looking for is inside the given rectangle."
The root of the R-tree contains several of the largest rectangles (which may intersect). Child nodes contain smaller rectangles included in the parent node, collectively encompassing all base points.
In theory, leaf nodes should contain indexed points, but since all index rows must have the same data type, rectangles reduced to points are repeatedly stored.
To visualize this structure, let's look at images for three levels of an R-tree. The points represent airport coordinates.
Level one: two large intersecting rectangles are visible.
Two intersecting rectangles are displayed.
Level two: large rectangles are split into smaller areas.
Large rectangles are divided into smaller areas.
Level three: each rectangle contains as many points as to fit one index page.
Each rectangle contains points that fit one index page.
These areas are structured into a tree, which is scanned during queries. For more detailed information, it is recommended to refer to the following article.
In this article, I briefly introduced the specific conditions, the problems encountered, the efforts made to solve them, and the basic concepts needed to address these issues. To summarize:
"Write once, Test anywhere"
Fixture Monkey is a testing object creation library being developed as open source by Naver. The name seems to be inspired by Netflix's open source tool, Chaos Monkey. By generating test fixtures randomly, it allows you to experience chaos engineering in practice.
Since I first encountered it about 2 years ago, it has become one of my favorite open source libraries. I even ended up writing two articles about it.
I haven't written any additional articles as there were too many changes with each version update, but now that version 1.x has been released, I am revisiting it with a fresh perspective.
While my previous articles were based on Java, I am now writing in Kotlin to align with current trends. The content of this article is based on the official documentation with some added insights from my actual usage.
Let's examine the following code to see what issues exist with the traditional approach.
I used JUnit5, which is familiar to Java developers, for the examples. However, personally, I recommend using Kotest in a Kotlin environment.
data class Product (
val id: Long,
val productName: String,
val price: Long,
val options: List<String>,
val createdAt: Instant,
val productType: ProductType,
val merchantInfo: Map<Int, String>
)
enum class ProductType {
ELECTRONICS,
CLOTHING,
FOOD
}
@Test
fun basic() {
val actual: Product = Product(
id = 1L,
price = 1000L,
productName = "productName",
productType = ProductType.FOOD,
options = listOf(
"option1",
"option2"
),
createdAt = Instant.now(),
merchantInfo = mapOf(
1 to "merchant1",
2 to "merchant2"
)
)
// The preparation process is lengthy compared to the test purpose
actual shouldNotBe null
}
Looking at the test code, it feels like there is too much code to write just to create objects for assertion. Due to the nature of the implementation, if properties are not set, a compilation error occurs, so even meaningless properties must be written.
When the preparation required for assertion in test code becomes lengthy, the meaning of the test purpose in the code can become unclear. The person reading this code for the first time would have to examine even seemingly meaningless properties to see if there is any hidden significance. This process increases developers' fatigue.
When directly setting properties to create objects, many edge cases that could occur in various scenarios are often overlooked because the properties are fixed.
val actual: Product = Product(
id = 1L, // What if the id becomes negative?
// ...omitted
)
To find edge cases, developers have to set properties one by one and verify them, but in reality, it is often only after runtime errors occur that developers become aware of edge cases. To easily discover edge cases before errors occur, object properties need to be set with a certain degree of randomness.
To reuse test objects, a pattern called the Object Mother pattern involves creating a factory class to generate objects and then executing test code using objects created from that class.
However, this method is not favored because it requires continuous management not only of the test code but also of the factory. Furthermore, it does not help in identifying edge cases.
Fixture Monkey elegantly addresses the issues of reusability and randomness as mentioned above. Let's see how it solves these problems.
First, add the dependency.
testImplementation("com.navercorp.fixturemonkey:fixture-monkey-starter-kotlin:1.0.13")
Apply KotlinPlugin()
to ensure that Fixture Monkey works smoothly in a Kotlin environment.
@Test
fun test() {
val fixtureMonkey = FixtureMonkey.builder()
.plugin(KotlinPlugin())
.build()
}
Let's write a test again using the Product
class we used before.
data class Product (
val id: Long,
val productName: String,
val price: Long,
val options: List<String>,
val createdAt: Instant,
val productType: ProductType,
val merchantInfo: Map<Int, String>
)
enum class ProductType {
ELECTRONICS,
CLOTHING,
FOOD
}
@Test
fun test() {
val fixtureMonkey = FixtureMonkey.builder()
.plugin(KotlinPlugin())
.build()
val actual: Product = fixtureMonkey.giveMeOne()
actual shouldNotBe null
}
You can create an instance of Product
without the need for unnecessary property settings. All property values are filled randomly by default.
Fills in multiple properties nicely
However, in most cases, specific property values are required. For example, in the example, the id
was generated as a negative number, but in reality, id
is often used as a positive number. There might be a validation logic like this:
init {
require(id > 0) { "id should be positive" }
}
After running the test a few times, if the id
is generated as a negative number, the test fails. The fact that all values are randomly generated makes it particularly useful for finding unexpected edge cases.
Let's maintain the randomness but restrict the range slightly to ensure the validation logic passes.
@RepeatedTest(10)
fun postCondition() {
val fixtureMonkey = FixtureMonkey.builder()
.plugin(KotlinPlugin())
.build()
val actual = fixtureMonkey.giveMeBuilder<Product>()
.setPostCondition { it.id > 0 } // Specify property conditions for the generated object
.sample()
actual.id shouldBeGreaterThan 0
}
I used @RepeatedTest
to run the test 10 times.
You can see that all tests pass.
When using postCondition
, be cautious as setting conditions too narrowly can make object creation costly. This is because the creation is repeated internally until an object that meets the condition is generated. In such cases, it is much better to use setExp
to fix specific values.
val actual = fixtureMonkey.giveMeBuilder<Product>()
.setExp(Product::id, 1L) // Only the specified value is fixed, the rest are random
.sample()
actual.id shouldBe 1L
If a property is a collection, you can use sizeExp
to specify the size of the collection.
val actual = fixtureMonkey.giveMeBuilder<Product>()
.sizeExp(Product::options, 3)
.sample()
actual.options.size shouldBe 3
Using maxSize
and minSize
, you can easily set the maximum and minimum size constraints for a collection.
val actual = fixtureMonkey.giveMeBuilder<Product>()
.maxSizeExp(Product::options, 10)
.sample()
actual.options.size shouldBeLessThan 11
There are various other property setting methods available, so I recommend exploring them when needed.
Fixture Monkey really resolves the inconveniences encountered while writing unit tests. Although not mentioned in this article, you can create conditions in the builder and reuse them, add randomness to properties, and help developers discover edge cases they may have missed. As a result, test code becomes very concise, and the need for additional code like Object Mother disappears, making maintenance easier.
Even before the release of Fixture Monkey 1.x, I found it very helpful in writing test code. Now that it has become a stable version, I hope you can introduce it without hesitation and enjoy writing test code.