4 posts tagged with "postgresql"

How SELECT FOR UPDATE Works

April 20, 2024 · 6 min read

Owner, Software Engineer at 42dot

In PostgreSQL, the FOR UPDATE lock is used to explicitly lock rows in a table while performing a SELECT query within a transaction. This lock mode is typically used to ensure that the selected rows do not change until the transaction is completed, preventing other transactions from modifying or locking these rows in a conflicting manner.

For example, it can be used to prevent other customers from changing data while a specific customer is going through the ticket booking process.

The cases we will examine in this article are somewhat special:

How does select for update behave if there is a mix of locked reads and unlocked reads?
If a lock is used initially, is it possible for other transactions to read?
Can consistent reading of data be guaranteed even if reading methods are mixed?

In PostgreSQL, the select for update clause operates differently depending on the transaction isolation level. Therefore, it is necessary to examine how it behaves at each isolation level.

Let’s assume a scenario where data is being modified when the following data exists.

id	name
1	null

Optimizing Pagination in Spring Batch with Composite Keys

March 15, 2023 · 7 min read

Haril Song

Owner, Software Engineer at 42dot

In this article, I will discuss the issues and solutions encountered when querying a table with millions of data using Spring Batch.

Environment

Spring Batch 5.0.1
PostgreSQL 11

Problem

While using JdbcPagingItemReader to query a large table, I noticed a significant slowdown in query performance over time and decided to investigate the code in detail.

Default Behavior

The following query is automatically generated and executed by the PagingQueryProvider:

SELECT *
FROM large_table
WHERE id > ?
ORDER BY id
LIMIT 1000;

In Spring Batch, when using JdbcPagingItemReader, instead of using an offset, it generates a where clause for pagination. This allows for fast retrieval of data even from tables with millions of records without any delays.

tip

Even with LIMIT, using OFFSET means reading all previous data again. Therefore, as the amount of data to be read increases, the performance degrades. For more information, refer to the article¹.

Using Multiple Sorting Conditions

The problem arises when querying a table with composite keys. When a composite key consisting of 3 columns is used as the sort key, the generated query looks like this:

SELECT *
FROM large_table
WHERE ((create_at > ?) OR
       (create_at = ? AND user_id > ?) OR
       (create_at = ? AND user_id = ? AND content_no > ?))
ORDER BY create_at, user_id, content_no
LIMIT 1000;

However, queries with OR operations in the where clause do not utilize indexes effectively. OR operations require executing multiple conditions, making it difficult for the optimizer to make accurate decisions. When I examined the explain output, I found the following results:

Limit  (cost=0.56..1902.12 rows=1000 width=327) (actual time=29065.549..29070.808 rows=1000 loops=1)
  ->  Index Scan using th_large_table_pkey on large_table  (cost=0.56..31990859.76 rows=16823528 width=327) (actual time=29065.547..29070.627 rows=1000 loops=1)
"        Filter: ((""create_at"" > '2023-01-28 06:58:13'::create_at without time zone) OR ((""create_at"" = '2023-01-28 06:58:13'::create_at without time zone) AND ((user_id)::text > '441997000'::text)) OR ((""create_at"" = '2023-01-28 06:58:13'::create_at without time zone) AND ((user_id)::text = '441997000'::text) AND ((content_no)::text > '9070711'::text)))"
        Rows Removed by Filter: 10000001
Planning Time: 0.152 ms
Execution Time: 29070.915 ms

With a query execution time close to 30 seconds, most of the data is discarded during filtering on the index, resulting in unnecessary time wastage.

Since PostgreSQL manages composite keys as tuples, writing queries using tuples allows for utilizing the advantages of Index scan even in complex where clauses.

SELECT *
FROM large_table
WHERE (create_at, user_id, content_no) > (?, ?, ?)
ORDER BY create_at, user_id, content_no
LIMIT 1000;

Limit  (cost=0.56..1196.69 rows=1000 width=327) (actual time=3.204..11.393 rows=1000 loops=1)
  ->  Index Scan using th_large_table_pkey on large_table  (cost=0.56..20122898.60 rows=16823319 width=327) (actual time=3.202..11.297 rows=1000 loops=1)
"        Index Cond: (ROW(""create_at"", (user_id)::text, (content_no)::text) > ROW('2023-01-28 06:58:13'::create_at without time zone, '441997000'::text, '9070711'::text))"
Planning Time: 0.276 ms
Execution Time: 11.475 ms

It can be observed that data is directly retrieved through the index without discarding any data through filtering.

Therefore, when the query executed by JdbcPagingItemReader uses tuples, it means that even when using composite keys as sort keys, processing can be done very quickly.

Let's dive into the code immediately.

Modifying PagingQueryProvider

Analysis

As mentioned earlier, the responsibility of generating queries lies with the PagingQueryProvider. Since I am using PostgreSQL, the PostgresPagingQueryProvider is selected and used.

The generated query differs based on whether it includes a group by clause.

By examining SqlPagingQueryUtils's buildSortConditions, we can see how the problematic query is generated.

Within the nested for loop, we can see how the query is generated based on the sort key.

Customizing `buildSortConditions`

Having directly inspected the code responsible for query generation, I decided to modify this code to achieve the desired behavior. However, direct overriding of this code is not possible, so I created a new class called PostgresOptimizingQueryProvider and re-implemented the code within this class.

private String buildSortConditions(StringBuilder sql) {
    Map<String, Order> sortKeys = getSortKeys();
    sql.append("(");
    sortKeys.keySet().forEach(key -> sql.append(key).append(", "));
    sql.delete(sql.length() - 2, sql.length());
    if (is(sortKeys, order -> order == Order.ASCENDING)) {
        sql.append(") > (");
    } else if (is(sortKeys, order -> order == Order.DESCENDING)) {
        sql.append(") < (");
    } else {
        throw new IllegalStateException("Cannot mix ascending and descending sort keys"); // Limitation of tuples
    }
    sortKeys.keySet().forEach(key -> sql.append("?, "));
    sql.delete(sql.length() - 2, sql.length());
    sql.append(")");
    return sql.toString();
}

Test Code

To ensure that the newly implemented section works correctly, I validated it through a test code.

@Test
@DisplayName("The Where clause generated instead of Offset is (create_at, user_id, content_no) > (?, ?, ?).")
void test() {
    // given
    PostgresOptimizingQueryProvider queryProvider = new PostgresOptimizingQueryProvider();
    queryProvider.setSelectClause("*");
    queryProvider.setFromClause("large_table");

    Map<String, Order> parameterMap = new LinkedHashMap<>();
    parameterMap.put("create_at", Order.ASCENDING);
    parameterMap.put("user_id", Order.ASCENDING);
    parameterMap.put("content_no", Order.ASCENDING);
    queryProvider.setSortKeys(parameterMap);

    // when
    String firstQuery = queryProvider.generateFirstPageQuery(10);
    String secondQuery = queryProvider.generateRemainingPagesQuery(10);

    // then
    assertThat(firstQuery).isEqualTo("SELECT * FROM large_table ORDER BY create_at ASC, user_id ASC, content_no ASC LIMIT 10");
    assertThat(secondQuery).isEqualTo("SELECT * FROM large_table WHERE (create_at, user_id, content_no) > (?, ?, ?) ORDER BY create_at ASC, user_id ASC, content_no ASC LIMIT 10");
}

The successful execution confirms that it is working as intended, and I proceeded to run the batch.

Guy: "is it over?"

Boy: "Shut up, It'll happen again!"

-- Within the Webtoon Hive

However, the out of range error occurred, indicating that the query was not recognized as having changed.

It seems that the parameter injection part is not automatically recognized just because the query has changed, so let's debug again to find the parameter injection part.

JdbcOptimizedPagingItemReader

The parameter is directly created by JdbcPagingItemReader, and I found that the number of parameters to be injected into SQL is increased by iterating through getParameterList in JdbcPagingItemReader.

I thought I could just override this method, but unfortunately it is not possible because it is private. After much thought, I copied the entire JdbcPagingItemReader and modified only the getParameterList part.

The getParameterList method is overridden in JdbcOptimizedPagingItemReader as follows:

private List<Object> getParameterList(Map<String, Object> values, Map<String, Object> sortKeyValue) {
    // ...
    // Returns the parameters that need to be set in the where clause without increasing them.
    return new ArrayList<>(sortKeyValue.values());
}

There is no need to add sortKeyValue, so it is directly added to parameterList and returned.

Now, let's run the batch again.

The first query is executed without requiring parameters,

2023-03-13T17:43:14.240+09:00 DEBUG 70125 --- [           main] o.s.jdbc.core.JdbcTemplate               : Executing SQL query [SELECT * FROM large_table ORDER BY create_at ASC, user_id ASC, content_no ASC LIMIT 2000]

The subsequent query execution receives parameters from the previous query.

2023-03-13T17:43:14.253+09:00 DEBUG 70125 --- [           main] o.s.jdbc.core.JdbcTemplate               : Executing prepared SQL statement [SELECT * FROM large_table WHERE (create_at, user_id, content_no) > (?, ?, ?) ORDER BY create_at ASC, user_id ASC, content_no ASC LIMIT 2000]

The queries are executed exactly as intended! 🎉

For pagination processing with over 10 million records, queries that used to take around 30 seconds now run in the range of 0.1 seconds, representing a significant performance improvement of nearly 300 times.

Now, regardless of the amount of data, queries can be read within milliseconds without worrying about performance degradation. 😎

Conclusion

In this article, I introduced the method used to optimize Spring Batch in an environment with composite keys. However, there is a drawback to this method: all columns that make up the composite key must have the same sorting condition. If desc or asc are mixed within the index condition generated by the composite key, a separate index must be used to resolve this issue 😢

Let's summarize today's content in one line and conclude the article.

"Avoid using composite keys as much as possible and use surrogate keys unrelated to the business."

Reference

Stack overflow

https://jojoldu.tistory.com/528 ↩

Caution when setting sortKeys in JdbcItemReader

March 6, 2023 · 4 min read

Haril Song

Owner, Software Engineer at 42dot

I would like to share an issue I encountered while retrieving large amounts of data in PostgreSQL.

Problem

While using the Spring Batch JdbcPagingItemReader, I set the sortKeys as follows:

...
  .selectClause("SELECT *")
  .fromClause("FROM big_partitioned_table_" + yearMonth)
  .sortKeys(Map.of(
          "timestamp", Order.ASCENDING,
          "mmsi", Order.ASCENDING,
          "imo_no", Order.ASCENDING
      )
  )
...

Although the current table's index is set as a composite index with timestamp, mmsi, and imo_no, I expected an Index scan to occur during the retrieval. However, in reality, a Seq scan occurred. The target table contains around 200 million records, causing the batch process to show no signs of completion. Eventually, I had to forcibly shut down the batch. Why did a Seq scan occur even when querying with index conditions? 🤔

In PostgreSQL, Seq scans occur in the following cases:

When the optimizer determines that a Seq scan is faster due to the table having a small amount of data
When the data being queried is too large (more than 10% of the table), and the optimizer deems Index scan less efficient than Seq scan
- In such cases, you can use limit to adjust the amount of data and execute an Index scan

In this case, since select * was used, there was a possibility of a Seq scan due to the large amount of data being queried. However, due to the chunk size, the query was performed with limit, so I thought that Index scan would occur continuously.

Debugging

To identify the exact cause, let's check the actual query being executed. By slightly modifying the YAML configuration, we can observe the queries executed by the JdbcPagingItemReader.

logging:
  level.org.springframework.jdbc.core.JdbcTemplate: DEBUG

I reran the batch process to directly observe the queries.

SELECT * FROM big_partitioned_table_202301 ORDER BY imo_no ASC, mmsi ASC, timestamp ASC LIMIT 1000

The order of the order by clause seemed odd, so I ran it again.

SELECT * FROM big_partitioned_table_202301 ORDER BY timestamp ASC, mmsi ASC, imo_no ASC LIMIT 1000

It was evident that the order by condition was changing with each execution.

To ensure the correct order for an Index scan, the sorting conditions must be specified in the correct order. Passing a general Map to sortKeys does not guarantee the order, leading to the SQL query not executing as intended.

To maintain order, you can use a LinkedHashMap to create the sortKeys.

Map<String, Order> sortKeys = new LinkedHashMap<>();
sortKeys.put("timestamp", Order.ASCENDING);
sortKeys.put("mmsi", Order.ASCENDING);
sortKeys.put("imo_no", Order.ASCENDING);

After making this adjustment and rerunning the batch, we could confirm that the sorting conditions were specified in the correct order.

SELECT * FROM big_partitioned_table_202301 ORDER BY timestamp ASC, mmsi ASC, imo_no ASC LIMIT 1000

Conclusion

The issue of Seq scan occurring instead of an Index scan was not something that could be verified with the application's test code, so we were unaware of any potential bugs. It was only when we observed a significant slowdown in the batch process in the production environment that we realized something was amiss. During development, I had not anticipated that the order of sorting conditions could change due to the Map data structure.

Fortunately, if an Index scan does not occur due to the large amount of data being queried, the batch process would slow down significantly with the LIMIT query, making it easy to notice the issue. However, if the data volume was low and the execution speeds of Index scan and Seq scan were similar, it might have taken a while to notice the problem.

Further consideration is needed on how to anticipate and address this issue in advance. Since the order of the order by condition is often crucial, it is advisable to use LinkedHashMap over HashMap whenever possible.

[Spring Batch] Implementing Custom Constraint Writer

April 21, 2022 · 4 min read

Haril Song

Owner, Software Engineer at 42dot

Situation 🧐

Recently, I designed a batch process that uses Upsert in PostgreSQL for a specific logic. During implementation, due to a change in business requirements, I had to add a specific column to a composite unique condition.

The issue arose from the fact that the unique constraint of the composite unique column does not prevent duplicates with null values in a specific column.

Let's take a look at an example of the problematic situation.

create table student
(
    id    integer not null
        constraint student_pk
            primary key,
    name  varchar,
    major varchar,
    constraint student_unique
        unique (name, major)
);

id	name	major
1	song	korean
2	kim	english
3	park	math
4	kim	NULL
5	kim	NULL

To avoid allowing null duplicates, the idea of inserting dummy data naturally came to mind, but I felt reluctant to store meaningless data in the database. Especially if the column where null occurs stores complex data like UUID, it would be very difficult to identify meaningless values buried among other values.

Although it may be a bit cumbersome, using a unique partial index allows us to disallow null values without inserting dummy data. I decided to pursue the most ideal solution, even if it is challenging.

Solution

Partial Index

CREATE UNIQUE INDEX stu_2col_uni_idx ON student (name, major)
    WHERE major IS NOT NULL;

CREATE UNIQUE INDEX stu_1col_uni_idx ON student (name)
    WHERE major IS NULL;

PostgreSQL provides the functionality of partial indexes.

Partial Index : A feature that creates an index only when certain conditions are met. It allows for efficient index creation and maintenance by narrowing the scope of the index.

When a value with only name is inserted, stu_1col_uni_idx allows only one row with the same name where major is null. By creating two complementary indexes, we can skillfully prevent duplicates with null values in a specific column.

duplicate error An error occurs when trying to store a value without major

However, when there are two unique constraints like this, since only one constraint check is allowed during Upsert execution, the batch did not run as intended.

After much deliberation, I decided to check if a specific value is missing before executing the SQL and then execute the SQL that meets the conditions.

Implementing `SelectConstraintWriter`

public class SelectConstraintWriter extends JdbcBatchItemWriter<Student> {

    @Setter
    private String anotherSql;

    @Override
    public void write(List<? extends Student> items) {
        if (items.isEmpty()) {
            return;
        }

        List<? extends Student> existMajorStudents = items.stream()
                .filter(student -> student.getMajor() != null)
                .collect(toList());

        List<? extends Student> nullMajorStudents = items.stream()
                .filter(student -> student.getMajor() == null)
                .collect(toList());

        executeSql(existMajorStudents, sql);
        executeSql(nullMajorStudents, anotherSql);
    }

    private void executeSql(List<? extends student> students, String sql) {
        if (logger.isDebugEnabled()) {
            logger.debug("Executing batch with " + students.size() + " items.");
        }

        int[] updateCounts;

        if (usingNamedParameters) {
            if (this.itemSqlParameterSourceProvider == null) {
                updateCounts = namedParameterJdbcTemplate.batchUpdate(sql, students.toArray(new Map[students.size()]));
            } else {
                SqlParameterSource[] batchArgs = new SqlParameterSource[students.size()];
                int i = 0;
                for (student item : students) {
                    batchArgs[i++] = itemSqlParameterSourceProvider.createSqlParameterSource(item);
                }
                updateCounts = namedParameterJdbcTemplate.batchUpdate(sql, batchArgs);
            }
        } else {
            updateCounts = namedParameterJdbcTemplate.getJdbcOperations().execute(sql,
                    (PreparedStatementCallback<int[]>) ps -> {
                        for (student item : students) {
                            itemPreparedStatementSetter.setValues(item, ps);
                            ps.addBatch();
                        }
                        return ps.executeBatch();
                    });
        }

        if (assertUpdates) {
            for (int i = 0; i < updateCounts.length; i++) {
                int value = updateCounts[i];
                if (value == 0) {
                    throw new EmptyResultDataAccessException("Item " + i + " of " + updateCounts.length
                            + " did not update any rows: [" + students.get(i) + "]", 1);
                }
            }
        }
    }
}

I implemented this by overriding the write method of the JdbcBatchItemWriter that was previously used. By checking the presence of major in the code and selecting and executing the appropriate SQL, we can ensure that the Upsert statement works correctly instead of encountering a duplicateKeyException.

Here is an example of usage:

@Bean
SelectConstraintWriter studentItemWriter() {
    String sql1 =
            "INSERT INTO student(id, name, major) "
                    + "VALUES (nextval('hibernate_sequence'), :name, :major) "
                    + "ON CONFLICT (name, major) WHERE major IS NOT NULL "
                    + "DO UPDATE "
                    + "SET name = :name, "
                    + "    major = :major";

    String sql2 =
            "INSERT INTO student(id, name, major) "
                    + "VALUES (nextval('hibernate_sequence'), :name, :major) "
                    + "ON CONFLICT (name) WHERE major IS NULL "
                    + "DO UPDATE "
                    + "SET name = :name, "
                    + "    major = :major";

    SelectConstraintWriter writer = new SelectConstraintWriter();
    writer.setSql(sql1);
    writer.setAnotherSql(sql2);
    writer.setDataSource(dataSource);
    writer.setItemSqlParameterSourceProvider(new BeanPropertyItemSqlParameterSourceProvider<>());
    writer.afterPropertiesSet();
    return writer;
}

Conclusion

It's regrettable that if PostgreSQL allowed multiple constraint checks during Upsert execution, we wouldn't have needed to go to such lengths. I hope for updates in future versions.

Reference

create unique constraint with null columns

Environment​

Problem​

Default Behavior​

Using Multiple Sorting Conditions​

Modifying PagingQueryProvider​

Analysis​

Customizing buildSortConditions​

Test Code​

JdbcOptimizedPagingItemReader​

Conclusion​

Reference​

Footnotes​

Problem​

Debugging​

Conclusion​

Situation 🧐​

Solution​

Partial Index​

Implementing SelectConstraintWriter​

Conclusion​

Reference​

Environment

Problem

Default Behavior

Using Multiple Sorting Conditions

Modifying PagingQueryProvider

Analysis

Customizing `buildSortConditions`

Test Code

JdbcOptimizedPagingItemReader

Conclusion

Reference

Footnotes

Problem

Debugging

Conclusion

Situation 🧐

Solution

Partial Index

Implementing `SelectConstraintWriter`

Conclusion

Reference