Slow performance for LoadUpdateData with postgresql

Hi all,

I face slowness on loadUpdateData with a file of 100’000 lines.
I have an index on the columns use in the where clause.

I see that liquibase is sending a request per row in the DB:

BEGIN
UPDATE INSTRUMENT_PRICE_HISTORY SET CREATED_AT = ‘2020-11-03 00:00:00’, ID = -1560221, PRICE_CURRENCY_ISO_CODE = ‘CHF’, PRICE_QUANTITY = 107.16, PUBLICATION_DATE = ‘2020-11-03 00:00:00’, UPDATED_AT = ‘2020-11-03 00:00:00’ WHERE INSTRUMENT_ID = -156 AND DATE = ‘2020-11-03’;
IF not found THEN
INSERT INTO INSTRUMENT_PRICE_HISTORY (ID, CREATED_AT, UPDATED_AT, INSTRUMENT_ID, DATE, PRICE_QUANTITY, PRICE_CURRENCY_ISO_CODE, PUBLICATION_DATE) VALUES (-1560221, ‘2020-11-03 00:00:00’, ‘2020-11-03 00:00:00’, -156, ‘2020-11-03’, 107.16, ‘CHF’, ‘2020-11-03 00:00:00’);
END IF;
END;

I try to understand why is not grouping the requests and if it’s possible to use an SQL MERGE instead of this statement.
I found this comment in the code:

              // we don't do batch updates for Postgres but we still send as a prepared statement, see LB-744
              // mysql supports batch updates, but the performance vs. the big insert is worse

But I can’t find the issue or information about ‘LB-744’.

There is also a comment that is not aligned with the code here:

        if (rows.stream().anyMatch(LoadDataRowConfig::needsPreparedStatement)) {
            // If we have only prepared statements and the database supports batching, let's roll

The comment should be If we have at least one prepared statements
But instead of going through all the rows, I don’t understand why !preparedStatements.isEmpty()is not sufficient.

I have several questions:

  1. Where can I found the ‘LB-744’ ?
  2. Why not using PreparedStatement with Liquibase ?
  3. Why Liquibase is not using a MERGE Statement ?
  4. Is there a reason to not group requests ?
    Like
BEGIN
UPDATE INSTRUMENT_PRICE_HISTORY SET CREATED_AT = '2020-11-03 00:00:00', ID = -1560221, PRICE_CURRENCY_ISO_CODE = 'CHF', PRICE_QUANTITY = 107.16, PUBLICATION_DATE = '2020-11-03 00:00:00', UPDATED_AT = '2020-11-03 00:00:00' WHERE INSTRUMENT_ID = -156 AND DATE = '2020-11-03';
IF not found THEN
INSERT INTO INSTRUMENT_PRICE_HISTORY (ID, CREATED_AT, UPDATED_AT, INSTRUMENT_ID, DATE, PRICE_QUANTITY, PRICE_CURRENCY_ISO_CODE, PUBLICATION_DATE) VALUES (-1560221, '2020-11-03 00:00:00', '2020-11-03 00:00:00', -156, '2020-11-03', 107.16, 'CHF', '2020-11-03 00:00:00');
END IF;

UPDATE INSTRUMENT_PRICE_HISTORY SET CREATED_AT = '2020-11-03 00:00:00', ID = -1560221, PRICE_CURRENCY_ISO_CODE = 'CHF', PRICE_QUANTITY = 107.16, PUBLICATION_DATE = '2020-11-03 00:00:00', UPDATED_AT = '2020-11-03 00:00:00' WHERE INSTRUMENT_ID = -157 AND DATE = '2020-11-03';
IF not found THEN
INSERT INTO INSTRUMENT_PRICE_HISTORY (ID, CREATED_AT, UPDATED_AT, INSTRUMENT_ID, DATE, PRICE_QUANTITY, PRICE_CURRENCY_ISO_CODE, PUBLICATION_DATE) VALUES (-1560221, '2020-11-03 00:00:00', '2020-11-03 00:00:00', -157, '2020-11-03', 107.16, 'CHF', '2020-11-03 00:00:00');

/* Other statements here too */

END;

Thanks for you help,
Philippe

The loadData/loadUpdateData code definitely needs a good cleaning up. It’s grown through small patches over the years and it’s ended up a lot of odd edge cases and attempts to work around database behaviors that are not well (or incorrectly) documented.

Performance especially is a big question in there. What version of liquibase are you looking at? There has been performance fixes in postgresql over the last couple months already if you are not fully up to date. In particular, I’m maybe remembering that postgreql was found to be surprisingly slow when using prepared statements and so we updated the code to try to use non-prepared statements when possible on postgresql? It was very surprising but very noticable.

To answer some of your questions, LB-744 was from an old and internal bug tracking system we had for a bit which is not available externally.

The reason we’re not grouping statements together more is because Liquibase’s code paths are designed around sending single statements so that we have no differences between update-sql output and update logic. The same reason keeps the code avoiding prepared statements as much as possible (plus maybe postgresql’s odd performance with them). That does hurt loadData’s ability to be more performant though and we are planning on looking into alternatives to that structure, but it will take a relatively large refactoring to support that and it hasn’t made it to the top of the priority list so far.

Merge would work as a statement instead. I don’t remember when merge was added in postgresql, perhaps it wasn’t widely available when the liquibase added loadUpdateData? Is there an advantage to merge vs. update if not found then ?

Nathan