Loading large CSV files is hugely inefficient and slow

Sounds like a great idea. 

Steve Donie
Principal Software Engineer
Datical, Inc. http://www.datical.com/


I’m not complaining about the checksumming part. I know what this is good for and why it is needed.

My problem is the loading of the data itself.

If I understand the code correctly, Liquibase has three stages when loading the data.

  • All rows are read and created as SqlStatement objects
  • INSERT statements are created for all SqlStatement objects by a SqlStatementGenerator
  • The INSERT statements gets executed
Additionally, if I understand it correctly, there are several optimizations built in.  In Step 1 for some Databases aggregated SqlStatement objects are created. Later when creating the INSERT statements multi row statements are created if supported.

But there are two problems:

Memory consumption:

Step 1 always loads all rows into memory. Having a table with 100 million rows and 3 GB disk size will need at least 12-16GB heap space. It would be a lot better to have a multi staged, multi threaded system. One thread reads the file, creates the SqlStatement objects and pushes them to a queue. A second thread takes the SqlStatement objects from the queue, generates the INSERT statements and pushes them to a second queue. A third thread takes the INSERT statements from the queue and executes them.

Batched PreparedStatements:

At the moment regular INSERT statements are used. Sometimes with multiple rows. Please correct me, if I’m wrong. It would be much more efficient to use always single row prepared INSERT statements and batch some 10.000 rows together via standard JDBC PreparedStatement.addBatch().

Any opinions on that?

my 2 cent,