Loading large CSV files is hugely inefficient and slow

un1382561729492r88id · June 22, 2016, 5:59am

Sounds like a great idea.

Steve Donie
Principal Software Engineer
Datical, Inc. http://www.datical.com/

markus.bernhardt · June 22, 2016, 5:59am

Hi,

I’m not complaining about the checksumming part. I know what this is good for and why it is needed.

My problem is the loading of the data itself.

If I understand the code correctly, Liquibase has three stages when loading the data.

All rows are read and created as SqlStatement objects
INSERT statements are created for all SqlStatement objects by a SqlStatementGenerator
The INSERT statements gets executed

Additionally, if I understand it correctly, there are several optimizations built in. In Step 1 for some Databases aggregated SqlStatement objects are created. Later when creating the INSERT statements multi row statements are created if supported.

But there are two problems:

Memory consumption:

Step 1 always loads all rows into memory. Having a table with 100 million rows and 3 GB disk size will need at least 12-16GB heap space. It would be a lot better to have a multi staged, multi threaded system. One thread reads the file, creates the SqlStatement objects and pushes them to a queue. A second thread takes the SqlStatement objects from the queue, generates the INSERT statements and pushes them to a second queue. A third thread takes the INSERT statements from the queue and executes them.

Batched PreparedStatements:

At the moment regular INSERT statements are used. Sometimes with multiple rows. Please correct me, if I’m wrong. It would be much more efficient to use always single row prepared INSERT statements and batch some 10.000 rows together via standard JDBC PreparedStatement.addBatch().

Any opinions on that?

my 2 cent,

markus

Topic		Replies	Views
liquibase bulk upload General Discussion	5	3088	February 15, 2011
load data csv performance General Discussion	3	2202	February 21, 2011
Maximum csv file size for loadData General Discussion	2	2540	March 11, 2015
Loading data from big CSV file is very slow General Discussion	0	644	October 27, 2012
Memory consumption for large files General Discussion	5	1272	October 10, 2013

Loading large CSV files is hugely inefficient and slow

Related topics