Liquibase 4.0 Parsing Logic

The current portion of the Liquibase 4.0 code that is under construction is the changelog parsing. I’ve finished the first pass implementing it and documenting what I’ve done and looking for a code review and/or suggestions.

Previously, in Liquibase 4.0:

I re-implemented the Liquibase 3 Change/Statement/Generator classes, along with the snapshot logic and any other external-system-interacting data in a new Action/ActionLogic class layout. 

The idea is that we have Action classes like CreateTableAction or DropColumnsAction which describe what you want to do, and then ActionLogic classes like CreateLableLogic and DropColumnsLogic that know how to apply that action to a given database. 

These Action classes for the most part follow the old Change and Statement classes, but were changed to be more consistent in naming and layout plus support affecting multiple objects at the same time (like creating multiple columns in one statement) for improved performance. 

Changelog Parsing Requirements

These Action classes can be created and ran directly, like I do in testing, but the main use is creating them from changelog files. We need a way to take various changelog formats (xml, sql, json, yaml) and create an object structure describing the ChangeLog, ChangeSets and the Actions to run as part of a ChangeSet.

In 3.x, the parsing code was relatively XML-specific and there was duplication between the XML parsing and other parsers. We also ran into problems where functionality was in one parser but not another, for example formatted SQL parser was missing changelog parameter functionality.

The parsing logic needs to be extendable, so additional parsers can be created and standard parser logic can be adjusted.

Changelog Parsing Process

There are 4 stages to the parsing process:

  1. PARSE: For a given file, find the best Parser plugin implementation that supports it. The job of the Parser is to convert the text of the file into a ParsedNode object structure with no translation business logic in it. Just a direct mapping from the text format to a structured/tokenized representation of it.
  1. PREPROCESS: The ParsedNode structure created by the parser is ran through all the Preprocessor plugins configured. The job of the preprocessors is to massage the original ParsedNode into something closer to the final object structure (in this case ChangeLog and it’s ChangeSets and Actions.
  1. MAPPING: Once all the Preprocessors have ran, the best Mapping plugins is found for the given target object and it’s sub-fields and that class is used to translate the ParsedNode structure into the final object. There should be little to no business logic in this class, it’s job is to simply direct-map the ParsedNode structure into an object structure. 
  1. POSTPROCESSING: After the object mapping is done, there is a step for any final fixup that needs to be done in the completed objects.

Process Details

As mentioned, the Parser and Mapping have little to no logic in them beyond straight mapping. Instead, all the “I know what the user means” logic needs to go into the Preprocessors. Examples include:

  • if you specify catalogName, schemaName and tableName, you actually mean a “table” node with a name and schema sub-nodes. 

  • the field “quotechar” is the same as “quoteString”

  • the value of ${schemaName} needs to be expanded to MY_SCHEMA

  • etc.

Because the preprocessors are used against the ParsedNode structure regardless of the original format, these rules now apply consistently no matter how the files were originally parsed. 

The preprocessors are normally ran in a random order, but each preprocessor can specify other preprocessors it must run before and/or after and Liquibase will create an order that satisfies those dependencies.

These preprocessors can be used to support multiple versions. For example, the 3.x changelog format will be different than the 4.x format, but we can write preprocessors to handle/translate those differences.

Looking at the Code

The best place to see the preprocessor code is in the cli branch of the liquibase4 repository: https://github.com/liquibase/liquibase4/tree/cli

Please take a look and let me know what questions or concerns you have with the new logic. 

The current code should handle most of the changes in a 3.x-format XML file. No other formats yet, and also no preconditions, rollback blocks, or includes yet. 

My next steps are cleaning up and better testing the parsing logic in the “cli” branch before merging it all back into liquibase4 master and moving onto the next portion of the 4.0 codebase. 

Nathan