Saturday, February 2, 2013

Using KNIME as an ETL for my datamart

On the process of commenting this post I realized that this could give a nice topic for a post.

First step is to create a new KNIME workspace just for this project. This can be done through the menus
File | Switch Workspace | Other
and choose the destination folder


KNIME will restart with the new workspace loaded.

Open the workbench to start working. 

Next step is to register the JDBC driver for the database system that will be used. This is done through the menus
File | Preferences

And pressing New it's possible to register a JDBC driver to the KNIME workspace.

I'm gonna use SQLite because being portable between Windows and Linux is perfect for my personal needs. And adopted the following folder structure
  • source - for the original data
  • workspace - ETL and staging area (KNIME workspace)
  • database - resulting databases

The data I'm gonna work with is the structure of the Portuguese territory and is available in this  Portuguese Government website.

Personally I prefer to download the data into the source folder in my computer as a safeguard (online data as the tendency to disappear). But KNIME allows us to grab the directly from an URL.

The data dictionary is on the codigos.txt file.

Returning to KNIME, we will use the File Reader node to load the distrito.txt file just drag and drop from the Node Repository into the center window.

Right click the File Reader node to set it's configuration.

We can change the code page of the data in the Advanced button to eliminate the strange characters.

Now the data seems right.

The File Reader node doesn't handle fixed size formats (as far as I know) so, for now, we'll keep the data as one large column and finally change the node name to distrito.txt so it can be easily identified later. Right click on the node and press Execute to run it.
The output of the node can be checked right clicking and selecting File Table.

Now we have to split the data into columns, for that we drag and drop the Cell Splitter By Position node into the center window and link it to the File Reader node.

Following the instructions in data dictionary file (codigos.txt) we have to split the file after the 2nd and 32nd position and give names to the columns.

Right click on the node and press Execute to run it.
The Cell Splitter By Position node has done it's work, but it doesn't remove the original column.

For that we use the Column Filter node.

In the node configuration we remove the original column (Col0) and press OK.

Right click on the node and press Execute to run it.

The final transformation is to convert the ID column to an integer. for that we use the String To Number node.

Press OK and right click on the node and press Execute to run it.

Finally to send this table to a database we use the Database Writer node.

The configuration is made in two steps, first set the database driver & URL and the table name.

It's important to check if you want to append the data to an existing table or simply to overwrite it.
Finally, just review the data types to use. Here I reduced the size of the varchar  columns from 255 to 30 as defined in the data dictionary.

The database was automatically created by KNIME and is ready to be used (almost).

Now we can adapt the previous workflow to treat the remaining tables.

No comments:

Post a Comment