A Poor Man's ETL Tool

Some days it seems that all we do is move data around. We take it from “here”, manipulate it, reformat it and send it over “there” for display or more analysis. ETL is short for for Extract, Transform and Load. There are lots tools, especially in data warehousing, that help us with these kinds of tasks. What you can accomplish with these tools using a drag and drop interface is truly amazing.

CloverETL

One of my favorite ETL tools is CloverETL. I love this tool. It has all the bells and whistles. The tool is intuitive to use. It’s really quick to get useful work accomplished with it. Some notable CloverETL features include :

  • Fetching data over the web
  • Multitude of database connection options
  • Connectors for Excel/CSV etc.
  • Debugging
  • Data Validating
  • Logging
  • Scheduling / Jobs
  • Rapid Prototyping
  • Custom Coding for Tasks

Purchasing a commercial flavor of CloverETL is an investment. The Community Edition is a solution to keeping costs low but there are some limitations with this edition :

  • The number available of readers, writers and transformers is limited
  • This edition only supports 20 items per graph
  • You can’t save a graph with a higher number of items
  • A potential workaround is to :
    • Create multiple projects and organize major tasks around these projects
    • Create sub-graphs in these projects for sub-tasks
  • Multiple projects and sub-graphs can be difficult to maintain

CloverETL Community Edition is basically the gateway into a world of effective data manipulation. It’s hard to resist purchasing a commercial edition once you have started using it. If you have the budget, I think an ETL tool is really worth the investment. The tool will pay for itself in time saved. However, if your budget does not allow for purchasing an ETL tool with all the fixings, and if and the Community Edition is too restrictive, you can look at using a poor man’s solution.

A Poor Man's Solution

If the ETL tasks are relatively simple, another approach is to use the database and its functionality to achieve the desired transformations. For complicated ETL tasks this approach will probably not scale well but for simple ad-hock data manipulations it seems to be a viable solution.

In this post, I will describe an ETL task written using Bash, PostgresSQL and PL/pgSQL functions. This example was written using Postgres.app on OS X. The ETL task is very simple. It takes an input CSV and generates an output CSV. It’s a toy example. The interesting part of the example is where we make use of the PL/pgSQL function (See : 2_process_raw_data.sql) and cycle through the raw data, making decisions on what the output should be. It illustrates the possibility of doing more complex ETL tasks with this kind of approach.

The Bash Script

The Bash script controls the execution flow of our ETL steps. It will launch the psql client and execute SQL scripts in a certain order. Sequentially, the Bash script below does the following :

  1. Creates temporary storage (See : 1_create_temp_storage.sql)
  2. Imports the contents of fruits.csv - our raw data - into this temporary storage
  3. Processes the raw data with a call to a PL/pgSQL function (See : 2_process_raw_data.sql)
  4. Exports our processed data to a csv file (See : processed_fruit.csv)
  5. Cleans up after itself (See : 3_drop_temp_items.sql)
#/bin/bash
set -e
db_user=beanj
db_password=
db_host=localhost
db_name=testing
data_dir=/Users/beanj/dev/git/quick_n_dirty_etl


# 1 - Create temp storage
export PGPASSWORD="$db_password" && \
export PGOPTIONS='--client-min-messages=warning' && \
'/Applications/Postgres.app/Contents/Versions/9.4/bin'/psql -p5432 -U "$db_user" \
-h "$db_host" \
-d "$db_name" \
-v ON_ERROR_STOP=1 \
-a -f 1_create_temp_storage.sql


# 2 - Import Raw Data
export PGPASSWORD="$db_password" && \
export PGOPTIONS='--client-min-messages=warning' && \
'/Applications/Postgres.app/Contents/Versions/9.4/bin'/psql -p5432 -U "$db_user" \
-h "$db_host" \
-d "$db_name" \
-v ON_ERROR_STOP=1 \
-c "COPY tmp_fruit FROM STDIN with CSV HEADER ;" < "$data_dir/fruit.csv"


# 3 - Process Raw Data
export PGPASSWORD="$db_password" && \
export PGOPTIONS='--client-min-messages=notice' && \
'/Applications/Postgres.app/Contents/Versions/9.4/bin'/psql -p5432 -U "$db_user" \
-h "$db_host" \
-d "$db_name" \
-v ON_ERROR_STOP=1 \
-a -f 2_process_raw_data.sql


# 4 - Export Processed Data
export PGPASSWORD="$db_password" && \
export PGOPTIONS='--client-min-messages=warning' && \
'/Applications/Postgres.app/Contents/Versions/9.4/bin'/psql -U "$db_user" \
-h "$db_host" \
-d "$db_name" \
-v ON_ERROR_STOP=1 \
-c "COPY (select * from tmp_processed) TO STDOUT WITH CSV HEADER; " > processed_fruit.csv


# 5 - Drop temporary tables and functions
export PGPASSWORD="$db_password" && \
export PGOPTIONS='--client-min-messages=notice' && \
'/Applications/Postgres.app/Contents/Versions/9.4/bin'/psql -p5432 -U "$db_user" \
-h "$db_host" \
-d "$db_name" \
-v ON_ERROR_STOP=1 \
-a -f 3_drop_temp_items.sql

Step 1 - The Temp Storage

We need a place to import data into. For this, we’ll create a table that we’ll eventually drop once we’re done with it. The contents of our 1_create_temp_storage.sql SQL script looks like :

/*
 * Create temp table to store our incoming data - from the CSV import 
 */
DROP TABLE IF EXISTS tmp_fruit
;

CREATE TABLE tmp_fruit
(
	name	varchar(10),
	count	int
)
;

Step 2 - The Input Data

The fruit.csv file gets imported in step 2 of our Bash script. The file contents look like :

name, count
Apple, 10
Banana, 21
Orange, 12
Lemon, 33

Step 3 - Processing the Raw Data

Processing the raw data… This is where it gets a little more interesting. This SQL script creates a function called tmp_process_fruit(varchar). The function cycles through our temporary fruit table and creates a string out of the fruit names. That string is returned from the function. The function takes a varchar as an input parameter. As we cycle through the input data, if the name of the fruit we are looking at, is the same as what we provided as an input parameter, the fruit’s name will be tacked to the accum variable with special characters and a counter around it. Otherwise, the name of the fruit is just tacked on to the accum variable.

The 2_process_raw_data.sql SQL script looks like this :

/*
 * Create function to process the fruits table data.
 * Cycles through the tmp_fruit records and returns a string.
 */

CREATE OR REPLACE FUNCTION tmp_process_fruit(varchar) RETURNS VARCHAR AS $$
DECLARE
	c_tmp_fruit      refcursor;
        r_tmp_fruit      RECORD;
        counter          int;
        accum 		 varchar;
        query 		 varchar;

BEGIN
    counter := 0;
	query := 'select * from tmp_fruit order by name desc';
	accum := '';
	OPEN c_tmp_fruit FOR EXECUTE query;
    FETCH FIRST FROM c_tmp_fruit INTO r_tmp_fruit;

    WHILE FOUND = TRUE LOOP 
        -- If the fruit is what we passed in as a parameter the generated 
        -- string element will look a little different when added to the accumulator 
        IF (r_tmp_fruit.name   = $1) THEN 
            accum := '*_'||r_tmp_fruit.name||'_'||CAST(counter AS text)||'_'||accum||'*';
        -- Otherwise, the string element just gets added to the accumulator as is
        ELSE
            accum := r_tmp_fruit.name || ' ' || accum;
        END IF;
        RAISE NOTICE 'Processing Record : %  ', counter;
        counter := counter + 1;
  	FETCH NEXT FROM c_tmp_fruit INTO r_tmp_fruit;
    END LOOP;

    CLOSE c_tmp_fruit;
    RETURN accum;
END;
$$ LANGUAGE PLPGSQL
;

DROP TABLE IF EXISTS tmp_processed
;

/*
 * Create a table called tmp_processed with the result of the 
 * tmp_process_fruit('Orange') function call. We will export this 
 * in a CSV file in another step.
 */
CREATE TABLE tmp_processed AS
SELECT tmp_process_fruit('Orange')
;

One thing to note in this step is that we created a table called tmp_processed in the last few lines of the SQL script above. This table’s content is the result of the call to the tmp_process_fruit('Orange') function.

Step 4 - Outputting the Processed Data

This step takes the contents of the tmp_processed table we created in Step 3 and puts it in a .CSV file. The content of the output CSV file looks like the following :

tmp_process_fruit
Apple Banana Lemon *_Orange_0_*

It’s a table with one row in it. The row contains the concatenated fruit names in our original raw data. Note that in Step 3, we called the tmp_process_fruit() function with Orange as the input parameter. The result is that our output contains the *_Orange_0_* substring.

Step 5 - Cleaning Up

The clean up script just removes any temporary tables and functions that we no longer need.

/* 
 *  Drop temp tables. 
 */

DROP FUNCTION tmp_process_fruit(varchar)
;

DROP TABLE IF EXISTS tmp_processed
;

DROP TABLE IF EXISTS tmp_fruit
;

Some Thoughts

What I like about this approach is that for simple tasks, revisiting and maintaining a series of data manipulations is relatively easy. There is no IDE to fight with. It’s clear what gets checked into SCM. The costs for tooling is very low. There are no additional tools or database drivers to install. The setup is very simple.

Developers write a lot of code in many different languages and environments. Having the ability to keep portions of it simple and easy to jump back into is a great advantage.

Source files for examples in this post : GitHub Repository

comments powered by Disqus