Data Workflow Project

Tasked with finding a better more automated method of getting “events” data flowing to end point systems.

Skills & Technologies

Amazon S3
Amazon Kinesis
Amazon Redshift
Apache Airflow


The first part consisted of gathering events from an external source via their ActiveMQ system. To do this we built a python app that utilized the stomp library to connect to ActiveMQ and read the events. Then systematically feed them into an internal Amazon Kinesis Firehose. This app runs daemonized on a cluster of servers.

The Amazon Kinesis Firehose we setup to feed data into a Redshift table via an S3 intermediate.

Delivery / Usage

To get this data to one of our endpoints we created an Airflow DAG program again written in python (as you do for Airflow). This script involved running once a day:

  • Polling Redshift for the days “events” end.
  • Exporting these events via an UNLOAD query.
  • Feeding them into a “cleaning service (see Data Cleaning project).
  • Waiting for results feeding into another external “cleaning” via FTP and again polling for results.
  • Placing the results onto S3 and COPY back into Redshift.
  • UNLOADing the data into segmented files (based on rulesets) back to S3.
  • Finally uploading the data files onto an FTP.