Data Workflow Project

Tasked with finding a better more automated method of getting “events” data flowing to end point systems.

Skills & Technologies

Python
ActiveMQ
Amazon S3
Amazon Kinesis
Amazon Redshift
Apache Airflow

Gathering

The first part consisted of gathering events from an external source via their ActiveMQ system. To do this we built a python app that utilized the stomp library to connect to ActiveMQ and read the events. Then systematically feed them into an internal Amazon Kinesis Firehose. This app runs daemonized on a cluster of servers.

The Amazon Kinesis Firehose we setup to feed data into a Redshift table via an S3 intermediate.

Delivery / Usage

To get this data to one of our endpoints we created an Airflow DAG program again written in python (as you do for Airflow). This script involved running once a day:

  • Polling Redshift for the days “events” end.
  • Exporting these events via an UNLOAD query.
  • Feeding them into a “cleaning service (see Data Cleaning project).
  • Waiting for results feeding into another external “cleaning” via FTP and again polling for results.
  • Placing the results onto S3 and COPY back into Redshift.
  • UNLOADing the data into segmented files (based on rulesets) back to S3.
  • Finally uploading the data files onto an FTP.