Data Workflow Project
Tasked with finding a better more automated method of getting “events” data flowing to end point systems.
Skills & Technologies
The first part consisted of gathering events from an external source via their ActiveMQ system. To do this we built a python app that utilized the stomp library to connect to ActiveMQ and read the events. Then systematically feed them into an internal Amazon Kinesis Firehose. This app runs daemonized on a cluster of servers.
The Amazon Kinesis Firehose we setup to feed data into a Redshift table via an S3 intermediate.
Delivery / Usage
To get this data to one of our endpoints we created an Airflow DAG program again written in python (as you do for Airflow). This script involved running once a day:
- Polling Redshift for the days “events” end.
- Exporting these events via an UNLOAD query.
- Feeding them into a “cleaning service (see Data Cleaning project).
- Waiting for results feeding into another external “cleaning” via FTP and again polling for results.
- Placing the results onto S3 and COPY back into Redshift.
- UNLOADing the data into segmented files (based on rulesets) back to S3.
- Finally uploading the data files onto an FTP.