Data Cleaning/Filtering Project

We found ourselves running the same manual tasks on data sets repeatedly, in order to “clean” them.

Skills & Technologies

Python

gunicorn

Celery

Amazon S3

Amazon Redshift

Redis

Frontend

This process was automated by creating 3 services. The first was a frontend webserver where clients could input tasks to be performed, for instance “clean this file on s3, at this level”. Clients could also query this webserver to find the progress of the task. The worker services could query it to find new tasks and update the ones they were working on. This service was built using python with the gunicorn library and using redis as store.

Workers

The second service was the “coordinator” which was a python daemon script it would poll the web service for tasks that needed to be done then go through the steps necessary to perform them. An example would be:

Pull specified file from S3
Reformat the file as required by an outside cleaning service, upload to their FTP
Poll FTP for completed results, pull them back
Upload data to a second external service via their web API in batches
Poll their API for the batches completion, pull the data back
Reformat all results and upload back into S3

All the while keeping the web service updated of progress (and errors) and marking when completed.

Reporting

The third was an async python celery based service. It is called from the webservice at request time to perform certain tasks. For instance when a cleaning process completed it was triggered to perform a cleaning cleanup process where by it would look at the results returned by the external cleaning serices and import them into Redshift for later reports/analysis.