Talk Submission

If you are interested in attending this talk at PyCon JP 2016, please use the social media share buttons below. We will consider the popularity of the proposals when making our selection.

talk

Building a data preparation pipeline with Pandas and AWS Lambda(en)

Speakers

Fabian Dubois

Audience level:

Intermediate

Category:

Industry Uses

Description

When working on a data project, you will be often be facing messy input files with lots of missing or ill formatted values. Data providers may update manually, making the data source even more error prone. Once you geed the data to a data visualization or a dashboard, this will create many issues. I will show how to create a data preparation pipeline using with Pandas running on AWS Lambda.

Objectives

Learn strategies to deal with dirty data. Learn how to make the best of Pandas to clean a dataset. Learn how to you can streamline a process with AWS Lambda.

Abstract

In the talk, I will first review typical cases where a data scientist or data application developper may be faced with dirty data in unpractical formats (think excel files). I will in particular discuss my experience building data visualization in a data journalism environment here data is gathered and updated manually. I will present alternative tools that are available on the market (Talend Dataprep, Trifacta wrangler for example), and explain why you may want to roll out your own solution. Then we will see how we can use python and pandas to clean the data, first by interacting with it in a jupyter notebook, then making it into a script. Finally, we will see how to streamline the preparation using AWS Lambda, in an example where will will automatically run our process whenever data is updated in a google spreadsheet, and uploading the clean dataset on AWS S3.