Talk Submission
If you are interested in attending this talk at PyCon JP 2016, please use the social media share buttons below. We will consider the popularity of the proposals when making our selection.
talk
Improving PySpark Performance - Leveraging DataFrames & other techniques(en)
Speakers
Holden Karau
Audience level:
Intermediate
Category:
Big Data
Description
This talk covers a number of important topics for making scalable Apache Spark programs in Python.
Objectives
Understand how to effectively use Spark in Python.
Abstract
This talk covers a number of important topics for making scalable Apache Spark programs - from RDD re-use to considerations for working with Key/Value data, why avoiding groupByKey is important and more. We also include Python specific considerations, like the difference between DataFrames/Datasets and traditional RDDs with Python. We also explore some tricks to intermix Python and JVM code for cases where the performance overhead is too high.