応募トーク
これは応募されたトークです。聞きたいと思うトークをSNSで拡散しましょう。選考時に参考にさせていただきます。
talk
Improving PySpark Performance - Leveraging DataFrames & other techniques(en)
スピーカー
Holden Karau
対象レベル:
中級
カテゴリ:
Big Data
説明
This talk covers a number of important topics for making scalable Apache Spark programs in Python.
目的
Understand how to effectively use Spark in Python.
概要
This talk covers a number of important topics for making scalable Apache Spark programs - from RDD re-use to considerations for working with Key/Value data, why avoiding groupByKey is important and more. We also include Python specific considerations, like the difference between DataFrames/Datasets and traditional RDDs with Python. We also explore some tricks to intermix Python and JVM code for cases where the performance overhead is too high.