A Guide to Optimising your Spark Application Performance (Part 1).
A cheat sheet to refer to when you run into performance issues with your Spark application.
👋 I am Aurimas. I write the SwirlAI Newsletter with the goal of presenting complicated Data related concepts in a simple and easy-to-digest way. My mission is to help You UpSkill and keep You updated on the latest news in Data Engineering, MLOps, Machine Learning and overall Data space.
This is a 🔒 Paid Subscriber 🔒 only issue. If you want to read a full article, consider upgrading to paid subscription.
Apache Spark continues to be the most popular distributed compute framework out there. I remember when I first got exposure to it, neither myself nor the company that I worked for had any experience or understanding of how much it takes to optimize your Data Processing Jobs. The same story continued when I switched jobs - throwing more compute resources to the problem was almost always the answer to the problem.
Only when I was made responsible for the amount of money we were spending on compute resources and the data volumes became huge did I start tinkering with different available knobs in the Spark framework and systems integrated with it. The results were astounding, in some cases reaching 10X and more performance improvement and cost reduction. Let’s be clear, back in the times I am talking about there were no books or good tutorials that would explain the concepts so there were a lot of trial and error situations involved and most of the attempts were based on intuition, mathematical calculations and deep research of Spark internals.
In this two part Newsletter series I want to summarise some of the learnings that I wish someone had given me as a condensed cheat sheet back in the times. Topics that will be covered.
Part 1:
Spark Architecture.
Wide vs. Narrow Transformations.
Broadcast Joins
Maximising Parallelism.
Partitioning.
Bucketing.
Caching.
Part 2:
spark.default.parallelism vs. spark.sql.shuffle.partitions.
Choosing the right File Format.
Column-Based vs. Row-Based storage
Splittable vs. non-splittable files
Encoding in Parquet.
RLE (Run Length Encoding)
Dictionary Encoding
Combining RLE and Dictionary Encoding
Considering upstream data preparation for Spark Job input files.
What happens when you Sort in Spark?
Understanding and tuning Spark Executor Memory.
Maximising the number of executors for a given cluster.
Having said this, spark has evolved over time and introduced a lot of optimisations natively and by doing so reduced the effort needed for Job optimisation. Nevertheless, all of what is outlined in the Newsletter episode continues to be useful and will help you understand Spark deeply and make better decisions when building your applications.
Introduction to Spark Architecture.
Before we delve into specific optimisation techniques, let’s remember the building blocks that make up Spark as it is needed to understand performance bottlenecks later on.
While Spark has been written in Scala, APIs exist for different languages as well, these include Java, Python and R.
Most of the libraries in Spark are contained in the Spark Core layer. There are several high level APIs built on top of Spark Core to support different use cases, these include:
SparkSQL - Batch Processing.
Spark Streaming - Near to Real-Time Processing.
Spark MLlib - Machine Learning.
GraphX - Graph Structures and Algorithms.
Spark usually runs on a cluster of machines, there are multiple orchestrators and resource managers that can run Spark natively. Cluster Manager is responsible for providing resources to Spark applications, supported Cluster Managers are:
Standalone - simple cluster manager shipped together with Spark.
Hadoop YARN - resource manager of Hadoop ecosystem.
Apache Mesos - general cluster manager (❗️Could be removed from the list as it is already deprecated, keeping it here for reference).
Kubernetes - popular open-source container orchestrator. Becoming increasingly common when running Spark Jobs, I will cover running Spark on Kubernetes in one of my future Newsletters.
What happens when you submit an application to Spark Cluster?
Once you submit a Spark Application - SparkContext Object is created in the Driver Program. This Object is responsible for communicating with the Cluster Manager.
SparkContext negotiates with Cluster Manager for required resources to run Spark application. Cluster Manager allocates the resources inside of a respective Cluster and creates a requested number of Spark Executors.
After starting - Spark Executors will connect with SparkContext to notify about joining the Cluster. Executors will be sending heartbeats regularly to notify the Driver Program that they are healthy and don’t need rescheduling.
Spark Executors are responsible for executing tasks of the Computation DAG (Directed Acyclic Graph). This could include reading, writing data or performing a certain operation on a partition of RDDs.
It is important to understand the internals of a Spark Job.
Spark Driver is responsible for constructing an optimised physical execution plan for a given application submitted for execution.
This plan materializes into a Job which is a DAG of Stages.
Some of the Stages can be executed in parallel if they have no sequential dependencies.
Each Stage is composed of Tasks.
All Tasks of a single Stage contain the same type of work which is the smallest piece of work that can be executed in parallel and is performed by Spark Executors.
Let’s outline some additional Definitions that will help us explain Spark behavior.
Spark Jobs are executed against Data Partitions.
Data Partition (not to be mistaken by Partitioning procedure via .partitionBy(), we will treat these as two separate things) is the smallest undividable piece of data and it can be stored on disk or in memory.
Each Spark Task is connected to a single Partition.
Data Partitions are immutable - this helps with disaster recovery in Big Data applications.
After each Transformation - number of Child Partitions will be created.
Each Child Partition will be derived from one or more Parent Partitions and will act as Parent Partition for future Transformations.
Wide, Narrow transformations and Shuffle.
Shuffle is a procedure when creation of Child Partitions involves data movement between Data Containers and Spark Executors over the network. Shuffle operation will be invoked when performing a wide transformation. There are two types of Transformations:
Narrow Transformations
These are simple transformations that can be applied locally without moving Data between Data Containers.
Locality is made possible due to cross-record context not being needed for the Transformation logic.
E.g. A simple filter function only needs to know the value of a given record. A map function performs transformation on a record without additional context from a different record.
Functions that trigger Narrow Transformations include:
map()
mapPartition()
flatMap()
filter()
union()
contains()
Keep reading with a 7-day free trial
Subscribe to