A Guide to Optimising your Spark Application Performance (Part 2)
A cheat sheet to refer to when you run into performance issues with your Spark application.
👋 I am Aurimas. I write the SwirlAI Newsletter with the goal of presenting complicated Data related concepts in a simple and easy-to-digest way. My mission is to help You UpSkill and keep You updated on the latest news in Data Engineering, MLOps, Machine Learning and overall Data space.
This is a 🔒 Paid Subscriber 🔒 only issue. If you want to read the full article, consider upgrading to paid subscription.
This is the second part of my Spark Application Optimisation Guide. If you missed the first one, be sure to check it out here. There I covered:
Spark Architecture.
Wide vs. Narrow Transformations.
Broadcast Joins
Maximising Parallelism.
Partitioning.
Bucketing.
Caching.
Today I cover optimisations that you should consider that are related to Data Storage and Infrastructure:
Choosing the right File Format.
Column-Based vs. Row-Based storage
Splittable vs. non-splittable files
Encoding in Parquet.
RLE (Run Length Encoding)
Dictionary Encoding
Combining RLE and Dictionary Encoding
Understanding and tuning Spark Executor Memory.
Maximising the number of executors for a given cluster.
Memory allocation.
CPU allocation.
In the third part of Spark related articles coming out in few weeks I am planning to cover some of the following topics (I will be doing a poll where paid subscribers will be able to participate in choosing a subset of the topics):
Window functions.
UDFs.
Inferring Schema.
What happens when you Sort in Spark?
Connecting to Databases.
Compression of Files.
Spark UI.
Current state of Spark Cluster deployment.
Spark 3.x improvements.
Spark Driver.
Introduction to Spark Streaming.
Deploying Spark on Kubernetes.
Upstream data preparation for Spark Job input files.
Spark for Machine Learning.
….
Let’s jump to today’s topics.
Choosing the right File Format.
When using Spark you will most likely interact with the data that resides in your Data Lake (or a Streaming Storage when Spark Streaming is being used). Today we focus on Batch Processing, let’s look into Storage of data in the Data Lake.
Let’s first focus on how data can be organised when stored on disk and then on the properties of parallelism of the stored data at read time. First is related to Column and Row Based storage while the latter is related to splittable and non-splittable files.
Column vs. Row Based storage.
There are mainly 4 behaviours of data stored in a single file you need to consider when choosing between Column or Row-based storage.
How is the data organised on disk.
How effective is the write operation of writing a single record/row.
How effective is the read operation when reading partial data (subset of columns).
How well can we compress the data when writing to disk.
Let’s look into how Column and Row Based storage matches against each other in these 4 areas.
Row Based:
Rows on disk are stored in sequence.
New rows are written efficiently since you can write the entire row at once.
For select statements that target a subset of columns, reading is slower since you need to scan all sets of rows to retrieve one of the columns.
Compression is not efficient if columns have different data types since different data types are scattered all around the file.
Column Based:
Columns on disk are stored in sequence.
New rows are written slowly since you need to write separate fields of a row into different parts of the file.
For select statements that target a subset of columns, reads are faster than row based storage since you don’t need to scan the entire file (given that metadata about where specific columns begin is present when performing the read).
Compression is efficient since different data types are always grouped together.
Example file formats:
Row Based: Avro.
Column Based: Parquet, ORC.
Tips.
If the main use case for Spark is OLAP - use Parquet or ORC.
If the main use case is about frequently writing data (OLTP) - use Avro.
Keep reading with a 7-day free trial
Subscribe to