13 Comments
Jul 30, 2023Liked by Aurimas Griciūnas

hello, have some questions:-

1) Rule of thumb: set the number of Cores to be two times less than files being read. This way you are very likely to avoid idling cores due to file size skew. Adjust according to your situation.

so let us consider a situation in which 1 have to read 196 files and file size is of 800 mb each and is gz file.

I use spark RDD to read all files and on EMRServerless

so total cores should be less than 98? right as these files are not splittable.

But lets say I read and repartition them to 2000 partitions so increase parallelism in further stages. but if my cores are limited then all next set of actions (stages or tasks) would be slow, right?

Expand full comment
Jul 30, 2023Liked by Aurimas Griciūnas

Hello, great post but have couple of questions:-

1) Why we havent made any reference to number of spark executors.

2) there are certain files such as .gz , when we use RDD to read files then they are not splittable , so how things change in that scenario.

Expand full comment

Great post. Actually, we can also consider the partitioning as a special type of bucketing, i.e. the modulo is the cardinality of the partition column. So naturally, if the cardinality is big, bucketing will be a better choice since there will be less small partition files.

Regarding spark.default.parallelism and spark.sql.shuffle.partitions, they are quite close to each other and a little bit confusing. Could you talk a little bit more in the future? Thanks in advance.

Expand full comment

hello , regarding keeping cores per executor to 5. Can you please tell use case where high cores per cpu will lead to better results, I have seen in compute intensive operations keeping high cores will lead to better results.

also pls refer https://aws.amazon.com/blogs/big-data/amazon-emr-serverless-supports-larger-worker-sizes-to-run-more-compute-and-memory-intensive-workloads/

and let me know your comments

Expand full comment

Great post !! however , I read in stackoverflow that spark.sql.shuffle.partitions applies only to joins and shuffles and not while writing data to disk using dataframe writer. Could you please comment ? Thanks

Expand full comment

Great post !! however , I read in stackoverflow that spark.sql.shuffle.partitions applies only to joins and shuffles and not while writing data to disk using dataframe writer. Could you please comment ? Thanks

Expand full comment

you havent talked about spark.dynamicallocation - I believe this is one of key parameter for cost optimization

Expand full comment

another scenario for

1) Rule of thumb: set the number of Cores to be two times less than files being read. This way you are very likely to avoid idling cores due to file size skew. Adjust according to your situation. - Now the file size is ranging from 200 mb to 800 mb and all are non splittable file total 196 files , so in this case core size should be 98?

Expand full comment