1) Rule of thumb: set the number of Cores to be two times less than files being read. This way you are very likely to avoid idling cores due to file size skew. Adjust according to your situation.
so let us consider a situation in which 1 have to read 196 files and file size is of 800 mb each and is gz file.
I use spark RDD to read all files and on EMRServerless
so total cores should be less than 98? right as these files are not splittable.
But lets say I read and repartition them to 2000 partitions so increase parallelism in further stages. but if my cores are limited then all next set of actions (stages or tasks) would be slow, right?
Great post. Actually, we can also consider the partitioning as a special type of bucketing, i.e. the modulo is the cardinality of the partition column. So naturally, if the cardinality is big, bucketing will be a better choice since there will be less small partition files.
Regarding spark.default.parallelism and spark.sql.shuffle.partitions, they are quite close to each other and a little bit confusing. Could you talk a little bit more in the future? Thanks in advance.
hello , regarding keeping cores per executor to 5. Can you please tell use case where high cores per cpu will lead to better results, I have seen in compute intensive operations keeping high cores will lead to better results.
Great post !! however , I read in stackoverflow that spark.sql.shuffle.partitions applies only to joins and shuffles and not while writing data to disk using dataframe writer. Could you please comment ? Thanks
Great post !! however , I read in stackoverflow that spark.sql.shuffle.partitions applies only to joins and shuffles and not while writing data to disk using dataframe writer. Could you please comment ? Thanks
1) Rule of thumb: set the number of Cores to be two times less than files being read. This way you are very likely to avoid idling cores due to file size skew. Adjust according to your situation. - Now the file size is ranging from 200 mb to 800 mb and all are non splittable file total 196 files , so in this case core size should be 98?
hello, have some questions:-
1) Rule of thumb: set the number of Cores to be two times less than files being read. This way you are very likely to avoid idling cores due to file size skew. Adjust according to your situation.
so let us consider a situation in which 1 have to read 196 files and file size is of 800 mb each and is gz file.
I use spark RDD to read all files and on EMRServerless
so total cores should be less than 98? right as these files are not splittable.
But lets say I read and repartition them to 2000 partitions so increase parallelism in further stages. but if my cores are limited then all next set of actions (stages or tasks) would be slow, right?
Hello, great post but have couple of questions:-
1) Why we havent made any reference to number of spark executors.
2) there are certain files such as .gz , when we use RDD to read files then they are not splittable , so how things change in that scenario.
Great post. Actually, we can also consider the partitioning as a special type of bucketing, i.e. the modulo is the cardinality of the partition column. So naturally, if the cardinality is big, bucketing will be a better choice since there will be less small partition files.
Regarding spark.default.parallelism and spark.sql.shuffle.partitions, they are quite close to each other and a little bit confusing. Could you talk a little bit more in the future? Thanks in advance.
hello , regarding keeping cores per executor to 5. Can you please tell use case where high cores per cpu will lead to better results, I have seen in compute intensive operations keeping high cores will lead to better results.
also pls refer https://aws.amazon.com/blogs/big-data/amazon-emr-serverless-supports-larger-worker-sizes-to-run-more-compute-and-memory-intensive-workloads/
and let me know your comments
Great post !! however , I read in stackoverflow that spark.sql.shuffle.partitions applies only to joins and shuffles and not while writing data to disk using dataframe writer. Could you please comment ? Thanks
Great post !! however , I read in stackoverflow that spark.sql.shuffle.partitions applies only to joins and shuffles and not while writing data to disk using dataframe writer. Could you please comment ? Thanks
you havent talked about spark.dynamicallocation - I believe this is one of key parameter for cost optimization
another scenario for
1) Rule of thumb: set the number of Cores to be two times less than files being read. This way you are very likely to avoid idling cores due to file size skew. Adjust according to your situation. - Now the file size is ranging from 200 mb to 800 mb and all are non splittable file total 196 files , so in this case core size should be 98?