Discussion about this post

User's avatar
Yash's avatar

Hello, great post but have couple of questions:-

1) Why we havent made any reference to number of spark executors.

2) there are certain files such as .gz , when we use RDD to read files then they are not splittable , so how things change in that scenario.

Expand full comment
Fan's avatar

Great post. Actually, we can also consider the partitioning as a special type of bucketing, i.e. the modulo is the cardinality of the partition column. So naturally, if the cardinality is big, bucketing will be a better choice since there will be less small partition files.

Regarding spark.default.parallelism and spark.sql.shuffle.partitions, they are quite close to each other and a little bit confusing. Could you talk a little bit more in the future? Thanks in advance.

Expand full comment
11 more comments...

No posts