To see this working, head to your live site.

Can you explain Spark groupByKey with example?

1 comment

1 Comment

Commenting on this post isn't available anymore. Contact the site owner for more info.

Sep 07, 2019

Spark groupByKey : As name says it groups the dataset (K, V) key-value pair based on Key and stores the value as Iterable, (K, V) => (K, Iterable(V)). It's very expensive operation and consumes lot of memory if dataset is huge.

For example,

scala> val rdd = sc.parallelize(List("Hello Hello Spark Apache Hello Dataneb Dataneb Dataneb Spark"))

rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[2] at parallelize at <console>:24

scala> rdd.collect

res3: Array[String] = Array(Hello Hello Spark Apache Hello Dataneb Dataneb Dataneb Spark)

Splitting the array and creating (K, V) pair

scala> val keyValue = rdd.flatMap(words => words.split(" ")).map(x=>(x,1))

keyValue: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[16] at map at <console>:25

Iterable[Int] Value "1" tells number of occurrences of Key

scala> keyValue.groupByKey.collect

res12: Array[(String, Iterable[Int])] = Array((Spark,CompactBuffer(1, 1)), (Dataneb,CompactBuffer(1, 1, 1)), (Hello,CompactBuffer(1, 1, 1)), (Apache,CompactBuffer(1)))

​Home • Blog • About • Terms of Use • Policy • Contact

Can you explain Spark groupByKey with example?

Home • Blog • About • Terms of Use • Policy • Contact