Ready to write your own blog?

Start Here

See a sample blog here

DATANEB

  • Home

  • Categories

  • Q&A Forum

  • Spark Scala Tutorial

  • SSB Tips

  • Members

  • Search Results

  • Email Us

  • More...

    Use tab to navigate through the menu items.
    To see this working, head to your live site.
    • Categories
    • All Posts
    • My Posts
    Hina Singh
    Sep 07, 2019

    Can you explain Spark groupByKey with example?

    in Apache Spark
    1 comment
    1 Comment
    Commenting has been turned off.
    W
    WhiteSand
    Sep 07, 2019
    •

    Spark groupByKey : As name says it groups the dataset (K, V) key-value pair based on Key and stores the value as Iterable, (K, V) => (K, Iterable(V)). It's very expensive operation and consumes lot of memory if dataset is huge.


    For example,

    scala> val rdd = sc.parallelize(List("Hello Hello Spark Apache Hello Dataneb Dataneb Dataneb Spark"))

    rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[2] at parallelize at <console>:24

    scala> rdd.collect

    res3: Array[String] = Array(Hello Hello Spark Apache Hello Dataneb Dataneb Dataneb Spark)


    Splitting the array and creating (K, V) pair

    scala> val keyValue = rdd.flatMap(words => words.split(" ")).map(x=>(x,1))

    keyValue: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[16] at map at <console>:25


    Iterable[Int] Value "1" tells number of occurrences of Key

    scala> keyValue.groupByKey.collect

    res12: Array[(String, Iterable[Int])] = Array((Spark,CompactBuffer(1, 1)), (Dataneb,CompactBuffer(1, 1, 1)), (Hello,CompactBuffer(1, 1, 1)), (Apache,CompactBuffer(1)))



    Like
    1 comments
    Similar Posts
    • What is mapPartitions in Spark (example)?
    • What is reduceByKey in Spark (example)?
    • Can be taken Mobile phone In IMA and what we can use it in weekend and after day training?

    Home   |   Contact Us

    ©2020 by Data Nebulae