top of page
BlogPageTop

Trending

How to convert RDD to Dataframe?

Updated: Oct 25, 2019



There are basically three methods by which we can convert a RDD into Dataframe. I am using spark shell to demonstrate these examples. Open spark-shell and import the libraries which are needed to run our code.


Scala> import org.apache.spark.sql.{Row, SparkSession}

Scala> import org.apache.spark.sql.types.{IntegerType, DoubleType, StringType, StructField, StructType}



Now, create a sample RDD with parallelize method.


Scala> val rdd = sc.parallelize(

Seq(

("One", Array(1,1,1,1,1,1,1)),

("Two", Array(2,2,2,2,2,2,2)),

("Three", Array(3,3,3,3,3,3))

) )




Method 1


If you don't need header, you can directly create it with RDD as input parameter to createDataFrame method.

Scala> val df1 = spark.createDataFrame(rdd)



 

Method 2


If you need header, you can add the header explicitly by calling method toDF.

Scala> val df2 = spark.createDataFrame(rdd).toDF("Label", "Values")




Method 3


If you need schema structure then you need RDD of [Row] type. Let's create a new rowsRDD for this scenario.

Scala> val rowsRDD = sc.parallelize(

Seq(

Row("One",1,1.0),

Row("Two",2,2.0),

Row("Three",3,3.0),

Row("Four",4,4.0),