How to convert RDD to Dataframe?

There are basically three methods by which we can convert a RDD into Dataframe. I am using spark shell to demonstrate these examples. Open spark-shell and import the libraries which are needed to run our code.

Scala> import org.apache.spark.sql.{Row, SparkSession}

Scala> import org.apache.spark.sql.types.{IntegerType, DoubleType, StringType, StructField, StructType}

Now, create a sample RDD with parallelize method.

Scala> val rdd = sc.parallelize(


("One", Array(1,1,1,1,1,1,1)),

("Two", Array(2,2,2,2,2,2,2)),

("Three", Array(3,3,3,3,3,3))

) )

Method 1

If you don't need header, you can directly create it with RDD as input parameter to createDataFrame method.

Scala> val df1 = spark.createDataFrame(rdd)


Method 2

If you need header, you can add the header explicitly by calling method toDF.

Scala> val df2 = spark.createDataFrame(rdd).toDF("Label", "Values")

Method 3

If you need schema structure then you need RDD of [Row] type. Let's create a new rowsRDD for this scenario.

Scala> val rowsRDD = sc.parallelize(









Now create the schema with the field names which you need.

Scala> val schema = new StructType().

add(StructField("Label", StringType, true)).

add(StructField("IntValue", IntegerType, true)).

add(StructField("FloatValue", DoubleType, true))

Now create the dataframe with rowsRDD & schema and show dataframe.

Scala> val df3 = spark.createDataFrame(rowsRDD, schema)

Thank you folks! If you have any question please mention in comments section below.

