How to write single CSV file using spark?

Updated: Sep 12, 2020


Apache Spark by default writes CSV file output in multiple parts-*.CSV, inside a directory. Reason is simple it creates multiple files because each partition is saved individually. Apache Spark is built for distributed processing and multiple files are expected. However, you can overcome this situation by several methods. In previous posts, we have just read the data files (flat file, json), created rdd, dataframes using spark sql, but we haven't written file back to disk or any storage system. In this Apache Spark tutorial - you will learn how to write files back to disk.


Main menu: Spark Scala Tutorial

For this blog, I am creating Scala Object - textfileWriter in same project - txtReader folder where we created textfileReader.




 

Source File


I am using the same source file squid.txt file (with duplicate records) which I created in previous blog. However, in practical scenario source could be anything - relational database, hdfs file system, message queue etc. Practically, It will be never the case, i.e. reading and writing same file. This is just for demo purpose.


1286536309.586 921 192.168.0.68 TCP_MISS/200 507 POST http://rcv-srv37.inplay.tubemogul.co...eiver/services - DIRECT/174.129.41.128 application/xml

1286536309.608 829 192.168.0.68 TCP_MISS/200 507 POST http://rcv-srv37.inplay.tubemogul.co...eiver/services - DIRECT/174.129.41.128 application/xml

1286536309.660 785 192.168.0.68 TCP_MISS/200 507 POST http://rcv-srv37.inplay.tubemogul.co...eiver/services - DIRECT/174.129.41.128 application/xml

1286536309.684 808 192.168.0.68 TCP_MISS/200 507 POST http://rcv-srv37.inplay.tubemogul.co...eiver/services - DIRECT/174.129.41.128 application/xml

1286536309.775 195 192.168.0.227 TCP_MISS/200 4120 GET http://i4.ytimg.com/vi/gTHZnIAzmdY/default.jpg - DIRECT/209.85.153.118 image/jpeg

1286536309.795 215 192.168.0.227 TCP_MISS/200 5331 GET http://i2.ytimg.com/vi/-jBxVLD4fzg/default.jpg - DIRECT/209.85.153.118 image/jpeg

1286536309.815 234 192.168.0.227 TCP_MISS/200 5261 GET http://i1.ytimg.com/vi/dCjp28ps4qY/default.jpg - DIRECT/209.85.153.118 image/jpeg



Sample Code

  • Open jsonfileReader.scala and copy-paste the code written below.

  • I have written separate blog to explain what are basic terminologies used in Spark like rdd, SparkContext, SQLContext, various transformations and actions etc. You can go through these for basic understanding.

  1. Spark shell, Spark context and configuration

  2. Spark RDD, Transformations and Actions


However, I have explained little bit in comments above each line of code what it actually does. For list of spark functions you can refer this.


You can make this code much simpler but my aim is to teach as well. Hence I have intentionally introduced header structure, SQL context, string rdd etc. However, if you are familiar with these, you can just focus on writing dataframe part highlighted in blue.


 

package com.dataneb.spark


// Each library has its significance, I have commented in below code how its being used

import org.apache.spark._

import org.apache.spark.sql._

import org.apache.log4j._

import org.apache.spark.sql.types.{StructType, StructField, StringType}

import org.apache.spark.sql.Row


object textfileWriter {


// Reducing the error level to just "ERROR" messages

// It uses library org.apache.log4j._

// You can apply other logging levels like ALL, DEBUG, ERROR, INFO, FATAL, OFF etc

Logger.getLogger("org").setLevel(Level.ERROR)


// Defining Spark configuration to define application name and the local resources to use

// It uses library org.apache.spark._

val conf = new SparkConf().setAppName("textfileWriter")

conf.setMaster("local")


// Using above configuration to define our SparkContext

val sc = new SparkContext(conf)


// Defining SQL context to run Spark SQL

// It uses library org.apache.spark.sql._

val sqlContext = new SQLContext(sc)


// Main function where all operations will occur

def main (args:Array[String]): Unit = {


// Reading the text file

val squidString = sc.textFile("/Users/Rajput/Documents/testdata/squid.txt")


// Defining the data-frame header structure

val squidHeader = "time duration client_add result_code bytes req_method url user hierarchy_code type"


// Defining schema from header which we defined above

// It uses library org.apache.spark.sql.types.{StructType, StructField, StringType}


val schema = StructType(squidHeader.split(" ").map(fieldName => StructField(fieldName,StringType, true)))


// Converting String RDD to Row RDD for 10 attributes

val rowRDD = squidString.map(_.split(" ")).map(x => Row(x(0), x(1), x(2), x(3), x(4), x(5) , x(6) , x(7) , x(8), x(9)))


// Creating dataframe based on Row RDD and schema

val squidDF = sqlContext.createDataFrame(rowRDD, schema)


// Writing dataframe to a file with overwrite mode, header and single partition.

squidDF

.repartition(1)

.write

.mode ("overwrite")

.format("com.databricks.spark.csv")

.option("header", "true")

.save("targetfile.csv")