Loading JSON file using Spark (Scala)

Updated: Oct 25, 2019

Main menu: Spark Scala Tutorial

In this Apache Spark Tutorial - We will be loading a simple JSON file. Now-a-days most of the time you will find files in either JSON format, XML or a flat file. JSON file format is very easy to understand and you will love it once you understand JSON file structure.


JSON File Structure

Before we ingest JSON file using spark, it's important to understand JSON data structure. Basically, JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. JSON is built on two structures:

  • A collection of name/value pairs, usually referred as an object and its value pair.

  • An ordered list of values. You can think it like an array, list of values.


  1. An object is an unordered set of name/value pairs. An object begins with { (left brace) and ends with } (right brace). Each name is followed by : (colon) and the name/value pairs are separated by , (comma).

  2. An array is an ordered collection of values. An array begins with [ (left bracket) and ends with ] (right bracket). Values are separated by , (comma).

  3. A value can be a string in double quotes, or a number, or true or false or null, or an object or an array. These structures can be nested.

One more fact, JSON files could exist in two formats. However, most of the time you will encounter multiline JSON files.

  • Multiline JSON where each line could have multiple records.

  • Single line JSON where each line depicts one record.


Multiline JSON would look something like this:

[ { "color": "red", "value": "#f00" }, { "color": "green", "value": "#0f0" }, { "color": "blue", "value": "#00f" }, { "color": "cyan", "value": "#0ff" }, { "color": "magenta", "value": "#f0f" }, { "color": "yellow", "value": "#ff0" }, { "color": "black", "value": "#000" } ]


Single line JSON would look something like this (try to correlate object, array and value structure format which I explained earlier):

{ "color": "red", "value": "#f00" }

{ "color": "green", "value": "#0f0" }

{ "color": "blue", "value": "#00f" }

{ "color": "cyan", "value": "#0ff" }

{ "color": "magenta", "value": "#f0f" }

{ "color": "yellow", "value": "#ff0" }

{ "color": "black", "value": "#000" }

Creating Sample JSON file

I have created two different sample files - multiline and single line JSON file with above mentioned records (just copy-paste).

  • singlelinecolors.json

  • multilinecolors.json

Sample files look like:


Note: I assume that you have installed Scala IDE if not please refer my previous blogs for installation steps (Windows & Mac users).

1. Create a new Scala project "jsnReader"

  • Go to FileNewProject and enter jsnReader in project name field and click finish.


2. Create a new Scala Package "com.dataneb.spark"

  • Right click on the jsnReader project in the Package Explorer panel → NewPackage and enter name com.dataneb.spark and finish.


3. Create a Scala object "jsonfileReader"

  • Expand the jsnReader project tree and right click on the com.dataneb.spark package → NewScala Object → enter jsonfileReader in the object name and press finish.


4. Add external jar files

  • Right click on jsnReader project → propertiesJava Build PathAdd External Jars

  • Now navigate to the path where you have installed Spark. You will find all the jar files under /spark/jars folder.



After adding these jar files you will find Referenced Library folder created on left panel of your screen below Scala object. You will also find that project has become invalid (red cross sign), we will fix it shortly.

5. Setup Scala Compiler

  • Now right click on jsnReader project → properties Scala Compiler and check the box Use Project Settings and select Fixed Scala installation: 2.11.11 (built-in) from drop-down options.

  • After applying these changes, you will find project has become valid again (red cross sign is gone).


6. Sample code

  • Open jsonfileReader.scala and copy-paste the code written below.

  • I have written separate blog to explain what are basic terminologies used in Spark like RDD, SparkContext, SQLContext, various transformations and actions etc. You can go through this for basic understanding.

However, I have explained little bit in comments above each line of code what it actually does. For list of spark functions you can refer this.


// Your package name

package com.dataneb.spark

// Each library has its significance, I have commented in below code how its being used

import org.apache.spark._

import org.apache.spark.sql._

import org.apache.log4j._

object jsonfileReader {

// Reducing the error level to just "ERROR" messages

// It uses library org.apache.log4j._

// You can apply other logging levels like ALL, DEBUG, ERROR, INFO, FATAL, OFF etc


// Defining Spark configuration to define application name and the local resources to use

// It uses library org.apache.spark._

val conf = new SparkConf().setAppName("Sample App")


// Using above configuration to define our SparkContext

val sc = new SparkContext(conf)

// Defining SQL context to run Spark SQL

// It uses library org.apache.spark.sql._

val sqlContext = new SQLContext(sc)

// Main function where all operations will occur

def main (args: Array[String]): Unit = {

// Reading the json file

val df = sqlContext.read.json("/Volumes/MYLAB/testdata/multilinecolors.json")

// Printing schema


// Saving as temporary table


// Retrieving all the records

val data=sqlContext.sql("select * from JSONdata")

// Showing all the records


// Stopping Spark Context





7. Run the code!

  • Right click anywhere on the screen and select Run As Scala Application.