Loading JSON file using Spark (Scala)

Updated: Oct 25, 2019

Main menu: Spark Scala Tutorial

In this Apache Spark Tutorial - We will be loading a simple JSON file. Now-a-days most of the time you will find files in either JSON format, XML or a flat file. JSON file format is very easy to understand and you will love it once you understand JSON file structure.

JSON File Structure

Before we ingest JSON file using spark, it's important to understand JSON data structure. Basically, JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. JSON is built on two structures:

  • A collection of name/value pairs, usually referred as an object and its value pair.

  • An ordered list of values. You can think it like an array, list of values.

  1. An object is an unordered set of name/value pairs. An object begins with { (left brace) and ends with } (right brace). Each name is followed by : (colon) and the name/value pairs are separated by , (comma).

  2. An array is an ordered collection of values. An array begins with [ (left bracket) and ends with ] (right bracket). Values are separated by , (comma).

  3. A value can be a string in double quotes, or a number, or true or false or null, or an object or an array. These structures can be nested.

One more fact, JSON files could exist in two formats. However, most of the time you will encounter multiline JSON files.

  • Multiline JSON where each line could have multiple records.

  • Single line JSON where each line depicts one record.

Multiline JSON would look something like this:

[ { "color": "red", "value": "#f00" }, { "color": "green", "value": "#0f0" }, { "color": "blue", "value": "#00f" }, { "color": "cyan", "value": "#0ff" }, { "color": "magenta", "value": "#f0f" }, { "color": "yellow", "value": "#ff0" }, { "color": "black", "value": "#000" } ]

Single line JSON would look something like this (try to correlate object, array and value structure format which I explained earlier):

{ "color": "red", "value": "#f00" }

{ "color": "green", "value": "#0f0" }

{ "color": "blue", "value": "#00f" }

{ "color": "cyan", "value": "#0ff" }

{ "color": "magenta", "value": "#f0f" }

{ "color": "yellow", "value": "#ff0" }

{ "color": "black", "value": "#000" }

Creating Sample JSON file

I have created two different sample files - multiline and single line JSON file with above mentioned records (just copy-paste).

  • singlelinecolors.json

  • multilinecolors.json

Sample files look like:

Note: I assume that you have installed Scala IDE if not please refer my previous blogs for installation steps (Windows & Mac users).

1. Create a new Scala project "jsnReader"

  • Go to FileNewProject and enter jsnReader in project name field and click finish.

2. Create a new Scala Package "com.dataneb.spark"

  • Right click on the jsnReader project in the Package Explorer panel → NewPackage and enter name com.dataneb.spark and finish.

3. Create a Scala object "jsonfileReader"

  • Expand the jsnReader project tree and right click on the com.dataneb.spark package → NewScala Object → enter jsonfileReader in the object name and press finish.