View By

Categories

 

Loading JSON file using Spark (Scala)

Updated: Oct 25, 2019

Main menu: Spark Scala Tutorial

In this Apache Spark Tutorial - We will be loading a simple JSON file. Now-a-days most of the time you will find files in either JSON format, XML or a flat file. JSON file format is very easy to understand and you will love it once you understand JSON file structure.



JSON File Structure


Before we ingest JSON file using spark, it's important to understand JSON data structure. Basically, JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. JSON is built on two structures:

  • A collection of name/value pairs, usually referred as an object and its value pair.

  • An ordered list of values. You can think it like an array, list of values.



  1. An object is an unordered set of name/value pairs. An object begins with { (left brace) and ends with } (right brace). Each name is followed by : (colon) and the name/value pairs are separated by , (comma).

  2. An array is an ordered collection of values. An array begins with [ (left bracket) and ends with ] (right bracket). Values are separated by , (comma).

  3. A value can be a string in double quotes, or a number, or true or false or null, or an object or an array. These structures can be nested.


One more fact, JSON files could exist in two formats. However, most of the time you will encounter multiline JSON files.

  • Multiline JSON where each line could have multiple records.

  • Single line JSON where each line depicts one record.



Multiline JSON would look something like this:

[ { "color": "red", "value": "#f00" }, { "color": "green", "value": "#0f0" }, { "color": "blue", "value": "#00f" }, { "color": "cyan", "value": "#0ff" }, { "color": "magenta", "value": "#f0f" }, { "color": "yellow", "value": "#ff0" }, { "color": "black", "value": "#000" } ]



Single line JSON would look something like this (try to correlate object, array and value structure format which I explained earlier):


{ "color": "red", "value": "#f00" }

{ "color": "green", "value": "#0f0" }

{ "color": "blue", "value": "#00f" }

{ "color": "cyan", "value": "#0ff" }

{ "color": "magenta", "value": "#f0f" }

{ "color": "yellow", "value": "#ff0" }

{ "color": "black", "value": "#000" }



Creating Sample JSON file


I have created two different sample files - multiline and single line JSON file with above mentioned records (just copy-paste).

  • singlelinecolors.json

  • multilinecolors.json

Sample files look like:



Note: I assume that you have installed Scala IDE if not please refer my previous blogs for installation steps (Windows & Mac users).


1. Create a new Scala project "jsnReader"

  • Go to FileNewProject and enter jsnReader in project name field and click finish.




2. Create a new Scala Package "com.dataneb.spark"

  • Right click on the jsnReader project in the Package Explorer panel → NewPackage and enter name com.dataneb.spark and finish.




3. Create a Scala object "jsonfileReader"

  • Expand the jsnReader project tree and right click on the com.dataneb.spark package → NewScala Object → enter jsonfileReader in the object name and press finish.





4. Add external jar files

  • Right click on jsnReader project → propertiesJava Build PathAdd External Jars

  • Now navigate to the path where you have installed Spark. You will find all the jar files under /spark/jars folder.





After adding these jar files you will find Referenced Library folder created on left panel of your screen below Scala object. You will also find that project has become invalid (red cross sign), we will fix it shortly.



5. Setup Scala Compiler

  • Now right click on jsnReader project → properties Scala Compiler and check the box Use Project Settings and select Fixed Scala installation: 2.11.11 (built-in) from drop-down options.

  • After applying these changes, you will find project has become valid again (red cross sign is gone).


6. Sample code

  • Open jsonfileReader.scala and copy-paste the code written below.

  • I have written separate blog to explain what are basic terminologies used in Spark like RDD, SparkContext, SQLContext, various transformations and actions etc. You can go through this for basic understanding.

However, I have explained little bit in comments above each line of code what it actually does. For list of spark functions you can refer this.


// Your package name

package com.dataneb.spark


// Each library has its significance, I have commented in below code how its being used

import org.apache.spark._

import org.apache.spark.sql._

import org.apache.log4j._


object jsonfileReader {


// Reducing the error level to just "ERROR" messages

// It uses library org.apache.log4j._

// You can apply other logging levels like ALL, DEBUG, ERROR, INFO, FATAL, OFF etc

Logger.getLogger("org").setLevel(Level.ERROR)


// Defining Spark configuration to define application name and the local resources to use

// It uses library org.apache.spark._

val conf = new SparkConf().setAppName("Sample App")

conf.setMaster("local")


// Using above configuration to define our SparkContext

val sc = new SparkContext(conf)


// Defining SQL context to run Spark SQL

// It uses library org.apache.spark.sql._

val sqlContext = new SQLContext(sc)


// Main function where all operations will occur

def main (args: Array[String]): Unit = {


// Reading the json file

val df = sqlContext.read.json("/Volumes/MYLAB/testdata/multilinecolors.json")


// Printing schema

df.printSchema()


// Saving as temporary table

df.registerTempTable("JSONdata")


// Retrieving all the records

val data=sqlContext.sql("select * from JSONdata")


// Showing all the records

data.show()


// Stopping Spark Context

sc.stop

}

}


7. Run the code!

  • Right click anywhere on the screen and select Run As Scala Application.



That's it!! If you have followed the steps properly you will find the result in Console.



We have successfully loaded JSON file using Spark SQL dataframes. Printed JSON schema and displayed the data.


Try reading single line JSON file which we created earlier. There is a multiline flag which you need to make true to read such files. Also, you can save this data in HDFS, database or CSV file depending upon your need. If you have any question, please don't forget to write in comments section below. Thank you.



Next: How to convert RDD to dataframe?


Navigation menu

1. Apache Spark and Scala Installation

1.1 Spark installation on Windows​

1.2 Spark installation on Mac

2. Getting Familiar with Scala IDE

2.1 Hello World with Scala IDE​

3. Spark data structure basics

3.1 Spark RDD Transformations and Actions example

4. Spark Shell

4.1 Starting Spark shell with SparkContext example​

5. Reading data files in Spark

5.1 SparkContext Parallelize and read textFile method

5.2 Loading JSON file using Spark Scala

5.3 Loading TEXT file using Spark Scala

5.4 How to convert RDD to dataframe?

6. Writing data files in Spark

​6.1 How to write single CSV file in Spark

7. Spark streaming

7.1 Word count example Scala

7.2 Analyzing Twitter texts

8. Sample Big Data Architecture with Apache Spark

9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science?

10. Spark Interview Questions and Answers


5,090 views2 comments

Help others, write your first blog today! 

Home   |   Contact Us

©2020 by Data Nebulae