top of page
BlogPageTop

Trending

ADVERTISEMENT

What is SparkContext (Scala)?

Updated: Nov 18, 2022


In this blog you will learn,

  • How to start spark-shell?

  • Understanding Spark-shell.

  • Creating Spark context and spark configuration.

  • Importing SparkContext and SparkConf.

  • Writing simple SparkContext Scala program.

 

Starting Spark-shell


If you haven't installed Apache spark on your machine, refer this (Windows | Mac users) for installation steps. Apache Spark installation is very easy and shouldn't take long.


Open your terminal and type the command spark-shell to start the shell. Same output, like what we did during Spark installation.


$ spark-shell

19/07/27 11:30:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Setting default log level to "WARN".

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

Spark context Web UI available at http://19x.xxx.x.x5:4040

Spark context available as 'sc' (master = local[*], app id = local-1564252213176).

Spark session available as 'spark'.

Welcome to

____ __

/ __/__ ___ _____/ /__

_\ \/ _ \/ _ `/ __/ '_/

/___/ .__/\_,_/_/ /_/\_\ version 2.3.1

/_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171)

Type in expressions to have them evaluated.

Type :help for more information.


 

What is Spark-shell?


Spark shell is an interactive shell through which you can access Spark APIs. Apache Spark has basically three sets of APIs (Application Program Interface) - RDDs, Datasets and DataFrames that allow developers to access the data and run various functions across four different languages - Java, Scala, Python and R. Don't worry, I will explain RDDs, Datasets and DataFrames shortly.



Easy right? But..


I need to explain few facts before we proceed further. Refer the screen shot shown below. We usually ignore the fact that there is lot of information in this output.



1. First line of the Spark output is showing us a warning that it's unable to load native-hadoop library and it will use builtin-java classes where applicable. It's because I haven't installed hadoop libraries (which is fine..), and wherever applicable Spark will use built-in java classes.


Output: 19/07/27 11:30:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Setting default log level to "WARN".


My point here is not the warning, but the WARN log level. Spark has various logging level which you can set while writing the program for example WARN, ALL, DEBUG, ERROR, INFO, FATAL, TRACE, TRACE_INT, OFF. By default Spark logging level is set to "WARN".


2. Next line is telling us how to adjust the logging level from default WARN to a newLevel. We will learn this later, how to run this piece of code sc.setLogLevel(newLevel). Its syntactically little different in various languages Scala, R, Java and Python.


Output: To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


3. Next line is telling us the link for Spark UI, sometimes called as DAG scheduler. You can copy-paste that link in your local browser to open Spark user interface. By default, it runs at port number 4040. It would look like this.


4. Next line is telling us that SparkContext is created as "sc" and by default it's going to use all the local resources in order to execute the program master = local [*] with application id as local-1564252213176.


Output: Spark context available as 'sc' (master = local[*], app id = local-1564252213176).


5. Spark session is created as 'spark'. We will see what is Spark session soon.


6. This line is telling us the Spark version, currently mine is 2.3.1.


7. We all know Java is needed to run Apache Spark, and same we did during installation. We installed Java first and then we installed Apache Spark. Here, the line is telling us the underlying Scala 2.11.8 and Java version 1.8.0_171.


Output: Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171)


8. You can run :help command for more information. Like this,


Well, it's again a new story and I will write in detail how to use these commands soon. However, I have highlighted few common commands - like how can you see history of your commands and edit it, how you can quit spark-shell.



Initializing Spark


In last section we encountered few terms like Spark context (by default started as "sc") and Spark session (by default started as "spark"). If you run these commands one-by-one you will find the default setup and alphanumeric pointer locations (like @778c2e7c) to these Spark objects.


It will be different on various machines, yours will be different from mine. For instance,


scala> sc

res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@778c2e7c


scala> spark

res1: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@16ccd2bc


 

What is SparkContext?


The first thing you do in Spark program is that you setup Spark context object. Why the first thing? This is because you need to tell Spark engine - How to run and what to run?


It's like before ordering/ or buying a pizza, you need to tell whether you want a veg pizza or a non-veg pizza and the toppings ;).


Spark context performs two major tasks (via Spark configuration - SparkConf ). It's not like these are the only two tasks but these are basic ones.

  • First setMaster, it tells Spark engine how to run i.e. whether it should run in cluster mode (master) or local mode (local). We will see how to setup master i.e. Yarn, Mesos or Kubernetes cluster and standalone local mode shortly.

  • Second setAppName, what to run i.e. the application name.

So, basically Spark context tells Spark engine which application will run in which mode?



How to Setup SparkContext?


In order to define SparkContext, you need to configure it which is done via SparkConf. You need to tell Spark engine the application name and the run mode.


1. For this, we need to import two Spark classes, without these Spark will never understand our inputs.


scala> import org.apache.spark.SparkContext

import org.apache.spark.SparkContext


scala> import org.apache.spark.SparkConf

import org.apache.spark.SparkConf


2. Next, define configuration variable conf, first pass "Sample Application" name via setAppName method and second define the mode with setMaster method. I have setup conf to local mode with all [*] resources.


scala> val conf = new SparkConf().setAppName("Sample Application").setMaster("local[*]")

conf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@c0013b8


You can see location (@c0013b8) of my configuration object. Spark engine can run either in standalone mode or cluster mode at one time, so at any given point of time you will have just one SparkContext. Confused? Wait I will explain soon.


Try to create new SparkContext with above configuration.

scala> new SparkContext(conf)


You will get the error telling - one Spark context is already running.


If you want to update SparkContext you need to stop() the default Spark context i.e. "sc" and re-define the Spark context with new configuration.


I hope you all understood what does it mean when I said one active Spark context.


Here is the complete reference from Apache documentation, what you can pass while setting up setMaster.



Well, instead of doing all of above configuration. You can also change default SparkContext "sc" which we saw earlier. For this you need to pass the inputs with spark-shell command before you start the spark shell.


$ spark-shell --master local[2]

19/07/27 14:33:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Setting default log level to "WARN".

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

Spark context Web UI available at http://19x.xxx.0.15:4040

Spark context available as 'sc' (master = local[2], app id = local-1564263216688).

Spark session available as 'spark'.

Welcome to

____ __

/ __/__ ___ _____/ /__

_\ \/ _ \/ _ `/ __/ '_/

/___/ .__/\_,_/_/ /_/\_\ version 2.3.1

/_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171)

Type in expressions to have them evaluated.

Type :help for more information.


Default setup was to utilize all local[*] cores (refer the output of first spark-shell command at the start of this post), now you can see it has changed to use local[2] cores.


 

Creating SparkContext in Scala IDE example


You can write similar program in Eclipse Scala IDE and run the sample application as follows.

Copy-paste lines from here.


package com.dataneb.spark

import org.apache.spark.SparkContext

import org.apache.spark.SparkConf


object scExample {


val conf = new SparkConf().setAppName("Sample Application").setMaster("local[4]")

val sc = new SparkContext(conf)


def main (args:Array[String]): Unit = {

print("stopping sparkConext \n")

sc.stop()

}

}


Thats all guys! Please comment if you have any question regarding this post in comments section below. Thank you!





Navigation menu

1. Apache Spark and Scala Installation

2. Getting Familiar with Scala IDE

3. Spark data structure basics

4. Spark Shell

5. Reading data files in Spark

6. Writing data files in Spark

7. Spark streaming

Comments


Want to share your thoughts about this blog?

Disclaimer: Please note that the information provided on this website is for general informational purposes only and should not be taken as legal advice. Dataneb is a platform for individuals to share their personal experiences with visa and immigration processes, and their views and opinions may not necessarily reflect those of the website owners or administrators. While we strive to keep the information up-to-date and accurate, we make no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability, or availability with respect to the website or the information, products, services, or related graphics contained on the website for any purpose. Any reliance you place on such information is therefore strictly at your own risk. We strongly advise that you consult with a qualified immigration attorney or official government agencies for any specific questions or concerns related to your individual situation. We are not responsible for any losses, damages, or legal disputes arising from the use of information provided on this website. By using this website, you acknowledge and agree to the above disclaimer and Google's Terms of Use (https://policies.google.com/terms) and Privacy Policy (https://policies.google.com/privacy).

RECOMMENDED FROM DATANEB

Struggle2.png

Apache Spark Tutorial Scala: A Beginners Guide to Apach...

Learn Apache Spark: Tutorial for Beginners - This Apache Spark tutorial documentation will introduce you to Apache Spark programming..

Nov 26, 2022

Struggle2.png

SQL Server 2014 Standard Download & Installation

In this post, you will learn how to download and install SQL Server 2014 from an ISO image. I will also download the AdventureWorks...

Apr 08, 2023

Struggle2.png

How to Pull Data from Oracle IDCS (Identity Cloud Servi...

Oracle IDCS has various rest APIs that can be used to pull data and you can utilize it further for data analytics. Let's see how we can...

Mar 24, 2024

bottom of page