Updated: Oct 25, 2019
Main menu: Spark Scala Tutorial
In this blog you will learn,
How to start spark-shell?
Creating Spark context and spark configuration.
Importing SparkContext and SparkConf.
Writing simple SparkContext Scala program.
Open your terminal and type the command spark-shell to start the shell. Same output, like what we did during Spark installation.
19/07/27 11:30:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://19x.xxx.x.x5:4040
Spark context available as 'sc' (master = local[*], app id = local-1564252213176).
Spark session available as 'spark'.
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.1
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171)
Type in expressions to have them evaluated.
Type :help for more information.
What is Spark-shell?
Spark shell is an interactive shell through which you can access Spark APIs. Apache Spark has basically three sets of APIs (Application Program Interface) - RDDs, Datasets and DataFrames that allow developers to access the data and run various functions across four different languages - Java, Scala, Python and R. Don't worry, I will explain RDDs, Datasets and DataFrames shortly.
Easy right? But..
I need to explain few facts before we proceed further. Refer the screen shot shown below. We usually ignore the fact that there is lot of information in this output.
1. First line of the Spark output is showing us a warning that it's unable to load native-hadoop library and it will use builtin-java classes where applicable. It's because I haven't installed hadoop libraries (which is fine..), and wherever applicable Spark will use built-in java classes.
Output: 19/07/27 11:30:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
My point here is not the warning, but the WARN log level. Spark has various logging level which you can set while writing the program for example WARN, ALL, DEBUG, ERROR, INFO, FATAL, TRACE, TRACE_INT, OFF. By default Spark logging level is set to "WARN".
2. Next line is telling us how to adjust the logging level from default WARN to a newLevel. We will learn this later, how to run this piece of code sc.setLogLevel(newLevel). Its syntactically little different in various languages Scala, R, Java and Python.
Output: To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
3. Next line is telling us the link for Spark UI, sometimes called as DAG scheduler. You can copy-paste that link in your local browser to open Spark user interface. By default, it runs at port number 4040. It would look like this.
4. Next line is telling us that SparkContext is created as "sc" and by default it's going to use all the local resources in order to execute the program master = local [*] with application id as local-1564252213176.
Output: Spark context available as 'sc' (master = local[*], app id = local-1564252213176).
5. Spark session is created as 'spark'. We will see what is Spark session soon.
6. This line is telling us the Spark version, currently mine is 2.3.1.
7. We all know Java is needed to run Apache Spark, and same we did during installation. We installed Java first and then we installed Apache Spark. Here, the line is telling us the underlying Scala 2.11.8 and Java version 1.8.0_171.
Output: Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171)
8. You can run :help command for more information. Like this,
Well, it's again a new story and I will write in detail how to use these commands soon. However, I have highlighted few common commands - like how can you see history of your commands and edit it, how you can quit spark-shell.
In last section we encountered few terms like Spark context (by default started as "sc") and Spark session (by default started as "spark"). If you run these commands one-by-one you will find the default setup and alphanumeric pointer locations (like @778c2e7c) to these Spark objects.
It will be different on various machines, yours will be different from mine. For instance,
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@778c2e7c
res1: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@16ccd2bc
What is SparkContext?
The first thing you do in Spark program is that you setup Spark context object. Why the first thing? This is because you need to tell Spark engine - How to run and what to run?
It's like before ordering/ or buying a pizza, you need to tell whether you want a veg pizza or a non-veg pizza and the toppings ;).
Spark context performs two major tasks (via Spark configuration - SparkConf ). It's not like these are the only two tasks but these are basic ones.
First setMaster, it tells Spark engine how to run i.e. whether it should run in cluster mode (master) or local mode (local). We will see how to setup master i.e. Yarn, Mesos or Kubernetes cluster and standalone local mode shortly.
Second setAppName, what to run i.e. the application name.
So, basically Spark context tells Spark engine which application will run in which mode?
How to Setup SparkContext?
In order to define SparkContext, you need to configure it which is done via SparkConf. You need to tell Spark engine the application name and the run mode.
1. For this, we need to import two Spark classes, without these Spark will never understand our inputs.
scala> import org.apache.spark.SparkContext
scala> import org.apache.spark.SparkConf
2. Next, define configuration variable conf, first pass "Sample Application" name via setAppName method and second define the mode with setMaster method. I have setup conf to local mode with all [*] resources.
scala> val conf = new SparkConf().setAppName("Sample Application").setMaster("local[*]")
conf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@c0013b8
You can see location (@c0013b8) of my configuration object. Spark engine can run either in standalone mode or cluster mode at one time, so at any given point of time you will have just one SparkContext. Confused? Wait I will explain soon.
Try to create new SparkContext with above configuration.
scala> new SparkContext(conf)
You will get the error telling - one Spark context is already running.
If you want to update SparkContext you need to stop() the default Spark context i.e. "sc" and re-define the Spark context with new configuration.
I hope you all understood what does it mean when I said one active Spark context.
Here is the complete reference from Apache documentation, what you can pass while setting up setMaster.