Spark Tutorial with Scala
Spark Scala Tutorial for beginners - This Spark tutorial will introduce you to Spark programming in Scala. You will learn about Spark Scala programming, Spark-shell, Spark dataframes, RDDs, Spark SQL, Spark Streaming with examples and finally prepare you for Spark Scala interview questions and answers.
Spark Tutorial with Scala:
Apache Spark and Scala Installation
Getting Familiar with Scala
Spark Transformations & Actions
Reading data files in Spark
Writing data files in Spark
I heard of Apache Spark in late 2014 when I became interested in Big Data Analytics. I was from SQL background and it was very difficult for me to change my style of thinking from “SQL programming” to Scala programming or to any other object oriented programming language.
I have written these Spark tutorials with Scala for those who are having no Scala programming background. So, if you have no prior knowledge of Scala and zero understanding of Apache Spark, you are at right place. I have written these tutorials in such a way that anyone can understand it easily.
I get so many emails from you guys that it encourages me everyday to write new blogs. Thank you guys, and I am really happy that these blogs are helping you.
For new folks!
You can go through these blogs in any fashion but if you are a complete beginner (with zero prior knowledge) I would encourage you to start from top and go through each post/link way down to bottom. It would take approximately a week to finish the content with “seriousness” and after that I am confident that you will be able to write Spark programs.
Don’t just read…
Run the commands/code on your machine, don’t just read the blog. Otherwise, you will forget everything after sometime.
You will not become an “expert” in a week, no one can! But you will head into right direction to become en expert. Just keep reading new content.
What you need before you start..
Zero prior knowledge and a laptop, Windows or Mac.
For any question, confusion, jokes..
Just scroll down to bottom of each blog and write down your question in comments section. Even if it’s a silly question, no issues, I will answer. You can email me as well, follow the link in the top menu.
Before I begin..
New blogs are often added here so bookmark this page and keep visiting this page, or signup for email notification. Don’t worry I will not spam your inbox with useless emails. Also, there are few links which are still not active because I am working on those.
Apache Spark Overview
Apache Spark is a distributed data processing engine.
In simple terms, suppose you have 100 terabytes of data (lets say huge collection of stories, plays, poems etc) and you want to find out most frequent word. Obviously, it will take good amount of time to read all the texts, split the string, group, count and order them, for example in SQL world think like;
select words, count(1)
group by words
order by 2 desc; // over 100 TB of data
I am pretty sure few queries will not even complete or it will take good amount of time to give you result.
Now suppose you have divided this data into 100 equal sets so that each set has 1 TB of data and you run the same query to get the result. Obviously it will be faster, right? Assume each set of data has its own memory, processor and they can work independently to get the partial result.
Or what about 1000 different sets with 100 MB of data each. Obviously, you will get the result in few seconds. Well there are many other factors like network bandwidth, disk speed, memory which is needed to calculate execution time, but keep them aside for a moment.
Also, don’t worry who will distribute the work - map, combine the result - reduce, manage partitions, data movement etc - all of this will be taken care by Apache Spark. Thats why it’s called processing engine and not a programming language. Apache spark supports four programming languages Java, Scala, Python and R i.e. you can tell Spark engine to perform a task in these four languages.
Also, Apache Spark has built-in libraries for Spark SQL, Machine Learning, Spark Streaming and GraphX. You will understand these soon.
What is Apache Spark used for?
Apache Spark is used for real time data processing with Spark Streaming.
Apache Spark is used to store petabytes of data (with data distributed over cluster with thousands of nodes).
Apache Spark is used to implement Extract, Transform, Load (ETL) processes.
Apache Spark is used to implement machine learning algorithms and create real time interactive dashboards for analytics.
Apache Spark Installation
I have written two separate blogs for Apache Spark installation, one for Mac and another for Windows. These steps might get old with new Apache Spark version, so comment if you find any issue. I will help you out to fix the issue.
Spark installation cont. for windows (pyspark)
Spark and Scala Basics
Once you have installed Spark on your machine, you are good to proceed with Scala basics. In this section you will learn basics of Scala which will be just enough to write Spark programs. You will also understand Apache Spark basic terminologies like Spark-shell, SparkContext, SparkSession and SparkConf. I encourage you all to run the commands and examples side by side.
Spark RDD, Transformations and Actions
Once you have basic understanding of Spark engine and Scala, you will learn about various Spark transformations and actions, RDD, Dataframes etc. You will understand - What is the flow of program in Spark.
Reading and Writing data files in Spark
In this section, you will start writing basic Spark programs. Again if you find some difficulty in understanding the program, comment. However, I have tried my best to keep them easy and clear.
Spark Dataframe example
Spark SQL example
It will give you more clear picture what is dataframe and how you can work with SQL in Spark.
Spark Streaming example
In this section you will understand how streaming works in Spark world. You have to run these programs on your machine to understand how things are working. If you are just reading it, it’s useless.
Spark Big Data
Just a high level walk through, how Spark is used in production environment.