Apache Spark Interview Questions

Updated: Oct 25, 2019

This post include Big Data Spark Interview Questions and Answers for experienced and beginners. If you are a beginner don't worry, answers are explained in detail. These are very frequently asked Data Engineer Interview Questions which will help you to crack big data job interview.

What is Apache Spark?

According to Spark documentation, Apache Spark is a fast and general-purpose in-memory cluster computing system.

  • It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs.

  • It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

In simple terms, Spark is a distributed data processing engine which supports programming language like Java, Scala, Python and R. In core, Spark engine has four built-in libraries which supports Spark SQL, Machine Learning, Spark Streaming and GraphX.

What is Apache Spark used for?

  • Apache Spark is used for real time data processing.

  • Implementing Extract, Transform, Load (ETL) processes.

  • Implementing machine learning algorithms and create interactive dashboards for data analytics.

  • Apache Spark is also used to store petabytes of data with data distributed over cluster with thousands of nodes.

How does Apache Spark work?

Spark uses master-slave architecture to distribute data across worker nodes and process them in parallel. Just like mapreduce, Spark has a central coordinator called driver and rest worker nodes as executors. Driver communicates with the executors to process the data.

Why is Spark faster than Hadoop mapreduce?

One of the drawbacks of Hadoop mapreduce is that it holds full data into HDFS after running each mapper and reducer job. This is very expensive because it consumes lot of disk I/O and network I/O. While in Spark, there are two processes transformations and actions. Spark doesn't write or hold the data in memory until an action is called. Thus, it reduces disk I/O and network I/O. Another innovation is in-memory caching where you can instruct Spark to hold input data in-memory so that program doesn't have to read data again from disk, thus reducing disk I/O.

Is Hadoop required for spark?

No, Hadoop file system is not required for Spark. However for better performance, Spark can use HDFS-YARN if required.

Is Spark built on top of Hadoop?

No. Spark is totally independent of Hadoop.

What is Spark API?

Apache Spark has basically three sets of APIs (Application Program Interface) - RDDs, Datasets and DataFrames that allow developers to access the data and run various functions across four different languages - Java, Scala, Python and R.

What is Spark RDD?

Resilient Distributed Datasets (RDDs) are basically an immutable collection of elements which is used as fundamental data structure in Apache Spark.

  • These are logically partitioned data across thousands of nodes in your cluster that can be accessed and computed in parallel.

  • RDD was the primary Spark API since Apache Spark foundation.