Spark read Text file into Dataframe

Updated: Nov 9, 2019

Main menu: Spark Scala Tutorial

In this Spark Scala tutorial you will learn how to read data from a text file & CSV to dataframe. This blog has two sections:

  1. Spark read Text File

  2. Spark read CSV with schema/header

There are various methods to load a text file in Spark. You can refer Spark documentation.

Spark Read Text File

I am loading a text file which is space (" ") delimited. I have chosen this format because in most of the practical cases you will find delimited text files with fixed number of fields.

Further, I will be adding a header to dataframe and transform it to some extent. I have tried to keep the code as simple as possible so that anyone can understand it. You can change the separator, name/number of fields, data type according to your requirement.

I am using squid logs as sample data for this example. It has date, integer and string fields which will help us to apply data type conversions and play around with Spark SQL. You can find complete squid file structure details at this.

  • No. of fields = 10

  • Separator is a space character

Sample Data

1286536309.586 921 TCP_MISS/200 507 POST - DIRECT/ application/xml

1286536309.608 829 TCP_MISS/200 507 POST - DIRECT/ application/xml

1286536309.660 785 TCP_MISS/200 507 POST - DIRECT/ application/xml

1286536309.684 808 TCP_MISS/200 507 POST - DIRECT/ application/xml

1286536309.775 195 TCP_MISS/200 4120 GET - DIRECT/ image/jpeg

1286536309.795 215 TCP_MISS/200 5331 GET - DIRECT/ image/jpeg

1286536309.815 234 TCP_MISS/200 5261 GET - DIRECT/ image/jpeg

Creating Sample Text File

I have created sample text file - squid.txt with above mentioned records (just copy-paste).

  • Filename: squid.txt

  • Path: /Users/Rajput/Documents/testdata

Eclipse IDE Setup (for beginners)

Before writing the Spark program it's necessary to setup Scala project in Eclipse IDE. I assume that you have installed Eclipse, if not please refer my previous blogs for installation steps (Windows | Mac users). These steps will be same for other sections like reading CSV, JSON, JDBC.

1. Create a new Scala project "txtReader"

  • Go to File → New → Project and enter txtReader in project name field and click finish.

2. Create a new Scala Package "com.dataneb.spark"

  • Right click on the txtReader project in the Package Explorer panel → New → Package and enter name com.dataneb.spark and finish.

3. Create a Scala object "textfileReader"

  • Expand the txtReader project tree and right click on the com.dataneb.spark package → New → Scala Object → enter textfileReader in the object name and press finish.

4. Add external jar files (if needed)

  • Right click on txtReader project propertiesJava Build Path Add External Jars

  • Now navigate to the path where you have installed Spark. You will find all the jar files under /spark/jars folder.

  • Now select all the jar files and click open. Apply and Close.