Updated: Nov 9, 2019
Main menu: Spark Scala Tutorial
In this Spark Scala tutorial you will learn how to read data from a text file & CSV to dataframe. This blog has two sections:
Spark read Text File
Spark read CSV with schema/header
There are various methods to load a text file in Spark. You can refer Spark documentation.
Spark Read Text File
I am loading a text file which is space (" ") delimited. I have chosen this format because in most of the practical cases you will find delimited text files with fixed number of fields.
Further, I will be adding a header to dataframe and transform it to some extent. I have tried to keep the code as simple as possible so that anyone can understand it. You can change the separator, name/number of fields, data type according to your requirement.
I am using squid logs as sample data for this example. It has date, integer and string fields which will help us to apply data type conversions and play around with Spark SQL. You can find complete squid file structure details at this.
No. of fields = 10
Separator is a space character
1286536309.586 921 192.168.0.68 TCP_MISS/200 507 POST http://rcv-srv37.inplay.tubemogul.co...eiver/services - DIRECT/220.127.116.11 application/xml
1286536309.608 829 192.168.0.68 TCP_MISS/200 507 POST http://rcv-srv37.inplay.tubemogul.co...eiver/services - DIRECT/18.104.22.168 application/xml
1286536309.660 785 192.168.0.68 TCP_MISS/200 507 POST http://rcv-srv37.inplay.tubemogul.co...eiver/services - DIRECT/22.214.171.124 application/xml
1286536309.684 808 192.168.0.68 TCP_MISS/200 507 POST http://rcv-srv37.inplay.tubemogul.co...eiver/services - DIRECT/126.96.36.199 application/xml
1286536309.775 195 192.168.0.227 TCP_MISS/200 4120 GET http://i4.ytimg.com/vi/gTHZnIAzmdY/default.jpg - DIRECT/188.8.131.52 image/jpeg
1286536309.795 215 192.168.0.227 TCP_MISS/200 5331 GET http://i2.ytimg.com/vi/-jBxVLD4fzg/default.jpg - DIRECT/184.108.40.206 image/jpeg
1286536309.815 234 192.168.0.227 TCP_MISS/200 5261 GET http://i1.ytimg.com/vi/dCjp28ps4qY/default.jpg - DIRECT/220.127.116.11 image/jpeg
Creating Sample Text File
I have created sample text file - squid.txt with above mentioned records (just copy-paste).
Eclipse IDE Setup (for beginners)
Before writing the Spark program it's necessary to setup Scala project in Eclipse IDE. I assume that you have installed Eclipse, if not please refer my previous blogs for installation steps (Windows | Mac users). These steps will be same for other sections like reading CSV, JSON, JDBC.
1. Create a new Scala project "txtReader"
Go to File → New → Project and enter txtReader in project name field and click finish.
2. Create a new Scala Package "com.dataneb.spark"
Right click on the txtReader project in the Package Explorer panel → New → Package and enter name com.dataneb.spark and finish.
3. Create a Scala object "textfileReader"
Expand the txtReader project tree and right click on the com.dataneb.spark package → New → Scala Object → enter textfileReader in the object name and press finish.
4. Add external jar files (if needed)
Right click on txtReader project → properties → Java Build Path → Add External Jars
Now navigate to the path where you have installed Spark. You will find all the jar files under /spark/jars folder.
Now select all the jar files and click open. Apply and Close.