Analyzing Twitter Data - Twitter sentiment analysis using Spark streaming

Updated: Oct 25, 2019


Analyzing Twitter Data - Twitter sentiment analysis using Spark streaming.


Twitter Spark Streaming - We will be analyzing Twitter data and we will be doing Twitter sentiment analysis using Spark streaming. You can do this in any programming language Python, Scala, Java or R.

Main menu: Spark Scala Tutorial

Spark streaming is very useful in analyzing real time data from IoT technologies which could be your smart watch, Google Home, Amazon Alexa, Fitbit, GPS, home security system, smart cameras or any other device which communicates with internet. Social accounts like Facebook, Twitter, Instagram etc generate enormous amount of data every minute.


Below trend shows interest over time for three of these smart technologies over past 5 years.


In this example we are going to stream Twitter API tweets in real time with OAuth authentication and filter the hashtags which are most famous among them.


Prerequisite

  • Download and install Apache Spark and Scala IDE (Windows | Mac)

  • Create Twitter sample application and obtain your client secret, client secret key, access token and access token secret. Refer this to know how to get Twitter development account and api access keys.


Authentication file setup


Create a text file twitter.txt with Twitter OAuth details and place it anywhere on your local directory system (remember the path).


File content should look like this.

  • Basically it has two fields separated with single space - first field contains OAuth headers name and second column contains api keys.

  • Make sure there is no extra space anywhere in your file else you will get authentication errors.

  • There is a new line character at the end (i.e. hit enter after 4th line). You can see empty 5th line in below screenshot.



Write the code!


Now create a Scala project in Eclipse IDE (see how to create Scala project), refer the following code that prints out live tweets as they stream using Spark Streaming. I have written separate blog to explain what are basic terminologies used in Spark like RDD, SparkContext, SQLContext, various transformations and actions etc. You can go through this for basic understanding.


However, I have explained little bit in comments above each line of code what it actually does. For list of spark functions you can refer this.


// Our package

package com.dataneb.spark


// Twitter libraries used to run spark streaming

import twitter4j._

import twitter4j.auth.Authorization

import twitter4j.auth.OAuthAuthorization

import twitter4j.conf.ConfigurationBuilde