Analyzing Twitter Data - Twitter sentiment analysis using Spark streaming

Updated: Oct 25, 2019

Analyzing Twitter Data - Twitter sentiment analysis using Spark streaming.

Twitter Spark Streaming - We will be analyzing Twitter data and we will be doing Twitter sentiment analysis using Spark streaming. You can do this in any programming language Python, Scala, Java or R.

Main menu: Spark Scala Tutorial

Spark streaming is very useful in analyzing real time data from IoT technologies which could be your smart watch, Google Home, Amazon Alexa, Fitbit, GPS, home security system, smart cameras or any other device which communicates with internet. Social accounts like Facebook, Twitter, Instagram etc generate enormous amount of data every minute.

Below trend shows interest over time for three of these smart technologies over past 5 years.

In this example we are going to stream Twitter API tweets in real time with OAuth authentication and filter the hashtags which are most famous among them.


  • Download and install Apache Spark and Scala IDE (Windows | Mac)

  • Create Twitter sample application and obtain your client secret, client secret key, access token and access token secret. Refer this to know how to get Twitter development account and api access keys.


Authentication file setup

Create a text file twitter.txt with Twitter OAuth details and place it anywhere on your local directory system (remember the path).

File content should look like this.

  • Basically it has two fields separated with single space - first field contains OAuth headers name and second column contains api keys.

  • Make sure there is no extra space anywhere in your file else you will get authentication errors.

  • There is a new line character at the end (i.e. hit enter after 4th line). You can see empty 5th line in below screenshot.

Write the code!

Now create a Scala project in Eclipse IDE (see how to create Scala project), refer the following code that prints out live tweets as they stream using Spark Streaming. I have written separate blog to explain what are basic terminologies used in Spark like RDD, SparkContext, SQLContext, various transformations and actions etc. You can go through this for basic understanding.

However, I have explained little bit in comments above each line of code what it actually does. For list of spark functions you can refer this.


// Our package

package com.dataneb.spark

// Twitter libraries used to run spark streaming

import twitter4j._

import twitter4j.auth.Authorization

import twitter4j.auth.OAuthAuthorization

import twitter4j.conf.ConfigurationBuilder

import org.apache.spark._

import org.apache.spark.SparkContext._

import org.apache.spark.streaming._

import org.apache.spark.streaming.twitter._

import org.apache.spark.streaming.StreamingContext._


import org.apache.spark.internal.Logging


import org.apache.spark.streaming.dstream._

import org.apache.spark.streaming.receiver.Receiver

/** Listens to a stream of Tweets and keeps track of the most popular

* hashtags over a 5 minute window.


object PopularHashtags {

/** Makes sure only ERROR messages get logged to avoid log spam. */

def setupLogging() = {

import org.apache.log4j.{Level, Logger}

val rootLogger = Logger.getRootLogger()



/** Configures Twitter service credentials using twiter.txt in the main workspace directory. Use the path where you saved the authentication file */

def setupTwitter() = {


for (line <- Source.fromFile("/Volumes/twitter.txt").getLines) {

val fields = line.split(" ")

if (fields.length == 2) {

System.setProperty("twitter4j.oauth." + fields(0), fields(1))




// Main function where the action happens

def main(args: Array[String]) {

// Configure Twitter credentials using twitter.txt


// Set up a Spark streaming context named "PopularHashtags" that runs locally using

// all CPU cores and one-second batches of data

val ssc = new StreamingContext("local[*]", "PopularHashtags", Seconds(1))

// Get rid of log spam (should be called after the context is set up)


// Create a DStream from Twitter using our streaming context

val tweets = TwitterUtils.createStream(ssc, None)

// Now extract the text of each status update into DStreams using map()

val statuses = => status.getText())

// Blow out each word into a new DStream

val tweetwords = statuses.flatMap(tweetText => tweetText.split(" "))

// Now eliminate anything that's not a hashtag

val hashtags = tweetwords.filter(word => word.startsWith("#"))

// Map each hashtag to a key/value pair of (hashtag, 1) so we can count them by adding up the values

val hashtagKeyValues = => (hashtag, 1))

// Now count them up over a 5 minute window sliding every one second

val hashtagCounts = hashtagKeyValues.reduceByKeyAndWindow( (x,y) => x + y, (x,y) => x - y, Seconds(300), Seconds(1))

// Sort the results by the count values

val sortedResults = hashtagCounts.transform(rdd => rdd.sortBy(x => x._2, false))

// Print the top 10


// Set a checkpoint directory, and kick it all off

// I can watch this all day!

ssc.checkpoint("/Volumes/Macintosh HD/Users/Rajput/Documents/checkpoint/")





Run it! See the result!


I actually went to my Twitter account to see top result tweet #MTVHottest and it was trending, see the snapshot below.


You might face error if

You are not using proper version of Scala and Spark Twitter libraries. I have listed them below for your reference. You can download these libraries from these two links Link1 & Link2.

  1. Scala version 2.11.11

  2. Spark version 2.3.2

  3. twitter4j-core-4.0.6.jar

  4. twitter4j-stream-4.0.6.jar

  5. spark-streaming-twitter_2.11-2.1.1.jar

Thank you folks. I hope you are enjoying these blogs, if you have any doubt please mention in comments section below.

#AnalyzingTwitterData #SparkStreamingExample #Scala #TwitterSparkStreaming #SparkStreaming

Next: Sample Big Data Architecture with Apache Spark

Navigation menu

1. Apache Spark and Scala Installation

1.1 Spark installation on Windows​

1.2 Spark installation on Mac