top of page

Results found for empty search

  • StreamingContext: Spark streaming word count example Scala

    Main menu: Spark Scala Tutorial In this tutorial you will learn, How to stream data in real time using Spark streaming? Spark streaming is basically used for near real-time data processing. Why I said "near" real-time? Because data processing takes some time, few milliseconds. This lag can be reduced but obviously it can't be reduced to zero. This lag is so minute that we end up calling it real-time processing. Streaming word seems very cool but honestly speaking most of you have already implemented this in the form of "batch mode". Only difference is the time window. Everyone is aware of batch mode where you pull the data on hourly, daily, weekly or monthly basis and process it to fulfill your business requirements. What if you start pulling data every second and simultaneously you made your code so efficient that it can process the data in milliseconds. It would be automatically near real time processing, right? Spark streaming basically provides you ability to define sliding time windows where you can define the batch interval. After that, Spark automatically breaks the data into smaller batches for real-time processing. It basically uses RDD's (resilient distributed datasets) to perform the computation on unstructured dataset. RDD's are nothing but references to the actual data which are distributed across multiple nodes with some replication factor which reveal their values only when you perform an action (like collect) on top of it, called lazy evaluation. If you haven't installed Apache Spark, please refer this (Windows | Mac users). Screen 1 Open Scala IDE, create package com.dataneb.spark and define Scala object SparkStreaming. Check this link how you can create packages and objects using Scala IDE. Copy paste the below code on your Scala IDE and let the program run. package com.dataneb.spark /** Libraries needed for streaming the data. */ import org.apache.spark._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ object SparkStreaming { /** Reducing the logging level to print just ERROR. */ def setupLogging() = { import org.apache.log4j.{Level, Logger} val rootLogger = Logger.getRootLogger() rootLogger.setLevel(Level.ERROR) } def main(args: Array[String]) { /** Defining Spark configuration to utilize all the resources and * setting application name as TerminalWordCount*/ val conf = new SparkConf().setMaster("local[*]").setAppName("TerminalWordCount") /** Calling logging function */ setupLogging() /** Defining spark streaming context with above configuration and batch interval as 1*/ val ssc = new StreamingContext(conf, Seconds(1)) /** Terminal 9999 where we will entering real time messages */ val lines = ssc.socketTextStream("localhost", 9999) /** Flat map to split the words with spaces and reduce by key pair to perform count */ val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) // Print the first ten elements of each RDD wordCounts.print() ssc.start() // Start the computation ssc.awaitTermination() // Wait for the computation to terminate } } Screen 2 Now open your local terminal and type "nc -lk 9999" , hit enter. What we did just now? We just made a network connection (nc or NetCat) to local port 9999. Now type some random inputs for Spark processing as shown below, Screen 1 Now go back to Scala IDE to see the processed records, you need to swap the screens quickly to see the results as Spark will process these lines within seconds. Just kidding, you can simply pin the console and scroll back to see the results. You can see the output below. That's it. It's so simple. In real life scenario you can stream the Kafka producer to local terminal from where Spark can pick up for processing. Or you can also configure Spark to communicate with your application directly. Thank you!! If you have any question write in comments section below. #spark #scala #example #StreamingContext #streaming #word #count Next: Analyzing Twitter text with Spark Streaming Navigation menu ​ 1. Apache Spark and Scala Installation 1.1 Spark installation on Windows​ 1.2 Spark installation on Mac 2. Getting Familiar with Scala IDE 2.1 Hello World with Scala IDE​ 3. Spark data structure basics 3.1 Spark RDD Transformations and Actions example 4. Spark Shell 4.1 Starting Spark shell with SparkContext example​ 5. Reading data files in Spark 5.1 SparkContext Parallelize and read textFile method 5.2 Loading JSON file using Spark Scala 5.3 Loading TEXT file using Spark Scala 5.4 How to convert RDD to dataframe? 6. Writing data files in Spark ​6.1 How to write single CSV file in Spark 7. Spark streaming 7.1 Word count example Scala 7.2 Analyzing Twitter texts 8. Sample Big Data Architecture with Apache Spark 9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science? 10. Spark Interview Questions and Answers

  • Kibana dashboard example

    In this Kibana dashboard tutorial we will learn how to create Kibana dashboard. Before we start with sample Kibana dashboard example, I hope you have some sample data loaded into ELK Elasticsearch. If not, please refer my previous blog - How to load sample data into ELK Elasticsearch. As discussed in my previous blog I am using sample Squid access logs (comma separated CSV file). You can find the file format details at this link. Navigation Menu Introduction to ELK Stack Installation Loading data into Elasticsearch with Logstash Create Kibana Dashboard Example Kibana GeoIP Dashboard Example For our understanding, we will create two basic Kibana dashboard: Top 10 requested URL's (type: pie chart): Basically, it will show what are the top 10 URL's which are getting hits. Number of events occurred per hour (type: bar chart): It will show various types of events which occurred each hour. Use Case 1: Top 10 Requested URL's (Pie chart) Open Kibana UI on your machine and go to Visualize tab => Create a visualization: Select the type of visualization. For our first use case, select pie chart: Select the index squidal which we created earlier. Now go to Options and uncheck Donut check box as we need a simple pie chart. Check Show Labels or you can leave it blank if you don't want labels, it's up to you. Next, go to Data tab where you will find Metrics or you can say Measure in reporting terminology by default it will show as Count. Click on blue button right behind Split Slices in order to choose Aggregation type. Lets choose Terms aggregation for our use case (in simple words assume Terms like group by SQL). Refer this for more details. Further, choose the Field => Requested_URL.keyword which will act as dimension for us. Hit blue arrow button next to Options in order to generate the chart. You can also give this chart a custom label as shown below. Save the chart => Hit Save button on top right corner of your dashboard. You can name the visualization as Top 10 Requested URL. Now go to Dashboard tab => Create a dashboard => Add => Select Top 10 Requested URL Hit Save button on top of your dashboard. Give it a meaningful name, for instance Squid Access Log Dashboard. Use Case 2: Number of events per hour (Bar chart) Go to Visualize tab again (top left corner of your screen) and click on "+" sign. Choose chart type as vertical bar and select squidal index. In this case, instead of choosing aggregation type as Terms, we will be using X-Axis bucket with Date Histogram and Interval as Hourly as shown below: Hit Save button and give it an appropriate name, for instance Events per Hour. Now go back to Dashboard tab => Squid Access Log Dashboard => Edit => Add and select Events per hour to add it in your dashboard. Hit Save again. At this point your dashboard should look like this: You can add as many visualizations you want depending upon your business requirement. Your opinion matters a lot please comment if you have any questions for me. Thank you! Next: Kibana GeoIP Dashboard Example Navigation Menu: Introduction to ELK Stack Installation Loading data into Elasticsearch with Logstash Create Kibana Dashboard Example Kibana GeoIP Dashboard Example #ELKStack #KibanaDashboardExample #KibanaDashboardTutorial #CreateKibanaDashboard

  • Analyzing Twitter Data - Twitter sentiment analysis using Spark streaming

    Analyzing Twitter Data - Twitter sentiment analysis using Spark streaming. Twitter Spark Streaming - We will be analyzing Twitter data and we will be doing Twitter sentiment analysis using Spark streaming. You can do this in any programming language Python, Scala, Java or R. Main menu: Spark Scala Tutorial Spark streaming is very useful in analyzing real time data from IoT technologies which could be your smart watch, Google Home, Amazon Alexa, Fitbit, GPS, home security system, smart cameras or any other device which communicates with internet. Social accounts like Facebook, Twitter, Instagram etc generate enormous amount of data every minute. Below trend shows interest over time for three of these smart technologies over past 5 years. In this example we are going to stream Twitter API tweets in real time with OAuth authentication and filter the hashtags which are most famous among them. Prerequisite Download and install Apache Spark and Scala IDE (Windows | Mac) Create Twitter sample application and obtain your client secret, client secret key, access token and access token secret. Refer this to know how to get Twitter development account and api access keys. Authentication file setup Create a text file twitter.txt with Twitter OAuth details and place it anywhere on your local directory system (remember the path). File content should look like this. Basically it has two fields separated with single space - first field contains OAuth headers name and second column contains api keys. Make sure there is no extra space anywhere in your file else you will get authentication errors. There is a new line character at the end (i.e. hit enter after 4th line). You can see empty 5th line in below screenshot. Write the code! Now create a Scala project in Eclipse IDE (see how to create Scala project), refer the following code that prints out live tweets as they stream using Spark Streaming. I have written separate blog to explain what are basic terminologies used in Spark like RDD, SparkContext, SQLContext, various transformations and actions etc. You can go through this for basic understanding. However, I have explained little bit in comments above each line of code what it actually does. For list of spark functions you can refer this. // Our package package com.dataneb.spark // Twitter libraries used to run spark streaming import twitter4j._ import twitter4j.auth.Authorization import twitter4j.auth.OAuthAuthorization import twitter4j.conf.ConfigurationBuilder import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.streaming._ import org.apache.spark.streaming.twitter._ import org.apache.spark.streaming.StreamingContext._ import scala.io.Source import org.apache.spark.internal.Logging import org.apache.spark.storage.StorageLevel import org.apache.spark.streaming.dstream._ import org.apache.spark.streaming.receiver.Receiver /** Listens to a stream of Tweets and keeps track of the most popular * hashtags over a 5 minute window. */ object PopularHashtags { /** Makes sure only ERROR messages get logged to avoid log spam. */ def setupLogging() = { import org.apache.log4j.{Level, Logger} val rootLogger = Logger.getRootLogger() rootLogger.setLevel(Level.ERROR) } /** Configures Twitter service credentials using twiter.txt in the main workspace directory. Use the path where you saved the authentication file */ def setupTwitter() = { import scala.io.Source for (line <- Source.fromFile("/Volumes/twitter.txt").getLines) { val fields = line.split(" ") if (fields.length == 2) { System.setProperty("twitter4j.oauth." + fields(0), fields(1)) } } } // Main function where the action happens def main(args: Array[String]) { // Configure Twitter credentials using twitter.txt setupTwitter() // Set up a Spark streaming context named "PopularHashtags" that runs locally using // all CPU cores and one-second batches of data val ssc = new StreamingContext("local[*]", "PopularHashtags", Seconds(1)) // Get rid of log spam (should be called after the context is set up) setupLogging() // Create a DStream from Twitter using our streaming context val tweets = TwitterUtils.createStream(ssc, None) // Now extract the text of each status update into DStreams using map() val statuses = tweets.map(status => status.getText()) // Blow out each word into a new DStream val tweetwords = statuses.flatMap(tweetText => tweetText.split(" ")) // Now eliminate anything that's not a hashtag val hashtags = tweetwords.filter(word => word.startsWith("#")) // Map each hashtag to a key/value pair of (hashtag, 1) so we can count them by adding up the values val hashtagKeyValues = hashtags.map(hashtag => (hashtag, 1)) // Now count them up over a 5 minute window sliding every one second val hashtagCounts = hashtagKeyValues.reduceByKeyAndWindow( (x,y) => x + y, (x,y) => x - y, Seconds(300), Seconds(1)) // Sort the results by the count values val sortedResults = hashtagCounts.transform(rdd => rdd.sortBy(x => x._2, false)) // Print the top 10 sortedResults.print // Set a checkpoint directory, and kick it all off // I can watch this all day! ssc.checkpoint("/Volumes/Macintosh HD/Users/Rajput/Documents/checkpoint/") ssc.start() ssc.awaitTermination() } } Run it! See the result! I actually went to my Twitter account to see top result tweet #MTVHottest and it was trending, see the snapshot below. You might face error if You are not using proper version of Scala and Spark Twitter libraries. I have listed them below for your reference. You can download these libraries from these two links Link1 & Link2. Scala version 2.11.11 Spark version 2.3.2 twitter4j-core-4.0.6.jar twitter4j-stream-4.0.6.jar spark-streaming-twitter_2.11-2.1.1.jar Thank you folks. I hope you are enjoying these blogs, if you have any doubt please mention in comments section below. #AnalyzingTwitterData #SparkStreamingExample #Scala #TwitterSparkStreaming #SparkStreaming Next: Sample Big Data Architecture with Apache Spark Navigation menu ​ 1. Apache Spark and Scala Installation 1.1 Spark installation on Windows​ 1.2 Spark installation on Mac 2. Getting Familiar with Scala IDE 2.1 Hello World with Scala IDE​ 3. Spark data structure basics 3.1 Spark RDD Transformations and Actions example 4. Spark Shell 4.1 Starting Spark shell with SparkContext example​ 5. Reading data files in Spark 5.1 SparkContext Parallelize and read textFile method 5.2 Loading JSON file using Spark Scala 5.3 Loading TEXT file using Spark Scala 5.4 How to convert RDD to dataframe? 6. Writing data files in Spark ​6.1 How to write single CSV file in Spark 7. Spark streaming 7.1 Word count example Scala 7.2 Analyzing Twitter texts 8. Sample Big Data Architecture with Apache Spark 9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science? 10. Spark Interview Questions and Answers

  • Samba Installation on OEL (Oracle Enterprise Linux), configuration and file sharing

    Samba is a software package which gives users flexibility to seamlessly share Linux/Unix files with Windows client and vice-versa. Samba installation, configuration and file sharing is very simple on oracle linux. Why is the need of Samba? Let's say you’re managing two operating systems - Linux and Windows on the same computer, dual booting every day, switching between the platforms depending upon your requirement. Perhaps you’re planning to eventually move to Linux as your main operating system; or probably Windows; you might even have plans to remove Microsoft’s OS from your computer at some point soon. One of the things holding you back is the ability to access data between operating systems. Or assume another scenario where you are running various component of Big Data architecture (Data Collection & Processing) on Linux machine but Visualization tool demands Windows file system only. Let’s see how we can work around this problem, and get your data where you want it. I am not saying Samba is the "only" solution for this problem but yes it is one of the best solution - no cost, real-time file sharing & super-fast; what else you need? What is Samba? Samba is a software package which gives users flexibility to seamlessly share Linux/Unix files with Windows client and vice-versa. With an appropriately-configured Samba server on Linux, Windows clients can map drives to the Linux file systems. Similarly, Samba client on UNIX can connect to Windows shares. Samba basically implements network file sharing protocol using Server Message Block (SMB) & Common Internet File System (CIFS). Installation and configuring Samba (v3.6) on Linux machine (RHEL/CentOS): 1. Execute below command on your terminal to install samba packages: yum install samba samba-client samba-common -y 2. Edit the configuration file /etc/samba/smb.conf mv /etc/samba/smb.conf /etc/samba/smb.conf.backup vi /etc/samba/smb.conf and copy-paste below details in configuration file. [global] workgroup = EXAMPLE.COM server string = Samba Server %v netbios name = centos security = user map to guest = bad user dns proxy = no #======== Share Definitions =========================== [Anonymous] path = /samba/anonymous browsable =yes writable = yes guest ok = yes read only = no 3. Now save the configuration file and create a Samba shared folder. Then, restart Samba services. Run below commands step by step: mkdir -p /samba/anonymous cd /samba chmod -R 0755 anonymous/ chown -R nobody:nobody anonymous/ chcon -t samba_share_t anonymous/ systemctl enable smb.service systemctl enable nmb.service systemctl restart smb.service systemctl restart nmb.service 4. You can check samba services by running ps -eaf | grep smbd; ps -eaf | grep nmbd 5. Now go to the windows Run prompt and type \\yourhostname\anonymous. Thats it!! You will be able to access anonymous shared drive by now. 6. If this doesn't connect to the shared folder. Please make sure your firewall services are stopped and try again. You can run below commands to stop the services. service firewalld stop service iptables stop 7. Now test shared folder by creating a sample text file on linux machine and opening it on windows machine. Samba Documentation You can find complete Samba documentation at below mentioned link. It's very well documented and can be easily understood. https://www.samba.org/samba/docs/ https://www.samba.org/samba/docs/man/ https://wiki.samba.org/index.php/Presentations https://wiki.samba.org/index.php/Main_Page https://wiki.samba.org/index.php/Installing_Samba If you enjoyed this post, please comment if you have any question regarding Samba Installation (on any operating system) and I would try to response as soon as possible. Thank you! Next : How to setup Oracle Linux on Virtual Box? #Samba #Installation #OEL #OracleLinux #Configuration #filesharing #Oracle #Linux Interested in learning Apache Spark? Main menu: Spark Scala Tutorial ​ 1. Apache Spark and Scala Installation 1.1 Spark installation on Windows​ 1.2 Spark installation on Mac 1.3 Pyspark installation on Windows 2. Getting Familiar with Scala IDE 2.1 Hello World with Scala IDE​ 3. Spark data structure basics 3.1 Spark RDD Transformations and Actions example 4. Spark Shell 4.1 Starting Spark shell with SparkContext example​ 5. Reading data files in Spark 5.1 SparkContext Parallelize and read textFile method 5.2 Loading JSON file using Spark Scala 5.3 Loading TEXT file using Spark Scala 5.4 How to convert RDD to dataframe? 6. Writing data files in Spark ​6.1 How to write single CSV file in Spark 7. Spark streaming 7.1 Word count example Scala 7.2 Analyzing Twitter texts 8. Sample Big Data Architecture with Apache Spark 9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science? 10. Spark Interview Questions and Answers

  • Tangy Creamy Tomato Carrot Soup Recipe

    When it's cold outside, nothing hits the spot quite like a creamy tomato carrot soup. This healthy, quick and easy recipe makes use of real cream, for a thick and satisfying soup with fabulous flavor that's ready in less than an hour. What could be better for dinner that a nice hot bowl of soup along with your favorite salad. Well this is what I was craving for one evening and turned out I had no vegetables in my refrigerator except for 3 carrots and few tomatoes. So a mashup of the left over ingredients in my fridge turned into a nice soothing bowl of carrot tomato soup topped with croutons and dollops of cottage cheese (current favorite at my place). Below are the details of the recipe. Do try it out and I would love to hear how it turned out. Let me know if anyone comes up with interesting variations as well. Preparation time: 30-40 min, Serves: 3-4 Ingredients 3 Carrots 2 Tomatoes 1/2 inch Ginger 3-4 Garlic Cloves 1 Onion 2-3 Green Chilies (as per spice preference) 1-2 tsp of cottage cheese ( as per preference) Salt to taste Pepper Oregano/Cilantro/Parsley ( as per preference) Heavy cream ( optional) Preparation Steps Heat Oil in a pan. Add chopped onions, garlic, ginger, green chilies. Saute till the onions are translucent. Next, add chopped carrots and saute for another 3 minutes. Next add the chopped tomatoes and continue to saute till the tomatoes are mushy and carrots are soft. I prefer to saute till the veggies are nicely roasted and the tomatoes have dissolved. Now blend this to a smooth paste. Pour the paste into the same pan and add 2 cups of hot water while stirring continuously. Add Salt per taste. Cook till it boils rapidly. Adjust the consistency as per your preference. Finally add salt to taste , fresh cracked pepper and some oregano or any herb of your choice. Add 2 teaspoons of cottage cheese to each serving and top it with some croutons. Hope you all like this recipe and can enjoy it on a cozy winter evening!

  • What is "is" in Python?

    The operators is and is not test for object identity. x is y is true if and only if x and y are the same object (object location in memory is same). Object identity is determined using the id() function. x is not y yields the inverse truth value. It’s easy to understand if I compare this operator with equality operator. is and is not compares the object reference. Check for identity. == and != compares the object value. Check for equality. For example, if you consider integer objects (excluding integers from -5 to 256), >>> A=9999 >>> B=9999 >>> A == B, A is B (True, False) >>> A, B (9999, 9999) >>> id(A), id(B) (4452685328, 4452686992) Python stores integer objects as single object between range -5 to 256 so the identity is same. >>> A=99 >>> B=99 >>> A == B, A is B (True, True) >>> A, B (99, 99) >>> id(A), id(B) (4404392064, 4404392064) Lets see behavior of other immutable objects like int & float, string, tuples and boolean: >>> 1 == 1.0, 1 is 1.0 (True, False) >>> 1 == 1, 1 is 1 (True, True) >>> 'A' == 'A', 'A' is 'A' (True, True) >>> (1,2) == (1,2), (1,2) is (1,2) (True, True) >>> True == True, True is True (True, True) What about mutable objects - list, set, dict? Behavior is totally different. >>> [1,2] == [1,2], [1,2] is [1,2] (True, False) >>> {1,2} == {1,2}, {1,2} is {1,2} (True, False) >>> {'k1':1} == {'k1':1}, {'k1':1} is {'k1':1} (True, False) is operator is used only when you want to compare the object identity, for regular comparison equality operator is used.

  • Installing Apache Spark and Scala (Mac)

    Main menu: Spark Scala Tutorial In this Spark Scala tutorial you will learn, How to install Apache Spark on Mac OS. By the end of this tutorial you will be able to run Apache Spark with Scala on Mac machine. You will also download Eclispe for Scala IDE. To install Apache Spark on windows machine visit this. Installing Homebrew You will be installing Apache Spark using Homebrew. So install Homebrew if you don’t have it, visit: https://brew.sh/ and copy paste the command on your terminal and run it. Installing Apache Spark Open terminal and type command brew install apache-spark and hit Enter. Create a log4j.properties file. Type cd /usr/local/Cellar/apache-spark/2.3.1/libexec/conf and hit Enter. Please change the version according to your downloaded version. Spark 2.3.1 is the version installed for me. Type cp log4j.properties.template log4j.properties and hit Enter. Edit the log4j.properties file and change the log level from INFO to ERROR on log4j.rootCategory. We are just changing the log level from INFO to ERROR only. Download Scala IDE Install the Scala IDE from here. Open the IDE once, just to check if it's running fine. You should see panels like this; Test it out! Open terminal and go to the directory where apache-spark was installed to (such as cd /usr/local/Cellar/apache-spark/2.3.1/libexec/) and then type ls to get a directory listing. Look for a text file, like README.md or CHANGES.txt. Type command spark-shell and hit Enter. At this point you should see a scala> prompt. If not, double check the steps above. Type val rdd = sc.textFile("README.md") or whatever text file you’ve found and hit Enter. You have just created a rdd of readme text file. Now type rdd.count() and hit Enter to count the number of lines in text file. You should get a count of number of lines in that file! Congratulations, you just ran your first Spark program! Don't worry about the commands, I will explain them. Sample Execution You’ve got everything set up! If you have any question please don't forget to mention in the comments section below. Main Menu | Next: Just enough Scala for Spark Navigation menu ​ 1. Apache Spark and Scala Installation 1.1 Spark installation on Windows​ 1.2 Spark installation on Mac 2. Getting Familiar with Scala IDE 2.1 Hello World with Scala IDE​ 3. Spark data structure basics 3.1 Spark RDD Transformations and Actions example 4. Spark Shell 4.1 Starting Spark shell with SparkContext example​ 5. Reading data files in Spark 5.1 SparkContext Parallelize and read textFile method 5.2 Loading JSON file using Spark Scala 5.3 Loading TEXT file using Spark Scala 5.4 How to convert RDD to dataframe? 6. Writing data files in Spark ​6.1 How to write single CSV file in Spark 7. Spark streaming 7.1 Word count example Scala 7.2 Analyzing Twitter texts 8. Sample Big Data Architecture with Apache Spark 9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science? 10. Spark Interview Questions and Answers

  • How to pull data from OKTA API example

    OKTA has various rest APIs (refer this) from where you can pull the data and play around according to your business requirement. As OKTA stores only 90 days of records so in many cases you might need to store the data in external databases and then perform your data analysis. In order to pull the data from OKTA I considered writing a shell script, probably because this looked very straight forward to me. But there are other methods as well which you can consider if you have wide project timeline. Lets see how this can be done with a shell script. Step 1: Go through the API reference documents and filters which OKTA has provided online. It's seriously very well documented and that would help you in case you want to tweak this script. Step 2: Get API access token from OKTA admin and validate if token is working properly or not with Postman client. Refer this. Step 3: Once you have the API access token and basic understanding of API filters you will be able to tweak the script according to your need. Step 4: Below is the complete shell program and brief explanation what each step is doing. # Define your environment variables - organization, domain and api_token. These will be used to construct URL in further steps. # If you want you can hide your API token, probably by reading token from a parameter file instead hard coding it. # Start ORG=company_name DOM=okta API_TOKEN=********************* # Initialize variables with some default values. # Change your destination path wherever you want to write the data. # Val is basically the pagination limit and PAT/REP_PAT is basically the pattern and replace_pattern string which I used to format the JSON file in correct format. Date_range will be used to pull the data based on dates which user inputs. VAL=1000 DEST_FILE=/var/spark/data i=1 PAT= REP_PAT= DATE_RANGE=2014-02-01 # Choose the API for which you need the data (events, logs or users), you can modify the code if you want to export any other api data. echo "Enter the name of API - events, logs, users. " read GID # Enter the date range to pull data echo "Enter the date in format yyyy-mm-dd" read DATE_RANGE date_func() { echo "Enter the date in format yyyy-mm-dd" read DATE_RANGE } # Check if entered date is in correct format if [ ${#DATE_RANGE} -ne 10 ]; then echo "Invalid date!! Enter date again.."; date_func else echo "Valid date!" fi # Construct the URL based on all the variables defined earlier URL=htt ps://$ORG.$DOM.com/api/v1/$GID?limit=$VAL # Case to choose API name entered by user, 4 to 10 are empty routes if you want to add new APIs case $GID in events) echo "events API selected" rm -f /var/spark/data/events.json* URL=htt ps://$ORG.$DOM.com/api/v1/$GID?lastUpdated%20gt%20%22"$DATE_RANGE"T00:00:00.000Z%22\&$VAL PAT=}]},{\"eventId\": REP_PAT=}]}'\n'{\"eventId\": sleep 1;; logs) echo "logs API selected" rm -f /var/spark/data/logs.json* URL=htt ps://$ORG.$DOM.com/api/v1/$GID?lastUpdated%20gt%20%22"$DATE_RANGE"T00:00:00.000Z%22\&$VAL PAT=}]},{\"actor\": REP_PAT=}]}'\n'{\"actor\": sleep 1;; users) echo "users API selected" PAT=}}},{\"id\": REP_PAT=}}}'\n'{\"id\": rm -f /var/spark/data/users.json* URL=htt ps://$ORG.$DOM.com/api/v1/$GID?filter=status%20eq%20%22STAGED%22%20or%20status%20eq%20%22PROVISIONED%22%20or%20status%20eq%20%22ACTIVE%22%20or%20status%20eq%20%22RECOVERY%22%20or%20status%20eq%20%22PASSWORD_EXPIRED%22%20or%20status%20eq%20%22LOCKED_OUT%22%20or%20status%20eq%20%22DEPROVISIONED%22\&$VAL echo $URL sleep 1;; 4) echo "four" ;; 5) echo "five" ;; 6) echo "six" ;; 7) echo "seven" ;; 8) echo "eight" ;; 9) echo "nine" ;; 10) echo "ten" ;; *) echo "INVALID INPUT!" ;; esac # Deleting temporary files before running the script rm -f itemp.txt rm -f temp.txt rm -f temp1.txt # Creating NEXT variable to handle pagination curl -i -X GET -H "Accept: application/json" -H "Content-Type: application/json" -H "Authorization: SSWS $API_TOKEN" "$URL" > itemp.txt NEXT=`grep -i 'rel="next"' itemp.txt | awk -F"<" '{print$2}' | awk -F">" '{print$1}'` tail -1 itemp.txt > temp.txt # Validating if URL is correctly defined echo $URL # Iterating the loop of pagination with NEXT variable until it's null while [ ${#NEXT} -ne 0 ] do echo "this command is executed till NEXT is null, current value of NEXT is $NEXT" curl -i -X GET -H "Accept: application/json" -H "Content-Type: application/json" -H "Authorization: SSWS $API_TOKEN" "$NEXT" > itemp.txt tail -1 itemp.txt >> temp.txt NEXT=`grep -i 'rel="next"' itemp.txt | awk -F"<" '{print$2}' | awk -F">" '{print$1}'` echo "number of loop = $i, for NEXT reference : $NEXT" (( i++ )) cat temp.txt | cut -c 2- | rev | cut -c 2- | rev > temp1.txt rm -f temp.txt # Formatting the output to create single line JSON records echo "PATTERN = $PAT" echo "REP_PATTERN = $REP_PAT" sed -i "s/$PAT/$REP_PAT/g" temp1.txt mv temp1.txt /var/spark/data/$GID.json_`date +"%Y%m%d_%H%M%S"` sleep 1 done # END See also - How to setup Postman client If you have any question please write in comments section below. Thank you!

  • Apache Kafka and Zookeeper Installation & Sample Pub-Sub Model

    There are many technologies available today which provides real time data ingestion (refer my previous blog). Apache Kafka is one of my favorites because of its distributed streaming platform. What exactly does that mean? Basically it can act as "publisher-subscriber" data streaming platform. It can process data streams as they occur. Lastly, it can store data stream in a fault-tolerant durable way. Kafka can run as a cluster on one or more servers that can span multiple data-centers. The Kafka cluster stores streams of records in categories called "topics". For instance, if I have data streams coming from Twitter (refer blog), you can name this topic as "Tweets". We will learn how to define these topics and how you can access pub-sub model. Kafka has basically four core APIs. Each record consists of a key, a value, and a timestamp. Producer API Consumer API Streams API & Connector API Whats the need of Zookeeper with Kafka? As each data stream in Kafka consist of a key, a value and a timestamp, we need someone to manage it's key-value pair and synchronicity. Zookeeper is essentially a centralized service for distributed systems to a hierarchical key-value store, which is used to provide configuration information, naming, providing synchronization service, and providing group services. Apache Kafka package installer comes with inbuilt Zookeeper but in production environment where we have multiple nodes, people usually install Zookeeper separately. I will explain you both the ways to run Kafka: one with inbuilt Zookeeper and another with separately installed Zookeeper. Kafka Installation Before we start Kafka installation. I hope you all have Java installed on your machine, if not please refer to my previous blog. Once you have successfully installed Java, go to the this link and download latest Kafka release. I have downloaded Scala 2.11 - Kafka_2.11-1.1.0.tgz (asc, sha512) to explain this blog. Once you download it on your local machine, move it to your Linux environment where you want to run Kafka. I use MobaXterm (open source tool) to transfer the file from my windows machine to Linux environment (Red Hat Linux client without GUI in this case). Navigate to the directory where you transferred .tgz file and untar the file. For instance I have kept it in /var folder: Now remove the .tgz file, not a necessary step but in order to save some space. cd /var rm /kafka_2.11-1.1.0.tgz Rename the folder for your convenience. mv kafka_2.11-1.1.0 kafka Zookeeper Installation Follow similar steps to download Zookeeper latest release from this link. Move the .gz file to your Linux machine (/var folder in my case) and perform the below steps: Run the below commands to untar the file and configure the .conf file. tar -xvf zookeeper-3.4.11.tar rm zookeeper-3.4.11.tar mv zookeeper-3.4.11 zookeeper cd zookeeper Now let's setup the configuration file. cd conf cp zoo_sample.cfg zoo.cfg Your configuration file will look something like below. You can change dataDir, if you don't want to depend upon your /tmp directory. If server reboots due to some infrastructure issues you might lose /tmp data, hence people usually don't rely on this. Or you can simply leave it as it is. At this point you are done with Zookeeper setup. Now we will start the Zookeeper as it is needed for Kafka. There are basically 2 ways to do this. Start Zookeeper First, like I said Kafka comes with inbuilt Zookeeper program (you can find Zookeeper files under /kafka/bin directory). So either you can start the Zookeeper which comes with Kafka or you can run Zookeeper separately. For this you can navigate to your Kafka directory and run the below command. cd /var/kafka bin/zookeeper-server-start.sh config/zookeeper.properties Or you can start Zookeeper which you just installed by running below command in /var/zookeeper directory. bin/zkServer.sh start You will get output something like this. Start Kafka Go to your Kafka directory and execute the below command: cd /var/kafka bin/kafka-server-start.sh config/server.properties You will find lots of events being generated and screen getting stuck at one point. Keep this terminal open and open a new terminal to verify if Zookeeper and Kafka services are running fine. Type the jps command to check active java process status. QuorunPeerMain basically shows our Zookeeper process 12050 & 11021 is our Kafka process. Process id might vary for you. Once you close that terminal Kafka service will stop. Another way is to run these services in background with "nohup", like; nohup bin/zookeeper-server-start.sh config/zookeeper.properties nohup bin/kafka-server-start.sh config/server.properties Create Kafka Topic As discussed earlier, Kafka cluster stores streams of records in categories called "topics". Lets create a topic called "Tweets". To do this run the below command in your Kafka directory. cd /var/kafka bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic Tweets You will get a prompt saying - Created topic "Tweets". You can also see the list of topics by running below command. bin/kafka-topics.sh --list --zookeeper localhost:2181 Running Kafka Producer and Consumer Run the below command in order to start Producer API. You need to tell Kafka which topic you want to start and on which port. Check /config/server.properties in Kafka directory for details. For example I am running it for Tweets on port 9092: cd /var/kafka bin/kafka-console-producer.sh --broker-list localhost:9092 --topic Tweets Now open a "new terminal" and lets run the Consumer API. Make sure you are entering the correct topic name. Port will be same where your Zookeeper is running. bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic Tweets --from-beginning Now go back to Producer API terminal and type anything, hit enter. Same message will be shown on your Consumer API like below; Producer Terminal Consumer Terminal Thank you!! If you have any question please write in comments section below. #DataIngestion #Kafka #Zookeeper #ApacheKafkaandZookeeperInstallation #kafkapubsubmodel Learn Apache Spark in 7 days, start today! Main menu: Spark Scala Tutorial 1. Apache Spark and Scala Installation 1.1 Spark installation on Windows​ 1.2 Spark installation on Mac 2. Getting Familiar with Scala IDE 2.1 Hello World with Scala IDE​ 3. Spark data structure basics 3.1 Spark RDD Transformations and Actions example 4. Spark Shell 4.1 Starting Spark shell with SparkContext example​ 5. Reading data files in Spark 5.1 SparkContext Parallelize and read textFile method 5.2 Loading JSON file using Spark Scala 5.3 Loading TEXT file using Spark Scala 5.4 How to convert RDD to dataframe? 6. Writing data files in Spark ​6.1 How to write single CSV file in Spark 7. Spark streaming 7.1 Word count example Scala 7.2 Analyzing Twitter texts 8. Sample Big Data Architecture with Apache Spark 9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science? 10. Spark Interview Questions and Answers

  • Mutable vs Immutable Objects in Python example

    Everything in Python is an object and all objects in Python can be either mutable or immutable. Immutable objects: An object with a fixed value. Immutable objects include numbers, bool, strings and tuples. Such an object cannot be altered. A new object has to be created if a different value has to be stored. They play an important role in places where a constant hash value is needed, for example as a key in a dictionary. Consider an integer → 1 assigned to variable “a”, >>> a=1 >>> b=a >>> id(a) 4547990592 >>> id(b) 4547990592 >>> id(1) 4547990592 Notice id() for all of them them is same. id() basically returns an integer which corresponds to the object’s memory location. b → a → 1 → 4547990592 Now, let’s increment the value of “a” by 1 so that we have new integer object → 2. >>> a=a+1 >>> id(a) 4547990624 >>> id(b) 4547990592 >>> id(2) 4547990624 Notice how id() of variable “a” changed, “b” and “1” are still same. b → 1 → 4547990592 a → 2 → 4547990624 The location to which “a” was pointing has changed ( from 4547990592 → 4547990624). Object “1” was never modified. Immutable objects doesn’t allow modification after creation. If you create n different integer values, each will have a different constant hash value. Mutable objects can change their value but keep their id() like list, dictionary, set. >>> x=[1,2,3] >>> y=x >>> id(x) 4583258952 >>> id(y) 4583258952 y → x → [1,2,3] → 4583258952 Now, lets change the list >>> x.pop() >>> x[1,2] >>> id(x) 4583258952 >>> id(y) 4583258952 y → x → [1,2] → 4583258952 x and y are still pointing to the same memory location even after actual list is changed.

bottom of page