top of page

Results found for empty search

  • Loading JSON file using Spark (Scala)

    Main menu: Spark Scala Tutorial In this Apache Spark Tutorial - We will be loading a simple JSON file. Now-a-days most of the time you will find files in either JSON format, XML or a flat file. JSON file format is very easy to understand and you will love it once you understand JSON file structure. JSON File Structure Before we ingest JSON file using spark, it's important to understand JSON data structure. Basically, JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. JSON is built on two structures: A collection of name/value pairs, usually referred as an object and its value pair. An ordered list of values. You can think it like an array, list of values. An object is an unordered set of name/value pairs. An object begins with { (left brace) and ends with } (right brace). Each name is followed by : (colon) and the name/value pairs are separated by , (comma). An array is an ordered collection of values. An array begins with [ (left bracket) and ends with ] (right bracket). Values are separated by , (comma). A value can be a string in double quotes, or a number, or true or false or null, or an object or an array. These structures can be nested. One more fact, JSON files could exist in two formats. However, most of the time you will encounter multiline JSON files. Multiline JSON where each line could have multiple records. Single line JSON where each line depicts one record. Multiline JSON would look something like this: [ { "color": "red", "value": "#f00" }, { "color": "green", "value": "#0f0" }, { "color": "blue", "value": "#00f" }, { "color": "cyan", "value": "#0ff" }, { "color": "magenta", "value": "#f0f" }, { "color": "yellow", "value": "#ff0" }, { "color": "black", "value": "#000" } ] Single line JSON would look something like this (try to correlate object, array and value structure format which I explained earlier): { "color": "red", "value": "#f00" } { "color": "green", "value": "#0f0" } { "color": "blue", "value": "#00f" } { "color": "cyan", "value": "#0ff" } { "color": "magenta", "value": "#f0f" } { "color": "yellow", "value": "#ff0" } { "color": "black", "value": "#000" } Creating Sample JSON file I have created two different sample files - multiline and single line JSON file with above mentioned records (just copy-paste). singlelinecolors.json multilinecolors.json Sample files look like: Note: I assume that you have installed Scala IDE if not please refer my previous blogs for installation steps (Windows & Mac users). 1. Create a new Scala project "jsnReader" Go to File → New → Project and enter jsnReader in project name field and click finish. 2. Create a new Scala Package "com.dataneb.spark" Right click on the jsnReader project in the Package Explorer panel → New → Package and enter name com.dataneb.spark and finish. 3. Create a Scala object "jsonfileReader" Expand the jsnReader project tree and right click on the com.dataneb.spark package → New → Scala Object → enter jsonfileReader in the object name and press finish. 4. Add external jar files Right click on jsnReader project → properties → Java Build Path → Add External Jars Now navigate to the path where you have installed Spark. You will find all the jar files under /spark/jars folder. After adding these jar files you will find Referenced Library folder created on left panel of your screen below Scala object. You will also find that project has become invalid (red cross sign), we will fix it shortly. 5. Setup Scala Compiler Now right click on jsnReader project → properties → Scala Compiler and check the box Use Project Settings and select Fixed Scala installation: 2.11.11 (built-in) from drop-down options. After applying these changes, you will find project has become valid again (red cross sign is gone). 6. Sample code Open jsonfileReader.scala and copy-paste the code written below. I have written separate blog to explain what are basic terminologies used in Spark like RDD, SparkContext, SQLContext, various transformations and actions etc. You can go through this for basic understanding. However, I have explained little bit in comments above each line of code what it actually does. For list of spark functions you can refer this. // Your package name package com.dataneb.spark // Each library has its significance, I have commented in below code how its being used import org.apache.spark._ import org.apache.spark.sql._ import org.apache.log4j._ object jsonfileReader { // Reducing the error level to just "ERROR" messages // It uses library org.apache.log4j._ // You can apply other logging levels like ALL, DEBUG, ERROR, INFO, FATAL, OFF etc Logger.getLogger("org").setLevel(Level.ERROR) // Defining Spark configuration to define application name and the local resources to use // It uses library org.apache.spark._ val conf = new SparkConf().setAppName("Sample App") conf.setMaster("local") // Using above configuration to define our SparkContext val sc = new SparkContext(conf) // Defining SQL context to run Spark SQL // It uses library org.apache.spark.sql._ val sqlContext = new SQLContext(sc) // Main function where all operations will occur def main (args: Array[String]): Unit = { // Reading the json file val df = sqlContext.read.json("/Volumes/MYLAB/testdata/multilinecolors.json") // Printing schema df.printSchema() // Saving as temporary table df.registerTempTable("JSONdata") // Retrieving all the records val data=sqlContext.sql("select * from JSONdata") // Showing all the records data.show() // Stopping Spark Context sc.stop } } 7. Run the code! Right click anywhere on the screen and select Run As Scala Application. That's it!! If you have followed the steps properly you will find the result in Console. We have successfully loaded JSON file using Spark SQL dataframes. Printed JSON schema and displayed the data. Try reading single line JSON file which we created earlier. There is a multiline flag which you need to make true to read such files. Also, you can save this data in HDFS, database or CSV file depending upon your need. If you have any question, please don't forget to write in comments section below. Thank you. Next: How to convert RDD to dataframe? Navigation menu ​ 1. Apache Spark and Scala Installation 1.1 Spark installation on Windows​ 1.2 Spark installation on Mac 2. Getting Familiar with Scala IDE 2.1 Hello World with Scala IDE​ 3. Spark data structure basics 3.1 Spark RDD Transformations and Actions example 4. Spark Shell 4.1 Starting Spark shell with SparkContext example​ 5. Reading data files in Spark 5.1 SparkContext Parallelize and read textFile method 5.2 Loading JSON file using Spark Scala 5.3 Loading TEXT file using Spark Scala 5.4 How to convert RDD to dataframe? 6. Writing data files in Spark ​6.1 How to write single CSV file in Spark 7. Spark streaming 7.1 Word count example Scala 7.2 Analyzing Twitter texts 8. Sample Big Data Architecture with Apache Spark 9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science? 10. Spark Interview Questions and Answers

  • How to write single CSV file using spark?

    Apache Spark by default writes CSV file output in multiple parts-*.CSV, inside a directory. Reason is simple it creates multiple files because each partition is saved individually. Apache Spark is built for distributed processing and multiple files are expected. However, you can overcome this situation by several methods. In previous posts, we have just read the data files (flat file, json), created rdd, dataframes using spark sql, but we haven't written file back to disk or any storage system. In this Apache Spark tutorial - you will learn how to write files back to disk. Main menu: Spark Scala Tutorial For this blog, I am creating Scala Object - textfileWriter in same project - txtReader folder where we created textfileReader. Source File I am using the same source file squid.txt file (with duplicate records) which I created in previous blog. However, in practical scenario source could be anything - relational database, hdfs file system, message queue etc. Practically, It will be never the case, i.e. reading and writing same file. This is just for demo purpose. 1286536309.586 921 192.168.0.68 TCP_MISS/200 507 POST http://rcv-srv37.inplay.tubemogul.co...eiver/services - DIRECT/174.129.41.128 application/xml 1286536309.608 829 192.168.0.68 TCP_MISS/200 507 POST http://rcv-srv37.inplay.tubemogul.co...eiver/services - DIRECT/174.129.41.128 application/xml 1286536309.660 785 192.168.0.68 TCP_MISS/200 507 POST http://rcv-srv37.inplay.tubemogul.co...eiver/services - DIRECT/174.129.41.128 application/xml 1286536309.684 808 192.168.0.68 TCP_MISS/200 507 POST http://rcv-srv37.inplay.tubemogul.co...eiver/services - DIRECT/174.129.41.128 application/xml 1286536309.775 195 192.168.0.227 TCP_MISS/200 4120 GET http://i4.ytimg.com/vi/gTHZnIAzmdY/default.jpg - DIRECT/209.85.153.118 image/jpeg 1286536309.795 215 192.168.0.227 TCP_MISS/200 5331 GET http://i2.ytimg.com/vi/-jBxVLD4fzg/default.jpg - DIRECT/209.85.153.118 image/jpeg 1286536309.815 234 192.168.0.227 TCP_MISS/200 5261 GET http://i1.ytimg.com/vi/dCjp28ps4qY/default.jpg - DIRECT/209.85.153.118 image/jpeg Sample Code Open jsonfileReader.scala and copy-paste the code written below. I have written separate blog to explain what are basic terminologies used in Spark like rdd, SparkContext, SQLContext, various transformations and actions etc. You can go through these for basic understanding. Spark shell, Spark context and configuration Spark RDD, Transformations and Actions However, I have explained little bit in comments above each line of code what it actually does. For list of spark functions you can refer this. You can make this code much simpler but my aim is to teach as well. Hence I have intentionally introduced header structure, SQL context, string rdd etc. However, if you are familiar with these, you can just focus on writing dataframe part highlighted in blue. package com.dataneb.spark // Each library has its significance, I have commented in below code how its being used import org.apache.spark._ import org.apache.spark.sql._ import org.apache.log4j._ import org.apache.spark.sql.types.{StructType, StructField, StringType} import org.apache.spark.sql.Row object textfileWriter { // Reducing the error level to just "ERROR" messages // It uses library org.apache.log4j._ // You can apply other logging levels like ALL, DEBUG, ERROR, INFO, FATAL, OFF etc Logger.getLogger("org").setLevel(Level.ERROR) // Defining Spark configuration to define application name and the local resources to use // It uses library org.apache.spark._ val conf = new SparkConf().setAppName("textfileWriter") conf.setMaster("local") // Using above configuration to define our SparkContext val sc = new SparkContext(conf) // Defining SQL context to run Spark SQL // It uses library org.apache.spark.sql._ val sqlContext = new SQLContext(sc) // Main function where all operations will occur def main (args:Array[String]): Unit = { // Reading the text file val squidString = sc.textFile("/Users/Rajput/Documents/testdata/squid.txt") // Defining the data-frame header structure val squidHeader = "time duration client_add result_code bytes req_method url user hierarchy_code type" // Defining schema from header which we defined above // It uses library org.apache.spark.sql.types.{StructType, StructField, StringType} val schema = StructType(squidHeader.split(" ").map(fieldName => StructField(fieldName,StringType, true))) // Converting String RDD to Row RDD for 10 attributes val rowRDD = squidString.map(_.split(" ")).map(x => Row(x(0), x(1), x(2), x(3), x(4), x(5) , x(6) , x(7) , x(8), x(9))) // Creating dataframe based on Row RDD and schema val squidDF = sqlContext.createDataFrame(rowRDD, schema) // Writing dataframe to a file with overwrite mode, header and single partition. squidDF .repartition(1) .write .mode ("overwrite") .format("com.databricks.spark.csv") .option("header", "true") .save("targetfile.csv") sc.stop() } } Run the code! Output There are several other methods to write these files. Method 1 This is what we did above. If expected dataframe size is small you can either use repartition or coalesce to create single file output as /filename.csv/part-00000. Scala> dataframe .repartition(1) .write .mode ("overwrite") .format("com.databricks.spark.csv") .option("header", "true") .save("filename.csv") Repartition(1) will shuffle the data to write everything in one particular partition thus writer cost will be high and it might take long time if file size is huge. Method 2 Coalesce will require lot of memory, if your file size is huge as you will run out of memory. Scala> dataframe .coalesce(1) .write .mode ("overwrite") .format("com.databricks.spark.csv") .option("header", "true") .save("filename.csv") Coalesce() vs repartition() Coalesce and repartition both shuffles the data to increase or decrease the partition, but repartition is more costlier operation as it performs full shuffle. For example, scala> val distData = sc.parallelize(1 to 16, 4) distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[128] at parallelize at :24 // current partition size scala> distData.partitions.size res63: Int = 4 // checking data across each partition scala> distData.mapPartitionsWithIndex((index, iter) => if (index == 0) iter else Iterator()).collect res64: Array[Int] = Array(1, 2, 3, 4) scala> distData.mapPartitionsWithIndex((index, iter) => if (index == 1) iter else Iterator()).collect res65: Array[Int] = Array(5, 6, 7, 8) scala> distData.mapPartitionsWithIndex((index, iter) => if (index == 2) iter else Iterator()).collect res66: Array[Int] = Array(9, 10, 11, 12) scala> distData.mapPartitionsWithIndex((index, iter) => if (index == 3) iter else Iterator()).collect res67: Array[Int] = Array(13, 14, 15, 16) // decreasing partitions to 2 scala> val coalData = distData.coalesce(2) coalData: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[133] at coalesce at :25 // see how shuffling occurred. Instead of moving all data it just moved 2 partitions. scala> coalData.mapPartitionsWithIndex((index, iter) => if (index == 0) iter else Iterator()).collect res68: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8) scala> coalData.mapPartitionsWithIndex((index, iter) => if (index == 1) iter else Iterator()).collect res69: Array[Int] = Array(9, 10, 11, 12, 13, 14, 15, 16) repartition() Notice how repartition() will re-shuffle everything to create new partitions as compared to previous RDDs - distData and coalData. Hence repartition is more costlier operation as compared to coalesce. scala> val repartData = distData.repartition(2) repartData: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[139] at repartition at :25 // checking data across each partition scala> repartData.mapPartitionsWithIndex((index, iter) => if (index == 0) iter else Iterator()).collect res70: Array[Int] = Array(1, 3, 6, 8, 9, 11, 13, 15) scala> repartData.mapPartitionsWithIndex((index, iter) => if (index == 1) iter else Iterator()).collect res71: Array[Int] = Array(2, 4, 5, 7, 10, 12, 14, 16) Method 3 Let the file create on various partitions and later merge the files with separate Shell Script. This method will be fast depending upon your hard disk write speed. #!/bin/bash echo "ColName1, ColName2, ColName3, ... , ColNameX" > filename.csv for i in /spark/output/*.CSV ; do echo "FileNumber $i" cat $i >> filename.csv rm $i done echo "Done" Method 4 If you are using Hadoop file system to store output files. You can leverage HDFS to merge files by using getmerge utility. Input your source directory with all partition files and destination output file, it concatenates all the files in source into destination local file. You can also set -nl to add a newline character at the end of each file. Further, -skip-empty-file can be used to avoid unwanted newline characters in case of empty files. Syntax : hadoop fs -getmerge [-nl] [-skip-empty-file] hadoop fs -getmerge -nl /spark/source /spark/filename.csv hadoop fs -getmerge /spark/source/file1.csv /spark/source/file2.txt filename.csv Method 5 Use FileUtil.copyMerge() to merge all the files. import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs._ def merge(srcPath: String, dstPath: String): Unit = { val hadoopConfig = new Configuration() val hdfs = FileSystem.get(hadoopConfig) FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), true, hadoopConfig, null) } val newData = << Your dataframe >> val outputfile = "/spark/outputs/subject" var filename = "sampleFile" var outputFileName = outputfile + "/temp_" + filename var mergedFileName = outputfile + "/merged_" + filename var mergeFindGlob = outputFileName newData.write .format("com.databricks.spark.csv") .option("header", "true") .mode("overwrite") .save(outputFileName) merge(mergeFindGlob, mergedFileName ) If you have any question, please don't forget to write in comments section below. Thank you! Next: Spark Streaming word count example Navigation menu ​ 1. Apache Spark and Scala Installation 1.1 Spark installation on Windows​ 1.2 Spark installation on Mac 2. Getting Familiar with Scala IDE 2.1 Hello World with Scala IDE​ 3. Spark data structure basics 3.1 Spark RDD Transformations and Actions example 4. Spark Shell 4.1 Starting Spark shell with SparkContext example​ 5. Reading data files in Spark 5.1 SparkContext Parallelize and read textFile method 5.2 Loading JSON file using Spark Scala 5.3 Loading TEXT file using Spark Scala 5.4 How to convert RDD to dataframe? 6. Writing data files in Spark ​6.1 How to write single CSV file in Spark 7. Spark streaming 7.1 Word count example Scala 7.2 Analyzing Twitter texts 8. Sample Big Data Architecture with Apache Spark 9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science? 10. Spark Interview Questions and Answers

  • How to make calls to Twitter APIs using Postman client?

    In this blog, I am going to invoke Twitter custom APIs with Postman client in order to pull live feeds, or you can say tweets from Twitter. Output will be JSON text which you can format or change based on your requirement. Soon I will be writing another blog to demonstrate how you can ingest this data in real time with Kafka and process it using Spark. Or, you can directly stream & process the data in real time with Spark streaming. As of now, let's try to connect Twitter API using Postman. Prerequisites Postman client Twitter developer account Postman Client Installation There are basically two ways to install Postman, either you can download the Postman extension for your browser (chrome in my case) or you can simply install native Postman application. I have installed Postman application to write this blog. Step 1. Google "Install Postman" and go to the Postman official site to download the application. Step 2. After opening Postman download link, select your operating system to start Postman download.​​​​​ It's available for all the types of platform - Mac, Linux and Windows. The download link keeps on changing so if the download link doesn't work just Google it as shown above. Step 3. Once installer is downloaded, run the installer to complete the installation process. It's approximately 250 MB application (for Mac). Step 4. Sign up.​​ After signing in, you can save your preferences or do it later as shown below. Step 5. Your workspace will look like below. Twitter Developer Account I hope you all have Twitter developers account, if not please create it. Then, go to Developer Twitter and sign in with your Twitter account. Click on Apps > Create an app at the top right corner of your screen. Note: Earlier, developer.twitter.com was known as apps.twitter.com. Fill out the form to create an application > specify Name, Description and Website details as shown below. This screen has slightly changed with new Twitter developer interface but overall process is still similar. If you have any question, please feel free to ask in comment section at the end of this post. Please provide a proper website name like https://example.com otherwise you will get error while creating the application. Sample has been shown above. Once you successfully create the app, you will get the below page. ​Make sure access level is set to Read and Write as shown above. Now go to Keys and Access Token tab > click on Create Access Token. At this point, you will be able to see 4 keys which will used in Postman client. Consumer Key (API Key) Consumer Secret (API Secret) Access Token Access Token Secret. New Interface looks like this. Calling Twitter API with Postman Client Open Postman application and click on authorization tab. Select authorization type as OAuth 1.0. Add authorization data to Request Headers. This is very important step else you will get error. After setting up authorization type and request header, fill out the form carefully with 4 keys (just copy-paste) which we generated in Twitter App - Consumer Key (API Key), Consumer Secret (API Secret), Access Token & Access Token Secret.​ Execute it! Now let's search for tweeter statuses which says snap. Copy-paste request URL as https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=snap as shown below. You can refer API reference index in order to access various Twitter custom API. GET some tweets, hit Send button. You will get response as shown below. GET Examples Twitter has very nice API documentation on accounts, users, tweets, media, trend, messages, geo, ads etc and there is huge variety of data which you can pull. I am invoking few APIs just for demonstration purpose. Accounts and users Lets say you want to search for user name "Elon". You can do it like this, GET https://api.twitter.com/1.1/users/search.json?q=elon Now suppose you want to get friend list of Elon Musk, you can do it like this, GET https://api.twitter.com/1.1/friends/list.json?user_id=44196397 Input user_id is same as id in previous output. You can also change the display => pretty, raw and preview. Trending Topics You can pull top 50 trending global topics with id = 1, for example, GET https://api.twitter.com/1.1/trends/place.json?id=1 POST Examples You can also POST something like you Tweet in your Twitter web account. For example if you want to Tweet Hello you can do it like this, POST https://api.twitter.com/1.1/statuses/update.json?status=Hello You can verify same with your Twitter account, yeah that's me! I rarely use Twitter. Cursoring Cursoring is used for pagination when you have large result set. Lets say you want to pull all statuses which says "Elon", it's obvious that there will be good number of tweets and that response can't fit in one page. To navigate through each page cursoring is needed. For example, lets say you want to pull 5 result per page you can do it like this, GET https://api.twitter.com/1.1/search/tweets.json?q=Elon&count=5 Now, to navigate to next 5 records you have to use next_results shown in search_metadata section above like this, GET https://api.twitter.com/1.1/search/tweets.json?max_id=1160404261450244095&q=Elon&count=5&include_entities=1 To get next set of results again use next_results from search_metadata of this result set and so on.. Now, obviously you can't do this manually each time. You need to write loop to get the result set programmatically, for example, cursor = -1 api_path = "https://api.twitter.com/1.1/endpoint.json?screen_name=targetUser" do { url_with_cursor = api_path + "&cursor=" + cursor response_dictionary = perform_http_get_request_for_url( url_with_cursor ) cursor = response_dictionary[ 'next_cursor' ] } while ( cursor != 0 ) In our case next_results is like next_cursor, like a pointer to next page. This might be different for different endpoints like tweets, users and accounts, ads etc. But logic will be same to loop through each result set. Refer this for complete details. That's it you have successfully pulled data from Twitter. #TwitterAPI #Postmaninstallation #Oauth #API #CustomerKey #CustomerSecret #accesstoken #Postman Next: Analyze Twitter Tweets using Apache Spark Learn Apache Spark in 7 days, start today! ​ 1. Apache Spark and Scala Installation 1.1 Spark installation on Windows​ 1.2 Spark installation on Mac 2. Getting Familiar with Scala IDE 2.1 Hello World with Scala IDE​ 3. Spark data structure basics 3.1 Spark RDD Transformations and Actions example 4. Spark Shell 4.1 Starting Spark shell with SparkContext example​ 5. Reading data files in Spark 5.1 SparkContext Parallelize and read textFile method 5.2 Loading JSON file using Spark Scala 5.3 Loading TEXT file using Spark Scala 5.4 How to convert RDD to dataframe? 6. Writing data files in Spark ​6.1 How to write single CSV file in Spark 7. Spark streaming 7.1 Word count example Scala 7.2 Analyzing Twitter texts 8. Sample Big Data Architecture with Apache Spark 9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science? 10. Spark Interview Questions and Answers

  • Kibana GeoIP example: How to index geographical location of IP addresses into Elasticsearch

    The relation between your IP address and geolocation is very simple. There are numerous websites available as of today like Maxmind, IP2Location, IPstack , Software77 etc where you can track the geolocation of an IP address. What's the benefit? It's very simple, it gives you another dimension to analyze your data. Let's say my data predicts that most of the users traffic is coming from 96.67.149.166. It doesn't make complete sense until I say most of the traffic is coming from New Jersey. When I say geolocation it includes multiple attributes like city, state, country, continent, region, currency, country flag, country language, latitude, longitude etc. Most of the websites which provide geolocation are paid sites. But there are few like IPstack which provides you free access token to make calls to their rest API's. Still there are limitations like how many rest API calls you can make per day and also how many types of attributes you can pull. Suppose I want to showcase specific city in the report and API provides limited access to country and continent only, then obviously that data is useless for me. Now the best part is Elastic stack provides you free plugin called "GeoIP" which grants you access to lookup millions of IP addresses. You would be thinking from where it gets the location details? The answer is Maxmind which I referred earlier. GeoIP plugin internally does a lookup from stored copy of Maxmind database which keeps on updating and creates number of extra fields with geo coordinates (longitude & latitude). These geo coordinates can be used to plot maps in Kibana. ELK Stack Installation I am installing ELK stack on Mac OS, for installation on Linux machine refer this. ELK installation is very easy on Mac with Homebrew. It's hardly few minutes task if done properly. 1. Homebrew Installation Run this command on your terminal. If you have already installed Homebrew move to the next step, or if this command doesn't work - copy it from here. $ /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" 2. Java Installation Check if java is installed on your machine. $ java -version java version "9.0.1" If java is not installed, run following steps to install java. $ brew tap caskroom/cask $ brew cask install java $ brew cask info java 3. Elasticsearch Installation $ brew tap elastic/tap $ brew install elastic/tap/elasticsearch-full $ elasticsearch If you see all INFO without any error, that means installation went fine. Let this run, don't kill the process. Now, simply open localhost:9200 in your local browser. You will see elasticsearch version. [TIP] You might face permission issue if you are not logged in with root user. To enable root user on Mac you can follow this. It's due to security reasons that root user is disabled by default on Mac. However another solution is to change folder permission itself. Run these commands if you want to change folder permissions, $ sudo chown -R $(whoami) /usr/local/include /usr/local/lib/pkgconfig $ chmod u+w /usr/local/include /usr/local/lib/pkgconfig Install xcode if it's missing, $ xcode-select --install 4. Kibana Installation $ brew install elastic/tap/kibana-full $ kibana Let this process run, don't kill. Now, open localhost:5601 in your local browser to check if kibana is running properly, 5. Logstash Installation $ brew install elastic/tap/logstash-full Configuring Logstash for GeoIP Let's begin with few sample IP addresses as listed below. I generated this sample data from browserling.com so please ignore if there is some known ip address in this list. Honestly speaking even I don't know where these IP addresses will point to when we generate the maps. Sample Data 1. Copy paste these records into a flat file with "ipaddress" header (sampleip.csv). ipaddress 0.42.56.104 82.67.74.30 55.159.212.43 108.218.89.226 189.65.42.171 62.218.183.66 210.116.94.157 80.243.180.223 169.44.232.173 232.117.72.103 242.14.158.127 14.209.62.41 4.110.11.42 135.235.149.26 93.60.177.34 145.121.235.122 170.68.154.171 206.234.141.195 179.22.18.176 178.35.233.119 145.156.239.238 192.114.2.154 212.36.131.210 252.185.209.0 238.49.69.205 2. Make sure your Elasticsearch and Kibana services are up and running. If not, please refer my previous blog - how to restart them. 3. [Update 9/Aug/2019: Not mandatory step now] Install GeoIP plugin for Elasticsearch. Run the below command in your Elasticsearch home directory. Once GeoIP plugin is installed successfully, you will be able to find plugin details under elasticsearch home plugin directory "/elasticsearch/plugins". You need to run installation command on each node if you are working in a clustered environment and then restart the services. /elasticsearch/bin/elasticsearch-plugin install ingest-geoip New version of elastics has built in GeoIP module, so you don't need to install it separately. Configure Logstash Configure logstash config file to create "logstash-iplocation" index. Please note your index name should start with logstash-name otherwise your attributes will not be mapped properly as geo_points datatype. This is because the default index name in logstash template is declared as logstash-* , you can change it if you want but as of now lets move ahead with logstash-iplocation. Below is the sample input, filter and output configuration. input { file { path => "/Volumes/MYLAB/testdata/sampleip.csv" start_position => "beginning" sincedb_path => "/Volumes/MYLAB/testdata/logstash.txt" } } filter { csv { columns => "ipaddress" } geoip { source => "message" } } output { elasticsearch { hosts => "localhost" index => "logstash-iplocation" } stdout{ codec => rubydebug } } My configuration file looks something like this: Important Notes Your index name should be in lower caps, starting with logstash- for example logstash-abcd Also, sincedb path is created once per file input, so if you want to reload the same file make sure you delete the sincedb file entry. It looks like this, You invoke geoip plugin from filter configuration, it has no relation with input/output. Run Logstash Load the data into elasticsearch by running below command (it's a single line command). Now wait, it will take few seconds to load. Change your home location accordingly, for me its homebrew linked as shown below. /usr/local/var/homebrew/linked/logstash-full/bin/logstash -f /usr/local/var/homebrew/linked/logstash-full/libexec/config/logstash_ip.config Sample output Important Notes See if filters geoip is invoked when you load the data into elasticsearch. Also, the datatype of location should be geo_point, otherwise there is some issue with your configuration. Latitude and longitude datatype should be float. These datatypes are like confirmation that logstash loaded this data as expected. Kibana Dashboard Creation 1. Once data is loaded into Elasticsearch, open Kibana UI and go to Management tab => Kibana Index pattern. 2. Create Kibana index with "logstash-iplocation" pattern and hit Next. 3. Select timestamp if you want to show it with your index and hit create index pattern. 4. Now go to Discover tab and select "logstash-iplocation" to see the data which we just loaded. You can expand the fields and see geoip.location has datatype as geo_point. You can verify this by "globe" sign which you will find just before geoip.location field. If it's not there then you have done some mistake and datatype mapping is incorrect. 5. Now go to Visualize tab and select coordinate map from the types of visualization and index name as "logstash-iplocation". 6. Apply the filters (Buckets: Geo coordinates, Aggregation: Geohash & Field: geoip.location) as shown below and hit the "Play" button. That's it !! You have located all the ip addresses. Thank you!! If you have any question please comment. Next: Loading data into Elasticsearch using Apache Spark Navigation Menu: Introduction to ELK Stack Installation Loading data into Elasticsearch with Logstash Create Kibana Dashboard Example Kibana GeoIP Dashboard Example Loading data into Elasticsearch using Apache Spark

  • A Day in the Life of a Computer Programmer

    As a computer programmer, my daily life is actually kind of weird. I did my undergrad in computer science and worked with Microsoft for 4 years. I'm self taught and have spent a far greater amount of hours learning how to code. I work on US based client projects. Due to different time zone, my hours are not fully standard. Usually I try to work 7–9 hours a day, however sometimes it can be as much as 10–12 hours. Here is my routine that I follow roughly everyday, 4:50 am : Alarm beeps, Snooze 1 .. 5:05 am : Snooze 2 .. 5:20 am : Snooze 3 .. 5:35 am : Snooze n, Rolling in bed .. 6:00 am : Semi awake 6:10 am : Check Facebook, Instagram, Whatsapp, Robinhood, 9gag 6:15 am : Wake-up, Workout (pushups, gym, meditation, yoga, I am lying) 6:30 am : Shower, Dress up 7:30 am : Leave for work, Mustang, Daily traffic, Pandora (free subscription) 8:15 am : Arrive at work, Parking, Swipe access card at the entrance 8:20 am : Checking email, Service Now 8:30 am : Offshore-onshore call, Status updates, Discuss target for the day 9:00 am : Breakfast (uncertain) 9:20 am : Code, debug, code, code, debug 10:00 am : Error, error, error, error 11:00 am : Coffee, "Smoking kills" 11:30 am : Code, code, debug, code 12:00 pm - 2:00 pm : Global variable "Lunch" 2:00 pm : Code, debug, code, debug 3:00 pm : Code, code, code, code 3:30 pm : Check on the entire team 4:30 pm : Wrap up, Leave for home, Drive, Traffic, Pandora 5:15 pm : Arrive home, Change dress, Chillax 5:45 pm : Jog for 4-5 miles, Shower 6:30 pm : Facebook, Youtube, News, Blogs 8:00 pm - 9:00 pm : Dinner, Eat 24, Cooking (rare element) 9:30 pm : Offshore-onshore call (sometimes free), Otherwise Netflix 10:00 pm : Netflix, Youtube, Chit-chat with girlfriend 10:30 pm - 11:00 pm : Shut down, Sleep 4:50 am : Alarm beeps, Snooze 1, 2, 3, n .. Thanks for reading, hit like and share if you enjoyed the post!

  • Installing Apache Spark and Scala (Windows)

    Main menu: Spark Scala Tutorial In this Spark Scala tutorial you will learn how to download and install, Apache Spark (on Windows) Java Development Kit (JDK) Eclipse Scala IDE By the end of this tutorial you will be able to run Apache Spark with Scala on Windows machine, and Eclispe Scala IDE. JDK Download and Installation 1. First download JDK (Java Development Kit) from this link. If you have already installed Java on your machine please proceed to Spark download and installation. I have already installed Java SE 8u171/ 8u172 (Windows x64) on my machine. Java SE 8u171 means Java Standard Edition 8 Update 171. This version keeps on changing so just download the latest version available at the time of download and follow these steps. 2. Accept the license agreement and choose the OS type. In my case it is Windows 64 bit platform. 3. Double click on downloaded executable file (jdk*.exe; ~200 MB) to start the installation. Note down the destination path where JDK is installing and then complete the installation process (for instance in this case it says Install to: C:\Program Files\Java\jdk1.8.0_171\). Apache Spark Download & Installation 1. Download a pre-built version of Apache Spark from this link. Again, don't worry about the version, it might be different for you. Choose latest Spark release from drop down menu and package type as pre-built for Apache Hadoop. 2. If necessary, download and install WinRAR so that you can extract the .tgz file that you just downloaded. 3. Create a separate directory spark in C drive. Now extract Spark files using WinRAR, and copy its contents from downloads folder => C:\spark. ​Please note you should end up with directory structure like C:\spark\bin, C:\spark\conf, etc as shown above. Configuring windows environment for Apache Spark 4. Make sure you "Hide file extension properties" in your file explorer (view tab) is unchecked. Now go to C:\spark\conf folder and rename log4j.properties.template file to log4j.properties. You should see filename as log4j.properties and not just log4j. 5. Now open log4j.properties with word pad and change the statement log4j.rootCategory=INFO, console --> log4j.rootCategory=ERROR, console. Save the file and exit, we did this change to capture ERROR messages only when we run Apache Spark, instead of capturing all INFO. 6. Now create C:\winutils\bin directory. Download winutils.exe from GitHub and extract all the files. You will find multiple versions of Hadoop inside it, you just need to focus on Hadoop version which you selected while downloading package type pre-built Hadoop 2.x/3.x in Step 1. Copy all the underlying files (all .dll, .exe etc) from Hadoop version folder and move it into C:\winutils\bin folder. This step is needed to make windows fool as we are running Hadoop. This location (C:\winutils\bin) will act as Hadoop home. 7. Now right-click your Windows menu, Select Control Panel --> System and Security --> System --> “Advanced System Settings” --> then click “Environment Variables” button. Click on "New" button in User variables and add 3 variables: SPARK_HOME c:\spark JAVA_HOME (path you noted while JDK Installation Step 3, for example C:\Program Files\Java\jdk1.8.0_171) HADOOP_HOME c:\winutils 8. Add the following 2 paths to your PATH user variable. Select "PATH" user variable and edit, if not present create new. %SPARK_HOME%\bin %JAVA_HOME%\bin Download and Install Scala IDE 1. Now install the latest Scala IDE from here. I have installed Scala-SDK-4.7 on my machine. Download the zipped file and extract it. That's it. 2. Under Scala-SDK folder you will find eclipse folder, extract it to c:\eclipse. Run eclipse.exe and it will open the IDE (we will use this later). Now test it out! Open up a Windows command prompt in administrator mode. Right click on command prompt in search menu and run as admin. Type java -version and hit Enter to check if Java is properly installed. If you see the Java version that means Java is installed properly. Type cd c:\spark and hit Enter. Then type dir and hit Enter to get a directory listing. Look for any text file, like README.md or CHANGES.txt. Type spark-shell and hit Enter. At this point you should have a scala> prompt as shown below. If not, double check the steps above, check the environment variables and after making change close the command prompt and retry again. Type val rdd = sc.textFile(“README.md”) and hit Enter. Now type rdd.count() and hit Enter. You should get a count of the number of lines from readme file! Congratulations, you just ran your first Spark program! We just created a rdd with readme text file and ran count action on it. Don't worry we will be going through this in detail in next sections. Hit control-D to exit the spark shell, and close the console window. You’ve got everything set up! Hooray! Note for Python lovers - To install pySpark continue to this blog. Thats all! Guys if it's not running, don't worry. Please mention in comments section below and I will help you out with installation process. Thank you. Next: Just enough Scala for Spark Navigation menu ​ 1. Apache Spark and Scala Installation 1.1 Spark installation on Windows​ 1.2 Spark installation on Mac 2. Getting Familiar with Scala IDE 2.1 Hello World with Scala IDE​ 3. Spark data structure basics 3.1 Spark RDD Transformations and Actions example 4. Spark Shell 4.1 Starting Spark shell with SparkContext example​ 5. Reading data files in Spark 5.1 SparkContext Parallelize and read textFile method 5.2 Loading JSON file using Spark Scala 5.3 Loading TEXT file using Spark Scala 5.4 How to convert RDD to dataframe? 6. Writing data files in Spark ​6.1 How to write single CSV file in Spark 7. Spark streaming 7.1 Word count example Scala 7.2 Analyzing Twitter texts 8. Sample Big Data Architecture with Apache Spark 9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science? 10. Spark Interview Questions and Answers

  • Elasticsearch Tutorial - What is ELK stack (Elastic stack)

    In this Elasticsearch tutorial, you will learn - What is ELK stack (Elastic stack)? We will go through ELK stack examples, load data into Elasticsearch stack and create Kibana dashboard. Navigation Menu: Introduction to ELK Stack Installation Load data into Elasticsearch stack with Logstash Create Kibana Dashboard Example Kibana GeoIP Dashboard Example What is ELK stack (now called Elastic Stack)? ELK stack is an acronym for three open-source products - Elasticsearch, Logstash & Kibana. However all three components are maintained by Elastic. ELK stack started as a Log Analytics solution but later it evolved into enterprise search and analytics platform. Elasticsearch is based on Lucene search engine and you can consider it as a NoSQL database which has capability to index (for text search) and store the data. Logstash is basically a data pipeline technique that can connect to various sources based on various plugins, apply transformations and loads data into various targets including Elasticsearch. In short, Logstash collects and transforms the data and sometimes used for data shipping as well. Kibana is a data visualization platform where you will create dashboards. Another tool called Filebeat is one of the Beats member which can also perform similar tasks like Logstash. ELK Stack Architecture Here is the basic architecture of elastic stack. Notice I haven't mentioned the source in below diagram. Usually data source for ELK stack are various log files, for example application log, server logs, database log, network switch log, router log etc. These log files are consumed using filebeat. Filebeat acts like data collector which collects various types of log files (when we have more than one type of log file). Now-a-days, Kafka is used as another layer which distributes files collected by filebeat to various queue from where logstash transform it and stores in elasticsearch for visualization. So complete flow would look like - [application log, server logs, database log, network switch log, router log etc] => Filebeat => Kafka => ELK Stack. Please note this could be changed based on architecture needed for a project. If there are limited types of log files, sometimes you might even not consider using filebeat or kafka and directly dump logs into ELK stack. Fun Fact: ELK stack Google Trend Elasticsearch is most famous amongst the stack. Refer the Google Trend shown below. Why is ELK stack is so popular worldwide, basically due to 3 major reasons. First of all, price - Its open source tool, easy to learn and free of cost. If you consider other visualization tools like QlikView and Tableau - Kibana provides you similar capabilities without any hidden cost. Elasticsearch is used by many big companies for example Wikipedia & GitHub. Second, its elegant user interface. You can spend time exploring and reviewing data not trying to figure out how to navigate the interface. And last but not the least, its extensible. Elasticsearch is schema-free NoSQL database and can scale horizontally. It is also used for real time analytics. Next: ELK Installation Navigation Menu: Introduction to ELK Stack Installation Load data into Elasticsearch stack with Logstash Create Kibana Dashboard Example Kibana GeoIP Dashboard Example #ELKStack #Elasticsearch #Logstash #Kibana #ElasticsearchTutorial #ElasticStack #ELKTutorial

  • What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science?

    Never thought I will spend so much time to understand these high profile terms. I was very confident that I knew theoretically everything that was necessary for me to start writing machine learning algorithms, until couple of days back when I asked myself - Does my use case fall under machine learning topic or is it artificial intelligence? Or is it predictive analytics? I began explaining myself, but couldn’t do it right. I spent several hours reading about these topics, reading blogs, thinking and ended up writing this blog to answer myself. I hope you all will also find this post helpful. Trust me most famous terminology amongst all is - "machine learning" in past couple of years. Below chart shows Google trend (interest over time) of these high profile terms - First lets understand these terminologies individually, keep below Venn diagram in mind while you read further. This will help you to distinguish various terminologies. You know what I did just now? I asked your brain to recognize patterns. Human brain automatically recognizes such patterns (basically "deep learning") because your brain is trained with "Venn diagrams" somewhere in past. By looking at diagram, your brain is able to predict few facts like Deep learning is subset of Machine learning, Artificial Intelligence is the super set, and Data Science could spread across all technologies. Right? Trust me if you show this diagram to prehistoric man, he will not understand anything. But your brain "algorithms" are trained enough with historic data to deduce and predict such facts. Isn't it? Artificial Intelligence (AI) Artificial intelligence is the broadest term. Originated in year 1950s and the oldest terminology used amongst all which we will discuss. In one liner, Artificial intelligence (AI) is a term for simulated intelligence in machines. The concept has always been the idea of building machines which are capable of thinking like humans, mimic like humans. Simplest example of AI is chess game when you play against computer, on paper program was first proposed in 1951. Recent AI example would include self-driving cars which has always been the subject of controversy. Artificial Intelligence can be split between two branches - One is labelled “applied AI” which uses these principles of simulating human thought to carry out one specific task. The other is known as “generalized AI” – which seeks to develop machine intelligences that can turn their hands to any task, much like a person. Machine Learning (ML)Machine learning is the subset of AI which originated in 1959. Evolved from the study of pattern recognition and computational learning theory in artificial intelligence. ML gives computers the ability to "learn" (i.e., progressively improve performance on a specific task) with data, without being explicitly programmed. You encounter machine learning almost everyday, think about Ride sharing apps like Lyft & Uber - How do they determine the price of your ride? Google maps - How do they analyze traffic movement and predict your arrival time within seconds? Filter spam - Emails going automatically to your spam folder? Amazon Alexa, Apple SIRI, Microsoft Cortana & Google Home - How do they recognize your speech? Deep Learning (DL) Deep learning (also known as Hierarchical learning, Deep machine learning or Deep structured learning) is a subset of Machine Learning where learning method is based on data representation or feature learning. Set of methods that allows a system to automatically discover the representations needed for feature detection or classification from raw data. Examples like Mobile check deposits - Convert handwritings on checks into actual text. Facebook face recognition - Seen Facebook recognizing names while tagging? Colorization of black and white images. Object recognition In short, all three terms (AI, ML & DL) can be related as below - recall those examples Chess board, Spam emails & Object recognition (picture credit blogs.nvidia) Predictive Analytics (PA) Under predictive analytics, the goal of the problems remains very narrow where the intent is to compute a value of a particular variable at a future point of time. You can say predictive analytics is basically a sub-field of machine learning. Machine learning is more versatile and is capable to solve a wide range of problems. There are some techniques where machine learning and predictive analytics overlap like linear and logistic regression but others like decision tree, random forest etc are essentially machine learning techniques. Keep aside these regression techniques as of now, I will write detailed blogs for these techniques. How does Data Science relate to AI, ML, PA & DL? Data science is a fairly general term for processes and methods that analyze and manipulate data. It provides you ground to apply artificial intelligence, machine learning, predictive analytics and deep learning to find meaningful and appropriate information from large volumes of raw data with greater speed and efficiency. Types of Machine learning Classification of machine learning will depend upon type of task which you expect machine to perform (Supervised, Unsupervised & Reinforcement) or based on desired output i.e. data. But at the end algorithms will remain same or you can say techniques which will help you to get the desired result. Regression: This is a type of problem where we need to predict the continuous-response value like what is the value of stock. Classification: This is a type of problem where we predict the categorical response value where the data can be separated into specific “classes” like an email it's "spam" or "not spam" Clustering: This is a type of problem where we group similar things together like grouping set of tweets from Twitter. I have tried to showcase the type with below chart, I hope you will find this helpful. Please don't limit yourself with the types of regression, classifiers & clusters which I have shown below. There are number of other algorithms which are being developed and used world wide. Ask yourself which technique fits your requirement. Thank you folks!! If you have any question please mention in comments section below. #MachineLearning #ArtificialIntelligence #DeepLearning #DataScience #PredictiveAnalytics #regression #classification #cluster Next: Spark Interview Questions and Answers Navigation menu ​ 1. Apache Spark and Scala Installation 1.1 Spark installation on Windows​ 1.2 Spark installation on Mac 2. Getting Familiar with Scala IDE 2.1 Hello World with Scala IDE​ 3. Spark data structure basics 3.1 Spark RDD Transformations and Actions example 4. Spark Shell 4.1 Starting Spark shell with SparkContext example​ 5. Reading data files in Spark 5.1 SparkContext Parallelize and read textFile method 5.2 Loading JSON file using Spark Scala 5.3 Loading TEXT file using Spark Scala 5.4 How to convert RDD to dataframe? 6. Writing data files in Spark ​6.1 How to write single CSV file in Spark 7. Spark streaming 7.1 Word count example Scala 7.2 Analyzing Twitter texts 8. Sample Big Data Architecture with Apache Spark 9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science? 10. Spark Interview Questions and Answers

  • How to convert RDD to Dataframe?

    Main menu: Spark Scala Tutorial There are basically three methods by which we can convert a RDD into Dataframe. I am using spark shell to demonstrate these examples. Open spark-shell and import the libraries which are needed to run our code. Scala> import org.apache.spark.sql.{Row, SparkSession} Scala> import org.apache.spark.sql.types.{IntegerType, DoubleType, StringType, StructField, StructType} Now, create a sample RDD with parallelize method. Scala> val rdd = sc.parallelize( Seq( ("One", Array(1,1,1,1,1,1,1)), ("Two", Array(2,2,2,2,2,2,2)), ("Three", Array(3,3,3,3,3,3)) ) ) Method 1 If you don't need header, you can directly create it with RDD as input parameter to createDataFrame method. Scala> val df1 = spark.createDataFrame(rdd) Method 2 If you need header, you can add the header explicitly by calling method toDF. Scala> val df2 = spark.createDataFrame(rdd).toDF("Label", "Values") Method 3 If you need schema structure then you need RDD of [Row] type. Let's create a new rowsRDD for this scenario. Scala> val rowsRDD = sc.parallelize( Seq( Row("One",1,1.0), Row("Two",2,2.0), Row("Three",3,3.0), Row("Four",4,4.0), Row("Five",5,5.0) ) ) Now create the schema with the field names which you need. Scala> val schema = new StructType(). add(StructField("Label", StringType, true)). add(StructField("IntValue", IntegerType, true)). add(StructField("FloatValue", DoubleType, true)) Now create the dataframe with rowsRDD & schema and show dataframe. Scala> val df3 = spark.createDataFrame(rowsRDD, schema) Thank you folks! If you have any question please mention in comments section below. Next: Writing data files in Spark Navigation menu ​ 1. Apache Spark and Scala Installation 1.1 Spark installation on Windows​ 1.2 Spark installation on Mac 2. Getting Familiar with Scala IDE 2.1 Hello World with Scala IDE​ 3. Spark data structure basics 3.1 Spark RDD Transformations and Actions example 4. Spark Shell 4.1 Starting Spark shell with SparkContext example​ 5. Reading data files in Spark 5.1 SparkContext Parallelize and read textFile method 5.2 Loading JSON file using Spark Scala 5.3 Loading TEXT file using Spark Scala 5.4 How to convert RDD to dataframe? 6. Writing data files in Spark ​6.1 How to write single CSV file in Spark 7. Spark streaming 7.1 Word count example Scala 7.2 Analyzing Twitter texts 8. Sample Big Data Architecture with Apache Spark 9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science? 10. Spark Interview Questions and Answers

  • Scala IDE \.metadata\.log error fix (Mac)

    Scala IDE is not compatible with Java SE 9 and higher versions. You might need to downgrade or install Java SE 8 in order to fix the issue. Lets go through each step how you can fix it. Step 1. Check what version of Java you have installed on your machine. Run this command in your terminal: /usr/libexec/java_home --verbose As you can see I have three different versions of java running on my machine. Step 2: Install Java SE 8 (jdk1.8) if you don't find it in your list. Refer this blog for java installation steps. Step 3: Now open your .bashrc file (run command: vi ~/.bashrc) and copy-paste below line in your bashrc file. export JAVA_HOME=$(/usr/libexec/java_home -v 1.8) Step 4: Save the file (:wq!) and reload your profile (source ~/.bashrc) Step 5: Now you need to define eclipse.ini arguments in order to use Java 1.8 version. On a Mac OS X system, you can find eclipse.ini by right-clicking (or Ctrl+click) on the Scala IDE executable in Finder, choose Show Package Contents, and then locate eclipse.ini in the Eclipse folder under Contents. The path is often /Applications/Scala IDE.app/Contents/Eclipse/eclipse.ini Step 6: Open it with text editor and copy-paste below line in eclipse.ini file. Change the version (if needed) according to your java version. Mine is 1.8.0_171. -vm /Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/bin Step 7: Save the file and exit. Step 8: Run the Scala IDE application now and it should run: If you are still facing problem, please mention it in the comments section below. Thank you! #ScalaIDEinstallation #metadatalogerror #eclipse #metadata #log #error Learn Apache Spark in 7 days, start today! Navigation menu ​ 1. Apache Spark and Scala Installation 1.1 Spark installation on Windows​ 1.2 Spark installation on Mac 2. Getting Familiar with Scala IDE 2.1 Hello World with Scala IDE​ 3. Spark data structure basics 3.1 Spark RDD Transformations and Actions example 4. Spark Shell 4.1 Starting Spark shell with SparkContext example​ 5. Reading data files in Spark 5.1 SparkContext Parallelize and read textFile method 5.2 Loading JSON file using Spark Scala 5.3 Loading TEXT file using Spark Scala 5.4 How to convert RDD to dataframe? 6. Writing data files in Spark ​6.1 How to write single CSV file in Spark 7. Spark streaming 7.1 Word count example Scala 7.2 Analyzing Twitter texts 8. Sample Big Data Architecture with Apache Spark 9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science? 10. Spark Interview Questions and Answers

bottom of page