82 results found for ""

  • SQL Server 2014 Standard Download & Installation

    In this post you will learn how to download and install SQL Server 2014 from an ISO image. I will also download AdventureWorks sample database and load into our SQL server 2014. Downloading ISO image 1. You can download the SQL Server from this link. But in case link changes, you can Google - "download SQL server enterprise evaluation" and it will show you the results as below. 2. Open the link highlighted above and you will find all the latest SQL Servers available. I am installing SQL Server 2014 for this post. As its old version, it is very stable. 3. Click on the little plus sign and select iso image to download. 4. Fill out these details and hit continue. 5. Select your platform based on your system properties whether it's 32 bit or 64 bit (Go to your control panel and look for system properties). Download will take approximately 10 minutes based on your network. Its ~3 GB ISO file. Make sure you have enough space on your computer to perform this installation. Run ISO image 1. Once download is complete, double click on the ISO file and it will automatically mount on your drive as shown below. If it doesn't mount automatically you can use winrar to unzip the ISO file. 2. Open the drive to run the setup. SQL Server 2014 Installation 1. Setup will take some time (approx. a minute) to open. You will get this screen. 2. Go to Installation tab as shown above and click on New SQL Server stand alone installation. 3. Click next, you can select check for updates or leave it as default. 4. This step will take some time (approx. 5 mins) where it will install setup files. 5. Finally, you will get this screen. Click Next. 6. I have SQL Server Express edition running on my machine so we are seeing this option. We will just continue with new installation default option. 7. Keep this edition as Evaluation and hit Next. 8. Check box to accept license terms and hit Next. 9. Select SQL Server Feature Installation and hit Next. 10. Select all the features and hit Next as shown below. 11. Enter the name of instance, I have named it as "SQL2014" and hit Next. 12. Leave it as default and hit Next as shown below. 13. Just leave the default as windows authentication for simplicity and add current user. Windows will automatically add current user when you click the Add current user button. Hit Next. Leave data directories and file stream as default for simplicity. Usually in production environment we choose different drives for logs and data (like in case server crashes). 14. Leave analysis services mode as default and add current user. Hit Next. 15. Leave the default options as shown below. 16. Click add current user and hit Next. 17. Give Controller Name, I have named it "Dataneb" and hit Next. 18. Hit Install button. 19. Installation will take some time (took 1 hour for me). So sit back and relax. Completion After 1 hour installation completed for me. This might take same or less time for you depending upon your system configuration. Now, you can open SQL Server configuration manager and see if SQL Server 2014 is running or not. I have SQL Server Express edition also running on this machine, ignore that. AdventureWorks Database I am loading AdventureWorks database for example purpose here. 1. In order to load AdventureWorks, open SSMS (SQL Server Management Studio) and connect to SQL2014 which we just installed. 2. Now download sample AdventureWorks 2014 database from Microsoft site. You can go to this link, and you can choose anyone. I am downloading AdventureWorks 2014. 3. Once download completes, move the AdventureWorks2014.bak file to MSSQL server 2014 backup location. It will be C:\Program Files\Microsoft SQL Server\MSSQL12.SQL2014\MSSQL\Backup 4. Now open SSMS => right click on Databases => Restore Database => select Device => click on three Dots => Add 5. Add the AdventureWorks2014 backup file and hit OK. 6. Once backup is loaded, you will be able to see AdventureWorks2014 loaded under your database panel, if not right click and refresh. That's all! If you have any question please mention in comments section below. Next: How to create SSIS package in Visual Studio 2017

  • Evanescence | Bring me to Life | Drum Sheet Music

    Easy Drum Covers for Beginners | Evanescence bring me to life drum sheet music If you've just started to play drums and you’re looking for easy drum cover for beginners keep reading these blog, here I will list my favorite simple covers for newbies in drumming. Evanescence Bring me to Life Drum Cover Follow me on Youtube The song starts with a simple 4/4 groove spanning 2 bars with eighth notes on the hi-hat and snare hit on 2 and 4 and ghost notes on snare on “a” of 2 and “e” of 3. It ends with an open hi-hat at the “and” of 4 in the second bar. Here is how the groove looks like. This brings us to the first verse of the song. The groove here looks like this: You can play the eight notes mentioned either on a ride cymbal or a crash cymbal. In the actual song it has been played on a ride cymbal. This same grouping has been played twice in the verse. Which brings to us to the next part: This is again a two-bar groove with eighth notes being played on an open hi-hat and snare drum on 2 and 4 of each bar. The difficult part here is to get the bass drum right. I would recommend practicing it slowly first at a very slow tempo first and then gradually try to reach tempo of the song. There are two bars of this after which we get back to the groove played in the first verse of the song i.e. The next part uses a combination of two grooves. First one is a one bar groove which looks like this: The second one is a two-bar groove, the first bar looks like this: And the second bar looks like this: This same groove gets repeated in the next measure with a small difference in the second bar of the second group. The next measure uses yet another groove which looks like following: The next part of the song uses variations of the groove played after first verse and it looks like this: Next part uses the same groove as the first verse, here is how it looks like: The song ends with the following groove: I have posted a video of myself playing this song on my YouTube channel. Do subscribe to the channel as well as this blog for more drum tutorials.

  • Elasticsearch Tutorial - What is ELK stack (Elastic stack)

    In this Elasticsearch tutorial, you will learn - What is ELK stack (Elastic stack)? We will go through ELK stack examples, load data into Elasticsearch stack and create Kibana dashboard. Navigation Menu: Introduction to ELK Stack Installation Load data into Elasticsearch stack with Logstash Create Kibana Dashboard Example Kibana GeoIP Dashboard Example What is ELK stack (now called Elastic Stack)? ELK stack is an acronym for three open-source products - Elasticsearch, Logstash & Kibana. However all three components are maintained by Elastic. ELK stack started as a Log Analytics solution but later it evolved into enterprise search and analytics platform. Elasticsearch is based on Lucene search engine and you can consider it as a NoSQL database which has capability to index (for text search) and store the data. Logstash is basically a data pipeline technique that can connect to various sources based on various plugins, apply transformations and loads data into various targets including Elasticsearch. In short, Logstash collects and transforms the data and sometimes used for data shipping as well. Kibana is a data visualization platform where you will create dashboards. Another tool called Filebeat is one of the Beats member which can also perform similar tasks like Logstash. ELK Stack Architecture Here is the basic architecture of elastic stack. Notice I haven't mentioned the source in below diagram. Usually data source for ELK stack are various log files, for example application log, server logs, database log, network switch log, router log etc. These log files are consumed using filebeat. Filebeat acts like data collector which collects various types of log files (when we have more than one type of log file). Now-a-days, Kafka is used as another layer which distributes files collected by filebeat to various queue from where logstash transform it and stores in elasticsearch for visualization. So complete flow would look like - [application log, server logs, database log, network switch log, router log etc] => Filebeat => Kafka => ELK Stack. Please note this could be changed based on architecture needed for a project. If there are limited types of log files, sometimes you might even not consider using filebeat or kafka and directly dump logs into ELK stack. Fun Fact: ELK stack Google Trend Elasticsearch is most famous amongst the stack. Refer the Google Trend shown below. Why is ELK stack is so popular worldwide, basically due to 3 major reasons. First of all, price - Its open source tool, easy to learn and free of cost. If you consider other visualization tools like QlikView and Tableau - Kibana provides you similar capabilities without any hidden cost. Elasticsearch is used by many big companies for example Wikipedia & GitHub. Second, its elegant user interface. You can spend time exploring and reviewing data not trying to figure out how to navigate the interface. And last but not the least, its extensible. Elasticsearch is schema-free NoSQL database and can scale horizontally. It is also used for real time analytics. Next: ELK Installation Navigation Menu: Introduction to ELK Stack Installation Load data into Elasticsearch stack with Logstash Create Kibana Dashboard Example Kibana GeoIP Dashboard Example #ELKStack #Elasticsearch #Logstash #Kibana #ElasticsearchTutorial #ElasticStack #ELKTutorial

  • How to Install OEL (Oracle Enterprise Linux) on Virtualbox?

    This blog article focuses on configuring Virtualbox to create a new VM for Oracle Linux 7 (OEL) and installing Oracle Linux 7 from the ISO image as a guest operating system. The Oracle Linux 7 ISO image can be downloaded from edelivery.oracle.com – Oracle Software Delivery Cloud. You must have a valid Oracle Account (free) to download the Linux ISO. Link: edelivery.oracle.com (downloaded platform x86 64 bit for this post, check your system configuration before you download) Configuring Virtual Box To create a VM, click on the New button on the top left corner and provide a descriptive name for the VM, type of OS and Version. The name that is specified will be used to identify the VM configuration: Once the required information is provided, click on the Next button​. Specify how much RAM you want to designate to your virtual machine. I have allocated 20 GB. Click on Next button, lets keep Hard disk as default size. Hit Create button, now keep Hard disk file type as VDI. We can change this later. Hit Next button, keep storage as dynamic allocation. Hit Next button and configure file location and size. For this post lets go with 20 GB. Hit Create button. Once the VDI disk is created with the specified size, you will be re-directed back to the main screen. You will be able to see all the details as shown below: Go to Settings and select Network tab, change configuration to "Bridged Adapter" and Promiscuous Mode to "Allow All" as shown below: Go to System tab and assign number of processors. I have selected 4 processors for this example: Now go to Storage tab and click on disk image to choose Virtual Optical Disk ISO image file which you downloaded initially from Oracle Cloud. Click on OK and now you will be able to see Optical Drive ISO image file as highlighted below (it was empty before, refer old previous screenshots). Ignore the name as it depends on which ISO image you download from Oracle cloud. Installing Oracle Linux From the Oracle VM VirtualBox Manager screen, select your VM that you just created and click on the green Start icon on the top of the screen. Select the language as English (United States). Hit Continue button. Now select Installation Destination as Local Standard Disks. Now go back and select Network & Host Name (currently it's shown as not connected). Turn ON the connection. Now setup the ROOT password: Create another User if you need. For instance I have created Hadoop user: Let post-installation setup tasks finish. Once it's done, you can click on the Software Selection button and choose to install a “Server with GUI” option and choose the KDE desktop. Click on the Done button after you select all the options that you want installed. From the Configuration screen, click on the Reboot button. Once the VM reboots, you will be directed to the login prompt. Since we did a bare minimal installation, we will not enter into a GUI mode. Login with root user and password which we setup earlier. Install net-tools package, Type: yum install net-tools Once net-tools installs, Type: ifconfig inet shown above will tell your IP address. That's it !!! To check your Linux version, Type: uname -a , you will get something like below: [root@localhost anonymous]# uname -a Linux localhost.localdomain 4.1.12-112.16.4.el7uek.x86_64 #2 SMP Mon Mar 12 23:57:12 PDT 2018 x86_64 x86_64 x86_64 GNU/Linux If you have any question please mention it in the comments section below. Thank you! #RedhatLinux #OracleLinuxInstallation #Installing #Oracle #Linux #on #Virtual #Box

  • What is "is" in Python?

    The operators is and is not test for object identity. x is y is true if and only if x and y are the same object (object location in memory is same). Object identity is determined using the id() function. x is not y yields the inverse truth value. It’s easy to understand if I compare this operator with equality operator. is and is not compares the object reference. Check for identity. == and != compares the object value. Check for equality. For example, if you consider integer objects (excluding integers from -5 to 256), >>> A=9999 >>> B=9999 >>> A == B, A is B (True, False) >>> A, B (9999, 9999) >>> id(A), id(B) (4452685328, 4452686992) Python stores integer objects as single object between range -5 to 256 so the identity is same. >>> A=99 >>> B=99 >>> A == B, A is B (True, True) >>> A, B (99, 99) >>> id(A), id(B) (4404392064, 4404392064) Lets see behavior of other immutable objects like int & float, string, tuples and boolean: >>> 1 == 1.0, 1 is 1.0 (True, False) >>> 1 == 1, 1 is 1 (True, True) >>> 'A' == 'A', 'A' is 'A' (True, True) >>> (1,2) == (1,2), (1,2) is (1,2) (True, True) >>> True == True, True is True (True, True) What about mutable objects - list, set, dict? Behavior is totally different. >>> [1,2] == [1,2], [1,2] is [1,2] (True, False) >>> {1,2} == {1,2}, {1,2} is {1,2} (True, False) >>> {'k1':1} == {'k1':1}, {'k1':1} is {'k1':1} (True, False) is operator is used only when you want to compare the object identity, for regular comparison equality operator is used.

  • Loading CSV data into Elasticsearch with Logstash

    Refer my previous blogs (Linux | Mac users) to install ELK stack on your machine. Once installation is done, there are couple ways to load CSV file into Elasticsearch which I am aware of (one via Logstash and another with filebeat). In this blog we will be using Logstash to load the file. I am using sample Squid access logs (comma separated CSV file) in order to explain this blog. You can find the file format details at this link. Sample Data Go ahead copy paste these records to create access_log.csv file. I randomly selected this format because its CSV file which is widely used format in production environment and we have various types of attributes (data types) in the file. $ more /Volumes/MYLAB/testdata/access_log.csv Time,Duration,Client_address,Result_code,Bytes,Request_method,Requested_URL,User,Hierarchy_code,Type 1121587707.473,60439,219.138.188.61,TCP_MISS/503,0,CONNECT,203.84.194.44:25,-,DIRECT/203.84.194.44,- 1121587709.448,61427,219.138.188.61,TCP_MISS/503,0,CONNECT,203.84.194.50:25,-,DIRECT/203.84.194.50,- 1121587709.448,61276,219.138.188.61,TCP_MISS/503,0,CONNECT,67.28.114.36:25,-,DIRECT/67.28.114.36,- 1121587709.449,60148,219.138.188.61,TCP_MISS/503,0,CONNECT,4.79.181.12:25,-,DIRECT/4.79.181.12,- 1121587710.889,60778,219.138.188.61,TCP_MISS/503,0,CONNECT,203.84.194.39:25,-,DIRECT/203.84.194.39,- 1121587714.803,60248,219.138.188.61,TCP_MISS/503,0,CONNECT,203.84.194.50:25,-,DIRECT/203.84.194.50,- 1121587714.803,59866,219.138.188.61,TCP_MISS/503,0,CONNECT,203.84.194.43:25,-,DIRECT/203.84.194.43,- 1121587719.834,60068,219.138.188.61,TCP_MISS/503,0,CONNECT,203.84.194.45:25,-,DIRECT/203.84.194.45,- 1121587728.564,59642,219.138.188.55,TCP_MISS/503,0,CONNECT,168.95.5.45:25,-,DIRECT/168.95.5.45,- My file looks something like this, Start Elastics Now start Elasticsearch and Kibana (if you don't remember how to start them, refer my previous blogs Linux | Mac users). Don't kill these processes, both are required by logstash to load the data. $ elasticsearch $ kibana Logstash Configuration In order to read CSV file with Logstash, you need to create a configuration file which will have all the configuration details for access log file like input, filter & output. In short, Input tag contains details like filename, location, start position etc. Filter tag contains file type, separator, column details, transformations etc. Output tag contains host detail where file will be written, index name (should be in lower case), document type etc. These tags look like JSON but actually these are not JSON. You can say these formats are specific to Logstash. I have created a config file under config directory in logstash as shown below. $ more /usr/local/var/homebrew/linked/logstash-full/libexec/config/logstash_accesslog.config input { file { path => "/Volumes/MYLAB/testdata/access_log.csv" start_position => "beginning" sincedb_path => "/Volumes/MYLAB/testdata/logstash.txt" } } filter { csv { separator => "," columns => [ "Time" , "Duration" , "Client_address" , "Result_code" , "Bytes" , "Request_method" , "Requested_URL" , "User" , "Hierarchy_code" , "Type" ] } date { match => [ "Time" , "UNIX" ] target => "EventTime" } mutate {convert => ["Duration", "integer"]} mutate {convert => ["Bytes", "integer"]} } output { elasticsearch { hosts => "localhost" index => "logstash-accesslog" } stdout {} } Explanation! These are very basic tags and straightforward. You use columns tag to define the list of fields within quotes (if you face issues use single quotes instead). "mutate" is basically doing minor datatype conversion. Also, "match" is used to convert UNIX timestamp to human-readable time format. Further, logstash-accesslog is the index name which I am using for Squid access logs. Still if there is any question regarding any of the tags please comment (in comment section below) and I will get back to you as soon as possible. Also, you can change sincedb_path to /dev/null if you don't want to keep the track of loaded files. If you want to reload the same file again make sure you delete the entry from sincedb_path file (logstash.txt in this case). Here is my config file snapshot, zoom it little bit to see the content :) Run Logstash & Load data You are all set to start Logstash now with configuration file which we just created. Follow the below steps in order to run Logstash with config file. It will take few seconds to index all the records. Change your logstash home location accordingly, mine is homebrew as I am using Mac. $ /usr/local/var/homebrew/linked/logstash-full/bin/logstash -f /usr/local/var/homebrew/linked/logstash-full/libexec/config/logstash_accesslog.config Make sure you are not getting any Error while loading the file in logstash output, otherwise file will not load and there will be no index creation. Now open Kibana and run below command in "Dev Tools" tab in order to see how many records got loaded. I loaded 10 records just for demonstration Kibana Index Creation Now go to "Management" tab in Kibana and click on Index Patterns => Create Index Pattern. Create Index pattern with same name which we mentioned in configuration file logstash-accesslog. Hit "Next step" and select time filter => I don't want to use time filter. Now hit "Create index pattern". Now go to "Discover" tab and you will be able to view the indexed data which we just loaded. Kibana Dashboard I don't have much data to create a Kibana dashboard here, but just for demonstration purpose lets say you want to see number of events occurred/milliseconds. It's impractical example, as you will never have such use case. Go to "Visualize" tab and hit Create new. Select pie chart for example, Now select logstash-accesslog index and apply changes in buckets as highlighted below, Thats all, you can see number of events occurred each milliseconds. If you have any question please write in comments section below. Thank you. Next: Create Kibana Dashboard Example Navigation Menu: Introduction to ELK Stack Installation Loading data into Elasticsearch with Logstash Create Kibana Dashboard Example Kibana GeoIP Dashboard Example Loading data into Elasticsearch using Apache Spark

  • Spark RDD, Transformations and Actions example

    Main menu: Spark Scala Tutorial In this Apache Spark RDD tutorial you will learn about, Spark RDD with example What is RDD in Spark? Spark transformations Spark actions Spark actions and transformations example Spark RDD operations What is a RDD in Spark? According to Apache Spark documentation - "Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat". Example (for easy understanding) - Not a practical case I seriously didn't understand anything when I read above definition for the first time, except the fact that RDD is acronym for Resilient Distributed Dataset. Let's try to understand RDD with a simple example. Assume that you have a collection of 100 movies and you have stored it on your personal laptop. This way you have complete data residing on a single machine (you can say it a node) i.e. your personal laptop. Now instead of having all movies on single machine, let's say you distributed the movies - 50 movies on laptop A and 50 movies on laptop B. This is where Distributed term comes into picture, 50% of your data is residing on one machine and 50% on another. Now let's say you were worried that any of the laptop can malfunction and you will lose your movies so you took the backup of movies. Backup of 50 movies which were present on laptop A on laptop B and similarly backup of 50 movies which were present on laptop B on laptop A. This is where the term Resilient or Fault-tolerant comes into picture, dictionary meaning of resilient is to withstand or recover quickly from difficult conditions and basically backup of your movies makes sure that you can recover data anytime from another machine (so called node) if system malfunctions. Number of times you create the backup or replicate the data into another machine for recovery is also called as replication factor. In above case replication factor was one, as you replicated the data once. In real life scenarios you will encounter huge amount of data (like movies data in above example) distributed across thousands of worker nodes (like laptop in above example) combination of which is called a cluster with higher replication factors (in above example it was just 1) in order to maintain fault tolerant system. Basic facts about Spark RDDs Resilient Distributed Datasets (RDDs) are basically an immutable collection of elements which is used as fundamental data structure in Apache Spark. You can create RDDs by two methods - Parallelize collection & referencing external datasets. RDDs are immutable i.e. read only data structures so you can't change original RDD. But you can always create a new one. RDDs supports two types of Spark operations - Transformations & Actions. Parallelize collection scala> sc.parallelize(1 to 10 by 2) res8: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at :25 Referencing a dataset scala> val dataFile = sc.textFile("/testdata/MountEverest.txt") dataFile: org.apache.spark.rdd.RDD[String] = /testdata/MountEverest.txt See - How to create a RDD? Spark Transformations & Actions In Spark, Transformations are functions that produces new RDD from an existing RDD. When you need actual data from a RDD, you need to apply actions. Below is the list of common transformations supported by Spark. But before that, those who are new to programming.. You will be using lambda functions or sometimes called anonymous functions to pass through these Spark transformations. So you should have basic understanding of lambda functions. In short, lambda functions are convenient way to write a function when you have to use functions just in one place. For example, if you want to double the number you can simply write; x => x + x like you do in Python and other languages. Syntax in Scala would be like this, scala> val lfunc = (x:Int) => x + x lfunc: Int => Int = // This tells that function takes integer and returns integer scala> lfunc(3) res0: Int = 6 Sample Data I will be using "Where is the Mount Everest?" text data. I just picked some random data to go through these examples. Where is Mount Everest? (MountEverest.txt) Mount Everest (Nepali: Sagarmatha सगरमाथा; Tibetan: Chomolungma ཇོ་མོ་གླང་མ; Chinese Zhumulangma 珠穆朗玛) is Earth's highest mountain above sea level, located in the Mahalangur Himal sub-range of the Himalayas. The international border between Nepal (Province No. 1) and China (Tibet Autonomous Region) runs across its summit point. - Reference Wikipedia scala> val mountEverest = sc.textFile("/testdata/MountEverest.txt") mountEverest: org.apache.spark.rdd.RDD[String] = /testdata/MountEverest.txt MapPartitionsRDD[1] at textFile at :24 Spark Transformations I encourage you all to run these examples on Spark-shell side-by-side. Don't just read through them. Type them on your keyboard it will help you learn. map(func) This transformation redistributes the data after passing each element through func. 1. For example, if you want to split the Mount Everest text into individual words, you just need to pass this lambda func x => x.split(" ") and it will create a new RDD as shown below. scala> val words = mountEverest.map(x => x.split(" ")) words: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[3] at map at :25 Did you spot the difference between mountEverest and words RDD? Yeah exactly, one is String type and after applying map transformation it's now Array of String. scala> words.collect() res1: Array[Array[String]] = Array(Array(Mount, Everest, (Nepali:, Sagarmatha, सगरमाथा;, Tibetan:, Chomolungma, ཇོ་མོ་གླང་མ;, Chinese, Zhumulangma, 珠穆朗玛), is, Earth's, highest, mountain, above, sea, level,, located, in, the, Mahalangur, Himal, sub-range, of, the, Himalayas., The, international, border, between, Nepal, (Province, No., 1), and, China, (Tibet, Autonomous, Region), runs, across, its, summit, point.)) To return all the elements of words RDD we have called collect() action. It's very basic Spark action. 2. Now, suppose you want to get the word count in this text file, you can do something like this - first split the file and then get the length or size of collection. scala> mountEverest.map(x => x.split(" ").length).collect() res6: Array[Int] = Array(45) // Mount Everest file has 45 words scala> mountEverest.map(x => x.split(" ").size).collect() res7: Array[Int] = Array(45) 3. Lets say you want to get total number of characters in the file, you can do it like this. scala> mountEverest.map(x => x.length).collect() res5: Array[Int] = Array(329) // Mount Everest file has 329 characters 4. Suppose you want to make all text upper/lower case, you can do it like this. scala> mountEverest.map(x => x.toUpperCase()).collect() res9: Array[String] = Array(MOUNT EVEREST (NEPALI: SAGARMATHA सगरमाथा; TIBETAN: CHOMOLUNGMA ཇོ་མོ་གླང་མ; CHINESE ZHUMULANGMA 珠穆朗玛) IS EARTH'S HIGHEST MOUNTAIN ABOVE SEA LEVEL, LOCATED IN THE MAHALANGUR HIMAL SUB-RANGE OF THE HIMALAYAS. THE INTERNATIONAL BORDER BETWEEN NEPAL (PROVINCE NO. 1) AND CHINA (TIBET AUTONOMOUS REGION) RUNS ACROSS ITS SUMMIT POINT.) scala> mountEverest.map(x=>x.toLowerCase()).collect() res35: Array[String] = Array(mount everest (nepali: sagarmatha सगरमाथा; tibetan: chomolungma ཇོ་མོ་གླང་མ; chinese zhumulangma 珠穆朗玛) is earth's highest mountain above sea level, located in the mahalangur himal sub-range of the himalayas.the international border between nepal (province no. 1) and china (tibet autonomous region) runs across its summit point.) flatmap(func) As name says it's flattened map. This is also similar to map, except the fact that it gives you more flattened output. For example, scala> val rdd = sc.parallelize(Seq("Where is Mount Everest","Himalayas India")) rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[22] at parallelize at :24 scala> rdd.collect res26: Array[String] = Array(Where is Mount Everest, Himalayas India) 1. We have two items in Parallel Collection RDD - "Where is Mount Everest" and "Himalayas India". scala> rdd.map(x => x.split(" ")).collect res21: Array[Array[String]] = Array(Array(Where, is, Mount, Everest), Array(Himalayas, India)) 2. When map() transformation is applied, it results into two separate array of strings (1st element (Where, is, Mount, Everest) and 2nd element => (Himalayas, India)). scala> rdd.flatMap(x => x.split(" ")).collect res23: Array[String] = Array(Where, is, Mount, Everest, Himalayas, India) 3. For flatMap(), output is flattened to single array of string Array[String]. Thus flatMap() is similar to map, where each input item is mapped to 0 or more output items (1st item => 4 elements, 2nd item => 2 elements). This will give you clear picture, scala> rdd.map(x => x.split(" ")).count() res24: Long = 2 // as map gives one to one output hence 2=>2 scala> rdd.flatMap(x => x.split(" ")).count() res25: Long = 6 // as flatMap gives one to zero or more output hence 2=>6 map() => [Where is Mount Everest, Himalayas India] => [[Where, is, Mount, Everest],[Himalayas, India]] flatMap() => [Where is Mount Everest, Himalayas India] => [Where, is, Mount, Everest, Himalayas, India] 4. Getting back to mountEverest RDD, suppose you want to get the length of each individual word. scala> mountEverest.flatMap(x=>x.split(" ")).map(x=>(x, x.length)).collect res82: Array[(String, Int)] = Array((Mount,5), (Everest,7), ((Nepali:,8), (Sagarmatha,10), (सगरमाथा;,8), (Tibetan:,8), (Chomolungma,11), (ཇོ་མོ་གླང་མ;,12), (Chinese,7), (Zhumulangma,11), (珠穆朗玛),5), (is,2), (Earth's,7), (highest,7), (mountain,8), (above,5), (sea,3), (level,,6), (located,7), (in,2), (the,3), (Mahalangur,10), (Himal,5), (sub-range,9), (of,2), (the,3), (Himalayas.The,13), (international,13), (border,6), (between,7), (Nepal,5), ((Province,9), (No.,3), (1),2), (and,3), (China,5), ((Tibet,6), (Autonomous,10), (Region),7), (runs,4), (across,6), (its,3), (summit,6), (point.,6)) filter(func) As name tells it is used to filter elements same like where clause in SQL and it is case sensitive. For example, scala> rdd.collect res26: Array[String] = Array(Where is Mount Everest, Himalayas India) // Returns one match scala> rdd.filter(x=>x.contains("Himalayas")).collect res31: Array[String] = Array(Himalayas India) // Contains is case sensitive scala> rdd.filter(x=>x.contains("himalayas")).collect res33: Array[String] = Array() scala> rdd.filter(x=>x.toLowerCase.contains("himalayas")).collect res37: Array[String] = Array(Himalayas India) Filtering even numbers, scala> sc.parallelize(1 to 15).filter(x=>(x%2==0)).collect res57: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14) scala> sc.parallelize(1 to 15).filter(_%5==0).collect res59: Array[Int] = Array(5, 10, 15) mapPartitions(func type Iterator) Similar to map() transformation but in this case function runs separately on each partition (block) of RDD unlike map() where it was running on each element of partition. Hence mapPartitions are also useful when you are looking for performance gain (calls your function once/partition not once/element). Suppose you have elements from 1 to 100 distributed among 10 partitions i.e. 10 elements/partition. map() transformation will call func 100 times to process these 100 elements but in case of mapPartitions(), func will be called once/partition i.e. 10 times. Secondly, mapPartitions() holds the data in-memory i.e. it will store the result in memory until all the elements of the partition has been processed. mapPartitions() will return the result only after it finishes processing of whole partition. mapPartitions() requires an iterator input unlike map() transformation. What is an Iterator? - An iterator is a way to access collection of elements one-by-one, its similar to collection of elements like List(), Array() etc in few ways but the difference is that iterator doesn't load the whole collection of elements in memory all together. Instead iterator loads elements one after another. In Scala you access these elements with hasNext and Next operation. For example, scala> sc.parallelize(1 to 9, 3).map(x=>(x, "Hello")).collect res3: Array[(Int, String)] = Array((1,Hello), (2,Hello), (3,Hello), (4,Hello), (5,Hello), (6,Hello), (7,Hello), (8,Hello), (9,Hello)) scala> sc.parallelize(1 to 9, 3).partitions.size res95: Int = 3 scala> sc.parallelize(1 to 9, 3).mapPartitions(x=>(Array("Hello").iterator)).collect res7: Array[String] = Array(Hello, Hello, Hello) scala> sc.parallelize(1 to 9, 3).mapPartitions(x=>(List(x.next).iterator)).collect res11: Array[Int] = Array(1, 4, 7) In first example, I have applied map() transformation on dataset distributed between 3 partitions so that you can see function is called 9 times. In second example, when we applied mapPartitions(), you will notice it ran 3 times i.e. for each partition once. We had to convert string "Hello" into iterator because mapPartitions() takes iterator as input. In thirds step, I tried to get the iterator next value to show you the element. Note that next is always increasing value, so you can't step back. See this, scala> sc.parallelize(1 to 9, 3).mapPartitions(x=>(List(x.next,x.next, "|").iterator)).collect res18: Array[Any] = Array(1, 2, |, 4, 5, |, 7, 8, |) In first call next value for partition 1 changed from 1 => 2 , for partition 2 it changed from 4 => 5 and similarly for partition 3 it changed from 7 => 8. You can keep this increasing until hasNext is False (hasNext is a property of iteration which tells you whether collection has ended or not, it returns you True or False based on items left in the collection). For example, scala> sc.parallelize(1 to 9, 3).mapPartitions(x=>(List(x.next, x.hasNext).iterator)).collect res19: Array[AnyVal] = Array(1, true, 4, true, 7, true) You can see hasNext is true because there are elements left in each partition. Now suppose we access all three elements from each partition, then hasNext will result false. For example, scala> sc.parallelize(1 to 9, 3).mapPartitions(x=>(List(x.next, x.next, x.next, x.hasNext).iterator)).collect res20: Array[AnyVal] = Array(1, 2, 3, false, 4, 5, 6, false, 7, 8, 9, false) Just for our understanding, if you will try to access next 4th time, you will get error which is expected. scala> sc.parallelize(1 to 9, 3).mapPartitions(x=>(List(x.next, x.next, x.next, x.next,x.hasNext).iterator)).collect 19/07/31 11:14:42 ERROR Executor: Exception in task 1.0 in stage 18.0 (TID 56) java.util.NoSuchElementException: next on empty iterator Think, map() transformation as special case of mapPartitions() where you have just 1 element in each partition. Isn't it? mapPartitionsWithIndex(func) Similar to mapPartitions, but good part is that you have index to see the partition position. For example, scala> val mp = sc.parallelize(List("One","Two","Three","Four","Five","Six","Seven","Eight","Nine"), 3) mp: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[38] at parallelize at :24 scala> mp.collect res23: Array[String] = Array(One, Two, Three, Four, Five, Six, Seven, Eight, Nine) scala> mp.mapPartitionsWithIndex((index, iterator) => {iterator.toList.map(x => x + "=>" + index ).iterator} ).collect res26: Array[String] = Array(One=>0, Two=>0, Three=>0, Four=>1, Five=>1, Six=>1, Seven=>2, Eight=>2, Nine=>2) Index 0 (first partition) has three values as expected, similarly other 2 partitions. If you have any question please mention it in comments section at the end of this blog. sample(withReplacement, fraction, seed) Generates a fraction RDD from an input RDD. Note that second argument fraction doesn't represent the fraction of actual RDD. It actually tells the probability of each element in the dataset getting selected for the sample. Seed is optional. First boolean argument decides type of sampling algorithm. For example, scala> sc.parallelize(1 to 10).sample(true, .4).collect res103: Array[Int] = Array(4) scala> sc.parallelize(1 to 10).sample(true, .4).collect res104: Array[Int] = Array(1, 4, 6, 6, 6, 9) // Here you can see fraction 0.2 doesn't represent fraction of rdd, 4 elements selected out of 10. scala> sc.parallelize(1 to 10).sample(true, .2).collect res109: Array[Int] = Array(2, 4, 7, 10) // Fraction set to 1 which is the max probability (0 to 1), so each element got selected. scala> sc.parallelize(1 to 10).sample(false, 1).collect res111: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) union(otherDataset) Similar to SQL union, but it keeps duplicate data. scala> val rdd1 = sc.parallelize(List("apple","orange","grapes","mango","orange")) rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[159] at parallelize at :24 scala> val rdd2 = sc.parallelize(List("red","green","yellow")) rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[160] at parallelize at :24 scala> rdd1.union(rdd2).collect res116: Array[String] = Array(apple, orange, grapes, mango, orange, red, green, yellow) scala> rdd2.union(rdd1).collect res117: Array[String] = Array(red, green, yellow, apple, orange, grapes, mango, orange) intersection(otherDataset) Returns intersection of two datasets. For example, scala> val rdd1 = sc.parallelize(-5 to 5) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[171] at parallelize at :24 scala> val rdd2 = sc.parallelize(1 to 10) rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[172] at parallelize at :24 scala> rdd1.intersection(rdd2).collect res119: Array[Int] = Array(4, 1, 5, 2, 3) distinct() Returns new dataset with distinct elements. For example, we don't have duplicate orange now. scala> val rdd = sc.parallelize(List("apple","orange","grapes","mango","orange")) rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[186] at parallelize at :24 scala> rdd.distinct.collect res121: Array[String] = Array(grapes, orange, apple, mango) Due to some technical issues I had to move some content of this page to other area. Please refer this for remaining list of transformations. Sorry for the inconvenience guys. groupByKey() reduceByKey() aggregateByKey() sortByKey() join() cartesian() coalesce() repartition() Now, as said earlier, RDDs are immutable so you can't change original RDD but you can always create a new RDD with spark transformations like map, flatmap, filter, groupByKey, reduceByKey, mapValues, sample, union, intersection, distinct, sortByKey etc. RDDs transformations are broadly classified into two categories - Narrow & Wide transformation. In narrow transformation like map & filter, all the elements that are required to compute the records in single partition live in the single partition of parent RDD. A limited subset of partition is used to calculate the result. In wide transformation like groupByKey and reduceByKey, all the elements that are required to compute the records in the single partition may live in many partitions of parent RDD. The partition may live in many partitions of parent RDD. Spark Actions When you want to work on actual dataset, you need to perform spark actions on RDDs like count, reduce, collect, first, takeSample, saveAsTextFile etc. Transformations are lazy in nature i.e. nothing happens when the code is evaluated. Meaning actual execution happens only when code is executed. RDDs are computed only when an action is applied on them. Also called as lazy evaluation. Spark evaluates the expression only when its value is needed by action. When you call an action, it actually triggers transformations to act upon RDD, dataset or dataframe. After that RDD, dataset or dataframe is calculated in memory. In short, transformations will actually occur only when you apply an action. Before that it’s just line of evaluated code :) Below is the list of Spark actions. reduce() It aggregates the elements of the dataset. For example, scala> val rdd = sc.parallelize(1 to 15).collect rdd: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15) scala> val rdd = sc.parallelize(1 to 15).reduce(_ + _) rdd: Int = 120 scala> val rdd = sc.parallelize(Array("Hello", "Dataneb", "Spark")).reduce(_ + _) rdd: String = SparkHelloDataneb scala> val rdd = sc.parallelize(Array("Hello", "Dataneb", "Spark")).map(x =>(x, x.length)).flatMap(l=> List(l._2)).collect rdd: Array[Int] = Array(5, 7, 5) scala> rdd.reduce(_ + _) res96: Int = 17 scala> rdd.reduce((x, y)=>x+y) res99: Int = 17 collect(), count(), first(), take() Collect returns all the elements of the dataset as an array. For example scala> sc.parallelize(1 to 20, 4).collect res100: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20) Counts the number of elements scala> sc.parallelize(1 to 20, 4).count res101: Long = 20 First returns the first element scala> sc.parallelize(1 to 20, 4).first res102: Int = 1 Take returns the number of elements you pass as argument scala> sc.parallelize(1 to 20, 4).take(5) res104: Array[Int] = Array(1, 2, 3, 4, 5) takeSample() It returns the random sample of size n. Boolean input is for with or without replacement. For example, scala> sc.parallelize(1 to 20, 4).takeSample(false,4) res107: Array[Int] = Array(15, 2, 5, 17) scala> sc.parallelize(1 to 20, 4).takeSample(false,4) res108: Array[Int] = Array(12, 5, 4, 11) scala> sc.parallelize(1 to 20, 4).takeSample(true,4) res109: Array[Int] = Array(18, 4, 1, 18) takeOrdered() It returns the elements in ordered fashion. For example, scala> sc.parallelize(1 to 20, 4).takeOrdered(7) res117: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7) Just opposite to top() action scala> sc.parallelize(1 to 20, 4).top(7) res118: Array[Int] = Array(20, 19, 18, 17, 16, 15, 14) countByKey() It takes (key, value) pair and returns (key, count of key). For example, scala> sc.parallelize(Array("Apple","Banana","Grapes","Oranges","Grapes","Banana")).map(k=>(k,1)).countByKey() res121: scala.collection.Map[String,Long] = Map(Grapes -> 2, Oranges -> 1, Banana -> 2, Apple -> 1) saveAsTextFile() It saves the dataset as text files in local directory or HDFS etc. You can reduce the number of files by coalesce transformation. scala>sc.parallelize(Array("Apple","Banana","Grapes","Oranges","Grapes","Banana")).saveAsTextFile("sampleFruits.txt") // Just one partition file with coalesce scala>sc.parallelize(Array("Apple","Banana","Grapes","Oranges","Grapes","Banana")).coalesce(1).saveAsTextFile("newsampleFruits.txt") saveAsObjectFile() It writes the data into simple format using Java serialization and you can load it again using sc.objectFile() scala> sc.parallelize(List(1,2)).saveAsObjectFile("/Users/Rajput/sample") foreach() It is generally used when you want to carry out some operation on output for each element. Like loading each element into database. scala> sc.parallelize("Hello").collect res139: Array[Char] = Array(H, e, l, l, o) scala> sc.parallelize("Hello").foreach(x=>println(x)) l H e l o // Output order of elements is not same every time scala> sc.parallelize("Hello").foreach(x=>println(x)) H e l o l Spark Workflow In this section you will understand how Spark program flows, like how you create intermediate RDDs and apply transformations and actions. You first create RDDs with parallelize method or referencing external dataset. Apply Transformations to create new RDDs based on your requirement. You will have list of RDDs called Lineage. Apply Actions on RDDs. Get your Result. Transformations & Actions example Let's try to implement above facts with some basic example which will give you more clear picture. Open spark-shell with below command in your terminal (refer mac/windows if you don't have spark installed yet). ./bin/spark-shell You can see SparkContext automatically created for you with all (*) local resources and app id in above screenshot. You can also check spark context by running sc command. res0 is nothing but result set zero for command sc. We already read about SparkContext in previous blog. 1. Create RDD, let's say by parallelize method with number of partitions 2. Below RDD will be basically list of characters distributed across 2 partitions. 2. Now, you can either apply transformation to create a new RDD (called lineage) or you can simply apply an action to show the result. Lets first apply few actions. res1 to res5 shows you the result of each action - collect, first, count, take, reduce, saveAsTextFile. Note (lazy evaluation) when you execute an action spark does the actual evaluation to bring the result. Now let's see the sample.csv file which is the last action result. Remember we created 2 partitions in first step, thats the reason we have 2 files with equal set of data as part-00000 & part-00001. 3. Now let's try to apply few transformations in order to create RDDs lineage. Refer the image shown above. In first step we have applied filter transformation to filter character 'a' creating a new RDD called MapPartitionsRDD[2] from our initial RDD ParallelCollectionRDD[0]. Similarly in third step we have filtered letter 'x' to create another RDD MapPartitionsRDD[3]. In last step we have used map & reduceByKey transformation to group the characters and get their counts, generating a new RDD ShuffleRDD[5]. As we have applied 2 transformations on one RDD i.e. map and reduceByKey, you will notice RDD[4] is missing. Spark internally saves the intermediate RDD[4] to generate the resultant ShuffleRDD[5] which is not printed in output. ParallelCollectionRDD[0], MapPartitionsRDD[2], MapPartitionsRDD[3], RDD[4], ShuffleRDD[5] is basically called lineage. You can say intermediate collection of elements which is needed by spark to evaluate your next action. 4. Now, you can notice res7, res8 and res9 are nothing but actions which we applied on lineage RDDs to get the Results. Thank you!! If you really like the post and you have any question, please don't forget to write in comments section below. Next: Loading data in Apache Spark Navigation menu ​ 1. Apache Spark and Scala Installation 1.1 Spark installation on Windows​ 1.2 Spark installation on Mac 2. Getting Familiar with Scala IDE 2.1 Hello World with Scala IDE​ 3. Spark data structure basics 3.1 Spark RDD Transformations and Actions example 4. Spark Shell 4.1 Starting Spark shell with SparkContext example​ 5. Reading data files in Spark 5.1 SparkContext Parallelize and read textFile method 5.2 Loading JSON file using Spark Scala 5.3 Loading TEXT file using Spark Scala 5.4 How to convert RDD to dataframe? 6. Writing data files in Spark ​6.1 How to write single CSV file in Spark 7. Spark streaming 7.1 Word count example Scala 7.2 Analyzing Twitter texts 8. Sample Big Data Architecture with Apache Spark 9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science? 10. Spark Interview Questions and Answers

  • Loading JSON file using Spark (Scala)

    Main menu: Spark Scala Tutorial In this Apache Spark Tutorial - We will be loading a simple JSON file. Now-a-days most of the time you will find files in either JSON format, XML or a flat file. JSON file format is very easy to understand and you will love it once you understand JSON file structure. JSON File Structure Before we ingest JSON file using spark, it's important to understand JSON data structure. Basically, JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. JSON is built on two structures: A collection of name/value pairs, usually referred as an object and its value pair. An ordered list of values. You can think it like an array, list of values. An object is an unordered set of name/value pairs. An object begins with { (left brace) and ends with } (right brace). Each name is followed by : (colon) and the name/value pairs are separated by , (comma). An array is an ordered collection of values. An array begins with [ (left bracket) and ends with ] (right bracket). Values are separated by , (comma). A value can be a string in double quotes, or a number, or true or false or null, or an object or an array. These structures can be nested. One more fact, JSON files could exist in two formats. However, most of the time you will encounter multiline JSON files. Multiline JSON where each line could have multiple records. Single line JSON where each line depicts one record. Multiline JSON would look something like this: [ { "color": "red", "value": "#f00" }, { "color": "green", "value": "#0f0" }, { "color": "blue", "value": "#00f" }, { "color": "cyan", "value": "#0ff" }, { "color": "magenta", "value": "#f0f" }, { "color": "yellow", "value": "#ff0" }, { "color": "black", "value": "#000" } ] Single line JSON would look something like this (try to correlate object, array and value structure format which I explained earlier): { "color": "red", "value": "#f00" } { "color": "green", "value": "#0f0" } { "color": "blue", "value": "#00f" } { "color": "cyan", "value": "#0ff" } { "color": "magenta", "value": "#f0f" } { "color": "yellow", "value": "#ff0" } { "color": "black", "value": "#000" } Creating Sample JSON file I have created two different sample files - multiline and single line JSON file with above mentioned records (just copy-paste). singlelinecolors.json multilinecolors.json Sample files look like: Note: I assume that you have installed Scala IDE if not please refer my previous blogs for installation steps (Windows & Mac users). 1. Create a new Scala project "jsnReader" Go to File → New → Project and enter jsnReader in project name field and click finish. 2. Create a new Scala Package "com.dataneb.spark" Right click on the jsnReader project in the Package Explorer panel → New → Package and enter name com.dataneb.spark and finish. 3. Create a Scala object "jsonfileReader" Expand the jsnReader project tree and right click on the com.dataneb.spark package → New → Scala Object → enter jsonfileReader in the object name and press finish. 4. Add external jar files Right click on jsnReader project → properties → Java Build Path → Add External Jars Now navigate to the path where you have installed Spark. You will find all the jar files under /spark/jars folder. After adding these jar files you will find Referenced Library folder created on left panel of your screen below Scala object. You will also find that project has become invalid (red cross sign), we will fix it shortly. 5. Setup Scala Compiler Now right click on jsnReader project → properties → Scala Compiler and check the box Use Project Settings and select Fixed Scala installation: 2.11.11 (built-in) from drop-down options. After applying these changes, you will find project has become valid again (red cross sign is gone). 6. Sample code Open jsonfileReader.scala and copy-paste the code written below. I have written separate blog to explain what are basic terminologies used in Spark like RDD, SparkContext, SQLContext, various transformations and actions etc. You can go through this for basic understanding. However, I have explained little bit in comments above each line of code what it actually does. For list of spark functions you can refer this. // Your package name package com.dataneb.spark // Each library has its significance, I have commented in below code how its being used import org.apache.spark._ import org.apache.spark.sql._ import org.apache.log4j._ object jsonfileReader { // Reducing the error level to just "ERROR" messages // It uses library org.apache.log4j._ // You can apply other logging levels like ALL, DEBUG, ERROR, INFO, FATAL, OFF etc Logger.getLogger("org").setLevel(Level.ERROR) // Defining Spark configuration to define application name and the local resources to use // It uses library org.apache.spark._ val conf = new SparkConf().setAppName("Sample App") conf.setMaster("local") // Using above configuration to define our SparkContext val sc = new SparkContext(conf) // Defining SQL context to run Spark SQL // It uses library org.apache.spark.sql._ val sqlContext = new SQLContext(sc) // Main function where all operations will occur def main (args: Array[String]): Unit = { // Reading the json file val df = sqlContext.read.json("/Volumes/MYLAB/testdata/multilinecolors.json") // Printing schema df.printSchema() // Saving as temporary table df.registerTempTable("JSONdata") // Retrieving all the records val data=sqlContext.sql("select * from JSONdata") // Showing all the records data.show() // Stopping Spark Context sc.stop } } 7. Run the code! Right click anywhere on the screen and select Run As Scala Application. That's it!! If you have followed the steps properly you will find the result in Console. We have successfully loaded JSON file using Spark SQL dataframes. Printed JSON schema and displayed the data. Try reading single line JSON file which we created earlier. There is a multiline flag which you need to make true to read such files. Also, you can save this data in HDFS, database or CSV file depending upon your need. If you have any question, please don't forget to write in comments section below. Thank you. Next: How to convert RDD to dataframe? Navigation menu ​ 1. Apache Spark and Scala Installation 1.1 Spark installation on Windows​ 1.2 Spark installation on Mac 2. Getting Familiar with Scala IDE 2.1 Hello World with Scala IDE​ 3. Spark data structure basics 3.1 Spark RDD Transformations and Actions example 4. Spark Shell 4.1 Starting Spark shell with SparkContext example​ 5. Reading data files in Spark 5.1 SparkContext Parallelize and read textFile method 5.2 Loading JSON file using Spark Scala 5.3 Loading TEXT file using Spark Scala 5.4 How to convert RDD to dataframe? 6. Writing data files in Spark ​6.1 How to write single CSV file in Spark 7. Spark streaming 7.1 Word count example Scala 7.2 Analyzing Twitter texts 8. Sample Big Data Architecture with Apache Spark 9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science? 10. Spark Interview Questions and Answers

  • Spark read Text file into Dataframe

    Main menu: Spark Scala Tutorial In this Spark Scala tutorial you will learn how to read data from a text file & CSV to dataframe. This blog has two sections: Spark read Text File Spark read CSV with schema/header There are various methods to load a text file in Spark. You can refer Spark documentation. Spark Read Text File I am loading a text file which is space (" ") delimited. I have chosen this format because in most of the practical cases you will find delimited text files with fixed number of fields. Further, I will be adding a header to dataframe and transform it to some extent. I have tried to keep the code as simple as possible so that anyone can understand it. You can change the separator, name/number of fields, data type according to your requirement. I am using squid logs as sample data for this example. It has date, integer and string fields which will help us to apply data type conversions and play around with Spark SQL. You can find complete squid file structure details at this. No. of fields = 10 Separator is a space character Sample Data 1286536309.586 921 192.168.0.68 TCP_MISS/200 507 POST http://rcv-srv37.inplay.tubemogul.co...eiver/services - DIRECT/174.129.41.128 application/xml 1286536309.608 829 192.168.0.68 TCP_MISS/200 507 POST http://rcv-srv37.inplay.tubemogul.co...eiver/services - DIRECT/174.129.41.128 application/xml 1286536309.660 785 192.168.0.68 TCP_MISS/200 507 POST http://rcv-srv37.inplay.tubemogul.co...eiver/services - DIRECT/174.129.41.128 application/xml 1286536309.684 808 192.168.0.68 TCP_MISS/200 507 POST http://rcv-srv37.inplay.tubemogul.co...eiver/services - DIRECT/174.129.41.128 application/xml 1286536309.775 195 192.168.0.227 TCP_MISS/200 4120 GET http://i4.ytimg.com/vi/gTHZnIAzmdY/default.jpg - DIRECT/209.85.153.118 image/jpeg 1286536309.795 215 192.168.0.227 TCP_MISS/200 5331 GET http://i2.ytimg.com/vi/-jBxVLD4fzg/default.jpg - DIRECT/209.85.153.118 image/jpeg 1286536309.815 234 192.168.0.227 TCP_MISS/200 5261 GET http://i1.ytimg.com/vi/dCjp28ps4qY/default.jpg - DIRECT/209.85.153.118 image/jpeg Creating Sample Text File I have created sample text file - squid.txt with above mentioned records (just copy-paste). Filename: squid.txt Path: /Users/Rajput/Documents/testdata Eclipse IDE Setup (for beginners) Before writing the Spark program it's necessary to setup Scala project in Eclipse IDE. I assume that you have installed Eclipse, if not please refer my previous blogs for installation steps (Windows | Mac users). These steps will be same for other sections like reading CSV, JSON, JDBC. 1. Create a new Scala project "txtReader" Go to File → New → Project and enter txtReader in project name field and click finish. 2. Create a new Scala Package "com.dataneb.spark" Right click on the txtReader project in the Package Explorer panel → New → Package and enter name com.dataneb.spark and finish. 3. Create a Scala object "textfileReader" Expand the txtReader project tree and right click on the com.dataneb.spark package → New → Scala Object → enter textfileReader in the object name and press finish. 4. Add external jar files (if needed) Right click on txtReader project → properties → Java Build Path → Add External Jars Now navigate to the path where you have installed Spark. You will find all the jar files under /spark/jars folder. Now select all the jar files and click open. Apply and Close. After adding these jar files you will find Referenced Library folder created on left panel of the screen below Scala object. 5. Setup Scala compiler Now right click on txtReader project → properties → Scala Compiler and check the box Use Project Settings and select Fixed Scala installation: 2.11.11 (built-in) from drop-down options. Write the code! [For beginners] Before you write the Spark program, I have written separate blog to explain Spark RDD, various transformations and actions. You can go through this for basic understanding. Refer these blogs for Spark-shell and SparkContext basics if you are new to Spark programming. However, I have explained little bit in comments above each line of code what it actually does. For list of spark functions you can refer this. Now, open textfileReader.scala and copy-paste the below code. // Your package name package com.dataneb.spark // Each library has its significance, I have commented when it's used import org.apache.spark._ import org.apache.spark.sql._ import org.apache.log4j._ import org.apache.spark.sql.types.{StructType, StructField, StringType} import org.apache.spark.sql.Row object textfileReader { // Reducing the error level to just "ERROR" messages // It uses library org.apache.log4j._ // You can apply other logging levels like ALL, DEBUG, ERROR, INFO, FATAL, OFF etc Logger.getLogger("org").setLevel(Level.ERROR) // Defining Spark configuration to set application name and master // It uses library org.apache.spark._ val conf = new SparkConf().setAppName("textfileReader") conf.setMaster("local") // Using above configuration to define our SparkContext val sc = new SparkContext(conf) // Defining SQL context to run Spark SQL // It uses library org.apache.spark.sql._ val sqlContext = new SQLContext(sc) // Main function where all operations will occur def main (args:Array[String]): Unit = { // Reading the text file val squidString = sc.textFile("/Users/Rajput/Documents/testdata/squid.txt") // Defining the data-frame header structure val squidHeader = "time duration client_add result_code bytes req_method url user hierarchy_code type" // Defining schema from header which we defined above // It uses library org.apache.spark.sql.types.{StructType, StructField, StringType} val schema = StructType(squidHeader.split(" ").map(fieldName => StructField(fieldName,StringType, true))) // Converting String RDD to Row RDD for 10 attributes val rowRDD = squidString.map(_.split(" ")).map(x => Row(x(0), x(1), x(2), x(3), x(4), x(5) , x(6) , x(7) , x(8), x(9))) // Creating data-frame based on Row RDD and schema val squidDF = sqlContext.createDataFrame(rowRDD, schema) // Saving as temporary table squidDF.registerTempTable("squid") // Retrieving all the records val allrecords = sqlContext.sql("select * from squid") // Showing top 5 records with false truncation i.e. showing complete row value allrecords.show(5,false) /* Further you can apply Spark transformations according to your need */ allrecords.write.saveAsTable("allrecords") // Printing schema before transformation allrecords.printSchema() // Something like this for date, integer and string conversion // To have multiline sql use triple quotes val transformedData = sqlContext.sql(""" -- multiline sql select from_unixtime(time) as time, -- you can apply to_date cast(duration as int) as duration, -- casting to integer cast (req_method as string) as req_method from allrecords -- casting to string just to explain where type like '%application%' -- filtering """) // To print schema after transformation, you can see new fields data types transformedData.printSchema() transformedData.show() sc.stop() } } Result Right click anywhere on the screen and select Run As Scala Application. If you have followed the steps properly you will find the result in Console. Key Notes First output is complete data-frame with all the fields as string type. Second output is the schema without any transformation, you will find all the datatypes as string. Third output is schema after applying datatype conversions. Fourth output is our transformed data (minor transformations). You might face error if; You have missed to import required jar files. You have missed to configure Scala compiler. You have missed to import referenced libraries. You have defined rowRDD with wrong number of fields like (x(0) to x(10)) you will see "ArrayIndexOutOfBoundsException" error. Spark Read CSV To demonstrate this I am using Spark-shell but you can always follow similar steps like above to create Scala project in Eclipse IDE. I have downloaded sample “books” data from Kaggle. I like Kaggle for free data files, you should try as well. Sample books.csv has 10 columns and its approximately 1.5 MB file, yeah I know it’s very small for Apache Spark. But this is just for demonstration purpose so it should be fine. Columns - bookID, title, authors, average_rating, isbn, isbn13, language_code, num_pages, ratings_count, text_reviews_count Path - /Volumes/MYLAB/testdata Files - book.csv Start Spark-shell I am using Spark version 2.3.1 and Scala version 2.11.8 // Create books dataframe using SparkSession available as spark scala> val booksDF = spark.read.csv("/Volumes/MYLAB/testdata/") booksDF: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 8 more fields] // Showing top 10 records in dataframe scala> booksDF.show(10) // To include header you can set option header => true scala> spark .read .format("csv") .option("header", "true") .load("/Volumes/MYLAB/testdata/") .show() // Also if you want to store Schema of dataframe you need to set option inferSchema => true scala> val booksDF = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/Volumes/MYLAB/testdata/") booksDF: org.apache.spark.sql.DataFrame = [bookID: string, title: string ... 8 more fields] scala> booksDF.printSchema root |-- bookID: string (nullable = true) |-- title: string (nullable = true) |-- authors: string (nullable = true) |-- average_rating: string (nullable = true) |-- isbn: string (nullable = true) |-- isbn13: string (nullable = true) |-- language_code: string (nullable = true) |-- # num_pages: string (nullable = true) |-- ratings_count: string (nullable = true) |-- text_reviews_count: string (nullable = true) // You can save this data in a temp table and run SQL scala> booksDF.registerTempTable("books") scala> booksDF.sqlContext.sql("select title from books").show(false) // You can write any sql you want, for example lets say you want to see books with rating over 4.5 scala> booksDF.sqlContext.sql("select title, average_rating from books where average_rating > 4.5").show(false) You can see what all options you can apply on a dataframe by pressing tab, for example, Thank you folks! If you have any questions please mention in comments section below. Next: Loading JSON file using Spark Scala Navigation menu ​ 1. Apache Spark and Scala Installation 1.1 Spark installation on Windows​ 1.2 Spark installation on Mac 2. Getting Familiar with Scala IDE 2.1 Hello World with Scala IDE​ 3. Spark data structure basics 3.1 Spark RDD Transformations and Actions example 4. Spark Shell 4.1 Starting Spark shell with SparkContext example​ 5. Reading data files in Spark 5.1 SparkContext Parallelize and read textFile method 5.2 Loading JSON file using Spark Scala 5.3 Loading TEXT file using Spark Scala 5.4 How to convert RDD to dataframe? 6. Writing data files in Spark ​6.1 How to write single CSV file in Spark 7. Spark streaming 7.1 Word count example Scala 7.2 Analyzing Twitter texts 8. Sample Big Data Architecture with Apache Spark 9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science? 10. Spark Interview Questions and Answers

  • Create SSIS package in Visual Studio 2017

    In this tutorial, you will learn how to create SSIS (SQL Server Integration Services) package in visual studio 2017 step by step. For this you need to install SQL Server Data Tools (SSDT) on your machine. SSDT (Business Intelligence template) is basically used to create SSIS/SSRS/SSAS solutions. For Visual Studio 2018+ Visual Studio 2018 or higher is included with SSDT BI template so you don't need to install SSDT separately. You just need to check box "Data Storage and Processing" in workload section while installing Visual Studio 2018 or later. Visual Studio 2017 Installation Download older version of Visual Studio from this link. You need to create Microsoft account (if you don't have one) and have free Visual Studio subscription. Below is the product description which you need to download. Download latest "Community" version (15.9 and not 15.0) in order to install SSDT BI as highlighted below. VS installation might take 20 to 45 mins depending upon your system configuration. If you already have VS 2017 (version 15.0) then you have to upgrade VS 2017 to latest version. You can go to Visual Studio Installed version and check for available updates. SSDT BI Installation You can download SSDT (version 15.9.1) from this link. This link might change with new versions coming in future, in that case you can simply Google "SSDT release 15.9" and visit Microsoft VS 2017 SSDT installation page. Check all the services (SSAS/SSIS/SSRS) as shown below and select Visual Studio Community 2017 from the drop down list. Click Install. Download and installation process will take around 30 minutes depending upon your system configuration. Restart your computer once installation is done. Installation Check Once installation is done, open Visual Studio 2017 and go to menu option File --> New Project. Look up for Business Intelligence, if you can find Integration Services, Analysis Services and Reporting Services on left side of panel, then installation is fine. Create SSIS package In order to create SSIS basic package, you need to Create a project for the package. Add a control flow and data flow to the package. Add components to the data flow. What is SSIS? SSIS is an ETL tool for data warehousing that comes with Microsoft SQL Server. There is no extra cost for SSIS services. It lets you setup automated data load or extract processes to and/or from your SQL Server. SSIS stands for SQL Server Integration Services and ETL stands for Extract-Transform-Load. It is comparable to other ETL tools like Informatica and IBM Datastage etc. What does it do? SSIS provides you platform referred as SSDT to develop ETL solution which could be the combination of one or more packages. Solutions are saved with sln extension and packages are basically XML files saved with dtsx extension. Packages are deployed in SQL Server msdb database called as SSISDB and managed in Integration Services Catalog in SSMS (SQL Server Management Studio). Creating Package 1. Go to File => New => Project, name your project. Click OK. It will open SSIS designer. On left hand panel you will see SSIS toolbox with all the tasks and at the center you will see various tabs to switch between control flow, data flow, parameters, event handler and package explorer. On right panel you will see solution explorer where you can find connection manager. Now, in order to create a SSIS package you need at-least one control flow and a data flow task. Data flow task is simply a task which is used to Extract, Load and Transform the data and control flow is like the logical unit which controls the execution of tasks, like the flow in which tasks will execute. 2. Drag and drop drop data flow task from SSIS toolbox to central panel (control flow tab) like below, 3. You can double click on Data Flow Task to rename it. I am keeping it as default "Data Flow Task". Now right click on Data Flow Task => Edit, or you can simply select the "Data Flow Task" and click on "Data Flow" tab, it will open screen where you can edit your "Data Flow Task". 4. Now drag and drop OLE DB Source, OLE DB Destination and Data Conversion task from SSIS toolbox to designer space as shown below, 5. Select OLE DB Source and drag and drop the blue/green arrow to connect Data Conversion task. And similarly drag and drop/green blue arrow (not the red one) from Data conversion task to OLE DB Destination. Like this, 6. Now, you need to create OLE DB connection for source and target. For this go to Solution explorer panel on right hand side => Connection manager => New connection manager. Select OLE DB and click ADD. If you have already created OLE DB data connection earlier on your machine it will show up here, otherwise you can click on NEW and create new one. Just enter your database name and test connection. I assume you have AdventureWorks database running on your machine if not please refer this post. I have already installed SQL Server 2014 and SQL Express so you can see 2 instances of SQL Server service running on my machine. Now, for example I have chosen AdventureWorks2014 database. 7. Now go back to Data Flow screen and right click on OLE DB Source task => Edit. Choose a sample table from drop down list, [Production].[Product]. Now go to Columns tab, remove selected columns and select these five columns - Name, ListPrice, Size, Weight and SellStartDate and click OK. It's just for example purpose. 8. Now go to Data Flow and right click on Data Conversion task => Edit. Select SellStartDate column and change its data type from [DT_DBTIMESTAMP] to [DT_DBDATE], keep alias name same and click OK. Just a minor datatype conversion to showcase this example. 9. Now, right click on OLE DB destination editor => Edit => New SSIS by default creates the "create table" statement for you with input columns. CREATE TABLE [OLE DB Destination] ( [Name] nvarchar(50), [ListPrice] money, [Size] nvarchar(5), [Weight] numeric(8,2), [OLE DB Source.SellStartDate] datetime, [Data Conversion.SellStartDate] date ) Edit the table name and remove [OLE DB Source.SellStartDate] and hit OK, CREATE TABLE [OLE DB Destination_Products] ( [Name] nvarchar(50), [ListPrice] money, [Size] nvarchar(5), [Weight] numeric(8,2), [Data Conversion.SellStartDate] date ) Mappings should look like this, just click OK. 10. Now right click on the blue/green arrow between Data conversion task and OLE DB Destination task and enable data viewer. This is not a mandatory step but just to see data preview after data conversion. 11. Now hit START button on top of your screen. This will start the package. You can see SellStartDate has only date after conversion (no time field), all the tasks are green ticked that means they ran successfully and number of rows 1,008 processed. You can stop the flow or restart again from buttons highlighted on top of the screen. That's it. This package creation example was showcased by Microsoft itself. I haven't modified anything to keep example simple and informative. I hope you enjoyed the post. If you have any question please mention in the comments section below. Thank you. Next: SQL Server 2014 Download and Installation

Home   |   Contact Us

©2020 by Data Nebulae