Results found for empty search
- Every Gardener Must Know These Homemade Organic Pesticide Remedies
Organic Homemade Pesticide: Insects like Cricket, Spiders, Snail, Aphids and others can cause serious damage to your garden and invite several diseases. Personally, I don't recommend chemical pesticides to get rid of these as they can make fruits and vegetables unsafe for consumption plus they are not safe for environment. However, there are many homemade remedies for you to stop these pests. Using Neem Mix 30 milliliters of Neem oil with 1 teaspoon mild soap. Mix the Neem and soap into 1 liters of warm water. Pour the pesticide into a squirt bottle and spray on the affected areas. Using Onion, Chilies & Garlic Blend 100 grams of red hot peppers with 50 grams of garlic cloves & 50 grams of onions to form thick paste. Mix the paste into 1 liter of warm water. Pour the solution into a container and leave it for 24 hours in a warm spot. Filter the solution through a strainer to remove solid particles. That's it, filtered solution is your pesticide. Pour your pesticide into a squirt bottle and spray on the affected plants. Using Tobacco Mix half cup of tobacco into 1 liters of water. Keep the mixture out in sun for 24 hours. Check the color of the mixture if it's similar to color of light tea. Add 2 tablespoons of mild liquid dish soap to the solution and mix thoroughly. Pour your liquid into a squirt bottle and spray on the affected plants. Using Orange Peels Boil 2 Orange peels in 1 liters of water Keep the solution in a warm spot for 24 hours. Pour your liquid into a squirt bottle after filtering the peels. Add a few drops of Castile soap and mix the solution thoroughly. Pour the pesticide into a squirt bottle and spray on the affected areas. Using Egg Shell Egg shells can not be used to make pesticide but it can protect your plant from pests. Further, composed of calcium carbonate, eggshells are an excellent way to introduce this mineral into the soil. Microwave waste egg shells for couple of minutes to kill bacteria. You can dry it in a sunny spot as well for 3-4 days, however microwave is a faster option. Put them in a plastic bag and crush it to make fine particles. Spread the crushed shell around your plant, it will block pests to attack your plant root and serve as good source of Calcium. You can blend egg shells as well and use it as a fertilizer. Eggshells will reduce the acidity of your soil and help to aerate it. #Garden #Pesticides #Organic #Homemade #Remedies
- Spark read Text file into Dataframe
Main menu: Spark Scala Tutorial In this Spark Scala tutorial you will learn how to read data from a text file & CSV to dataframe. This blog has two sections: Spark read Text File Spark read CSV with schema/header There are various methods to load a text file in Spark. You can refer Spark documentation. Spark Read Text File I am loading a text file which is space (" ") delimited. I have chosen this format because in most of the practical cases you will find delimited text files with fixed number of fields. Further, I will be adding a header to dataframe and transform it to some extent. I have tried to keep the code as simple as possible so that anyone can understand it. You can change the separator, name/number of fields, data type according to your requirement. I am using squid logs as sample data for this example. It has date, integer and string fields which will help us to apply data type conversions and play around with Spark SQL. You can find complete squid file structure details at this. No. of fields = 10 Separator is a space character Sample Data 1286536309.586 921 192.168.0.68 TCP_MISS/200 507 POST http://rcv-srv37.inplay.tubemogul.co...eiver/services - DIRECT/174.129.41.128 application/xml 1286536309.608 829 192.168.0.68 TCP_MISS/200 507 POST http://rcv-srv37.inplay.tubemogul.co...eiver/services - DIRECT/174.129.41.128 application/xml 1286536309.660 785 192.168.0.68 TCP_MISS/200 507 POST http://rcv-srv37.inplay.tubemogul.co...eiver/services - DIRECT/174.129.41.128 application/xml 1286536309.684 808 192.168.0.68 TCP_MISS/200 507 POST http://rcv-srv37.inplay.tubemogul.co...eiver/services - DIRECT/174.129.41.128 application/xml 1286536309.775 195 192.168.0.227 TCP_MISS/200 4120 GET http://i4.ytimg.com/vi/gTHZnIAzmdY/default.jpg - DIRECT/209.85.153.118 image/jpeg 1286536309.795 215 192.168.0.227 TCP_MISS/200 5331 GET http://i2.ytimg.com/vi/-jBxVLD4fzg/default.jpg - DIRECT/209.85.153.118 image/jpeg 1286536309.815 234 192.168.0.227 TCP_MISS/200 5261 GET http://i1.ytimg.com/vi/dCjp28ps4qY/default.jpg - DIRECT/209.85.153.118 image/jpeg Creating Sample Text File I have created sample text file - squid.txt with above mentioned records (just copy-paste). Filename: squid.txt Path: /Users/Rajput/Documents/testdata Eclipse IDE Setup (for beginners) Before writing the Spark program it's necessary to setup Scala project in Eclipse IDE. I assume that you have installed Eclipse, if not please refer my previous blogs for installation steps (Windows | Mac users). These steps will be same for other sections like reading CSV, JSON, JDBC. 1. Create a new Scala project "txtReader" Go to File → New → Project and enter txtReader in project name field and click finish. 2. Create a new Scala Package "com.dataneb.spark" Right click on the txtReader project in the Package Explorer panel → New → Package and enter name com.dataneb.spark and finish. 3. Create a Scala object "textfileReader" Expand the txtReader project tree and right click on the com.dataneb.spark package → New → Scala Object → enter textfileReader in the object name and press finish. 4. Add external jar files (if needed) Right click on txtReader project → properties → Java Build Path → Add External Jars Now navigate to the path where you have installed Spark. You will find all the jar files under /spark/jars folder. Now select all the jar files and click open. Apply and Close. After adding these jar files you will find Referenced Library folder created on left panel of the screen below Scala object. 5. Setup Scala compiler Now right click on txtReader project → properties → Scala Compiler and check the box Use Project Settings and select Fixed Scala installation: 2.11.11 (built-in) from drop-down options. Write the code! [For beginners] Before you write the Spark program, I have written separate blog to explain Spark RDD, various transformations and actions. You can go through this for basic understanding. Refer these blogs for Spark-shell and SparkContext basics if you are new to Spark programming. However, I have explained little bit in comments above each line of code what it actually does. For list of spark functions you can refer this. Now, open textfileReader.scala and copy-paste the below code. // Your package name package com.dataneb.spark // Each library has its significance, I have commented when it's used import org.apache.spark._ import org.apache.spark.sql._ import org.apache.log4j._ import org.apache.spark.sql.types.{StructType, StructField, StringType} import org.apache.spark.sql.Row object textfileReader { // Reducing the error level to just "ERROR" messages // It uses library org.apache.log4j._ // You can apply other logging levels like ALL, DEBUG, ERROR, INFO, FATAL, OFF etc Logger.getLogger("org").setLevel(Level.ERROR) // Defining Spark configuration to set application name and master // It uses library org.apache.spark._ val conf = new SparkConf().setAppName("textfileReader") conf.setMaster("local") // Using above configuration to define our SparkContext val sc = new SparkContext(conf) // Defining SQL context to run Spark SQL // It uses library org.apache.spark.sql._ val sqlContext = new SQLContext(sc) // Main function where all operations will occur def main (args:Array[String]): Unit = { // Reading the text file val squidString = sc.textFile("/Users/Rajput/Documents/testdata/squid.txt") // Defining the data-frame header structure val squidHeader = "time duration client_add result_code bytes req_method url user hierarchy_code type" // Defining schema from header which we defined above // It uses library org.apache.spark.sql.types.{StructType, StructField, StringType} val schema = StructType(squidHeader.split(" ").map(fieldName => StructField(fieldName,StringType, true))) // Converting String RDD to Row RDD for 10 attributes val rowRDD = squidString.map(_.split(" ")).map(x => Row(x(0), x(1), x(2), x(3), x(4), x(5) , x(6) , x(7) , x(8), x(9))) // Creating data-frame based on Row RDD and schema val squidDF = sqlContext.createDataFrame(rowRDD, schema) // Saving as temporary table squidDF.registerTempTable("squid") // Retrieving all the records val allrecords = sqlContext.sql("select * from squid") // Showing top 5 records with false truncation i.e. showing complete row value allrecords.show(5,false) /* Further you can apply Spark transformations according to your need */ allrecords.write.saveAsTable("allrecords") // Printing schema before transformation allrecords.printSchema() // Something like this for date, integer and string conversion // To have multiline sql use triple quotes val transformedData = sqlContext.sql(""" -- multiline sql select from_unixtime(time) as time, -- you can apply to_date cast(duration as int) as duration, -- casting to integer cast (req_method as string) as req_method from allrecords -- casting to string just to explain where type like '%application%' -- filtering """) // To print schema after transformation, you can see new fields data types transformedData.printSchema() transformedData.show() sc.stop() } } Result Right click anywhere on the screen and select Run As Scala Application. If you have followed the steps properly you will find the result in Console. Key Notes First output is complete data-frame with all the fields as string type. Second output is the schema without any transformation, you will find all the datatypes as string. Third output is schema after applying datatype conversions. Fourth output is our transformed data (minor transformations). You might face error if; You have missed to import required jar files. You have missed to configure Scala compiler. You have missed to import referenced libraries. You have defined rowRDD with wrong number of fields like (x(0) to x(10)) you will see "ArrayIndexOutOfBoundsException" error. Spark Read CSV To demonstrate this I am using Spark-shell but you can always follow similar steps like above to create Scala project in Eclipse IDE. I have downloaded sample “books” data from Kaggle. I like Kaggle for free data files, you should try as well. Sample books.csv has 10 columns and its approximately 1.5 MB file, yeah I know it’s very small for Apache Spark. But this is just for demonstration purpose so it should be fine. Columns - bookID, title, authors, average_rating, isbn, isbn13, language_code, num_pages, ratings_count, text_reviews_count Path - /Volumes/MYLAB/testdata Files - book.csv Start Spark-shell I am using Spark version 2.3.1 and Scala version 2.11.8 // Create books dataframe using SparkSession available as spark scala> val booksDF = spark.read.csv("/Volumes/MYLAB/testdata/") booksDF: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 8 more fields] // Showing top 10 records in dataframe scala> booksDF.show(10) // To include header you can set option header => true scala> spark .read .format("csv") .option("header", "true") .load("/Volumes/MYLAB/testdata/") .show() // Also if you want to store Schema of dataframe you need to set option inferSchema => true scala> val booksDF = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/Volumes/MYLAB/testdata/") booksDF: org.apache.spark.sql.DataFrame = [bookID: string, title: string ... 8 more fields] scala> booksDF.printSchema root |-- bookID: string (nullable = true) |-- title: string (nullable = true) |-- authors: string (nullable = true) |-- average_rating: string (nullable = true) |-- isbn: string (nullable = true) |-- isbn13: string (nullable = true) |-- language_code: string (nullable = true) |-- # num_pages: string (nullable = true) |-- ratings_count: string (nullable = true) |-- text_reviews_count: string (nullable = true) // You can save this data in a temp table and run SQL scala> booksDF.registerTempTable("books") scala> booksDF.sqlContext.sql("select title from books").show(false) // You can write any sql you want, for example lets say you want to see books with rating over 4.5 scala> booksDF.sqlContext.sql("select title, average_rating from books where average_rating > 4.5").show(false) You can see what all options you can apply on a dataframe by pressing tab, for example, Thank you folks! If you have any questions please mention in comments section below. Next: Loading JSON file using Spark Scala Navigation menu 1. Apache Spark and Scala Installation 1.1 Spark installation on Windows 1.2 Spark installation on Mac 2. Getting Familiar with Scala IDE 2.1 Hello World with Scala IDE 3. Spark data structure basics 3.1 Spark RDD Transformations and Actions example 4. Spark Shell 4.1 Starting Spark shell with SparkContext example 5. Reading data files in Spark 5.1 SparkContext Parallelize and read textFile method 5.2 Loading JSON file using Spark Scala 5.3 Loading TEXT file using Spark Scala 5.4 How to convert RDD to dataframe? 6. Writing data files in Spark 6.1 How to write single CSV file in Spark 7. Spark streaming 7.1 Word count example Scala 7.2 Analyzing Twitter texts 8. Sample Big Data Architecture with Apache Spark 9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science? 10. Spark Interview Questions and Answers
- ELK stack Installation on OEL (Oracle Enterprise Linux)
Refer my previous blog to install Oracle Enterprise Linux operating system on your machine. Or if you have any operating system which supports Linux kernel like CentOS, Ubuntu, RedHat Linux etc, these steps will be similar. Navigation Menu: Introduction to ELK Stack Installation Loading data into Elasticsearch with Logstash Create Kibana Dashboard Example Kibana GeoIP Dashboard Example Elasticsearch Installation Before we start Elasticsearch installation. I hope you all have Java installed on your machine, if not please refer this. Now once you have installed Java successfully, go to this link and download latest version of Elasticsearch. https://www.elastic.co/downloads/ I have downloaded TAR file (elasticsearch-6.2.4.tar.gz) to explain this blog. For machines with GUI like CentOS, Ubuntu: Once you download it on your local machine, move it to your Linux environment where you want to run Elasticsearch. I use MobaXterm (open source tool) to transfer file from my windows machine to Linux environment (Red Hat Linux client without GUI in this case). For non-GUI Linux machines: Simply run wget on your Linux machine (if you don't have wget package installed on your machine, run this command with root user to install wget: yum install wget -y). Run below commands to install Elasticsearch with any user except root. Change the version according to your requirement, like I removed 6.2.4 for simplicity. wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.2.4.tar.gz tar -xvzf elasticsearch-6.2.4.tar.gz rm -f elasticsearch-6.2.4.tar.gz mv elasticsearch-6.2.4 elasticsearch Start Elasticsearch To start Elasticsearch, navigate to Elasticsearch directory and launch elasticsearch. cd elasticsearch/ ./bin/elasticsearch Running Elasticsearch in Background You can start Elasticsearch in background as well with below commands. Run nohup and disown the process. Later you can find out the java process running on your machine or you can simply note down the PID which generates after executing nohup. Like in below case - 25605 is the PID. [hadoop@elasticsearch elasticsearch]$ nohup ./bin/elasticsearch & [1] 25605 [hadoop@elasticsearch elasticsearch]$ nohup: ignoring input and appending output to ‘nohup.out’ disown [hadoop@elasticsearch elasticsearch]$ ps -aux | grep java hadoop 25605 226 6.1 4678080 1257552 pts/0 Sl 11:54 0:31 /usr/java/java/bin/java -Xms1g -Xmx1g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.io.tmpdir=/tmp/elasticsearch.zbtKhO5i -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -Xloggc:logs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=32 -XX:GCLogFileSize=64m -Des.path.home=/home/hadoop/apps/installers/elasticsearch -Des.path.conf=/home/hadoop/apps/installers/elasticsearch/config -cp /home/hadoop/apps/installers/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch Note: If you are getting below error, please make sure you are not logged in as root. Remove the file, login with different user and redo above steps. Remember I told to install elasticsearch with any user except root. Error:java.nio.file.AccessDeniedException: /home/hadoop/apps/installers/elasticsearch/config/jvm.options Verify Elasticsearch installation [hadoop@localhost etc]$ curl http://localhost:9200 { "name" : "akY11V_", "cluster_name" : "elasticsearch", "cluster_uuid" : "3O3dLMIDRYmJa1zrqNZqug", "version" : { "number" : "6.2.4", "build_hash" : "ccec39f", "build_date" : "2018-04-12T20:37:28.497551Z", "build_snapshot" : false, "lucene_version" : "7.2.1", "minimum_wire_compatibility_version" : "5.6.0", "minimum_index_compatibility_version" : "5.0.0" }, "tagline" : "You Know, for Search" } Or you can simply open http://localhost:9200 in your local browser if your operating system supports any GUI. Kibana Installation Follow similar steps to download Kibana latest release from below link: https://www.elastic.co/downloads/kibana Move the TAR file to your Linux machine or simply run wget to download the file. Modify the version according to your requirement. wget https://artifacts.elastic.co/downloads/kibana/kibana-6.2.4-linux-x86_64.tar.gz tar -xvzf kibana-6.2.4-linux-x86_64.tar.gz rm -f kibana-6.2.4-linux-x86_64.tar.gz mv kibana-6.2.4-linux-x86_64 kibana Now, uncomment this line in kibana.yml file: elasticsearch.url: "http://localhost:9200" cd kibana vi /config/kibana.yml Start Kibana [hadoop@localhost kibana]$ ./bin/kibana log [21:09:12.958] [info][status][plugin:kibana@6.2.4] Status changed from uninitialized to green - Ready log [21:09:13.091] [info][status][plugin:elasticsearch@6.2.4] Status changed from uninitialized to yellow - Waiting for Elasticsearch log [21:09:13.539] [info][status][plugin:timelion@6.2.4] Status changed from uninitialized to green - Ready log [21:09:13.560] [info][status][plugin:console@6.2.4] Status changed from uninitialized to green - Ready log [21:09:13.573] [info][status][plugin:metrics@6.2.4] Status changed from uninitialized to green - Ready log [21:09:13.637] [info][listening] Server running at http://localhost:5601 log [21:09:13.758] [info][status][plugin:elasticsearch@6.2.4] Status changed from yellow to green - Ready You can start Kibana in background as well by executing below command: [hadoop@elasticsearch kibana]$ ./bin/kibana & [2] 23866 [hadoop@elasticsearch kibana]$ log [15:30:26.029] [info][status][plugin:kibana@6.2.4] Status changed from uninitialized to green - Ready log [15:30:26.164] [info][status][plugin:elasticsearch@6.2.4] Status changed from uninitialized to yellow - Waiting for Elasticsearch log [15:30:26.676] [info][status][plugin:timelion@6.2.4] Status changed from uninitialized to green - Ready log [15:30:26.701] [info][status][plugin:console@6.2.4] Status changed from uninitialized to green - Ready log [15:30:26.718] [info][status][plugin:metrics@6.2.4] Status changed from uninitialized to green - Ready log [15:30:26.781] [info][listening] Server running at http://localhost:5601 log [15:30:26.861] [info][status][plugin:elasticsearch@6.2.4] Status changed from yellow to green - Ready disown Logstash Installation Follow similar steps to download Logstash latest release from below link: https://www.elastic.co/downloads/logstash Or run the below commands with wget to download and install: wget https://artifacts.elastic.co/downloads/logstash/logstash-6.2.4.tar.gz tar -xvzf logstash-6.2.4.tar.gz rm -f logstash-6.2.4.tar.gz mv logstash-6.2.4 logstash Create config-sample file cd /logstash/config vi logstash-simple.conf input { stdin { } } output { elasticsearch { hosts => ["localhost:9200"] } stdout { codec => rubydebug } } Start Logstash [hadoop@localhost logstash]$ ./bin/logstash -f ./config/logstash-simple.conf Sending Logstash's logs to /home/hadoop/apps/installers/logstash/logs which is now configured via log4j2.properties [2018-05-25T17:29:34,107][INFO ][logstash.modules.scaffold] Initializing module {:module_name=>"fb_apache", :directory=>"/home/hadoop/apps/installers/logstash/modules/fb_apache/configuration"} [2018-05-25T17:29:34,150][INFO ][logstash.modules.scaffold] Initializing module {:module_name=>"netflow", :directory=>"/home/hadoop/apps/installers/logstash/modules/netflow/configuration"} [2018-05-25T17:29:34,385][INFO ][logstash.setting.writabledirectory] Creating directory {:setting=>"path.queue", :path=>"/home/hadoop/apps/installers/logstash/data/queue"} [2018-05-25T17:29:34,396][INFO ][logstash.setting.writabledirectory] Creating directory {:setting=>"path.dead_letter_queue", :path=>"/home/hadoop/apps/installers/logstash/data/dead_letter_queue"} [2018-05-25T17:29:35,467][WARN ][logstash.config.source.multilocal] Ignoring the 'pipelines.yml' file because modules or command line options are specified [2018-05-25T17:29:35,554][INFO ][logstash.agent ] No persistent UUID file found. Generating new UUID {:uuid=>"1aad4d0b-71ea-4355-8c21-9623927af557", :path=>"/home/hadoop/apps/installers/logstash/data/uuid"} [2018-05-25T17:29:37,391][INFO ][logstash.runner ] Starting Logstash {"logstash.version"=>"6.2.4"} [2018-05-25T17:29:38,775][INFO ][logstash.agent ] Successfully started Logstash API endpoint {:port=>9600} [2018-05-25T17:29:48,843][INFO ][logstash.pipeline ] Starting pipeline {:pipeline_id=>"main", "pipeline.workers"=>4, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>50} [2018-05-25T17:29:50,008][INFO ][logstash.outputs.elasticsearch] Elasticsearch pool URLs updated {:changes=>{:removed=>[], :added=>[http://localhost:9200/]}} [2018-05-25T17:29:50,030][INFO ][logstash.outputs.elasticsearch] Running health check to see if an Elasticsearch connection is working {:healthcheck_url=>http://localhost:9200/, :path=>"/"} [2018-05-25T17:29:50,614][WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>"http://localhost:9200/"} [2018-05-25T17:29:50,781][INFO ][logstash.outputs.elasticsearch] ES Output version determined {:es_version=>6} [2018-05-25T17:29:50,789][WARN ][logstash.outputs.elasticsearch] Detected a 6.x and above cluster: the `type` event field won't be used to determine the document _type {:es_version=>6} [2018-05-25T17:29:50,834][INFO ][logstash.outputs.elasticsearch] Using mapping template from {:path=>nil} [2018-05-25T17:29:50,873][INFO ][logstash.outputs.elasticsearch] Attempting to install template {:manage_template=>{"template"=>"logstash-*", "version"=>60001, "settings"=>{"index.refresh_interval"=>"5s"}, "mappings"=>{"_default_"=>{"dynamic_templates"=>[{"message_field"=>{"path_match"=>"message", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false}}}, {"string_fields"=>{"match"=>"*", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false, "fields"=>{"keyword"=>{"type"=>"keyword", "ignore_above"=>256}}}}}], "properties"=>{"@timestamp"=>{"type"=>"date"}, "@version"=>{"type"=>"keyword"}, "geoip"=>{"dynamic"=>true, "properties"=>{"ip"=>{"type"=>"ip"}, "location"=>{"type"=>"geo_point"}, "latitude"=>{"type"=>"half_float"}, "longitude"=>{"type"=>"half_float"}}}}}}}} [2018-05-25T17:29:50,963][INFO ][logstash.outputs.elasticsearch] Installing elasticsearch template to _template/logstash [2018-05-25T17:29:51,421][INFO ][logstash.outputs.elasticsearch] New Elasticsearch output {:class=>"LogStash::Outputs::ElasticSearch", :hosts=>["//localhost:9200"]} [2018-05-25T17:29:51,646][INFO ][logstash.pipeline ] Pipeline started successfully {:pipeline_id=>"main", :thread=>"#"} The stdin plugin is now waiting for input: [2018-05-25T17:29:51,902][INFO ][logstash.agent ] Pipelines running {:count=>1, :pipelines=>["main"]} hello world { "@version" => "1", "@timestamp" => 2018-05-25T21:34:46.148Z, "host" => "localhost.localdomain", "message" => "hello world" } Accessing Kibana dashboard In order to access Kibana dashboard remotely, configure the file kibana.yml in /kibana/config directory to server.host: "0.0.0.0" as highlighted below. vi kibana.yml Now try opening the link on your local browser. http://{your machine ip}:5601 Note: If link doesn't work, try to stop firewall services on your server. Run below commands: service firewalld stop service iptables stop Here is the sample, In my case it's http://192.16x.x.xxx:5601 If you don't have your linux machine details. You can search your ipaddress by running ifconfig on your machine (inet is your ip). I hope you enjoyed this post. Please comment below if you have any question. Thank you! Next: Loading data into Elasticsearch with Logstash Navigation Menu: Introduction to ELK Stack Installation Loading data into Elasticsearch with Logstash Create Kibana Dashboard Example Kibana GeoIP Dashboard Example #InstallElasticsearch #Installation #ELKStack #Elasticsearch #Kibana #Logstash
- Loading JSON file using Spark (Scala)
Main menu: Spark Scala Tutorial In this Apache Spark Tutorial - We will be loading a simple JSON file. Now-a-days most of the time you will find files in either JSON format, XML or a flat file. JSON file format is very easy to understand and you will love it once you understand JSON file structure. JSON File Structure Before we ingest JSON file using spark, it's important to understand JSON data structure. Basically, JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. JSON is built on two structures: A collection of name/value pairs, usually referred as an object and its value pair. An ordered list of values. You can think it like an array, list of values. An object is an unordered set of name/value pairs. An object begins with { (left brace) and ends with } (right brace). Each name is followed by : (colon) and the name/value pairs are separated by , (comma). An array is an ordered collection of values. An array begins with [ (left bracket) and ends with ] (right bracket). Values are separated by , (comma). A value can be a string in double quotes, or a number, or true or false or null, or an object or an array. These structures can be nested. One more fact, JSON files could exist in two formats. However, most of the time you will encounter multiline JSON files. Multiline JSON where each line could have multiple records. Single line JSON where each line depicts one record. Multiline JSON would look something like this: [ { "color": "red", "value": "#f00" }, { "color": "green", "value": "#0f0" }, { "color": "blue", "value": "#00f" }, { "color": "cyan", "value": "#0ff" }, { "color": "magenta", "value": "#f0f" }, { "color": "yellow", "value": "#ff0" }, { "color": "black", "value": "#000" } ] Single line JSON would look something like this (try to correlate object, array and value structure format which I explained earlier): { "color": "red", "value": "#f00" } { "color": "green", "value": "#0f0" } { "color": "blue", "value": "#00f" } { "color": "cyan", "value": "#0ff" } { "color": "magenta", "value": "#f0f" } { "color": "yellow", "value": "#ff0" } { "color": "black", "value": "#000" } Creating Sample JSON file I have created two different sample files - multiline and single line JSON file with above mentioned records (just copy-paste). singlelinecolors.json multilinecolors.json Sample files look like: Note: I assume that you have installed Scala IDE if not please refer my previous blogs for installation steps (Windows & Mac users). 1. Create a new Scala project "jsnReader" Go to File → New → Project and enter jsnReader in project name field and click finish. 2. Create a new Scala Package "com.dataneb.spark" Right click on the jsnReader project in the Package Explorer panel → New → Package and enter name com.dataneb.spark and finish. 3. Create a Scala object "jsonfileReader" Expand the jsnReader project tree and right click on the com.dataneb.spark package → New → Scala Object → enter jsonfileReader in the object name and press finish. 4. Add external jar files Right click on jsnReader project → properties → Java Build Path → Add External Jars Now navigate to the path where you have installed Spark. You will find all the jar files under /spark/jars folder. After adding these jar files you will find Referenced Library folder created on left panel of your screen below Scala object. You will also find that project has become invalid (red cross sign), we will fix it shortly. 5. Setup Scala Compiler Now right click on jsnReader project → properties → Scala Compiler and check the box Use Project Settings and select Fixed Scala installation: 2.11.11 (built-in) from drop-down options. After applying these changes, you will find project has become valid again (red cross sign is gone). 6. Sample code Open jsonfileReader.scala and copy-paste the code written below. I have written separate blog to explain what are basic terminologies used in Spark like RDD, SparkContext, SQLContext, various transformations and actions etc. You can go through this for basic understanding. However, I have explained little bit in comments above each line of code what it actually does. For list of spark functions you can refer this. // Your package name package com.dataneb.spark // Each library has its significance, I have commented in below code how its being used import org.apache.spark._ import org.apache.spark.sql._ import org.apache.log4j._ object jsonfileReader { // Reducing the error level to just "ERROR" messages // It uses library org.apache.log4j._ // You can apply other logging levels like ALL, DEBUG, ERROR, INFO, FATAL, OFF etc Logger.getLogger("org").setLevel(Level.ERROR) // Defining Spark configuration to define application name and the local resources to use // It uses library org.apache.spark._ val conf = new SparkConf().setAppName("Sample App") conf.setMaster("local") // Using above configuration to define our SparkContext val sc = new SparkContext(conf) // Defining SQL context to run Spark SQL // It uses library org.apache.spark.sql._ val sqlContext = new SQLContext(sc) // Main function where all operations will occur def main (args: Array[String]): Unit = { // Reading the json file val df = sqlContext.read.json("/Volumes/MYLAB/testdata/multilinecolors.json") // Printing schema df.printSchema() // Saving as temporary table df.registerTempTable("JSONdata") // Retrieving all the records val data=sqlContext.sql("select * from JSONdata") // Showing all the records data.show() // Stopping Spark Context sc.stop } } 7. Run the code! Right click anywhere on the screen and select Run As Scala Application. That's it!! If you have followed the steps properly you will find the result in Console. We have successfully loaded JSON file using Spark SQL dataframes. Printed JSON schema and displayed the data. Try reading single line JSON file which we created earlier. There is a multiline flag which you need to make true to read such files. Also, you can save this data in HDFS, database or CSV file depending upon your need. If you have any question, please don't forget to write in comments section below. Thank you. Next: How to convert RDD to dataframe? Navigation menu 1. Apache Spark and Scala Installation 1.1 Spark installation on Windows 1.2 Spark installation on Mac 2. Getting Familiar with Scala IDE 2.1 Hello World with Scala IDE 3. Spark data structure basics 3.1 Spark RDD Transformations and Actions example 4. Spark Shell 4.1 Starting Spark shell with SparkContext example 5. Reading data files in Spark 5.1 SparkContext Parallelize and read textFile method 5.2 Loading JSON file using Spark Scala 5.3 Loading TEXT file using Spark Scala 5.4 How to convert RDD to dataframe? 6. Writing data files in Spark 6.1 How to write single CSV file in Spark 7. Spark streaming 7.1 Word count example Scala 7.2 Analyzing Twitter texts 8. Sample Big Data Architecture with Apache Spark 9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science? 10. Spark Interview Questions and Answers
- How to write single CSV file using spark?
Apache Spark by default writes CSV file output in multiple parts-*.CSV, inside a directory. Reason is simple it creates multiple files because each partition is saved individually. Apache Spark is built for distributed processing and multiple files are expected. However, you can overcome this situation by several methods. In previous posts, we have just read the data files (flat file, json), created rdd, dataframes using spark sql, but we haven't written file back to disk or any storage system. In this Apache Spark tutorial - you will learn how to write files back to disk. Main menu: Spark Scala Tutorial For this blog, I am creating Scala Object - textfileWriter in same project - txtReader folder where we created textfileReader. Source File I am using the same source file squid.txt file (with duplicate records) which I created in previous blog. However, in practical scenario source could be anything - relational database, hdfs file system, message queue etc. Practically, It will be never the case, i.e. reading and writing same file. This is just for demo purpose. 1286536309.586 921 192.168.0.68 TCP_MISS/200 507 POST http://rcv-srv37.inplay.tubemogul.co...eiver/services - DIRECT/174.129.41.128 application/xml 1286536309.608 829 192.168.0.68 TCP_MISS/200 507 POST http://rcv-srv37.inplay.tubemogul.co...eiver/services - DIRECT/174.129.41.128 application/xml 1286536309.660 785 192.168.0.68 TCP_MISS/200 507 POST http://rcv-srv37.inplay.tubemogul.co...eiver/services - DIRECT/174.129.41.128 application/xml 1286536309.684 808 192.168.0.68 TCP_MISS/200 507 POST http://rcv-srv37.inplay.tubemogul.co...eiver/services - DIRECT/174.129.41.128 application/xml 1286536309.775 195 192.168.0.227 TCP_MISS/200 4120 GET http://i4.ytimg.com/vi/gTHZnIAzmdY/default.jpg - DIRECT/209.85.153.118 image/jpeg 1286536309.795 215 192.168.0.227 TCP_MISS/200 5331 GET http://i2.ytimg.com/vi/-jBxVLD4fzg/default.jpg - DIRECT/209.85.153.118 image/jpeg 1286536309.815 234 192.168.0.227 TCP_MISS/200 5261 GET http://i1.ytimg.com/vi/dCjp28ps4qY/default.jpg - DIRECT/209.85.153.118 image/jpeg Sample Code Open jsonfileReader.scala and copy-paste the code written below. I have written separate blog to explain what are basic terminologies used in Spark like rdd, SparkContext, SQLContext, various transformations and actions etc. You can go through these for basic understanding. Spark shell, Spark context and configuration Spark RDD, Transformations and Actions However, I have explained little bit in comments above each line of code what it actually does. For list of spark functions you can refer this. You can make this code much simpler but my aim is to teach as well. Hence I have intentionally introduced header structure, SQL context, string rdd etc. However, if you are familiar with these, you can just focus on writing dataframe part highlighted in blue. package com.dataneb.spark // Each library has its significance, I have commented in below code how its being used import org.apache.spark._ import org.apache.spark.sql._ import org.apache.log4j._ import org.apache.spark.sql.types.{StructType, StructField, StringType} import org.apache.spark.sql.Row object textfileWriter { // Reducing the error level to just "ERROR" messages // It uses library org.apache.log4j._ // You can apply other logging levels like ALL, DEBUG, ERROR, INFO, FATAL, OFF etc Logger.getLogger("org").setLevel(Level.ERROR) // Defining Spark configuration to define application name and the local resources to use // It uses library org.apache.spark._ val conf = new SparkConf().setAppName("textfileWriter") conf.setMaster("local") // Using above configuration to define our SparkContext val sc = new SparkContext(conf) // Defining SQL context to run Spark SQL // It uses library org.apache.spark.sql._ val sqlContext = new SQLContext(sc) // Main function where all operations will occur def main (args:Array[String]): Unit = { // Reading the text file val squidString = sc.textFile("/Users/Rajput/Documents/testdata/squid.txt") // Defining the data-frame header structure val squidHeader = "time duration client_add result_code bytes req_method url user hierarchy_code type" // Defining schema from header which we defined above // It uses library org.apache.spark.sql.types.{StructType, StructField, StringType} val schema = StructType(squidHeader.split(" ").map(fieldName => StructField(fieldName,StringType, true))) // Converting String RDD to Row RDD for 10 attributes val rowRDD = squidString.map(_.split(" ")).map(x => Row(x(0), x(1), x(2), x(3), x(4), x(5) , x(6) , x(7) , x(8), x(9))) // Creating dataframe based on Row RDD and schema val squidDF = sqlContext.createDataFrame(rowRDD, schema) // Writing dataframe to a file with overwrite mode, header and single partition. squidDF .repartition(1) .write .mode ("overwrite") .format("com.databricks.spark.csv") .option("header", "true") .save("targetfile.csv") sc.stop() } } Run the code! Output There are several other methods to write these files. Method 1 This is what we did above. If expected dataframe size is small you can either use repartition or coalesce to create single file output as /filename.csv/part-00000. Scala> dataframe .repartition(1) .write .mode ("overwrite") .format("com.databricks.spark.csv") .option("header", "true") .save("filename.csv") Repartition(1) will shuffle the data to write everything in one particular partition thus writer cost will be high and it might take long time if file size is huge. Method 2 Coalesce will require lot of memory, if your file size is huge as you will run out of memory. Scala> dataframe .coalesce(1) .write .mode ("overwrite") .format("com.databricks.spark.csv") .option("header", "true") .save("filename.csv") Coalesce() vs repartition() Coalesce and repartition both shuffles the data to increase or decrease the partition, but repartition is more costlier operation as it performs full shuffle. For example, scala> val distData = sc.parallelize(1 to 16, 4) distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[128] at parallelize at :24 // current partition size scala> distData.partitions.size res63: Int = 4 // checking data across each partition scala> distData.mapPartitionsWithIndex((index, iter) => if (index == 0) iter else Iterator()).collect res64: Array[Int] = Array(1, 2, 3, 4) scala> distData.mapPartitionsWithIndex((index, iter) => if (index == 1) iter else Iterator()).collect res65: Array[Int] = Array(5, 6, 7, 8) scala> distData.mapPartitionsWithIndex((index, iter) => if (index == 2) iter else Iterator()).collect res66: Array[Int] = Array(9, 10, 11, 12) scala> distData.mapPartitionsWithIndex((index, iter) => if (index == 3) iter else Iterator()).collect res67: Array[Int] = Array(13, 14, 15, 16) // decreasing partitions to 2 scala> val coalData = distData.coalesce(2) coalData: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[133] at coalesce at :25 // see how shuffling occurred. Instead of moving all data it just moved 2 partitions. scala> coalData.mapPartitionsWithIndex((index, iter) => if (index == 0) iter else Iterator()).collect res68: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8) scala> coalData.mapPartitionsWithIndex((index, iter) => if (index == 1) iter else Iterator()).collect res69: Array[Int] = Array(9, 10, 11, 12, 13, 14, 15, 16) repartition() Notice how repartition() will re-shuffle everything to create new partitions as compared to previous RDDs - distData and coalData. Hence repartition is more costlier operation as compared to coalesce. scala> val repartData = distData.repartition(2) repartData: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[139] at repartition at :25 // checking data across each partition scala> repartData.mapPartitionsWithIndex((index, iter) => if (index == 0) iter else Iterator()).collect res70: Array[Int] = Array(1, 3, 6, 8, 9, 11, 13, 15) scala> repartData.mapPartitionsWithIndex((index, iter) => if (index == 1) iter else Iterator()).collect res71: Array[Int] = Array(2, 4, 5, 7, 10, 12, 14, 16) Method 3 Let the file create on various partitions and later merge the files with separate Shell Script. This method will be fast depending upon your hard disk write speed. #!/bin/bash echo "ColName1, ColName2, ColName3, ... , ColNameX" > filename.csv for i in /spark/output/*.CSV ; do echo "FileNumber $i" cat $i >> filename.csv rm $i done echo "Done" Method 4 If you are using Hadoop file system to store output files. You can leverage HDFS to merge files by using getmerge utility. Input your source directory with all partition files and destination output file, it concatenates all the files in source into destination local file. You can also set -nl to add a newline character at the end of each file. Further, -skip-empty-file can be used to avoid unwanted newline characters in case of empty files. Syntax : hadoop fs -getmerge [-nl] [-skip-empty-file] hadoop fs -getmerge -nl /spark/source /spark/filename.csv hadoop fs -getmerge /spark/source/file1.csv /spark/source/file2.txt filename.csv Method 5 Use FileUtil.copyMerge() to merge all the files. import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs._ def merge(srcPath: String, dstPath: String): Unit = { val hadoopConfig = new Configuration() val hdfs = FileSystem.get(hadoopConfig) FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), true, hadoopConfig, null) } val newData = << Your dataframe >> val outputfile = "/spark/outputs/subject" var filename = "sampleFile" var outputFileName = outputfile + "/temp_" + filename var mergedFileName = outputfile + "/merged_" + filename var mergeFindGlob = outputFileName newData.write .format("com.databricks.spark.csv") .option("header", "true") .mode("overwrite") .save(outputFileName) merge(mergeFindGlob, mergedFileName ) If you have any question, please don't forget to write in comments section below. Thank you! Next: Spark Streaming word count example Navigation menu 1. Apache Spark and Scala Installation 1.1 Spark installation on Windows 1.2 Spark installation on Mac 2. Getting Familiar with Scala IDE 2.1 Hello World with Scala IDE 3. Spark data structure basics 3.1 Spark RDD Transformations and Actions example 4. Spark Shell 4.1 Starting Spark shell with SparkContext example 5. Reading data files in Spark 5.1 SparkContext Parallelize and read textFile method 5.2 Loading JSON file using Spark Scala 5.3 Loading TEXT file using Spark Scala 5.4 How to convert RDD to dataframe? 6. Writing data files in Spark 6.1 How to write single CSV file in Spark 7. Spark streaming 7.1 Word count example Scala 7.2 Analyzing Twitter texts 8. Sample Big Data Architecture with Apache Spark 9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science? 10. Spark Interview Questions and Answers
- How to make calls to Twitter APIs using Postman client?
In this blog, I am going to invoke Twitter custom APIs with Postman client in order to pull live feeds, or you can say tweets from Twitter. Output will be JSON text which you can format or change based on your requirement. Soon I will be writing another blog to demonstrate how you can ingest this data in real time with Kafka and process it using Spark. Or, you can directly stream & process the data in real time with Spark streaming. As of now, let's try to connect Twitter API using Postman. Prerequisites Postman client Twitter developer account Postman Client Installation There are basically two ways to install Postman, either you can download the Postman extension for your browser (chrome in my case) or you can simply install native Postman application. I have installed Postman application to write this blog. Step 1. Google "Install Postman" and go to the Postman official site to download the application. Step 2. After opening Postman download link, select your operating system to start Postman download. It's available for all the types of platform - Mac, Linux and Windows. The download link keeps on changing so if the download link doesn't work just Google it as shown above. Step 3. Once installer is downloaded, run the installer to complete the installation process. It's approximately 250 MB application (for Mac). Step 4. Sign up. After signing in, you can save your preferences or do it later as shown below. Step 5. Your workspace will look like below. Twitter Developer Account I hope you all have Twitter developers account, if not please create it. Then, go to Developer Twitter and sign in with your Twitter account. Click on Apps > Create an app at the top right corner of your screen. Note: Earlier, developer.twitter.com was known as apps.twitter.com. Fill out the form to create an application > specify Name, Description and Website details as shown below. This screen has slightly changed with new Twitter developer interface but overall process is still similar. If you have any question, please feel free to ask in comment section at the end of this post. Please provide a proper website name like https://example.com otherwise you will get error while creating the application. Sample has been shown above. Once you successfully create the app, you will get the below page. Make sure access level is set to Read and Write as shown above. Now go to Keys and Access Token tab > click on Create Access Token. At this point, you will be able to see 4 keys which will used in Postman client. Consumer Key (API Key) Consumer Secret (API Secret) Access Token Access Token Secret. New Interface looks like this. Calling Twitter API with Postman Client Open Postman application and click on authorization tab. Select authorization type as OAuth 1.0. Add authorization data to Request Headers. This is very important step else you will get error. After setting up authorization type and request header, fill out the form carefully with 4 keys (just copy-paste) which we generated in Twitter App - Consumer Key (API Key), Consumer Secret (API Secret), Access Token & Access Token Secret. Execute it! Now let's search for tweeter statuses which says snap. Copy-paste request URL as https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=snap as shown below. You can refer API reference index in order to access various Twitter custom API. GET some tweets, hit Send button. You will get response as shown below. GET Examples Twitter has very nice API documentation on accounts, users, tweets, media, trend, messages, geo, ads etc and there is huge variety of data which you can pull. I am invoking few APIs just for demonstration purpose. Accounts and users Lets say you want to search for user name "Elon". You can do it like this, GET https://api.twitter.com/1.1/users/search.json?q=elon Now suppose you want to get friend list of Elon Musk, you can do it like this, GET https://api.twitter.com/1.1/friends/list.json?user_id=44196397 Input user_id is same as id in previous output. You can also change the display => pretty, raw and preview. Trending Topics You can pull top 50 trending global topics with id = 1, for example, GET https://api.twitter.com/1.1/trends/place.json?id=1 POST Examples You can also POST something like you Tweet in your Twitter web account. For example if you want to Tweet Hello you can do it like this, POST https://api.twitter.com/1.1/statuses/update.json?status=Hello You can verify same with your Twitter account, yeah that's me! I rarely use Twitter. Cursoring Cursoring is used for pagination when you have large result set. Lets say you want to pull all statuses which says "Elon", it's obvious that there will be good number of tweets and that response can't fit in one page. To navigate through each page cursoring is needed. For example, lets say you want to pull 5 result per page you can do it like this, GET https://api.twitter.com/1.1/search/tweets.json?q=Elon&count=5 Now, to navigate to next 5 records you have to use next_results shown in search_metadata section above like this, GET https://api.twitter.com/1.1/search/tweets.json?max_id=1160404261450244095&q=Elon&count=5&include_entities=1 To get next set of results again use next_results from search_metadata of this result set and so on.. Now, obviously you can't do this manually each time. You need to write loop to get the result set programmatically, for example, cursor = -1 api_path = "https://api.twitter.com/1.1/endpoint.json?screen_name=targetUser" do { url_with_cursor = api_path + "&cursor=" + cursor response_dictionary = perform_http_get_request_for_url( url_with_cursor ) cursor = response_dictionary[ 'next_cursor' ] } while ( cursor != 0 ) In our case next_results is like next_cursor, like a pointer to next page. This might be different for different endpoints like tweets, users and accounts, ads etc. But logic will be same to loop through each result set. Refer this for complete details. That's it you have successfully pulled data from Twitter. #TwitterAPI #Postmaninstallation #Oauth #API #CustomerKey #CustomerSecret #accesstoken #Postman Next: Analyze Twitter Tweets using Apache Spark Learn Apache Spark in 7 days, start today! 1. Apache Spark and Scala Installation 1.1 Spark installation on Windows 1.2 Spark installation on Mac 2. Getting Familiar with Scala IDE 2.1 Hello World with Scala IDE 3. Spark data structure basics 3.1 Spark RDD Transformations and Actions example 4. Spark Shell 4.1 Starting Spark shell with SparkContext example 5. Reading data files in Spark 5.1 SparkContext Parallelize and read textFile method 5.2 Loading JSON file using Spark Scala 5.3 Loading TEXT file using Spark Scala 5.4 How to convert RDD to dataframe? 6. Writing data files in Spark 6.1 How to write single CSV file in Spark 7. Spark streaming 7.1 Word count example Scala 7.2 Analyzing Twitter texts 8. Sample Big Data Architecture with Apache Spark 9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science? 10. Spark Interview Questions and Answers
- Kibana GeoIP example: How to index geographical location of IP addresses into Elasticsearch
The relation between your IP address and geolocation is very simple. There are numerous websites available as of today like Maxmind, IP2Location, IPstack , Software77 etc where you can track the geolocation of an IP address. What's the benefit? It's very simple, it gives you another dimension to analyze your data. Let's say my data predicts that most of the users traffic is coming from 96.67.149.166. It doesn't make complete sense until I say most of the traffic is coming from New Jersey. When I say geolocation it includes multiple attributes like city, state, country, continent, region, currency, country flag, country language, latitude, longitude etc. Most of the websites which provide geolocation are paid sites. But there are few like IPstack which provides you free access token to make calls to their rest API's. Still there are limitations like how many rest API calls you can make per day and also how many types of attributes you can pull. Suppose I want to showcase specific city in the report and API provides limited access to country and continent only, then obviously that data is useless for me. Now the best part is Elastic stack provides you free plugin called "GeoIP" which grants you access to lookup millions of IP addresses. You would be thinking from where it gets the location details? The answer is Maxmind which I referred earlier. GeoIP plugin internally does a lookup from stored copy of Maxmind database which keeps on updating and creates number of extra fields with geo coordinates (longitude & latitude). These geo coordinates can be used to plot maps in Kibana. ELK Stack Installation I am installing ELK stack on Mac OS, for installation on Linux machine refer this. ELK installation is very easy on Mac with Homebrew. It's hardly few minutes task if done properly. 1. Homebrew Installation Run this command on your terminal. If you have already installed Homebrew move to the next step, or if this command doesn't work - copy it from here. $ /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" 2. Java Installation Check if java is installed on your machine. $ java -version java version "9.0.1" If java is not installed, run following steps to install java. $ brew tap caskroom/cask $ brew cask install java $ brew cask info java 3. Elasticsearch Installation $ brew tap elastic/tap $ brew install elastic/tap/elasticsearch-full $ elasticsearch If you see all INFO without any error, that means installation went fine. Let this run, don't kill the process. Now, simply open localhost:9200 in your local browser. You will see elasticsearch version. [TIP] You might face permission issue if you are not logged in with root user. To enable root user on Mac you can follow this. It's due to security reasons that root user is disabled by default on Mac. However another solution is to change folder permission itself. Run these commands if you want to change folder permissions, $ sudo chown -R $(whoami) /usr/local/include /usr/local/lib/pkgconfig $ chmod u+w /usr/local/include /usr/local/lib/pkgconfig Install xcode if it's missing, $ xcode-select --install 4. Kibana Installation $ brew install elastic/tap/kibana-full $ kibana Let this process run, don't kill. Now, open localhost:5601 in your local browser to check if kibana is running properly, 5. Logstash Installation $ brew install elastic/tap/logstash-full Configuring Logstash for GeoIP Let's begin with few sample IP addresses as listed below. I generated this sample data from browserling.com so please ignore if there is some known ip address in this list. Honestly speaking even I don't know where these IP addresses will point to when we generate the maps. Sample Data 1. Copy paste these records into a flat file with "ipaddress" header (sampleip.csv). ipaddress 0.42.56.104 82.67.74.30 55.159.212.43 108.218.89.226 189.65.42.171 62.218.183.66 210.116.94.157 80.243.180.223 169.44.232.173 232.117.72.103 242.14.158.127 14.209.62.41 4.110.11.42 135.235.149.26 93.60.177.34 145.121.235.122 170.68.154.171 206.234.141.195 179.22.18.176 178.35.233.119 145.156.239.238 192.114.2.154 212.36.131.210 252.185.209.0 238.49.69.205 2. Make sure your Elasticsearch and Kibana services are up and running. If not, please refer my previous blog - how to restart them. 3. [Update 9/Aug/2019: Not mandatory step now] Install GeoIP plugin for Elasticsearch. Run the below command in your Elasticsearch home directory. Once GeoIP plugin is installed successfully, you will be able to find plugin details under elasticsearch home plugin directory "/elasticsearch/plugins". You need to run installation command on each node if you are working in a clustered environment and then restart the services. /elasticsearch/bin/elasticsearch-plugin install ingest-geoip New version of elastics has built in GeoIP module, so you don't need to install it separately. Configure Logstash Configure logstash config file to create "logstash-iplocation" index. Please note your index name should start with logstash-name otherwise your attributes will not be mapped properly as geo_points datatype. This is because the default index name in logstash template is declared as logstash-* , you can change it if you want but as of now lets move ahead with logstash-iplocation. Below is the sample input, filter and output configuration. input { file { path => "/Volumes/MYLAB/testdata/sampleip.csv" start_position => "beginning" sincedb_path => "/Volumes/MYLAB/testdata/logstash.txt" } } filter { csv { columns => "ipaddress" } geoip { source => "message" } } output { elasticsearch { hosts => "localhost" index => "logstash-iplocation" } stdout{ codec => rubydebug } } My configuration file looks something like this: Important Notes Your index name should be in lower caps, starting with logstash- for example logstash-abcd Also, sincedb path is created once per file input, so if you want to reload the same file make sure you delete the sincedb file entry. It looks like this, You invoke geoip plugin from filter configuration, it has no relation with input/output. Run Logstash Load the data into elasticsearch by running below command (it's a single line command). Now wait, it will take few seconds to load. Change your home location accordingly, for me its homebrew linked as shown below. /usr/local/var/homebrew/linked/logstash-full/bin/logstash -f /usr/local/var/homebrew/linked/logstash-full/libexec/config/logstash_ip.config Sample output Important Notes See if filters geoip is invoked when you load the data into elasticsearch. Also, the datatype of location should be geo_point, otherwise there is some issue with your configuration. Latitude and longitude datatype should be float. These datatypes are like confirmation that logstash loaded this data as expected. Kibana Dashboard Creation 1. Once data is loaded into Elasticsearch, open Kibana UI and go to Management tab => Kibana Index pattern. 2. Create Kibana index with "logstash-iplocation" pattern and hit Next. 3. Select timestamp if you want to show it with your index and hit create index pattern. 4. Now go to Discover tab and select "logstash-iplocation" to see the data which we just loaded. You can expand the fields and see geoip.location has datatype as geo_point. You can verify this by "globe" sign which you will find just before geoip.location field. If it's not there then you have done some mistake and datatype mapping is incorrect. 5. Now go to Visualize tab and select coordinate map from the types of visualization and index name as "logstash-iplocation". 6. Apply the filters (Buckets: Geo coordinates, Aggregation: Geohash & Field: geoip.location) as shown below and hit the "Play" button. That's it !! You have located all the ip addresses. Thank you!! If you have any question please comment. Next: Loading data into Elasticsearch using Apache Spark Navigation Menu: Introduction to ELK Stack Installation Loading data into Elasticsearch with Logstash Create Kibana Dashboard Example Kibana GeoIP Dashboard Example Loading data into Elasticsearch using Apache Spark
- A Day in the Life of a Computer Programmer
As a computer programmer, my daily life is actually kind of weird. I did my undergrad in computer science and worked with Microsoft for 4 years. I'm self taught and have spent a far greater amount of hours learning how to code. I work on US based client projects. Due to different time zone, my hours are not fully standard. Usually I try to work 7–9 hours a day, however sometimes it can be as much as 10–12 hours. Here is my routine that I follow roughly everyday, 4:50 am : Alarm beeps, Snooze 1 .. 5:05 am : Snooze 2 .. 5:20 am : Snooze 3 .. 5:35 am : Snooze n, Rolling in bed .. 6:00 am : Semi awake 6:10 am : Check Facebook, Instagram, Whatsapp, Robinhood, 9gag 6:15 am : Wake-up, Workout (pushups, gym, meditation, yoga, I am lying) 6:30 am : Shower, Dress up 7:30 am : Leave for work, Mustang, Daily traffic, Pandora (free subscription) 8:15 am : Arrive at work, Parking, Swipe access card at the entrance 8:20 am : Checking email, Service Now 8:30 am : Offshore-onshore call, Status updates, Discuss target for the day 9:00 am : Breakfast (uncertain) 9:20 am : Code, debug, code, code, debug 10:00 am : Error, error, error, error 11:00 am : Coffee, "Smoking kills" 11:30 am : Code, code, debug, code 12:00 pm - 2:00 pm : Global variable "Lunch" 2:00 pm : Code, debug, code, debug 3:00 pm : Code, code, code, code 3:30 pm : Check on the entire team 4:30 pm : Wrap up, Leave for home, Drive, Traffic, Pandora 5:15 pm : Arrive home, Change dress, Chillax 5:45 pm : Jog for 4-5 miles, Shower 6:30 pm : Facebook, Youtube, News, Blogs 8:00 pm - 9:00 pm : Dinner, Eat 24, Cooking (rare element) 9:30 pm : Offshore-onshore call (sometimes free), Otherwise Netflix 10:00 pm : Netflix, Youtube, Chit-chat with girlfriend 10:30 pm - 11:00 pm : Shut down, Sleep 4:50 am : Alarm beeps, Snooze 1, 2, 3, n .. Thanks for reading, hit like and share if you enjoyed the post!
- Installing Apache Spark and Scala (Windows)
Main menu: Spark Scala Tutorial In this Spark Scala tutorial you will learn how to download and install, Apache Spark (on Windows) Java Development Kit (JDK) Eclipse Scala IDE By the end of this tutorial you will be able to run Apache Spark with Scala on Windows machine, and Eclispe Scala IDE. JDK Download and Installation 1. First download JDK (Java Development Kit) from this link. If you have already installed Java on your machine please proceed to Spark download and installation. I have already installed Java SE 8u171/ 8u172 (Windows x64) on my machine. Java SE 8u171 means Java Standard Edition 8 Update 171. This version keeps on changing so just download the latest version available at the time of download and follow these steps. 2. Accept the license agreement and choose the OS type. In my case it is Windows 64 bit platform. 3. Double click on downloaded executable file (jdk*.exe; ~200 MB) to start the installation. Note down the destination path where JDK is installing and then complete the installation process (for instance in this case it says Install to: C:\Program Files\Java\jdk1.8.0_171\). Apache Spark Download & Installation 1. Download a pre-built version of Apache Spark from this link. Again, don't worry about the version, it might be different for you. Choose latest Spark release from drop down menu and package type as pre-built for Apache Hadoop. 2. If necessary, download and install WinRAR so that you can extract the .tgz file that you just downloaded. 3. Create a separate directory spark in C drive. Now extract Spark files using WinRAR, and copy its contents from downloads folder => C:\spark. Please note you should end up with directory structure like C:\spark\bin, C:\spark\conf, etc as shown above. Configuring windows environment for Apache Spark 4. Make sure you "Hide file extension properties" in your file explorer (view tab) is unchecked. Now go to C:\spark\conf folder and rename log4j.properties.template file to log4j.properties. You should see filename as log4j.properties and not just log4j. 5. Now open log4j.properties with word pad and change the statement log4j.rootCategory=INFO, console --> log4j.rootCategory=ERROR, console. Save the file and exit, we did this change to capture ERROR messages only when we run Apache Spark, instead of capturing all INFO. 6. Now create C:\winutils\bin directory. Download winutils.exe from GitHub and extract all the files. You will find multiple versions of Hadoop inside it, you just need to focus on Hadoop version which you selected while downloading package type pre-built Hadoop 2.x/3.x in Step 1. Copy all the underlying files (all .dll, .exe etc) from Hadoop version folder and move it into C:\winutils\bin folder. This step is needed to make windows fool as we are running Hadoop. This location (C:\winutils\bin) will act as Hadoop home. 7. Now right-click your Windows menu, Select Control Panel --> System and Security --> System --> “Advanced System Settings” --> then click “Environment Variables” button. Click on "New" button in User variables and add 3 variables: SPARK_HOME c:\spark JAVA_HOME (path you noted while JDK Installation Step 3, for example C:\Program Files\Java\jdk1.8.0_171) HADOOP_HOME c:\winutils 8. Add the following 2 paths to your PATH user variable. Select "PATH" user variable and edit, if not present create new. %SPARK_HOME%\bin %JAVA_HOME%\bin Download and Install Scala IDE 1. Now install the latest Scala IDE from here. I have installed Scala-SDK-4.7 on my machine. Download the zipped file and extract it. That's it. 2. Under Scala-SDK folder you will find eclipse folder, extract it to c:\eclipse. Run eclipse.exe and it will open the IDE (we will use this later). Now test it out! Open up a Windows command prompt in administrator mode. Right click on command prompt in search menu and run as admin. Type java -version and hit Enter to check if Java is properly installed. If you see the Java version that means Java is installed properly. Type cd c:\spark and hit Enter. Then type dir and hit Enter to get a directory listing. Look for any text file, like README.md or CHANGES.txt. Type spark-shell and hit Enter. At this point you should have a scala> prompt as shown below. If not, double check the steps above, check the environment variables and after making change close the command prompt and retry again. Type val rdd = sc.textFile(“README.md”) and hit Enter. Now type rdd.count() and hit Enter. You should get a count of the number of lines from readme file! Congratulations, you just ran your first Spark program! We just created a rdd with readme text file and ran count action on it. Don't worry we will be going through this in detail in next sections. Hit control-D to exit the spark shell, and close the console window. You’ve got everything set up! Hooray! Note for Python lovers - To install pySpark continue to this blog. Thats all! Guys if it's not running, don't worry. Please mention in comments section below and I will help you out with installation process. Thank you. Next: Just enough Scala for Spark Navigation menu 1. Apache Spark and Scala Installation 1.1 Spark installation on Windows 1.2 Spark installation on Mac 2. Getting Familiar with Scala IDE 2.1 Hello World with Scala IDE 3. Spark data structure basics 3.1 Spark RDD Transformations and Actions example 4. Spark Shell 4.1 Starting Spark shell with SparkContext example 5. Reading data files in Spark 5.1 SparkContext Parallelize and read textFile method 5.2 Loading JSON file using Spark Scala 5.3 Loading TEXT file using Spark Scala 5.4 How to convert RDD to dataframe? 6. Writing data files in Spark 6.1 How to write single CSV file in Spark 7. Spark streaming 7.1 Word count example Scala 7.2 Analyzing Twitter texts 8. Sample Big Data Architecture with Apache Spark 9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science? 10. Spark Interview Questions and Answers
- Elasticsearch Tutorial - What is ELK stack (Elastic stack)
In this Elasticsearch tutorial, you will learn - What is ELK stack (Elastic stack)? We will go through ELK stack examples, load data into Elasticsearch stack and create Kibana dashboard. Navigation Menu: Introduction to ELK Stack Installation Load data into Elasticsearch stack with Logstash Create Kibana Dashboard Example Kibana GeoIP Dashboard Example What is ELK stack (now called Elastic Stack)? ELK stack is an acronym for three open-source products - Elasticsearch, Logstash & Kibana. However all three components are maintained by Elastic. ELK stack started as a Log Analytics solution but later it evolved into enterprise search and analytics platform. Elasticsearch is based on Lucene search engine and you can consider it as a NoSQL database which has capability to index (for text search) and store the data. Logstash is basically a data pipeline technique that can connect to various sources based on various plugins, apply transformations and loads data into various targets including Elasticsearch. In short, Logstash collects and transforms the data and sometimes used for data shipping as well. Kibana is a data visualization platform where you will create dashboards. Another tool called Filebeat is one of the Beats member which can also perform similar tasks like Logstash. ELK Stack Architecture Here is the basic architecture of elastic stack. Notice I haven't mentioned the source in below diagram. Usually data source for ELK stack are various log files, for example application log, server logs, database log, network switch log, router log etc. These log files are consumed using filebeat. Filebeat acts like data collector which collects various types of log files (when we have more than one type of log file). Now-a-days, Kafka is used as another layer which distributes files collected by filebeat to various queue from where logstash transform it and stores in elasticsearch for visualization. So complete flow would look like - [application log, server logs, database log, network switch log, router log etc] => Filebeat => Kafka => ELK Stack. Please note this could be changed based on architecture needed for a project. If there are limited types of log files, sometimes you might even not consider using filebeat or kafka and directly dump logs into ELK stack. Fun Fact: ELK stack Google Trend Elasticsearch is most famous amongst the stack. Refer the Google Trend shown below. Why is ELK stack is so popular worldwide, basically due to 3 major reasons. First of all, price - Its open source tool, easy to learn and free of cost. If you consider other visualization tools like QlikView and Tableau - Kibana provides you similar capabilities without any hidden cost. Elasticsearch is used by many big companies for example Wikipedia & GitHub. Second, its elegant user interface. You can spend time exploring and reviewing data not trying to figure out how to navigate the interface. And last but not the least, its extensible. Elasticsearch is schema-free NoSQL database and can scale horizontally. It is also used for real time analytics. Next: ELK Installation Navigation Menu: Introduction to ELK Stack Installation Load data into Elasticsearch stack with Logstash Create Kibana Dashboard Example Kibana GeoIP Dashboard Example #ELKStack #Elasticsearch #Logstash #Kibana #ElasticsearchTutorial #ElasticStack #ELKTutorial









