81 results found for ""

  • Spark Transformations example (Part 1)

    Apache Spark transformations like Spark reduceByKey, groupByKey, mapPartitions, map etc are very widely used. Apart from these transformations there are several others, I will explain each of them with examples. But before I proceed with Spark transformation examples, if you are new to Spark and Scala I would highly encourage you to go through this post - Spark RDD, Transformation and Actions example. Main menu: Spark Scala Tutorial We will be using lambda functions to pass through most of these Spark transformations. So those who are new to Scala should have basic understanding of lambda functions. Lambda Functions In brief, lambda functions are like normal functions except the fact that they are more convenient when we have to use functions just in one place so that you don't need to worry about defining functions separately. For example, if you want to double the number you can simply write; x => x + x like you do in Python and other languages. Syntax in Scala would be like this, scala> val lfunc = (x:Int) => x + x lfunc: Int => Int = scala> lfunc(3) res0: Int = 6 Sample Data We will be using "Where is the Mount Everest?" text data which we created in earlier post (SparkContext and text files). I just picked some random data to go through these examples. Where is Mount Everest? (MountEverest.txt) Mount Everest (Nepali: Sagarmatha सगरमाथा; Tibetan: Chomolungma ཇོ་མོ་གླང་མ; Chinese Zhumulangma 珠穆朗玛) is Earth's highest mountain above sea level, located in the Mahalangur Himal sub-range of the Himalayas. The international border between Nepal (Province No. 1) and China (Tibet Autonomous Region) runs across its summit point. - Reference Wikipedia scala> val mountEverest = sc.textFile("/Users/Rajput/Documents/testdata/MountEverest.txt") mountEverestRDD: org.apache.spark.rdd.RDD[String] = /Users/Rajput/Documents/testdata/MountEverest.txt MapPartitionsRDD[1] at textFile at :24 Spark Transformations I encourage you all to run these examples on Spark-shell side-by-side. map(func) This transformation redistributes the data after passing each element through func. For example, if you want to split the Mount Everest text into individual words, you just need to pass this lambda func x => x.split(" ") and it will create a new RDD as shown below. What is this func doing? It's just reading each element and splitting on the basis of space character. scala> val words = mountEverest.map(x => x.split(" ")) words: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[3] at map at :25 scala> words.collect() res1: Array[Array[String]] = Array(Array(Mount, Everest, (Nepali:, Sagarmatha, सगरमाथा;, Tibetan:, Chomolungma, ཇོ་མོ་གླང་མ;, Chinese, Zhumulangma, 珠穆朗玛), is, Earth's, highest, mountain, above, sea, level,, located, in, the, Mahalangur, Himal, sub-range, of, the, Himalayas., The, international, border, between, Nepal, (Province, No., 1), and, China, (Tibet, Autonomous, Region), runs, across, its, summit, point.)) Don't worry about collect() action, it's very basic Spark action which is used to return all the element. Now, suppose you want to get the word count in this text file, you can do something like this - first split the file and then get the length or size. scala> mountEverest.map(x => x.split(" ").length).collect() res6: Array[Int] = Array(45) scala> mountEverest.map(x => x.split(" ").size).collect() res7: Array[Int] = Array(45) Lets say you want to get total number of characters in file, you can do this. scala> mountEverest.map(x => x.length).collect() res5: Array[Int] = Array(329) Making all text upper case, you can do it like this. scala> mountEverest.map(x => x.toUpperCase()).collect() res9: Array[String] = Array(MOUNT EVEREST (NEPALI: SAGARMATHA सगरमाथा; TIBETAN: CHOMOLUNGMA ཇོ་མོ་གླང་མ; CHINESE ZHUMULANGMA 珠穆朗玛) IS EARTH'S HIGHEST MOUNTAIN ABOVE SEA LEVEL, LOCATED IN THE MAHALANGUR HIMAL SUB-RANGE OF THE HIMALAYAS. THE INTERNATIONAL BORDER BETWEEN NEPAL (PROVINCE NO. 1) AND CHINA (TIBET AUTONOMOUS REGION) RUNS ACROSS ITS SUMMIT POINT.) flatmap(func) This is also similar to map, except the fact that it gives you more flattened output. For example, scala> val rdd = sc.parallelize(Seq("Where is Mount Everest","Himalayas India")) rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[22] at parallelize at :24 scala> rdd.collect res26: Array[String] = Array(Where is Mount Everest, Himalayas India) scala> rdd.map(x => x.split(" ")).collect res21: Array[Array[String]] = Array(Array(Where, is, Mount, Everest), Array(Himalayas, India)) scala> rdd.flatMap(x => x.split(" ")).collect res23: Array[String] = Array(Where, is, Mount, Everest, Himalayas, India) In above case we have two elements in rdd - "Where is Mount Everest" and second "Himalayas India". When map() transformation is applied, it returned array of array string Array[Array[String]]. It has basically two separate array of strings within an array. So for each element we got one output (1st element => 1 element (Where, is, Mount, Everest), 2nd element => 1 element (Himalayas, India)). And those individual elements are collection of words separated by comma. But if you see flatMap(), output is flattened to single array of string Array[String]. Thus flatMap() is similar to map, but each input item is mapped to 0 or more output items (1st element => 4 elements, 2nd element => 2 elements). This will give you clear picture, scala> rdd.map(x => x.split(" ")).count() res24: Long = 2 scala> rdd.flatMap(x => x.split(" ")).count() res25: Long = 6 map() => [Where is Mount Everest, Himalayas India] => [[Where, is, Mount, Everest],[Himalayas, India]] flatMap() => [Where is Mount Everest, Himalayas India] => [Where, is, Mount, Everest, Himalayas, India] filter(func) As name says it is used to filter elements same like where clause in SQL and it is case sensitive. For example, // returns one element which contains match scala> rdd.filter(x=>x.contains("Himalayas")).collect res31: Array[String] = Array(Himalayas India) // No match scala> rdd.filter(x=>x.contains("Dataneb")).collect res32: Array[String] = Array() // Case sensitive scala> rdd.filter(x=>x.contains("himalayas")).collect res33: Array[String] = Array() scala> rdd.filter(x=>x.toLowerCase.contains("himalayas")).collect res37: Array[String] = Array(Himalayas India) mapPartitions(func) Similar to map() transformation but in this case function runs separately on each partition (block) of RDD unlike map() where it was running on each element of partition. Hence mapPartitions are also useful when you are looking for performance gain (calls your function once/partition not once/element). Suppose you have elements from 1 to 100 distributed among 10 partitions i.e. 10 elements/partition. map() transformation will call func 100 times to process these 100 elements but in case of mapPartitions(), func will be called once/partition i.e. 10 times. Secondly, mapPartitions() holds the data in-memory i.e. it will store the result in memory until all the elements of the partition has been processed. mapPartitions() will return the result only after it finishes processing of whole partition. mapPartitions() requires an iterator input unlike map() transformation. What is an iterator? (for new programmers) - An iterator is a way to access collection of elements one-by-one, its similar to collection of elements like List(), Array(), Dict() etc in few ways but the difference is that iterator doesn't load the whole collection of elements in memory at together. Instead iterator loads elements one after another. In Scala you access these elements with hasNext and Next operation. For example, scala> sc.parallelize(1 to 9, 3).map(x=>(x, "Hello")).collect res3: Array[(Int, String)] = Array((1,Hello), (2,Hello), (3,Hello), (4,Hello), (5,Hello), (6,Hello), (7,Hello), (8,Hello), (9,Hello)) scala> sc.parallelize(1 to 9, 3).mapPartitions(x=>(Array("Hello").iterator)).collect res7: Array[String] = Array(Hello, Hello, Hello) scala> sc.parallelize(1 to 9, 3).mapPartitions(x=>(List(x.next).iterator)).collect res11: Array[Int] = Array(1, 4, 7) In first example, I have applied map() transformation on dataset distributed between 3 partitions so that you can see function is called 9 times. In second example, when we applied mapPartitions(), you will notice it ran 3 times i.e. for each partition once. We had to convert string "Hello" into iterator because mapPartitions() takes iterator. In thirds step, I tried to get the iterator value to showcase the dataset. Note that next is always increasing value, so you can't step back. See this, scala> sc.parallelize(1 to 9, 3).mapPartitions(x=>(List(x.next,x.next, "|").iterator)).collect res18: Array[Any] = Array(1, 2, |, 4, 5, |, 7, 8, |) In first call next value for partition 1 changed from 1 => 2 , for partition 2 it changed from 4 => 5 and similarly for partition 3 it changed from 7 => 8. You can keep this increasing until hasNext is False (hasNext is a property of iteration which tells you whether collection has ended or not, it returns you True or False based on items left in the collection). For example, scala> sc.parallelize(1 to 9, 3).mapPartitions(x=>(List(x.next, x.hasNext).iterator)).collect res19: Array[AnyVal] = Array(1, true, 4, true, 7, true) You can see hasNext is true because there are elements left in each partition. Now suppose we access all three elements from each partition, then hasNext will result false. For example, scala> sc.parallelize(1 to 9, 3).mapPartitions(x=>(List(x.next, x.next, x.next, x.hasNext).iterator)).collect res20: Array[AnyVal] = Array(1, 2, 3, false, 4, 5, 6, false, 7, 8, 9, false) Just for our understanding, if you will try to access next 4th time, you will get error which is expected - scala> sc.parallelize(1 to 9, 3).mapPartitions(x=>(List(x.next, x.next, x.next, x.next, x.hasNext).iterator)).collect 19/07/31 11:14:42 ERROR Executor: Exception in task 1.0 in stage 18.0 (TID 56) java.util.NoSuchElementException: next on empty iterator Think, map() transformation as special case of mapPartitions() where you have just 1 element in each partition. Isn't it? mapPartitionsWithIndex(func) Similar to mapPartitions, but good part is you have index to see the partition position. For example, scala> val mp = sc.parallelize(List("One","Two","Three","Four","Five","Six","Seven","Eight","Nine"), 3) mp: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[38] at parallelize at :24 scala> mp.collect res23: Array[String] = Array(One, Two, Three, Four, Five, Six, Seven, Eight, Nine) scala> mp.mapPartitionsWithIndex((index, iterator) => {iterator.toList.map(x => x + "=>" + index ).iterator} ).collect res26: Array[String] = Array(One=>0, Two=>0, Three=>0, Four=>1, Five=>1, Six=>1, Seven=>2, Eight=>2, Nine=>2) Index 0 (first partition) has three values as expected, similarly other 2 partitions. If you have any question please mention it in comments section at the end of this blog. sample() Generates a fraction RDD from an input RDD. Note that second argument fraction doesn't represent the fraction of actual RDD. It actually tells the probability of each element in the dataset getting selected for the sample. Seed is optional. First boolean argument decides type of sampling algorithm. For example, scala> sc.parallelize(1 to 10).sample(true, .4).collect res103: Array[Int] = Array(4) scala> sc.parallelize(1 to 10).sample(true, .4).collect res104: Array[Int] = Array(1, 4, 6, 6, 6, 9) // Here you can see fraction 0.2 doesn't represent fraction of rdd, 4 elements selected out of 10. scala> sc.parallelize(1 to 10).sample(true, .2).collect res109: Array[Int] = Array(2, 4, 7, 10) // fraction set to 1 which is the max value (probability 0 to 1), so each element got selected. scala> sc.parallelize(1 to 10).sample(false, 1).collect res111: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) union() Similar to SQL union, except the fact that it keeps duplicate data. scala> val rdd1 = sc.parallelize(List("apple","orange","grapes","mango","orange")) rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[159] at parallelize at :24 scala> val rdd2 = sc.parallelize(List("red","green","yellow")) rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[160] at parallelize at :24 scala> rdd1.union(rdd2).collect res116: Array[String] = Array(apple, orange, grapes, mango, orange, red, green, yellow) scala> rdd2.union(rdd1).collect res117: Array[String] = Array(red, green, yellow, apple, orange, grapes, mango, orange) intersection() Returns intersection of two datasets. For example, scala> val rdd1 = sc.parallelize(-5 to 5) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[171] at parallelize at :24 scala> val rdd2 = sc.parallelize(1 to 10) rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[172] at parallelize at :24 scala> rdd1.intersection(rdd2).collect res119: Array[Int] = Array(4, 1, 5, 2, 3) distinct() Returns new dataset with distinct elements. For example, we don't have duplicate orange now. scala> val rdd = sc.parallelize(List("apple","orange","grapes","mango","orange")) rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[186] at parallelize at :24 scala> rdd.distinct.collect res121: Array[String] = Array(grapes, orange, apple, mango) That's all guys, please refer next post for next set of transformations in Spark. Next: Spark Transformations (Part 2) Navigation menu ​ 1. Apache Spark and Scala Installation 1.1 Spark installation on Windows​ 1.2 Spark installation on Mac 2. Getting Familiar with Scala IDE 2.1 Hello World with Scala IDE​ 3. Spark data structure basics 3.1 Spark RDD Transformations and Actions example 4. Spark Shell 4.1 Starting Spark shell with SparkContext example​ 5. Reading data files in Spark 5.1 SparkContext Parallelize and read textFile method 5.2 Loading JSON file using Spark Scala 5.3 Loading TEXT file using Spark Scala 5.4 How to convert RDD to dataframe? 6. Writing data files in Spark ​6.1 How to write single CSV file in Spark 7. Spark streaming 7.1 Word count example Scala 7.2 Analyzing Twitter texts 8. Sample Big Data Architecture with Apache Spark 9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science? 10. Spark Interview Questions and Answers

  • Kibana dashboard example

    In this Kibana dashboard tutorial we will learn how to create Kibana dashboard. Before we start with sample Kibana dashboard example, I hope you have some sample data loaded into ELK Elasticsearch. If not, please refer my previous blog - How to load sample data into ELK Elasticsearch. As discussed in my previous blog I am using sample Squid access logs (comma separated CSV file). You can find the file format details at this link. Navigation Menu Introduction to ELK Stack Installation Loading data into Elasticsearch with Logstash Create Kibana Dashboard Example Kibana GeoIP Dashboard Example For our understanding, we will create two basic Kibana dashboard: Top 10 requested URL's (type: pie chart): Basically, it will show what are the top 10 URL's which are getting hits. Number of events occurred per hour (type: bar chart): It will show various types of events which occurred each hour. Use Case 1: Top 10 Requested URL's (Pie chart) Open Kibana UI on your machine and go to Visualize tab => Create a visualization: Select the type of visualization. For our first use case, select pie chart: Select the index squidal which we created earlier. Now go to Options and uncheck Donut check box as we need a simple pie chart. Check Show Labels or you can leave it blank if you don't want labels, it's up to you. Next, go to Data tab where you will find Metrics or you can say Measure in reporting terminology by default it will show as Count. Click on blue button right behind Split Slices in order to choose Aggregation type. Lets choose Terms aggregation for our use case (in simple words assume Terms like group by SQL). Refer this for more details. Further, choose the Field => Requested_URL.keyword which will act as dimension for us. Hit blue arrow button next to Options in order to generate the chart. You can also give this chart a custom label as shown below. Save the chart => Hit Save button on top right corner of your dashboard. You can name the visualization as Top 10 Requested URL. Now go to Dashboard tab => Create a dashboard => Add => Select Top 10 Requested URL Hit Save button on top of your dashboard. Give it a meaningful name, for instance Squid Access Log Dashboard. Use Case 2: Number of events per hour (Bar chart) Go to Visualize tab again (top left corner of your screen) and click on "+" sign. Choose chart type as vertical bar and select squidal index. In this case, instead of choosing aggregation type as Terms, we will be using X-Axis bucket with Date Histogram and Interval as Hourly as shown below: Hit Save button and give it an appropriate name, for instance Events per Hour. Now go back to Dashboard tab => Squid Access Log Dashboard => Edit => Add and select Events per hour to add it in your dashboard. Hit Save again. At this point your dashboard should look like this: You can add as many visualizations you want depending upon your business requirement. Your opinion matters a lot please comment if you have any questions for me. Thank you! Next: Kibana GeoIP Dashboard Example Navigation Menu: Introduction to ELK Stack Installation Loading data into Elasticsearch with Logstash Create Kibana Dashboard Example Kibana GeoIP Dashboard Example #ELKStack #KibanaDashboardExample #KibanaDashboardTutorial #CreateKibanaDashboard

  • How to make calls to Twitter APIs using Postman client?

    In this blog, I am going to invoke Twitter custom APIs with Postman client in order to pull live feeds, or you can say tweets from Twitter. Output will be JSON text which you can format or change based on your requirement. Soon I will be writing another blog to demonstrate how you can ingest this data in real time with Kafka and process it using Spark. Or, you can directly stream & process the data in real time with Spark streaming. As of now, let's try to connect Twitter API using Postman. Prerequisites Postman client Twitter developer account Postman Client Installation There are basically two ways to install Postman, either you can download the Postman extension for your browser (chrome in my case) or you can simply install native Postman application. I have installed Postman application to write this blog. Step 1. Google "Install Postman" and go to the Postman official site to download the application. Step 2. After opening Postman download link, select your operating system to start Postman download.​​​​​ It's available for all the types of platform - Mac, Linux and Windows. The download link keeps on changing so if the download link doesn't work just Google it as shown above. Step 3. Once installer is downloaded, run the installer to complete the installation process. It's approximately 250 MB application (for Mac). Step 4. Sign up.​​ After signing in, you can save your preferences or do it later as shown below. Step 5. Your workspace will look like below. Twitter Developer Account I hope you all have Twitter developers account, if not please create it. Then, go to Developer Twitter and sign in with your Twitter account. Click on Apps > Create an app at the top right corner of your screen. Note: Earlier, developer.twitter.com was known as apps.twitter.com. Fill out the form to create an application > specify Name, Description and Website details as shown below. This screen has slightly changed with new Twitter developer interface but overall process is still similar. If you have any question, please feel free to ask in comment section at the end of this post. Please provide a proper website name like https://example.com otherwise you will get error while creating the application. Sample has been shown above. Once you successfully create the app, you will get the below page. ​Make sure access level is set to Read and Write as shown above. Now go to Keys and Access Token tab > click on Create Access Token. At this point, you will be able to see 4 keys which will used in Postman client. Consumer Key (API Key) Consumer Secret (API Secret) Access Token Access Token Secret. New Interface looks like this. Calling Twitter API with Postman Client Open Postman application and click on authorization tab. Select authorization type as OAuth 1.0. Add authorization data to Request Headers. This is very important step else you will get error. After setting up authorization type and request header, fill out the form carefully with 4 keys (just copy-paste) which we generated in Twitter App - Consumer Key (API Key), Consumer Secret (API Secret), Access Token & Access Token Secret.​ Execute it! Now let's search for tweeter statuses which says snap. Copy-paste request URL as https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=snap as shown below. You can refer API reference index in order to access various Twitter custom API. GET some tweets, hit Send button. You will get response as shown below. GET Examples Twitter has very nice API documentation on accounts, users, tweets, media, trend, messages, geo, ads etc and there is huge variety of data which you can pull. I am invoking few APIs just for demonstration purpose. Accounts and users Lets say you want to search for user name "Elon". You can do it like this, GET https://api.twitter.com/1.1/users/search.json?q=elon Now suppose you want to get friend list of Elon Musk, you can do it like this, GET https://api.twitter.com/1.1/friends/list.json?user_id=44196397 Input user_id is same as id in previous output. You can also change the display => pretty, raw and preview. Trending Topics You can pull top 50 trending global topics with id = 1, for example, GET https://api.twitter.com/1.1/trends/place.json?id=1 POST Examples You can also POST something like you Tweet in your Twitter web account. For example if you want to Tweet Hello you can do it like this, POST https://api.twitter.com/1.1/statuses/update.json?status=Hello You can verify same with your Twitter account, yeah that's me! I rarely use Twitter. Cursoring Cursoring is used for pagination when you have large result set. Lets say you want to pull all statuses which says "Elon", it's obvious that there will be good number of tweets and that response can't fit in one page. To navigate through each page cursoring is needed. For example, lets say you want to pull 5 result per page you can do it like this, GET https://api.twitter.com/1.1/search/tweets.json?q=Elon&count=5 Now, to navigate to next 5 records you have to use next_results shown in search_metadata section above like this, GET https://api.twitter.com/1.1/search/tweets.json?max_id=1160404261450244095&q=Elon&count=5&include_entities=1 To get next set of results again use next_results from search_metadata of this result set and so on.. Now, obviously you can't do this manually each time. You need to write loop to get the result set programmatically, for example, cursor = -1 api_path = "https://api.twitter.com/1.1/endpoint.json?screen_name=targetUser" do { url_with_cursor = api_path + "&cursor=" + cursor response_dictionary = perform_http_get_request_for_url( url_with_cursor ) cursor = response_dictionary[ 'next_cursor' ] } while ( cursor != 0 ) In our case next_results is like next_cursor, like a pointer to next page. This might be different for different endpoints like tweets, users and accounts, ads etc. But logic will be same to loop through each result set. Refer this for complete details. That's it you have successfully pulled data from Twitter. #TwitterAPI #Postmaninstallation #Oauth #API #CustomerKey #CustomerSecret #accesstoken #Postman Next: Analyze Twitter Tweets using Apache Spark Learn Apache Spark in 7 days, start today! ​ 1. Apache Spark and Scala Installation 1.1 Spark installation on Windows​ 1.2 Spark installation on Mac 2. Getting Familiar with Scala IDE 2.1 Hello World with Scala IDE​ 3. Spark data structure basics 3.1 Spark RDD Transformations and Actions example 4. Spark Shell 4.1 Starting Spark shell with SparkContext example​ 5. Reading data files in Spark 5.1 SparkContext Parallelize and read textFile method 5.2 Loading JSON file using Spark Scala 5.3 Loading TEXT file using Spark Scala 5.4 How to convert RDD to dataframe? 6. Writing data files in Spark ​6.1 How to write single CSV file in Spark 7. Spark streaming 7.1 Word count example Scala 7.2 Analyzing Twitter texts 8. Sample Big Data Architecture with Apache Spark 9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science? 10. Spark Interview Questions and Answers

  • $74 Billion in Lost Bitcoin - Modern Day Buried Treasure

    In 2018, the digital forensics firm Chainalysis has estimated that around 35% of all bitcoin is likely lost. This is approximately 7.4 million of the 21 million bitcoin that will ever exist. Which this month (August 2019) has been averaging at a market value of $10K per bitcoin. Extrapolating, this is over $74 billion worth of bitcoin. Losing Bitcoin To be more exact, the bitcoin aren't lost but frozen. Bitcoin are inaccessible to their owners without a private key, a 256-bit number, usually saved in a wallet file. It is these keys, stored on hard drives or flash drives, that people misplace/lose/throw away/erase. One of the most infamous stories involves James Howell, a British man who owned upwards of 7,500 bitcoins. Howell accidentally threw out the hard drive containing his wallet file, thus losing all access to his Bitcoin. At the time he lost it, the digital currency were valued at $7.5 million. The fortune is currently buried at his local landfill in Newport, South Wales. And, despite his desire to search, his local city council officials won't allow him due to safety concerns. These 7,500 bitcoins are now estimated to be worth $75 million. Similarly, Campbell Simpson threw away a hard drive containing 1,400 bitcoin, currently valued to be worth $14 million. Even Elon Musk seems to have lost his bitcoin. To be clear, we aren't discussing how people lose bitcoin through careless/frivolous spending or outright theft. Those bitcoin have just been transacted and are under different ownership but they are still "active" coins. The bitcoin we're discussing has become forsaken, abandoned and inaccessible, buried forever in an online vault. Forgotten pins, key holders who have passed away, overwritten USB sticks; there are countless ways people have lost their bitcoin but what they have in common is lost passwords/keys. Bitcoin Protection Your bitcoin is protected with a 256-bit key which has 2^256 possible combinations or 115,792,089,237,316,195,423,570,985,008,687,907,853,269,984,665,640,564,039,457,584,007,913,129,639,936 possible combinations. It's been estimated that even the fastest supercomputer in the world (Tianhe-2) would take millions of years to crack 256-bit encryption. Bitcoin Recovery The US Treasury Department replaces and redeems an estimated $30 million in mutilated, damaged, and burned currency every year. However, there is no one that can really help you if you lose your bitcoin password since there's no way to reset the password. Besides dumpster diving, some desperate investors have even resorted to hypnosis in an attempt to conjure up their keys. However, one company offers a glimmer of hope to those who lost their keys. Wallet Recovery Services, managed by Dave Bitcoin (an alias), claims a 30% success rate in hacking passwords by brute force decryption and charges his clients 20% of the amount in the wallet. Good luck to all you investors and keep your wallets safe~

  • What is the most efficient way to find prime factors of a number (python)?

    One of the best method is to use Sieve of Eratosthenes Create a list of prime flags with their indices representing corresponding numbers. 0 → False 1 → False 2 → True and so on.. Flag → False for each multiples of index if it’s prime. Python program is shown below, def primes(n): flag = [True] * n flag[0] = flag[1] = False for (index, prime) in enumerate(flag): if prime: yield index for i in range(index*index, n, index): flag[i] = False Print the last item from generator, p = None for p in is_prime(1000000): pass print(p) >>> 999983 Regular way To make the program efficient you need to: Cross out 1 as it's not prime. Cross out all of the multiples of 2 if input number is odd. Cross out all of the multiples of 3 if input number is even. Most importantly, for number N, you only need to test until you get to a prime till √N. If a number N has a prime factor larger than √N , then it surely has a prime factor smaller than √N. So it's sufficient to search for prime factors in the range [1,√N]. If no prime factors exist in the range [1,√N], then N itself is prime and there is no need to continue searching beyond that range. You can do it like this, def is_prime(num): if num<2: return False else: return all(num%i for i in range(2, int(num ** 0.5)+1)) Checking if 1471 is prime, is_prime(1471) True Let's say you want to check all the prime numbers up-to 2 million, prime_list = [i for i in range(2000000) if is_prime(i)] I have seen similar type of problems in project Euler like sum of primes or total number of primes below (a very large number). If the number is small you can simply divide by all the numbers till (n-1) and check if it’s prime or not, but that program (shown below) performs very bad if the number is big. def is_prime(num): if num < 2: return False for i in range(2, num): if num%i==0: return False else: return True For example: is_prime(47) True >>> %timeit ("is_prime(47)") 10.9 ns ± 0.0779 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each) Above program performs very bad if the number is big. Better solution is, instead of dividing by all the numbers up-to (n-1), you just need to check if number is divisible till sqrt(num). def is_prime(x): '''Checking if number is prime, run until sqrt(number)''' sqrt = round(x ** 0.5) if x < 2: return False else: for i in range(2, sqrt + 1): if x%i == 0: return False else: return True OR, def is_prime(num): if num<2: return False else: return all(num%i for i in range(2, int(num ** 0.5)+1)) That’s it, now use list comprehension to get prime_list and print whatever you need. >>> prime_list=[i for i in range(3000000) if is_prime(i)] >>> print(f"Number of primes below 3 million: {len(prime_list)}") >>> print(f"Sum of primes below 3 million: {sum(prime_list)}") >>> print(prime_list) Number of primes below 3 million: 216816 Sum of primes below 3 million: 312471072265 [2, 3, 5, 7, 11, 13, 17, 19....... ]

  • Frequently asked Informatica interview questions and answers 2019

    What is Informatica and Enterprise Data Warehouse? Informatica is an ETL tool which is used to extract, transform and load data. It plays crucial role in building Enterprise Data Warehouse. EDW provides user a global view on historical data stored from various department in an organization like finance, sales, labor, marketing etc based on which critical business decisions could be made. What is repository service, integration service, reporting service, domain and node? There are basically three types of services which you will find under application services: Repository service - It is responsible for storing Informatica metadata (source definition, target definition, mappings, mapplets and transformations in the form of tables) and providing access to these repositories for other services. It is also responsible for maintaining connection between client development tool and Informatica repository which keeps metadata up-to-date by updating, inserting and deleting the metadata. Integration Service - It is responsible for execution of workflow and end-to-end data movement from source to target. It pulls the workflow information from repository, starts the execution of tasks, gathers/combines data from different sources (for example relational database and flat file) and loads into single or multiple targets. Reporting Service - It enables report generation. Domain - It is administrative part of Informatica which is used by admins to operate nodes and services. You can define multiple nodes in a domain with a gateway node which is responsible for receiving request and distributing it to worker nodes. Further, nodes are responsible for running services and other Informatica processes. There are basically two types of services in a domain - Service manager and Application services. Service manager is responsible for logging and login operations. Like authentication, authorization, managing people and groups. Application services are responsible for managing integration services, repository services and reporting services. Key properties of domain are as follows: Database properties - You can define database instance name and port responsible for holding domain. It consist of database type (like Oracle, SQL server etc), host, port, database name and user name. General Properties - You can define resilience timeout, restart period, restart attempts, dispatch mode etc. For example if services goes down how many seconds application services wait to again connect to respective service depends upon resilience timeout, how long Informatica can try restarting those services depends upon restart period and attempts. How tasks will be distributed to worker nodes from gateway will depend upon dispatch mode like round robin. What is PowerCenter repository? It is like a relational database which stores Informatica metadata in the form of tables (underlying database could be Oracle database or SQL server or similar) and it is managed by repository services. What is Informatica client tool? Informatica client tool is basically developer tool installed on client machine and it consist of four parts: Designer (Source, target, mapping, mapplet and transformations designer) Workflow manager (Task, worklet and workflow designer) Monitor Repository manager Basic terminology consist of: Workflow, Worklet, Sessions & Tasks: Workflow consists of one or more session, worklet and task (includes timer, decision, command, event wait, mail, link, assignment, control etc) connected in parallel or sequence. You can run these sessions by using session manager or pmcmd command. Further, you can write pre-post-session commands. Mappings, Mapplets & Transformations: Mappings are collection of source, target and transformations. Mapplets (designed in mapplet designer) are like re-usable mappings which contains various transformations but no source/target. You can run the mapping in debug mode without creating a session. There are mapping parameters and variables as well. Mapping parameters represent constant values that are defined before running a session while mapping variables can change values during sessions. What could be the various states of object in Informatica? Valid - fully syntactically correct object. Invalid - where syntax and properties are invalid according to Informatica standards. Informatica marks those objects invalid. It could be mapping, mapplet, task, session, workflow or any transformation. Impacted - where underlying object is invalid. For instance in a mapping suppose underlying transformation has become invalid due to some change. What happens when user executes the workflow? User executes workflow Informatica invokes integration service to pull workflow details from repository Integration service starts execution of workflow after gathering workflow metadata Integration service runs all child tasks Reads and combine data from sources and loads into target After execution, it updates the status of task as succeeded, failed, unknown or aborted Workflow and session log is generated What is the difference between ABORT and STOP? Abort will kill process after 60 seconds even if data processing hasn't finished. Session will be forcefully terminated and DTM (Data Transformation Manager) process will be killed. Stop will end the process once DTM processing has finished processing. Although it stops reading data from source as soon as stop command is received. What are the various types of transformation available in Informatica? There are basically two categories of transformation in Informatica. Active - It can change the number of rows that pass through transformation for example filter, sorter transformations. It can also change the row type for example update strategy transformation which can mark row for update, insert, reject or delete. Further it can also change the transaction boundary for example transaction control transformation which can allow commit and rollback for each row based on certain expression evaluation. Passive - Maintains same number of rows, no change in row type and transaction boundary. Number of output rows will remain always same as number of input rows. Active transformation includes: Source Qualifier: Connected to relational source or flat file, converts source data type to Informatica native data types. Performs joins, filter, sort, distinct and you can write custom SQL. You can have multiple source qualifier in single session with multiple targets, in this case you can decide target load order as well. Joiner: It can join two heterogeneous sources unlike source qualifier which needs common source. It performs normal join, master outer join, detail outer join and full outer join. Filter: It has single condition - drops records based on filter criteria like SQL where clause, single input and single output. Router: It has input, output and default group - acts like filter but you can apply multiple conditions for each group like SQL case statement, single input multiple output, more easier and efficient to work as compared to filter. Aggregator: It performs calculations such as sums, averages etc. It is unlike expression transformation in which one can do calculations in groups. It also provides extra cache files to store transformation values if required. Rank Sorter: Sorts the data. It has distinct option which can filter duplicate rows. Lookup: It has input, output, lookup and return port. Explained in next question. Union: It works like union all SQL. Stored Procedure Update Strategy: Treats source row as "data driven" - DD_UPDATE, DD_DELETE, DD_INSERT and DD_REJECT. Normalizer: Takes multiple columns and returns few based on normalization. Passive transformation includes: Expression: It is used to calculate in single row before writing on the target, basically non-aggregate calculations. Sequence Generator: For generating primary keys, surrogate keys (NEXTVAL & CURRVAL). Structure Parser Explain Lookup transformation in detail? Lookup transformation is used to lookup a flat file, relational database table or view. It has basically four ports - input (I), output (O), lookup (L) and return port (R). Lookup transformations can be connected or unconnected and could act as active/passive transformation. Connected lookup: It can return multiple output values. It can have both dynamic and static cache. It can return more than one column value in output port. It caches all lookup columns. Unconnected lookup: It can take multiple input parameters like column1, column2, column3 .. for lookup but output will be just one value. It has static cache. It can return only one column value in output port. It caches lookup condition and lookup output port in return port. Lookup cache can be made cached or no cache. Cached lookup can be static or dynamic. Dynamic cache can basically change during the execution of process, for example if lookup data itself changes during transaction (NewLookupRow). Further, these cache can be made persistent or non-persistent i.e. it tells Informatica to keep lookup cache data or delete it after completion of session. Here is types of cache: Static - static in nature. Dynamic - dynamic in nature. Persistent - keep or delete the lookup cache. Re-cache - makes sure cache is refreshed if underlying table data changes. Shared cache - can be used by other mappings. What are the various types of files created during Informatica session run? Session log file - depending upon tracing level (none, normal, terse, verbose, verbose data) it records SQL commands, reader-writer thread, errors, load summary etc. Note: You can change the number of session log files saved (default is zero) for historic runs. Workflow log file - includes load statistics like number of rows for source/target, table names etc. Informatica server log - created at home directory on Unix box with all status and error messages. Cache file - index or data cache files; for example aggregate cache, lookup cache etc. Output file - based on session if it's creating output data file. Bad/Reject file - contains rejected rows which were not written to target. Tracing levels: None: Applicable only at session level. The Integration Service uses the tracing levels configured in the mapping. Terse: logs initialization information, error messages, and notification of rejected data in the session log file. Normal: Integration Service logs initialization and status information, errors encountered and skipped rows due to transformation row errors. Summarizes session results, but not at the level of individual rows. Verbose Initialization: In addition to normal tracing, the Integration Service logs additional initialization details; names of index and data files used, and detailed transformation statistics. Verbose Data: In addition to verbose initialization tracing, the Integration Service logs each row that passes into the mapping. Also notes where the Integration Service truncates string data to fit the precision of a column and provides detailed transformation statistics. When you configure the tracing level to verbose data, the Integration Service writes row data for all rows in a block when it processes a transformation. What is SQL override in Informatica? SQL override can be implemented at Source Qualifier, Lookups and Target. It allows you to override the existing SQL in mentioned transformations to Limit the incoming rows Escape un-necessary sorting of data (order by multiple columns) to improve performance Use parameters and variables Add WHERE clause What is parallel processing in Informatica? You can implement parallel processing in Informatica by Partitioning sessions. Partitioning allows you to split large amount of data into smaller sets which can be processed in parallel. It uses hardware to its maximum efficiency. These are the types of Partitioning in Informatica: Database Partitioning: Integration service checks if source table has partitions for parallel read. Round-robin Partitioning: Integration service evenly distributes the rows into partitions for parallel processing. Performs well when you don't want to group data. Hash Auto-keys Partitioning: Distributes data based on hash function among partitions. It uses all sorted ports to create partition key. Perform well before sorter, rank and unsorted aggregator. Hash User-keys Partitioning: Distributes data based on hash function among partitions, but here user can choose the port which will act as partition key. Key Range Partitioning: User can choose multiple ports which will act as compound partition key. Pass-through Partitioning: Rows are passed to next partition without redistribution like what we saw in round robin. What can be done to improve performance in Informatica? See the bottlenecks by running different tracing levels (verbose data). Tune the transformations like lookups (caching), source qualifiers, aggregator (use sorted input), filtering un-necessary data, degree of parallelism in workflow etc. Dropping the index before data load and re-indexing (for example using command task at session level) after data load is complete. Change session properties like commit interval, buffer block size, session level Partitioning (explained above), reduce logging details, DTM buffer cache size and auto-memory attributes. OS level tuning like disk I/O & CPU consumption.

  • Spark Transformation example in Scala (Part 2)

    Main menu: Spark Scala Tutorial In this post I will walk you through groupByKey, reduceByKey, aggregateByKey, sortByKey, join, cartesian, coalesce and repartition Spark transformations. In previous blog we covered map, flatMap, mapPartitions, mapPartitionsWithIndex, filter, distinct, union, intersection and sample Spark transformations. I would encourage you all to go through these posts before this. Spark Transformation Examples in Scala (Part 1) Spark RDD, Transformations and Actions example groupByKey() As name says it groups the dataset (K, V) key-value pair based on Key and stores the value as Iterable, (K, V) => (K, Iterable(V)). It's very expensive operation and consumes lot of memory if dataset is huge. For example, scala> val rdd = sc.parallelize(List("Hello Hello Spark Apache Hello Dataneb Dataneb Dataneb Spark")) rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[2] at parallelize at :24 scala> rdd.collect res3: Array[String] = Array(Hello Hello Spark Apache Hello Dataneb Dataneb Dataneb Spark) // Splitting the array and creating (K, V) pair scala> val keyValue = rdd.flatMap(words => words.split(" ")).map(x=>(x,1)) keyValue: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[16] at map at :25 // Iterable[Int] Value "1" tells number of occurrences of Key scala> keyValue.groupByKey.collect res12: Array[(String, Iterable[Int])] = Array((Spark,CompactBuffer(1, 1)), (Dataneb,CompactBuffer(1, 1, 1)), (Hello,CompactBuffer(1, 1, 1)), (Apache,CompactBuffer(1))) reduceByKey() Operates on (K, V) pair dataset, but reduce func must be of type (V, V) => V. For example, if you want to reduce all the values to get the total number of occurrences. scala> rdd .flatMap(words => words.split(" ")) .map(x=>(x,1)) .reduceByKey((x, y)=>x+y) .collect res14: Array[(String, Int)] = Array((Spark,2), (Dataneb,3), (Hello,3), (Apache,1)) aggregateByKey() It's similar to reduceByKey(), I hardly use this transformation because you can achieve the same with previous transformation. For example, scala> rdd .flatMap(words => words.split(" ")) .map(x=>(x,1)) .aggregateByKey(0)((x, y)=> x+y, (k, v)=> k+v ) .collect res24: Array[(String, Int)] = Array((Spark,2), (Dataneb,3), (Hello,3), (Apache,1)) sortByKey() Called upon key-value pair, returns sorted by keys. For example, scala> rdd .flatMap(words => words.split(" ")) .distinct .map(x => (x,1)) .sortByKey() -- by default Ascending .collect res36: Array[(String, Int)] = Array((Apache,1), (Dataneb,1), (Hello,1), (Spark,1)) scala> rdd .flatMap(words => words.split(" ")) .distinct .map(x => (x,1)) .sortByKey(false) -- Ascending order (false) .collect res37: Array[(String, Int)] = Array((Spark,1), (Hello,1), (Dataneb,1), (Apache,1)) join() It takes datasets of type key-value pair and works same like sql joins. For no match value will be None. For example, scala> val rdd1 = sc.parallelize(List("Apple","Orange", "Banana", "Grapes", "Strawberry", "Papaya")).map(words => (words,1)) rdd1: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[96] at map at :24 scala> val rdd2 = sc.parallelize(List("Apple", "Grapes", "Peach", "Fruits")).map(words => (words,1)) rdd2: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[98] at map at :24 scala> rdd1.join(rdd2).collect res40: Array[(String, (Int, Int))] = Array((Grapes,(1,1)), (Apple,(1,1))) scala> rdd1.rightOuterJoin(rdd2).collect res41: Array[(String, (Option[Int], Int))] = Array((Grapes,(Some(1),1)), (Peach,(None,1)), (Apple,(Some(1),1)), (Fruits,(None,1))) scala> rdd1.leftOuterJoin(rdd2).collect res43: Array[(String, (Int, Option[Int]))] = Array((Grapes,(1,Some(1))), (Banana,(1,None)), (Papaya,(1,None)), (Orange,(1,None)), (Apple,(1,Some(1))), (Strawberry,(1,None))) scala> rdd1.fullOuterJoin(rdd2).collect res44: Array[(String, (Option[Int], Option[Int]))] = Array((Grapes,(Some(1),Some(1))), (Peach,(None,Some(1))), (Banana,(Some(1),None)), (Papaya,(Some(1),None)), (Orange,(Some(1),None)), (Apple,(Some(1),Some(1))), (Fruits,(None,Some(1))), (Strawberry,(Some(1),None))) cartesian() Same like cartesian product, return all possible pairs of elements of dataset. scala> val rdd1 = sc.parallelize(List("Apple","Orange", "Banana", "Grapes", "Strawberry", "Papaya")) rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[111] at parallelize at :24 scala> val rdd2 = sc.parallelize(List("Apple", "Grapes", "Peach", "Fruits")) rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[112] at parallelize at :24 scala> rdd1.cartesian(rdd2).collect res46: Array[(String, String)] = Array((Apple,Apple), (Apple,Grapes), (Apple,Peach), (Apple,Fruits), (Orange,Apple), (Banana,Apple), (Orange,Grapes), (Banana,Grapes), (Orange,Peach), (Banana,Peach), (Orange,Fruits), (Banana,Fruits), (Grapes,Apple), (Grapes,Grapes), (Grapes,Peach), (Grapes,Fruits), (Strawberry,Apple), (Papaya,Apple), (Strawberry,Grapes), (Papaya,Grapes), (Strawberry,Peach), (Papaya,Peach), (Strawberry,Fruits), (Papaya,Fruits)) coalesce() coalesce and repartition both shuffles the data to increase or decrease the partition, but repartition is more costlier operation as it re-shuffles all data and creates new partition. For example, scala> val distData = sc.parallelize(1 to 16, 4) distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[128] at parallelize at :24 // current partition size scala> distData.partitions.size res63: Int = 4 // checking data across each partition scala> distData.mapPartitionsWithIndex((index, iter) => if (index == 0) iter else Iterator()).collect res64: Array[Int] = Array(1, 2, 3, 4) scala> distData.mapPartitionsWithIndex((index, iter) => if (index == 1) iter else Iterator()).collect res65: Array[Int] = Array(5, 6, 7, 8) scala> distData.mapPartitionsWithIndex((index, iter) => if (index == 2) iter else Iterator()).collect res66: Array[Int] = Array(9, 10, 11, 12) scala> distData.mapPartitionsWithIndex((index, iter) => if (index == 3) iter else Iterator()).collect res67: Array[Int] = Array(13, 14, 15, 16) // decreasing partitions to 2 scala> val coalData = distData.coalesce(2) coalData: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[133] at coalesce at :25 // see how shuffling occurred. Instead of moving all data it just moved 2 partitions. scala> coalData.mapPartitionsWithIndex((index, iter) => if (index == 0) iter else Iterator()).collect res68: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8) scala> coalData.mapPartitionsWithIndex((index, iter) => if (index == 1) iter else Iterator()).collect res69: Array[Int] = Array(9, 10, 11, 12, 13, 14, 15, 16) repartition() Notice how it re-shuffled everything to create new partitions as compared to previous RDDs - distData and coalData. Hence repartition is more costlier operation as compared to coalesce. scala> val repartData = distData.repartition(2) repartData: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[139] at repartition at :25 // checking data across each partition scala> repartData.mapPartitionsWithIndex((index, iter) => if (index == 0) iter else Iterator()).collect res70: Array[Int] = Array(1, 3, 6, 8, 9, 11, 13, 15) scala> repartData.mapPartitionsWithIndex((index, iter) => if (index == 1) iter else Iterator()).collect res71: Array[Int] = Array(2, 4, 5, 7, 10, 12, 14, 16) That's all folks. If you have any question please mention in comments section below. Thank you. Next: Spark RDD, Transformation and Actions Navigation menu ​ 1. Apache Spark and Scala Installation 1.1 Spark installation on Windows​ 1.2 Spark installation on Mac 2. Getting Familiar with Scala IDE 2.1 Hello World with Scala IDE​ 3. Spark data structure basics 3.1 Spark RDD Transformations and Actions example 4. Spark Shell 4.1 Starting Spark shell with SparkContext example​ 5. Reading data files in Spark 5.1 SparkContext Parallelize and read textFile method 5.2 Loading JSON file using Spark Scala 5.3 Loading TEXT file using Spark Scala 5.4 How to convert RDD to dataframe? 6. Writing data files in Spark ​6.1 How to write single CSV file in Spark 7. Spark streaming 7.1 Word count example Scala 7.2 Analyzing Twitter texts 8. Sample Big Data Architecture with Apache Spark 9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science? 10. Spark Interview Questions and Answers

  • How to Install Apache HTTP Server on CentOS

    yum install httpd Apache is most popular HTTP Server which runs on Linux & Windows based operating system. Let's see how to install and configure Apache HTTP Web Server on CentOS. 1. First update yum package sudo yum -y update 2. Next, install Apache HTTP server sudo yum install httpd 3. Start & Enable HTTP server (to start automatically on server reboot) [centos@ ~]$ sudo systemctl start httpd [centos@ ~]$ sudo systemctl enable httpd Created symlink from /etc/systemd/system/multi-user.target.wants/httpd.service to /usr/lib/systemd/system/httpd.service. 4. Now check the status of Apache server [centos@ ~]$ sudo systemctl status httpd httpd.service - The Apache HTTP Server Loaded: loaded (/usr/lib/systemd/system/httpd.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Thu 2018-08-02 18:32:00 UTC; 6min ago Docs: man:httpd(8) man:apachectl(8) Main PID: 10235 (code=exited, status=0/SUCCESS) Aug 02 18:32:00 dataneb.com systemd[1]: Starting The Apache HTTP Server... Aug 02 18:32:00 dataneb.com httpd[10235]: httpd (pid 10202) already running Aug 02 18:32:00 dataneb.com kill[10237]: kill: cannot find process "" Aug 02 18:32:00 dataneb.com systemd[1]: httpd.service: control process exited, code=exited status=1 Aug 02 18:32:00 dataneb.com systemd[1]: Failed to start The Apache HTTP Server. Aug 02 18:32:00 dataneb.com systemd[1]: Unit httpd.service entered failed state. Aug 02 18:32:00 dataneb.com systemd[1]: httpd.service failed. 5. If server does not start, disable SELinux on CentOS [centos@]$ cd /etc/selinux [centos@]$ sudo vi config # This file controls the state of SELinux on the system. # SELINUX= can take one of these three values: # enforcing - SELinux security policy is enforced. # permissive - SELinux prints warnings instead of enforcing. # disabled - No SELinux policy is loaded. # SELINUX=enforcing SELINUX=disabled # SELINUXTYPE= can take one of three two values: # targeted - Targeted processes are protected, # minimum - Modification of targeted policy. Only selected processes are protected. # mls - Multi Level Security protection. SELINUXTYPE=targeted 6. Reboot the system to make SELinux changes effective [centos@ ~]$ sudo reboot debug1: channel 0: free: client-session, nchannels 1 Connection to xx.xxx.xxx.xx closed by remote host. Connection to xx.xxx.xxx.xx closed. Transferred: sent 16532, received 333904 bytes, in 1758.1 seconds Bytes per second: sent 9.4, received 189.9 debug1: Exit status -1 7. Now check the status of Apache server again [centos@ ~]$ sudo systemctl status httpd httpd.service - The Apache HTTP Server Loaded: loaded (/usr/lib/systemd/system/httpd.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2018-08-02 18:40:18 UTC; 35s ago Docs: man:httpd(8) man:apachectl(8) Main PID: 855 (httpd) Status: "Total requests: 0; Current requests/sec: 0; Current traffic: 0 B/sec" CGroup: /system.slice/httpd.service ├─855 /usr/sbin/httpd -DFOREGROUND ├─879 /usr/sbin/httpd -DFOREGROUND ├─880 /usr/sbin/httpd -DFOREGROUND ├─881 /usr/sbin/httpd -DFOREGROUND ├─882 /usr/sbin/httpd -DFOREGROUND └─883 /usr/sbin/httpd -DFOREGROUND Aug 02 18:40:17 dataneb.com systemd[1]: Starting The Apache HTTP Server... Aug 02 18:40:18 dataneb.com systemd[1]: Started The Apache HTTP Server. 8. Configure firewalld (CentOS is built by default to block Apache traffic) [centos@ ~]$ firewall-cmd --zone=public --permanent --add-service=http [centos@ ~]$ firewall-cmd --zone=public --permanent --add-service=https [centos@ ~]$ firewall-cmd --reload 9. Test your url by entering Apache server ip address in your local browser Thank you!! If you enjoyed this post, I’d be very grateful if you’d help it spread by emailing it to a friend, or sharing it on Google or Facebook. Refer the links below. Also click on "Subscribe" button on top right corner to stay updated with latest posts. Your opinion matters a lot please comment if you have any suggestion for me. #install #apache #httpserver #centos #howtoinstallapacheoncentos Learn Apache Spark in 7 days, start today! Main menu: Spark Scala Tutorial 1. Apache Spark and Scala Installation 1.1 Spark installation on Windows​ 1.2 Spark installation on Mac 2. Getting Familiar with Scala IDE 2.1 Hello World with Scala IDE​ 3. Spark data structure basics 3.1 Spark RDD Transformations and Actions example 4. Spark Shell 4.1 Starting Spark shell with SparkContext example​ 5. Reading data files in Spark 5.1 SparkContext Parallelize and read textFile method 5.2 Loading JSON file using Spark Scala 5.3 Loading TEXT file using Spark Scala 5.4 How to convert RDD to dataframe? 6. Writing data files in Spark ​6.1 How to write single CSV file in Spark 7. Spark streaming 7.1 Word count example Scala 7.2 Analyzing Twitter texts 8. Sample Big Data Architecture with Apache Spark 9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science? 10. Spark Interview Questions and Answers

  • Apache Spark Interview Questions

    This post include Big Data Spark Interview Questions and Answers for experienced and beginners. If you are a beginner don't worry, answers are explained in detail. These are very frequently asked Data Engineer Interview Questions which will help you to crack big data job interview. What is Apache Spark? According to Spark documentation, Apache Spark is a fast and general-purpose in-memory cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. In simple terms, Spark is a distributed data processing engine which supports programming language like Java, Scala, Python and R. In core, Spark engine has four built-in libraries which supports Spark SQL, Machine Learning, Spark Streaming and GraphX. What is Apache Spark used for? Apache Spark is used for real time data processing. Implementing Extract, Transform, Load (ETL) processes. Implementing machine learning algorithms and create interactive dashboards for data analytics. Apache Spark is also used to store petabytes of data with data distributed over cluster with thousands of nodes. How does Apache Spark work? Spark uses master-slave architecture to distribute data across worker nodes and process them in parallel. Just like mapreduce, Spark has a central coordinator called driver and rest worker nodes as executors. Driver communicates with the executors to process the data. Why is Spark faster than Hadoop mapreduce? One of the drawbacks of Hadoop mapreduce is that it holds full data into HDFS after running each mapper and reducer job. This is very expensive because it consumes lot of disk I/O and network I/O. While in Spark, there are two processes transformations and actions. Spark doesn't write or hold the data in memory until an action is called. Thus, it reduces disk I/O and network I/O. Another innovation is in-memory caching where you can instruct Spark to hold input data in-memory so that program doesn't have to read data again from disk, thus reducing disk I/O. Is Hadoop required for spark? No, Hadoop file system is not required for Spark. However for better performance, Spark can use HDFS-YARN if required. Is Spark built on top of Hadoop? No. Spark is totally independent of Hadoop. What is Spark API? Apache Spark has basically three sets of APIs (Application Program Interface) - RDDs, Datasets and DataFrames that allow developers to access the data and run various functions across four different languages - Java, Scala, Python and R. What is Spark RDD? Resilient Distributed Datasets (RDDs) are basically an immutable collection of elements which is used as fundamental data structure in Apache Spark. These are logically partitioned data across thousands of nodes in your cluster that can be accessed and computed in parallel. RDD was the primary Spark API since Apache Spark foundation. Which are the methods to create RDD in spark? There are mainly two methods to create RDD. Parallelizing - sc.parallelize() Reference external dataset - sc.textFile() Read - Spark context parallelize and reference external dataset example. When would you use Spark RDD? RDDs are used for unstructured data like streams of media texts, when schema and columnar format of data is not mandatory requirement like accessing data by column name and any other tabular attributes. Secondly, RRDs are used when you want full control over physical distribution of data. What is SparkContext, SQLContext, SparkSession and SparkConf? SparkContext tells Spark driver application whether to access the cluster through a resource manager or to run locally in standalone mode. The resource manager can be YARN, or Spark's cluster manager. SparkConf stores configuration parameters that Spark driver application passes to SparkContext. These parameters define properties of Spark driver application which is used by Spark to allocate resources on the cluster. Such as the number, memory size and cores used by the executors running on the worker nodes. SQLContext is a class which is used to implement Spark SQL. You need to create SparkConf and SparkContext first in order to implement SQLContext. It is basically used for structured data when you want to implement schema and run SQL. All three - SparkContext, SparkConf and SQLContext are encapsulated within SparkSession. In newer version you can directly implement Spark SQL with SparkSession. What is Spark checkpointing? Spark checkpointing is a process that saves the actual intermediate RDD data to a reliable distributed file system. It's the process of saving intermediate stage of a RDD lineage. You can do it by calling checkpoint, RDD.checkpoint() while developing the Spark driver application. You need to set up checkpoint directory where Spark can store these intermediate RDDs by calling RDD.setCheckpointDir(). What is an action in Spark and what happens when it's executed? Action triggers execution of RDD lineage graph, loads original data from disk to create intermediate RDDs, performs all transformations and returns the final output to the Spark driver program or writes the data to file system (based on the type of action). According to Spark documentation, following are the list of actions. What is Spark Streaming? Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s machine learning and graph processing algorithms on data streams. Reference: Apache Spark documentation Login to see more; What is difference between RDDs, DataFrame and dataset? Why is spark RDD immutable? Are spark DataFrames immutable? Are spark DataFrames distributed? What is Spark stage? How does SQL spark work? What is spark executor and how does it work? How will you improve Apache Spark performance? What is spark SQL Warehouse Dir? What is Spark shell? How would you open and close it? How will you clear the screen on spark shell? What is parallelize in spark? Does spark SQL require hive? What is Metastore in hive? What does repartition and coalesce do in spark? What is spark mapPartitions? What is difference between MAP and flatMap in spark? What is spark reduceByKey? What is lazy evaluation in spark? What is accumulator in spark? Can RDD be shared between SparkContexts? What happens if RDD partition is lost due to worker node failure? Which serialization libraries are supported in spark? What is cluster manager in spark? Questions? Feel free to write in comments section below. Thank you.

  • Mutable vs Immutable Objects in Python example

    Everything in Python is an object and all objects in Python can be either mutable or immutable. Immutable objects: An object with a fixed value. Immutable objects include numbers, bool, strings and tuples. Such an object cannot be altered. A new object has to be created if a different value has to be stored. They play an important role in places where a constant hash value is needed, for example as a key in a dictionary. Consider an integer → 1 assigned to variable “a”, >>> a=1 >>> b=a >>> id(a) 4547990592 >>> id(b) 4547990592 >>> id(1) 4547990592 Notice id() for all of them them is same. id() basically returns an integer which corresponds to the object’s memory location. b → a → 1 → 4547990592 Now, let’s increment the value of “a” by 1 so that we have new integer object → 2. >>> a=a+1 >>> id(a) 4547990624 >>> id(b) 4547990592 >>> id(2) 4547990624 Notice how id() of variable “a” changed, “b” and “1” are still same. b → 1 → 4547990592 a → 2 → 4547990624 The location to which “a” was pointing has changed ( from 4547990592 → 4547990624). Object “1” was never modified. Immutable objects doesn’t allow modification after creation. If you create n different integer values, each will have a different constant hash value. Mutable objects can change their value but keep their id() like list, dictionary, set. >>> x=[1,2,3] >>> y=x >>> id(x) 4583258952 >>> id(y) 4583258952 y → x → [1,2,3] → 4583258952 Now, lets change the list >>> x.pop() >>> x[1,2] >>> id(x) 4583258952 >>> id(y) 4583258952 y → x → [1,2] → 4583258952 x and y are still pointing to the same memory location even after actual list is changed.

Home   |   Contact Us

©2020 by Data Nebulae