top of page

Results found for empty search

  • What is the most efficient way to find prime factors of a number (python)?

    One of the best method is to use Sieve of Eratosthenes 0 → False 1 → False 2 → True and so on.. Python program is shown below, def primes(n): flag = [True] * n flag[0] = flag[1] = False for (index, prime) in enumerate(flag): if prime: yield index for i in range(index*index, n, index): flag[i] = False Print the last item from generator, p = None for p in is_prime(1000000): pass print(p) >>> 999983 Regular way To make the program efficient you need to: Cross out 1 as it's not prime. Cross out all of the multiples of 2 if input number is odd. Cross out all of the multiples of 3 if input number is even. Most importantly, for number N, you only need to test until you get to a prime till √N. If a number N has a prime factor larger than √N , then it surely has a prime factor smaller than √N. So it's sufficient to search for prime factors in the range [1,√N]. If no prime factors exist in the range [1,√N], then N itself is prime and there is no need to continue searching beyond that range. You can do it like this, def is_prime(num): if num<2: return False else: return all(num%i for i in range(2, int(num ** 0.5)+1)) Checking if 1471 is prime, is_prime(1471) True Let's say you want to check all the prime numbers up-to 2 million, prime_list = [i for i in range(2000000) if is_prime(i)] I have seen similar type of problems in project Euler like sum of primes or total number of primes below (a very large number). If the number is small you can simply divide by all the numbers till (n-1) and check if it’s prime or not, but that program (shown below) performs very bad if the number is big. def is_prime(num): if num < 2: return False for i in range(2, num): if num%i==0: return False else: return True For example: is_prime(47) True >>> %timeit ("is_prime(47)") 10.9 ns ± 0.0779 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each) Above program performs very bad if the number is big. Better solution is, instead of dividing by all the numbers up-to (n-1), you just need to check if number is divisible till sqrt(num). def is_prime(x): '''Checking if number is prime, run until sqrt(number)''' sqrt = round(x ** 0.5) if x < 2: return False else: for i in range(2, sqrt + 1): if x%i == 0: return False else: return True OR, def is_prime(num): if num<2: return False else: return all(num%i for i in range(2, int(num ** 0.5)+1)) That’s it, now use list comprehension to get prime_list and print whatever you need. >>> prime_list=[i for i in range(3000000) if is_prime(i)] >>> print(f"Number of primes below 3 million: {len(prime_list)}") >>> print(f"Sum of primes below 3 million: {sum(prime_list)}") >>> print(prime_list) Number of primes below 3 million: 216816 Sum of primes below 3 million: 312471072265 [2, 3, 5, 7, 11, 13, 17, 19....... ]

  • Installing Java on Oracle Linux

    Referenced from www.java.com (added few additional steps in order to make installation process more perfect) Java for Linux Platforms 1. First check if Java is already installed on your machine, Type java -version, or simply run this command on your terminal: which java 2. If Java is not present, your terminal will not understand this command and it will say command not found. 3. Now to install Java, change to the directory in which you want to install. Type: cd directory_path_name For example, to install the software in the /usr/java/ directory, Type: cd /usr/java/ 4. Download the tarball Java file from www.java.com (snippet shown above). ​ 5. Get 32 bit or 64 bit tarball file depending upon on your Linux machine configuration. 6. Move (sftp) the .tar.gz archive binary to the current directory /usr/java/. 7. Unpack the tarball and install Java tar zxvf jre-8u73-linux-i586.tar.gz In this example, it is installed in the /usr/java/jre1.8.0_73 directory. You can remove the version detail and rename the file according to your convenience. 8. Delete .tar.gz file if you want to save some disk space. 9. Setup .bashrc file. Type: vi ~/.bashrc and enter these two lines in the file; export JAVA_HOME=/usr/java/jre1.8.0_73 export PATH=$PATH:$JAVA_HOME/bin 12. Now, run source ~/.bashrc Now type command: java -version in order to see if java is successfully installed or not. If it's not running find bin directory where you unzipped Java and run: /path_to_your_Java/bin/java -version Java for RPM based Linux Platforms Become root by running su and entering the super-user password. Uninstall any earlier installations of the Java packages. rpm -e package_name Change to the directory in which you want to install. Type: cd directory_path_name For example, to install the software in the /usr/java/ directory, Type: cd /usr/java Install the package. rpm -ivh jre-8u73-linux-i586.rpm To upgrade a package, Type: rpm -Uvh jre-8u73-linux-i586.rpm Exit the root shell. No need to reboot. Delete the .rpm file if you want to save disk space. If you have any question, please write in comments section below. Thank you! #Javainstallation #OracleLinux #OEL #OL

  • J2 Work Permit and Processing Time

    J2 work permit (EAD) application processing time can take between 3-5 months. I am sharing our journey about J2 visa work permit and I think most of you will find this article helpful. My spouse worked with a J2 EAD for approximately 3 years until he got the green card. Before J2 EAD, my husband was working on H1B for 4 years and it was such a nightmare working on H1B, I remember we use to watch the news regarding H1B visas during the Trump administration period from January 2017 to January 2021. Every other day there was some sizzling news on H1B. During that time period, H1B processing time went high, extensions started getting rejected, premium processing fee was increased, some companies stopped sponsoring H1B visa holders, etc. There were so many negative things going on with H1B. Many of my friends had to go back to India as their visa extension got rejected. All different kinds of reasons to deny an extension, this is when my husband planned to switch from H1B to J2 EAD, followed by some personal reason and this was a wise decision for us. So, why J2 EAD? A J1 visa is granted under a cultural exchange program and its dependent (J2 visa holder) can get EAD within ~3-5 months and work part-time or full-time without any restriction. In our case, we were out of the danger ⚠️ zone of H1B. As you know every year ”highly skilled“ people are hired on H1B via a lottery system. H1B workers constitute the biggest visa group among all the other work visa holders, so H1B is the first target when it comes to securing a job for US citizens. Kick out H1B visa holders and most US people will get jobs. This is not my thought but that’s how it’s portrayed to the whole system. Google Trends The biggest advantage, you don’t need any sponsorship when you have J2 EAD, you can work anywhere anytime you want. You can even work extra hours and have 2 jobs. You get a lot of freedom with J2 EAD. You would have noticed that when you apply for a job in the US, usually there is a question if you need sponsorship. You will choose yes if you are an H1B visa holder and most probably due to this reason your resume will be trashed. For J2 EAD, you don’t need any kind of sponsorship. J1 visa holders are eligible to file for the green card themselves or by hiring a lawyer, unlike H1B where companies have to file on your behalf. You can’t do it by yourself. So, another advantage of being J2 dependent is that you automatically get a green card when the primary visa holder (J1) gets the green card. J2 work permit (EAD) When you apply for J2 EAD you will get EAD for the same period as your DS-2019 validity. If your DS-2019 is valid for a year, you will get EAD for a year. If your DS-2019 is valid for more, your EAD will have more time period. In most cases, the DS-2019 duration depends upon how much funding your professor has for your research work and how the University visa and immigration department grants you the DS-2019. Initially, when I came to the US, I had straight 5 years of DS-2019. You can even negotiate with your professor to show more funding and get maximum period on DS-2019 so that your dependent can get more time on EAD. Similary it helps you getting longer Drivers License, State ID etc.. The maximum you can work on J2 EAD is 5 years, the same as the primary visa holder. Later J1 can apply for a green card. Related post: J1 visa holders can apply for green card, see how? J2 EAD extension is a pretty simple process and usually, you can get it extended in 3 months, the same as you can get a new EAD. You can apply for extension 6 months before your EAD expiry so you have sufficient time. We applied for a J2 EAD extension 3 times and it was never rejected or never got RFE, etc. Please refer to my next blog on how you can apply for J2 EAD. Please comment if you have any questions or doubts. I will try my best to share my experience so far. Next: Filing J2 EAD Related Posts ✔ Go to Main Menu ✔ How long can I stay after my J1 visa expires ✔ J1 visa waiver application process end-to-end ✔ How to convert from a J1 visa to H1B ✔ How to apply for J2 EAD ✔ Indian Passport Renewal ✔ How to apply for an OCI card Related Topics

  • Hidden iPhone Tips & Tricks That You Should Know

    Apple's iPhone has so many features that it's impossible to use all of them but here's few hidden iOS tips and tricks you probably didn't know even existed. 1. Checkout Your iPhone Battery Health For iPhone 6 and later, iOS 11.3 and later - New feature has been added to display battery health status and recommend if a battery needs to be replaced. These can be found in Settings > Battery > Battery Health Click on Battery Health to check your battery status. Maximum battery capacity measures the device battery capacity relative to when it was new. Initially it will be 100% and it decreases with time as battery health degrades. Checkout this for complete details. 2. Apple's iPhone Battery Replacement - Just $29 Apple is offering discounted battery replacement (just $29) for all eligible iPhone 6 and later model. Refer following chart. Visit Apple store near you or book an online appointment here to replace your battery if required. 3. Guided Access It keeps the iPhone in a single app mode, and allows you to control which features are available to the secondary user. This feature is very useful when someone like your little cousin borrows your iPhone just to play game. In that case you can grant access to the game app only. Open the settings app and tap General > Accessibility > Guided Access as shown below. Tap Guided Access so its slider turns green. Turn on Accessibility Shortcut as well. To start Guided Access, triple-click the Home button in the app which you want to give access. 4. Live iPhone Screen Recording You can start live screen recording of your iPhone screen on a single tap. First enable it - Open settings app > Control Center > Customize Controls > add Screen Recording Now, turn on screen recording by taping below button. 5. Smart Invert Colors (while using iPhone late at night) This features allows you to invert the colors (Black to White and vice-versa), mostly used at the time of reading or writing. Open the settings app and tap General > Accessibility > Display Accommodations > Invert Colors. Tap Smart Invert Colors so its slider turns green. It will appear like below. Before (Invert Colors : OFF) After (Invert Colors : ON) Invert Colors Shortcut You can turn on Invert Colors simply by triple clicking home button. To setup open the settings app and tap General > Accessibility > Accessibility Shortcut and check Smart Invert Colors as shown below. Now just triple click your home button to turn ON/OFF Invert Colors. 6. Hide Your Photos Open Photos app, select all the photos you want to hide, tap share and select hide. 7. Emergency SOS (highly recommended) Rapidly click the sleep/wake button five times to quickly call emergency services - health issues, theft or threat. Open the settings app > Emergency SOS (turn slider green) 8. Automatically Offload Unused Applications Apple automatically uninstalls the apps which you don't use while keeping the app data in case you want to re-install the app. This feature is very useful when you are running short on iPhone memory. Go to settings app > General > iPhone Storage > Offload Unused App (enable this by turning slider green) 9. Automatically Turn On "Do Not Disturb" While Driving Go to settings app > Do Not Disturb > Do Not Disturb While Driving (enable this by turning slider green) 10. Turn Your Live Photo Into a GIF Open Photos app and choose any live photo. Now simply slide up to reveal hidden edit features like - Bounce, Loop & Long Exposure. Thank you!! If you enjoyed this post, I’d be very grateful if you’d help it spread by sharing it to a friend, or sharing it on Google or Facebook.

  • Funny Short Math Jokes and Puns, Math is Fun!

    A mathematical joke is a form of humor which relies on aspects of mathematics or a stereotype of mathematicians to derive humor. The humor may come from a pun, or from a double meaning of a mathematical term, or from a lay person's misunderstanding of a mathematical concept. Instead of good-bye we say Calc-U-later Why should you not mix alcohol and calculus? Because you should never drink and derive. Write the expression for the volume of a thick crust pizza with height "a" and radius "z". The formula for volume is π·(radius)**2·(height). In this case, pi·z·z·a. How do you make seven even? Just remove the “s.” Q: What is a proof? A: One-half percent of alcohol. Q: What is gray and huge and has integer coefficients? A: An elephantine equation. Q: Why do truncated Maclaurin series fit the original function so well? A: Because they are “Taylor” made. Q: What is gray and huge and has integer coefficients? A: An elephantine equation. Q: What’s a polar bear? A: A rectangular bear after a coordinate transform. Q: What do you get if you cross a mosquito with a mountain climber? A: You can’t cross a vector with a scalar. Theorem. 3=4. Proof. Suppose a + b = c This can also be written as: 4a − 3a + 4b − 3b = 4c − 3c After reorganizing: 4a + 4b − 4c = 3a + 3b − 3c Take the constants out of the brackets: 4(a + b − c) = 3(a + b − c) Remove the same term left and right: 4=3 A mathematician and an engineer are on a desert island. They find two palm trees with one coconut each. The engineer shinnies up one tree, gets the coconut, and eats it. The mathematician shinnies up the other tree, gets the coconut, climbs the other tree and puts it there. “Now we’ve reduced it to a problem we know how to solve.” There are a mathematician and a physicist and a burning building with people inside. There are a fire hydrant and a hose on the sidewalk. The physicist has to put the fire out…so, he attaches the hose to the hydrant, puts the fire out, and saves the house and the family. Then they put the people back in the house, set it on fire, and ask the mathematician to solve the problem. So, he takes the hose off the hydrant and lays it on the sidewalk. “Now I’ve reduced it to a previously solved problem” and walks away. Three men are in a hot-air balloon. Soon, they find themselves lost in a canyon somewhere. One of the three men says, “I’ve got an idea. We can call for help in this canyon and the echo will carry our voices far.” So he leans over the basket and yells out, “Helloooooo! Where are we?” (They hear the echo several times.) Fifteen minutes later, they hear this echoing voice: “Hellooooo! You’re lost!!” One of the men says, “That must have been a mathematician.” Puzzled, one of the other men asks, “Why do you say that?” The reply: “For three reasons: (1) He took a long time to answer, (2) he was absolutely correct, and (3) his answer was absolutely useless.” Infinitely many mathematicians walk into a bar. The first says, "I'll have a beer." The second says, "I'll have half a beer." The third says, "I'll have a quarter of a beer." Before anyone else can speak, the barman fills up exactly two glasses of beer and serves them. "Come on, now,” he says to the group, “You guys have got to learn your limits.” Scientists caught a physicist and a mathematician and locked them in separate rooms so both could not interact with each other. They started studying their behavior. The two were assigned a task to remove a hammered nail from inside the wall. The only tools they had were a hammer and a nail-drawer. After some muscular effort, both solved the tasks similarly by using the nail-drawer. Then there was a second task, to remove the nail that was barely touching the wall with its sharp end. The physicist simply took the nail with his hand. The mathematician hammered the nail inside the wall with full force and proudly announced: the problem has been reduced to the previous one! A mathematician organizes a raffle in which the prize is an infinite amount of money paid over an infinite amount of time. Of course, with the promise of such a prize, his tickets sell like hot cake. When the winning ticket is drawn, and the jubilant winner comes to claim his prize, the mathematician explains the mode of payment: "1 dollar now, 1/2 dollar next week, 1/3 dollar the week after that..." Sherlock Holmes and Watson travel on a balloon. They were hidden in clouds, so they didn’t know which country they flew above. Finally they saw a guy below between clouds, so they asked. “Hey, you know where we are?” “Yes” “Where?” “In a balloon”. And the guy was hidden by clouds again. Watson:”Goddamn, what a stupid idiot!” Holmes:”No my friend, he’s a mathematician”. Watson:”How can you know that, Holmes?” Holmes:”Elementary, my dear Watson. He responded with an absolutely correct and absolutely useless answer”. My girlfriend is the square root of -100. She’s a perfect 10, but purely imaginary. How do mathematicians scold their children? "If I've told you n times, I've told you n+1 times..." What’s the best way to woo a math teacher? Use acute angle. What do you call a number that can't keep still? A roamin' numeral. Take a positive integer N. No wait, N is too big; take a positive integer k. A farmer counted 196 cows in 
the field. But when he rounded them up, he had 200. Why should you never argue with decimals? Because decimals always have a point. When someone once asked Professor Eilenberg if he could eat Chinese food with three chopsticks, he answered, "Of course," according to Professor Morgan. How are you going to do it? I'll take the three chopsticks, I'll put one of them aside on the table, and I'll use the other two. A statistics professor is going through security at the airport when they discover a bomb in his carry-on. The TSA officer is livid. "I don't understand why you'd want to kill so many innocent people!" The professor laughs and explains that he never wanted to blow up the plane; in fact, he was trying to save them all. "So then why did you bring a bomb?!" The professor explains that the probability of a bomb being on an airplane is 1/1000, which is quite high if you think about it, and statistically relevant enough to prevent him from being able to fly stress-free. "So what does that have to do with you packing a bomb?" the TSA officer wants to know, so the professor explains. "You see, if there's 1/1000 probability of a bomb being on my plane, the chance that there are two bombs is 1/1000000. So if I bring a bomb, the chance there is another bomb is only 1/1000000, so we are all much safer." The great probabilist Mark Kac (1914-1984) once gave a lecture at Caltech, with Feynman in the audience. When Kac finished, Feynman stood up and loudly proclaimed, "If all mathematics disappeared, it would set physics back precisely one week." To that outrageous comment, Kac shot back with that yes, he knew of that week; it was "Precisely the week in which God created the world." An experimental physicist meets a mathematician in a bar and they start talking. The physicict asks, "What kind of math do you do?" to which the mathematician replies, "Knot theory." The physicist says, "Me neither!" A poet, a priest, and a mathematician are discussing whether it's better to have a wife or a mistress. The poet argues that it's better to have a mistress because love should be free and spontaneous. The priest argues that it's better to have a wife because love should be sanctified by God. The mathematician says, "I think it's better to have both. That way, when each of them thinks you're with the other, you can do some mathematics." Three mathematicians walk into a bar. Bartender asks:”Will all of you guys have beer?” The first mathematician: “I don’t know”. The second mathematician: “I don’t know”. The third one: ”Yes”. A mathematician is attending a conference in another country and is sleeping at a hotel. Suddenly, there is a fire alarm and he rushes out in panic. He also notices some smoke coming from one end of the corridor. As he is running, he spots a fire extinguisher. “Ah!”, he exclaims, “A solution exists!” and comes back to his room and sleeps peacefully. Two statisticians go to hunt a bear. After roaming the woods for a while, they spot a lone grizzly. The first statistician takes aim and shoots, but it hits three feet in front of the bear. The second one shoots next, and it hits three feet behind the bear. They both agree that they have shot the bear and go to retrieve it.. Parallel lines have so much in common. It’s a shame they’ll never meet. I just saw my math teacher with a piece of graph paper. I think he must be plotting something. Are monsters good at math? No, unless you Count Dracula. My girlfriend is the square root of -100. She's a perfect 10, but purely imaginary. Q: Why is a math book depressed? A: Because it has so many problems. How do you stay warm in an empty room? Go into the corner where it is always 90 degrees. There are three kinds of people in the world: those who can count and those who can't. Q: Why did I divide sin by tan? A: Just cos. Q: Where's the only place you can buy 64 watermelons and nobody wonders why? A: In an elementary school math class. 60 out of 50 people have trouble with fractions. But why did 7 eat 9? Because you’re supposed to eat 3 squared meals a day. Q: Why is the obtuse triangle depressed? A: Because it is never right. Q: Why did the 30-60-90 degree triangle marry the 45-45-90 degree triangle? A: Because they were right for each other. Q: Why didn't the Romans find algebra very challenging? A: Because they always knew X was 10. Two statisticians went out hunting and they found a deer. The first one overshoots by 5 meters. The second one undershoots by 5 meters. They both hug each other and shout out “We Got It!” An astronomer, a physicist and a mathematician are on a train traveling from England to Scotland. It is the first time for each of them. Some time after the train crosses the border, the three of them notice a sheep in a field. “Amazing!” says the astronomer. “All the sheep in Scotland are black!”. “No, no” responds the physicist. “Some sheep in Scotland are black!” The mathematician closes his eyes pityingly, and intones: “In Scotland, there is at least one field, containing at least one sheep, at least one side of which is black.” An engineer, a physicist and a mathematician go to a hotel. The boiler malfunctions in the middle of the night and the radiators in each room set the curtains on fire. The engineer sees the fire, sees there is a bucket in the bathroom, fills the bucket with water and throws it over the fire. The physicist sees the fire, sees the bucket, fills the bucket to the top of his mentally calculated error margin and throws it over the fire. The mathematician sees the fire, sees the bucket, see the solution and goes back to sleep. #MathJokes #FunnyMath #MathPuns #ShortMathJoke

  • Loading CSV data into Elasticsearch with Logstash

    Refer to my previous blogs (Linux | Mac users) to install the ELK stack on your machine. Once installation is done, there are couple of ways to load CSV files into Elasticsearch which I am aware of (one via Logstash and another with filebeat). In this blog, we will be using Logstash to load the file. I am using sample Squid access logs (comma-separated CSV file) in order to explain this blog. You can find the file format details at this link. Sample Data Copy and paste these records to create an access_log.csv file. I randomly selected this format because it's a CSV file which is a widely used format in the production environment and we have various types of attributes (data types) in the file. $ more /Volumes/MYLAB/testdata/access_log.csv Time,Duration,Client_address,Result_code,Bytes,Request_method,Requested_URL,User,Hierarchy_code,Type 1121587707.473,60439,219.138.188.61,TCP_MISS/503,0,CONNECT,203.84.194.44:25,-,DIRECT/203.84.194.44,- 1121587709.448,61427,219.138.188.61,TCP_MISS/503,0,CONNECT,203.84.194.50:25,-,DIRECT/203.84.194.50,- 1121587709.448,61276,219.138.188.61,TCP_MISS/503,0,CONNECT,67.28.114.36:25,-,DIRECT/67.28.114.36,- 1121587709.449,60148,219.138.188.61,TCP_MISS/503,0,CONNECT,4.79.181.12:25,-,DIRECT/4.79.181.12,- 1121587710.889,60778,219.138.188.61,TCP_MISS/503,0,CONNECT,203.84.194.39:25,-,DIRECT/203.84.194.39,- 1121587714.803,60248,219.138.188.61,TCP_MISS/503,0,CONNECT,203.84.194.50:25,-,DIRECT/203.84.194.50,- 1121587714.803,59866,219.138.188.61,TCP_MISS/503,0,CONNECT,203.84.194.43:25,-,DIRECT/203.84.194.43,- 1121587719.834,60068,219.138.188.61,TCP_MISS/503,0,CONNECT,203.84.194.45:25,-,DIRECT/203.84.194.45,- 1121587728.564,59642,219.138.188.55,TCP_MISS/503,0,CONNECT,168.95.5.45:25,-,DIRECT/168.95.5.45,- My file looks something like this, Start Elastics Now start Elasticsearch and Kibana (if you don't remember how to start them, refer to my previous blogs Linux | Mac users). Don't kill these processes, both are required by logstash to load the data. $ elasticsearch $ kibana Logstash Configuration In order to read the CSV files with Logstash, you need to create a configuration file that will have all the configuration details for accessing the log file like input, filter & output. In short, The input tag contains details like filename, location, start position, etc. Filter tag contains file type, separator, column details, transformations, etc. Output tag contains host detail where the file will be written, index name (should be in lower case), document type, etc. These tags look like JSON but actually, these are not JSON. You can say these formats are specific to Logstash. I have created a config file under the config directory in logstash as shown below. $ more /usr/local/var/homebrew/linked/logstash-full/libexec/config/logstash_accesslog.config input { file { path => "/Volumes/MYLAB/testdata/access_log.csv" start_position => "beginning" sincedb_path => "/Volumes/MYLAB/testdata/logstash.txt" } } filter { csv { separator => "," columns => [ "Time" , "Duration" , "Client_address" , "Result_code" , "Bytes" , "Request_method" , "Requested_URL" , "User" , "Hierarchy_code" , "Type" ] } date { match => [ "Time" , "UNIX" ] target => "EventTime" } mutate {convert => ["Duration", "integer"]} mutate {convert => ["Bytes", "integer"]} } output { elasticsearch { hosts => "localhost" index => "logstash-accesslog" } stdout {} } Explanation! These are very basic tags and straightforward. You use the columns tag to define the list of fields within quotes (if you face issues use single quotes instead). "mutate" is basically doing minor datatype conversion. Also, "match" is used to convert the UNIX timestamps to human-readable time format. Further, a logstash-access log is the index name that I am using for Squid access logs. Still, if there is any question regarding any of the tags please comment (in the comment section below) and I will get back to you as soon as possible. Also, you can change sincedb_path to /dev/null if you don't want to keep the track of loaded files. If you want to reload the same file again make sure you delete the entry from the sincedb_path file (logstash.txt in this case). Here is my config file snapshot, zoom it a little bit to see the content :) Run Logstash & Load data You are all set to start Logstash now with the configuration file which we just created. Follow the below steps in order to run Logstash with the config file. It will take a few seconds to index all the records. Change your logstash home location accordingly, mine is homebrew as I am using Mac. $ /usr/local/var/homebrew/linked/logstash-full/bin/logstash -f /usr/local/var/homebrew/linked/logstash-full/libexec/config/logstash_accesslog.config Make sure you are not getting any Error while loading the file in logstash output, otherwise the file will not load and there will be no index creation. Now open Kibana and run the below command in the "Dev Tools" tab in order to see how many records got loaded. I loaded 10 records just for demonstration Kibana Index Creation Now go to the "Management" tab in Kibana and click on Index Patterns => Create Index Pattern. Create an Index pattern with the same name which we mentioned in the configuration file logstash-accesslog. Hit "Next step" and select time filter => I don't want to use time filter. Now hit "Create index pattern". Now go to the "Discover" tab and you will be able to view the indexed data which we just loaded. Kibana Dashboard I don't have much data to create a Kibana dashboard here, but just for demonstration purposes, let's say you want to see the number of events that occurred/milliseconds. It's an impractical example, as you will never have a such a use case. Go to the "Visualize" tab and hit Create new. Select a pie chart for example, Now select logstash-accesslog index and apply changes in buckets as highlighted below, That's all, you can see a number of events that occurred each millisecond. If you have any questions please write in the comments section below. Thank you. Next: Create Kibana Dashboard Example Navigation Menu: Introduction to ELK Stack Installation Loading data into Elasticsearch with Logstash Create Kibana Dashboard Example Kibana GeoIP Dashboard Example Loading data into Elasticsearch using Apache Spark

  • How to write your first blog on Dataneb?

    01. Sign Up First sign up, after this, you should automatically receive writer's privilege within 24 hours. If you don't get writer's privilege, please email us. 02. Start Writing Go to your profile > Create a post. Before you start, please read blogging guidelines carefully. Blogging Guidelines These rules are meant to keep a quality blogging environment at Dataneb. Blog Uniqueness Blogs should be unique. Dataneb does not accept syndicated/unoriginal posts, research papers, duplicate posts, copying of others' content/articles, etc... NOTE: Violation to this guidelines will result into direct loss of writers privilege. Blog Length Blogs should have a minimum length of 2000 characters, there is no upper limit. You will find the total number of characters in the top left corner of the editor while drafting blogs. Blogs not fulfilling this criterion will be automatically moved to draft status. Image Requirement You can add images (but they should not be copyrighted images). Or, you can leave it to us. One of our moderators will handle image requirements. Back-links Backlinks are allowed (maximum 5 & sometimes more) as far as the intention is clear. Make sure you are not linking any blacklisted websites. Miscellaneous Moderators have the authority to add keywords and modify texts, images, etc so that your blog can get a higher Google ranking. This will help your blog to get more organic views. You can delete your post anytime, but Dataneb has full rights to re-publish that content again. Wait! There is an Easier Way to Publish Your Blog We understand that you don't want to publish your blog without review. Don't worry! Just draft the blog and save it. Email us when your blog is ready to publish and one of our moderators will review and publish it for you. If you are just a member and don't want to become a writer. You can also write your post in a word document and email us for submission. What's next? Share your blog post on Facebook, Twitter, etc to get more views, earn badges and invite others. Sharing blogs on social media is the easiest and fastest way to earn views. We value your words, please don't hurt others' feelings while commenting on blog posts, and maintain a quality environment at Dataneb. Email us if you have any queries.

  • Apache Spark Tutorial Scala: A Beginners Guide to Apache Spark Programming

    Learn Apache Spark: Tutorial for Beginners - This Apache Spark tutorial documentation will introduce you to Apache Spark programming in Scala. You will learn about Scala programming, dataframe, RDD, Spark SQL, and Spark Streaming with examples and finally prepare yourself for Spark interview questions and answers. What is Apache Spark? Apache Spark is an analytics engine for big data processing. It runs 100 times faster than Hadoop and gives you full freedom to process large-scale data in real time, run analytics and apply machine learning algorithms. Navigation menu ​ 1. Apache Spark and Scala Installation 1.1 Spark installation on Windows​ 1.2 Spark installation on Mac 2. Getting Familiar with Scala IDE 2.1 Hello World with Scala IDE​ 3. Spark data structure basics 3.1 Spark RDD Transformations and Actions example 4. Spark Shell 4.1 Starting Spark shell with SparkContext example​ 5. Reading data files in Spark 5.1 SparkContext Parallelize and read textFile method 5.2 Loading JSON file using Spark Scala 5.3 Loading TEXT file using Spark Scala 5.4 How to convert RDD to dataframe? 6. Writing data files in Spark ​6.1 How to write single CSV file in Spark 7. Spark streaming 7.1 Word count example Scala 7.2 Analyzing Twitter texts 8. Sample Big Data Architecture with Apache Spark 9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science? 10. Spark Interview Questions and Answers Next: Apache Spark Installation ( Windows | Mac )

  • Spark RDD, Transformations and Actions example

    Main menu: Spark Scala Tutorial In this Apache Spark RDD tutorial you will learn about, Spark RDD with example What is RDD in Spark? Spark transformations Spark actions Spark actions and transformations example Spark RDD operations What is a RDD in Spark? According to Apache Spark documentation - "Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat". Example (for easy understanding) - Not a practical case I seriously didn't understand anything when I read above definition for the first time, except the fact that RDD is acronym for Resilient Distributed Dataset. Let's try to understand RDD with a simple example. Assume that you have a collection of 100 movies and you have stored it on your personal laptop. This way you have complete data residing on a single machine (you can say it a node) i.e. your personal laptop. Now instead of having all movies on single machine, let's say you distributed the movies - 50 movies on laptop A and 50 movies on laptop B. This is where Distributed term comes into picture, 50% of your data is residing on one machine and 50% on another. Now let's say you were worried that any of the laptop can malfunction and you will lose your movies so you took the backup of movies. Backup of 50 movies which were present on laptop A on laptop B and similarly backup of 50 movies which were present on laptop B on laptop A. This is where the term Resilient or Fault-tolerant comes into picture, dictionary meaning of resilient is to withstand or recover quickly from difficult conditions and basically backup of your movies makes sure that you can recover data anytime from another machine (so called node) if system malfunctions. Number of times you create the backup or replicate the data into another machine for recovery is also called as replication factor. In above case replication factor was one, as you replicated the data once. In real life scenarios you will encounter huge amount of data (like movies data in above example) distributed across thousands of worker nodes (like laptop in above example) combination of which is called a cluster with higher replication factors (in above example it was just 1) in order to maintain fault tolerant system. Basic facts about Spark RDDs Resilient Distributed Datasets (RDDs) are basically an immutable collection of elements which is used as fundamental data structure in Apache Spark. You can create RDDs by two methods - Parallelize collection & referencing external datasets. RDDs are immutable i.e. read only data structures so you can't change original RDD. But you can always create a new one. RDDs supports two types of Spark operations - Transformations & Actions. Parallelize collection scala> sc.parallelize(1 to 10 by 2) res8: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at :25 Referencing a dataset scala> val dataFile = sc.textFile("/testdata/MountEverest.txt") dataFile: org.apache.spark.rdd.RDD[String] = /testdata/MountEverest.txt See - How to create a RDD? Spark Transformations & Actions In Spark, Transformations are functions that produces new RDD from an existing RDD. When you need actual data from a RDD, you need to apply actions. Below is the list of common transformations supported by Spark. But before that, those who are new to programming.. You will be using lambda functions or sometimes called anonymous functions to pass through these Spark transformations. So you should have basic understanding of lambda functions. In short, lambda functions are convenient way to write a function when you have to use functions just in one place. For example, if you want to double the number you can simply write; x => x + x like you do in Python and other languages. Syntax in Scala would be like this, scala> val lfunc = (x:Int) => x + x lfunc: Int => Int = // This tells that function takes integer and returns integer scala> lfunc(3) res0: Int = 6 Sample Data I will be using "Where is the Mount Everest?" text data. I just picked some random data to go through these examples. Where is Mount Everest? (MountEverest.txt) Mount Everest (Nepali: Sagarmatha सगरमाथा; Tibetan: Chomolungma ཇོ་མོ་གླང་མ; Chinese Zhumulangma 珠穆朗玛) is Earth's highest mountain above sea level, located in the Mahalangur Himal sub-range of the Himalayas. The international border between Nepal (Province No. 1) and China (Tibet Autonomous Region) runs across its summit point. - Reference Wikipedia scala> val mountEverest = sc.textFile("/testdata/MountEverest.txt") mountEverest: org.apache.spark.rdd.RDD[String] = /testdata/MountEverest.txt MapPartitionsRDD[1] at textFile at :24 Spark Transformations I encourage you all to run these examples on Spark-shell side-by-side. Don't just read through them. Type them on your keyboard it will help you learn. map(func) This transformation redistributes the data after passing each element through func. 1. For example, if you want to split the Mount Everest text into individual words, you just need to pass this lambda func x => x.split(" ") and it will create a new RDD as shown below. scala> val words = mountEverest.map(x => x.split(" ")) words: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[3] at map at :25 Did you spot the difference between mountEverest and words RDD? Yeah exactly, one is String type and after applying map transformation it's now Array of String. scala> words.collect() res1: Array[Array[String]] = Array(Array(Mount, Everest, (Nepali:, Sagarmatha, सगरमाथा;, Tibetan:, Chomolungma, ཇོ་མོ་གླང་མ;, Chinese, Zhumulangma, 珠穆朗玛), is, Earth's, highest, mountain, above, sea, level,, located, in, the, Mahalangur, Himal, sub-range, of, the, Himalayas., The, international, border, between, Nepal, (Province, No., 1), and, China, (Tibet, Autonomous, Region), runs, across, its, summit, point.)) To return all the elements of words RDD we have called collect() action. It's very basic Spark action. 2. Now, suppose you want to get the word count in this text file, you can do something like this - first split the file and then get the length or size of collection. scala> mountEverest.map(x => x.split(" ").length).collect() res6: Array[Int] = Array(45) // Mount Everest file has 45 words scala> mountEverest.map(x => x.split(" ").size).collect() res7: Array[Int] = Array(45) 3. Lets say you want to get total number of characters in the file, you can do it like this. scala> mountEverest.map(x => x.length).collect() res5: Array[Int] = Array(329) // Mount Everest file has 329 characters 4. Suppose you want to make all text upper/lower case, you can do it like this. scala> mountEverest.map(x => x.toUpperCase()).collect() res9: Array[String] = Array(MOUNT EVEREST (NEPALI: SAGARMATHA सगरमाथा; TIBETAN: CHOMOLUNGMA ཇོ་མོ་གླང་མ; CHINESE ZHUMULANGMA 珠穆朗玛) IS EARTH'S HIGHEST MOUNTAIN ABOVE SEA LEVEL, LOCATED IN THE MAHALANGUR HIMAL SUB-RANGE OF THE HIMALAYAS. THE INTERNATIONAL BORDER BETWEEN NEPAL (PROVINCE NO. 1) AND CHINA (TIBET AUTONOMOUS REGION) RUNS ACROSS ITS SUMMIT POINT.) scala> mountEverest.map(x=>x.toLowerCase()).collect() res35: Array[String] = Array(mount everest (nepali: sagarmatha सगरमाथा; tibetan: chomolungma ཇོ་མོ་གླང་མ; chinese zhumulangma 珠穆朗玛) is earth's highest mountain above sea level, located in the mahalangur himal sub-range of the himalayas.the international border between nepal (province no. 1) and china (tibet autonomous region) runs across its summit point.) flatmap(func) As name says it's flattened map. This is also similar to map, except the fact that it gives you more flattened output. For example, scala> val rdd = sc.parallelize(Seq("Where is Mount Everest","Himalayas India")) rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[22] at parallelize at :24 scala> rdd.collect res26: Array[String] = Array(Where is Mount Everest, Himalayas India) 1. We have two items in Parallel Collection RDD - "Where is Mount Everest" and "Himalayas India". scala> rdd.map(x => x.split(" ")).collect res21: Array[Array[String]] = Array(Array(Where, is, Mount, Everest), Array(Himalayas, India)) 2. When map() transformation is applied, it results into two separate array of strings (1st element (Where, is, Mount, Everest) and 2nd element => (Himalayas, India)). scala> rdd.flatMap(x => x.split(" ")).collect res23: Array[String] = Array(Where, is, Mount, Everest, Himalayas, India) 3. For flatMap(), output is flattened to single array of string Array[String]. Thus flatMap() is similar to map, where each input item is mapped to 0 or more output items (1st item => 4 elements, 2nd item => 2 elements). This will give you clear picture, scala> rdd.map(x => x.split(" ")).count() res24: Long = 2 // as map gives one to one output hence 2=>2 scala> rdd.flatMap(x => x.split(" ")).count() res25: Long = 6 // as flatMap gives one to zero or more output hence 2=>6 map() => [Where is Mount Everest, Himalayas India] => [[Where, is, Mount, Everest],[Himalayas, India]] flatMap() => [Where is Mount Everest, Himalayas India] => [Where, is, Mount, Everest, Himalayas, India] 4. Getting back to mountEverest RDD, suppose you want to get the length of each individual word. scala> mountEverest.flatMap(x=>x.split(" ")).map(x=>(x, x.length)).collect res82: Array[(String, Int)] = Array((Mount,5), (Everest,7), ((Nepali:,8), (Sagarmatha,10), (सगरमाथा;,8), (Tibetan:,8), (Chomolungma,11), (ཇོ་མོ་གླང་མ;,12), (Chinese,7), (Zhumulangma,11), (珠穆朗玛),5), (is,2), (Earth's,7), (highest,7), (mountain,8), (above,5), (sea,3), (level,,6), (located,7), (in,2), (the,3), (Mahalangur,10), (Himal,5), (sub-range,9), (of,2), (the,3), (Himalayas.The,13), (international,13), (border,6), (between,7), (Nepal,5), ((Province,9), (No.,3), (1),2), (and,3), (China,5), ((Tibet,6), (Autonomous,10), (Region),7), (runs,4), (across,6), (its,3), (summit,6), (point.,6)) filter(func) As name tells it is used to filter elements same like where clause in SQL and it is case sensitive. For example, scala> rdd.collect res26: Array[String] = Array(Where is Mount Everest, Himalayas India) // Returns one match scala> rdd.filter(x=>x.contains("Himalayas")).collect res31: Array[String] = Array(Himalayas India) // Contains is case sensitive scala> rdd.filter(x=>x.contains("himalayas")).collect res33: Array[String] = Array() scala> rdd.filter(x=>x.toLowerCase.contains("himalayas")).collect res37: Array[String] = Array(Himalayas India) Filtering even numbers, scala> sc.parallelize(1 to 15).filter(x=>(x%2==0)).collect res57: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14) scala> sc.parallelize(1 to 15).filter(_%5==0).collect res59: Array[Int] = Array(5, 10, 15) mapPartitions(func type Iterator) Similar to map() transformation but in this case function runs separately on each partition (block) of RDD unlike map() where it was running on each element of partition. Hence mapPartitions are also useful when you are looking for performance gain (calls your function once/partition not once/element). Suppose you have elements from 1 to 100 distributed among 10 partitions i.e. 10 elements/partition. map() transformation will call func 100 times to process these 100 elements but in case of mapPartitions(), func will be called once/partition i.e. 10 times. Secondly, mapPartitions() holds the data in-memory i.e. it will store the result in memory until all the elements of the partition has been processed. mapPartitions() will return the result only after it finishes processing of whole partition. mapPartitions() requires an iterator input unlike map() transformation. What is an Iterator? - An iterator is a way to access collection of elements one-by-one, its similar to collection of elements like List(), Array() etc in few ways but the difference is that iterator doesn't load the whole collection of elements in memory all together. Instead iterator loads elements one after another. In Scala you access these elements with hasNext and Next operation. For example, scala> sc.parallelize(1 to 9, 3).map(x=>(x, "Hello")).collect res3: Array[(Int, String)] = Array((1,Hello), (2,Hello), (3,Hello), (4,Hello), (5,Hello), (6,Hello), (7,Hello), (8,Hello), (9,Hello)) scala> sc.parallelize(1 to 9, 3).partitions.size res95: Int = 3 scala> sc.parallelize(1 to 9, 3).mapPartitions(x=>(Array("Hello").iterator)).collect res7: Array[String] = Array(Hello, Hello, Hello) scala> sc.parallelize(1 to 9, 3).mapPartitions(x=>(List(x.next).iterator)).collect res11: Array[Int] = Array(1, 4, 7) In first example, I have applied map() transformation on dataset distributed between 3 partitions so that you can see function is called 9 times. In second example, when we applied mapPartitions(), you will notice it ran 3 times i.e. for each partition once. We had to convert string "Hello" into iterator because mapPartitions() takes iterator as input. In thirds step, I tried to get the iterator next value to show you the element. Note that next is always increasing value, so you can't step back. See this, scala> sc.parallelize(1 to 9, 3).mapPartitions(x=>(List(x.next,x.next, "|").iterator)).collect res18: Array[Any] = Array(1, 2, |, 4, 5, |, 7, 8, |) In first call next value for partition 1 changed from 1 => 2 , for partition 2 it changed from 4 => 5 and similarly for partition 3 it changed from 7 => 8. You can keep this increasing until hasNext is False (hasNext is a property of iteration which tells you whether collection has ended or not, it returns you True or False based on items left in the collection). For example, scala> sc.parallelize(1 to 9, 3).mapPartitions(x=>(List(x.next, x.hasNext).iterator)).collect res19: Array[AnyVal] = Array(1, true, 4, true, 7, true) You can see hasNext is true because there are elements left in each partition. Now suppose we access all three elements from each partition, then hasNext will result false. For example, scala> sc.parallelize(1 to 9, 3).mapPartitions(x=>(List(x.next, x.next, x.next, x.hasNext).iterator)).collect res20: Array[AnyVal] = Array(1, 2, 3, false, 4, 5, 6, false, 7, 8, 9, false) Just for our understanding, if you will try to access next 4th time, you will get error which is expected. scala> sc.parallelize(1 to 9, 3).mapPartitions(x=>(List(x.next, x.next, x.next, x.next,x.hasNext).iterator)).collect 19/07/31 11:14:42 ERROR Executor: Exception in task 1.0 in stage 18.0 (TID 56) java.util.NoSuchElementException: next on empty iterator Think, map() transformation as special case of mapPartitions() where you have just 1 element in each partition. Isn't it? mapPartitionsWithIndex(func) Similar to mapPartitions, but good part is that you have index to see the partition position. For example, scala> val mp = sc.parallelize(List("One","Two","Three","Four","Five","Six","Seven","Eight","Nine"), 3) mp: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[38] at parallelize at :24 scala> mp.collect res23: Array[String] = Array(One, Two, Three, Four, Five, Six, Seven, Eight, Nine) scala> mp.mapPartitionsWithIndex((index, iterator) => {iterator.toList.map(x => x + "=>" + index ).iterator} ).collect res26: Array[String] = Array(One=>0, Two=>0, Three=>0, Four=>1, Five=>1, Six=>1, Seven=>2, Eight=>2, Nine=>2) Index 0 (first partition) has three values as expected, similarly other 2 partitions. If you have any question please mention it in comments section at the end of this blog. sample(withReplacement, fraction, seed) Generates a fraction RDD from an input RDD. Note that second argument fraction doesn't represent the fraction of actual RDD. It actually tells the probability of each element in the dataset getting selected for the sample. Seed is optional. First boolean argument decides type of sampling algorithm. For example, scala> sc.parallelize(1 to 10).sample(true, .4).collect res103: Array[Int] = Array(4) scala> sc.parallelize(1 to 10).sample(true, .4).collect res104: Array[Int] = Array(1, 4, 6, 6, 6, 9) // Here you can see fraction 0.2 doesn't represent fraction of rdd, 4 elements selected out of 10. scala> sc.parallelize(1 to 10).sample(true, .2).collect res109: Array[Int] = Array(2, 4, 7, 10) // Fraction set to 1 which is the max probability (0 to 1), so each element got selected. scala> sc.parallelize(1 to 10).sample(false, 1).collect res111: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) union(otherDataset) Similar to SQL union, but it keeps duplicate data. scala> val rdd1 = sc.parallelize(List("apple","orange","grapes","mango","orange")) rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[159] at parallelize at :24 scala> val rdd2 = sc.parallelize(List("red","green","yellow")) rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[160] at parallelize at :24 scala> rdd1.union(rdd2).collect res116: Array[String] = Array(apple, orange, grapes, mango, orange, red, green, yellow) scala> rdd2.union(rdd1).collect res117: Array[String] = Array(red, green, yellow, apple, orange, grapes, mango, orange) intersection(otherDataset) Returns intersection of two datasets. For example, scala> val rdd1 = sc.parallelize(-5 to 5) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[171] at parallelize at :24 scala> val rdd2 = sc.parallelize(1 to 10) rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[172] at parallelize at :24 scala> rdd1.intersection(rdd2).collect res119: Array[Int] = Array(4, 1, 5, 2, 3) distinct() Returns new dataset with distinct elements. For example, we don't have duplicate orange now. scala> val rdd = sc.parallelize(List("apple","orange","grapes","mango","orange")) rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[186] at parallelize at :24 scala> rdd.distinct.collect res121: Array[String] = Array(grapes, orange, apple, mango) Due to some technical issues I had to move some content of this page to other area. Please refer this for remaining list of transformations. Sorry for the inconvenience guys. groupByKey() reduceByKey() aggregateByKey() sortByKey() join() cartesian() coalesce() repartition() Now, as said earlier, RDDs are immutable so you can't change original RDD but you can always create a new RDD with spark transformations like map, flatmap, filter, groupByKey, reduceByKey, mapValues, sample, union, intersection, distinct, sortByKey etc. RDDs transformations are broadly classified into two categories - Narrow & Wide transformation. In narrow transformation like map & filter, all the elements that are required to compute the records in single partition live in the single partition of parent RDD. A limited subset of partition is used to calculate the result. In wide transformation like groupByKey and reduceByKey, all the elements that are required to compute the records in the single partition may live in many partitions of parent RDD. The partition may live in many partitions of parent RDD. Spark Actions When you want to work on actual dataset, you need to perform spark actions on RDDs like count, reduce, collect, first, takeSample, saveAsTextFile etc. Transformations are lazy in nature i.e. nothing happens when the code is evaluated. Meaning actual execution happens only when code is executed. RDDs are computed only when an action is applied on them. Also called as lazy evaluation. Spark evaluates the expression only when its value is needed by action. When you call an action, it actually triggers transformations to act upon RDD, dataset or dataframe. After that RDD, dataset or dataframe is calculated in memory. In short, transformations will actually occur only when you apply an action. Before that it’s just line of evaluated code :) Below is the list of Spark actions. reduce() It aggregates the elements of the dataset. For example, scala> val rdd = sc.parallelize(1 to 15).collect rdd: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15) scala> val rdd = sc.parallelize(1 to 15).reduce(_ + _) rdd: Int = 120 scala> val rdd = sc.parallelize(Array("Hello", "Dataneb", "Spark")).reduce(_ + _) rdd: String = SparkHelloDataneb scala> val rdd = sc.parallelize(Array("Hello", "Dataneb", "Spark")).map(x =>(x, x.length)).flatMap(l=> List(l._2)).collect rdd: Array[Int] = Array(5, 7, 5) scala> rdd.reduce(_ + _) res96: Int = 17 scala> rdd.reduce((x, y)=>x+y) res99: Int = 17 collect(), count(), first(), take() Collect returns all the elements of the dataset as an array. For example scala> sc.parallelize(1 to 20, 4).collect res100: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20) Counts the number of elements scala> sc.parallelize(1 to 20, 4).count res101: Long = 20 First returns the first element scala> sc.parallelize(1 to 20, 4).first res102: Int = 1 Take returns the number of elements you pass as argument scala> sc.parallelize(1 to 20, 4).take(5) res104: Array[Int] = Array(1, 2, 3, 4, 5) takeSample() It returns the random sample of size n. Boolean input is for with or without replacement. For example, scala> sc.parallelize(1 to 20, 4).takeSample(false,4) res107: Array[Int] = Array(15, 2, 5, 17) scala> sc.parallelize(1 to 20, 4).takeSample(false,4) res108: Array[Int] = Array(12, 5, 4, 11) scala> sc.parallelize(1 to 20, 4).takeSample(true,4) res109: Array[Int] = Array(18, 4, 1, 18) takeOrdered() It returns the elements in ordered fashion. For example, scala> sc.parallelize(1 to 20, 4).takeOrdered(7) res117: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7) Just opposite to top() action scala> sc.parallelize(1 to 20, 4).top(7) res118: Array[Int] = Array(20, 19, 18, 17, 16, 15, 14) countByKey() It takes (key, value) pair and returns (key, count of key). For example, scala> sc.parallelize(Array("Apple","Banana","Grapes","Oranges","Grapes","Banana")).map(k=>(k,1)).countByKey() res121: scala.collection.Map[String,Long] = Map(Grapes -> 2, Oranges -> 1, Banana -> 2, Apple -> 1) saveAsTextFile() It saves the dataset as text files in local directory or HDFS etc. You can reduce the number of files by coalesce transformation. scala>sc.parallelize(Array("Apple","Banana","Grapes","Oranges","Grapes","Banana")).saveAsTextFile("sampleFruits.txt") // Just one partition file with coalesce scala>sc.parallelize(Array("Apple","Banana","Grapes","Oranges","Grapes","Banana")).coalesce(1).saveAsTextFile("newsampleFruits.txt") saveAsObjectFile() It writes the data into simple format using Java serialization and you can load it again using sc.objectFile() scala> sc.parallelize(List(1,2)).saveAsObjectFile("/Users/Rajput/sample") foreach() It is generally used when you want to carry out some operation on output for each element. Like loading each element into database. scala> sc.parallelize("Hello").collect res139: Array[Char] = Array(H, e, l, l, o) scala> sc.parallelize("Hello").foreach(x=>println(x)) l H e l o // Output order of elements is not same every time scala> sc.parallelize("Hello").foreach(x=>println(x)) H e l o l Spark Workflow In this section you will understand how Spark program flows, like how you create intermediate RDDs and apply transformations and actions. You first create RDDs with parallelize method or referencing external dataset. Apply Transformations to create new RDDs based on your requirement. You will have list of RDDs called Lineage. Apply Actions on RDDs. Get your Result. Transformations & Actions example Let's try to implement above facts with some basic example which will give you more clear picture. Open spark-shell with below command in your terminal (refer mac/windows if you don't have spark installed yet). ./bin/spark-shell You can see SparkContext automatically created for you with all (*) local resources and app id in above screenshot. You can also check spark context by running sc command. res0 is nothing but result set zero for command sc. We already read about SparkContext in previous blog. 1. Create RDD, let's say by parallelize method with number of partitions 2. Below RDD will be basically list of characters distributed across 2 partitions. 2. Now, you can either apply transformation to create a new RDD (called lineage) or you can simply apply an action to show the result. Lets first apply few actions. res1 to res5 shows you the result of each action - collect, first, count, take, reduce, saveAsTextFile. Note (lazy evaluation) when you execute an action spark does the actual evaluation to bring the result. Now let's see the sample.csv file which is the last action result. Remember we created 2 partitions in first step, thats the reason we have 2 files with equal set of data as part-00000 & part-00001. 3. Now let's try to apply few transformations in order to create RDDs lineage. Refer the image shown above. In first step we have applied filter transformation to filter character 'a' creating a new RDD called MapPartitionsRDD[2] from our initial RDD ParallelCollectionRDD[0]. Similarly in third step we have filtered letter 'x' to create another RDD MapPartitionsRDD[3]. In last step we have used map & reduceByKey transformation to group the characters and get their counts, generating a new RDD ShuffleRDD[5]. As we have applied 2 transformations on one RDD i.e. map and reduceByKey, you will notice RDD[4] is missing. Spark internally saves the intermediate RDD[4] to generate the resultant ShuffleRDD[5] which is not printed in output. ParallelCollectionRDD[0], MapPartitionsRDD[2], MapPartitionsRDD[3], RDD[4], ShuffleRDD[5] is basically called lineage. You can say intermediate collection of elements which is needed by spark to evaluate your next action. 4. Now, you can notice res7, res8 and res9 are nothing but actions which we applied on lineage RDDs to get the Results. Thank you!! If you really like the post and you have any question, please don't forget to write in comments section below. Next: Loading data in Apache Spark Navigation menu ​ 1. Apache Spark and Scala Installation 1.1 Spark installation on Windows​ 1.2 Spark installation on Mac 2. Getting Familiar with Scala IDE 2.1 Hello World with Scala IDE​ 3. Spark data structure basics 3.1 Spark RDD Transformations and Actions example 4. Spark Shell 4.1 Starting Spark shell with SparkContext example​ 5. Reading data files in Spark 5.1 SparkContext Parallelize and read textFile method 5.2 Loading JSON file using Spark Scala 5.3 Loading TEXT file using Spark Scala 5.4 How to convert RDD to dataframe? 6. Writing data files in Spark ​6.1 How to write single CSV file in Spark 7. Spark streaming 7.1 Word count example Scala 7.2 Analyzing Twitter texts 8. Sample Big Data Architecture with Apache Spark 9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science? 10. Spark Interview Questions and Answers

  • MuleSoft Anypoint Studio Download

    In this blog, you will download the latest version of anypoint studio and install it. Before downloading anypoint studio, some prerequisites need to be made. JDK Download Before you download JDK, understand the difference between JRE, JDK & JVM. This is necessary as you might see several versions of Java already running on your machine. JRE is Java Runtime Environment that fulfills the basic requirements which are needed to run any Java application on your machine. It contains a JVM (Java Virtual Machine) and other required classes needed to support JRE and run any application. Now, JDK (Java Development Kit) is a full fledge software development kit used to develop any Java application. JDK includes JRE and thus a virtual machine (JVM) as well which is needed to develop, execute and test any Java application. JDK > JRE > JVM > Objects & Classes & other needed files. Now, check if Java is already installed on your machine. For windows or Mac, run this command on the command prompt/ terminal java -version You will get output like this java version "9.0.1" Java(TM) SE Runtime Environment (build 9.0.1+11) Java HotSpot(TM) 64-Bit Server VM (build 9.0.1+11, mixed mode) or, something like this java version "1.8.0_341" Java(TM) SE Runtime Environment (build 1.8.0_341-b25) Java HotSpot(TM) 64-Bit Server VM (build 25.341-b25, mixed mode) What's the difference between the above two examples? Example first shows that you have installed JDK 9 in past and the second example shows you have installed JRE in past. Now I think you are clear on what's the difference here. Most probably you will already have JRE running on your machine as it's needed to run several other Java applications (not for development but for execution) and that does not mean you don't need JDK. For anypoint studio installation you need JDK. Download & Install JDK Download the latest version of JDK - https://www.oracle.com/java/technologies/downloads/ Once you install JDK, let's say stable version which is JDK 17 at this point in time. Run the java -version command again to validate the installation. You might see the old/existing version here. If you already had Java installed on your machine JRE or JDK earlier, you have to go to your program files and check if JDK got installed (OSDisk (C:) > Program Files > Java and you might see multiple JDK/JRE files there, which is expected and okay). Configuring User Environment Variable for JDK Once JDK is installed, you need to add a path variable to environment user variables so that we can tell Anypoint Studio which Java executable to use. Search 🔎 "environment variables" on your windows machine which will open system properties > advanced > environment variables. Click on Environment Variables > New (User variable, not System variable). Define New User Variable with Variable name: JAVA_HOME & Variable value: C:\Program Files\Java\jdk-17.0.4 and press OK. Copy the path from your local machine (OSDisk (C:) > Program Files > Java), do not just copy and paste the above path. That's just an example. If you plan to install newer version of JDK, you can simply update the JAVA_HOME variable. Create a new path variable Search 🔎 "environment variables" on your windows machine which will open system properties > advanced > environment variables. See the variables defined under User variables. If there is no "path" variable then create a New > Variable name: path & Variable value: %JAVA_HOME%\bin If there is an existing "path" variable (most probably you will have this) > Select "path" variable > Edit > New > and enter value as %JAVA_HOME%\bin Reboot your system and run the path command on your command prompt and that should give you JDK 17 (or whichever version you installed) Example output C:\Users\Dataneb>path PATH=c:\Program Files\Java\jdk-17.0.4\bin; Download Anypoint Studio In order to download Anypoint Studio, go to https://www.mulesoft.com/lp/dl/studio, or if you want to download any previous versions go to https://www.mulesoft.com/lp/dl/studio/previous. Fill out the form and download it. Step 1 Sample downloaded file name: AnypointStudio-7.13.0-win64.zip Create a new folder under OSDisk(C:) > Mulesoft Extract the downloaded file. Right-click on downloaded file > Click extract all > Browse > Select the folder C:/Mulesoft > Extract Final path OSDisk(C:) > Mulesoft > AnypointStudio > configuration/ features/ license / etc.. Step 2 In order to tell AnypointStudio which JDK to use, you need to update the AnypointStudio.ini file. Navigate to path OSDisk(C:) > Mulesoft > AnypointStudio and edit AnypointStudio.ini file in notepad. Uncomment the following 2 lines to configure a specific JDK path, and save and close the file. Example -vm C:\Program Files\Java\jdk-17.0.4\bin\javaw.exe Note the difference between java.exe and javaw.exe, both are java executables on windows platform. Java.exe is the console app while Javaw.exe is console-less. Step 3 Run AnypointStudio.exe

bottom of page