Search Results | Dataneb

DATANEB

Results found for ""

Hidden iPhone Tips & Tricks That You Should Know
Apple's iPhone has so many features that it's impossible to use all of them but here's few hidden iOS tips and tricks you probably didn't know even existed. 1. Checkout Your iPhone Battery Health For iPhone 6 and later, iOS 11.3 and later - New feature has been added to display battery health status and recommend if a battery needs to be replaced. These can be found in Settings > Battery > Battery Health Click on Battery Health to check your battery status. Maximum battery capacity measures the device battery capacity relative to when it was new. Initially it will be 100% and it decreases with time as battery health degrades. Checkout this for complete details. 2. Apple's iPhone Battery Replacement - Just $29 Apple is offering discounted battery replacement (just $29) for all eligible iPhone 6 and later model. Refer following chart. Visit Apple store near you or book an online appointment here to replace your battery if required. 3. Guided Access It keeps the iPhone in a single app mode, and allows you to control which features are available to the secondary user. This feature is very useful when someone like your little cousin borrows your iPhone just to play game. In that case you can grant access to the game app only. Open the settings app and tap General > Accessibility > Guided Access as shown below. Tap Guided Access so its slider turns green. Turn on Accessibility Shortcut as well. To start Guided Access, triple-click the Home button in the app which you want to give access. 4. Live iPhone Screen Recording You can start live screen recording of your iPhone screen on a single tap. First enable it - Open settings app > Control Center > Customize Controls > add Screen Recording Now, turn on screen recording by taping below button. 5. Smart Invert Colors (while using iPhone late at night) This features allows you to invert the colors (Black to White and vice-versa), mostly used at the time of reading or writing. Open the settings app and tap General > Accessibility > Display Accommodations > Invert Colors. Tap Smart Invert Colors so its slider turns green. It will appear like below. Before (Invert Colors : OFF) After (Invert Colors : ON) Invert Colors Shortcut You can turn on Invert Colors simply by triple clicking home button. To setup open the settings app and tap General > Accessibility > Accessibility Shortcut and check Smart Invert Colors as shown below. Now just triple click your home button to turn ON/OFF Invert Colors. 6. Hide Your Photos Open Photos app, select all the photos you want to hide, tap share and select hide. 7. Emergency SOS (highly recommended) Rapidly click the sleep/wake button five times to quickly call emergency services - health issues, theft or threat. Open the settings app > Emergency SOS (turn slider green) 8. Automatically Offload Unused Applications Apple automatically uninstalls the apps which you don't use while keeping the app data in case you want to re-install the app. This feature is very useful when you are running short on iPhone memory. Go to settings app > General > iPhone Storage > Offload Unused App (enable this by turning slider green) 9. Automatically Turn On "Do Not Disturb" While Driving Go to settings app > Do Not Disturb > Do Not Disturb While Driving (enable this by turning slider green) 10. Turn Your Live Photo Into a GIF Open Photos app and choose any live photo. Now simply slide up to reveal hidden edit features like - Bounce, Loop & Long Exposure. Thank you!! If you enjoyed this post, I’d be very grateful if you’d help it spread by sharing it to a friend, or sharing it on Google or Facebook.
Road trip from New York to Los Angeles: The Epic New York to LA Drive
The New York to Los Angeles road trip has something for everyone, whether you are traveling with family, friends, or in a group. Driving from the East Coast to the West Coast or from the West Coast to the East Coast takes approximately 42 hours if you drive non-stop, covering a distance of 2,790 miles. The trip can span anywhere from 4 to 7 days. Thanks to the Dataneb team for publishing this article. It's one of the best cross-country road trips in the United States, driving from New York to Los Angeles, and passing through some of the country's most famous tourist attractions, national parks, and scenic routes in the country. This road trip is all about fun and enjoyment, not rushing through the drive. Table of Contents: New York to LA Drive New York to Los Angeles Drive (Routes) Route 1 (via I-80 W) Route 2 (via I-70 W and I-40 W) New York to California Drive Day 1. New York to Chicago, IL (11 hours, 745 miles via I-80) Day 2: Chicago, IL to Omaha, NE (7.5 hours, 466 miles via I-80) Day 3: Omaha to Rocky Mountain, CO (8 hours, 568 miles via I-76) Day 4: Sand Dunes, CO to Page, AZ (7 hours, 426 miles via US-160) Day 5: Page, AZ to Las Vegas, NE (4 hours, 272 miles via US-89 & I-15) Day 6: Las Vegas, NE to Los Angeles, CA (4 hours, 270 miles via I-15) Road Trip Preparation Checklist New York to Los Angeles Drive (Routes) There are a couple of routes for the New York to Los Angeles drive, both of which cover approximately distance of 2800 miles and take around 42 hours. One route for the New York to LA drive is via I-80 W (the blue line on the map above), and the second route is via I-70 W and I-40 W (the bottom grey line on the map above). Both routes pass through different states, which can help you decide which route to take. Route 1 (via I-80 W) Over 7 days, you will drive through 12 different states: New York, Pennsylvania, Ohio, Indiana, Illinois, Iowa, Nebraska, Colorado, New Mexico, Utah, Nevada, Arizona (optional), and California. Arizona is optional if you wish to pass through. I extended my trip to include Arizona because I love its vivid landscape, which includes Antelope Canyon, the Grand Canyon, Monument Valley, and Horseshoe Bend. Below is a picture of Horseshoe Bend in Arizona; I truly admire this natural wonder. Route 2 (via I-70 W and I-40 W) The second route is to drive via I-70 W and I-40 W, driving through New York, Pennsylvania, Ohio, Indiana, Illinois, Missouri, Oklahoma, Texas, New Mexico, Arizona, and California. A road trip is always enjoyable, especially for those who love driving, and it becomes even more memorable when you're traveling in a group. Unfortunately, I was the sole driver for the entire trip, but I wasn't alone; my wife kept me awake with her jokes, albeit not very good ones. ;) You will cross four different time zones (EST, CST, MST, PST) during what is perhaps one of the longest road trips you can imagine in the United States, and I assure you, you'll love it. I've covered the hotels, locations, routes, and driving hours I followed. Feel free to make adjustments if you have other plans. New York to California Drive My initial plan for the New York to California drive was very flexible. Feel free to add or subtract extra days between destinations based on your preferences. I recommend avoiding advance hotel bookings to maintain flexibility and the option to change your plans. Who cares about the hotel; just find one on Google with good ratings, and you'll be fine. After such a long, tiring drive, all you need is a comfy bed. Day 1: New York to Chicago, IL Day 2: Explore Chicago, IL Day 3: Chicago, IL to Omaha, NE Day 4: Omaha, NE to the Rocky Mountains, CO Day 5: Explore the Rocky Mountains, CO Day 6: Rocky Mountains, CO to Page, AZ Day 7: Explore Page, AZ Day 8: Page, AZ to Las Vegas, NE Day 9: Explore Las Vegas, NV Day 10: Las Vegas, NE to Los Angeles, CA Days 10-11: Explore Los Angeles and San Diego, CA Day 1. New York to Chicago, IL (11 hours, 745 miles via I-80) An eleven-hour New York to Chicago drive is quite long, isn't it? Yes, but I planned to drive the maximum distance on the first day because I had plenty of energy. Additionally, stopping in Chicago was a good idea. However, if an eleven-hour drive is too much for you, you can split it into two days, with a 7-hour drive followed by a 5-hour drive, making a stop in Cleveland, OH. If you have some time in the evening, you can visit the Rock and Roll Hall of Fame. It's located on the shore of Lake Erie in downtown Cleveland, and it's a beautiful place. I had visited Cleveland a couple of times before, so there wasn't much left for me to explore there. Anyway, I stayed at Cool Springs Inn, Michigan City approximately an hour's drive from Chicago, for a couple of reasons. The first reason was the cost; hotels in Chicago were too expensive, approximately 3-4 times the cost of what I paid in Michigan City ($45). Secondly, Chicago is only about an hour's drive from this place, so you can wake up early in the morning and explore Chicago if you want. You can visit these places in Chicago: Navy Pier Willis Tower Cloud Gate John Hancock Center Shedd Aquarium Art Institute of Chicago There is a lot to do in Chicago; in fact, a day is not enough for Chicago. Consider getting a city pass and try to explore as much as you can. Extend your stay if needed. At the end of the day, you can either drive back to Cool Springs or book another hotel nearby. The Palmer House is a good option if you want to spend more time in Chicago downtown, I have been there a couple of times. It's a bit pricier at around $150 per night, and you have to pay extra for parking, approximately $50, but the place is awesome. I paid this price in 2018, so it would have changed since then. Day 2: Chicago, IL to Omaha, NE (7.5 hours, 466 miles via I-80) Well, I didn't stop at Omaha, Horizon Inn Motel as per my initial plan. Instead, I drove an additional 8 hours to reach Rocky Mountains, CO, which was originally my Day 3 stop. It might sound crazy, but I drove for over 15 hours to reach Colorado. I was particularly interested in hiking and exploring Colorado, so by adding an extra 8 hours of driving, I effectively saved a day for Colorado. Omaha offers plenty to do if you stick to the initial plan. If you have an interest in zoos and America's largest indoor rainforest, visit the Henry Doorly Zoo and Aquarium, which features an incredible indoor desert and rainforest. Day 3: Omaha to Rocky Mountain, CO (8 hours, 568 miles via I-76) Colorado is too vast to explore in a single day. So I stayed there for a couple of days, with my first stop at Coyote Mountain Lodge. I began the day with a hike in the Rocky Mountains and then drove through the Garden of the Gods. On the second day, I stopped at Estes Park and visited Royal Gorge Bridge (shown below). Later in the evening, we went to Great Sand Dunes National Park. I haven't seen landscapes like those at Great Sand Dunes in my entire life. You'll find snowy mountains, deserts, lakes, and lush green forests all in the same place. It's truly mesmerizing! Day 4: Sand Dunes, CO to Page, AZ (7 hours, 426 miles via US-160) Arizona is also too vast to explore in one day. However, you can visit Antelope Canyon, Horseshoe Bend, and drive through Zion National Park or the Grand Canyon. The Grand Canyon itself requires a couple of days if you want to explore all its corners. If you want to save some time, you can consider a helicopter tour from Las Vegas. Here's a random picture from Zion National Park. It's a stunning place, so don't miss the chance to drive through if you're nearby. Antelope Canyon shines bright orange in the noon sunlight, so it's a good time to visit around that time. You'll need to book tickets for this. Day 5: Page, AZ to Las Vegas, NE (4 hours, 272 miles via US-89 & I-15) Las Vegas is self-explanatory, and I don't think this place needs any explanation. You can try different foods and drinks, enjoy a stroll along the streets, experience the nightlife, enjoy rides, and explore the city. Saturday night is the most popular time, so plan your trip accordingly. Day 6: Las Vegas, NE to Los Angeles, CA (4 hours, 270 miles via I-15) You can cover this distance on the 5th day itself if you don't like Vegas. But I believe that's not the case; no one wants to skip Vegas. Southern California has tons of things to do, including Universal Studios, the San Diego Zoo, the Griffith Observatory, the Santa Monica Pier, and plenty of beautiful beaches. One of my favorites is Potato Chip Rock Mountain, a short mountain hike near San Diego County. Here's a picture. Road Trip Preparation Checklist Ensure your vehicle is fully serviced before planning this trip, including an engine oil change, checking tire conditions, lights, brakes, etc. I wasted many hours due to an oil change in Arizona, so be cautious. Avoid booking a hotel in advance, but try to make a reservation before 3:00 p.m. if you decide to book on the same day to avoid higher costs. I had to cancel one booking due to a change in plans. Keep extra blankets and pillows in your car in case you need to rest. Avoid overloading your vehicle; always leave extra space for yourself. Take regular breaks while driving; I was taking breaks every 3-4 hours of driving. Try not to drive more than 8 hours per day, although I didn't follow this rule myself. Keep warm clothes (jackets) in your car, as the weather can change significantly during such a long-distance tour. For example, when I started the trip, it was 25 degrees Celsius in New York, but it dropped to 2 degrees Celsius when I reached Colorado. This might be different depending upon the season for you. Keep an ample supply of water and food in your car. Carry a tire inflator in your car; you can find one on Amazon around $25. Avoid driving between 6 pm and 7 pm due to the sunset. Since you'll be driving west, you'll face the sun every evening. Most importantly, enjoy your trip and take your time; there's no need to rush! Next: Continue reading Related Posts ✔ Apply for US tourist visa ✔ Things to know about US culture Related Topics
Adsense Alternatives for Small Websites
Did your Google Adsense application got rejected? If the answer is yes, you are at right place. Well it's not the end of world and no doubt, there are several other alternatives for Google Adsense for small websites (hosted on WIX, Wordpress, Blogger etc). But the biggest question is - Which one should you vouch for? This is my weekly report of Adsense account (for one month old website) other than dataneb.com. Still thinking.. Why your Google Adsense account got rejected? Simple it's because of poor content on website, probably low traffic, site un-availability during verification, some policy violation etc. It could be due to one or more following reasons. Not Enough Content Website Design & Navigation Missing About Us or Contact Page Low Traffic New Website Poor Quality Content Language Restriction Number of Posts Number of Words per Post Using Free Domain There could be X number of factors and even after spending several hours on research - you will never get the answer. So don't get upset, as I said it's not the end of world, there are several other alternatives to Google Adsense. However, there is no doubt that Google Adsense provides you the easiest and best monetization methods to get steady income from your blogs. So if you have Google Adsense account, utilize it wisely. My rejection reason (but finally got approved) - Thought this might help others. I was providing wrong URL. Yeah I know it's funny and.. a silly mistake. Make sure you are providing the correct website name while submitting Google Adsense request form. My initial two request got denied because I entered wrong website url. First time I entered: http://dataneb.com and second time I entered https://dataneb.com. Correct name was https://www.dataneb.com, yeah I understand it's silly mistake but this is what it is. You need to mention url correctly otherwise Google crawlers will never read your website. and the proud moment, Another reason which is very common but never mentioned is - Google Adsense bots. Yes thats true, Adsense does not look into each request and website content manually. Their advance bots perform the hard task. The problem is the content which is usually generated via a JavaScript/AJAX type of content retrieval, so, since most crawlers, including AdSense, does not execute JavaScript, the crawler never sees your website content, and therefore see's your site as having no content. You will often face this issue with websites like WIX. Well whatever is the reason I would suggest instead of wasting time and money on your existing traffic, you can move forward with other alternatives which has very easy approval process and it will generate similar amount of revenue. Google Adsense approval time? Usually it's within 48 hours but sometimes longer depending upon your quality of website. If you don't get approval in first couple of requests, trust me you are stuck in infinite loop of wait time. I was little lucky in this case, it took 24 hours for me to get the final approval after couple of issues. Maximum number of Adsense units you can place on each page? No restriction, before it was just 3. How to integrate Ads with your website? Just add html code provided by these systems to your html widget anywhere on the page wherever you want to show the Ads. Google Adsense Alternatives I am not going to list down top 5, 10 or top 20 and confuse you more. Instead I am recommending just 3 based on my personal experience, ease of approval and revenue, which helped me to grow my business. I have used them and I am still using them (apart from Google Adsense) for my other websites. So, lets meet our top 3 alternatives to Google Adsense Before these 3 alternatives I would suggest you to try Amazon Affiliates program if you don't have much traffic. However, Amazon does not pay you for clicks or impression, it does only when a sale happens. 1. Media.Net (BEST alternative after Adsense) Approval time - few hours No limitation on number of Ads units per page No hidden fees It's also known as Yahoo/Bing advertising and it provides you contextual ads. It's holding rank 2 in contextual Ads. No minimum traffic requirement Unlike Adsense where you have option to choose image ads, here you have just textual ads There is no limitation on number of Ads unit like Adsense Supports mobile Ads Further you can change the size, color and shapes of Ad unit according to your convenience Monthly payment via Paypal ($100 minimum) 2. Infolinks (It's good.. ) Approval time - few hours It's very simple to integrate with your website It's open to any publisher - small, medium or large scale No fees No minimum requirements for traffic and page views or visitors and no hidden commitments Best part is Infolinks doesn't require space on your blog, they simply convert keywords into advertisement links So when users hover their mouse on specific keywords it automatically shows advertisements It provides in-text advertising and pays you per clicks on ads & not per impression 3. Chitika (It's okay) Approval time - few minutes Language restriction - English only No minimum traffic requirement No limitation on number of Ads per page Payment via Paypal ($10 minimum) or by check ($50 minimum) It target Ads are based on visitors location, so if your posts are location specific this is recommended for you Limitations on custom size of the Ads Similar to Adsense Image quality - medium Conclusion If you have Google Adsense account, use it wisely. If not, move ahead with Media.net. I would suggest just use Media.net and don't over-crowd your good looking website with tons of various types of Ads. Thank you. If you have any question for me please comment below. Good luck!
Enable Root User on Mac
By default root user is disabled on Mac, you need to follow below steps in order to enable/disable or change password for root user on Mac. 1. From top left hand side, choose Apple menu > System Preferences, then click Users & Groups (or Accounts). 2. Click the lock icon , then enter an administrator name and password. 3. After you unlock the lock. Click Login Options, right next to home icon. 4. Now Click Join (or Edit), right next to Network Account Server. Now Click Open Directory Utility. 5. Click lock icon in the Directory Utility window, then enter an administrator name and password. 6. From the menu bar in Directory Utility: Choose Edit > Enable Root User, then enter the password that you want to use for the root user. You can enable/disable/change password for root user from here. 7. Now go to Terminal and switch user to root and test. Rajas-MacBook-Pro: Rajput$ su root Password: Thank you!! If you enjoyed this post, I’d be very grateful if you’d help it spread by emailing it to a friend, or sharing it on Google or Facebook. Refer the links below. Also click on "Subscribe" button on top right corner to stay updated with latest posts. Your opinion matters a lot please comment if you have any suggestion for me. #enable #root #user #Mac
How to Install Apache HTTP Server on CentOS
yum install httpd Apache is most popular HTTP Server which runs on Linux & Windows based operating system. Let's see how to install and configure Apache HTTP Web Server on CentOS. 1. First update yum package sudo yum -y update 2. Next, install Apache HTTP server sudo yum install httpd 3. Start & Enable HTTP server (to start automatically on server reboot) [centos@ ~]$ sudo systemctl start httpd [centos@ ~]$ sudo systemctl enable httpd Created symlink from /etc/systemd/system/multi-user.target.wants/httpd.service to /usr/lib/systemd/system/httpd.service. 4. Now check the status of Apache server [centos@ ~]$ sudo systemctl status httpd httpd.service - The Apache HTTP Server Loaded: loaded (/usr/lib/systemd/system/httpd.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Thu 2018-08-02 18:32:00 UTC; 6min ago Docs: man:httpd(8) man:apachectl(8) Main PID: 10235 (code=exited, status=0/SUCCESS) Aug 02 18:32:00 dataneb.com systemd[1]: Starting The Apache HTTP Server... Aug 02 18:32:00 dataneb.com httpd[10235]: httpd (pid 10202) already running Aug 02 18:32:00 dataneb.com kill[10237]: kill: cannot find process "" Aug 02 18:32:00 dataneb.com systemd[1]: httpd.service: control process exited, code=exited status=1 Aug 02 18:32:00 dataneb.com systemd[1]: Failed to start The Apache HTTP Server. Aug 02 18:32:00 dataneb.com systemd[1]: Unit httpd.service entered failed state. Aug 02 18:32:00 dataneb.com systemd[1]: httpd.service failed. 5. If server does not start, disable SELinux on CentOS [centos@]$ cd /etc/selinux [centos@]$ sudo vi config # This file controls the state of SELinux on the system. # SELINUX= can take one of these three values: # enforcing - SELinux security policy is enforced. # permissive - SELinux prints warnings instead of enforcing. # disabled - No SELinux policy is loaded. # SELINUX=enforcing SELINUX=disabled # SELINUXTYPE= can take one of three two values: # targeted - Targeted processes are protected, # minimum - Modification of targeted policy. Only selected processes are protected. # mls - Multi Level Security protection. SELINUXTYPE=targeted 6. Reboot the system to make SELinux changes effective [centos@ ~]$ sudo reboot debug1: channel 0: free: client-session, nchannels 1 Connection to xx.xxx.xxx.xx closed by remote host. Connection to xx.xxx.xxx.xx closed. Transferred: sent 16532, received 333904 bytes, in 1758.1 seconds Bytes per second: sent 9.4, received 189.9 debug1: Exit status -1 7. Now check the status of Apache server again [centos@ ~]$ sudo systemctl status httpd httpd.service - The Apache HTTP Server Loaded: loaded (/usr/lib/systemd/system/httpd.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2018-08-02 18:40:18 UTC; 35s ago Docs: man:httpd(8) man:apachectl(8) Main PID: 855 (httpd) Status: "Total requests: 0; Current requests/sec: 0; Current traffic: 0 B/sec" CGroup: /system.slice/httpd.service ├─855 /usr/sbin/httpd -DFOREGROUND ├─879 /usr/sbin/httpd -DFOREGROUND ├─880 /usr/sbin/httpd -DFOREGROUND ├─881 /usr/sbin/httpd -DFOREGROUND ├─882 /usr/sbin/httpd -DFOREGROUND └─883 /usr/sbin/httpd -DFOREGROUND Aug 02 18:40:17 dataneb.com systemd[1]: Starting The Apache HTTP Server... Aug 02 18:40:18 dataneb.com systemd[1]: Started The Apache HTTP Server. 8. Configure firewalld (CentOS is built by default to block Apache traffic) [centos@ ~]$ firewall-cmd --zone=public --permanent --add-service=http [centos@ ~]$ firewall-cmd --zone=public --permanent --add-service=https [centos@ ~]$ firewall-cmd --reload 9. Test your url by entering Apache server ip address in your local browser Thank you!! If you enjoyed this post, I’d be very grateful if you’d help it spread by emailing it to a friend, or sharing it on Google or Facebook. Refer the links below. Also click on "Subscribe" button on top right corner to stay updated with latest posts. Your opinion matters a lot please comment if you have any suggestion for me. #install #apache #httpserver #centos #howtoinstallapacheoncentos Learn Apache Spark in 7 days, start today! Main menu: Spark Scala Tutorial 1. Apache Spark and Scala Installation 1.1 Spark installation on Windows 1.2 Spark installation on Mac 2. Getting Familiar with Scala IDE 2.1 Hello World with Scala IDE 3. Spark data structure basics 3.1 Spark RDD Transformations and Actions example 4. Spark Shell 4.1 Starting Spark shell with SparkContext example 5. Reading data files in Spark 5.1 SparkContext Parallelize and read textFile method 5.2 Loading JSON file using Spark Scala 5.3 Loading TEXT file using Spark Scala 5.4 How to convert RDD to dataframe? 6. Writing data files in Spark 6.1 How to write single CSV file in Spark 7. Spark streaming 7.1 Word count example Scala 7.2 Analyzing Twitter texts 8. Sample Big Data Architecture with Apache Spark 9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science? 10. Spark Interview Questions and Answers
How to Pull Data from Oracle IDCS (Identity Cloud Services) Rest API
Oracle IDCS has various rest APIs that can be used to pull data and you can utilize it further for data analytics. Let's see how we can pull data using simple shell scripts. Table of Contents: Oracle IDCS Rest API Parameter File Main Shell Script Check if the parameter file exists Create basic token Function to regenerate the token Testing token validity Pull totalResults count Loop to pull the records Formatting the JSON output Summary This Bash script automates the retrieval and processing of audit events data from an Oracle IDCS Rest API endpoint. It begins by checking for the existence of a parameter file, "param.txt", and sets environment variables accordingly. Using these variables, it generates a basic authentication token and validates its integrity. The script then retrieves the total number of qualified records from the API endpoint and proceeds to fetch the audit events data in paginated batches. Each batch is modified to a specific format and saved into JSON files with timestamps. Finally, it iterates through each batch until all records are retrieved and clean up temporary files upon completion, providing a seamless and efficient workflow for managing audit events. Parameter File Step 1: Create a parameter file "param.txt" which will contain the Customer ID, Customer secret, and organization URL (all in new line). You can leave the environment name as it is. Please note below values are just dummy values to showcase how you need to create a param file. Validate using Postman if your keys are working properly before running the script. scripts> pwd /home/hadoop/scripts scripts> more param.txt CID= 61rgrjk5869bjrvrb9999rbre20 CSEC= 01rgt-atbt-4956-9e77-15rjb74756nr64 ORG= https://idcs-9bbrtj756bjer8gbk753gbvj8f7eh3.identity.oraclecloud.com ENV= QUAL Main Script Step 2: At the same location, create the shell script for pulling data in JSON format. A brief description is given before each section of code. You can name your shell script anything; just make sure the permissions are correct to execute. I kept it as 755. Check if the Parameter File Exists This script checks for the presence of a parameter file named "param.txt" and reads specific lines from it to determine the environment. Based on the environment, it provides a corresponding welcome message or indicates an invalid selection. #!/bin/bash [ -f ./param.txt ] && echo "Parameter file is present" || echo "Parameter file not found!! Create param.txt with CID, CSEC, ORG, and ENV details." ENV=`head -8 ./param.txt | tail -1` [ -z "$ENV" ] && echo "Environment variable is empty" || echo "Environment variable looks good" case $ENV in DEV) echo "Welcome to DEV environment!" ;; QUAL) echo "Welcome to QUAL environment!" CID=`head -2 ./param.txt | tail -1` CSEC=`head -4 ./param.txt | tail -1` ORG=`head -6 ./param.txt | tail -1` sleep 1;; PL) echo "Welcome to ProdLike environment!" ;; PT) echo "Welcome to ProdTest environment!" ;; PROD) echo "Welcome to PROD environment!" ;; *) echo "Invalid environment selection!" ;; esac Create Basic Token Generate the base64, basic_token at https://www.base64encode.org/. This line of code takes the values stored in $CID and $CSEC, combines them into a single string separated by a colon, encodes that string into Base64 format, and assigns the resulting encoded string to the variable basic_token. basic_token=`echo -n $CID:$CSEC | base64 -w 0` Function to Regenerate the Token This function sends a request to obtain a new access token using the OAuth 2.0 client credentials flow, processes the response to extract the access token, and stores it in a file named access_token.tmp. regenToken() { curl -X POST \ "$ORG/oauth2/v1/token" \ -H "Authorization: Basic $basic_token" \ -H "Cache-Control: no-cache" \ -H "Content-Type: application/x-www-form-urlencoded" \ -d "grant_type=client_credentials&scope=urn%3Aopc%3Aidm%3A__myscopes__" | awk -F[":"] '{print$2}' | awk -F[","] '{print$1}' | awk '{print substr($0, 2, length($0) - 2)}' > access_token.tmp echo "New token is generated.. access_token refreshed!!" } Testing Token Validity This portion of the main shell script checks the validity of an access token. It first reads the token from a file and then sends a request to validate it. If the token is invalid, it regenerates a new one and repeats the process. Finally, it updates the token file accordingly. access_token=`more access_token.tmp`. tokenTest=`curl -X POST "$ORG/oauth2/v1/introspect" -H "Authorization: Basic $basic_token" -H "Cache-Control: no-cache" -H "Content-Type: application/x-www-form-urlencoded" -d token=$access_token | awk -F"," '{print$1}' | awk -F":" '{print$2}' | sed 's/[{}]//g'` if [ "$tokenTest" = "true" ]; then echo "Token is valid.."; else echo "Invalid token! Invoking func to pull new token.." regenToken access_token=`more access_token.tmp` fi Remove all the previous files. The script can be modified later to pull delta records only. rm -f auditevents.idcs* Pull totalResults Count This script retrieves the total number of qualified records from an API endpoint "$ORG/admin/v1/AuditEvents" using a GET request. It includes an access token in the request headers for authorization. The response is processed using awk to extract the value associated with "totalResults". After obtaining the total number of records, it echoes this information and then waits for 5 seconds. totalResults=`curl -X GET "$ORG/admin/v1/AuditEvents?&count=0" -H "Authorization: Bearer $access_token" -H "Cache-Control: no-cache" | awk -F"\"totalResults\"\:" '{print$2}' | awk -F"," '{print$1}'` echo "Total number of qualified records: $totalResults" sleep 5 Loop to Pull the Records This script iterates through paginated API calls to retrieve audit events data from "$ORG/admin/v1/AuditEvents". It sets the pagination parameters and continuously fetches data until all records are obtained. Each batch of data is processed and saved into a JSON file named "auditevents.idcs.json". itemsPerPage=1000 startIndex=1 while [ $startIndex -le $totalResults ] do echo "startIndex: $startIndex" curl -X GET \ "$ORG/admin/v1/AuditEvents?&startIndex=$startIndex&count=$itemsPerPage" \ -H "Authorization: Bearer $access_token" \ -H "Cache-Control: no-cache" | awk -F"Resources" '{print$2}' | awk -F"startIndex" '{print$1}' | cut -c 4- | rev | cut -c 4- | rev > auditevents.idcs.json Formatting the JSON Output This script performs a search and replaces operation on a JSON file named "auditevents.idcs.json". It replaces occurrences of the string defined by the variable PAT with the string defined by REP_PAT. The modified content is then redirected to a new file with a timestamp appended to its name. Afterward, it increments the startIndex by 1000 to prepare for the next batch of data retrieval in the loop. This process repeats until all records are retrieved. PAT=]},{\"idcsCreatedBy REP_PAT=]}'\n'{\"idcsCreatedBy sed "s/$PAT/$REP_PAT/g" auditevents.idcs.json > auditevents.idcs.json_`date +"%Y%m%d_%H%M%S%N"` startIndex=`expr $startIndex + 1000` done Remove the access token temp file at the end of the script: rm -f access_token.tmp Summary In short, the shell script performs the following operations: It checks for the presence of a parameter file named "param.txt" and sets environment variables accordingly. It generates a basic authentication token. It validates the token and regenerates it if invalid. It retrieves the total number of qualified records and waits for 5 seconds. It retrieves audit events data in paginated batches, modifies the data format, and saves it into JSON files with timestamps. It iterates through each batch until all records are retrieved. Finally, it cleans up temporary files ```bash #!/bin/bash [ -f ./param.txt ] && echo "Parameter file is present" || echo "Parameter file not found!! Create param.txt with CID, CSEC, ORG, and ENV details." ENV=$(head -8 ./param.txt | tail -1) [ -z "$ENV" ] && echo "Environment variable is empty" || echo "Environment variable looks good" case $ENV in DEV) echo "Welcome to DEV environment!" ;; QUAL) echo "Welcome to QUAL environment!" CID=$(head -2 ./param.txt | tail -1) CSEC=$(head -4 ./param.txt | tail -1) ORG=$(head -6 ./param.txt | tail -1) sleep 1 ;; PL) echo "Welcome to ProdLike environment!" ;; PT) echo "Welcome to ProdTest environment!" ;; PROD) echo "Welcome to PROD environment!" ;; *) echo "Invalid environment selection!" ;; esac basic_token=$(echo -n $CID:$CSEC | base64 -w 0) regenToken() { curl -X POST \ "$ORG/oauth2/v1/token" \ -H "Authorization: Basic $basic_token" \ -H "Cache-Control: no-cache" \ -H "Content-Type: application/x-www-form-urlencoded" \ -d "grant_type=client_credentials&scope=urn%3Aopc%3Aidm%3A__myscopes__" | awk -F[":"] '{print$2}' | awk -F[","] '{print$1}' | awk '{print substr($0, 2, length($0) - 2)}' > access_token.tmp echo "New token is generated.. access_token refreshed!!" } access_token=$(more access_token.tmp). tokenTest=$(curl -X POST "$ORG/oauth2/v1/introspect" -H "Authorization: Basic $basic_token" -H "Cache-Control: no-cache" -H "Content-Type: application/x-www-form-urlencoded" -d token=$access_token | awk -F"," '{print$1}' | awk -F":" '{print$2}' | sed 's/[{}]//g') if [ "$tokenTest" = "true" ]; then echo "Token is valid.." else echo "Invalid token! Invoking func to pull new token.." regenToken access_token=$(more access_token.tmp) fi rm -f auditevents.idcs* totalResults=$(curl -X GET "$ORG/admin/v1/AuditEvents?&count=0" -H "Authorization: Bearer $access_token" -H "Cache-Control: no-cache" | awk -F"\"totalResults\"\:" '{print$2}' | awk -F"," '{print$1}') echo "Total number of qualified records: $totalResults" sleep 5 itemsPerPage=1000 startIndex=1 while [ $startIndex -le $totalResults ]; do echo "startIndex: $startIndex" curl -X GET \ "$ORG/admin/v1/AuditEvents?&startIndex=$startIndex&count=$itemsPerPage" \ -H "Authorization: Bearer $access_token" \ -H "Cache-Control: no-cache" | awk -F"Resources" '{print$2}' | awk -F"startIndex" '{print$1}' | cut -c 4- | rev | cut -c 4- | rev > auditevents.idcs.json PAT=]},{\"idcsCreatedBy REP_PAT=]}'\n'{\"idcsCreatedBy sed "s/$PAT/$REP_PAT/g" auditevents.idcs.json > auditevents.idcs.json_$(date +"%Y%m%d_%H%M%S%N") startIndex=$(expr $startIndex + 1000) done rm -f access_token.tmp ``` I've formatted the script for better readability while preserving its functionality. Related Posts ✔ How to setup Postman client ✔ Calling Twitter API Related Topics
How to convert RDD to Dataframe?
Main menu: Spark Scala Tutorial There are basically three methods by which we can convert a RDD into Dataframe. I am using spark shell to demonstrate these examples. Open spark-shell and import the libraries which are needed to run our code. Scala> import org.apache.spark.sql.{Row, SparkSession} Scala> import org.apache.spark.sql.types.{IntegerType, DoubleType, StringType, StructField, StructType} Now, create a sample RDD with parallelize method. Scala> val rdd = sc.parallelize( Seq( ("One", Array(1,1,1,1,1,1,1)), ("Two", Array(2,2,2,2,2,2,2)), ("Three", Array(3,3,3,3,3,3)) ) ) Method 1 If you don't need header, you can directly create it with RDD as input parameter to createDataFrame method. Scala> val df1 = spark.createDataFrame(rdd) Method 2 If you need header, you can add the header explicitly by calling method toDF. Scala> val df2 = spark.createDataFrame(rdd).toDF("Label", "Values") Method 3 If you need schema structure then you need RDD of [Row] type. Let's create a new rowsRDD for this scenario. Scala> val rowsRDD = sc.parallelize( Seq( Row("One",1,1.0), Row("Two",2,2.0), Row("Three",3,3.0), Row("Four",4,4.0), Row("Five",5,5.0) ) ) Now create the schema with the field names which you need. Scala> val schema = new StructType(). add(StructField("Label", StringType, true)). add(StructField("IntValue", IntegerType, true)). add(StructField("FloatValue", DoubleType, true)) Now create the dataframe with rowsRDD & schema and show dataframe. Scala> val df3 = spark.createDataFrame(rowsRDD, schema) Thank you folks! If you have any question please mention in comments section below. Next: Writing data files in Spark Navigation menu 1. Apache Spark and Scala Installation 1.1 Spark installation on Windows 1.2 Spark installation on Mac 2. Getting Familiar with Scala IDE 2.1 Hello World with Scala IDE 3. Spark data structure basics 3.1 Spark RDD Transformations and Actions example 4. Spark Shell 4.1 Starting Spark shell with SparkContext example 5. Reading data files in Spark 5.1 SparkContext Parallelize and read textFile method 5.2 Loading JSON file using Spark Scala 5.3 Loading TEXT file using Spark Scala 5.4 How to convert RDD to dataframe? 6. Writing data files in Spark 6.1 How to write single CSV file in Spark 7. Spark streaming 7.1 Word count example Scala 7.2 Analyzing Twitter texts 8. Sample Big Data Architecture with Apache Spark 9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science? 10. Spark Interview Questions and Answers
How to pull data from OKTA API example
OKTA has various rest APIs (refer this) from where you can pull the data and play around according to your business requirement. As OKTA stores only 90 days of records so in many cases you might need to store the data in external databases and then perform your data analysis. In order to pull the data from OKTA I considered writing a shell script, probably because this looked very straight forward to me. But there are other methods as well which you can consider if you have wide project timeline. Lets see how this can be done with a shell script. Step 1: Go through the API reference documents and filters which OKTA has provided online. It's seriously very well documented and that would help you in case you want to tweak this script. Step 2: Get API access token from OKTA admin and validate if token is working properly or not with Postman client. Refer this. Step 3: Once you have the API access token and basic understanding of API filters you will be able to tweak the script according to your need. Step 4: Below is the complete shell program and brief explanation what each step is doing. # Define your environment variables - organization, domain and api_token. These will be used to construct URL in further steps. # If you want you can hide your API token, probably by reading token from a parameter file instead hard coding it. # Start ORG=company_name DOM=okta API_TOKEN=********************* # Initialize variables with some default values. # Change your destination path wherever you want to write the data. # Val is basically the pagination limit and PAT/REP_PAT is basically the pattern and replace_pattern string which I used to format the JSON file in correct format. Date_range will be used to pull the data based on dates which user inputs. VAL=1000 DEST_FILE=/var/spark/data i=1 PAT= REP_PAT= DATE_RANGE=2014-02-01 # Choose the API for which you need the data (events, logs or users), you can modify the code if you want to export any other api data. echo "Enter the name of API - events, logs, users. " read GID # Enter the date range to pull data echo "Enter the date in format yyyy-mm-dd" read DATE_RANGE date_func() { echo "Enter the date in format yyyy-mm-dd" read DATE_RANGE } # Check if entered date is in correct format if [ ${#DATE_RANGE} -ne 10 ]; then echo "Invalid date!! Enter date again.."; date_func else echo "Valid date!" fi # Construct the URL based on all the variables defined earlier URL=htt ps://$ORG.$DOM.com/api/v1/$GID?limit=$VAL # Case to choose API name entered by user, 4 to 10 are empty routes if you want to add new APIs case $GID in events) echo "events API selected" rm -f /var/spark/data/events.json* URL=htt ps://$ORG.$DOM.com/api/v1/$GID?lastUpdated%20gt%20%22"$DATE_RANGE"T00:00:00.000Z%22\&$VAL PAT=}]},{\"eventId\": REP_PAT=}]}'\n'{\"eventId\": sleep 1;; logs) echo "logs API selected" rm -f /var/spark/data/logs.json* URL=htt ps://$ORG.$DOM.com/api/v1/$GID?lastUpdated%20gt%20%22"$DATE_RANGE"T00:00:00.000Z%22\&$VAL PAT=}]},{\"actor\": REP_PAT=}]}'\n'{\"actor\": sleep 1;; users) echo "users API selected" PAT=}}},{\"id\": REP_PAT=}}}'\n'{\"id\": rm -f /var/spark/data/users.json* URL=htt ps://$ORG.$DOM.com/api/v1/$GID?filter=status%20eq%20%22STAGED%22%20or%20status%20eq%20%22PROVISIONED%22%20or%20status%20eq%20%22ACTIVE%22%20or%20status%20eq%20%22RECOVERY%22%20or%20status%20eq%20%22PASSWORD_EXPIRED%22%20or%20status%20eq%20%22LOCKED_OUT%22%20or%20status%20eq%20%22DEPROVISIONED%22\&$VAL echo $URL sleep 1;; 4) echo "four" ;; 5) echo "five" ;; 6) echo "six" ;; 7) echo "seven" ;; 8) echo "eight" ;; 9) echo "nine" ;; 10) echo "ten" ;; *) echo "INVALID INPUT!" ;; esac # Deleting temporary files before running the script rm -f itemp.txt rm -f temp.txt rm -f temp1.txt # Creating NEXT variable to handle pagination curl -i -X GET -H "Accept: application/json" -H "Content-Type: application/json" -H "Authorization: SSWS $API_TOKEN" "$URL" > itemp.txt NEXT=`grep -i 'rel="next"' itemp.txt | awk -F"<" '{print$2}' | awk -F">" '{print$1}'` tail -1 itemp.txt > temp.txt # Validating if URL is correctly defined echo $URL # Iterating the loop of pagination with NEXT variable until it's null while [ ${#NEXT} -ne 0 ] do echo "this command is executed till NEXT is null, current value of NEXT is $NEXT" curl -i -X GET -H "Accept: application/json" -H "Content-Type: application/json" -H "Authorization: SSWS $API_TOKEN" "$NEXT" > itemp.txt tail -1 itemp.txt >> temp.txt NEXT=`grep -i 'rel="next"' itemp.txt | awk -F"<" '{print$2}' | awk -F">" '{print$1}'` echo "number of loop = $i, for NEXT reference : $NEXT" (( i++ )) cat temp.txt | cut -c 2- | rev | cut -c 2- | rev > temp1.txt rm -f temp.txt # Formatting the output to create single line JSON records echo "PATTERN = $PAT" echo "REP_PATTERN = $REP_PAT" sed -i "s/$PAT/$REP_PAT/g" temp1.txt mv temp1.txt /var/spark/data/$GID.json_`date +"%Y%m%d_%H%M%S"` sleep 1 done # END See also - How to setup Postman client If you have any question please write in comments section below. Thank you!
Spark RDD, Transformations and Actions example
Main menu: Spark Scala Tutorial In this Apache Spark RDD tutorial you will learn about, Spark RDD with example What is RDD in Spark? Spark transformations Spark actions Spark actions and transformations example Spark RDD operations What is a RDD in Spark? According to Apache Spark documentation - "Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat". Example (for easy understanding) - Not a practical case I seriously didn't understand anything when I read above definition for the first time, except the fact that RDD is acronym for Resilient Distributed Dataset. Let's try to understand RDD with a simple example. Assume that you have a collection of 100 movies and you have stored it on your personal laptop. This way you have complete data residing on a single machine (you can say it a node) i.e. your personal laptop. Now instead of having all movies on single machine, let's say you distributed the movies - 50 movies on laptop A and 50 movies on laptop B. This is where Distributed term comes into picture, 50% of your data is residing on one machine and 50% on another. Now let's say you were worried that any of the laptop can malfunction and you will lose your movies so you took the backup of movies. Backup of 50 movies which were present on laptop A on laptop B and similarly backup of 50 movies which were present on laptop B on laptop A. This is where the term Resilient or Fault-tolerant comes into picture, dictionary meaning of resilient is to withstand or recover quickly from difficult conditions and basically backup of your movies makes sure that you can recover data anytime from another machine (so called node) if system malfunctions. Number of times you create the backup or replicate the data into another machine for recovery is also called as replication factor. In above case replication factor was one, as you replicated the data once. In real life scenarios you will encounter huge amount of data (like movies data in above example) distributed across thousands of worker nodes (like laptop in above example) combination of which is called a cluster with higher replication factors (in above example it was just 1) in order to maintain fault tolerant system. Basic facts about Spark RDDs Resilient Distributed Datasets (RDDs) are basically an immutable collection of elements which is used as fundamental data structure in Apache Spark. You can create RDDs by two methods - Parallelize collection & referencing external datasets. RDDs are immutable i.e. read only data structures so you can't change original RDD. But you can always create a new one. RDDs supports two types of Spark operations - Transformations & Actions. Parallelize collection scala> sc.parallelize(1 to 10 by 2) res8: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at :25 Referencing a dataset scala> val dataFile = sc.textFile("/testdata/MountEverest.txt") dataFile: org.apache.spark.rdd.RDD[String] = /testdata/MountEverest.txt See - How to create a RDD? Spark Transformations & Actions In Spark, Transformations are functions that produces new RDD from an existing RDD. When you need actual data from a RDD, you need to apply actions. Below is the list of common transformations supported by Spark. But before that, those who are new to programming.. You will be using lambda functions or sometimes called anonymous functions to pass through these Spark transformations. So you should have basic understanding of lambda functions. In short, lambda functions are convenient way to write a function when you have to use functions just in one place. For example, if you want to double the number you can simply write; x => x + x like you do in Python and other languages. Syntax in Scala would be like this, scala> val lfunc = (x:Int) => x + x lfunc: Int => Int = // This tells that function takes integer and returns integer scala> lfunc(3) res0: Int = 6 Sample Data I will be using "Where is the Mount Everest?" text data. I just picked some random data to go through these examples. Where is Mount Everest? (MountEverest.txt) Mount Everest (Nepali: Sagarmatha सगरमाथा; Tibetan: Chomolungma ཇོ་མོ་གླང་མ; Chinese Zhumulangma 珠穆朗玛) is Earth's highest mountain above sea level, located in the Mahalangur Himal sub-range of the Himalayas. The international border between Nepal (Province No. 1) and China (Tibet Autonomous Region) runs across its summit point. - Reference Wikipedia scala> val mountEverest = sc.textFile("/testdata/MountEverest.txt") mountEverest: org.apache.spark.rdd.RDD[String] = /testdata/MountEverest.txt MapPartitionsRDD[1] at textFile at :24 Spark Transformations I encourage you all to run these examples on Spark-shell side-by-side. Don't just read through them. Type them on your keyboard it will help you learn. map(func) This transformation redistributes the data after passing each element through func. 1. For example, if you want to split the Mount Everest text into individual words, you just need to pass this lambda func x => x.split(" ") and it will create a new RDD as shown below. scala> val words = mountEverest.map(x => x.split(" ")) words: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[3] at map at :25 Did you spot the difference between mountEverest and words RDD? Yeah exactly, one is String type and after applying map transformation it's now Array of String. scala> words.collect() res1: Array[Array[String]] = Array(Array(Mount, Everest, (Nepali:, Sagarmatha, सगरमाथा;, Tibetan:, Chomolungma, ཇོ་མོ་གླང་མ;, Chinese, Zhumulangma, 珠穆朗玛), is, Earth's, highest, mountain, above, sea, level,, located, in, the, Mahalangur, Himal, sub-range, of, the, Himalayas., The, international, border, between, Nepal, (Province, No., 1), and, China, (Tibet, Autonomous, Region), runs, across, its, summit, point.)) To return all the elements of words RDD we have called collect() action. It's very basic Spark action. 2. Now, suppose you want to get the word count in this text file, you can do something like this - first split the file and then get the length or size of collection. scala> mountEverest.map(x => x.split(" ").length).collect() res6: Array[Int] = Array(45) // Mount Everest file has 45 words scala> mountEverest.map(x => x.split(" ").size).collect() res7: Array[Int] = Array(45) 3. Lets say you want to get total number of characters in the file, you can do it like this. scala> mountEverest.map(x => x.length).collect() res5: Array[Int] = Array(329) // Mount Everest file has 329 characters 4. Suppose you want to make all text upper/lower case, you can do it like this. scala> mountEverest.map(x => x.toUpperCase()).collect() res9: Array[String] = Array(MOUNT EVEREST (NEPALI: SAGARMATHA सगरमाथा; TIBETAN: CHOMOLUNGMA ཇོ་མོ་གླང་མ; CHINESE ZHUMULANGMA 珠穆朗玛) IS EARTH'S HIGHEST MOUNTAIN ABOVE SEA LEVEL, LOCATED IN THE MAHALANGUR HIMAL SUB-RANGE OF THE HIMALAYAS. THE INTERNATIONAL BORDER BETWEEN NEPAL (PROVINCE NO. 1) AND CHINA (TIBET AUTONOMOUS REGION) RUNS ACROSS ITS SUMMIT POINT.) scala> mountEverest.map(x=>x.toLowerCase()).collect() res35: Array[String] = Array(mount everest (nepali: sagarmatha सगरमाथा; tibetan: chomolungma ཇོ་མོ་གླང་མ; chinese zhumulangma 珠穆朗玛) is earth's highest mountain above sea level, located in the mahalangur himal sub-range of the himalayas.the international border between nepal (province no. 1) and china (tibet autonomous region) runs across its summit point.) flatmap(func) As name says it's flattened map. This is also similar to map, except the fact that it gives you more flattened output. For example, scala> val rdd = sc.parallelize(Seq("Where is Mount Everest","Himalayas India")) rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[22] at parallelize at :24 scala> rdd.collect res26: Array[String] = Array(Where is Mount Everest, Himalayas India) 1. We have two items in Parallel Collection RDD - "Where is Mount Everest" and "Himalayas India". scala> rdd.map(x => x.split(" ")).collect res21: Array[Array[String]] = Array(Array(Where, is, Mount, Everest), Array(Himalayas, India)) 2. When map() transformation is applied, it results into two separate array of strings (1st element (Where, is, Mount, Everest) and 2nd element => (Himalayas, India)). scala> rdd.flatMap(x => x.split(" ")).collect res23: Array[String] = Array(Where, is, Mount, Everest, Himalayas, India) 3. For flatMap(), output is flattened to single array of string Array[String]. Thus flatMap() is similar to map, where each input item is mapped to 0 or more output items (1st item => 4 elements, 2nd item => 2 elements). This will give you clear picture, scala> rdd.map(x => x.split(" ")).count() res24: Long = 2 // as map gives one to one output hence 2=>2 scala> rdd.flatMap(x => x.split(" ")).count() res25: Long = 6 // as flatMap gives one to zero or more output hence 2=>6 map() => [Where is Mount Everest, Himalayas India] => [[Where, is, Mount, Everest],[Himalayas, India]] flatMap() => [Where is Mount Everest, Himalayas India] => [Where, is, Mount, Everest, Himalayas, India] 4. Getting back to mountEverest RDD, suppose you want to get the length of each individual word. scala> mountEverest.flatMap(x=>x.split(" ")).map(x=>(x, x.length)).collect res82: Array[(String, Int)] = Array((Mount,5), (Everest,7), ((Nepali:,8), (Sagarmatha,10), (सगरमाथा;,8), (Tibetan:,8), (Chomolungma,11), (ཇོ་མོ་གླང་མ;,12), (Chinese,7), (Zhumulangma,11), (珠穆朗玛),5), (is,2), (Earth's,7), (highest,7), (mountain,8), (above,5), (sea,3), (level,,6), (located,7), (in,2), (the,3), (Mahalangur,10), (Himal,5), (sub-range,9), (of,2), (the,3), (Himalayas.The,13), (international,13), (border,6), (between,7), (Nepal,5), ((Province,9), (No.,3), (1),2), (and,3), (China,5), ((Tibet,6), (Autonomous,10), (Region),7), (runs,4), (across,6), (its,3), (summit,6), (point.,6)) filter(func) As name tells it is used to filter elements same like where clause in SQL and it is case sensitive. For example, scala> rdd.collect res26: Array[String] = Array(Where is Mount Everest, Himalayas India) // Returns one match scala> rdd.filter(x=>x.contains("Himalayas")).collect res31: Array[String] = Array(Himalayas India) // Contains is case sensitive scala> rdd.filter(x=>x.contains("himalayas")).collect res33: Array[String] = Array() scala> rdd.filter(x=>x.toLowerCase.contains("himalayas")).collect res37: Array[String] = Array(Himalayas India) Filtering even numbers, scala> sc.parallelize(1 to 15).filter(x=>(x%2==0)).collect res57: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14) scala> sc.parallelize(1 to 15).filter(_%5==0).collect res59: Array[Int] = Array(5, 10, 15) mapPartitions(func type Iterator) Similar to map() transformation but in this case function runs separately on each partition (block) of RDD unlike map() where it was running on each element of partition. Hence mapPartitions are also useful when you are looking for performance gain (calls your function once/partition not once/element). Suppose you have elements from 1 to 100 distributed among 10 partitions i.e. 10 elements/partition. map() transformation will call func 100 times to process these 100 elements but in case of mapPartitions(), func will be called once/partition i.e. 10 times. Secondly, mapPartitions() holds the data in-memory i.e. it will store the result in memory until all the elements of the partition has been processed. mapPartitions() will return the result only after it finishes processing of whole partition. mapPartitions() requires an iterator input unlike map() transformation. What is an Iterator? - An iterator is a way to access collection of elements one-by-one, its similar to collection of elements like List(), Array() etc in few ways but the difference is that iterator doesn't load the whole collection of elements in memory all together. Instead iterator loads elements one after another. In Scala you access these elements with hasNext and Next operation. For example, scala> sc.parallelize(1 to 9, 3).map(x=>(x, "Hello")).collect res3: Array[(Int, String)] = Array((1,Hello), (2,Hello), (3,Hello), (4,Hello), (5,Hello), (6,Hello), (7,Hello), (8,Hello), (9,Hello)) scala> sc.parallelize(1 to 9, 3).partitions.size res95: Int = 3 scala> sc.parallelize(1 to 9, 3).mapPartitions(x=>(Array("Hello").iterator)).collect res7: Array[String] = Array(Hello, Hello, Hello) scala> sc.parallelize(1 to 9, 3).mapPartitions(x=>(List(x.next).iterator)).collect res11: Array[Int] = Array(1, 4, 7) In first example, I have applied map() transformation on dataset distributed between 3 partitions so that you can see function is called 9 times. In second example, when we applied mapPartitions(), you will notice it ran 3 times i.e. for each partition once. We had to convert string "Hello" into iterator because mapPartitions() takes iterator as input. In thirds step, I tried to get the iterator next value to show you the element. Note that next is always increasing value, so you can't step back. See this, scala> sc.parallelize(1 to 9, 3).mapPartitions(x=>(List(x.next,x.next, "|").iterator)).collect res18: Array[Any] = Array(1, 2, |, 4, 5, |, 7, 8, |) In first call next value for partition 1 changed from 1 => 2 , for partition 2 it changed from 4 => 5 and similarly for partition 3 it changed from 7 => 8. You can keep this increasing until hasNext is False (hasNext is a property of iteration which tells you whether collection has ended or not, it returns you True or False based on items left in the collection). For example, scala> sc.parallelize(1 to 9, 3).mapPartitions(x=>(List(x.next, x.hasNext).iterator)).collect res19: Array[AnyVal] = Array(1, true, 4, true, 7, true) You can see hasNext is true because there are elements left in each partition. Now suppose we access all three elements from each partition, then hasNext will result false. For example, scala> sc.parallelize(1 to 9, 3).mapPartitions(x=>(List(x.next, x.next, x.next, x.hasNext).iterator)).collect res20: Array[AnyVal] = Array(1, 2, 3, false, 4, 5, 6, false, 7, 8, 9, false) Just for our understanding, if you will try to access next 4th time, you will get error which is expected. scala> sc.parallelize(1 to 9, 3).mapPartitions(x=>(List(x.next, x.next, x.next, x.next,x.hasNext).iterator)).collect 19/07/31 11:14:42 ERROR Executor: Exception in task 1.0 in stage 18.0 (TID 56) java.util.NoSuchElementException: next on empty iterator Think, map() transformation as special case of mapPartitions() where you have just 1 element in each partition. Isn't it? mapPartitionsWithIndex(func) Similar to mapPartitions, but good part is that you have index to see the partition position. For example, scala> val mp = sc.parallelize(List("One","Two","Three","Four","Five","Six","Seven","Eight","Nine"), 3) mp: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[38] at parallelize at :24 scala> mp.collect res23: Array[String] = Array(One, Two, Three, Four, Five, Six, Seven, Eight, Nine) scala> mp.mapPartitionsWithIndex((index, iterator) => {iterator.toList.map(x => x + "=>" + index ).iterator} ).collect res26: Array[String] = Array(One=>0, Two=>0, Three=>0, Four=>1, Five=>1, Six=>1, Seven=>2, Eight=>2, Nine=>2) Index 0 (first partition) has three values as expected, similarly other 2 partitions. If you have any question please mention it in comments section at the end of this blog. sample(withReplacement, fraction, seed) Generates a fraction RDD from an input RDD. Note that second argument fraction doesn't represent the fraction of actual RDD. It actually tells the probability of each element in the dataset getting selected for the sample. Seed is optional. First boolean argument decides type of sampling algorithm. For example, scala> sc.parallelize(1 to 10).sample(true, .4).collect res103: Array[Int] = Array(4) scala> sc.parallelize(1 to 10).sample(true, .4).collect res104: Array[Int] = Array(1, 4, 6, 6, 6, 9) // Here you can see fraction 0.2 doesn't represent fraction of rdd, 4 elements selected out of 10. scala> sc.parallelize(1 to 10).sample(true, .2).collect res109: Array[Int] = Array(2, 4, 7, 10) // Fraction set to 1 which is the max probability (0 to 1), so each element got selected. scala> sc.parallelize(1 to 10).sample(false, 1).collect res111: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) union(otherDataset) Similar to SQL union, but it keeps duplicate data. scala> val rdd1 = sc.parallelize(List("apple","orange","grapes","mango","orange")) rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[159] at parallelize at :24 scala> val rdd2 = sc.parallelize(List("red","green","yellow")) rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[160] at parallelize at :24 scala> rdd1.union(rdd2).collect res116: Array[String] = Array(apple, orange, grapes, mango, orange, red, green, yellow) scala> rdd2.union(rdd1).collect res117: Array[String] = Array(red, green, yellow, apple, orange, grapes, mango, orange) intersection(otherDataset) Returns intersection of two datasets. For example, scala> val rdd1 = sc.parallelize(-5 to 5) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[171] at parallelize at :24 scala> val rdd2 = sc.parallelize(1 to 10) rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[172] at parallelize at :24 scala> rdd1.intersection(rdd2).collect res119: Array[Int] = Array(4, 1, 5, 2, 3) distinct() Returns new dataset with distinct elements. For example, we don't have duplicate orange now. scala> val rdd = sc.parallelize(List("apple","orange","grapes","mango","orange")) rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[186] at parallelize at :24 scala> rdd.distinct.collect res121: Array[String] = Array(grapes, orange, apple, mango) Due to some technical issues I had to move some content of this page to other area. Please refer this for remaining list of transformations. Sorry for the inconvenience guys. groupByKey() reduceByKey() aggregateByKey() sortByKey() join() cartesian() coalesce() repartition() Now, as said earlier, RDDs are immutable so you can't change original RDD but you can always create a new RDD with spark transformations like map, flatmap, filter, groupByKey, reduceByKey, mapValues, sample, union, intersection, distinct, sortByKey etc. RDDs transformations are broadly classified into two categories - Narrow & Wide transformation. In narrow transformation like map & filter, all the elements that are required to compute the records in single partition live in the single partition of parent RDD. A limited subset of partition is used to calculate the result. In wide transformation like groupByKey and reduceByKey, all the elements that are required to compute the records in the single partition may live in many partitions of parent RDD. The partition may live in many partitions of parent RDD. Spark Actions When you want to work on actual dataset, you need to perform spark actions on RDDs like count, reduce, collect, first, takeSample, saveAsTextFile etc. Transformations are lazy in nature i.e. nothing happens when the code is evaluated. Meaning actual execution happens only when code is executed. RDDs are computed only when an action is applied on them. Also called as lazy evaluation. Spark evaluates the expression only when its value is needed by action. When you call an action, it actually triggers transformations to act upon RDD, dataset or dataframe. After that RDD, dataset or dataframe is calculated in memory. In short, transformations will actually occur only when you apply an action. Before that it’s just line of evaluated code :) Below is the list of Spark actions. reduce() It aggregates the elements of the dataset. For example, scala> val rdd = sc.parallelize(1 to 15).collect rdd: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15) scala> val rdd = sc.parallelize(1 to 15).reduce(_ + _) rdd: Int = 120 scala> val rdd = sc.parallelize(Array("Hello", "Dataneb", "Spark")).reduce(_ + _) rdd: String = SparkHelloDataneb scala> val rdd = sc.parallelize(Array("Hello", "Dataneb", "Spark")).map(x =>(x, x.length)).flatMap(l=> List(l._2)).collect rdd: Array[Int] = Array(5, 7, 5) scala> rdd.reduce(_ + _) res96: Int = 17 scala> rdd.reduce((x, y)=>x+y) res99: Int = 17 collect(), count(), first(), take() Collect returns all the elements of the dataset as an array. For example scala> sc.parallelize(1 to 20, 4).collect res100: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20) Counts the number of elements scala> sc.parallelize(1 to 20, 4).count res101: Long = 20 First returns the first element scala> sc.parallelize(1 to 20, 4).first res102: Int = 1 Take returns the number of elements you pass as argument scala> sc.parallelize(1 to 20, 4).take(5) res104: Array[Int] = Array(1, 2, 3, 4, 5) takeSample() It returns the random sample of size n. Boolean input is for with or without replacement. For example, scala> sc.parallelize(1 to 20, 4).takeSample(false,4) res107: Array[Int] = Array(15, 2, 5, 17) scala> sc.parallelize(1 to 20, 4).takeSample(false,4) res108: Array[Int] = Array(12, 5, 4, 11) scala> sc.parallelize(1 to 20, 4).takeSample(true,4) res109: Array[Int] = Array(18, 4, 1, 18) takeOrdered() It returns the elements in ordered fashion. For example, scala> sc.parallelize(1 to 20, 4).takeOrdered(7) res117: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7) Just opposite to top() action scala> sc.parallelize(1 to 20, 4).top(7) res118: Array[Int] = Array(20, 19, 18, 17, 16, 15, 14) countByKey() It takes (key, value) pair and returns (key, count of key). For example, scala> sc.parallelize(Array("Apple","Banana","Grapes","Oranges","Grapes","Banana")).map(k=>(k,1)).countByKey() res121: scala.collection.Map[String,Long] = Map(Grapes -> 2, Oranges -> 1, Banana -> 2, Apple -> 1) saveAsTextFile() It saves the dataset as text files in local directory or HDFS etc. You can reduce the number of files by coalesce transformation. scala>sc.parallelize(Array("Apple","Banana","Grapes","Oranges","Grapes","Banana")).saveAsTextFile("sampleFruits.txt") // Just one partition file with coalesce scala>sc.parallelize(Array("Apple","Banana","Grapes","Oranges","Grapes","Banana")).coalesce(1).saveAsTextFile("newsampleFruits.txt") saveAsObjectFile() It writes the data into simple format using Java serialization and you can load it again using sc.objectFile() scala> sc.parallelize(List(1,2)).saveAsObjectFile("/Users/Rajput/sample") foreach() It is generally used when you want to carry out some operation on output for each element. Like loading each element into database. scala> sc.parallelize("Hello").collect res139: Array[Char] = Array(H, e, l, l, o) scala> sc.parallelize("Hello").foreach(x=>println(x)) l H e l o // Output order of elements is not same every time scala> sc.parallelize("Hello").foreach(x=>println(x)) H e l o l Spark Workflow In this section you will understand how Spark program flows, like how you create intermediate RDDs and apply transformations and actions. You first create RDDs with parallelize method or referencing external dataset. Apply Transformations to create new RDDs based on your requirement. You will have list of RDDs called Lineage. Apply Actions on RDDs. Get your Result. Transformations & Actions example Let's try to implement above facts with some basic example which will give you more clear picture. Open spark-shell with below command in your terminal (refer mac/windows if you don't have spark installed yet). ./bin/spark-shell You can see SparkContext automatically created for you with all (*) local resources and app id in above screenshot. You can also check spark context by running sc command. res0 is nothing but result set zero for command sc. We already read about SparkContext in previous blog. 1. Create RDD, let's say by parallelize method with number of partitions 2. Below RDD will be basically list of characters distributed across 2 partitions. 2. Now, you can either apply transformation to create a new RDD (called lineage) or you can simply apply an action to show the result. Lets first apply few actions. res1 to res5 shows you the result of each action - collect, first, count, take, reduce, saveAsTextFile. Note (lazy evaluation) when you execute an action spark does the actual evaluation to bring the result. Now let's see the sample.csv file which is the last action result. Remember we created 2 partitions in first step, thats the reason we have 2 files with equal set of data as part-00000 & part-00001. 3. Now let's try to apply few transformations in order to create RDDs lineage. Refer the image shown above. In first step we have applied filter transformation to filter character 'a' creating a new RDD called MapPartitionsRDD[2] from our initial RDD ParallelCollectionRDD[0]. Similarly in third step we have filtered letter 'x' to create another RDD MapPartitionsRDD[3]. In last step we have used map & reduceByKey transformation to group the characters and get their counts, generating a new RDD ShuffleRDD[5]. As we have applied 2 transformations on one RDD i.e. map and reduceByKey, you will notice RDD[4] is missing. Spark internally saves the intermediate RDD[4] to generate the resultant ShuffleRDD[5] which is not printed in output. ParallelCollectionRDD[0], MapPartitionsRDD[2], MapPartitionsRDD[3], RDD[4], ShuffleRDD[5] is basically called lineage. You can say intermediate collection of elements which is needed by spark to evaluate your next action. 4. Now, you can notice res7, res8 and res9 are nothing but actions which we applied on lineage RDDs to get the Results. Thank you!! If you really like the post and you have any question, please don't forget to write in comments section below. Next: Loading data in Apache Spark Navigation menu 1. Apache Spark and Scala Installation 1.1 Spark installation on Windows 1.2 Spark installation on Mac 2. Getting Familiar with Scala IDE 2.1 Hello World with Scala IDE 3. Spark data structure basics 3.1 Spark RDD Transformations and Actions example 4. Spark Shell 4.1 Starting Spark shell with SparkContext example 5. Reading data files in Spark 5.1 SparkContext Parallelize and read textFile method 5.2 Loading JSON file using Spark Scala 5.3 Loading TEXT file using Spark Scala 5.4 How to convert RDD to dataframe? 6. Writing data files in Spark 6.1 How to write single CSV file in Spark 7. Spark streaming 7.1 Word count example Scala 7.2 Analyzing Twitter texts 8. Sample Big Data Architecture with Apache Spark 9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science? 10. Spark Interview Questions and Answers
What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science?
Never thought I will spend so much time to understand these high profile terms. I was very confident that I knew theoretically everything that was necessary for me to start writing machine learning algorithms, until couple of days back when I asked myself - Does my use case fall under machine learning topic or is it artificial intelligence? Or is it predictive analytics? I began explaining myself, but couldn’t do it right. I spent several hours reading about these topics, reading blogs, thinking and ended up writing this blog to answer myself. I hope you all will also find this post helpful. Trust me most famous terminology amongst all is - "machine learning" in past couple of years. Below chart shows Google trend (interest over time) of these high profile terms - First lets understand these terminologies individually, keep below Venn diagram in mind while you read further. This will help you to distinguish various terminologies. You know what I did just now? I asked your brain to recognize patterns. Human brain automatically recognizes such patterns (basically "deep learning") because your brain is trained with "Venn diagrams" somewhere in past. By looking at diagram, your brain is able to predict few facts like Deep learning is subset of Machine learning, Artificial Intelligence is the super set, and Data Science could spread across all technologies. Right? Trust me if you show this diagram to prehistoric man, he will not understand anything. But your brain "algorithms" are trained enough with historic data to deduce and predict such facts. Isn't it? Artificial Intelligence (AI) Artificial intelligence is the broadest term. Originated in year 1950s and the oldest terminology used amongst all which we will discuss. In one liner, Artificial intelligence (AI) is a term for simulated intelligence in machines. The concept has always been the idea of building machines which are capable of thinking like humans, mimic like humans. Simplest example of AI is chess game when you play against computer, on paper program was first proposed in 1951. Recent AI example would include self-driving cars which has always been the subject of controversy. Artificial Intelligence can be split between two branches - One is labelled “applied AI” which uses these principles of simulating human thought to carry out one specific task. The other is known as “generalized AI” – which seeks to develop machine intelligences that can turn their hands to any task, much like a person. Machine Learning (ML)Machine learning is the subset of AI which originated in 1959. Evolved from the study of pattern recognition and computational learning theory in artificial intelligence. ML gives computers the ability to "learn" (i.e., progressively improve performance on a specific task) with data, without being explicitly programmed. You encounter machine learning almost everyday, think about Ride sharing apps like Lyft & Uber - How do they determine the price of your ride? Google maps - How do they analyze traffic movement and predict your arrival time within seconds? Filter spam - Emails going automatically to your spam folder? Amazon Alexa, Apple SIRI, Microsoft Cortana & Google Home - How do they recognize your speech? Deep Learning (DL) Deep learning (also known as Hierarchical learning, Deep machine learning or Deep structured learning) is a subset of Machine Learning where learning method is based on data representation or feature learning. Set of methods that allows a system to automatically discover the representations needed for feature detection or classification from raw data. Examples like Mobile check deposits - Convert handwritings on checks into actual text. Facebook face recognition - Seen Facebook recognizing names while tagging? Colorization of black and white images. Object recognition In short, all three terms (AI, ML & DL) can be related as below - recall those examples Chess board, Spam emails & Object recognition (picture credit blogs.nvidia) Predictive Analytics (PA) Under predictive analytics, the goal of the problems remains very narrow where the intent is to compute a value of a particular variable at a future point of time. You can say predictive analytics is basically a sub-field of machine learning. Machine learning is more versatile and is capable to solve a wide range of problems. There are some techniques where machine learning and predictive analytics overlap like linear and logistic regression but others like decision tree, random forest etc are essentially machine learning techniques. Keep aside these regression techniques as of now, I will write detailed blogs for these techniques. How does Data Science relate to AI, ML, PA & DL? Data science is a fairly general term for processes and methods that analyze and manipulate data. It provides you ground to apply artificial intelligence, machine learning, predictive analytics and deep learning to find meaningful and appropriate information from large volumes of raw data with greater speed and efficiency. Types of Machine learning Classification of machine learning will depend upon type of task which you expect machine to perform (Supervised, Unsupervised & Reinforcement) or based on desired output i.e. data. But at the end algorithms will remain same or you can say techniques which will help you to get the desired result. Regression: This is a type of problem where we need to predict the continuous-response value like what is the value of stock. Classification: This is a type of problem where we predict the categorical response value where the data can be separated into specific “classes” like an email it's "spam" or "not spam" Clustering: This is a type of problem where we group similar things together like grouping set of tweets from Twitter. I have tried to showcase the type with below chart, I hope you will find this helpful. Please don't limit yourself with the types of regression, classifiers & clusters which I have shown below. There are number of other algorithms which are being developed and used world wide. Ask yourself which technique fits your requirement. Thank you folks!! If you have any question please mention in comments section below. #MachineLearning #ArtificialIntelligence #DeepLearning #DataScience #PredictiveAnalytics #regression #classification #cluster Next: Spark Interview Questions and Answers Navigation menu 1. Apache Spark and Scala Installation 1.1 Spark installation on Windows 1.2 Spark installation on Mac 2. Getting Familiar with Scala IDE 2.1 Hello World with Scala IDE 3. Spark data structure basics 3.1 Spark RDD Transformations and Actions example 4. Spark Shell 4.1 Starting Spark shell with SparkContext example 5. Reading data files in Spark 5.1 SparkContext Parallelize and read textFile method 5.2 Loading JSON file using Spark Scala 5.3 Loading TEXT file using Spark Scala 5.4 How to convert RDD to dataframe? 6. Writing data files in Spark 6.1 How to write single CSV file in Spark 7. Spark streaming 7.1 Word count example Scala 7.2 Analyzing Twitter texts 8. Sample Big Data Architecture with Apache Spark 9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science? 10. Spark Interview Questions and Answers

J1 Visa Program USA - The Ultimate Guide to Green Card

Terms

Policy

Privacy

Contact