top of page
BlogPageTop

Trending

What is Big Data Architecture? Ingest, Transform/Enrich and Publish

Updated: Oct 25, 2019


What is Big Data Architecture? How to define Big Data Architecture? What is the need of Big Data Architecture? What are various ways to ingest, transform, enrich & publish Big Data? There are several questions around it. Let's explore few sample big data architectures.

 

With glittering new tools in the market and myriad buzzwords surrounding data operations, consumers of information often overlook the building process, believing insights gleaned from data are instantaneous and automated. We live in the “pre-AI” age where clear answers to qualitative questions from quantitative analysis requires human intervention. Yes, advanced data science gives us extensive means to visualize and cross-section, but human beings are still needed to ask questions in a logical fashion and find the significance of resulting insights.

Please note above architecture is just a sample architecture which varies depending upon nature of data and client requirement. We will discuss it in detail shortly.



What is the need of Big Data Architecture?

A big data architecture is designed to handle the ingestion, enrichment & processing of raw structured, semi-structured and unstructured data that is too large or complex for traditional database systems or traditional data warehousing system. The three V's - volume, velocity & variety - are the most common properties of Big Data architecture. Whether we end up electing well known Kappa or Lambda architecture of big data architecture - understanding of three V's and nature of data plays a very crucial role in our big data architecture. For instance, If the velocity of data is very low or volume is very low why don't we go with traditional database systems? Instead I have seen organization rushing towards transforming their traditional data warehouse systems into big data architectures because it's shinning in market.


 

Let's categorize Big Data Architecture Workloads

  • Real-time processing, data sources like IoT devices - I would rather say it's "near" real time processing (Ingestion & Enrichment will take few seconds). If the solution includes real-time sources, the architecture must include a way to capture and store real-time messages for stream processing. Usually these streams are carried out using Apache Kafka & Zookeeper pair, Amazon Simple Queue Service (SQS), JBoss, RabbitMQ, IBM Websphere MQ, Microsoft Messaging Queue etc. Kappa architecture is famous for this type of workload.

  • Batch processing - Because the data sets are so large, often a big data architecture solution must process data files using long-running batch jobs to filter, aggregate, and otherwise prepare the data for analysis. Usually these jobs involve reading source files, processing them, and writing the output to new files. Options include running U-SQL (Unstructured SQL) jobs in Azure Data Lake Analytics, using Hive, Pig, or custom Map/Reduce jobs in an HDInsight Hadoop cluster, or using Java, Scala, or Python programs in an HDInsight Spark cluster. Lambda architecture is generally used for this.

  • Machine learning & Predictive analytics - A common misconception is that predictive analytics and machine learning are the same thing. This is not the case. At its core, predictive analytics encompasses a variety of statistical techniques (including machine learning, predictive modeling and data mining) and uses statistics (both historical and current) to estimate, or ‘predict’, future outcomes. Machine learning, on the other hand, is a sub-field of computer science that gives ‘computers the ability to learn without being explicitly programmed’. Machine learning evolved from the study of pattern recognition and explores the notion that algorithms can learn from and make predictions on data. And, as they begin to become more ‘intelligent’, these algorithms can overcome program instructions to make highly accurate, data-driven decisions. R, Python & Scala are popular languages to work with these workloads.


 

Big Data Architecture backbone: Data refinement is the key!


Serious data scientists need to make data refinement their first priority, and break down the data work into three steps:

  • Data Ingestion, or call it Data Collection layer - People use different terminologies for the first layer. However main focus of this layer is to choose right technology depending upon Big Data architecture workload and project requirement. If requirement demands real-time processing we can use Kafka or any other real time MQ systems mentioned earlier. If source is just a flat file which is generated few times a day, go with simple file transfer protocol. At the end, don't forget money and third party vendor technology support matters as well.

  • Data Enrichment, Transformation, Processing & Refinement - To be instrumentally useful, data must be converted into “answers” to questions. In other words, Big Data must get smaller after passing second layer. Don't pile up raw data which is not in question as it's going to dramatically slow down your process over a period.

  • Data Publish, or Delivery so called the Presentation Layer - Deliver the answers through optimized channels in proper formats and frequency. This layer includes reporting, visualization, data exploration, ad-hoc querying and export datasets. Visualization through Tableau, QlikView etc, reporting through BOBJ, SSRS etc , ad-hoc querying using Hive, Impala, Spark SQL etc. Further, choice of technology depend upon end users - different users like administrator, business users, vendor, partners etc. demand data in different format.



Data storage : Last but not least!


Hadoop distributed file system is the most commonly used storage framework in Big Data architecture, others are the NoSQL data stores – MongoDB, HBase, Cassandra etc. One of the salient features of Hadoop storage is its capability to scale, self-manage and self-heal.

Things to consider while planning storage methodology:

  • Type of data (historical or incremental)

  • Format of data ( structured, semi-structured and unstructured)

  • Analytical requirement that storage can support (synchronous & asynchronous)

  • Compression requirements

  • Frequency of incoming data

  • Query pattern on the data

  • Consumers of the data

Thank you! If you have any question please don't forget to mention in comments section below.




Navigation menu

1. Apache Spark and Scala Installation

2. Getting Familiar with Scala IDE

3. Spark data structure basics

4. Spark Shell

5. Reading data files in Spark

6. Writing data files in Spark

7. Spark streaming

ADVERTISEMENT

Want to share your thoughts about this blog?

Disclaimer: Please note that the information provided on this website is for general informational purposes only and should not be taken as legal advice. Dataneb is a platform for individuals to share their personal experiences with visa and immigration processes, and their views and opinions may not necessarily reflect those of the website owners or administrators. While we strive to keep the information up-to-date and accurate, we make no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability, or availability with respect to the website or the information, products, services, or related graphics contained on the website for any purpose. Any reliance you place on such information is therefore strictly at your own risk. We strongly advise that you consult with a qualified immigration attorney or official government agencies for any specific questions or concerns related to your individual situation. We are not responsible for any losses, damages, or legal disputes arising from the use of information provided on this website. By using this website, you acknowledge and agree to the above disclaimer and Google's Terms of Use (https://policies.google.com/terms) and Privacy Policy (https://policies.google.com/privacy).

RECOMMENDED FROM DATANEB

bottom of page