Dear Friends,
Couple of months back I've published a post on SQL Server 2016 new features
here.
Meanwhile, let me introduce you to
Hadoop. We will learn this as a series of inter-related post. So don't miss any post in between and read it serially. Let's make it fun and interesting to learn
Hadoop.
So, lets understand what are the formats of data that we handle in real word.
- Flat File
- Rows and Columns
- Images and Document
- Audio and Video
- XML Data and many more......
Big Data is an ocean of data which an organization stores. These data come in three V's i.e. Volume, Velocity and Variety.
Now-a-days
huge Volume of data are getting generated by many sources such as Facebook,Whatsapp, E-commerce sites, etc, etc, etc. These huge volume of data are getting generated with
high Velocity, can say it is multiplying every seconds, every minute, every hour. Along with the huge Volume and high Velocity
numerous Variety of data is generated in different forms.
These data can be in any format i.e. structure, semi-structure as well as unstructured. Data stored in the form of row and column can be well defined as structured data whereas data in form of document, image, sms, video, audio,etc can be categories into unstructured and data in html or in XML format can be semi-structure data.
Q. I am sure you must be thinking that then how does a RDBMS handles these kind of unstructured or semi-structure data in there Database?
A. Well, to handle these kind of data's we have special data type such as Varbinary(Max), XML. Drawback of this is, if we are storing an image, it is stored in binary format within the database; whereas actual image is stored in Filestream or the Server itself. Hence there is an performance impact during storing and retrieving Petabyte of data's.
Moreover to this, Big Data is not just about maintaining and growing the data year on year, but it is also about how you manages
these data's to make an informative decision. Data in Big Data can also comes in various complex format, to manage and process these type of data we need large set of cluster servers.
With this introduction to Big Data, now let me introduce you to Hadoop.
Hadoop is a large set of cluster servers which is built to process large set of data. It has two main core component i.e. '
Hadoop MapReduce' (Processing Part) and '
Hadoop Distributed File System' (Storage Part). Hadoop project comes under Apache and that is why it is called as 'Apache Hadoop'. The idea behind these two core component came into existence when
Google has released there two white paper of there project on 'MapReduce' and 'Google File System (GFS)' in the year 2004.
Hadoop was created by Doug Cutting in 2005. Cutting, He was working at Yahoo! at the time he build the software, it was named after his son's toy elephant.
Wikipedia defines Hadoop as "
an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware"
Hadoop is an open source framework available in free as well as commercial use under Apache license. It allows distributing and processing of dataset across the large Cluster set. On top of this, there are lot more application build by other organisation who use Hadoop or continuously work on this product which all comes under '
Hadoop Ecosystem'.
Check here to find the list of projects under Hadoop Ecosystem. As we move further we will see post on the important projects under Hadoop Ecosystem.
Apache Hadoop Architecture consist of following components:
- Hadoop Common: It contains the libraries and other utilities needed by other module of Hadoop.
- Hadoop Distributed File System (HDFS): It is cluster of Servers with commodity storage which is used for data storage across the cluster.
- Hadoop YARN: This component is used for Job scheduling and Resource management in Cluster.
- Hadoop MapReduce: The processing part of the data is done by this component.
At least now, I can consider that you are having a fair enough idea about this technology. But how can or who is the best person to get into Hadoop??
Technical answer for that would be those who are interested to learn this technology can get in two ways:
- As a Developer
- As a Administrator
- As a Developer: Hadoop is a framework which is built in JAVA language. So having JAVA background can get easy access to become a Hadoop Developer. Since the growing popularity of Hadoop, now a days this is the most common designation you can find in Job sites.
- As a Administrator: Most organisation with Hadoop installation prefer for a Part time or a
Full time Administrator to manage there Hadoop Clusters. It is not compulsion that the admin should have the knowledge of JAVA to learn this technology. Indeed! they should have some basics for troubleshooting. Candidate those who are having knowledge with Database Admin (SQL Server, Oracle, etc) background who already have troubleshooting, Server maintenance, Disaster Recovery knowledge are preferred or anyone with Network or Storage or Server Admin (Windows\Linux) skills can be the other best choice. Here in this post it is mentioned in detail who suits best for Hadoop.
Following might be the questions in your mind if we want to get start with Hadoop Admin:
- Do we need any DBA skills? Of course Yes; If we need to Admin the Hadoop Cluster (Maintaining, Monitoring, Configuration, Troubleshooting,etc).
- Do we need to learn Java? Yes; At least some basics to understand the Java errors while troubleshooting any issue.
- Do we need to understand Non-RDBMS? Yes; Hadoop understand both SQL and NoSQL (Not only SQL). So having knowledge on Non-RDBMS product is most important.
- Do we need to learn Linux too? Yes; at least the basics.
In our next post we will see the concept of HDFS (Hadoop Distributed File Structure).
Interested in learning SQL Server Clustering
check here. Stay tuned for many more updates...
Keep Learning and Enjoy Learning!!!