I just watched a 18 minutes video on introduction to Big Data & Hadoop on Udemy. Here’s a link https://www.udemy.com/hadoopstarterkit/learn/ to the course I’ve enrolled in if you’d like too. I would like to brief what I learned.
What is Big Data?
There are mainly three factors that very well helps define a big data. Volume, Velocity and Variety.
Let me take an example of an imaginary startup company who has around 1 TB of data at the initial phase. How do we define the data? Does it qualify for a big data? Well if I say the amount of data is going to be stable throughout the lifetime of the company, is it a big data? Certainly not. For a data set to be called big data, it should have a good growth rate thereby increasing the volume of the data and should be of different variety (text, picture, pdf, etc).
Here are some of the examples of big data.
Companies like Amazon monitors not only your purchase history and wishlist but also each clicks, recording all the pattern and processing this big amount of data thereby giving us a better recommendation system.
Here’s what NASA has to say about big data.
“In the time it took you to read this sentence, NASA gathered approximately 1.73 gigabytes of data from our nearly 100 currently active missions! We do this every hour, every day, every year – and the collection rate is growing exponentially. – See more at: http://open.nasa.gov/blog/2012/10/04/what-is-nasa-doing-with-big-data-today/”
Have a look at this
Big Data Challenges
Storage – Storage of data should be as efficient as possible both in terms of hardware and processing and retriving the data.
Computation Efficiency – It should be suitable for computation
Data Loss – Data may be lost due to hardware failure and other reasons. Hence data recovery strategies must be good.
Time – Big data is basically for analysis and processing, hence the amount of time for processing the data set should be minimal.
Cost – It should provide huge space and should also be cost effective.
The main issue is scalability. Once the data increases, the amount of time for data processing goes higher with unmanagable number of tables forcing us to denormalize. Necessities may arise to change the query for efficiency. Also RDBMS is for structured data set only. Once the data is present in various formats, RDBMS cannot be used.
Grid computing creates nodes hence is good for compute intensive. However, it does not perform well for big set of data. It requires programming in lower level like C.
A good solution, HADOOP
Supports huge volume
Storage Efficiency both in terms of hardware and processing/retrival
Good Data Recovery
Horizontal Scaling – Processing time is minimal
Easy to Programmers and Non Programmers.
Is Hadoop replacing RDBMS?
So is Hadoop going to replace RDBMS? No. Hadoop is one thing and RDBMS is another better for specific purposes.
Storage : Perabytes
Made of commodity computers. These are cost effective but enterprise level hardware.
Batch Processing System
Dynamic Schema (Different formats of files)
Cost may increase violently with volume