Hadoop Starter Kit What Is Big Data

- - Uncategorized

I just watched a 18 minutes video on introduction to Big Data & Hadoop on Udemy. Here’s a link https://www.udemy.com/hadoopstarterkit/learn/ to the course I’ve enrolled in if you’d like too. I would like to brief what I learned.

What is Big Data?

There are mainly three factors that very well helps define a big data. Volume, Velocity and Variety.

Let me take an example of an imaginary startup company who has around 1 TB of data at the initial phase. How do we define the data? Does it qualify for a big data? Well if I say the amount of data is going to be stable throughout the lifetime of the company, is it a big data? Certainly not. For a data set to be called big data, it should have a good growth rate thereby increasing the volume of the data and should be of different variety (text, picture, pdf, etc).

Here are some of the examples of big data.

Companies like Amazon monitors not only your purchase history and wishlist but also each clicks, recording all the pattern and processing this big amount of data thereby giving us a better recommendation system.

Here’s what NASA has to say about big data.

In the time it took you to read this sentence, NASA gathered approximately 1.73 gigabytes of data from our nearly 100 currently active missions! We do this every hour, every day, every year – and the collection rate is growing exponentially. – See more at: http://open.nasa.gov/blog/2012/10/04/what-is-nasa-doing-with-big-data-today/

Have a look at this

https://gigaom.com/2012/08/22/facebook-is-collecting-your-data-500-terabytes-a-day/

Big Data Challenges

Storage – Storage of data should be as efficient as possible both in terms of hardware and processing and retriving the data.

Computation Efficiency – It should be suitable for computation

Data Loss – Data may be lost due to hardware failure and other reasons. Hence data recovery strategies must be good.

Time – Big data is basically for analysis and processing, hence the amount of time for processing the data set should be minimal.

Cost – It should provide huge space and should also be cost effective.

Traditional Solutions

RDBMS

The main issue is scalability. Once the data increases, the amount of time for data processing goes higher with unmanagable number of tables forcing us to denormalize. Necessities may arise to change the query for efficiency. Also RDBMS is for structured data set only. Once the data is present in various formats, RDBMS cannot be used.

GRID Computing

Grid computing creates nodes hence is good for compute intensive. However, it does not perform well for big set of data. It requires programming in lower level like C.

A good solution, HADOOP

Supports huge volume

Storage Efficiency both in terms of hardware and processing/retrival

Good Data Recovery

Horizontal Scaling – Processing time is minimal

Cost Effective

Easy to Programmers and Non Programmers.

Is Hadoop replacing RDBMS?

So is Hadoop going to replace RDBMS? No. Hadoop is one thing and RDBMS is another better for specific purposes.

Hadoop

Storage : Perabytes

Horizontal Scaling

Cost Effective

Made of commodity computers. These are cost effective but enterprise level hardware.

Batch Processing System

Dynamic Schema (Different formats of files)

RDBMS

Storage: Gigabytes

Scaling limitted

Cost may increase violently with volume

Static Schema

Post Tags:

bhishan

I am Bhishan Bhandari, a CS student and life hacker. I specialize in automation. I sell my services on fiverr. You can hire me for projects here Buy Services Follow me on github for code updates Github You can always communicate your thoughts/wishes/questions to me at bbhishan@gmail.com