Automate the boring stuff with python

- - Python, Tutorials

Some while ago, I got myself enrolled in one of the best video lectures at Udemy. I have recently completed the lectures and would like to brief about it. The course is named Automate the boring stuff with python. Well, it is an excellent video lecture A.I Sweigart has brought up. It is good to go for people with any skill level. The video lectures gradually leaps upwards the ladder underlying the basics at the initial few videos. I would say it is a motivation to a new-comer in python.

Jumping straight onto the topics. Following is the list of topics covered in the course at Udemy which has at the time of writing this article, a 29, 500 students enrolled.

The lectures are chunked onto 16 sections.

Section 1(Installation and Introduction)

This section covers installation of python and basics including taking input from the user. More of an intro.

Section 2 (Flow Control)

The beginning of this section introduces to flowcharts, working with them, importance, etc. Basic if-else statements, looping structures- while loop and for loop. Includes topics like comparison operators, boolean operators and monkeying around them.

Section 3 (Functions)

Starts with built-in functions like print(), input(), len(). Intro to built-in modules and importing them like math which contains math related functions. Moves on to making calls to method the module has offered. Further towards the end of this section, A.I. Swiegart explains making functions and talks about local and global scoping.

Section 4 (Handling Error)

Error catching techniques in python using try/except block.

Section 5 (Writing a complete program using above learned things.)

A good point to start writing a complete program, hence the tutorial heads on to making the classic guess the number program.

Section 6 (Lists)

This section covers the lists definition, accessing items through index as well as slicing and deleting items in a list. Additionally, the lectures goes on to show the graphical representation on how the accessing of the items in the list is happening. Concatenating strings and lists are also covered. Using in operator to find the content in the list and string and passing strings on to list() method is talked about towards the end of this section. This section also covers looping over elements in a list, various built-in methods over lists and finally a comparison between list and string.

Section 7 (Dictionary)

Starts with the introduction to yet another powerful data-type in python, dictionary. Creating them, iterating over them an so on. Further, the lecture talks more about data structures and how they can model our problem set using an example program of tic-tac-toe.

Section 8 (Strings)

This section adds more knowledge about string methods and string manipulation as well as formatting strings. Great content in this section.

Section 9 (Running Programs from command line)

Sebang line is introduced in this section which I think is one of the most important thing to include in a lecture.

Section 10 (Regular Expressions)

Section 10 has 5 video lectures altogether. The lecture begins with the basics of regex advancing towards topics like greedy/nongreedy matching, findall() method, regex sub() method and verbose mode, etc. The section ends by creating a email and phone number scraper.

Section 11 (Files)

This section of the video course is designated for detailed talk on files. I think this is the most fundamental knowledge to have since it is glued in every application you build, be it a web-application or a small script. (On a long run, it helps in easily configuring paths in django and understanding exactly what is happening.) This sections covers essential things like absolute file path, relative file path, reading and writing to a plain text file, copying and moving files and folders, walking a directory tree and deleting files and folders.

Section 12 (Debugging)

Walk through debugging techniques like assert and logging, etc.

Section 13 (Web Scraping)

Intro to modules such as webbrowser, requests, BeautifulSoup, selenium. Each of the mentioned modules has dedicated video on showcasing their methods and usage. Parsing html using BeautifulSoup, controlling the browser with selenium, downloading files using requests modules and so on.

Section 14 (Working with Excel, Word and PDF files)

In this portion of the lecture, various libraries such as openpyxl, pypdf2, etc are introduced and their usage case are showcased as well. Reading and writing excel files, reading pdf files, merging them, etc are explained towards the end of this section.

Section 15 (Emails)

This section covers sending emails, checking emails, creating MIME objects and iterating over various folders in the email.

Section 16 (GUI Automation)

Introduction to pyautogui. You can read more about it’s usage on my article here. Controlling mouse, keyboard along with a delay in each click/keystroke, etc. Shows a game player designed with the use of pyautogui and assigns a task to create a bot to play 2048 game. Here is my assignment that plays a 2048 game on it’s own. https://github.com/bhishan/2048autoplay/

Concluding Words

It is an excellent video course. The name of the course however is misleading in a sense that it provides more content than it promises. Here’s is a link to the course if you’d like to enroll. https://www.udemy.com/automate

Thanks for reading guys. Share your thoughts on this post below in the comments section.

Integrating Google APIs using python – Slides API is fun

- - Python, Tutorials, Web

The Google slides API(currently in version 1) is very interesting in a sense that it provides most of the features for creating presentations. Things like setting transparency of images, creating shapes, text boxes, stretching pictures to fit the template, choosing layouts, text formatting, replacing text throughout the presentation, duplicating slide and a lot more.

Now this is not a how to article and just a regular blog, I am not going to go into details on using the APIs and explaining the codes. Comment below and let know if you’d be interested for a video tutorial on this very idea. If we have many interested for the video tutorial, I will cover the entire codewalk along with how to on enabling APIs.
In this blog, I will talk about one of the smaller projects I took on at fiverr. If you are a regular reader, you might have noticed that I had been away for quite a long time from writing blogs. In the meantime, I started selling services on fiverr.

GOOGLE APIs and Automation

Google APIs are always interesting and allows developers with it’s superior APIs to build products and services around it. Even better when you integrate multiple APIs into a single product/service. I had used Google sheets API and drive API in the past. While slides API is essentially a subset of drive API, I hadn’t yet used it. Since presentations actually reside in the drive itself, I like to call slides as being a subset of drive.

The task was to read a specific spreadsheet populated with contents and later take these data to add into slides using a template stored in the drive itself. Each of the rows in the spreadsheet corresponded to a specific entertainment keyword with columns defining statistics such as mobile impressions, video impressions, audience type, overall impressions, an image file name, etc.

The images, again were hosted in the drive and were to be used as background image for the slide corresponding to the row in the spreadsheet.


I made use of a library : python client for google apis to complete the task. Installation is as such

pip install --upgrade google-api-python-client

In order to make use of google apis, it is required to create a project on google console and activate the APIs required(in our case, Drive API, Sheets API, Slides API). Once the project is created, you can download the oauth2.0 credentials as a JSON file and take it from there.

Sneak Peek

Integrating Google APIs

I am going wrap up this blog here. If you are interested for a video tutorial comment down below. Thanks for reading. I appreciate your time. Follow me on github. If you are looking for automation scripts, you can message me at fiverr.

Implementing Stack using List in Python – Python Programming Essentials

- - Python, Tutorials, Web

Intro

Stack is a collection of objects inserted and removed in a last-in first-out fashion (LIFO). Objects can be inserted onto stack at any time but only the object inserted last can be accessed or removed which coins the object to be top of the stack.

Realization of Stack Operations using List

 

Methods Realization using List Running Time
S.push(e) L.append(e) O(1)*
S.pop() L.pop() O(1)*
S.top() L[-1] O(1)
S.isempty() len(L) == 0 O(1)
len(S) len(L) O(1)

What is O(1)* ?

The running time for push and pop operations are given O(1)* in the above table. This is known as amortization. It is a principle used in complexity analysis of data structures and algorithms. It should be used carefully and for special cases only.

Why did we use amortized analysis for push/pop?

The list(our stack’s underlying data structure) is a series of objects which eventually are realized by arrays. The objects are stored in a continuous block of memory which offers indexing property for lists. As such, a list cannot occupy the entire memory but restricts to some specific size. When there is no more space for the objects to be added to the end of the list, a new memory series is allocated with the increased size, all the objects are copied to the new allocation and new object is added next to the last object of the current series. The previously held memory is then released free. Here, on every append, resizing of list is not required but true once in a while. Hence the running time of append in list (push on stack) for most elements is O(1) but as a whole in an amortized sense, it is O(1)* which accounts for the timely resizing and copying of elements.

Similarly for pop operations, shrinking of the underlying list is done once in a while therefore accounting for an amortized complexity of O(1)*

Implementation of Stack using List

 

class ListStack:
    def __init__(self):
        self._data = []

    def __len__(self):
        return len(self._data)


    def isempty(self):
        return len(self._data) == 0

    def top(self):
        return self._data[-1]

    def push(self, e):
        self._data.append(e)

    def pop(self):
        return self._data.pop()

Conclusion

Stack is an important data structure for realizing solutions to various programming problems. As such, it is even more essential to understand the running time evaluations and working mechanism of these data structures.

Follow me on github Github

Hire me for a project Fiverr

Raising and Handling Exceptions in Python – Python Programming Essentials

- - Tutorials

Brief Introduction

Any unexpected events that occur during the execution of a program is known to be an exception. Like everything, exceptions are also objects in python that is either an instance of Exception class or an instance of underlying class derived from the base class Exception. Exceptions may occur due to logical errors in the program, running out of memory, etc..

Common Exception Types

Class Description
Exception A base class for most error types
AttributeError Raised by syntax obj.foo, if obj has no member named foo
EOFError Raised if “end of file” reached for console or file input
IOError Raised upon failure of I/O operation (e.g., opening file)
IndexError Raised if index to sequence is out of bounds
KeyError Raised if nonexistent key requested for set or dictionary
KeyboardInterrupt Raised if user types ctrl-C while program is executing
NameError Raised if nonexistent identifier used
StopIteration Raised by next(iterator) if no element
TypeError Raised when wrong type of parameter is sent to a function
ValueError Raised when parameter has invalid value (e.g., sqrt(−5))
ZeroDivisionError Raised when any division operator used with 0 as divisor
For an example, following produces a TypeError exception
abs(‘hello world’) #expects numeric parameter but string given
Example of ValueError

Although the type of the passed parameter is correct, the value is illegitimate.

int(‘hello world’)
int(‘3.14’)

Raising an Exception

An exception can be raised from anywhere within the program though the keyword raise followed by an instance of any of the exception classes.

For example, when your program is expecting a positive integer to process but the I/O stream sent a negative integer, you could raise an Exception as such:

raise ValueError(‘Expecting a positive integer, got negative’) #instance of ValueError exception class

Handling an Exception

Now that we have talked on raising an exception, we should program such that the exception is dealt as required, else the execution of the program terminates. It is advisible to catch each exception types separately although python allows a more generic exception handling for any type of exceptions that may occur.

Examples of Common Usage:

try: 
    result = x/y
except ZeroDivisionError:
    #do as per required

Other common exception handling:

try:
    fp = open(‘sample.txt’ )
except IOError as e:
    print( Unable to open the file: , e)

Conclusion

Exceptions are an important principles of programming for any languages. It should be used wisely. On a concluding note, a try-except block can have a finally block as well. An example of use of finally can be to close a connection regardless of the successful or failed transmission of messages. Additionally, a try-except combination can have a single try block with multiple except blocks catching various classes of exception.

Follow me on github https://github.com/bhishan

Hire me for a project https://fiverr.com/bhishan

Automation With Python Python Codes To Create Dropbox Apps

- - Python, Tutorials
As promised in the article earlier on Automate DropBox Signups using python, I have come up with an article along with the codes to create an app and fetch the API keys for it which then allows us to access the files in dropbox. Well, again we stick to selenium module for an ease. In the last article, I’ve explained a python script to automate the signups for dropbox. Now that we have enough cloud space in different accounts. We now need to access the files in those spaces so we can use it as a file server. DropBox provides a feature to create apps on dropbox and gives API keys to hence access the files in the account. Since we’ve got multiple dropbox accounts we would stick towards automating the procedure to get the api key for accessing the files.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
browser = webdriver.Firefox()
browser.get("https://dropbox.com/login")
list_of_inputs = browser.find_elements_by_xpath("//div/input[starts-with(@id,'pyxl')]")
list_of_inputs[0].send_keys("email@domain.com")
list_of_inputs[1].send_keys("password")
sign_in = browser.find_elements_by_xpath("//*[contains(text(),'Sign in')]")
sign_in[len(sign_in)-1].click()
time.sleep(10)
browser.get("https://dropbox.com/developers/apps/create")
time.sleep(3)
type_of_app = browser.find_elements_by_xpath("//*[contains(text(),'Dropbox API app')]")
type_of_app[0].click()
file_access = browser.find_elements_by_xpath("//*[contains(text(),'My app needs access to files already on Dropbox.')]")
file_access[0].click()
type_of_file_access = browser.find_elements_by_xpath("//*[contains(text(),'My app needs access to a user')]")
type_of_file_access[0].click()
app_name = browser.find_element_by_name("name")
app_name.send_keys("appnamewhichisuniquelolo")
create_app = browser.find_elements_by_xpath("//*[contains(text(),'Create app')]")
create_app[1].click()
time.sleep(7)
app_key_item = browser.find_element_by_class_name("app-key")
app_key = str(app_key_item.get_attribute('innerHTML'))
app_secret_item = browser.find_element_by_class_name("app-secret")
app_secret = app_secret_item.get_attribute('data-app-secret')
print app_key, app_secret

General Idea of Automation

The general idea for automation is to mimic the manual workflow and put it in a loop or assign a cron job(it’s kind of same thing but not really). For creating apps on dropbox, I did the same thing. The codes are self-explanatory. We’ve used selenium and time module throughout our program. We use selenium for initiating as well as interacting with the browser. You can see, we’ve used time.time(time_in_seconds) method from time module. Depending on the speed of the internet, we need to set this up. Failing to do so will lead the program to misbehave since it will start looking for some element even when the page hasn’t been completely loaded. We fuel our program with the varieties of methods selenium provides. The above codes however shows only the procedure to create an app for a single account and print the api keys. You should loop over some file containing email id’s and password and save the api keys to some file in real usage. Hint: Place a loop over the codes and once done with getting api keys, logout from the current account.

Do comment below how you felt the article was. Any queries, please mention below.

Announcement

I’ve joined twitter @bbhishan

Google Search Using Selenium And Python – Selenium Python Basics

- - Applications, Python, Tutorials

The intentions of this blog is to show through examples some of the most common methods of selenium. Selenium is a library used for automated browser testing. However, in this post we will discuss about using selenium module in python to make a google search. The post breaks down into various blocks explaining on how to open a url in the browser via selenium python, search presence of a url in a page, click links present in a page. These are the necessities to get started with selenium.

Prerequisites
  1. Python
  2. selenium module in Python
  3. Chrome driver (http://chromedriver.chromium.org/downloads)
Installation of selenium through pip in both Linux and Windows

pip install selenium

Google search using selenium python
from selenium import webdriver

search_query = input("Enter the search query")
search_query = search_query.replace(' ', '+') #structuring our search query for search url.
executable_path = "/path/to/chromedriver"
browser = webdriver.Chrome(executable_path=executable_path)


for i in range(20):
    browser.get("https://www.google.com/search?q=" + search_query + "&start=" + str(10 * i))
    matched_elements = browser.find_elements_by_xpath('//a[starts-with(@href, "https://www.thetaranights.com")]')
    if matched_elements:
        matched_elements[0].click()
        break
1. Import statements (Line 1)

It is the import statements that is required for initiating a browser later in our program and passing url parameters to the address bar in the browser. It can be thought of as a driver for the browser. We use various methods on an instance from webdriver.Chrome() instance to control interaction with the browser.

2. Get query for google search (Line 3 and 4)

Here, we are taking a query for the google search via input() in Python3(raw_input() for Python2). Here is an example url for a google search which requires the spaces between the words to be replaced by “+” , an additional parameter start=0 is seen which specifies the search result of page 1. Similarly start=10 gives the search result of page 2.
https://www.google.com/search?q=bhishan+bhandari&start=0“
Hence, after taking the input from the user, we replaces the spaces with +.

3. Instantiate a browser (Line 5)

The statement browser = webdriver.Chrome() opens up a new browser window. We can also customize the browser capabilities such as download location, etc.

4. Opening a url in the browser (Line 9)

For opening a url in the browser, all you need to do is pass the url as an argument to the browser.get method. Remember I’ve given browser.get because we instantiated the browser earlier with browser = webdriver.Chrome(executable_path=executable_path).

5. Searching for a presence of certain url/text in the search result (Line 10 to 15)

The following methods returns the browser elements which match the criteria that the href attribute of the anchor element starts with https://www.thetaranights.com

browser.find_elements_by_xpath('//a[starts-with(@href, "http://www.thetaranights.com"]')

There is also an alternative method find_element_by_xpath for getting the first element that matches the given xpath construct. Then we make a check whether or not any there was at least an element returned from the above statement, which if true we click using click() method on the first element that matched the criteria. This will open the link on the browser. Since the result we are looking for is found and clicked, we exit loop. Else continue searching for the link with the above criteria until 20 pages if not found. You can quit the browser using browser.quit() method.

We generally covered how to open a browser, search for link in the body of the page and click the link. You may also like to read my article on how to login to a website using selenium python.

How To Split And Merge Pdf Documents

- - Uncategorized

Not the type of posts I usually produce. A promotional review of a tool.

Everyone knows that PDF files are hard to work with. Apart from figuring out how to convert PDF documents, oftentimes we’re also trying to put together the best PDF document possible from other content.

But when those content sources are already in the PDF format, it can seem like an uphill battle just to get the content separated. More often than not, we need to figure out how to manipulate PDF documents at the page level.

Sometimes we may need to rework a PDF document by adding or removing a few pages. Manipulating PDF documents like this can seem intimidating at first.

If you have legal PDF documents your concern may be preserving the integrity of the PDF pages, or if you’re working with reports, you may be worried about deleting the original PDF pages for good.

Normally, you’d have to convert the entire PDF file into a Microsoft Word document, delete or insert the pages accordingly, and then convert it back to PDF. But there’s an easier way to do it.

With a tool like Able2Extract 10 from Investintech.com, you can merge and split your PDFs as easily as you can select a page. This latest version comes with features for converting, creating and editing PDF documents.

Under the latter category, Able2Extract 10 has added the ability to merge and split PDF files. It does this by letting you extract or insert PDF pages to your currently opened PDF document.

For instance, if you have blank pages or full page images in a PDF you’d like to remove or collect into one file, you can extract them into a completely separate file. Or, if you’d like to add some supplementary information to compliment your existing PDF content, you can add them page by page into an existing PDF document easily.

Here’s a look at how this can be done with Able2Extract 10’s latest PDF splitting and merging feature.

To Merge PDFs:

1. Open the PDF you wish to add pages to in Able2Extract 10.

2. Click on Edit from the toolbar

 

3. From the side editing panel, select Insert From PDF

4. From the dialog that appears, select your PDF file from which you want to insert pages from. Click on Open.

How To Download Udemy Videos Script For Downloading Udemy Videos

- - Web

This short post will walk through simple steps to download udemy videos which are not downloadable from the website. Most of the paid udemy courses as well as some free courses are unavailable to download at udemy.com . I personally have around 200 courses in my account. Now most of these courses were not available for download. Fortunately I found a python script on the internet which solved my problem easily. It is udemy-dl.

Installation of udemydl
  1. Install Python for your operating system, preferably Python3
  2. Install pip (python package installer) if not already installed with Python.
  3. Install udemy-dl via
    pip install udemy-dl

The udemy-dl installed from pip doesn’t work. Therefore we need to download it from github. Follow the steps below

    • Clone the repository or download the repository as a zip and extract it from https://github.com/r0oth3x49/udemy-dl.
    • Navigate to the folder through command line/terminal. cd udemydlfolderpath
    • Install dependencies of udemy-dlpip install -r requirements.txt

The above command will install all the prerequisites for udemy-dl to run correctly.

Using udemydl to download courses from Udemy account

Open command line/terminal and navigate to the udemy-dl folder that you downloaded and enter the command python udemy-dl.py link_to_the_course_on_website

python udemy-dl.py https://www.udemy.com/COURSE_NAME

Example

python udemy-dl.py https://www.udemy.com/learn-how-to-deploy-docker-applications-to-production

You will be asked for the username and password for your udemy account. Once credentials are entered, it begins the download.

Thanks for reading

Website Mobile Friendly Tester Automation Script Python Codes For Mobile Friendly Test

- - Python, Tutorials

Hey Guys, I am back again with another script that may pronounce useful to website owners, search engine optimization experts as well as normal people like me. Through the codes we write and discuss in this article, you will be able to check if a website is mobile friendly or not. Well, here I offer a bonus. Through the codes you will be able to issue a number of websites for a mobile friendly test instantly at a time. Why is it necessary? Here’s the answer. As of the latest update in google’s search algorithm, the search engine lord now considers mobile friendliness as a major ranking factor for a website.

Python script to automate mobile friendly test

Before we begin

Before we begin our coding, let me make few things clear. We will be writing 2 files although one will be a simple text file and another will be a python file. In this text file we will write the names of the domain we want to issue for a mobile friendly test, one in each line in the format domain.com i.e without www

from json import loads
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [("User-agent","Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 (maverick) Firefox/3.6.13")]    

with open('websitesformobilefriendlytest.txt') as f:
    for line in f:
        google_results = br.open("https://www.googleapis.com/pagespeedonline/v3beta1/mobileReady?url=http://" + str(line)).read()
        json_obj = loads(google_results)
        if json_obj["ruleGroups"]["USABILITY"]["pass"] == True:
            print "Congrats " + str(line)  + " is mobile friendly"
        else:
            print str(line) + " is not mobile friendly"

1. Line 1 and 2

These are the import statements as we will be using mechanize module to query the mobile friendly test via a browser instantiated by the module and the response is a JSON hence we import loads from json.

2. Line 3 to 5

On line 3 we use the Browser() method of mechanize to instantiate a browser. Line 4 is a statement that tells to ignore the robots.txt file. On line 5, we specify a user agent.

3. Line 7 to 14

Line 7 opens the text file where we previously stored the names of the domain. We now can reference the content of the file via variable f.

Line 8 is the start of the for loop which stores the name of the domain in the variable line on each iteration.

On line 9, we query a domain name/ website for a mobile friendly test. The specified url will return a response of the test result which we store in a variable google_results

On line 10, we read the response and load it as a json object to a variable json_obj.

Now on line 11, we have a conditional statement to check if the website passed the mobile friendly test. The test result is a boolean value which is a value for the key “pass” which is again a value for the key “USABILITY” which in turn is a value for the key “ruleGroups” in the json_obj. Below is the example of how it may look.

{“ruleGroups” : {“USABILITY” : {“pass” : Ture/False}}}

If the website passed the mobile friendly test, the value will be True else False. Based on the result, we then print whether a website is mobile compatible or not.

Mobile friendly tester which writes result to google spreadsheet

Well, here is the bonus code. Let me know if you have any questions regarding the codes in the comment section below. Also, here’s a similar program (Is it a wordpress website checker script)with explanation on the codes which can help you understand and implement these codes. Thanks for reading :)

from json import loads
import mechanize
import gdata.spreadsheet.service
import datetime
rowdict = {}
rowdict['date'] = str(datetime.date.today())
spread_sheet_id = '13mX6ALRRtGlfCzyDNCqY-G_AqYV4TpE7rq1ZNNOcD_Q'
worksheet_id = 'od6'
client = gdata.spreadsheet.service.SpreadsheetsService()
client.debug = True
client.email = 'email@domain.com'
client.password = 'password'
client.source = 'mobilefriendlytest'
client.ProgrammaticLogin()

br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [("User-agent","Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 (maverick) Firefox/3.6.13")]

with open('websitesformobilefriendlytest.txt') as f:
for line in f:
    google_results = br.open("https://www.googleapis.com/pagespeedonline/v3beta1/mobileReady?url=http://" + str(line)).read()
    json_obj = loads(google_results)
    rowdict['website'] = str(line)
    if json_obj["ruleGroups"]["USABILITY"]["pass"] == True:
        #print "Congrats " + str(line) + " is mobile friendly"
        rowdict['ismobilefriendly'] = "yes"
    else:
        #print str(line) + " is not mobile friendly"
        rowdict['ismobilefriendly'] = "no"
    client.InsertRow(rowdict,spread_sheet_id, worksheet_id)

Hadoop Starter Kit What Is Big Data

- - Web

I just watched a 18 minutes video on introduction to Big Data & Hadoop on Udemy. Here’s a link https://www.udemy.com/hadoopstarterkit/learn/ to the course I’ve enrolled in if you’d like too. I would like to brief what I learned.

What is Big Data?

There are mainly three factors that very well helps define a big data. Volume, Velocity and Variety.

Let me take an example of an imaginary startup company who has around 1 TB of data at the initial phase. How do we define the data? Does it qualify for a big data? Well if I say the amount of data is going to be stable throughout the lifetime of the company, is it a big data? Certainly not. For a data set to be called big data, it should have a good growth rate thereby increasing the volume of the data and should be of different variety (text, picture, pdf, etc).

Here are some of the examples of big data.

Companies like Amazon monitors not only your purchase history and wishlist but also each clicks, recording all the pattern and processing this big amount of data thereby giving us a better recommendation system.

Here’s what NASA has to say about big data.

In the time it took you to read this sentence, NASA gathered approximately 1.73 gigabytes of data from our nearly 100 currently active missions! We do this every hour, every day, every year – and the collection rate is growing exponentially. – See more at: http://open.nasa.gov/blog/2012/10/04/what-is-nasa-doing-with-big-data-today/

Have a look at this

https://gigaom.com/2012/08/22/facebook-is-collecting-your-data-500-terabytes-a-day/

Big Data Challenges

Storage – Storage of data should be as efficient as possible both in terms of hardware and processing and retriving the data.

Computation Efficiency – It should be suitable for computation

Data Loss – Data may be lost due to hardware failure and other reasons. Hence data recovery strategies must be good.

Time – Big data is basically for analysis and processing, hence the amount of time for processing the data set should be minimal.

Cost – It should provide huge space and should also be cost effective.

Traditional Solutions

RDBMS

The main issue is scalability. Once the data increases, the amount of time for data processing goes higher with unmanagable number of tables forcing us to denormalize. Necessities may arise to change the query for efficiency. Also RDBMS is for structured data set only. Once the data is present in various formats, RDBMS cannot be used.

GRID Computing

Grid computing creates nodes hence is good for compute intensive. However, it does not perform well for big set of data. It requires programming in lower level like C.

A good solution, HADOOP

Supports huge volume

Storage Efficiency both in terms of hardware and processing/retrival

Good Data Recovery

Horizontal Scaling – Processing time is minimal

Cost Effective

Easy to Programmers and Non Programmers.

Is Hadoop replacing RDBMS?

So is Hadoop going to replace RDBMS? No. Hadoop is one thing and RDBMS is another better for specific purposes.

Hadoop

Storage : Perabytes

Horizontal Scaling

Cost Effective

Made of commodity computers. These are cost effective but enterprise level hardware.

Batch Processing System

Dynamic Schema (Different formats of files)

RDBMS

Storage: Gigabytes

Scaling limitted

Cost may increase violently with volume

Static Schema