Posts Tagged: "grab whois info using python"

Grab Whois Information And Write To Google Spreadsheet

Hello Guys, Here I am with yet another program that can benefit you and many search engine optimizers. By the end of this read you will be able to write a program to extract the whois information of a number of domains stored in a text file and write the information about the domain in a google spreadsheet which has now been a medium to share data and findings online. As a Search Engine Optimizer, you need to keep track a number of websites including your competitions. Here I offer you a simple python program to keep track of. On the other hand if you are not a SEO expert like myself, you can still use this script to track various websites you are used to.

Prerequisites before beginning to code

We are going to have two files one of which is a .py file where we code our program. The other is a text file with .txt extention where we store the domain names we want to find whois information for. The text file must contain a domian name in a format www.domain.com one per each line.

Next, we need to create a google spreadsheet where we intend to write the whois information so we can share with others. Direct your browser to https://docs.google.com/spreadsheets/ and create a new spreadsheet named “Whois Info”. Once done, create three rows namely “website”, “whoisinformation” and “date”. The name of the domain name will be under the row website, the whois information will be under the row whoisinformation and the date we queried the whois information will remain under the row date.

Python code to extract whois information and write to google spreadsheet

from bs4 import BeautifulSoup
from urllib2 import urlopen
import gdata.spreadsheet.service
import datetime
rowdict = {}
rowdict['date'] = str(datetime.date.today())
spread_sheet_id = '1zE8Qe8wmC271hG2uW4XE68btUks79xX0OG-O4KDl_Mo'
worksheet_id = 'od6'
client = gdata.spreadsheet.service.SpreadsheetsService()
client.debug = True
client.email = "email@domain.com"
client.password = 'password'
client.source = 'whoisinfo'
client.ProgrammaticLogin()
with open('websitesforwhois.txt') as f:
    for line in f:
        soup = BeautifulSoup(urlopen("http://www.checkdomain.com/cgi-bin/checkdomain.pl?domain=" + str(line)).read())
        for pre in soup.find_all("pre"):
            whois_info = str(pre.string)
        #print whois_info
        rowdict['website'] = str(line)
        rowdict['whoisinformation'] = whois_info
        client.InsertRow(rowdict,spread_sheet_id, worksheet_id)

1. Line 1 to 4

These are the import statements. We use BeautifulSoup to make a soup object out of a url response. Urlopen to get the response of a url. Gdata to access the google spreadsheet. Datetime to get the current system time.

2. Line 5 and 6

In our program, we require to access the google spreadsheet and write to it hence we are using gdata module. Now in order to write to spreadsheet, we need to pass the data as a dictionary or generally known as json which has data as a key:value pair. Rowdict is a variable storing the data to pass to google spreadsheet. On line 6, we store the current date to the key “date” which if you remember is a row at our spreadsheet.

3. Line 7 to 14

Line 7 to 14 is a procedure to connect/access a specific google spreadsheet. We require spread_sheet_id and worksheet_id. Take a look to the url of your spreadsheet. The url looks something like this one

https://docs.google.com/spreadsheets/d/1VbNph0TfFetKLU8hphrEyuNXlJ-7m628p8Sbu82o8lU/edit#gid=0

The spreadsheet id(mentioned earlier) is present in the url. “1VbNph0TfFetKLU8hphrEyuNXlJ-7m628p8Sbu82o8lU” in the above url is the spreadsheet id we need. By default the worksheet id is ‘od6‘.

On line 13 is the client.source assigned to string ‘whoisinfo’. This is the file name or the spreadsheet name. Remember we named our spreadsheet “Whois Info”. The client.source is the spreadsheet name which is written in small alphabets excluding white spaces.

4. Line 15 to 16

Line 15 opens the text file where we’ve stored the names of the domain. Line 16 helps iterate through each lines in the file. At each iteration, the domain name at each line is stored to variable line.

5 Line 17

On line 17, we query the page giving the whois information for us and make a soup object out of it by invoking the BeautifulSoup method over the url response. The reason we are making a soup object is that we can access required data via tags and the data we need is inside a <pre></pre> tag.

6 Line 18 to 19

Now we know that there is only one “pre” tag in the soup element. We therefore iterate to find a pre tag and store the information inside of the pre tag to a variable whois_info.

7 Line 21 to 23

On line 21, we are assigning the domain name to the key “website” of the dictionary rowdict. On line 22, we are assigning the whois information stored in the variable whois_info to the key “whoisinformation” of the dictionary rowdict. Note that the key of the dictionary must match to the row name in our spreadsheet. Line 23 pushes the dictionary to the google spreadsheet and writes to it. The iteration goes until the domain names at a text file is finished.

If you have any questions/confusions regarding the article or code, please mention below in comments so we can discuss. Thanks for reading