Posts Tagged: "gdata python"

Grab Whois Information And Write To Google Spreadsheet

Hello Guys, Here I am with yet another program that can benefit you and many search engine optimizers. By the end of this read you will be able to write a program to extract the whois information of a number of domains stored in a text file and write the information about the domain in a google spreadsheet which has now been a medium to share data and findings online. As a Search Engine Optimizer, you need to keep track a number of websites including your competitions. Here I offer you a simple python program to keep track of. On the other hand if you are not a SEO expert like myself, you can still use this script to track various websites you are used to.

Prerequisites before beginning to code

We are going to have two files one of which is a .py file where we code our program. The other is a text file with .txt extention where we store the domain names we want to find whois information for. The text file must contain a domian name in a format www.domain.com one per each line.

Next, we need to create a google spreadsheet where we intend to write the whois information so we can share with others. Direct your browser to https://docs.google.com/spreadsheets/ and create a new spreadsheet named “Whois Info”. Once done, create three rows namely “website”, “whoisinformation” and “date”. The name of the domain name will be under the row website, the whois information will be under the row whoisinformation and the date we queried the whois information will remain under the row date.

Python code to extract whois information and write to google spreadsheet

from bs4 import BeautifulSoup
from urllib2 import urlopen
import gdata.spreadsheet.service
import datetime
rowdict = {}
rowdict['date'] = str(datetime.date.today())
spread_sheet_id = '1zE8Qe8wmC271hG2uW4XE68btUks79xX0OG-O4KDl_Mo'
worksheet_id = 'od6'
client = gdata.spreadsheet.service.SpreadsheetsService()
client.debug = True
client.email = "email@domain.com"
client.password = 'password'
client.source = 'whoisinfo'
client.ProgrammaticLogin()
with open('websitesforwhois.txt') as f:
    for line in f:
        soup = BeautifulSoup(urlopen("http://www.checkdomain.com/cgi-bin/checkdomain.pl?domain=" + str(line)).read())
        for pre in soup.find_all("pre"):
            whois_info = str(pre.string)
        #print whois_info
        rowdict['website'] = str(line)
        rowdict['whoisinformation'] = whois_info
        client.InsertRow(rowdict,spread_sheet_id, worksheet_id)

1. Line 1 to 4

These are the import statements. We use BeautifulSoup to make a soup object out of a url response. Urlopen to get the response of a url. Gdata to access the google spreadsheet. Datetime to get the current system time.

2. Line 5 and 6

In our program, we require to access the google spreadsheet and write to it hence we are using gdata module. Now in order to write to spreadsheet, we need to pass the data as a dictionary or generally known as json which has data as a key:value pair. Rowdict is a variable storing the data to pass to google spreadsheet. On line 6, we store the current date to the key “date” which if you remember is a row at our spreadsheet.

3. Line 7 to 14

Line 7 to 14 is a procedure to connect/access a specific google spreadsheet. We require spread_sheet_id and worksheet_id. Take a look to the url of your spreadsheet. The url looks something like this one

https://docs.google.com/spreadsheets/d/1VbNph0TfFetKLU8hphrEyuNXlJ-7m628p8Sbu82o8lU/edit#gid=0

The spreadsheet id(mentioned earlier) is present in the url. “1VbNph0TfFetKLU8hphrEyuNXlJ-7m628p8Sbu82o8lU” in the above url is the spreadsheet id we need. By default the worksheet id is ‘od6‘.

On line 13 is the client.source assigned to string ‘whoisinfo’. This is the file name or the spreadsheet name. Remember we named our spreadsheet “Whois Info”. The client.source is the spreadsheet name which is written in small alphabets excluding white spaces.

4. Line 15 to 16

Line 15 opens the text file where we’ve stored the names of the domain. Line 16 helps iterate through each lines in the file. At each iteration, the domain name at each line is stored to variable line.

5 Line 17

On line 17, we query the page giving the whois information for us and make a soup object out of it by invoking the BeautifulSoup method over the url response. The reason we are making a soup object is that we can access required data via tags and the data we need is inside a <pre></pre> tag.

6 Line 18 to 19

Now we know that there is only one “pre” tag in the soup element. We therefore iterate to find a pre tag and store the information inside of the pre tag to a variable whois_info.

7 Line 21 to 23

On line 21, we are assigning the domain name to the key “website” of the dictionary rowdict. On line 22, we are assigning the whois information stored in the variable whois_info to the key “whoisinformation” of the dictionary rowdict. Note that the key of the dictionary must match to the row name in our spreadsheet. Line 23 pushes the dictionary to the google spreadsheet and writes to it. The iteration goes until the domain names at a text file is finished.

If you have any questions/confusions regarding the article or code, please mention below in comments so we can discuss. Thanks for reading

Is It A WordPress Website Checker Script In Python

- - Applications, Python, Tutorials

By the end of this read, you will be able to code a program that will check for a number of domain names stored in a text file to verify whether or not a website is powered by wordpress. In this program we will write the result for each website in a google spreadsheet for later use but you can apply certain code if a website is wordpress powered at the mean time.

Python script to check if a bunch website is wordpress powered

Below is the program written in python programming language which reads a text file storing the names of the domain one per each line and checks if it a wordpress website and writes the status in a google spreadsheet.

 

from bs4 import BeautifulSoup
import mechanize
import gdata.spreadsheet.service
import datetime
rowdict = {}
rowdict['date'] = str(datetime.date.today())
spread_sheet_id = '1mvebf95F5c_QUCg4Oep_oRkDRe40QRDzVfAzt29Y_QA'
worksheet_id = 'od6'
client = gdata.spreadsheet.service.SpreadsheetsService()
client.debug = True
client.email = "email@domain.com"
client.password = 'password'
client.source = 'iswordpress'
client.ProgrammaticLogin()
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [("User-agent","Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 (maverick) Firefox/3.6.13")]
base_url = br.open("http://www.isitwp.com/")
with open('websitesforwpcheck.txt') as f:
    for line in f:
        rowdict['website'] = str(line)
        br.select_form(nr=0)
        br["q"] = str(line)
        isitwp_response = br.submit()
        isitwp_response = isitwp_response.read()
        if "Good news everyone" in a:
            rowdict['iswordpresswebsite'] = "yes"
        else:
            rowdict['iswordpresswebsite'] = "no"
        client.InsertRow(rowdict,spread_sheet_id, worksheet_id)

Isitwordpresswebsite code explanation

Before beginning , we need to create a spreadsheet and make three rows with names website, date, isitwordpresswebsite

1. Line 1 to 4

These are the import statements for the libraries we will use in our program. We use BeautifulSoup to convert a response from a request as a soup object. Mechanize is used in order to skip the length of code for sessions and cookies. Gdata is used to connect to our google account in order to access the spreadsheet we want to work with. Datetime is used to get the current date at the time of script run.

2. Line 5

In line 5, we create an empty dictionary where we later create a key:value pairs for date, website name and status of a website(is it wordpress or not). Also google spreadsheet accepts a dictionary item/json which is then written to the spreadsheet. In line 6 we store the current date at the time of script run to a key “date” which is later in our program pushed to the google spreadsheet.

3. Line 7 to 14

Now, before proceeding forward, we need to create a google spreadsheet. Now another step is to take a look at the url of the spreadsheet we created. We need the spreadsheet id to access the spreadsheet via our program. Below is a screenshot of how to get the spreadsheet id. The worksheet id is ‘od6‘ by default. Line 9 to 14 gets us logged in to our google account and accesses the spreadsheet we want to work with.

4. Line 15 to 18

In line 15, we use the mechanize module’s method to initiate a browser. Line 16 states to ignore the robots.txt file. Line 17 adds a user agent to the browser. Line 18 lets us open the website “isitwp.com” which we will be using to check if a website is wordpress powered or not.

5. Line 19

Line 19 allows us to open the text file where we have the names of the domain(one per each line).

6. Line 20 to 21

In line 20, we iterate through the lines of the file where we have stored domain names. At each iteration, the domain name at the current line is stored to the variable line. In line 21, we create a key:value pair where we store the domain name at each iteration to the key “website.

7. Line 22 to 30

Line 22 is a statement to select the form of index 0 i.e the first form present in the website isitwp.com. Now value of the name attribute of the seachbox is “q” which is where we store the name of the domain before submitting the form. Line 24 submits the form after passing the value of domain name in the br[‘q’] and stores the response in a variable isitwp_response. Line 25 is the complete webpage response stored in variable isitwp_response. In line 26 we check if “Good news everyone” substring is present in the response which means the website is powered by wordpress else it is not. Line 27 then makes a key:value pair where the key “iswordpresswebsite” is given value “yes” if the condition on line 26 passed and “no” if the condition failed. Now remember the key of the dictionary rowdict must be the name of the row in our spreadsheet.

This way we can test if a website is wordpress powered for a number of website names stored in a text file(one at each line). Thanks for reading :) . If you have any questions regarding the article or the code, comment below so we can discuss on that part.