Posts Tagged: "BeautifulSoup"

Web Scraping – BeautifulSoup Python

- - Python, Tutorials

Data collection from public sources is often beneficial to a business or an individual. As such the term “web scraping” isn’t something new. These data are often wrangled within html tags and attributes. Python is often used for data collection from these sources. The intentions of this post is to host example code snippets so people can take ideas from it to build scrapers as per their needs using BeautifulSoup and urllib module in Python. I will be using github’s trending page https://github.com/trending throughout this post for the examples, especially because it best suits for applying various BeautifulSoup methods.

Installation:

pip install BeautifulSoup4

Get html of a page:
>>> import urllib
>>> resp = urllib.request.urlopen("https://github.com/trending")
>>> resp.getcode()
200
>>> resp.read() # the html
Using BeautifulSoup to get title from a page
>>> import urllib
>>> import bs4
>>> github_trending = urllib.request.urlopen("https://github.com/trending")
>>> trending_soup = bs4.BeautifulSoup(github_trending.read(), "lxml")
>>> trending_soup.title
<title>Trending  repositories on GitHub today · GitHub</title>
>>> trending_soup.title.string
'Trending  repositories on GitHub today · GitHub'
>>>
Find single element by tag name, find multiple elements by tag name
>>> ordered_list = trending_soup.find('ol') #single element
>>>
>>> type(ordered_list)
<class 'bs4.element.Tag'>
>>>
>>> all_li = ordered_list.find_all('li') # multiple elements
>>>
>>> type(all_li)
<class 'bs4.element.ResultSet'>
>>>
>>> trending_repositories = [each_list.find('h3').text for each_list in all_li]
>>> for each_repository in trending_repositories:
...     print(each_repository.strip())
...
klauscfhq / taskbook
robinhood / faust
Avik-Jain / 100-Days-Of-ML-Code
jxnblk / mdx-deck
faressoft / terminalizer
trekhleb / javascript-algorithms
apexcharts / apexcharts.js
grain-lang / grain
thedaviddias / Front-End-Performance-Checklist
istio / istio
CyC2018 / Interview-Notebook
fivethirtyeight / russian-troll-tweets
boyerjohn / rapidstring
donnemartin / system-design-primer
awslabs / aws-cdk
QUANTAXIS / QUANTAXIS
crossoverJie / Java-Interview
GoogleChromeLabs / ndb
dylanbeattie / rockstar
vuejs / vue
sbussard / canvas-sketch
Microsoft / vscode
flutter / flutter
tensorflow / tensorflow
Snailclimb / Java-Guide
>>>
Getting Attributes of an element
>>> for each_list in all_li:
...     anchor_element = each_list.find('a')
...     print("https://github.com" + anchor_element['href'])
...
https://github.com/klauscfhq/taskbook
https://github.com/robinhood/faust
https://github.com/Avik-Jain/100-Days-Of-ML-Code
https://github.com/jxnblk/mdx-deck
https://github.com/faressoft/terminalizer
https://github.com/trekhleb/javascript-algorithms
https://github.com/apexcharts/apexcharts.js
https://github.com/grain-lang/grain
https://github.com/thedaviddias/Front-End-Performance-Checklist
https://github.com/istio/istio
https://github.com/CyC2018/Interview-Notebook
https://github.com/fivethirtyeight/russian-troll-tweets
https://github.com/boyerjohn/rapidstring
https://github.com/donnemartin/system-design-primer
https://github.com/awslabs/aws-cdk
https://github.com/QUANTAXIS/QUANTAXIS
https://github.com/crossoverJie/Java-Interview
https://github.com/GoogleChromeLabs/ndb
https://github.com/dylanbeattie/rockstar
https://github.com/vuejs/vue
https://github.com/sbussard/canvas-sketch
https://github.com/Microsoft/vscode
https://github.com/flutter/flutter
https://github.com/tensorflow/tensorflow
https://github.com/Snailclimb/Java-Guide
>>>
Using class name or other attributes to get element
>>> for each_list in all_li:
...     total_stars_today = each_list.find(attrs={'class':'float-sm-right'}).text
...     print(total_stars_today.strip())
...
1,063 stars today
846 stars today
596 stars today
484 stars today
459 stars today
429 stars today
443 stars today
366 stars today
330 stars today
282 stars today
182 stars today
190 stars today
200 stars today
190 stars today
166 stars today
164 stars today
144 stars today
158 stars today
157 stars today
144 stars today
144 stars today
142 stars today
132 stars today
101 stars today
108 stars today
>>>
Navigate childrens from an element
>>> for each_children in ordered_list.children:
...     print(each_children.find('h3').text.strip())
...
klauscfhq / taskbook
robinhood / faust
Avik-Jain / 100-Days-Of-ML-Code
jxnblk / mdx-deck
faressoft / terminalizer
trekhleb / javascript-algorithms
apexcharts / apexcharts.js
grain-lang / grain
thedaviddias / Front-End-Performance-Checklist
istio / istio
CyC2018 / Interview-Notebook
fivethirtyeight / russian-troll-tweets
boyerjohn / rapidstring
donnemartin / system-design-primer
awslabs / aws-cdk
QUANTAXIS / QUANTAXIS
crossoverJie / Java-Interview
GoogleChromeLabs / ndb
dylanbeattie / rockstar
vuejs / vue
sbussard / canvas-sketch
Microsoft / vscode
flutter / flutter
tensorflow / tensorflow
Snailclimb / Java-Guide
>>>

The .children will only return the immediate childrens of the parent element. If you’d like to get all of the elements under certain element, you should use .descendent

Navigate descendents from an element
>>> for each_children in ordered_list.descendent:
...     # perform operations
Navigating previous and next siblings of elements
>>> all_li = ordered_list.find_all('li')
>>> fifth_li = all_li[4]
>>> # each li element is separated by '\n' and hence to navigate to the fourth li, we should navigate previous sibling twice
...
>>>
>>> fourth_li = fifth_li.previous_sibling.previous_sibling
>>> fourth_li.find('h3').text.strip()
'jxnblk / mdx-deck'
>>>
>>> # similarly for navigating to the sixth li from fifth li, we would use next_sibling
...
>>> sixth_li = fifth_li.next_sibling.next_sibling
>>> sixth_li.find('h3').text.strip()
'trekhleb / javascript-algorithms'
>>>
Navigate to parent of an element
>>> all_li = ordered_list.find_all('li')
>>> first_li = all_li[0]
>>> li_parent = first_li.parent
>>> # the li_parent is the ordered list <ol>
...
>>>
Putting it all together(Github Trending Scraper)
>>> import urllib
>>> import bs4
>>>
>>> github_trending = urllib.request.urlopen("https://github.com/trending")
>>> trending_soup = bs4.BeautifulSoup(github_trending.read(), "lxml")
>>> ordered_list = trending_soup.find('ol')
>>> for each_list in ordered_list.find_all('li'):
...     repository_name = each_list.find('h3').text.strip()
...     repository_url = "https://github.com" + each_list.find('a')['href']
...     total_stars_today = each_list.find(attrs={'class':'float-sm-right'}).text
…        print(repository_name, repository_url, total_stars_today)

klauscfhq / taskbook                             https://github.com/klauscfhq/taskbook                             1,404 stars today
robinhood / faust                                https://github.com/robinhood/faust                                960 stars today
Avik-Jain / 100-Days-Of-ML-Code 	         https://github.com/Avik-Jain/100-Days-Of-ML-Code                  566 stars today
trekhleb / javascript-algorithms 	         https://github.com/trekhleb/javascript-algorithms                 431 stars today
jxnblk / mdx-deck 			         https://github.com/jxnblk/mdx-deck 	                           416 stars today
apexcharts / apexcharts.js 		         https://github.com/apexcharts/apexcharts.js 	                   411 stars today
faressoft / terminalizer 		         https://github.com/faressoft/terminalizer 	                   406 stars today
istio / istio 			                 https://github.com/istio/istio 	                           309 stars today
thedaviddias / Front-End-Performance-Checklist 	 https://github.com/thedaviddias/Front-End-Performance-Checklist   315 stars today
grain-lang / grain 			         https://github.com/grain-lang/grain 	                           301 stars today
boyerjohn / rapidstring 			 https://github.com/boyerjohn/rapidstring 	                   232 stars today
CyC2018 / Interview-Notebook 			 https://github.com/CyC2018/Interview-Notebook 	                   186 stars today
donnemartin / system-design-primer 		 https://github.com/donnemartin/system-design-primer 	           189 stars today
awslabs / aws-cdk 			         https://github.com/awslabs/aws-cdk 	                           186 stars today
fivethirtyeight / russian-troll-tweets 		 https://github.com/fivethirtyeight/russian-troll-tweets 	   159 stars today
GoogleChromeLabs / ndb 			         https://github.com/GoogleChromeLabs/ndb 	                   172 stars today
crossoverJie / Java-Interview 			 https://github.com/crossoverJie/Java-Interview 	           148 stars today
vuejs / vue 			                 https://github.com/vuejs/vue 	                                   137 stars today
Microsoft / vscode 			         https://github.com/Microsoft/vscode 	                           137 stars today
flutter / flutter 			         https://github.com/flutter/flutter 	                           132 stars today
QUANTAXIS / QUANTAXIS 			         https://github.com/QUANTAXIS/QUANTAXIS 	                   132 stars today
dylanbeattie / rockstar 			 https://github.com/dylanbeattie/rockstar 	                   130 stars today
tensorflow / tensorflow 			 https://github.com/tensorflow/tensorflow 	                   106 stars today
Snailclimb / Java-Guide 			 https://github.com/Snailclimb/Java-Guide 	                   111 stars today
WeTransfer / WeScan 			         https://github.com/WeTransfer/WeScan 	                           118 stars today


Grab siteprice and write to google spreadsheet using python

- - Applications, Python, Tutorials

By the end of this read you will be able to grab site price from siteprice.org and write it to google spreadsheet using python. Every website has it’s competition. As our website evolves, we have more competitions and the competitors website also earns good value. It is vital to know the value of our website as well as our competition’s value. Siteprice.org is one of those websites which calculates a website’s value based on different factors.

Putting domain name of website in a text file with one domain per line will be our strategy for querying number of websites’s price. You may wish to put hundreds of websites in this txt file which are your competitions.

Python codes to extract site price and write in google spreadsheet

from bs4 import BeautifulSoup
from urllib2 import urlopen
import gdata.spreadsheet.service
import datetime
rowdict = {}
rowdict['date'] = str(datetime.date.today())
spread_sheet_id = '13mX6ALRRtGlfCzyDNCqY-G_AqYV4TpE7rq1ZNNOcD_Q'
worksheet_id = 'od6'
client = gdata.spreadsheet.service.SpreadsheetsService()
client.debug = True
client.email = 'email@domain.com'
client.password = 'password'
client.source = 'siteprice'
client.ProgrammaticLogin()
with open('websitesforprice.txt') as f:
    for line in f:
        soup = BeautifulSoup(urlopen("http://www.siteprice.org/website-worth/" + line).read())
        rowdict['website'] = str(line)
        rowdict['price'] = soup.find(id="lblSitePrice").string
        client.InsertRow(rowdict,spread_sheet_id, worksheet_id)

1. Line 1 to 4

These lines are import statements. Here in this program, we are using various python libraries. Gdata is used to access google spreadsheet. We are using BeautifulSoup because it allows us to get data via id which we will use to get the price of a website. Datetime is used to get the current date. Urlopen us used to open the webpage which contains the data we want.

2.Line 5 to 14

In order to write the extracted rank to google spreadsheet programmatically we are using the gdata module. In order to write to a spreadsheet we need the spreadsheet id, worksheet id and a dictionary containing values we want to write to the spreadsheet. The dictionary contains key as the column header and value as the string that is to be written to the spreadsheet(website, price, date for our program).

Go to docs.google.com when logged in and create a new spreadsheet. Fill the first three columns of the first row as website, price and date respectively. All the letter should be in lower case and no whitespaces. Now when you have created a new spreadsheet, take a look to the url. The url looks something like this one

https://docs.google.com/spreadsheets/d/13mX6ALRRtGlfCzyDNCqY-G_AqYV4TpE7rq1ZNNOcD_Q/edit#gid=0

The spreadsheet id(mentioned earlier) is present in the url.

13mX6ALRRtGlfCzyDNCqY-G_AqYV4TpE7rq1ZNNOcD_Q” in the above url is the spreadsheet id we need. By default the worksheet id is ‘od6‘.

Basically line 5 to 14 are codes to access google spreadsheet.

3. Line 15 to 20

Since we’re writing a program that can extract alexa ranks for hundreds of websites and append it to google spreadsheet, therefore taking url from console input is never a good solution. We have to write the url of websites we want to take care of in a text file. Each website in a single line in the format www.domain.com. Make sure there is a valid website, one in each line because we will read the url from python line by line.

Line 17 makes a soup element out of the url which has the information we are looking for. The soup element is of different websites in each iteration. Line 18 stores the value of the domain in the key “website” of json rowdict. Line 19 stores the price of the website in the key price of json rowdict. You can see we use BeutifulSoup to get data via id. Finally line 20 pushes the entire json element to google spreadsheet. This piece of codes runs for the number of times equal to the line in text file.

Thanks for reading :) Enjoy!! . If you have any questions regarding the post, feel free to comment below.

What are the most interesting web scraping modules for python

- - Python, Tutorials, Web
Python programming language is in the hype for over a decade. It is the most recommended language for the beginner programmers since it’s syntax are readable by almost every non-programmers too. At the same time recommended for web scraping, automation and data science. However python comes short in terms of speed when compared to languages such as C++ and JAVA. The plus for python programming language is the wide range of enthusiastic contributors and users around the globe. There are countless modules for doing various domain specific tasks which makes it even more popular today. From web scraping to gui automation, there are modules for almost everything. Here, in this post, I will list some of the most used and interesting python modules for web scraping that are lifesaver for a programmer.

Popular python modules for web scraping

1. Mechanize

mechanize is a popular python module for it allows creation of a browser instance. It also maintains sessions which aids as a toolkit to obtain tasks like login, signup automation, etc.

2. BeautifulSoup

BeautifulSoup is another beautiful python module which aids scraping the data required from html/xmls via tags. With beautiful you can scrape almost everything because it aids different methods like searching via tags, finding all links, etc.

3. Selenium

Although selenium is a well known module for automated browser testing, it can be used as a web scraping tool as well. I promise you, it pays pretty well. With methods to find element via ids, name, class, etc, selenium would allow you to have anything on the website.

4. lxml

lxml is another wonderful library for parsing xml/htmls, however I would say beautifulsoup beats it in terms of usability. You could option to use any of the modules among lxml and BeautifulSoup since they pretty much do the same stuff.

I have used all of the above modules extensively in my projects and they allowed me to move faster. I was able to do some cool things with these modules. For example: automating conversation between two cleverbots(AI featured bots), getting paid courses at udemy, finding the most popular facebook fan page among my friends,etc. Therefore I totally recommend them. Below is a link to some of the most interesting things I’ve done with these modules.

Cool stuffs with python

Tell me how you felt the article was in the comments section below and/or you can add some of your favorite modules too. There’s always thanks for reading 

Grab alexa rank and write to google spreadsheet using python

- - Tutorials

How to grab alexa rank using python?

By the end of this read you will be able to extract alexa rank and write it to a google spreadsheet programmatically. This can be helpful for seo experts as they have to constantly analyze the alexa ranking and submit it to his/her boss in a regular basis or for personal reference. Some times you have a huge number of websites to take care of, especially when your main expertise is search engine optimization. Going to alexa’s website and writing ranks for each website is a cumbersome task to deal with. The code provided below is in python programming language. Executing the code with valid gmail id, password, spreadsheet id, worksheet id associated with the spreadsheet you want to work with will append website link and current alexa rank with date to the spreadsheet. Don’t worry about spreadsheet id and worksheet id, I will let you know on how to get these items. Continue Reading