Beginner’s guide to Webscraping with Python.

Beginner’s guide to Webscraping with Python.

The amount of information available on the Internet grows larger day by day. Humans cannot process such substantial amounts of data manually. As a result, we need to find a way to automatically access and process this data. This can be done with webscraping.

For example:

Consider you’re a seller on Amazon, you may need to analyse thousands of reviews (for your products or the competition) and see if they are good/bad. This ofcourse can be done with sentiment analysis.

But first you’ll need those reviews in a simple readable format at one place. This is where web scraping helps.

Another example would be of an investor who acts on news. He may keep refreshing the page for an announcement on which he has to act. Or he can just have a script which will do it for him. If the data is available the script will analyse it and throw buy/sell signals.

Getting Started. We’ll be using Python as our scripting language.

There are two  libraries we are going to focus on.

BeautifulSoup.
mechanize.

BeautifulSoup is a web scraping library.

Mechanize is used to fill up forms,

Selenium is used when mechanize fails because of JavaScript (interactive webpage)

Installation.

You can install Python from https://www.python.org/downloads/
Get the Python 2.7.x version.

Now to get the libraries we need, we have to use pip, a package management tool for Python.

In the prompt, type:

pip install BeautifulSoup4
pip install mechanize
pip install requests
pip install openpyxl
requests is a library to get the webpages so we can scrape them.

Openpyxl helps saving the data in an Excel spreadsheet.

After the libraries are successfully installed, open Python Shell.

In this tutorial we will scrape the stock quote and other data from MSN money.

Coding

Let’s first, import our necessary libraries.

#import libraries
import requests
from bs4 import BeautifulSoup
import mechanize
from openpyxl import Workbook

Next, assign a variable to url

#declare url

 

url = ‘ http://www.msn.com/en-in/money/stockdetails/fi-151.1.ULVR.LON ‘

Now make use of the requests library to get our HTML page.

#get page
 
page = requests.get(url)
 
Note: requests can send data, headers, user agents, store cookies and much more. For more information visit ‘link’

Now we have to parse the website and store it in ‘soup’

#parse the HTML
soup =BeautifulSoup (page.content,  ‘lxml’)
 
#page.content to get content of page, using just page would throw an error.

 

Now you have to understand that scraping is all about navigating through the HTML tree and finding out the best and the fastest way to get our data.

We want to find the current market price of the stock.

First open the page in your chrome browser, and at the market price, right click and click on ‘inspect element’

Beginner's guide to Webscraping with Python. Wrekin Data

You will see the source code of the page, and the tags in which the market price value is stored.
Beginner's guide to Webscraping with Python. Wrekin Data
Which is:

<span class=”current-price” data-role=”currentvalue”>4,413.00</span>

BeautifulSoup helps us navigate and get the data by the find() method.

There are many ways code and get that data:

# By attributes
 

current_price = soup.find(‘span’, attrs={‘class’: ‘current-price’ })

The code finds the ‘span’ tag with attributes where class name is ‘current-price’
To get the text inside the tag we use .text

current_price=current_price.text.strip()

#strip() removes starting and trailing whitespace

We can find in this way aswell.

 

current_price = soup.find(‘span’,attrs ={ ‘data-role’=”currentvalue”}).text.strip()

#current_price= soup.find(‘html’).next_element.next_element…………..)

#next_element finds the next element after the tag found.

 

There are many ways to get this data, we have to find the easiest and the fastest way.

The above method would require 100s of .next_element to reach our data. But it is not viable. And a slow method.

Sometimes we have to use it where there is no dedicated classifier,

i.e there are many class name with same value.

See this example.

We want to find out the P/E ratio.

With inspect element, we find:

 

<span class=”name”><p class=’truncated-string’  title=’P/E Ratio (EPS)’>P/E Ratio (EPS)</p></span>
<span class=”value baseminus”><p class=’truncated-string’  title=’23.92 (2.03) ‘>23.92 (2.03) </p></span>

How to get the value that is (‘23.92’)?

 

If you see, there are many class names ‘ value baseminus’

If we use

pe_ratio = soup.find(‘p’,attrs ={‘title’=’P/E Ratio (EPS)’}).text

This will return:

P/E Ratio (EPS)

Because that is the text inside.

So how to get the value ?

pe_ratio = soup.find(‘p’,attrs ={‘title’=’P/E Ratio (EPS)’}).next_element.next_element.next_element.next_element.next_element.next_element.next_element
Not so pretty, but it works.

There may be many ways to get the same data.

Now type

print pe_ratio

print current_price

 

Saving in a spreadsheet.

wb = Workbook()  #Workboo
sheet=wb.active # select sheet
sheet.title = ‘Data’ #sheet title
sheet.append(pe_ratio)
sheet.append(current_price)
wb.save(‘data.xlsx’)
wb.close()

 

Save this file.

Now if you run it anytime, it will show, the CMP and PE ratio import it into excel without going through the webpage.

 

Advanced

Let’s see if we can get data for hundreds of stock tickers ?

a= ‘ MMM ABT ABBV ACN ATVI AYI ADBE AAP AES AET AMG AFL A APD AKAM ALK ALB ALXN ALLE AGN ADS LNT ALL GOOGL GOOG MO AMZN AEE AAL AEP AXP AIG AMT AWK AMP ABC AME AMGN APH APC ADI ANTM AON APA AIV AAPL’

a=a.split() # makes list
cmp=[]
pe=[] #list to append data later

url=’http://www.msn.com/en-us/money/stockdetails/analysis/fi-126.1.’

wb = Workbook()
sheet=wb.active
sheet.title = ‘a’

For loop

for a in a:

fullurl = url + a + ‘.NYS’ #append full url from list

page =requests.get(fullurl)

    pe_ratio = soup.find(‘p’,attrs ={‘title’=’P/E Ratio (EPS)’}).next_element.next_element.next_element.next_element.next_element.next_element.next_element

 

    current_price = soup.find(‘span’,attrs ={ ‘data-role’=”currentvalue”}).text.strip()

 

cmp.append(current_price)
pe.append(pe_ratio) # appends every value in the list

 

sheet.append(cmp)

sheet.append(pe)

#appends list into excel

wb.save(‘data.xlsx’)

wb.close()

This script will give you the data of all the tickers in ‘a’ .

If you add 1000tickers in that list, you will get data from all of them.

Then you can do whatever analysis you want from it.

This is the power of data scraping.

The applications are many, similarly you can get tens of thousands of valuable  data from directory websites.

It can be used for lead generation, for startups or established companies.

This is automatic, so there is absolutely zero chance of human error. The data will always be correct.

 

Conclusion

BeautifulSoup is great for small to medium range webscraping.

Always make sure that you send less requests or you will spam the target with too many requests crashing their website.

To get good at it, try to automate the boring stuff with it.

Try natural language processing on scrapped forums.

 



Contact Us

We are glad that you preferred to contact us. Please fill our short form and one of our friendly team members will contact you back.

X