13 Aug Beginner’s guide to Webscraping with Python.
The amount of information available on the Internet grows larger day by day. Humans cannot process such substantial amounts of data manually. As a result, we need to find a way to automatically access and process this data. This can be done with webscraping.
For example:
Consider you’re a seller on Amazon, you may need to analyse thousands of reviews (for your products or the competition) and see if they are good/bad. This ofcourse can be done with sentiment analysis.
But first you’ll need those reviews in a simple readable format at one place. This is where web scraping helps.
Another example would be of an investor who acts on news. He may keep refreshing the page for an announcement on which he has to act. Or he can just have a script which will do it for him. If the data is available the script will analyse it and throw buy/sell signals.
Getting Started. We’ll be using Python as our scripting language.
There are two libraries we are going to focus on.
BeautifulSoup.
mechanize.
BeautifulSoup is a web scraping library.
Mechanize is used to fill up forms,
Selenium is used when mechanize fails because of JavaScript (interactive webpage)
Installation.
You can install Python from https://www.python.org/downloads/
Get the Python 2.7.x version.
Now to get the libraries we need, we have to use pip, a package management tool for Python.
In the prompt, type:
pip install BeautifulSoup4
pip install mechanize
pip install requests
pip install openpyxl
requests is a library to get the webpages so we can scrape them.
Openpyxl helps saving the data in an Excel spreadsheet.
After the libraries are successfully installed, open Python Shell.
In this tutorial we will scrape the stock quote and other data from MSN money.
Coding
Let’s first, import our necessary libraries.
#import libraries
import requests
from bs4 import BeautifulSoup
import mechanize
from openpyxl import Workbook
Next, assign a variable to url
#declare url
url = ‘ https://www.msn.com/en-in/money/stockdetails/fi-151.1.ULVR.LON ‘
Now make use of the requests library to get our HTML page.
#get page
page = requests.get(url)
Note: requests can send data, headers, user agents, store cookies and much more. For more information visit ‘link’
Now we have to parse the website and store it in ‘soup’
#parse the HTML
soup =BeautifulSoup (page.content, ‘lxml’)
#page.content to get content of page, using just page would throw an error.
Now you have to understand that scraping is all about navigating through the HTML tree and finding out the best and the fastest way to get our data.
We want to find the current market price of the stock.
First open the page in your chrome browser, and at the market price, right click and click on ‘inspect element’
You will see the source code of the page, and the tags in which the market price value is stored.
Which is:
<span class=”current-price” data-role=”currentvalue”>4,413.00</span> |
BeautifulSoup helps us navigate and get the data by the find() method.
There are many ways code and get that data:
# By attributes
current_price = soup.find(‘span’, attrs={‘class’: ‘current-price’ })
The code finds the ‘span’ tag with attributes where class name is ‘current-price’
To get the text inside the tag we use .text
current_price=current_price.text.strip()
#strip() removes starting and trailing whitespace
We can find in this way aswell.
current_price = soup.find(‘span’,attrs ={ ‘data-role’=”currentvalue”}).text.strip()
#current_price= soup.find(‘html’).next_element.next_element…………..)
#next_element finds the next element after the tag found.
There are many ways to get this data, we have to find the easiest and the fastest way.
The above method would require 100s of .next_element to reach our data. But it is not viable. And a slow method.
Sometimes we have to use it where there is no dedicated classifier,
i.e there are many class name with same value.
See this example.
We want to find out the P/E ratio.
With inspect element, we find:
<span class=”name”><p class=’truncated-string’ title=’P/E Ratio (EPS)’>P/E Ratio (EPS)</p></span> | ||
<span class=”value baseminus”><p class=’truncated-string’ title=’23.92 (2.03) ‘>23.92 (2.03) </p></span> | ||
How to get the value that is (‘23.92’)?
If you see, there are many class names ‘ value baseminus’
If we use
pe_ratio = soup.find(‘p’,attrs ={‘title’=’P/E Ratio (EPS)’}).text
This will return:
P/E Ratio (EPS)
Because that is the text inside.
So how to get the value ?
pe_ratio = soup.find(‘p’,attrs ={‘title’=’P/E Ratio (EPS)’}).next_element.next_element.next_element.next_element.next_element.next_element.next_element
Not so pretty, but it works.
There may be many ways to get the same data.
Now type
print pe_ratio
print current_price
Saving in a spreadsheet.
wb = Workbook() #Workboo
sheet=wb.active # select sheet
sheet.title = ‘Data’ #sheet title
sheet.append(pe_ratio)
sheet.append(current_price)
wb.save(‘data.xlsx’)
wb.close()
Save this file.
Now if you run it anytime, it will show, the CMP and PE ratio import it into excel without going through the webpage.
Advanced
Let’s see if we can get data for hundreds of stock tickers ?
a= ‘ MMM ABT ABBV ACN ATVI AYI ADBE AAP AES AET AMG AFL A APD AKAM ALK ALB ALXN ALLE AGN ADS LNT ALL GOOGL GOOG MO AMZN AEE AAL AEP AXP AIG AMT AWK AMP ABC AME AMGN APH APC ADI ANTM AON APA AIV AAPL’
a=a.split() # makes list
cmp=[]
pe=[] #list to append data later
url=’https://www.msn.com/en-us/money/stockdetails/analysis/fi-126.1.’
wb = Workbook()
sheet=wb.active
sheet.title = ‘a’
For loop
for a in a:
fullurl = url + a + ‘.NYS’ #append full url from list
page =requests.get(fullurl)
pe_ratio = soup.find(‘p’,attrs ={‘title’=’P/E Ratio (EPS)’}).next_element.next_element.next_element.next_element.next_element.next_element.next_element
current_price = soup.find(‘span’,attrs ={ ‘data-role’=”currentvalue”}).text.strip()
cmp.append(current_price)
pe.append(pe_ratio) # appends every value in the list
sheet.append(cmp)
sheet.append(pe)
#appends list into excel
wb.save(‘data.xlsx’)
wb.close()
This script will give you the data of all the tickers in ‘a’ .
If you add 1000tickers in that list, you will get data from all of them.
Then you can do whatever analysis you want from it.
This is the power of data scraping.
The applications are many, similarly you can get tens of thousands of valuable data from directory websites.
It can be used for lead generation, for startups or established companies.
This is automatic, so there is absolutely zero chance of human error. The data will always be correct.
Conclusion
BeautifulSoup is great for small to medium range webscraping.
Always make sure that you send less requests or you will spam the target with too many requests crashing their website.
To get good at it, try to automate the boring stuff with it.
Try natural language processing on scrapped forums.