Python Web Scraping Using Selenium

Python Web Scraping Using Selenium

If you started Web Scraping with Python, you may have used BeautifulSoup or requests.

These are great libraries, but if you have a website which uses Javascript or user interaction, those libraries will fail. This is where Selenium is used.

 

What is Selenium ?

Selenium is a webdriver, it automates and takes control of your website.

Hence it emulates whatever a browser would do, and the website will see it as a Browser.

It will load and process JavaScript.

Installing.

Simply use pip

pip install selenium

You will also need a driver, just like the name says, it helps drive selenium. This is specific to browsers.

Download ChromeDriver from http://chromedriver.storage.googleapis.com/index.html

And unzip it to the Python root.

Opening a Webpage.

# import library

from selenium import webdriver

# We have to initialize the driver first.

driver = webdriver.Chrome()

Now when you do this, one instance of Chrome will appear.

Now let’s open a webpage.

We will be doing a simple Search on a website and will then collect the results.

For this example we’ll use the JohnLewis.com website.

page = driver.get( ‘ https://www.johnlewis.com/’ )

This will open the John Lewis page in our Browser. We can control our browser through Code.

Now let’s do a simple search.

First we have to find the searchbox.

To do that right click on the search box and select ‘Inspect Element’ .

You will see something like this:

Great! Now we know the ID of the searchbox.

We can Just get it as :

searchbox = driver.find_element_by_id (‘ search-keywords’ )

This will select the searchbox.

To see other methods of locating elements visit :

https://selenium-python.readthedocs.io/locating-elements.html

Now that the searchbox is selected we can send a search string.

searchbox.send_keys ( ‘ Watch ‘  )

We must click the search button to send a query.

Like the same method above we see that the Button has a name ‘btnK’

We can use that to find it.

button = driver.find_element_by_name ( ‘ search ‘ )

Like before now we have to send a click, this can easily be done by .click()

button.click()

We just searched from Python !

Now we have to get our results for this we simply use BeautifulSoup.

from bs4 import BeautifulSoup # import BeautifulSoup

soup = BeautifulSoup(dr.page_source, ‘lxml’)  # Use the page source as

for link in soup.find_all ( attrs = {‘class’  : ‘  qv-image-holder ‘ } ) : # find all results using a for loop.

                link.next_element.next_element [ ‘href’ ]

This will bring back all the link results of the search Query.

You can then scrape the data from those links.

 

A detailed tutorial about how to use BeautifulSoup is here :  http://www.wrekindata.co.uk/beginners-guide-to-webscraping-with-python/

 

Conclusion.

Selenium is a powerful tool, but it should only be used when other methods fail.

It is slow compared to requests because it has to load pictures and scripts.

To speed it up, use PhantomJS it is a headless browser and a lot faster than Chrome.

Just download the driver unpack it on Python Path.

And use:

driver = webriver.PhantomJS()

This should be used after testing in Chrome.

Selenium can be used where speed is required.

  • Book Tickets fast
  • Web Sales which go out in seconds.

If the data you need is behind a form, you can automate the form filling and get the data behind it.

We can safely say Selenium can scrape any website which can be visited by a browser.

 

 



Contact Us

We are glad that you preferred to contact us. Please fill our short form and one of our friendly team members will contact you back.

X