Photo by JavaTpoint and Tai Bui on Unsplash edited by author

Let's do some Scraping with Selenium!

Chesta Dhingra

--

Collecting the data sometimes become quite a task or should I say a rudimentary or boring one. Yes, we specifically data scientist and data analyst have been in those situations. But how can we forget our own trusted python, that helps in making things quite feasible, isn’t it!

As Python comes up with some amazing libraries like Selenium, Scrapy & BeautifulSoup that helps in scraping and collecting the data. These libraries automate the process of opening the sites collecting the data and storing it either in the databases like MongoDB, SQLite or storing the collected data in the form of a data frame using the pandas library. In this we will create an automatic bot that will scrape the websites.

But before doing that we should always go through website robots.txt file that helps to know which type of information we are allowed to scrape and which are not.

So in this article we are going to create a project that helps us to scrape the information about the books, their titles and author name from the website name audible.

In this we will be using the recent Selenium version which is 4.7.2.

First and foremost thing is to create a virtual environment, before starting any project in python it is highly recommended to create a one.

# open the folder which you have named as per your requirement in the vscode.
# open the terminal
python -m virtualenv venv
# this will create a virtual env where all your libraries and their versions
# going to be saved.
# to activate the venv in the cmd follow below steps
cd venv
cd Scripts
activate .

# this will activate the venv just like below image
by author

Now, coming to the main part of the project install the essential libraries like selenium and while installing the library also download the drivers, for example I am using chrome thus I have installed the chrome driver. If you were also using chrome for scraping the website then you can download the chrome driver according to your chrome version from the link. Secondly, driver should be placed in the same folder just like the one I have shown below.

by author
## import all the essential libraries
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import pandas as pd
import time

chr_options = Options()
chr_options.add_experimental_option("detach", True)
chr_options.add_experimental_option('excludeSwitches', ['enable-logging'])

## the above code helps to open up the site for a longer duration which I have
## observed, otherwise if we won't metntion Options() and add_experimental_option()
## the site will open and close instantly

## next step would be the specify the path of the chrome driver that we have placed
## in the folder

driver = webdriver.Chrome(service=Service('C:/Users/projects/selenium_driver/chromedriver.exe'),options = chr_options)

## now with the help of driver we will open up the website in the browser

driver.get("https://www.audible.in/adblbestsellers?ref=a_search_t1_navTop_pl1cg0c1r0&pf_rd_p=45682ad3-a4ca-473f-b9dc-aedd02f1e5fb&pf_rd_r=3BP1A3BP16YAWMFQ9CJ5&pageLoadId=f1oszbm5bPQ5DB2o&creativeId=bbff6a49-335a-4726-a316-1f6a3c7070bb")

## to maximize the window
driver.maximize_window()

"""
## while scraping the data Xpath plays quite an important role while
## figuring out the way or inspecting the elements.
## to inspect the elements on the website the command Ctrl+Shift+I
## in the website which we are trying to inspect first debug the Javascript
## for that use the command Ctrl+Shift+P then in the pop-up box type Javascript
## click on disable Javascript
## the arrow on the very top left near the Elements will help you to inspect the
## elements and with some basics of Xpath one can figure out the path of the particular elements
## that we want to collect

## in this we will be getting data for all the pages in the audible bestseller collection
## for that we will look at the bottom where the pages tab and icons are given
"""

pagination = driver.find_element(By.XPATH,"//ul[contains(@class,'pagingElements')]")
pages = pagination.find_elements(By.TAG_NAME,"li")

"""
the first pagination will find the elements of the page tab followed by the
list of the numbers that are there in the tab
after that we will create the integer value of the last page element
*note here we specify the value as -2 because the arrow in the last is considered to be
-1
"""
last_page = int(pages[-2].text)

current_page =1

## create list to store the book_title and author_name

book_title =[]
author_name =[]

"""
Loop will help in collecting the data untill we'll scrape all the pages
"""
while current_page<= last_page:

time.sleep(2)
## the container will contain the list values of the books that have been saved
## which includes all the title, authorname, rating etc. basically the productlist
container = driver.find_element(By.CLASS_NAME,'adbl-impression-container')

#//div[contains(@class,'adbl-impression-container')]//div//ul//li[contains(@class,'productListItem')]

products = container.find_elements(By.XPATH,'.//li[contains(@class, "productListItem")]')
## prodcuts will contain all the productlist we have use XPATH here and in each product
## all the book titles, author names, reviews will be available which we can find via Xpath



for product in products:
# print(product.text)
book_title.append((product.find_element(By.XPATH,'.//h3//a').text))
author_name.append(product.find_element(By.XPATH,'.//li[contains(@class,"authorLabel")]//span//a').text)

current_page+=1
## we will store the data in the list and after that we will increase the count
## of the new page and it will click on the next page using the click() function

try:
next_page = driver.find_element(By.XPATH,"//span[contains(@class,'nextButton')]")

next_page.click()
except:
pass

audible_data = pd.DataFrame({'book_title':book_title,
'author_name':author_name})
audible_data.to_csv("pagination_data.csv",index=False)

## after this we will store the data in the dataframe format

"""
if one wants to look for a separate loop to get information about the title or
author name one can use the below code. If not this can be commented out as well
"""
name_of_book = driver.find_elements(By.XPATH,'//h3//a')
for name in name_of_book:
print(name.text)

author_name = driver.find_elements(By.XPATH,'//li[contains(@class,"authorLabel")]//span//a')


for a_name in author_name:
print(a_name.text)

"""
lastly it is necessary to close the open window as well once we are done with
scraping and collecting the data
"""
driver.quit()

I hope you enjoyed the scraping with selenium tutorial. If you like it follow me for more such articles I publish on weekly basis.

--

--

Chesta Dhingra
Chesta Dhingra

Written by Chesta Dhingra

Data Scientist. AI & Machine learning Enthusiast. Follow me on Linkedin www.linkedin.com/in/chesta-dhingra-a6aa52110

No responses yet