How to Scrape Bigbasket Data Easily Using Python
In this tutorial blog, we will find out web scraping methods that allow us to scrape helpful data from all websites using the BeautifulSoup library from Python.
What is Web Scraping?
In definition, web scraping is the mechanism of gathering a huge amount of data from web pages as well as store data in any necessary formats that further assists us in doing analysis of scraped data. Beautifulsoup is Python’s package to parse HTML as well as XML documents that helps in scraping the data very easily.
Here are the steps that we will use to scrape data with python:
- Initially, we will get a website URL for scraping.
- Examine the page.
- Search data that we wish to scrape.
- Write the Python code as well as run it.
- Store information in the necessary format.
BeautifulSoup is the most sophisticated web scraping library that parses the XML or HTML content from web pages.
During this learning procedure, we will analyze the two data scraping sections given below where we would explore additional functionalities about BeautifulSoup:
- Scrape a Bigbasket Grocery site for scraping product data and store the data in the CSV or JSON file.
- Scrape tabular form data from the website as well as make that ready like Dataframe utilizing pandas.
Before we move ahead, let’s go through the fundamentals of HTML as well as how to review the webpage yourself.
HTML is the markup language utilized to structure a web page. This offers tags like <li> to do listing, <div> to do division, <p> for paragraphs and more.
Follow these steps for inspecting the webpage:
Initially, open a website URL in the browser.
Right-click on a page and choose ‘Inspect’.
A window called ‘Chrome DevTools’ will open at the side of this page where you can observe the HTML of the webpage.
In the code, we would utilize BeautifulSoup for downloading the HTML content from the given website and perform web scraping.
Let’s get to begin!
1. Scrape Bigbasket Website:
Here, we will go through a step-by-step procedure of scraping product data like Product’s Name, Product Quantity, Brand Name, Pricing, Product Description, etc. from this site with BeautifulSoup as well as store data in the CSV file.
Step 1: Install and Import the required libraries into Jupyter notebook.
pip install BeautifulSoupfrom bs4 import BeautifulSoup as bs import requests # importing requests module to open a URL
Step 2: Describe the EAN code listing for which we require to scrape data as well as assign that to the variable named ‘eanCodeLists’
eanCodeLists = [126906,40139631,40041188,40075201,40053874,1204742,40046735,40100963,40067874,40045943]
Let’s check an EAN code-40053874 for getting the Product’s Name, Product Quantity, Brand Name, Pricing, Product Description, etc., and later utilize it for the loop to repeat the given list for getting all the products data.
Step 3: After that, open the given URL with requests.get() technique that makes the HTTP request to the web page.
urlopen = requests.get('https://www.bigbasket.com/pd/40053874').text
Step 4: Utilize BeautifulSoup for parsing HTML as well as assign the variable named ‘soup’
soup = bs(urlopen,'html.parser')
Step 5: After that, let’s open a URL https://www.bigbasket.com/pd/40053874 in the browser as well as right-click on content that we require as well as get the equivalent HTML tags. Then we will utilize these tags within the code to find the necessary data.
Then, right-click on a field named ‘Weikfield Chilli Vinegar, 200 g’ for getting tag names. It will provide us Brand’s name, Product’s Name, as well as Quantity. Please see the below image.
<h1 class="GrE04" style="-webkit-line-clamp:initial">Weikfield Chilli Vinegar, 200 g </h1>
It’s time to utilize Beautifulsoup for referring these tags as well as assign them to the variable named ‘ProductInfo’.
ProductInfo = soup.find("h1", {"class": "GrE04"}).text # .text will give us the text underlying that HTML element
Step 6: Now, it’s time to utilize the split() method for getting the results below:
Now, split(‘ ‘,1)[1] provides ‘Chilli Vinegar, 200 g ‘ as well as split(‘,’)[0] splits utilizing ‘,’ as well as provides ‘Chilli Vinegar’
Step 7: For getting the Price & Products description
Pricing field tags: <td data-qa="productPrice" class="IyLvo">Rs <! - →35</td> Product description's field tags: <div class="_26MFu "><style ...
Therefore, we are now having,
ProductName= Chilli Vinegar BrandName= Weikfield ProducQty = 200 g ProductPrice= Rs 35 ProductDesc = The spiciness of a fresh green chilli diffusing its heat into sharp vinegar makes this spicy vinegar a unique fusion of spicy and sour notes.
Step 8: Now, we can get all the EAN codes data using the given code as well as for loop.
Step 9: Use pandas for storing data into Dataframe.
Step 10: To conclude, save data into JSON and CSV files into a local directory.
2. Scrape Tabular Format’s Data
Here, we will extract the website https://www.ssa.gov/OACT/babynames/decades/names2010s.html that has data in a tabular format having ‘200’ very popular names of male as well as female babies that are born during the time period of 2010–2018 in the USA. (It is a sample data depending on the Social Security Card app data on March 2019).
Step 1: Import libraries as well as use BeautifulSoup for parsing HTML content.
import requests from bs4 import BeautifulSoup as bs url = requests.get('https://www.ssa.gov/OACT/babynames/decades/names2010s.html').text soup = bs(url,'html.parser')
Step 2: Let’s utilize <table class=”t-stripe”> for scraping table data.
table_content = soup.find('table',{'class':'t-stripe'})
Here, we utilize tag names ‘td’ that represents the table data (or data cell), ‘th’ (or table header) as well as ‘tr’ (or table rows). Now, we will utilize ‘tr’ tag from the ‘table_content’ that has the combination of ‘th’ and ‘td’.
data = table_content.findAll('tr')[0:202] #returns all 200 rows including header
Step 3: Now, it’s time to utilize loop for repeating those 200 rows as well as get data in the ‘list’ variable named ‘rows_data’.
Initially, let’s check ‘data’ length with len(data)
and this is its code…
Step 4: Now, it’s time to utilize pandas for storing data into Dataframe.
Step 5: Now, we can do some operations within this data as well as get a few insights.
Let’s observe how many times the name ‘Samuel’ gets used.
df[df['Male_Name'] == 'Samuel'][['Male_Name','Male_Number','Rank']]
Conclusion
That’s how, we can utilize web scraping services with Python for scraping any website as well as extracting some important data, which can be utilized for doing any analysis. Some important use cases about web scraping services include:
- Businesses, Market Analysis, E-Commerce, Competition Monitoring, Price Comparison
- Collecting Data from Different Resources for Analysis
- Getting Latest News Reports
- Marketing
- Media
- Travel Companies Used for Collecting Live Tracking Data
- Weather Forecasting
Originally published at https://www.xbyte.io.