Why Do You Need Web Scraping?

It all begins with information (It is the collection of facts). Data is required by businesses and organizations for market research. These data can be gathered through interviews, observations, surveys, and questionnaires, as well as government archives and the Internet.

Web scraping is a technique for extracting relevant large amounts of data from websites and saving it to a file or database. The data that is scraped is usually in tabular or spreadsheet format (e.g.: CSV file).

In this blog, we’ll scrape the value of a website. Today is the first day of our web scraping project.

Below given is the overview of the steps we will follow:

  • Using queries, download the webpage.
  • beautifulsoup4 will parse the HTML source code.
  • Extract company names, CEOs, global rankings, market capitalization, annual revenue, employee count, and company URLs.
  • Using Pandas, compile the data and generate a CSV file.

How to Perform Web scraping?

Python is a fantastic language that provides packages such as Beautiful Soup, Requests, and Pandas that are used to extract data from HTML code and transform it into various formats (CSV, XML, JSON) depending on the application.

HTML: The code used to organize a website and its information is known as HTML (Hypertext Markup Language). It includes tags that specify how well a web browser should format and display information.

BeautifulSoup is a Library for python that extracts data from HTML and XML files.

Requests are the de-facto Python library standard for trying to make HTTP requests.

HTTP is a protocol that is used to retrieve resources such as HTML documents.

Let us extract the web page of the top insurance companies by market capitalization

At the end of the project, we will create a CSV file in the below format:

companies_name,CEOs_name,world_ranks,market_capitalizations_in_billion_dollars,annual_revenues_in_million_dollars,number_of_employees,companies_URLs BERKSHIRE HATHAWAY,Warren Buffett,8,543.68,286260.0,391500.0, UNITEDHEALTH GROUP,David S. Wichmann,18,332.73,255630.0,320000.0, BANK OF AMERICA CORPORATION,Brian Moynihan,20,262.2,85530.0,208000.0, WELLS FARGO & COMPANY,Charles W. Scharf,65,124.78,72340.0,258700.0, AIA GROUP,Lee Yuan Siong,91,152.33,50360.0,23000.0,

Download the Webpage using requests

We’ll use the requests Python library to download the web page.

Let’s get started with installing and importing requests.

!pip install requests --upgrade --quiet import requests

We can make use of requests. get obtain the ability to download webpage

topics_url = '' response = requests.get(topics_url)

requests. get returns a reaction object that contains the information from a web page as well as some additional information. Using response.text, we can get at the contents of a website page.

page_content = response.text page_content[:1000] '<!DOCTYPE html> \n<html lang="en" dir="ltr" prefix=" content: dc: foaf: og: rdfs: schema: sioc: sioct: skos: xsd: ">\n <head>\n <meta charset="utf-8"/> \n<script async src="//"></script> \n<script>(adsbygoogle=window.adsbygoogle||[]).push({google_ad_client:"ca-pub-2407955258669770",enable_page_level_ads:true});</script> <script>window.google_analytics_uacct="UA-121331115-1";(function(i,s,o,g,r,a,m){i["GoogleAnalyticsObject"]=r;i[r]=i[r]||function(){(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)})(window,document,"script","https://www'>

HTML code can be found on the website. Using requests, we successfully fetched the web page. The HTML of the webpage value is contained in the above cell page_ content [:1000]. We can also save it to a folder and view it locally within Jupyter by selecting “File > Open.”

with open('world-insurance.html','w',encoding = "utf-8") as file: file.write(page_content)

This page will look similar to the original page

Parse the HTML source code using beautifulsoup4

To decode the HTML source code of the web page downloaded in the previous section, we’ll use the Beautiful Soup Python library. We’ll also add a helper function.

!pip install beautifulsoup4 --upgrade --quiet from bs4 import BeautifulSoup doc = BeautifulSoup(response.text, 'html.parser')

We can use doc it to retrieve data from the web page once it has been parsed.

type(doc) bs4.BeautifulSoupdoc.find('title') <title>World Top Insurance Companies by Market Value as on 2021</title>

We were able to recover the headline from the website page, as shown above.

Let’s make a reusable helper function get a page that can download a website page and generate a Beautiful Soup doc for every given URL.

def get_page(url): """Download a web page and return a beautiful soup doc""" # Download the webpage response = requests.get(url) # Check if the dowmload was successful if response.status_code != 200: raise Exception('Unable to download page {}'.format(url)) # Get the page HTML page_contents = response.text # Create a bs4 doc doc = BeautifulSoup(response.text,'html.parser') return doc

We can even use the aforementioned function to obtain the URL of any website page. The status code will be returned by the response.status_code in this case. A legitimate sequence number should be in the range (200–299). More on the status code. response.text returns the response’s HTML content. The term parser refers to a class called HTML Parser, which is used to parse HTML files whilst also web scraping.

Extract Company Names, CEOs, World Ranks, Market Capitalization, Annual Revenue, Number of Employees, Company URLs, etc.

All of the information can be retrieved from the web page’s li_tag.

Let’s make a variable that extracts the li_tag of the class row well clearfix.

company_block = doc.find_all('li',class_='row well clearfix')

Extracting Company Names

Let’s write a helper function that will extract business names from a web page.

def name_of_companies(company_block): company_names = [] for tag in company_block: c_name = tag.find('div',class_='field field--name-node-title field--type-ds field--label-hidden field--item') company_names.append(c_name.find('a').text) return company_names

Let us check the function name_of_companies


Extracting CEOs Name

A helper function for extracting CEO names from web pages

def name_of_CEOs(company_block): CEO_names = [] for tag in company_block: names = tag.find('div',class_='clearfix col-sm-12 field field--name-field-ceo field--type-entity-reference field--label-above') try: ceo = names.find('a').text CEO_names.append(ceo) except AttributeError: CEO_names.append(None) return CEO_names

Let Us Check the Function name_of_ceo

# Let's call the function name_of_CEOs(company_block) ['Warren Buffett', 'David S. Wichmann', 'Brian Moynihan', 'Charles W. Scharf', 'Lee Yuan Siong', 'David I. McKay', 'Ma Mingzhe', 'Gao Yingxin', None, 'Bharat Masrani']

Extracting World Ranks

A helper function for obtaining World Ranks from a web page.

def ranks_of_world(company_block): world_ranks = [] for tag in company_block: rank = tag.find('div', class_='clearfix col-sm-6 field field--name-field-world-rank-sep-01-2021- field--type-integer field--label-above') world_ranks.append(rank.find('div',class_='field--item').text) return world_ranks

Let us Check the Function ranks_of_world

ranks_of_world(company_block) ['8', '18', '20', '65', '91', '93', '94', '111', '131', '133']

Extracting Market Capitalization

A helper function for obtaining Market Capitalization from a website.

def market_caps(company_block): market_capitalization_in_dollars = [] for tag in company_block: market_cap = tag.find('div',class_='clearfix col-sm-6 field field--name-field-market-value-jan012021 field--type-float field--label-above') try: caps = market_cap.find('div',class_='field--item').text replace_caps = caps.replace(' Billion USD',"") market_capitalization_in_dollars.append(float(replace_caps)) except AttributeError: market_capitalization_in_dollars.append(None) return market_capitalization_in_dollars

Let us Check the Function market_caps

# Let's call the function market_caps(company_block) [543.68, 332.73, 262.2, 124.78, 152.33, 116.72, 233.34, 129.25, None, 102.4]

Extracting Annual Revenue

A helper function for obtaining Annual Revenue from a web page.

annual_revenue_in_dollars = [] for tag in company_block: annual_revenue = tag.find('div',class_='clearfix col-sm-12 field field--name-field-revenue-in-usd field--type-float field--label-inline') try: revenue = annual_revenue.find('div',class_='field--item').text replace_string = revenue.replace(',',"").replace(' Million USD',"") annual_revenue_in_dollars.append(int(replace_string)) except AttributeError: annual_revenue_in_dollars.append(None) return annual_revenue_in_dollars

Let us Check the Function annual_revenue_in_dollars

# Let's call the function annual_rev(company_block) [286260, 255630, 85530, 72340, 50360, 37367, 166950, 82215, 81730, 34568]

Extracting Number of Employees

A helper function for calculating the number of employees from a web page.

def employees(company_block): no_of_employees = [] for tag in company_block: employee = tag.find('div',class_='clearfix col-sm-12 field field--name-field-employee-count field--type-integer field--label-inline') try: n_employee = employee.find('div',class_='field--item').text replace_string = n_employee.replace(',',"") no_of_employees.append(int(replace_string)) except AttributeError: no_of_employees.append(None) return no_of_employees

Let us Check the Function no_of_employees

# Let's call the function employees(company_block) [391500, 320000, 208000, 258700, 23000, 83842, 376900, 309384, 59000, 89598]

Extracting the Company URLs

A helper function for extracting the URLs of the Company from a web page.

def extract_urls(company_block): company_urls = [] for tag in company_block: c_url = tag.find('div',class_='clearfix col-sm-12 field field--name-field-company-website field--type-link field--label-above') try: company_urls.append(c_url.find('a')['href']) except AttributeError: company_urls.append(None) return company_urls

Check the Function extract_urls

extract_urls(company_block) ['', '', '', '', '', '', '', '', '', '']

We already have all of functions we need to extract the required data from a web page. It’s time to write a dictionary function using all of the helper features mentioned above.

Let’s define the scrape page function, which will loop through all of the website value’s pages.

today (there are 53 pages starting with (0, 52)).

def scrape_page(): all_info_dict = {} all_info_dict = { 'companies_name':[], 'CEOs_name':[], 'world_ranks':[], 'market_capitalizations_in_billion_dollars':[], 'annual_revenues_in_million_dollars':[], 'number_of_employees':[], 'companies_URLs':[] } for page in range (0,53): url = f"{page}" company_block = get_page(url).find_all('li',class_='row well clearfix') all_info_dict['companies_name'] += name_of_companies(company_block) all_info_dict['CEOs_name'] += name_of_CEOs(company_block) all_info_dict['world_ranks'] += ranks_of_world(company_block) all_info_dict['market_capitalizations_in_billion_dollars'] += market_caps(company_block) all_info_dict['annual_revenues_in_million_dollars'] += annual_rev(company_block) all_info_dict['number_of_employees'] += employees(company_block) all_info_dict['companies_URLs'] += extract_urls(company_block) page = page + 1 return all_info_dict

In the above feature scrape_page, we initially generated a vacant dictionary (it stores data in key: value pairs) all_info_dict, and then we formed a vacant list in the dictionary with the key: ‘companies_name, ‘CEOs_name, ‘world_ranks,’ market_ capitalizations_in_billion_dollars, ‘annual_revenues_in_million_dollars, ‘number_ of_employee. These keys will hold the value pairs for all of the helper functions defined earlier in the section.

The for loop here will loop through all of the website’s pages, looking for a li tag that will be stored in the changeable company block.

Then, all vacant lists will be concatenated with the correlating helper functions, and finally, all outputs will be stored in all info_dict.

Compiling the Information and Creating a CSV File Using Pandas

From the dictionary, we’ll create a pandas data frame.

Pandas is a Python library that is used to work with data sets.

A DataFrame is a data structure that organizes information into a two-dimensional table of rows and columns.

# Create pandas dataframe from dictionary import pandas as pd scrape_page_dataframe = pd.DataFrame(scrape_page())


Saving the Extracted Data into a CSV File:



Here’s a brief breakdown of the steps we took to scrape top insurance companies from value today.

  • Using requests, we downloaded the webpage.
  • We used beautifulsoup4 to parse the HTML source code of the web page.
  • We extracted company names, CEOs, global rankings, market capitalization, annual revenue, employee count, and company URLs.
  • Using Pandas, I compiled the data and generated a CSV file.

The format of the CSV file we created is as follows:

Here is the complete script for the Project:

def get_page(url): """Download a web page and return a beautiful soup doc""" # Download the webpage response = requests.get(url) # Check if the dowmload was successful if response.status_code != 200: raise Exception('Unable to download page {}'.format(url)) # Get the page HTML page_contents = response.text # Create a bs4 doc doc = BeautifulSoup(response.text,'html.parser') return doc def name_of_companies(company_block): company_names = [] for tag in company_block: c_name = tag.find('div',class_='field field--name-node-title field--type-ds field--label-hidden field--item') company_names.append(c_name.find('a').text) return company_names def name_of_CEOs(company_block): CEO_names = [] for tag in company_block: names = tag.find('div',class_='clearfix col-sm-12 field field--name-field-ceo field--type-entity-reference field--label-above') try: ceo = names.find('a').text CEO_names.append(ceo) except AttributeError: CEO_names.append(None) return CEO_names def ranks_of_world(company_block): world_ranks = [] for tag in company_block: rank = tag.find('div', class_='clearfix col-sm-6 field field--name-field-world-rank-sep-01-2021- field--type-integer field--label-above') world_ranks.append(rank.find('div',class_='field--item').text) return world_ranks def market_caps(company_block): market_capitalization_in_dollars = [] for tag in company_block: market_cap = tag.find('div',class_='clearfix col-sm-6 field field--name-field-market-value-jan012021 field--type-float field--label-above') try: caps = market_cap.find('div',class_='field--item').text replace_caps = caps.replace(' Billion USD',"") market_capitalization_in_dollars.append(float(replace_caps)) except AttributeError: market_capitalization_in_dollars.append(None) return market_capitalization_in_dollars def annual_rev(company_block): annual_revenue_in_dollars = [] for tag in company_block: annual_revenue = tag.find('div',class_='clearfix col-sm-12 field field--name-field-revenue-in-usd field--type-float field--label-inline') try: revenue = annual_revenue.find('div',class_='field--item').text replace_string = revenue.replace(',',"").replace(' Million USD',"") annual_revenue_in_dollars.append(int(replace_string)) except AttributeError: annual_revenue_in_dollars.append(None) return annual_revenue_in_dollars def employees(company_block): no_of_employees = [] for tag in company_block: employee = tag.find('div',class_='clearfix col-sm-12 field field--name-field-employee-count field--type-integer field--label-inline') try: n_employee = employee.find('div',class_='field--item').text replace_string = n_employee.replace(',',"") no_of_employees.append(int(replace_string)) except AttributeError: no_of_employees.append(None) return no_of_employees def extract_urls(company_block): company_urls = [] for tag in company_block: c_url = tag.find('div',class_='clearfix col-sm-12 field field--name-field-company-website field--type-link field--label-above') try: company_urls.append(c_url.find('a')['href']) except AttributeError: company_urls.append(None) return company_urls def scrape_page(): all_info_dict = {} all_info_dict = { 'companies_name':[], 'CEOs_name':[], 'world_ranks':[], 'market_capitalizations':[], 'annual_revenues':[], 'number_of_employees':[], 'companies_URLs':[] } for page in range (0,53): url = f"{page}" company_block = get_page(url).find_all('li',class_='row well clearfix') all_info_dict['companies_name'] += name_of_companies(company_block) all_info_dict['CEOs_name'] += name_of_CEOs(company_block) all_info_dict['world_ranks'] += ranks_of_world(company_block) all_info_dict['market_capitalizations_in_billion_dollars'] += market_caps(company_block) all_info_dict['annual_revenues_in_million_dollars'] += annual_rev(company_block) all_info_dict['number_of_employees'] += employees(company_block) all_info_dict['companies_URLs'] += extract_urls(company_block) page = page + 1 return all_info_dict

