How to Scrape Yellow Pages using Python & LXML?

X-Byte Enterprise Crawling
6 min readSep 23, 2020

--

In this scraping tutorial, let’s see we will let you know how to scrape YellowPages.com using Python and LXML, which will scrape business information based on the category and city from the Yellow Pages.

To use this yellow pages scraper, let’s go through yellow pages of data for different restaurants in the city. Then extract business information from first page outcomes.

What Data Do We Extract?

Here are the data fields that we will scrape:

  • Rankings
  • Business Name
  • Business Pages
  • Phone Numbers
  • Website
  • Street’s Name
  • Category
  • Ratings
  • Region
  • Locality
  • URL
  • Zip Code

Here is the screenshot of information we will scrape from Yellow Pages using yellow pages API.

How to Find Data?

Initially, we should find the data, which is available in the current page’s HTML Tags before to start creating a Yellow pages scraper. You should understand HTML tags of the content of the pages for doing so.

If you know Python and HTML, it will be easier for you. You don’t require superior programming skills for most parts of this tutorial.

Let’s examine the HTML of a web page as well as discover where the data is situated. What we are going to do is this:

Get the HTML tags, which enclose the listing of links where we require data from

Get links from that and scrape data

Reviewing the HTML

Why do we need to inspect the elements? — To get any elements on web pages using the XML path appearance.

Open a web browser (we have used Google Chrome here) and visit https://www.yellowpages.com/search?search_terms=restaurant&geo_location_terms=Boston link.

Then right-click on the page link and select — Inspect Element. A toolbar will get opened showing ‘HTML Content’ of this web page in the well-structured format.

The Image here shows data that we require to scrape in a DIV tag. In case, you see closely, it has the attribute named ‘class’ known as ‘result’. The DIV contains data fields that we have to scrape.

Let’s discover the HTML tag(s) that has links we require to scrape. You may right-click on a link title in a browser as well as perform ‘Inspect Element’. This will open HTML Content and highlight the tag that holds the data that you have right-clicked on. See, the image below to get data fields well-structured.

How to Set Your Computer for Web Scraping Development?

We will utilize Python 3 for the Python web scraping tutorial. The code won’t run in case, you use Python 2.7. To begin, your system requires PIP and Python 3 installed in that.

The majority of UNIX operating systems including Mac OS and Linux comes with pre-installed Python. However, not all Linux Operating Systems distribute with by default Python 3.

To validate the python version, just open the terminal (in Mac OS and Linux) or Command Prompt (on the Windows) as well as type:

python --version

After that, press the Enter key. In case, your output looks like Python 3.x.x, then you have got Python 3 installed. Similarly, if it is Python 2.x.x, then you are having Python 2. However, if that prints errors, you probably don’t have the python installed.

Installing Python 3 with Pip

You can use this guide for installing Python 3 for Linux –
http://docs.python-guide.org/en/latest/starting/install3/linux/

If you are a Mac user, you can follow the guide at — http://docs.python-guide.org/en/latest/starting/install3/osx/

Package Installation

Use Python Requests for making requests as well as downloading HTML content for different pages at (http://docs.python-requests.org/en/master/user/install/).

Use Python LXML to parse HTML’s Tree Structure with Xpaths (Know more at — http://lxml.de/installation.html)

The Code

#! /usr/bin/env python
# -*- coding: utf-8 -*-
import requests
from lxml import html
import unicodecsv as csv
import argparse
def parse_listing(keyword, place):
"""
Function to process yellow page listing page
: param keyword: search query
: param place: place name
"""
url = "https://www.yellowpages.com/search?search_terms={0}&geo_location_terms={1}".format(keyword, place)
print("retrieving ", url) headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-GB,en;q=0.9,en-US;q=0.8,ml;q=0.7',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Host': 'www.yellowpages.com',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36'
}
# Adding retries
for retry in range(10):
try:
response = requests.get(url, verify=False, headers=headers)
print("parsing page")
if response.status_code == 200:
parser = html.fromstring(response.text)
# making links absolute
base_url = "https://www.yellowpages.com"
parser.make_links_absolute(base_url)
XPATH_LISTINGS = "//div[@class='search-results organic']//div[@class='v-card']"
listings = parser.xpath(XPATH_LISTINGS)
scraped_results = []
for results in listings:
XPATH_BUSINESS_NAME = ".//a[@class='business-name']//text()"
XPATH_BUSSINESS_PAGE = ".//a[@class='business-name']//@href"
XPATH_TELEPHONE = ".//div[@class='phones phone primary']//text()"
XPATH_ADDRESS = ".//div[@class='info']//div//p[@itemprop='address']"
XPATH_STREET = ".//div[@class='street-address']//text()"
XPATH_LOCALITY = ".//div[@class='locality']//text()"
XPATH_REGION = ".//div[@class='info']//div//p[@itemprop='address']//span[@itemprop='addressRegion']//text()"
XPATH_ZIP_CODE = ".//div[@class='info']//div//p[@itemprop='address']//span[@itemprop='postalCode']//text()"
XPATH_RANK = ".//div[@class='info']//h2[@class='n']/text()"
XPATH_CATEGORIES = ".//div[@class='info']//div[contains(@class,'info-section')]//div[@class='categories']//text()"
XPATH_WEBSITE = ".//div[@class='info']//div[contains(@class,'info-section')]//div[@class='links']//a[contains(@class,'website')]/@href"
XPATH_RATING = ".//div[@class='info']//div[contains(@class,'info-section')]//div[contains(@class,'result-rating')]//span//text()"
raw_business_name = results.xpath(XPATH_BUSINESS_NAME)
raw_business_telephone = results.xpath(XPATH_TELEPHONE)
raw_business_page = results.xpath(XPATH_BUSSINESS_PAGE)
raw_categories = results.xpath(XPATH_CATEGORIES)
raw_website = results.xpath(XPATH_WEBSITE)
raw_rating = results.xpath(XPATH_RATING)
# address = results.xpath(XPATH_ADDRESS)
raw_street = results.xpath(XPATH_STREET)
raw_locality = results.xpath(XPATH_LOCALITY)
raw_region = results.xpath(XPATH_REGION)
raw_zip_code = results.xpath(XPATH_ZIP_CODE)
raw_rank = results.xpath(XPATH_RANK)
business_name = ''.join(raw_business_name).strip() if raw_business_name else None
telephone = ''.join(raw_business_telephone).strip() if raw_business_telephone else None
business_page = ''.join(raw_business_page).strip() if raw_business_page else None
rank = ''.join(raw_rank).replace('.\xa0', '') if raw_rank else None
category = ','.join(raw_categories).strip() if raw_categories else None
website = ''.join(raw_website).strip() if raw_website else None
rating = ''.join(raw_rating).replace("(", "").replace(")", "").strip() if raw_rating else None
street = ''.join(raw_street).strip() if raw_street else None
locality = ''.join(raw_locality).replace(',\xa0', '').strip() if raw_locality else None
locality, locality_parts = locality.split(',')
_, region, zipcode = locality_parts.split(' ')
business_details = {
'business_name': business_name,
'telephone': telephone,
'business_page': business_page,
'rank': rank,
'category': category,
'website': website,
'rating': rating,
'street': street,
'locality': locality,
'region': region,
'zipcode': zipcode,
'listing_url': response.url
}
scraped_results.append(business_details)
return scraped_results elif response.status_code == 404:
print("Could not find a location matching", place)
# no need to retry for non existing page
break
else:
print("Failed to process page")
return []
except:
print("Failed to process page")
return []
if __name__ == "__main__": argparser = argparse.ArgumentParser()
argparser.add_argument('keyword', help='Search Keyword')
argparser.add_argument('place', help='Place Name')
args = argparser.parse_args()
keyword = args.keyword
place = args.place
scraped_data = parse_listing(keyword, place) if scraped_data:
print("Writing scraped data to %s-%s-yellowpages-scraped-data.csv" % (keyword, place))
with open('%s-%s-yellowpages-scraped-data.csv' % (keyword, place), 'wb') as csvfile:
fieldnames = ['rank', 'business_name', 'telephone', 'business_page', 'category', 'website', 'rating',
'street', 'locality', 'region', 'zipcode', 'listing_url']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames, quoting=csv.QUOTE_ALL)
writer.writeheader()
for data in scraped_data:
writer.writerow(data)

You need to execute the complete code through typing a script name trailed by a -h in the terminal or command prompt:

usage: yellow_pages.py [-h] keyword place
positional arguments:
keyword Search Keyword
place Place Name
optional arguments:
-h, --help show this help message and exit

The positional argument keywords represent a category as well as positing is the preferred location to look for the business. Let’s take an example and get the business information for restaurants in Boston, MA. Its script would be implemented like:

python3 yellow_pages.py restaurants Boston, MA

You need to see a file named restaurants-Boston-Yellowpages-scraped-data.csv in the same folder as a script, with extracted data. Let’s see some sample information about business data scrapped from the YellowPages.com.

Click on the given below link and contact us for services and a free quote.

https://www.xbyte.io/contact-us.php

Some Limitations

This code needs to be capable of scraping business information from the majority of locations. However, if you need any professional assistance in scraping websites, feel free to contact us. Just fill-up the details below.

--

--

X-Byte Enterprise Crawling
X-Byte Enterprise Crawling

Written by X-Byte Enterprise Crawling

Offer web scraping & Data extraction services like Amazon data scraping, Real Estate,eBay, Travel & all type of services per client requirements. www.xbyte.io

No responses yet