How to Scrape Amazon.com Data with Python?
Just go through Amazon.com’s anti-bot mechanism as well as use Python for crawling Amazon.com pricing data automatically.
You can’t crawl Amazon.com easily these days. Amazon would block the HTTP requests as well as make the automation an unsuccessful trial.
For instance, run the given code:
import requests page = requests.get('https://www.amazon.com/') print(page.text)
Your request would get a sincere welcome from Amazon.com including this rather than its actual web page HTML:
Therefore, suppose you don’t need to use APIs (not for free). Also study that during 2019, the USA court had fully legalized web scraping.
Collecting amazon.com public data is completely legal.
It is not easy to extract Amazon.com data but it’s not impossible either. In this blog, we will share our practices. We hope this might useful to you doesn’t matter if you are a buyer, seller, or Data Scientist that requires latest raw price data.
Anyways, the scraping target is supported by smaller or larger companies, we haven’t send requests in the Multi-Process as well as Multi-Thread asynchronous styling.
Smaller servers without security can be attached by the waves of different HTTP requests, as well as well-protected websites would block your IP Addresses if not banned.
Adding a smaller interval between different HTTP requests could be a solution before you start a crawler, for instance:
import time time.sleep(0.1) # sleep for 0.1 second
To prevent huge data scraping, a lot of websites apply severe anti-bot policies. For example, ask the request browser for running a Javascript piece as well as other complex methods. To discover these checking, the easiest result would be utilizing headless browsers including:
pyppeteer, a Python version of puppeteer. Selenium using Python.
These headless browsers work like Firefox or Chrome without showing on a web page as well as could be well-controlled with code.
For a few pages that need complex interactions as well as require human eyes & hand assistance, you might even think about creating a Chrome extension for capturing web data as well as send that to any local running services
However, there is one disadvantage of headless browsers: its enormous resource use, both RAM and CPU. All HTTP requests are sent from the real Web Browsers.
For amazon pricing data, we will utilize other solutions. Let’s continue.
A cloudscraper package is made for bypassing Cloudflare’s anti-bot pages (identified as IUAM or “I’m Under Attack Mode”). We have found that it works for various websites, as well as it works well using Amazon’s pages also.
To install that:
Run the quick test:
import cloudscraper scraper = cloudscraper.create_scraper() page = scraper.get('https://www.amazon.com/') print(page.text)
Now you should see something like that rather than any warm API reference welcome.
A cloudscraper module offers many features and one of them is browser type configuration. For instance, we need amazon.com to revert HTML content for Chrome with Windows (against mobile version). We can reset the scraper example like:
scraper = cloudscraper.create_scraper( browser={ 'browser': 'chrome', 'platform': 'windows', 'desktop': True } )
Many other combinations are there from its source readme page.
Solving captcha is terrible of data scraping. You would face that sooner or later.
For minimizing the effects of captcha, we have used the following tactics:
Tactic #1. Rather than Solving That, We Have Ignored That
Whenever our crawler sees any captcha, we put a URL string back in the URL queue. Then shuffle a queue for randomizing the order (for avoiding repeated sending the similar URLs in the shorter timespan).
import random random.shuffle(url_list)
Tactic #2. Apply Random Sleep
Apply a random sleep after each page request. After so many tries, we have found sleep among 0 to 1-second works also. Not very slow as well as also activates the minimum captcha.
time.sleep(random.uniform(0+sleep_base,1+sleep_base))
In the given code, we have also defined a sleep_base variable. When a crawler triggers any captcha, we add 2 seconds with sleep_base. (sleep_base+=2).
Tactic #3. Use Several Scraper Examples
Using one scraper example will start activating captcha after about 200 requests as well as more triggered doesn’t matter how many seconds we have added to sleep_base. With multiple scraper examples can alleviate it efficiently. Here, we have used 30 scraper examples
scrapers = [] for _ in range(30): scraper = cloudscraper.create_scraper( browser={ 'browser': 'chrome', 'platform': 'windows', 'desktop': True } ) scrapers.append(scraper)
Using that, take one example from a scraper list and then randomly shuffle a scraper listing (or randomize a picker index).
Many tools are there to assist parsing HTML text, so you could even utilize Regular Expression for scraping the key data you need. We have found BeautifulSoup a beautiful convenient tool for navigating HTML elements.
For installing the packages:
pip install beautifulsoup4 pip install html5lib
Using the assistance of html5lib, you could allow a HTML5 parser using BeautifulSoup. Let’s see the use with two easy (as well as real) samples.
With find function for retrieving data from different elements having particular Id.
from bs4 import BeautifulSoup soup = BeautifulSoup(page.text,'html5lib') sku_title = soup.find(id='productTitle').get_text().strip()
Use select function for navigating elements with CSS style selector.
chip = soup.select('.po-graphics_coprocessor > td') [1].get_text().strip()
It is unimportant to get more use from its documents.
We am not certain for how long the solution would survive, might be one month, a year, or 10 years. While we are writing this, a crawler empowered by the solution is running on our server 24*7.
If data is refreshed for today, it means a crawler is working well and the solution is applicable.
We will update this blog whenever new problems come as well as are solved.
For more information about scraping Amazon.com data using Python, contact X-Byte Enterprise Crawling or ask for a free quote!
Originally published at https://www.xbyte.io.