How to Scrape Flight Data with Python
Let’s assume that we need to organize our trip for next weekend. Organize it to either visit Madrid or Milan. Just go for the finest option. In the initial step, we search for the flight that we generally do. In this example, we have used Kayak. When we’ve entered the search criteria as well as set some extra filters like “Nonstop”, we see that remarkably a URL in the browser has familiar accordingly.
We can separate this URL into various parts: origin, startdate, destination, endate as well as a suffix, which tells Kayak to search for straight connections as well as sort results by pricing.
origin = "ZRH" destination = "MXP" startdate = "2019-09-06" enddate = "2019-09-09" url = "https://www.kayak.com/flights/" + origin + "-" + destination + "/" + startdate + "/"
Now the normal idea is getting the data required (e.g. pricing, arrival, and departure times) from the fundamental HTML code of a website. To perform this, we depend on two options. The first option is selenium that controls the browser as well as automatically opens your website. The second option is Beautiful Soup that helps us in reshaping the chaotic HTML code in a more readable and structured format. From that “soup” we could later easily have the tastier bites that we’re searching for.
Now, let’s start the proceedings. First, we require to set the selenium up. To do that, we have to download a browser driver like ChromeDriver. Ensure it matches with the installed Chrome version that we need to provide in the similar folder having the Python code. Now, load some packages as well as command selenium that we wish to utilize ChromeDriver as well as open the URL from there.
from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions from bs4 import BeautifulSoup import re import pandas as pd import numpy as np chrome_options = webdriver.ChromeOptions() driver = webdriver.Chrome("chromedriver.exe") driver.implicitly_wait(20) driver.get(url)
When the site gets loaded, we require to discover how we could access the data, which is relevant to us. For example, the departure time with inspecting feature about the browser, we could see that departure time of 8:55 pm is wrapped within the span with the class named “depart-time base-time”.
Now, in case, we pass a site’s HTML code to the BeautifulSoup, then we can precisely search for classes that we’re interested in. Then the results could be scraped with the simple loop. As for every search result, we have the set of two diverse departure times, as well as we also require to reshape results in the logical arrival-departure time pairs.
soup=BeautifulSoup(driver.page_source, 'lxml') deptimes = soup.find_all('span', attrs={'class': 'depart-time base-time'}) arrtimes = soup.find_all('span', attrs={'class': 'arrival-time base-time'}) meridies = soup.find_all('span', attrs={'class': 'time-meridiem meridiem'}) deptime = [] for div in deptimes: deptime.append(div.getText()[:-1]) arrtime = [] for div in arrtimes: arrtime.append(div.getText()[:-1]) meridiem = [] for div in meridies: meridiem.append(div.getText()) deptime = np.asarray(deptime) deptime = deptime.reshape(int(len(deptime)/2), 2) arrtime = np.asarray(arrtime) arrtime = arrtime.reshape(int(len(arrtime)/2), 2) meridiem = np.asarray(meridiem) meridiem = meridiem.reshape(int(len(meridiem)/4), 4)
We also utilize a similar tactic for the pricing. However, while inspecting the pricing element, we could see that Kayak wants to utilize various classes for pricing information. So, we need to utilize a regular appearance to capture different cases. In addition, the pricing itself is more wrapped up and that’s why we require to utilize some extra steps to go into that.
regex = re.compile('Common-Booking-MultiBookProvider (.*)multi-row Theme-featured-large(.*)') price_list = soup.find_all('div', attrs={'class': regex}) price = [] for div in price_list: price.append(int(div.getText().split('\n')[3][1:-1]))
Now, just put all the things into a good frame and find:
df = pd.DataFrame({"origin" : origin, "destination" : destination, "startdate" : startdate, "enddate" : enddate, "price": price, "currency": "USD", "deptime_o": [m+str(n) for m,n in zip(deptime[:,0],meridiem[:,0])], "arrtime_d": [m+str(n) for m,n in zip(arrtime[:,0],meridiem[:,1])], "deptime_d": [m+str(n) for m,n in zip(deptime[:,1],meridiem[:,2])], "arrtime_o": [m+str(n) for m,n in zip(arrtime[:,1],meridiem[:,3])] })
We have extracted as well as put in the shape about all the data, which was twisted in HTML code of initial flights. You can do the heavy lifting.
For making things more appropriate, we can wrap our codes from the above in the function as well as call the function through using various destinations and preliminary day combinations for the three-day journey. While sending different requests, Kayak could think that we’re the bot, the finest way of taking care of that is through constantly changing a browser’s user agents as well as also through waiting slightly between requests. Our whole code might look like that:
# -*- using Python 3.7 -*- from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions from bs4 import BeautifulSoup import re import pandas as pd import numpy as np from datetime import date, timedelta, datetime import time def scrape(origin, destination, startdate, days, requests): global results enddate = datetime.strptime(startdate, '%Y-%m-%d').date() + timedelta(days) enddate = enddate.strftime('%Y-%m-%d') url = "https://www.kayak.com/flights/" + origin + "-" + destination + "/" + startdate + "/" + enddate + "?sort=bestflight_a&fs=stops=0" print("\n" + url) chrome_options = webdriver.ChromeOptions() agents = ["Firefox/66.0.3","Chrome/73.0.3683.68","Edge/16.16299"] print("User agent: " + agents[(requests%len(agents))]) chrome_options.add_argument('--user-agent=' + agents[(requests%len(agents))] + '"') chrome_options.add_experimental_option('useAutomationExtension', False) driver = webdriver.Chrome("chromedriver.exe", options=chrome_options, desired_capabilities=chrome_options.to_capabilities()) driver.implicitly_wait(20) driver.get(url) #Check if Kayak thinks that we're a bot time.sleep(5) soup=BeautifulSoup(driver.page_source, 'lxml') if soup.find_all('p')[0].getText() == "Please confirm that you are a real KAYAK user.": print("Kayak thinks I'm a bot, which I am ... so let's wait a bit and try again") driver.close() time.sleep(20) return "failure" time.sleep(20) #wait 20sec for the page to load soup=BeautifulSoup(driver.page_source, 'lxml') #get the arrival and departure times deptimes = soup.find_all('span', attrs={'class': 'depart-time base-time'}) arrtimes = soup.find_all('span', attrs={'class': 'arrival-time base-time'}) meridies = soup.find_all('span', attrs={'class': 'time-meridiem meridiem'}) deptime = [] for div in deptimes: deptime.append(div.getText()[:-1]) arrtime = [] for div in arrtimes: arrtime.append(div.getText()[:-1]) meridiem = [] for div in meridies: meridiem.append(div.getText()) deptime = np.asarray(deptime) deptime = deptime.reshape(int(len(deptime)/2), 2) arrtime = np.asarray(arrtime) arrtime = arrtime.reshape(int(len(arrtime)/2), 2) meridiem = np.asarray(meridiem) meridiem = meridiem.reshape(int(len(meridiem)/4), 4) #Get the price regex = re.compile('Common-Booking-MultiBookProvider (.*)multi-row Theme-featured-large(.*)') price_list = soup.find_all('div', attrs={'class': regex}) price = [] for div in price_list: price.append(int(div.getText().split('\n')[3][1:-1])) df = pd.DataFrame({"origin" : origin, "destination" : destination, "startdate" : startdate, "enddate" : enddate, "price": price, "currency": "USD", "deptime_o": [m+str(n) for m,n in zip(deptime[:,0],meridiem[:,0])], "arrtime_d": [m+str(n) for m,n in zip(arrtime[:,0],meridiem[:,1])], "deptime_d": [m+str(n) for m,n in zip(deptime[:,1],meridiem[:,2])], "arrtime_o": [m+str(n) for m,n in zip(arrtime[:,1],meridiem[:,3])] }) results = pd.concat([results, df], sort=False) driver.close() #close the browser time.sleep(15) #wait 15sec until the next request return "success" #Create an empty dataframe results = pd.DataFrame(columns=['origin','destination','startdate','enddate','deptime_o','arrtime_d','deptime_d','arrtime_o','currency','price']) requests = 0 destinations = ['MXP','MAD'] startdates = ['2019-09-06','2019-09-20','2019-09-27'] for destination in destinations: for startdate in startdates: requests = requests + 1 while scrape('ZRH', destination, startdate, 3, requests) != "success": requests = requests + 1 #Find the minimum price for each destination-startdate-combination results_agg = results.groupby(['destination','startdate'])['price'].min().reset_index().rename(columns={'min':'price'}) Once we have specified all combinations and scraped the respective data, we can nicely visualize our results using a heatmap from seaborn heatmap_results = pd.pivot_table(results_agg , values='price', index=['destination'], columns='startdate') import seaborn as sns import matplotlib.pyplot as plt sns.set(font_scale=1.5) plt.figure(figsize = (18,6)) sns.heatmap(heatmap_results, annot=True, annot_kws={"size": 24}, fmt='.0f', cmap="RdYlGn_r")
When we have identified all the combinations as well as extracted the individual data, we could nicely imagine our results through the heatmap from seaborn.
heatmap_results = pd.pivot_table(results_agg , values='price', index=['destination'], columns='startdate') import seaborn as sns import matplotlib.pyplot as plt sns.set(font_scale=1.5) plt.figure(figsize = (18,6)) sns.heatmap(heatmap_results, annot=True, annot_kws={"size": 24}, fmt='.0f', cmap="RdYlGn_r")
If you want to scrape flight data using Python then X-Byte Enterprise Crawling is the best option for you! Contact us for more details!
Originally published at https://www.xbyte.io.