How to Scrape LinkedIn for Public Company Data

Scrape LinkedIn for Public Company Data
Scrape LinkedIn for Public Company Data

At X-Byte Enterprise Crawling, we feel very happy that you have visited out page about how to scrape LinkedIn for public company Data and you won’t be disappointed!

Through this tutorial, we will demonstrate you how to scrape LinkedIn public pages. For people, who have come on this page with no understanding about why they need to scrape LinkedIn company data, let’s discuss a few points:

  • Automation in LinkedIn Search: You wish to work for the company having some particular criteria as well as they are not the normal suspects. You may have the shortlist, but that list isn’t short and more like the long list. You need a tool like Google Finance, which could help in filtering companies depending on the criteria they get published on LinkedIn. You may take the “long list” to scrape this data into a well-structured format and create a wonderful analysis tool.
  • Interest: You are interested about the companies on LinkedIn as well as want to collect a good set of data to satisfy your interest.
  • Tinkerer: You want to tinker as well as found that you might like to learn Python as well as need something helpful to begin.

Whatever the reason might be, you have come at the right place!

In the tutorial, basic steps are given about how to scrape data from LinkedIn using Python.

In this tutorial, we will use basic Python as well as some python packages — LXML and requests. We won’t use more complex packages like Scrapy for anything simple.

You require to install following things:

Python 2.7 accessible here
( https://www.python.org/downloads/)

Python Requests accessible here (http://docs.python-requests.org/en/master/user/install/). You could need Python pips to install this accessible here –
https://pip.pypa.io/en/stable/installing/)

Python LXML (Study how to install it here — http://lxml.de/installation.html)

The code is scraping LinkedIn is entrenched below as well as if you are not capable to see that in the browser, this could be downloaded from GIST here.

from lxml import html

import csv, os, json

import requests

from exceptions import ValueError

from time import sleep

def linkedin_companies_parser(url):

for i in range(5):

try:

headers = {

‘User-Agent’: ‘Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36’

}

print “Fetching :”,url

response = requests.get(url, headers = headers,verify=False)

formatted_response = response.content.replace(‘<! — ‘, ‘’).replace(‘ →’, ‘’)

doc = html.fromstring(formatted_response)

datafrom_xpath = doc.xpath(‘//code[@id=”stream-promo-top-bar-embed-id-content”]//text()’)

content_about = doc.xpath(‘//code[@id=”stream-about-section-embed-id-content”]’)

if not content_about:

content_about = doc.xpath(‘//code[@id=”stream-footer-embed-id-content”]’)

if content_about:

pass

# json_text = content_about[0].html_content().replace(‘<code id=”stream-footer-embed-id-content”><! — ‘,’’).replace(‘<code id=”stream-about-section-embed-id-content”><! — ‘,’’).replace(‘ →</code>’,’’)

if datafrom_xpath:

try

json_formatted_data = json.loads(datafrom_xpath[0])

company_name = json_formatted_data[‘companyName’] if ‘companyName’ in json_formatted_data.keys() else None

size = json_formatted_data[‘size’] if ‘size’ in json_formatted_data.keys() else None

industry = json_formatted_data[‘industry’] if ‘industry’ in json_formatted_data.keys() else None

description = json_formatted_data[‘description’] if ‘description’ in json_formatted_data. keys() else None

follower_count = json_formatted_data[‘followerCount’] if ‘followerCount’ in json_form atted_data.keys() else None

year_founded = json_formatted_data[‘yearFounded’] if ‘yearFounded’ in json_forma tted_data.keys() else None

website = json_formatted_data[‘website’] if ‘website’ in json_formatted_data.keys() else None

type = json_formatted_data[‘companyType’] if ‘companyType’ in json_formatted_data .keys() else None

specialities = json_formatted_data[‘specialties’] if ‘specialties’ in json_formatted_data. keys() else None

if “headquarters” in json_formatted_data.keys():

city = json_formatted_data[“headquarters”][‘city’] if ‘city’ in json_formatted_data[“he adquarters”].keys() else None

country = json_formatted_data[“headquarters”][‘country’] if ‘country’ in json_formatted _data[‘headquarters’].keys() else None

state = json_formatted_data[“headquarters”][‘state’] if ‘state’ in json_formatted_data[‘ headquarters’].keys() else None

street1 = json_formatted_data[“headquarters”][‘street1’] if ‘street1’ in json_formatted _data[‘headquarters’].keys() else None

street2 = json_formatted_data[“headquarters”][‘street2’] if ‘street2’ in json_formatted _data[‘headquarters’].keys() else None

zip = json_formatted_data[“headquarters”][‘zip’] if ‘zip’ in json_formatted_data[‘headq uarters’].keys() else None

street = street1 + ‘, ‘ + street2

else:

zip = none

city = None

state = None

country = none

street = none

street1 = None

street2 = None

data = {

‘company_name’: company_name,

‘size’: size,

‘industry’: industry,

‘description’: description,

‘follower_count’: follower_count,

‘founded’: year_founded,

‘website’: website,

‘type’: type,

‘specialities’: specialities,

‘city’: city,

‘country’: country,

‘state’: state,

‘street’: street,

‘zip’: zip,

‘url’: url

}

return data

except:

print “cant parse page”, url

# Retry in case of captcha or login page redirection

if len(response.content) < 2000 or “trk=login_reg_redirect” in url:

if response.status_code == 404:

print “linkedin page not found”

else

raise ValueError(‘redirecting to login page or captcha found’)

except :

print “retrying :”,url

def readurls():

companyurls = [‘https://www.linkedin.com/company/tata-consultancy-services']

extracted_data = []

for url in companyurls:

extracted_data.append(linkedin_companies_parser(url))

f = open(‘data.json’, ‘w’)

json.dump(extracted_data, f, indent=4)

if __name__ == “__main__”:

readurls()

You just need to change a URL in that line

companyurls = [‘https://www.linkedin.com/company/xbyte-crawling']

or add some URLs detached by different commas to that list You may save a file as well as run that using Python — python filename.py

The result will be in the file named data.json using the similar directory as well as will look somewhat like this

{

“website”: “https://www.xbyte.io",

“description”: “X-Byte Enterprise Crawling is among the best web scraping companies in the world for the reason.\r\n We won’t leave you with the \”self-service\” screen for building your individual scrapers.\r\n We have the real humans, which will chat to you inside hours of the request as well as help you in your requirement.\r\n Although we are the leading providers in this field, our investment in the automation has helped us in providing a totally \”full service\” at affordable prices.\r\n Contact us at www.xbyte.io and experience our amazing customer service “

“founded”: 2012,

“street”: Houston,

“specialities”: [

“Web Scraping Service”,

“Website Scraping”,

“Screen scraping”,

“Data scraping”,

“Web crawling”,

“Data as a Service”,

“Data extraction API”,

“Scrapy”,

“Python”,

“DaaS”

],

“size”: “100–150 employees”,

“city”: Houston,

“zip”: TX-770143,

“url”: “https://www.linkedin.com/company/xbyte-crawling",

“country”: USA,

“industry”: “Computer Software”,

“state”: Texas,

“company_name”: “X-Byte Enterprise Crawling”,

“follower_count”: 2262,

“type”: “Privately Held”

}

Or in case you are running that for Cisco

companyurls = [‘https://www.linkedin.com/company/cisco']

The result will be like this

“website”: “http://www.cisco.com",

“description”: “Cisco (NASDAQ: CSCO) allows people to create powerful connections — in education, philanthropy, business, or imagination. Cisco software, hardware, and services offerings are utilized for creating the Internet solutions, which make networks possible — offering easy use to data anywhere an time. \r\n\r\n Cisco was initiated in 1984 by the small group of computer professionals from Stanford University. Ever since the company’s origin, Cisco engineers are the leaders in development of the Internet Protocol (IP)-based networking skills. Today, having over 71,000 employees globally, this practice of revolution continues with the industry-leading solutions and products in company’s key development areas of switching and routing, and with advanced technologies like IP telephony, home networking, security, optical networking, wireless technology, and storage area networking. Besides its products, Cisco offers an extensive range of service offerings like advanced services and technical support. \r\n\r\n Cisco sells its services and products, both directly using its individual sales force or using the channel partners, commercial businesses, larger enterprises, consumers, and service providers.”

“founded”: 1984,

“street”: “Tasman Way, “,

“specialities”: [

“Networking”,

“Wireless”,

“Security”,

“Unified Communication”,

“Telepresence”,

“Collaboration”,

“Data Center”,

“Virtualization”,

“Unified Computing Systems”

],

“size”: “10,001+ employees”,

“city”: “San Jose”,”zip”: “95134”,

“zip”: “95134”,

“url”: “https://www.linkedin.com/company/cisco",

“country”: “United States”,

“industry”: “Computer Networking”,

“state”: “CA”,

“company_name”: “Cisco”,

“follower_count”: 1201541,

“type”: “Public Company”

}

Warning: As LinkedIn requires you to log in whenever you open the website, this code might not work for you.

You can easily change the fields or URLs you wish to scrape. Contact us for scraping LinkedIn for public company data!

Visit Our Site : www.xbyte.io

Offer web scraping & Data extraction services like Amazon data scraping, Real Estate,eBay, Travel & all type of services per client requirements. www.xbyte.io