Python Scrapy – Scraping Dynamic Website with API-Generated Content

5/5 - (2 votes)

Scrapy is an excellent tool for extracting data from static and dynamic websites.

In this article, we are going to discuss the solution to the following problems:

Extract all details of Offices from the website https:/directory.ntschools.net/#/offices Instead of using a whole scrapy framework, use a typical Python script for extracting the data. For each office, we need to scrape the Physical address, Postal address, Phone, Parent Division, and Email. Also, to scrape the data by running the python script file instead of crawling scrapy from the terminal. Create a Dataframe for the office details scraped from the website and print the output.

Extract all details of Offices

First, let’s quickly look at the website for data extraction.

Now disable the JavaScript in the web browser and reload the page.

We can see that nothing is loading in the browser. That means the entire website is running by JavaScript. So regular scraping will not get data from this website.

We can get the data using selenium , splash , or other methods to fetch the javascript-loaded data.

But before going for any of those solutions, we can go to the developer tab, under section network, and check for the network flow during the site loading. You can see a lot of activities are going under the network section.

Now we have to filter the content using the XHR tab. So we can see two items under the network section.

Select the `GetOfficeList` item then we can see more details of that item’s content.

Under the new section, we can see the tabs of Headers, Preview, Response, etc.

Select the Response tab, and then we can see the required details included in this link as a JSON object.

We can get all the required data by getting this JSON object from the particular request URL.

To get that URL, we can right-click on the GetOfficeList, select the copylinkaddress and store it in a variable named ` base_url `.

To make a request, we require the request header details. Select the copy request headers option and store the data by converting it into a python dictionary.

Request Headers Contents

A Typical Python Script for Extracting the Data

Now we have everything to fetch the data from the website. Let’s start the spider for scraping the website.

(venv) $ scrapy genspider office_details domain

This will generate a spider basic template file.

(venv) $ cat office_details.py
import scrapy

class OfficeDetailsSpider(scrapy.Spider):
name = 'office_details'
allowed_domains = ['domain']
start_urls = ['http://domain/']

def parse(self, response):
Pass

In this, we are removing the allowed_domains variable since we are not using it. We have to include the website URL (“ https://directory.ntschools.net /#/offices ”) in the strat_urls list .

(venv) $ cat office_details.py
import scrapy

class OfficeDetailsSpider(scrapy.Spider):
name = 'office_details'
start_urls = ['https://directory.ntschools.net/#/offices']

def parse(self, response):
pass

Now we are going to send the request for getting the JSON object. For that, we can write the code inside the parse function.

def parse(self, response):
base_url = 'https://directory.ntschools.net/api/Office/GetOfficeList'
headers = {
'Accept': 'application/json',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
'Host': 'directory.ntschools.net',
'Pragma': 'no-cache',
'Referer': 'https://directory.ntschools.net/',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
'X-Requested-With': 'Fetch'
}
yield scrapy.Request(url=base_url, callback=self.parse_api, headers=headers)

So by using yield , the website containing the JSON object, that has been parsed to the function named parse_api .

So from the response received in the function parse_api , we can get the JSON object. Now we need to import the JSON module into our Python file for loading the JSON object into our python file.

import json

def parse_api(self, response):
datas = json.loads(response.body)
for data in datas:
print(data['name'])
if data['offices'] is not []:
for sub in data['offices']:
print(f' {sub["name"]}')

This will give an output as follows:

This output is the same as shown on the website.

import json

def parse_api(self, response):
datas = json.loads(response.body)
print(datas[0])

The output of the above code will be as follows:

If we look into the output, the postal and physical addresses are missing in this data. The office details page contained that information.

When we click on the ` Agency Services ` new page is loaded, https://directory. ntschools.net/#/offices/details/asdiv . But if we want to parse into the office details, we need to identify the code ` asdiv `.

But this code is included in the previous JSON object with the key name ` itSchoolCode `. So we are going to modify our script to get that details.

def parse_api(self, response):
datas = json.loads(response.body)
for data in datas:
print(f"{data['name']} -> {data['itSchoolCode']}")

if data['offices'] is not []:
for sub in data['offices']:
print(f" {sub['name']} -> {sub['itSchoolCode']}")

The output of this script is as follows:

Extract the Address From the Details Page

Now we need to fetch the URL details of the JSON object as we previously searched in the network tab.

When we loaded the office details page, additional items were generated in the network tab, as seen in the following picture:

GetOfficeDetails?param='asdiv'

Respose tab of GetOfficeDetails?param='asdiv'

Request URL for JSON object shown in the previous picture

Now we know the extension code for each office. Using this information, we can make another request to get the details of the new page. Another function with the name ` parse_office `; is used for this.

Before going to function parse_office , some modifications to make in the function parse_api by adding metadata as a dictionary .

def parse_api(self, response):
url = 'https://directory.ntschools.net/api/Office/GetOfficeDetails?param='
datas = json.loads(response.body)
for data in datas:
code = data['itSchoolCode']
parentDivision = {'parentDivision': data['parentDivision']}
yield scrapy.Request(
url+code,
callback=self.parse_office,
meta=parentDivision,
headers=self.headers
)
if data['offices'] is not []:
for sub in data['offices']:
sub_code = sub['itSchoolCode']
parentDivision = {'parentDivision': sub['parentDivision']}
yield scrapy.Request(
url+sub_code,
callback=self.parse_office,
meta=parentDivision,
headers=self.headers
)

Now we can collect all the details by writing a function name parse_office.

def parse_office(self, response):
parentDivision = response.meta['parentDivision']
office_detail = json.loads(response.body)
print({
'OfficeName': office_detail['name'],
'PhysicalAddress': office_detail['physicalAddress']['displayAddress'],
'PostalAddress': office_detail['postalAddress']['displayAddress'],
'TelephoneNumber': office_detail['telephoneNumber'],
'FaxNumber': office_detail['facsimileTelephoneNumber'],
'Email': office_detail['mail'],
'Website': office_detail['uri'],
'ParentDivision': parentDivision,
})

Let’s see the total script developed until now:

import scrapy
import json

class OfficeDetailsSpider(scrapy.Spider):
name = 'office_details'
start_urls = ['https://directory.ntschools.net/#/offices']
headers = {
'Accept': 'application/json',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
'Host': 'directory.ntschools.net',
'Pragma': 'no-cache',
'Referer': 'https://directory.ntschools.net/',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
'X-Requested-With': 'Fetch'
}

def parse(self, response):
base_url = 'https://directory.ntschools.net/api/Office/GetOfficeList'
yield scrapy.Request(url=base_url, callback=self.parse_api, headers=self.headers)

def parse_api(self, response):
url = 'https://directory.ntschools.net/api/Office/GetOfficeDetails?param='
datas = json.loads(response.body)
for data in datas:
code = data['itSchoolCode']
parentDivision = {'parentDivision': data['parentDivision']}
yield scrapy.Request(
url+code,
callback=self.parse_office,
meta=parentDivision,
headers=self.headers
)
if data['offices'] is not []:
for sub in data['offices']:
sub_code = sub['itSchoolCode']
parentDivision = {'parentDivision': sub['parentDivision']}
yield scrapy.Request(
url+sub_code,
callback=self.parse_office,
meta=parentDivision,
headers=self.headers
)

def parse_office(self, response):
parentDivision = response.meta['parentDivision']
office_detail = json.loads(response.body)
print({
'OfficeName': office_detail['name'],
'PhysicalAddress': office_detail['physicalAddress']['displayAddress'],
'PostalAddress': office_detail['postalAddress']['displayAddress'],
'TelephoneNumber': office_detail['telephoneNumber'],
'FaxNumber': office_detail['facsimileTelephoneNumber'],
'Email': office_detail['mail'],
'Website': office_detail['uri'],
'ParentDivision': parentDivision,
})

This script will generate an output as shown in the following picture:

It’s Great!

Now we scrape all the details from the API available on the website.

Extract Data by Running Python Script File Instead Crawling Scrapy from Terminal

Now we have to run the spider from the script itself and store the scraped data in a data frame and then print the data frame.

For that, we need to import CrawlerProcess class from scrapy.crawler :

from scrapy.crawler import CrawlerProcess

process = CrawlerProcess()
process.crawl(OfficeDetailsSpider)
process.start()

Here we are assigning an object, process, to CrawlerProcess() then, we passed our spider class name through the crawl() method of CrawlerProcess class. To initiate the crawling we used another method, start() .

Create Dataframe for the Office Details Scraped from the Website

The next challenge is to build a data frame with the details we scrape from the website. For that, we have to create a global list variable and dictionary in the parse_office function to be stored in this global variable.

Import pandas as pd

office_details = list()
...
...
...
def parse_office(self, response):
parentDivision = response.meta['parentDivision']
office_data = json.loads(response.body)
office_details.append(
{
'OfficeName': office_data['name'],
'PhysicalAddress': office_data['physicalAddress']['displayAddress'],
'PostalAddress': office_data['postalAddress']['displayAddress'],
'TelephoneNumber': office_data['telephoneNumber'],
'FaxNumber': office_data['facsimileTelephoneNumber'],
'Email': office_data['mail'],
'Website': office_data['uri'],
'ParentDivision': parentDivision,
}
)

We stored all the office details in a list and we can convert that list into a data frame by using the following code.

df_office = pd.DataFrame(office_details)
print(df_office)

So all the challenges have been completed and now we can look into the whole script code with a proper structure:

import scrapy
from scrapy.crawler import CrawlerProcess
import json
import pandas as pd

office_details = list()

class OfficeDetailsSpider(scrapy.Spider):
name = 'office_details'
start_urls = ['https://directory.ntschools.net/#/offices']
headers = {
'Accept': 'application/json',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
'Host': 'directory.ntschools.net',
'Pragma': 'no-cache',
'Referer': 'https://directory.ntschools.net/',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
'X-Requested-With': 'Fetch'
}

def parse(self, response):
base_url = 'https://directory.ntschools.net/api/Office/GetOfficeList'
yield scrapy.Request(url=base_url, callback=self.parse_api, headers=self.headers)

def parse_api(self, response):
url = 'https://directory.ntschools.net/api/Office/GetOfficeDetails?param='
datas = json.loads(response.body)
for data in datas:
code = data['itSchoolCode']
parentDivision = {'parentDivision': data['parentDivision']}
yield scrapy.Request(
url+code,
callback=self.parse_office,
meta=parentDivision,
headers=self.headers
)
if data['offices'] is not []:
for sub in data['offices']:
sub_code = sub['itSchoolCode']
parentDivision = {'parentDivision': sub['parentDivision']}
yield scrapy.Request(
url+sub_code,
callback=self.parse_office,
meta=parentDivision,
headers=self.headers
)

def parse_office(self, response):
parentDivision = response.meta['parentDivision']
office_data = json.loads(response.body)
office_details.append(
{
'OfficeName': office_data['name'],
'PhysicalAddress': office_data['physicalAddress']['displayAddress'],
'PostalAddress': office_data['postalAddress']['displayAddress'],
'TelephoneNumber': office_data['telephoneNumber'],
'FaxNumber': office_data['facsimileTelephoneNumber'],
'Email': office_data['mail'],
'Website': office_data['uri'],
'ParentDivision': parentDivision,
}
)

def crawl():
process = CrawlerProcess()
process.crawl(OfficeDetailsSpider)
process.start()

def main():
crawl()
df_office = pd.DataFrame(office_details)
print(df_office)

if __name__ == '__main__':
main()

Python Scrapy – Scraping Dynamic Website with API-Generated Content

Related News