Source

How to web scrape on Schedule using Github Actions?

August 04, 2021

Hi everyone! 👋 It’s been a long time since I last posted on this blog. I was busy graduating from college and moving places and that didn’t leave enough time for fun new projects. I am still in the middle of it all but recently did a project that gave me a tutorial idea to write about. I was asked to create a web scraper that would scrape a certain website, filter the data and then send an email with that data. I will show you what I was tasked with, how I developed it, and how I deployed it for free using GitHub Actions.

Note: You can find the corresponding GitHub repo here

Requirements

I was asked if I could write an application that would monitor this website below and extract tender information and email only the latest tenders to an email id. I explored the website and quickly agreed.

PPRA homepage

Plan of Action

I had come across some fun projects by the likes of Simon Wilson and wanted to test out GitHub Actions for myself. What better way to test it out than deploying a working scraper that automatically runs on a schedule and emails the scraped data?

The first step is to download the web page using Python and pull out the table into a more useable data structure. I opted for regex initially and then settled on Pandas for the initial extraction.

The next step is to email the parsed data. For that, I am going to rely on a GitHub action created by Dawid Dziurla.

Installing the required libraries

We will be making use of the following two libraries:

pandas
requests

Go ahead and add both of these libraries into a requirements.txt file in the root of your project directory/repository and install them if you haven’t already. Later on, we will be using this file to install these dependencies in the GitHub Actions execution environment.

Writing the scraper

This is by far the most time-consuming part of the whole process. However, it is pretty simple once you know what you are doing. The first step is to figure out the URL of the tenders page. The website uses AJAX so going to the tenders page does not change the URL in the address bar. To find the actual URL, we can open up Developer Tools and look at the requests made while navigating to the tenders page.

Network Manager

This first URL is the direct URL of the tenders page but it does not have a page number in the URL. For it to use the page number as well we can navigate to the 2nd page of the tenders and take a look at the new request (notice the PageNo=2 at the end of the URL). We can now follow the pattern of this request to query for however many pages we want.

Network manager 2

At this point, we can write some Python code to download the page using requests.

import requests as r
html = r.get('https://ppra.org.pk/dad_tenders.asp?PageNo=1')

Now that we know how to query for different pages, we need to figure out how to parse the table. I initially tried using regex as for such simple use cases it has always worked out pretty well for me. I know about this SO answer but we are parsing a known, limited subset of HTML so it is perfectly fine to use regex for that.

However, regex posed a few issues. It wasn’t working exactly the way I wanted it to. I had forgotten about the re.DOTALL flag while writing my initial pattern and my regex pattern wasn’t working past the \n character. Instead of fixing this issue, I decided to give Pandas a try. I had heard that you could easily parse an HTML table using Pandas but I had never tried it so I thought about giving that a try instead.

My main motivation for taking on new projects is that I get to explore and learn new stuff and then write about it on this blog so I actively try to opt for new methods even if I know the old method works perfectly fine. You will never learn about new and better ways if you don’t explore, right?

Using Pandas is extremely straightforward. We can use the following snippet to parse the tables in the document:

import pandas as pd
dfs = pd.read_html(html.text)

However, by only supplying the HTML to read_html method, Pandas extracts all the tables from the page. In my testing, this resulted in 4 tables.

>>> dfs
[                                                   0
0  Please Click on image to Download/View. Only 1...,                        0                                                  1
0  Tender Closing Date :  Month January  Feburay  March  April  May  Jun...,                   0
0  PPRA Ref No: TSE,             0                                                  1         2                3                      4
0   Tender No                                    Tender  Details  Download  Advertised Date           Closing Date
1   TS455000E  Pakistan Aeronautical Complex, Kamra Procureme...       NaN        30/7/2021   4/8/2021 10:30:00 AM
2   TS454835E  Pakistan State Oil, Karachi Miscellaneous Work...       NaN        30/7/2021    5/8/2021 2:15:00 PM
3   TS453722E  State Life Insurance Corporation of Pakistan, ...       NaN        30/7/2021              24/8/2021
4   TS453262E  Sui Southern Gas Company Limited, Karachi SSGC...       NaN        30/7/2021   23/8/2021 3:00:00 PM
5   TS455691E  National Database and Registration Authority N...       NaN        30/7/2021  16/8/2021 11:00:00 AM
6   TS453260E  Sui Southern Gas Company Limited, Karachi SSGC...       NaN        30/7/2021  23/8/2021 12:30:00 PM
7   TS456503E  National Heritage & Intrigation Division, Isla...       NaN        30/7/2021  24/8/2021 11:00:00 AM
 
...

To only extract the table we need, we can tell Pandas to extract the table with the width attribute set to 656. There is only one table in the whole HTML with that specific attribute value so it only results in 1 extracted table.

dfs = pd.read_html(html.text, attrs={'width': '656'})

>>> dfs
[            0                                                  1         2                3                      4
0   Tender No                                    Tender  Details  Download  Advertised Date           Closing Date
1   TS456131E  Sui Southern Gas Company Limited, Karachi Supp...       NaN        30/7/2021   25/8/2021 3:30:00 PM
2   TS456477E  National Accountability Bureau, Sukkur Service...       NaN        30/7/2021  16/8/2021 11:00:00 AM
3   TS456476E  Sukkur Electric Power Company (SEPCO), Sukkur ...       NaN        30/7/2021  12/8/2021 11:00:00 AM
4   TS456475E  Evacuee Trust Property Board, Multan Services ...       NaN        30/7/2021  17/8/2021 10:00:00 AM
5   TS456474E  Military Engineering Services (Navy), Karachi ...       NaN        30/7/2021  13/8/2021 11:30:00 AM
6   TS456473E  National University of Science and Technology,...       NaN        30/7/2021  17/8/2021 11:00:00 AM
7   TS456490E  Shaikh Zayed Hospital, Lahore Miscellaneous Wo...       NaN        30/7/2021  19/8/2021 11:00:00 AM
8   TS456478E  Ministry of Religious Affairs & Interfaith Har...       NaN        30/7/2021  16/8/2021 11:00:00 AM
9   TS456489E  Cantonment Board, Rawalpindi Services Required...       NaN        30/7/2021  17/8/2021 12:00:00 PM
10  TS456480E  National Bank of Pakistan, Lahore Miscellaneou...       NaN        30/7/2021  16/8/2021 11:00:00 AM
11  TS456481E  National Bank of Pakistan, Lahore Miscellaneou...       NaN        30/7/2021  16/8/2021 11:00:00 AM

...

There are still a few issues with this extraction.

Pandas isn’t able to automatically extract header for our table
Advertised Date column data isn’t parsed as a date
Download column is all NaNs

To fix the first issue, we can pass in the header parameter to read_html and Pandas will make the respective row the header of the table. The second issue can be fixed by passing in parse_dates parameter and Pandas will parse the data in the respective column as dates. There are multiple ways to resolve the third issue. I ended up using regex to extract the download links into a list and then assigning that list to the Download column in our data frame.

The read_html method call looks something like this after fixing the first two issues:

dfs = pd.read_html(html.text, attrs={'width': '656'}, header=0, parse_dates=['Advertised Date'])

The regex for extracting the Download links and assigning them to the Download column looks like this:

download_links = re.findall('<a target="_blank" href="(.+?)"><img border="0" src="images/(?:.+?)"></a>',html.text)
download_links = ["<a href='https://ppra.org.pk/"+link+"' style='display: block;text-align: center;'> <img src='https://ppra.org.pk/images/download_icon.gif'/></a>" for link in download_links]
tender_table = dfs[0]
tender_table['Download'] = download_links

The extracted links are relative so we can use list comprehension to prepend the actual URL to the extracted link. I am converting the link into an anchor tag and encapsulating an image with that because of two reasons. Firstly, it would look nice in our final email, and secondly, the table would look similar to what our client is used to seeing on the website so there would be less visual fatigue while looking at the email.

The client asked me to generate emails with the tenders for the latest date only. The tenders for the latest date sometimes span multiple pages so we can put all the code we have so far into a separate function and then pass in page numbers to that function to extract the tables from a specific page.

The code with the function looks like this:

import requests as r
import pandas as pd
import re

url_template = "https://ppra.org.pk/dad_tenders.asp?PageNo="

def download_parse_table(url):
    html = r.get(url)
    dfs = pd.read_html(html.text, attrs={'width': '656'}, header=0, parse_dates=['Advertised Date'])
    download_links = re.findall('<a target="_blank" href="(.+?)"><img border="0" src="images/(?:.+?)"></a>',html.text)
    download_links = ["<a href='https://ppra.org.pk/"+link+"' style='display: block;text-align: center;'> <img src='https://ppra.org.pk/images/download_icon.gif'/></a>" for link in download_links]
    tender_table = dfs[0]
    tender_table['Download'] = download_links
    return tender_table

To put all the extracted tables into one data frame, we need to put them all into a list and use the pd.concat method. The code for that looks like this:

combined_df = []
for index in range(1,8):
    df = download_parse_table(url_template+str(index))
    combined_df.append(df)

combined_df = pd.concat(combined_df)

>>> combined_df
    Tender No                                    Tender  Details                                           Download Advertised Date           Closing Date
0   TS455000E  Pakistan Aeronautical Complex, Kamra Procureme...  <a href='https://ppra.org.pk/download.asp?tend...      2021-07-30   4/8/2021 10:30:00 AM
1   TS454835E  Pakistan State Oil, Karachi Miscellaneous Work...  <a href='https://ppra.org.pk/doc/30-7/42pso307...      2021-07-30    5/8/2021 2:15:00 PM
2   TS453722E  State Life Insurance Corporation of Pakistan, ...  <a href='https://ppra.org.pk/doc/30-7/42life30...      2021-07-30              24/8/2021
3   TS453262E  Sui Southern Gas Company Limited, Karachi SSGC...  <a href='https://ppra.org.pk/download.asp?tend...      2021-07-30   23/8/2021 3:00:00 PM
4   TS455691E  National Database and Registration Authority N...  <a href='https://ppra.org.pk/download.asp?tend...      2021-07-30  16/8/2021 11:00:00 AM
..        ...                                                ...                                                ...             ...                    ...
25  TS456443E  Civil Aviation Authority, Karachi TENDER NOTIC...  <a href='https://ppra.org.pk/download.asp?tend...      2021-07-29  13/8/2021 11:00:00 AM
26  TS454178E  Zarai Taraqiati Bank Ltd (ZTBL), Islamabad Inf...  <a href='https://ppra.org.pk/doc/28-7/ztb287-1...      2021-07-28  10/8/2021 11:00:00 AM
27  TS454566E  Sui Northern Gas Pipelines Limited, Lahore Rel...  <a href='https://ppra.org.pk/download.asp?tend...      2021-07-28    1/9/2021 3:30:00 PM
28  TS455579E  Pakistan Ordnance Factories, Wah Cantt Repair ...  <a href='https://ppra.org.pk/doc/28-7/pof287-4...      2021-07-28   4/8/2021 10:20:00 AM
29  TS455365E  Pakistan National Shipping Corporation (PNSC),...  <a href='https://ppra.org.pk/download.asp?tend...      2021-07-28  24/8/2021 11:00:00 AM

[210 rows x 5 columns]

It looks mostly fine but there is one issue. The index is preserved after the concat. I want to reset it so that it goes from 0-209 instead of 0-29 multiple times. This is also easy to acomplish. We just need to modify the concat method call like this:

combined_df = pd.concat(combined_df).reset_index(drop=True)

>>> combined_df
     Tender No                                    Tender  Details                                           Download Advertised Date           Closing Date
0    TS455000E  Pakistan Aeronautical Complex, Kamra Procureme...  <a href='https://ppra.org.pk/download.asp?tend...      2021-07-30   4/8/2021 10:30:00 AM
1    TS454835E  Pakistan State Oil, Karachi Miscellaneous Work...  <a href='https://ppra.org.pk/doc/30-7/42pso307...      2021-07-30    5/8/2021 2:15:00 PM
2    TS453722E  State Life Insurance Corporation of Pakistan, ...  <a href='https://ppra.org.pk/doc/30-7/42life30...      2021-07-30              24/8/2021
3    TS453262E  Sui Southern Gas Company Limited, Karachi SSGC...  <a href='https://ppra.org.pk/download.asp?tend...      2021-07-30   23/8/2021 3:00:00 PM
4    TS455691E  National Database and Registration Authority N...  <a href='https://ppra.org.pk/download.asp?tend...      2021-07-30  16/8/2021 11:00:00 AM
..         ...                                                ...                                                ...             ...                    ...
205  TS456443E  Civil Aviation Authority, Karachi TENDER NOTIC...  <a href='https://ppra.org.pk/download.asp?tend...      2021-07-29  13/8/2021 11:00:00 AM
206  TS454178E  Zarai Taraqiati Bank Ltd (ZTBL), Islamabad Inf...  <a href='https://ppra.org.pk/doc/28-7/ztb287-1...      2021-07-28  10/8/2021 11:00:00 AM
207  TS454566E  Sui Northern Gas Pipelines Limited, Lahore Rel...  <a href='https://ppra.org.pk/download.asp?tend...      2021-07-28    1/9/2021 3:30:00 PM
208  TS455579E  Pakistan Ordnance Factories, Wah Cantt Repair ...  <a href='https://ppra.org.pk/doc/28-7/pof287-4...      2021-07-28   4/8/2021 10:20:00 AM
209  TS455365E  Pakistan National Shipping Corporation (PNSC),...  <a href='https://ppra.org.pk/download.asp?tend...      2021-07-28  24/8/2021 11:00:00 AM

[210 rows x 5 columns]

This looks much better!

Next we need to filter this data for the latest date. We can do that using just two lines:

latest_date = combined_df.iloc[0]['Advertised Date']
filtered_df = combined_df[combined_df['Advertised Date'] == latest_date]

We first extract the latest date which is the Advertised Date of the first item in the table and then filter the rest of the table using that value.

Now we can convert this into an HTML table. Pandas does not create a full HTML document so we need to create one ourselves and then embed the Pandas output in it:

html_string = """
    <html>
    <head><title>Latest PPRA Tenders</title></head>
    <body>
            <style>
        table {
            border-collapse: collapse;
            border: 1px solid silver;
        }
        table tr:nth-child(even) {
            background: #E0E0E0;
        }
        </style>
        %s
    </body>
    </html>
"""

table_html = filtered_df.to_html(index=False,render_links=True, justify="center", 
    escape=False, border=4)
with open('ppra.html', 'w') as f:
    f.write(html_string %(table_html))

I used the %s string interpolation method because I have some <style> tags with {} braces and it confuses Python if I use the f-strings.

After running what we have so far, the output (ppra.html) looks like this:

Output Table

There is one issue, the Tender Details column looks very cluttered. Instead of putting the details data in a single font-weight and on the same line, we need to break it up as they do on the original website. The easiest way to do that is to extract the details using regex and then replace the details in the data frame with the extracted ones.

The regex for details extraction looks something like this:

details = re.findall('<td bgcolor="(?:.+?)" width="305">(.+?)</td>', html.text, re.DOTALL)

The detail cells have a specific width so we extract the data based on that. We use re.DOTALL because the details span multiple lines and we want . to match the carriage return (\r) and newline character (\n) as well. The extracted details contain \r\n and we can get rid of them using this list comprehension:

details = [detail.replace('\r\n','') for detail in details]

Let’s assign this details list to our data frame:

tender_table["Tender  Details"] = details

Final scraper code

The final scraper code looks like this:

import requests as r
import pandas as pd
import re

url_template = "https://ppra.org.pk/dad_tenders.asp?PageNo="
html_string = """
    <html>
    <head><title>Latest PPRA Tenders</title></head>
    <body>
        <style>
        table {
            border-collapse: collapse;
            border: 1px solid silver;
        }
        table tr:nth-child(even) {
            background: #E0E0E0;
        }
        </style>
        %s
    </body>
    </html>
"""

def download_parse_table(url):
    html = r.get(url)
    details = re.findall('<td bgcolor="(?:.+?)" width="305">(.+?)</td>', html.text, re.DOTALL)
    details = [detail.replace('\r\n','') for detail in details]
    dfs = pd.read_html(html.text, attrs={'width': '656'}, header=0, 
        parse_dates=['Advertised Date'])
    download_links = re.findall(
        '<a target="_blank" href="(.+?)"><img border="0" src="images/(?:.+?)"></a>',
        html.text)
    download_links = ["<a href='https://ppra.org.pk/"+link+"' style='display: block;text-align: center;'> <img src='https://ppra.org.pk/images/download_icon.gif'/></a>" for link in download_links]
    tender_table = dfs[0]
    tender_table['Download'] = download_links
    tender_table["Tender  Details"] = details
    return tender_table

combined_df = []
for index in range(1,8):
    df = download_parse_table(url_template+str(index))
    combined_df.append(df)

combined_df = pd.concat(combined_df).reset_index(drop=True)
latest_date = combined_df.iloc[0]['Advertised Date']
filtered_df = combined_df[combined_df['Advertised Date'] == latest_date]

table_html = filtered_df.to_html(index=False,render_links=True, justify="center", 
    escape=False, border=4)
with open('ppra.html', 'w') as f:
    f.write(html_string %(table_html))

Getting started with Github Action

I am going to keep this intro to GitHub Actions very short. GitHub Actions have a concept of workflows. Actions will execute workflows. These workflows are contained inside the .github/workflows folder in the root of the repo and list the steps we want Actions to execute. I went ahead and created a .github/workflows folder in my project root and then created a scrape.yml file inside the workflows folder. GH Actions would make more sense if I show you the complete YAML file and then explain it.

The contents of the scrape.yml file are this:

name: Scrape

on:
  schedule:
    - cron: "0 4 * * *"
  workflow_dispatch:

env:
  ACTIONS_ALLOW_UNSECURE_COMMANDS: true

jobs:
  scrape-latest:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout repo
        uses: actions/checkout@v2
      - name: Set up Python
        uses: actions/setup-python@v2.0.0
        with:
          python-version: '3.7'
      - name: Install requirements
        run: pip install -r requirements.txt
      - name: Run Scraper
        run: python scraper.py
      - name: Set env vars
        run: |
          echo "DATE=$(python -c 'import datetime as dt; print(dt.datetime.now().date())')" >> $GITHUB_ENV
      - name: Send mail
        uses: dawidd6/action-send-mail@v3
        with:
          server_address: smtp.gmail.com
          server_port: 465
          username: ${{secrets.MAIL_USERNAME}}
          password: ${{secrets.MAIL_PASSWORD}}
          subject: Latest PPRA tenders for ${{env.DATE}}
          to: hi@yasoob.me
          from: Automated Email
          ignore_cert: true
          secure: true
          html_body: file://ppra.html

We start by naming the Action. In our case, we named it Scrape. Next, we tell GitHub when to execute this action. The first time is via a cron schedule and the second is via the online workflow dispatcher. The cron value is similar to the crontab you might have used on Linux. You can use Crontab Guru to explore crontabs. The one I am using will cause the workflow to run every day at 4 o’clock. This is in UTC. The workflow_dispatch is used just for testing. This way we don’t have to wait until 4 o’clock just to test it and can trigger the workflow manually using the online interface.

Next, we create an environment variable to which our execution environment will have access. ACTIONS_ALLOW_UNSECURE_COMMANDS is required for Python on GitHub due to a bug. I am not sure if it is fixed or not. Afterward, we install Python, install the requirements and run the scraper. Then we set the DATE variable to the current server time. This will be used in the subject of our email.

For the email sending part, I am using the awesome send-email action which makes the whole process super simple. I provide it with my Gmail username and password and point it to the generated HTML file and it automatically sends the email.

We also need to make sure we configure the secrets in the repository settings page so that the send-email action has access to the MAIL_USERNAME and MAIL_PASSWORD.

GitHub Repository settings

We can test the execution by pushing everything to GitHub and then going to the Actions tab of the repo. From there we can select our workflow and manually trigger it.

Github action workflow

GitHub Actions Feedback

The Action has been running successfully for a few days now. I have fallen in love with the simplicity of Actions. The only major issue is that you should not use the free version for anything time-sensitive. The action execution is almost always delayed by a few minutes when using the scheduler. For most simple cases it is fine though. I am also not sure if the situation is any different under the paid plan. However, if you dispatch the workflow using the online interface, it mostly gets scheduled for execution right away.

I will use Actions again in the future when I have similar projects. The only thing to keep in mind is that Actions only have a limited amount of free minutes. If you go beyond this limit, you will have to pay and the price varies based on the operating system. The free account is limited to 2000 minutes per month.

Conclusion

I hope you all enjoyed this tutorial. You can find the corresponding GitHub repo here. I had a ton of fun working on it. If you have any questions, comments, or suggestions, please feel free to comment below or send me an email (hi @ yasoob.me) or a tweet. I do plan on writing about a bunch of other stuff but time is at a premium these days. However, I am hoping the situation would soon change 😄 I will see you in the next article ❤️

✍️ Comments

kyaw
Tuesday, Jul 12, 2022 at 12:40 UTC

Dear Yasoob! Thanks for sharing this tutorial which is very useful for me. Is there any way to save the html file in a folder inside the current repository. In my case I am downloading an image with tables which is sent to another site to extract tables as dataframe. I still want to keep the downloaded table image which is gone after the scrap session is completed.

Thank you!

Your comment has been submitted and will be published once it has been approved. 😊

Yasoob Khalid

My Books

How to web scrape on Schedule using Github Actions?

Requirements

Plan of Action

Installing the required libraries

Writing the scraper

Final scraper code

Getting started with Github Action

GitHub Actions Feedback

Conclusion

✍️ Comments

Say something

Thank you!

How to web scrape on Schedule using Github Actions?

Requirements

Plan of Action

Installing the required libraries

Writing the scraper

Final scraper code

Getting started with Github Action

GitHub Actions Feedback

Conclusion

You might also like

Newsletter

✍️ Comments

Say something

Thank you!