How we won the PAN 2024 Text Classification challenge

2024-07-31T00:00:00-04:00

Background

The Competition

Research Process

Results

Outro

Getting started with Selenium Webdriver and Requests in Python

2024-06-13T00:00:00-04:00

In this short tutorial, let’s look at the US Patent and Trademark Office (USPTO) website and scrape the patent database using a keyword search. We will use Selenium WebDriver to scrape the data. We will then use the Requests library to download the individual patent PDF documents.

Compiled by Srikar Kashyap Pulipaka

Last Updated: 13 June 2024

Part 1: Scraping the USPTO website for Patent Data

What is Selenium WebDriver?

Selenium Webdriver is a browser simulation framework that allows you to interact with a web page using a real, fully-featured browser. It is primarily used for automating web applications for testing purposes, but it can also be used for web scraping. It is one of the many components of the Selenium Test Suite.

Installation and Setup

First, you need to install the Selenium WebDriver and the Chrome WebDriver. You can install the Selenium WebDriver using the following command:

pip install selenium

You can download the Chrome WebDriver from the following link: Chrome WebDriver

Note: The Chrome WebDriver should be the same as the version of Chrome installed on your system. You can check the version of Chrome by going to chrome://settings/help.

Once you download the Chrome WebDriver, extract the file and place it in the same directory as your Python script.

Importing the necessary libraries

We start off with importing the necessary libraries. We will be using the Selenium WebDriver to scrape the data and the Requests library to download the PDF documents.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd

Keyword Definition

Let’s define the keyword to be used to search the USPTO database. In this case, we will use the keyword “semiconductor”.

keyword = "semiconductor"

Initializing the WebDriver and Navigating to the USPTO Website

We will initialize the WebDriver and navigate to the USPTO website. We will then search for the keyword “semiconductor” in the search bar, and press the search/enter button.

driver = webdriver.Chrome()
driver.get("https://ppubs.uspto.gov/pubwebapp/static/pages/ppubsbasic.html")
driver.find_element("id","searchText1").send_keys("semiconductor")
driver.find_element("id","searchText1").send_keys(Keys.RETURN)

Scraping and Saving the Data

We will scrape the data from the search results and save it into a Pandas Dataframe. We will then save the Dataframe into a CSV file. For this tutorial, we will run this script for the first 5 pages of the search results. You can run eventually run it for all the pages in its current form by removing the count condition.

master_df = pd.DataFrame()
count = 0
while True:
    time.sleep(3)
    df = pd.read_html(str(driver.page_source))[0]
    master_df = pd.concat([master_df, df])
    time.sleep(2)
    try:
        driver.find_element("id","paginationNextItem").click()
    except Exception as e:
        print(e)
        break
    print('Size of collection so far:', master_df.shape[0])
    count += 1
    if count > 5:
        break
master_df.reset_index(drop=True, inplace=True)
master_df.to_csv('patents.csv', index=False)

Let’s look at what happens in this piece of code:

We initialize a master dataframe using the Pandas library. This dataframe will store the data from all the pages.
We also instantiate a count variable to keep track of the number of pages we have scraped.
We start a while loop that will run until the count reaches 5. This loop will scrape the data from each page and save it into the master dataframe.
We sleep for 3 seconds after the page loads to give it enough time to load the javascript elements. Please adjust this time depending on your internet speed and the complexity of the website.
We then provide the webpage source code as the input for the Pandas Read HTML function. This function will scrape the data from the webpage and save it into a Pandas Dataframe. It just reads the list of tables from the HTML source code. The required table is the first table in the list of tables. So we select the first element of the list.
We concat this new dataframe with the master dataframe.
We check if the element for the next page is present in the current page. If it is present, we click on the next page button. If it is not present, we break out of the loop. This is how we know that we have reached the end of the search results.
We increment the count variable by 1.
If the value of count is 5, we break out of the loop (Please comment this line if you want to scrape all the pages).
We reset the index of the master dataframe as the index will be duplicated after each concatenation.
We save the master dataframe into a CSV file.

Final Code (Run Only This Code)

# import the required libraries
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd

# set the keyword
keyword = "semiconductor"

# open the browser and navigate to the website. Enter the keyword and click on search
driver = webdriver.Chrome()
driver.get("https://ppubs.uspto.gov/pubwebapp/static/pages/ppubsbasic.html")
driver.find_element("id","searchText1").send_keys("semiconductor")
driver.find_element("id","searchText1").send_keys(Keys.RETURN)

# create an empty dataframe to store the data
master_df = pd.DataFrame()

# loop through the pages and extract the data
count = 0

while True:
    # wait for the page to load
    time.sleep(3)
    # read the data from the page and append it to the master
    df = pd.read_html(str(driver.page_source))[0]
    master_df = pd.concat([master_df, df])
    time.sleep(2)
    try:
        driver.find_element("id","paginationNextItem").click()
    except Exception as e:
        print(e)
        break
    print('Size of collection so far:', master_df.shape[0])
    count += 1
    if count > 5:
        break
driver.close()
master_df.reset_index(drop=True, inplace=True)
master_df.to_csv('patents.csv', index=False)

Size of collection so far: 50
Size of collection so far: 100
Size of collection so far: 150
Size of collection so far: 200
Size of collection so far: 250
Size of collection so far: 300

Let’s have a look at the data we collected.

master_df.head()

	Result #	Document/Patent number	Display	Title	Inventor name	Publication date	Pages
0	1	US-20240193696-A1	Preview PDF	PROACTIVE WEATHER EVENT COMMUNICATION SYSTEM A...	Wyatt; Amber et al.	2024-06-13	21
1	2	US-20240188889-A1	Preview PDF	FLASH LED AND HEART RATE MONITOR LED INTEGRATI...	Tankiewicz; Szymon Michal et al.	2024-06-13	28
2	3	US-20240193523-A1	Preview PDF	VIRTUAL CAREER MENTOR THAT CONSIDERS SKILLS AN...	O'Donncha; Fearghal et al.	2024-06-13	16
3	4	US-20240193519-A1	Preview PDF	SYSTEMS AND METHODS FOR SYSTEM-WIDE GRANULAR A...	Holovacs; Jeremy	2024-06-13	34
4	5	US-20240190459-A1	Preview PDF	METHODS AND SYSTEMS FOR VEHICLE CONTROL UNDER ...	Mamchuk; Tetyana V. et al.	2024-06-13	23

Part 2: Downloading the PDF Patent Documents using the Requests Library

Now that we have the patents data, we can use the following code to extract/download the patents in PDF format. We will be using the Requests library to download the PDF documents using simple HTTP requests.

It turns out (to our advantage) that the USPTO websites stores the PDF patent documents in a predictable URL format. We can use this to download the PDFs of the patents we are interested in. The URL is of the format https://ppubs.uspto.gov/dirsearch-public/print/downloadPdf/XXXXXXXX.pdf where XXXXXXXX is the patent number. We can use the requests library to download the PDFs.

import requests

url = "https://ppubs.uspto.gov/dirsearch-public/print/downloadPdf/{}"

# sample 4 rows
sample = master_df.head(4)
for i, row in sample.iterrows():
    patent_number = row["Document/Patent number"].split("-")[1]
    formatted_url = url.format(patent_number)
    response = requests.get(formatted_url)
    with open(f"{row['Document/Patent number']}.pdf", "wb") as f:
        f.write(response.content)
    print(f"Downloaded {row['Document/Patent number']}.pdf")

Downloaded US-20240193696-A1.pdf
Downloaded US-20240188889-A1.pdf
Downloaded US-20240193523-A1.pdf
Downloaded US-20240193519-A1.pdf

Let’s break down the code:

We first import the necessary library: requests.
We sample 4 rows from the master_df DataFrame (first four rows).
We iterate over each row in the sample DataFrame.
We extract the patent number from the “Document/Patent number” column. Since the patent number is of the format “US-XXXXX-XX”, we split the text using the “-“ character and extract the second part.
We format the URL using the extracted patent number.
We make a GET request to the formatted URL.
We write the content of the response to a PDF file with the name of the patent number. We load the file in write-binary mode (“wb”) as it is a PDF file.
We print a message indicating that the PDF file has been downloaded.

Selenium Webdriver vs. Requests Library

Some personal notes on the choice between Selenium WebDriver and the Requests library for web scraping:

When to use Selenium WebDriver

When the website is dynamic and requires javascript to load the content.
When the website requires user interaction (e.g., clicking buttons, filling forms).

When to use the Requests library

When the required elements are present in the source code of the webpage on load. In this case, you can directly scrape the data using the Requests library.
When the website is static and does not require javascript to load the content.
When the website does not require user interaction.

Tip (that usually works): If you can find the required data/element in the source code after clicking Control+U, you can directly scrape the data using the Requests library. If you cannot find the required data/element in the source code, you might need to use the Selenium WebDriver.

Problems and Nuisances

2021-01-15T00:00:00-05:00

Identifying the user issues worth solving is very crucial but often ignored

In the course of my study on product management, I’ve identified two categories of user issues: Problems and Nuisances.

Problems are difficulties that affect the user so much, they are willing to spend time and money to solve them.

Nuisances feel like problems at the moment but are quickly ignored or forgotten.

Let’s take an example

My mobile hangs almost once every month and needs a restart. If somebody asks me about this, I would rave for half an hour about how terrible my phone is. If they ask me if I would buy their hang-free phone, I would instinctively say yes!

But will I? No! Because it’s a nuisance that seems important in that moment. I’m totally fine with my phone the rest of time.

This misidentification of nuisances as problems seems to plague non-tech companies the most. A laundry list of features is made and the development outsourced to a vendor, with negligible or zero user surveys or demand identification. Most of these features are often obscure and unused.

But don’t throw away the nuisances yet! In many cases, nuisances solved with 10X efficiency have a potential niche market for themselves. These markets rarely overlap with the original markets.

A clock that slows down every month by five minutes? Nuisance. A clock that doesn’t slow down for decades or centuries? Atomic clock (with important scientific applications).

Can you think of any more examples that prove (or disprove) this distinction?

TL;DR: Solve problems. Unless you can solve the nuisance 10X better. Then do that!

Srikar Kashyap Pulipaka