It’s May! That means I’ve been working on the SIREN project at the Centre for Mental Health Law and Policy for an entire year. Our monolith has changed drastically since we started. From just a single scraper for Telegraph India, we now have coverage of 12 different publications, complete with a data validation and cleaning layer. These webscrapers go through E-paper catalogs to find articles that report on suicide, extracting the metadata of relevant articles and putting it in a spreadsheet for researchers to use. As you can imagine, this saves a lot of time that would otherwise be spent on manually finding these articles.
Turns out that most epapers have a searchbar you can use to search up particular keywords. This “solves” the biggest struggle with webscraping epapers: dealing with non-textual data. Scraping textual data is a bajillion times easier than scraping binary data which should be textual; like images of text. And epapers are exactly that. Thankfully, the publications store the text in their database, and their internal APIs query that database whenever someone uses the search function on the webpage. These endpoints are easy to find and exploit to anyone who knows how to use their browser’s developer tools.
For example, searching for the 10 latest Star Wars articles on News Minute is a simple HTTP GET request:
import requests
resp = requests.get("https://www.thenewsminute.com/api/v1/advanced-search?q=star%20wars&limit=10&offset=0&fields=url,headline,published-at")
print(resp.json())
At CMHLP, we screen vast amounts of data, not just the first 10 results. This usually involves some sort of pagination mechanism, which can be slow without a few tricks. This is because python code executes sequentially; whenever we make a request, we wait for it to finish before sending the next one. It’s terrible for performance, because the time spent in simply waiting for a response from the News Minute servers is precious time our processor could use for other things.
This is where we make things asynchronous. All that means is we’ll be telling our code when it needs to wait for some I/O to happen, like getting a response from a server, so that it can do some other work in the meantime. This could also be something like reading from a disk; where the work is not computationally expensive, but rather bound by the time taken for I/O. These kinds of “I/O bound” programs benefit greatly from Cooperative Multitasking, which is the async/await concurrency model.
To make our code async, we’ll first switch from the requests
package, which executes synchronously, to httpx
(aiohttp is also a great choice).
import httpx
import asyncio
async def get_page(page: int = 0, *, page_size: int = 100, client: httpx.AsyncClient) -> dict[str, str]:
offset = page_size * page
resp = await client.get(
f"https://www.thenewsminute.com/api/v1/advanced-search?q=star%20wars&limit={page_size}&offset={offset}&fields=url,headline"
)
return resp.json()
async def main():
PAGE_SIZE = 100
async with httpx.AsyncClient(timeout=None) as client:
initial = await get_page(page_size=PAGE_SIZE, client=client)
total = initial["total"]
pages = (total // PAGE_SIZE) + 2
tasks: list[asyncio.Task[dict[str, str]]] = []
for i in range(1, pages):
task = asyncio.create_task(get_page(i, client=client))
tasks.append(task)
results = await asyncio.gather(*tasks)
data = initial["items"]
for sr in results:
if items := sr.get("items"):
data.extend(items)
print(data)
asyncio.run(main())
Whoa! It looks quite different now. Let’s see what’s changed:
get_page
function now has the async
prefix in it’s definitionget_page
has an await
prefixmain
functionasync with
? asyncio.gather
? asyncio.run
?Let’s start with coroutines. These are special functions which can be suspended and resumed.
These are created primarily with the async def
syntax. They’re also called async functions
to distinguish them from coroutine objects
, which are what they return.
Our code has two coroutines; get_page
and main
. When we call them, we get a coroutine object. You can open an asyncio REPL with python -m asyncio
and try it out:
>>> async def foo():
... print("bar")
...
>>> foo()
<console>:1: RuntimeWarning: coroutine 'foo' was never awaited
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
<coroutine object foo at 0x7afdfcbfa9240>
We get a RuntimeWarning, because calling foo()
doesn’t actually execute it; as you can see, “bar” was never printed.
As the warning says, we need to await
it.
>>> async def foo():
... print("bar")
...
>>> await foo()
bar
In essence, await
will suspend the execution of the currently running awaitable
(like a coroutine) and yield control to the event loop, so that another task may take place.
Once we’ve made the initial query, we get the total number of articles. With that information, we know how many queries we need to make;
and hence there is no need to make those queries sequentially (making the nth query doesn’t require us to know the result of (n-1)th query).
Instead, we use the asyncio.create_task
method to schedule execution of the coros.
We maintain a reference to these tasks inside the tasks
list, which we’ll pass to asyncio.gather
.
This will block till all the tasks are done, and all our data has been received.
Finally, we need to create the event loop and run our main()
coro; we use the asyncio.run
function for this.
Lets compare benchmarks:
Our vanilla, sync script:
Executed in 141.38 secs fish external
usr time 2.18 secs 189.00 micros 2.18 secs
sys time 0.11 secs 106.00 micros 0.11 secs
Versus the shiny new async script:
Executed in 7.25 secs fish external
usr time 1.28 secs 243.00 micros 1.28 secs
sys time 0.14 secs 136.00 micros 0.14 secs
*The top left number is the time taken for execution. Don't bother about the rest of them; they are only there for the nerds that know what they mean We get a 20x performance increase with the async version! This is how most SIREN scrapers work; exploiting an internal search api and making requests concurrently to get fast results. Well, that’s if such an api exists…
Some publications don’t have a convenient search bar. Or they do, but it’s broken or extremely inaccurate. This makes things significantly harder. We don’t have any textual data to work with; we have images. Extracting text from images (a process called OCR) is extremely taxing on the cpu. Not only does it take a lot of compute power, it also takes a lot of time. This is also in part due to the degree of accuracy needed from the SIREN scrapers; we can only use high-resolution images to ensure sufficient precision. OCR is “CPU-bound” work, and async doesn’t help here; there’s no waiting time, because the CPU is constantly crunching numbers. :( In these cases, we use thread-based parallelism, to utilize all our cpu cores and run multiple OCR tasks in parallel. This is very janky in python due to it’s gil. But that’s a discussion for another time.
Sifting through api payloads has been an interesting experience.
This fakeAmountINR
field, for example, from the Indian Express:
That doesn’t seem very ethical…
Or this snippet from the Hindustan Times:
I wonder if their engineers have heard of naming conventions or style guides?
It would be amazing if publications would work with researchers and give them easy access to the data they need. This whole webscraping business is only happening because there is no proper channel for researchers to access vital data.
In any case… looks like I’m going to need a whole lot of eye bleach now.