Web Scraping On The Edge Using HTMLRewriter
Aug 19, 2023 · 9 min readIn this tutorial, we’re going to create a web scraper using HTMLRewriter. Specifically, we’re going to create a web service that scrape what a user is currently reading on Goodreads served as JSON data.
Technologies
First, a short introduction to the techs we’re going to use.
HTMLRewriter
As the name suggest, HTMLRewriter is a super lightweight and fast tool to rewrite HTML created for the edge environment like Cloudflare Workers, Deno, and Bun.
Even if it’s main purpose is to rewrite HTML, we can also use it to parse HTML, because it can’t rewrite HTML without parsing it first.
It might not be ideal for large scale scraping though (which you’ll see in the examples later on), but for our purpose, it’s more than enough.
If you’re interested with HTMLRewriter, you can read more about the history of its creation on Cloudflare’s blog.
Hono
Hono is a small, simple, and ultrafast web framework designed for the edge. We’re going to use Hono as our router.
Cloudflare Workers
The app will be deployed as a Cloudflare Worker running on the edge (close to your users).
Note that even though, We’re deploying to Cloudflare Worker in this tutorial, the tutorial should also apply for other environments that implement HTMLRewriter like Deno and Bun.
Building The Scraper
Let’s begin building our app.
For those that prefer to just read the code, the source code is open sourced on Github.
Setup
Let’s begin by scaffholding our project using the Cloudflare Workers template from Hono.
Then select Cloudflare Workers (Arrow key to move, and Space key to select) in the provided options.
Make sure to install the dependencies by running:
And you should now be able to start the dev server by running:
Now, if you go to the provided link, default to http://127.0.0.1:8787/
, you should see the text Hello Hono!
.
File structure
The generated file structure is really simple. Other than the standard npm stuff, the one that you might not be familiar with if you’ve never build a Cloudflare Worker before is wrangler.toml
.
The file contains metadata and configurations for the project. For example: your Worker KV
, D1
, etc.
In this project though, we’re not going to need any of that. But you can change the name
to the name of your app. Cloudflare will use the name
as part of your deployed app’s URL, and also as an identifier in Cloudflare’s dashboard.
I’m going to name mine: goodreads-currently-reading
Alright, now that we’re done here. Let’s explore src/index.ts
, where we’re going to spend our time in, for the rest of the tutorial.
Routing
Now, if you open the file, you should be greeted with code similar to this:
The code should be really familiar if you have experience with backend frameworks like Express.
Our app will only have 1 route, /:id
. The id
here refers to Goodreads’ user id. To create the route, just add the following code after the index
route handler:
Verify that your app is working by going to /<anything>
.
You should get a JSON response like this.
Fetching
Now that we have the user ID, we need to use it to fetch the “currently-reading” page of the user from Goodreads.
In case you don’t know, you can get your Goodreads’ user ID by going to your profile page in Goodreads. And it’s there in the URL. For example, here is mine:
We only need the id
number (74091755
). So feel free to omit the following username
.
Going back to the code, here is how you can fetch the “currently-reading” page:
Tips: You can also fetch other shelves by modifying
shelf=currently-reading
. For example:shelf=read
.
Let’s also do a minimal error handling in case the fetch request fails.
Scraping
Now, that we’re connected with Goodreads. We can start scraping.
To start scraping, we have to pass response
to HTMLRewriter to parse and transform. But since we don’t need the result of the transformation, we can just ignore it.
We created an array to store our data, then we pass the response to the HTMLRewriter instance.
The interesting and most important bit here is the .on
method. Its first argument is a selector, and its second argument is an instance of ElementHandler.
The ElementHandler
looks like this in full:
In the case of our app, our selector captured an HTML tree that look like this:
Now, if this tree is passed to ElementHandler
:
Then element
refers to the <a>
tag, while the text
refers to the text inside (“Pixel Art for Game Developers”), and comment
is undefined
in this case, since there’s no HTML comment.
Now, back to our app.
Since the data that I needed is conveniently provided in the <a>
element, I can just take it using the getAttribute
method and push it into the array.
Now if you run the app, go to /:id
, and check the result, you should get the data as intended.
Finalizing
That’s it for our app. We achieved our objective. Here is the code in full.
Notice that I refactored the app a bit because the href
attribute of the <a>
is relative, but I want it to be an absolute URL. So, I extracted the Goodreads url into a separate variable.
You should also adjust yours to better fit your needs. Maybe you also want to scrape the cover? the rating? You got the idea.
Deploying
Finally, you can deploy your app using the command:
You might be prompted to sign in if it’s your first time using Wrangler. Simply follow the instructions.
Wrangler should start deploying your project, and you should get a live URL to your app. Congrats!
Bonus
I hope that by now you have a pretty a good idea of how to use HTMLRewriter for web scraping. As a bonus content, I want to address some common gotchas when using HTMLRewriter for web scraping.
Text Content Might Come In Chunk
Say that we have an HTML that look like this:
When you access the text using HTMLRewriter, it might come in chunks like: He
, llo,
, world
, !
.
So, always remember to concatenate these chunks when scraping for text data.
You’ll see an example in the next section.
Can’t Directly Access Nested Element
At the time of writing, The Element object doesn’t have any method to directly access nested elements. So, in order to access nested elements, we have to run separate selectors and handlers for them.
For example, say that our HTML look like this:
And we want our data to be in this shape:
When we’re on the a
selector, we have no way to access the nested <img />
and <span>
inside it. To access them, we need to create separate selectors to handle each case.
Like this:
And it can get a little complicated when we need to scrape multiple items.
Fortunately, as you might have noticed, our handlers are called in order, allowing the code above to work.
Wrap Up
That’s it! Hope you find the tutorial useful. And if you need it, the source code for goodreads-currently-reading
is on my Github.