In this tutorial, we’re going to create a web scraper using HTMLRewriter. Specifically, we’re going to create a web service that scrape what a user is currently reading on Goodreads served as JSON data.
First, a short introduction to the techs we’re going to use.
Even if it’s main purpose is to rewrite HTML, we can also use it to parse HTML, because it can’t rewrite HTML without parsing it first.
It might not be ideal for large scale scraping though (which you’ll see in the examples later on), but for our purpose, it’s more than enough.
If you’re interested with HTMLRewriter, you can read more about the history of its creation on Cloudflare’s blog.
Hono is a small, simple, and ultrafast web framework designed for the edge. We’re going to use Hono as our router.
The app will be deployed as a Cloudflare Worker running on the edge (close to your users).
Building The Scraper
Let’s begin building our app.
For those that prefer to just read the code, the source code is open sourced on Github.
Let’s begin by scaffholding our project using the Cloudflare Workers template from Hono.
Then select Cloudflare Workers (Arrow key to move, and Space key to select) in the provided options.
Make sure to install the dependencies by running:
And you should now be able to start the dev server by running:
Now, if you go to the provided link, default to
http://127.0.0.1:8787/, you should see the text
The generated file structure is really simple. Other than the standard npm stuff, the one that you might not be familiar with if you’ve never build a Cloudflare Worker before is
The file contains metadata and configurations for the project. For example: your
In this project though, we’re not going to need any of that. But you can change the
name to the name of your app. Cloudflare will use the
name as part of your deployed app’s URL, and also as an identifier in Cloudflare’s dashboard.
I’m going to name mine:
Alright, now that we’re done here. Let’s explore
src/index.ts, where we’re going to spend our time in, for the rest of the tutorial.
Now, if you open the file, you should be greeted with code similar to this:
The code should be really familiar if you have experience with backend frameworks like Express.
Our app will only have 1 route,
id here refers to Goodreads’ user id. To create the route, just add the following code after the
index route handler:
Verify that your app is working by going to
You should get a JSON response like this.
Now that we have the user ID, we need to use it to fetch the “currently-reading” page of the user from Goodreads.
In case you don’t know, you can get your Goodreads’ user ID by going to your profile page in Goodreads. And it’s there in the URL. For example, here is mine:
We only need the
id number (
74091755). So feel free to omit the following
Going back to the code, here is how you can fetch the “currently-reading” page:
Tips: You can also fetch other shelves by modifying
shelf=currently-reading. For example:
Let’s also do a minimal error handling in case the fetch request fails.
Now, that we’re connected with Goodreads. We can start scraping.
To start scraping, we have to pass
response to HTMLRewriter to parse and transform. But since we don’t need the result of the transformation, we can just ignore it.
We created an array to store our data, then we pass the response to the HTMLRewriter instance.
ElementHandler looks like this in full:
In the case of our app, our selector captured an HTML tree that look like this:
Now, if this tree is passed to
element refers to the
<a> tag, while the
text refers to the text inside (“Pixel Art for Game Developers”), and
undefined in this case, since there’s no HTML comment.
Now, back to our app.
Since the data that I needed is conveniently provided in the
<a> element, I can just take it using the
getAttribute method and push it into the array.
Now if you run the app, go to
/:id, and check the result, you should get the data as intended.
That’s it for our app. We achieved our objective. Here is the code in full.
Notice that I refactored the app a bit because the
href attribute of the
<a> is relative, but I want it to be an absolute URL. So, I extracted the Goodreads url into a separate variable.
You should also adjust yours to better fit your needs. Maybe you also want to scrape the cover? the rating? You got the idea.
Finally, you can deploy your app using the command:
You might be prompted to sign in if it’s your first time using Wrangler. Simply follow the instructions.
Wrangler should start deploying your project, and you should get a live URL to your app. Congrats!
I hope that by now you have a pretty a good idea of how to use HTMLRewriter for web scraping. As a bonus content, I want to address some common gotchas when using HTMLRewriter for web scraping.
Text Content Might Come In Chunk
Say that we have an HTML that look like this:
When you access the text using HTMLRewriter, it might come in chunks like:
So, always remember to concatenate these chunks when scraping for text data.
You’ll see an example in the next section.
Can’t Directly Access Nested Element
At the time of writing, The Element object doesn’t have any method to directly access nested elements. So, in order to access nested elements, we have to run separate selectors and handlers for them.
For example, say that our HTML look like this:
And we want our data to be in this shape:
When we’re on the
a selector, we have no way to access the nested
<img /> and
<span> inside it. To access them, we need to create separate selectors to handle each case.
And it can get a little complicated when we need to scrape multiple items.
Fortunately, as you might have noticed, our handlers are called in order, allowing the code above to work.
That’s it! Hope you find the tutorial useful. And if you need it, the source code for
goodreads-currently-reading is on my Github.