Subscribe to receive notifications of new posts:

How we built an open-source SEO tool using Workers, D1, and Queues

03/02/2023

10 min read
How we built an open-source SEO tool using Workers, D1, and Queues.

Building applications on Cloudflare Workers has always been fun. Workers applications have low latency response times by default, and easy developer ergonomics thanks to Wrangler. It's no surprise that for years now, developers have been going from idea to production with Workers in just a few minutes.

Internally, we're no different. When a member of our team has a project idea, we often reach for Workers first, and not just for the MVP stage, but in production, too. Workers have been a secret ingredient to Cloudflare’s innovation for some time now, allowing us to build products like Access, Stream and Workers KV. Even better, when we have new ideas and we can use new Cloudflare products to build them, it's a great way to give feedback on those products.

We've discussed this in the past on the Cloudflare blog - in May last year, I wrote how we rebuilt Cloudflare's developer documentation using many of the tools that had recently been released in the Workers ecosystem: Cloudflare Pages for hosting, and Bulk Redirects for the redirect rules. In November, we released a new version of our API documentation, which again used Pages for hosting, and Pages functions for intelligent caching and transformation of our API schema.

In this blog post, I’m excited to show off some of the new tools in Cloudflare’s developer arsenal, D1 and Queues, to prototype and ship an internal tool for our SEO experts at Cloudflare. We've made this project, which we're calling Prospector, open-source too - check it out in our cloudflare/templates repo on GitHub. Whether you're a developer looking to understand how to use multiple parts of Cloudflare's developer stack together, or an SEO specialist who may want to deploy the tool in production, we've made it incredibly easy to get up and running.

What we're building

Prospector is a tool that allows Cloudflare's SEO experts to monitor our blog and marketing site for specific keywords. When a keyword is matched on a page, Prospector will notify an email address. This allows our SEO experts to stay informed of any changes to our website, and take action accordingly.

Using MailChannels' integration with Workers, we can quickly and easily send emails from our application using a single API call. This allows us to focus on the core functionality of the application, and not worry about the details of sending emails.

Prospector uses Cloudflare Workers as the user-facing API for the application. It uses D1 to store and retrieve data in real-time, and Queues to handle the fetching of all URLs and the notification process. We've also included an intuitive user interface for the application, which is built with HTML, CSS, and JavaScript.

Why we built it

It is widely known in SEO that both internal and external links help Google and other search engines understand what a website is about, which impacts keyword rankings. Not only do these links guide readers to additional helpful information, they also allow web crawlers for search engines to discover and index content on the site.

Acquiring external links is often a time-consuming process and at the discretion of third parties, whereas website owners typically have much more control over internal links. As a result, internal linking is one of the most useful levers available in SEO.

In an ideal world, every piece of content would be fully formed upon publication, replete with helpful internal links throughout the piece. However, this is often not the case. Many times, content is edited after the fact or additional pieces of relevant content come along after initial publication. These situations result in missed opportunities for internal linking.

Like other large organizations, Cloudflare has published thousands of blogs and web pages over the years. We share new content every time a product/technology is introduced and improved. Ultimately, that also means it's become more challenging to identify opportunities for internal linking in a timely, automated fashion. We needed a tool that would allow us to identify internal linking opportunities as they appear, and speed up the time it takes to identify new internal linking opportunities.

Although we tested several tools that might solve this problem, we found that they were limited in several ways. First, some tools only scanned the first 2,000 characters of a web page. Any opportunities found beyond that limit would not be detected. Next, some tools did not allow us to limit searches to certain areas of the site and resulted in many false positives. Finally, other potential solutions required manual operation, leaving the process at the mercy of human memory.

To solve our problem (and ultimately, improve our SEO), we needed an automated tool that could discover and notify us of new instances of targeted phrases on a specified range of pages.

How it works

Data model

First, let's explore the data model for Prospector. We have two main tables: notifiers and urls. The notifiers table stores the email address and keyword that we want to monitor. The urls table stores the URL and sitemap that we want to scrape. The notifiers table has a one-to-many relationship with the urls table, meaning that each notifier can have many URLs associated with it.

In addition, we have a sitemaps table that stores the sitemap URLs that we've scraped. Many larger websites don't just have a single sitemap: the Cloudflare blog, for instance, has a primary sitemap that contains four sub-sitemaps. When the application is deployed, a primary sitemap is provided as configuration, and Prospector will parse it to find all of the sub-sitemaps.

Finally, notifier_matches is a table that stores the matches between a notifier and a URL. This allows us to keep track of which URLs have already been matched, and which ones still need to be processed. When a match has been found, the notifier_matches table is updated to reflect that, and "matches" for a keyword are no longer processed. This saves our SEO experts from a crowded inbox, and allows them to focus and act on new matches.

Connecting the pieces with Cloudflare Queues
Cloudflare Queues acts as the work queue for Prospector. When a new notifier is added, a new job is created for it and added to the queue. Behind the scenes, Queues will distribute the work across multiple Workers, allowing us to scale the application as needed. When a job is processed, Prospector will scrape the URL and check for matches. If a match is found, Prospector will send an email to the notifier's email address.

Using the Cron Triggers functionality in Workers, we can schedule the scraping process to run at a regular interval - by default, once a day. This allows us to keep our data up-to-date, and ensures that we're always notified of any changes to our website. It also allows the end-user to configure when they receive emails in case they want to receive them more or less frequently, or at the beginning of their workday.

The Module Workers syntax for Workers makes accessing the application bindings - the constants available in the application for querying D1, Queues, and other services - incredibly easy. src/index.ts, the entrypoint for the application, looks like this:

import { DBUrl, Env } from './types'

import {
  handleQueuedUrl,
  scheduled,
} from './functions';

import h from './api'

export default {
  async fetch(
	request: Request,
	env: Env,
	ctx: ExecutionContext
  ): Promise<Response> {
	return h.fetch(request, env, ctx)
  },

  async queue(
	batch: MessageBatch<Error>,
	env: Env
  ): Promise<void> {
	for (const message of batch.messages) {
  	const url: DBUrl = JSON.parse(message.body)
  	await handleQueuedUrl(url, env.DB)
	}
  },

  async scheduled(
	env: Env,
  ): Promise<void> {
	await scheduled({
  	authToken: env.AUTH_TOKEN,
  	db: env.DB,
  	queue: env.QUEUE,
  	sitemapUrl: env.SITEMAP_URL,
	})
  }
};

With this syntax, we can see where the various events incoming to the application - the fetch event, the queue event, and the scheduled event - are handled. The fetch event is the main entrypoint for the application, and is where we handle all of the API routes. The queue event is where we handle the work that's been added to the queue, and the scheduled event is where we handle the scheduled scraping process.

Central to the application, of course, is Workers - acting as the API gateway and coordinator. We've elected to use the popular open-source framework Hono, an Express-style API for Workers, in Prospector. With Hono, we can quickly map out a REST API in just a few lines of code. Here's an example of a few API routes and how they're defined with Hono:

const app = new Hono()

app.get("/", (context) => {
  return context.html(index)
})

app.post("/notifiers", async context => {
  try {
	const { keyword, email } = await context.req.parseBody()
	await context.env.DB.prepare(
  	"insert into notifiers (keyword, email) values (?, ?)"
	).bind(keyword, email).run()
	return context.redirect('/')
  } catch (err) {
	context.status(500)
	return context.text("Something went wrong")
  }
})

app.get('/sitemaps', async (context) => {
  const query = await context.env.DB.prepare(
	"select * from sitemaps"
  ).all();
  const sitemaps: Array<DBSitemap> = query.results
  return context.json(sitemaps)
})

Crucial to the development of Prospector are the improved TypeScript bindings for Workers. As announced in November of last year, TypeScript bindings for Workers are now automatically generated based on our open source runtime, workerd. This means that whenever we use the types provided from the @cloudflare/workers-types package in our application, we can be sure that the types are always up-to-date.

With these bindings, we can define the types for our environment variables, and use them in our application. Here's an example of the Env type, which defines the environment variables that we use in the application:

export interface Env {
  AUTH_TOKEN: string
  DB: D1Database
  QUEUE: Queue
  SITEMAP_URL: string
}

Notice the types of the DB and QUEUE bindings - D1Database and Queue, respectively. These types are automatically generated, complete with type signatures for each method inside of the D1 and Queue APIs. This means that we can be sure that we're using the correct methods, and that we're passing the correct arguments to them, directly from our text editor - without having to refer to the documentation.

How to use it

One of my favorite things about Workers is that deploying applications is quick and easy. Using `wrangler.toml` and some simple build scripts, we can deploy a fully-functional application in just a few minutes. Prospector is no different. With just a few commands, we can create the necessary D1 database and Queues instance, and deploy the application to our account.

First, you'll need to clone the repository from our cloudflare/templates repository:

git clone $URL

If you haven't installed wrangler yet, you can do so by running:

npm install @cloudflare/wrangler -g

With Wrangler installed, you can login to your account by running:

wrangler login

After you've done that, you'll need to create a new D1 database, as well as a Queues instance. You can do this by running the following commands:

wrangler d1 create $DATABASE_NAME
wrangler queues create $QUEUE_NAME

Configure your wrangler.toml with the appropriate bindings (see [the README](URL) for an example):

[[ d1_databases ]]
binding = "DB"
database_name = "keyword-tracker-db"
database_id = "ab4828aa-723b-4a77-a3f2-a2e6a21c4f87"
preview_database_id = "8a77a074-8631-48ca-ba41-a00d0206de32"
	
[[queues.producers]]
  queue = "queue"
  binding = "QUEUE"

[[queues.consumers]]
  queue = "queue"
  max_batch_size = 10
  max_batch_timeout = 30
  max_retries = 10
  dead_letter_queue = "queue-dlq"

Next, you can run the bin/migrate script to create the tables in your database:

bin/migrate

This will create all the needed tables in your database, both in development (locally) and in production. Note that you'll even see the creation of a honest-to-goodness .sqlite3 file in your project directory - this is the local development database, which you can connect to directly using the same SQLite CLI that you're used to:

$ sqlite3 .wrangler/state/d1/DB.sqlite3
sqlite> .tables notifier_matches  notifiers      sitemaps       urls

Finally, you can deploy the application to your account:

npm run deploy

With a deployed application, you can visit your Workers URL to see the user interface. From there, you can add new notifiers and URLs, and see the results of your scraping process. When a new keyword match is found, you’ll receive an email with the details of the match instantly:

Conclusion

For some time, there have been a great deal of applications that were hard to build on Workers without relational data or background task tooling. Now, with D1 and Queues, we can build applications that seamlessly integrate between real-time user interfaces, geographically distributed data, background processing, and more, all using the same developer ergonomics and low latency that Workers is known for.

D1 has been crucial for building this application. On larger sites, the number of URLs that need to be scraped can be quite large. If we were to use Workers KV, our key-value store, for storing this data, we would quickly struggle with how to model, retrieve, and update the data needed for this use-case. With D1, we can build relational data models and quickly query just the data we need for each queued processing task.

Using these tools, developers can build internal tools and applications for their companies that are more powerful and more scalable than ever before. With the integration of Cloudflare's Zero Trust suite, developers can make these applications secure by default, and deploy them to Cloudflare's global network. This allows developers to build applications that are fast, secure, and reliable, all without having to worry about the underlying infrastructure.

Prospector is a great example of how easy it is to build applications on Cloudflare Workers. With the recent addition of D1 and Queues, we've been able to build fully-functional applications that require real-time data and background processing in just a few hours. We're excited to share the open-source code for Prospector, and we'd love to hear your feedback on the project.

If you have any questions, feel free to reach out to us on Twitter at @cloudflaredev, or join us in the Cloudflare Workers Discord community, which recently hit 20k members and is a great place to ask questions and get help from other developers.

We protect entire corporate networks, help customers build Internet-scale applications efficiently, accelerate any website or Internet application, ward off DDoS attacks, keep hackers at bay, and can help you on your journey to Zero Trust.

Visit 1.1.1.1 from any device to get started with our free app that makes your Internet faster and safer.

To learn more about our mission to help build a better Internet, start here. If you're looking for a new career direction, check out our open positions.
DevelopersServerlessStorageCloudflare WorkersD1QueuesDeveloper Platform

Follow on X

Kristian Freeman|@kristianf_
Cloudflare|@cloudflare

Related posts

May 22, 2024 1:00 PM

AI Gateway is generally available: a unified interface for managing and scaling your generative AI workloads

AI Gateway is an AI ops platform that provides speed, reliability, and observability for your AI applications. With a single line of code, you can unlock powerful features including rate limiting, custom caching, real-time logs, and aggregated analytics across multiple providers...