---
title: Should Proof-of-Work be standardized for HTTP?
date: 2025-05-24
tags: [code]
description: "Anubis seems to emerge as the go-to tool to combat scraping. But is it ultimately a good idea?"
---

A couple of days ago I tried to access the excellent Arch Wiki and was greeted
with this text instead (emphasis mine):

> You are seeing this because the administrator of this website has set up
> **Anubis** to protect the server against the scourge of **AI companies
> aggressively scraping websites**. This can and does **cause downtime** for
> the websites, which makes their resources inaccessible for everyone.
>
> Anubis is a compromise. Anubis uses a **Proof-of-Work** scheme in the
> vein of Hashcash, a proposed proof-of-work scheme for reducing email spam.
> The idea is that at individual scales the additional load is ignorable, but
> at mass scraper levels it adds up and makes scraping much more expensive.
>
> …
>
> Please note that Anubis requires the use of **modern JavaScript** features
> that plugins like JShelter will disable. Please disable JShelter or other
> such plugins for this domain.

The Arch Wiki is not the only page that is affected by scraping. [Drew
DeVault](https://drewdevault.com/2025/03/17/2025-03-17-Stop-externalizing-your-costs-on-me.html)
also wrote a great article about it, explaining some of the issues sysadmin are
currently facing.

Anubis seems to emerge as the go-to tool to combat scraping. But it requires
JavaScript, so it also blocks any legitimate attempt to request the page with
anything but a modern web browser. You can also [configure it to let in
specific User Agents](https://anubis.techaro.lol/docs/admin/policies/), but
then attackers could just bypass the protection by using those User Agent
strings.

I believe the JavaScript issue could be fixed by implementing something like
Anubis' Proof-of-Work scheme on the protocol level. But standardizing it would
also be an endorsement of the concept as a whole. So let's first look into it:
How does it work, how does it compare to other mitigations, and is it
ultimately a good idea?

## What is bad about AI companies scraping websites

Scraping on its own just means that programs use HTML that was generated for
humans instead of dedicated APIs to get information. I often end up using that
technique when an API is not available.

There are actually a lot of bots that regularly request HTML that is intended
for humans. For example, search engine crawlers like the Google bot regularily
scan the whole web to update their index. However, in that case search engines
and website owners have a mutual interest: allowing users to find the content.
So they play nice with each other: The crawlers use a unique `User-Agent`
header and voluntarily respect any restrictions that are defined in
`robots.txt`.

What AI companies are doing, on the other hand, seems to be much more similar
to DDoS attacks: Servers get flooded with requests with `User-Agent` headers
that look like regular browsers. They also come from many different IP
addresses so it is hard to distinguish them from organic traffic.

One issue that is sometimes mentioned is that AI companies only take, but
give nothing back. Search engines cause a little bit of load, but they also
send users to the page. AI companies, on the other hand, just use the content
as training data and do not retain a link to the source.

I am not sure what I think about that. On the one hand, I think this issue is
mostly caused by ad-based monetization, which is a scourge on its own.
Spreading information, in whichever way people want to, is a good thing! On the
other hand, I also don't like when rich companies steel from open source
communities. In the case of the Arch Wiki, the content is published under [GNU
FDL](https://www.gnu.org/licenses/fdl-1.3.html), so scraping it for training AI
models is actually illegal.

For me, the main issue with these attacks (lets call them what they are) is
that they exhaust all resources to the point where servers cannot handle
requests that come from real human users.

## Mitigations

The first line of defense is **performance optimization**. Servers can handle
much more requests if they don't require a lot of resources. However, at some
point this will no longer be sufficient and we need to start blocking requests.
The fundamental issue then is how to distinguish good requests from bad
requests. How can that even be defined?

**CAPTCHAs** define good requests as those that were initiated by humans, so
they require that clients pass a Turing test. While this definition is useful
in some cases, it is not useful for many other cases where we explicitly want
to allow scraping.

**Rate limiting** defines good request by their frequency. I find this to be a
much better definition for most situations because it roughly translates to
resource usage. However, rate limiting requires that we can identify which
requests come from the same source. If attackers use different IP addresses and
User Agents, it is hard to even realize that all those requests belong
together.

In that case, we can do **active monitoring** and constantly update our
blocking rules based on request patterns. But there is no guarantee that we
will actually find any patterns. It is also a huge amount of work.

The new idea that **Proof-of-Work** brings to the table is that good requests
need to contribute some of their own resources. However, we do not actually
share the work between client and server. Instead, the client just wastes some
CPU time with some complex calculations to signal that it is willing to do its
part. In a way, this is the cryptographic version of [Bullshit
Jobs](https://en.wikipedia.org/wiki/Bullshit_Jobs). Proof-of-Work does not
prevent scrapers from exhausting server resources, but it provides incentives.

## Proof-of-Work in Anubis

Anubis is deployed as a proxy in front of the actual application. When a client
first makes a request, Anubis instead loads a page with some JavaScript that
tries to find a string so that `sha256(string + challenge)` starts with
`difficulty` zeroes. Once that string is found, it is sent back to the server.
On success, Anubis stores the challenge and response in a cookie and then
finally lets the user pass to the application.

The challenge is not random. It contains the IP address, current week, and a
secret. This way, a new proof must be calculated for every device, week, and
service.

For further details, see the [Anubis
documentation](https://anubis.techaro.lol/docs/design/how-anubis-works).

## Proof-of-Work in HTTP

This exact mechanism could be integrated into HTTP by adding a new
Authentication scheme:

```http
HTTP/1.1 401 Unauthorized
WWW-Authenticate: Proof-Of-Work algorithm=SHA-256 difficulty=5 challenge=ABC
```

```http
Authorization: Proof-Of-Work algorithm=SHA-256 difficulty=5 challenge=ABC response=XYZ
```

A JavaScript/cookie fallback that works a lot like Anubis could be added for
browsers that do not yet support the new scheme. Also, IP-based exceptions
could be added for important clients like the Google bot until they add
support.

Supporting this scheme on the protocol level would allow to implement support
in clients that do not execute JavaScript, e.g. curl. It would also open new
usecases that do not necessarily involve web browsers, e.g. protecting
resource-intensive API endpoints.

## Distribution of Work

Proof-of-Work only works as intended if:

-   it causes negligible load on the server
-   it causes negligible load for casual users
-   it causes significant load for scrapers

But is that the case?

In the case of Anubis, I would say clearly no. The proof takes less than 2
seconds to compute and then stays valid for a whole week. I do not see how that
could ever be considered *significant load*.

Why do people who deploy Anubis still see positive results? I guess this is
mostly because they do something unconventional that scrapers have not yet
adapted to. This is a completely valid mitigation in itself. But it seizes to
work as soon as it becomes too prevalent, so standardizing it would be
counter-productive. And it also doesn't really require wasting CPU time. Just
setting a cookie would work just as well.

Let's look at a more meaningful approach: The server has to verify the proof on
every request, so the client should have to calculate a proof on (nearly) every
request, too. This could be achieved by including the exact URL in the
challenge and reducing the validity to something like 5 minutes.

For casual users, I would consider an increase in load time of ~20% as
acceptable. Lets says that is something like 200ms on average. The Arch Wiki
has close to 30.000 pages, so downloading all of them would require clients to
waste ~100 minutes of CPU time. While this is not nothing, I am also not
convinced that this is enough of an obstacle to discourage scrapers.

[Raphael Michel](https://behind.pretix.eu/2025/05/23/captchas-are-over/) comes
to a similar conclusion when discussing scalpers: If you stand to make 200€
profit from a request, you do not care about a few cents in extra CPU time.

Also, this whole idea assumes that attackers even care about their resource
usage. DDoS attacks are commonly executed via bot nets where attackers have
taken over regular people's devices. In that case, attackers don't really care
about resource use because they don't pay the bill.

## Conclusion

So should the Proof-of-Work scheme be standardized? Performance optimizations
and *doing something unconventional* will only get us so far. We need something
better. And in order to make Proof-of-Work useful it needs to be standardized.

But does it actually work? I was genuinely excited about Anubis. I liked its
premise:

> The idea is that at individual scales the additional load is ignorable, but
> at mass scraper levels it adds up and makes scraping much more expensive.

But on closer inspection I am not really sure if that balance can be achieved.