Ξ

Should Proof-of-Work be standardized for HTTP?

Published on 2025-05-24 code

A couple of days ago I tried to access the excellent Arch Wiki and was greeted with this text instead (emphasis mine):

You are seeing this because the administrator of this website has set up Anubis to protect the server against the scourge of AI companies aggressively scraping websites. This can and does cause downtime for the websites, which makes their resources inaccessible for everyone.

Anubis is a compromise. Anubis uses a Proof-of-Work scheme in the vein of Hashcash, a proposed proof-of-work scheme for reducing email spam. The idea is that at individual scales the additional load is ignorable, but at mass scraper levels it adds up and makes scraping much more expensive.

Please note that Anubis requires the use of modern JavaScript features that plugins like JShelter will disable. Please disable JShelter or other such plugins for this domain.

The Arch Wiki is not the only page that is affected by scraping. Drew DeVault also wrote a great article about it, explaining some of the issues sysadmin are currently facing.

Anubis seems to emerge as the go-to tool to combat scraping. But it requires JavaScript, so it also blocks any legitimate attempt to request the page with anything but a modern web browser. You can also configure it to let in specific User Agents, but then attackers could just bypass the protection by using those User Agent strings.

I believe the JavaScript issue could be fixed by implementing something like Anubis' Proof-of-Work scheme on the protocol level. But standardizing it would also be an endorsement of the concept as a whole. So let's first look into it: How does it work, how does it compare to other mitigations, and is it ultimately a good idea?

What is bad about AI companies scraping websites

Scraping on its own just means that programs use HTML that was generated for humans instead of dedicated APIs to get information. I often end up using that technique when an API is not available.

There are actually a lot of bots that regularly request HTML that is intended for humans. For example, search engine crawlers like the Google bot regularily scan the whole web to update their index. However, in that case search engines and website owners have a mutual interest: allowing users to find the content. So they play nice with each other: The crawlers use a unique User-Agent header and voluntarily respect any restrictions that are defined in robots.txt.

What AI companies are doing, on the other hand, seems to be much more similar to DDoS attacks: Servers get flooded with requests with User-Agent headers that look like regular browsers. They also come from many different IP addresses so it is hard to distinguish them from organic traffic.

One issue that is sometimes mentioned is that AI companies only take, but give nothing back. Search engines cause a little bit of load, but they also send users to the page. AI companies, on the other hand, just use the content as training data and do not retain a link to the source.

I am not sure what I think about that. On the one hand, I think this issue is mostly caused by ad-based monetization, which is a scourge on its own. Spreading information, in whichever way people want to, is a good thing! On the other hand, I also don't like when rich companies steel from open source communities. In the case of the Arch Wiki, the content is published under GNU FDL, so scraping it for training AI models is actually illegal.

For me, the main issue with these attacks (lets call them what they are) is that they exhaust all resources to the point where servers cannot handle requests that come from real human users.

Mitigations

The first line of defense is performance optimization. Servers can handle much more requests if they don't require a lot of resources. However, at some point this will no longer be sufficient and we need to start blocking requests. The fundamental issue then is how to distinguish good requests from bad requests. How can that even be defined?

CAPTCHAs define good requests as those that were initiated by humans, so they require that clients pass a Turing test. While this definition is useful in some cases, it is not useful for many other cases where we explicitly want to allow scraping.

Rate limiting defines good request by their frequency. I find this to be a much better definition for most situations because it roughly translates to resource usage. However, rate limiting requires that we can identify which requests come from the same source. If attackers use different IP addresses and User Agents, it is hard to even realize that all those requests belong together.

In that case, we can do active monitoring and constantly update our blocking rules based on request patterns. But there is no guarantee that we will actually find any patterns. It is also a huge amount of work.

The new idea that Proof-of-Work brings to the table is that good requests need to contribute some of their own resources. However, we do not actually share the work between client and server. Instead, the client just wastes some CPU time with some complex calculations to signal that it is willing to do its part. In a way, this is the cryptographic version of Bullshit Jobs. Proof-of-Work does not prevent scrapers from exhausting server resources, but it provides incentives.

Proof-of-Work in Anubis

Anubis is deployed as a proxy in front of the actual application. When a client first makes a request, Anubis instead loads a page with some JavaScript that tries to find a string so that sha256(string + challenge) starts with difficulty zeroes. Once that string is found, it is sent back to the server. On success, Anubis stores the challenge and response in a cookie and then finally lets the user pass to the application.

The challenge is not random. It contains the IP address, current week, and a secret. This way, a new proof must be calculated for every device, week, and service.

For further details, see the Anubis documentation.

Proof-of-Work in HTTP

This exact mechanism could be integrated into HTTP by adding a new Authentication scheme:

HTTP/1.1 401 Unauthorized
WWW-Authenticate: Proof-Of-Work algorithm=SHA-256 difficulty=5 challenge=ABC
Authorization: Proof-Of-Work algorithm=SHA-256 difficulty=5 challenge=ABC response=XYZ

A JavaScript/cookie fallback that works a lot like Anubis could be added for browsers that do not yet support the new scheme. Also, IP-based exceptions could be added for important clients like the Google bot until they add support.

Supporting this scheme on the protocol level would allow to implement support in clients that do not execute JavaScript, e.g. curl. It would also open new usecases that do not necessarily involve web browsers, e.g. protecting resource-intensive API endpoints.

Distribution of Work

Proof-of-Work only works as intended if:

But is that the case?

In the case of Anubis, I would say clearly no. The proof takes less than 2 seconds to compute and then stays valid for a whole week. I do not see how that could ever be considered significant load.

Why do people who deploy Anubis still see positive results? I guess this is mostly because they do something unconventional that scrapers have not yet adapted to. This is a completely valid mitigation in itself. But it seizes to work as soon as it becomes too prevalent, so standardizing it would be counter-productive. And it also doesn't really require wasting CPU time. Just setting a cookie would work just as well.

Let's look at a more meaningful approach: The server has to verify the proof on every request, so the client should have to calculate a proof on (nearly) every request, too. This could be achieved by including the exact URL in the challenge and reducing the validity to something like 5 minutes.

For casual users, I would consider an increase in load time of ~20% as acceptable. Lets says that is something like 200ms on average. The Arch Wiki has close to 30.000 pages, so downloading all of them would require clients to waste ~100 minutes of CPU time. While this is not nothing, I am also not convinced that this is enough of an obstacle to discourage scrapers.

Raphael Michel comes to a similar conclusion when discussing scalpers: If you stand to make 200€ profit from a request, you do not care about a few cents in extra CPU time.

Also, this whole idea assumes that attackers even care about their resource usage. DDoS attacks are commonly executed via bot nets where attackers have taken over regular people's devices. In that case, attackers don't really care about resource use because they don't pay the bill.

Conclusion

So should the Proof-of-Work scheme be standardized? Performance optimizations and doing something unconventional will only get us so far. We need something better. And in order to make Proof-of-Work useful it needs to be standardized.

But does it actually work? I was genuinely excited about Anubis. I liked its premise:

The idea is that at individual scales the additional load is ignorable, but at mass scraper levels it adds up and makes scraping much more expensive.

But on closer inspection I am not really sure if that balance can be achieved.