Methodology

Every review on this site is based on a test I actually ran. This page describes what those tests look like, what the rating numbers mean, and where my method has gaps.

The setup

Tests run from a normal home environment unless I say otherwise.

Location. Hamburg, Germany. I will say if I test from elsewhere.
Networks. A residential connection by default. For services that behave differently on commercial IPs, I add runs through a consumer VPN (Mullvad), a datacentre IP (a small VPS), and a mobile carrier connection (tethered).
Browsers. Firefox and Chromium as a baseline. I add Safari when a service depends on browser quirks.
Devices. A current Mac for desktop. A mid-range Android phone for mobile. No headless browsers in scoring (those are covered separately as their own test class).
Account tier. I use the free tier first. Paid tiers are tested when the free tier is too limited to draw a useful conclusion. I always say which tier I used.

What I record

For each test:

The date I ran it.
The browser, OS, and IP type.
What I tried to do (sign up, post a form, integrate a widget, send a request, etc.).
What happened. Including the things that did not work and the time I spent fighting them.

Screenshots are taken at the time of the test. If I update an old review, the new screenshots get a new date in the caption.

What the rating means

The number at the bottom of a review is a 1–10 score. It is not a star rating in disguise. It is a summary of how I would describe the service to a friend who asked. Roughly:

9–10 — I would recommend this without hesitation for the use case the post is about.
7–8 — I would recommend this, with caveats noted in the post.
5–6 — There are real reasons to pick it and real reasons not to.
3–4 — I would not pick it for the use case the post is about.
1–2 — Avoid.

The score is one number summarising a few axes. The post itself is the real review; the number is the headline.

The axes (provisional)

These are the things I weigh. The exact weighting is going to be tightened as more reviews go up.

Privacy. What data the service collects, where it goes, and whether it is honest about it.
Accessibility. Keyboard support, screen reader support, alternatives for users who fail the primary challenge.
Bypass resistance. How easily a CAPTCHA-solving service or a competent script can get past it. Where I can measure this with public solvers, I do.
Integration friction. How long it took to get a working test page up, how clear the documentation was, how often I had to read the source.
Pricing honesty. Whether the public pricing matches what you actually pay, whether there are surprise tiers, whether the free tier is real.

Biases I know I have

I prefer privacy-respecting services. I try to score this fairly, but my preference shows up in which services I cover first.
I prefer self-hostable software. Same caveat.
I am one person. I have one network, one set of devices, one set of accounts. Results from a corporate IP block or a different country will differ. When I can, I run extra tests; when I cannot, I say so.
I am writing in English about services that mostly market in English. Coverage of Chinese, Russian, and Indic-market services will be weaker until I find ways to test them properly.

If any of this is wrong or unclear, tell me.