// blog
Notes from the crawl
Field notes and war stories on running scrapers in production — scale, anti-bot, data quality, monitoring, and AI.
-
AI Is a Fallback, Not a Scraper
Using an LLM to scrape every page is slow, expensive, and quietly unreliable. Used as a self-healing fallback that repairs broken selectors and caches the fix, it's genuinely useful. Here's the line between the two.
read → -
The Browser Was Doing Math. So I Did It Myself.
One field on a real-estate site forced an expensive browser render per property. It turned out the browser wasn't fetching that number — it was calculating it. So I reverse-engineered the math and cut a day-long crawl to four hours.
read → -
My Proxy Was Leaking. SOCKS5 Plugged It.
I was rendering pages through residential proxies and still got banned — by my server's own IP. The browser was leaking requests around the HTTP proxy. SOCKS5 is what finally plugged the hole.
read → -
Don't Take the Bait
Not all anti-bot defence tries to block you — some invites you in, then wastes, flags, or poisons your crawl. A field guide to the traps sites set for spiders, and how not to walk into them.
read → -
The Website Wasn't Lying. I Was.
A real-estate crawl that returned data not matching my own filters, a day spent convinced the site was messing with me, and the one-line Scrapy fix once I admitted the bug was mine. First in a Mistakes from the field series.
read → -
Spider vs 403
Every spider hates 403, and 403 hates every spider. A round-by-round deep dive into the protocol-level tells — headers, header order, TLS, HTTP/2 frames, and session — that decide who wins.
read → -
No News Isn't Good News
Monitors only fire if the spider runs. The scariest failure is the one where nothing runs, nothing fires, and the silence reads as success. Here's how to alert on absence.
read → -
One Run Tells You Nothing
A single crawl's stats look fine in isolation — the signal is in the trend across runs. Why I persist every stat from every scheduled spider, and what that history unlocks.
read → -
Don't Trust a Spider You Can't Monitor
A scheduled spider with no monitors is a liability waiting to happen. Here's the full checklist of what every spider monitor should actually assert — and how to wire it up with Spidermon.
read → -
Anyone Can Write a Spider
Scraping is step one. The value is in turning raw, messy crawl output into clean, monitored, model-ready data that lands where it's needed — on a schedule, reliably.
read → -
Scrapers Don't Crash — They Lie
Scrapers don't fail loudly — they fail silently, returning fewer or subtly wrong rows. Here's how I build pipelines that validate, repair, and alert instead of quietly rotting.
read → -
Why Your Scraper Keeps Getting Blocked
Modern bot detection isn't one wall — it's layers, from IP reputation to TLS fingerprints to behaviour. Here's how I think about each layer and stay under the radar.
read → -
One Spider Isn't Enough
When one spider stops being enough — how Redis-backed queues turn a single crawler into a fleet, and the knobs that actually matter once you do.
read →