// blog

Notes from the crawl

Field notes and war stories on running scrapers in production — scale, anti-bot, data quality, monitoring, and AI.

Jun 16, 2026 aillmself-healing

AI Is a Fallback, Not a Scraper

Using an LLM to scrape every page is slow, expensive, and quietly unreliable. Used as a self-healing fallback that repairs broken selectors and caches the fix, it's genuinely useful. Here's the line between the two.
read →
Apr 22, 2026 reverse-engineeringperformancejavascript

The Browser Was Doing Math. So I Did It Myself.

One field on a real-estate site forced an expensive browser render per property. It turned out the browser wasn't fetching that number — it was calculating it. So I reverse-engineered the math and cut a day-long crawl to four hours.
read →
Feb 11, 2026 proxiesplaywrightanti-bot

My Proxy Was Leaking. SOCKS5 Plugged It.

I was rendering pages through residential proxies and still got banned — by my server's own IP. The browser was leaking requests around the HTTP proxy. SOCKS5 is what finally plugged the hole.
read →
Nov 19, 2025 anti-bothoneypotscrawling

Don't Take the Bait

Not all anti-bot defence tries to block you — some invites you in, then wastes, flags, or poisons your crawl. A field guide to the traps sites set for spiders, and how not to walk into them.
read →
Sep 9, 2025 scrapysessionsmistakes-from-the-field

The Website Wasn't Lying. I Was.

A real-estate crawl that returned data not matching my own filters, a day spent convinced the site was messing with me, and the one-line Scrapy fix once I admitted the bug was mine. First in a Mistakes from the field series.
read →
Jun 10, 2025 anti-bothttp2fingerprinting

Spider vs 403

Every spider hates 403, and 403 hates every spider. A round-by-round deep dive into the protocol-level tells — headers, header order, TLS, HTTP/2 frames, and session — that decide who wins.
read →
Mar 18, 2025 monitoringalertingreliability

No News Isn't Good News

Monitors only fire if the spider runs. The scariest failure is the one where nothing runs, nothing fires, and the silence reads as success. Here's how to alert on absence.
read →
Jan 22, 2025 scrapymonitoringobservability

One Run Tells You Nothing

A single crawl's stats look fine in isolation — the signal is in the trend across runs. Why I persist every stat from every scheduled spider, and what that history unlocks.
read →
Nov 5, 2024 spidermonmonitoringdata-quality

Don't Trust a Spider You Can't Monitor

A scheduled spider with no monitors is a liability waiting to happen. Here's the full checklist of what every spider monitor should actually assert — and how to wire it up with Spidermon.
read →
Aug 14, 2024 etlairflowpyspark

Anyone Can Write a Spider

Scraping is step one. The value is in turning raw, messy crawl output into clean, monitored, model-ready data that lands where it's needed — on a schedule, reliably.
read →
May 21, 2024 data-qualityspidermonetl

Scrapers Don't Crash — They Lie

Scrapers don't fail loudly — they fail silently, returning fewer or subtly wrong rows. Here's how I build pipelines that validate, repair, and alert instead of quietly rotting.
read →
Mar 12, 2024 anti-botproxiesfingerprinting

Why Your Scraper Keeps Getting Blocked

Modern bot detection isn't one wall — it's layers, from IP reputation to TLS fingerprints to behaviour. Here's how I think about each layer and stay under the radar.
read →
Jan 18, 2024 scrapydistributedredis

One Spider Isn't Enough

When one spider stops being enough — how Redis-backed queues turn a single crawler into a fleet, and the knobs that actually matter once you do.
read →

Notes from the crawl

AI Is a Fallback, Not a Scraper

The Browser Was Doing Math. So I Did It Myself.

My Proxy Was Leaking. SOCKS5 Plugged It.

Don't Take the Bait

The Website Wasn't Lying. I Was.

Spider vs 403

No News Isn't Good News

One Run Tells You Nothing

Don't Trust a Spider You Can't Monitor

Anyone Can Write a Spider

Scrapers Don't Crash — They Lie

Why Your Scraper Keeps Getting Blocked

One Spider Isn't Enough