Scrapers Don't Crash — They Lie
Scrapers don't fail loudly — they fail silently, returning fewer or subtly wrong rows. Here's how I build pipelines that validate, repair, and alert instead of quietly rotting.
The dangerous thing about a broken scraper is that it usually doesn’t crash. The site changes a class name, a selector silently returns None, and your crawl keeps running — happily writing thousands of rows with an empty price field. Nobody notices until a downstream model or report looks wrong weeks later.
A production scraper has to assume the target will change under it. The goal isn’t to prevent breakage — it’s to catch it the moment it happens and, where possible, repair it without a human in the loop.
Validate every item before it’s stored
The first line of defence is an item pipeline that refuses to store garbage. I validate against a schema and route failures somewhere visible instead of silently dropping or persisting them:
from itemadapter import ItemAdapter
class ValidationPipeline:
REQUIRED = ("listing_id", "price", "address")
def process_item(self, item, spider):
adapter = ItemAdapter(item)
missing = [f for f in self.REQUIRED if not adapter.get(f)]
if missing:
spider.crawler.stats.inc_value(f"validation/missing_fields")
raise DropItem(f"missing {missing} on {adapter.get('url')}")
adapter["price"] = self._normalize_price(adapter["price"])
return item
The key move is incrementing a stat on every failure. Those counters are what monitoring watches — a crawl that suddenly drops 40% of its items is a louder signal than any log line.
Dedup and normalise on the way in
Most “data quality” problems are really consistency problems: the same listing scraped twice with slightly different formatting, prices as "$1,200" vs 1200.0, whitespace and casing drift. I normalise aggressively in the pipeline and dedup on a stable business key so re-runs upsert rather than pile up:
INSERT INTO listings (listing_id, price, address, scraped_at)
VALUES (%s, %s, %s, now())
ON CONFLICT (listing_id)
DO UPDATE SET price = EXCLUDED.price,
address = EXCLUDED.address,
scraped_at = EXCLUDED.scraped_at;
Repair mid-run instead of failing the row
“Self-healing” is the part people skip. When a primary selector fails, the pipeline doesn’t have to give up — it can try a fallback path before declaring the row bad:
- Fallback selectors. If the main CSS path returns nothing, try a secondary one (a JSON-LD block, a
data-*attribute, an embedded__NEXT_DATA__payload). Sites often expose the same field three different ways. - Targeted re-fetch. A transient empty response gets re-queued once with backoff before it counts as a real failure — distinguishing “the site hiccuped” from “the site changed.”
- Quarantine, don’t discard. Rows that fail every path go to a dead-letter table with their raw HTML attached, so a fix is a re-parse, not a re-crawl.
This is what lets a weekly run deliver clean output with zero manual cleanup between runs — the pipeline absorbs the small breakages that would otherwise accumulate.
Monitor coverage, not just errors
The metric that actually predicts data quality is coverage: did this run produce roughly what we expected? I use Spidermon to assert on the finished crawl and alert when reality drifts from expectation:
from spidermon import Monitor, monitors
from spidermon.contrib.scrapy.monitors import BaseStatMonitor
@monitors.name("Coverage")
class CoverageMonitor(BaseStatMonitor):
stat_name = "item_scraped_count"
threshold = 15000 # expect at least this many per run
assert_type = ">="
Pair that with checks on the validation-failure ratio and the field-fill rate, wire the failures to Slack or email, and a broken selector pages you within minutes of the run — not weeks later when someone notices the numbers look off.
The mindset
Treat the scraper like any other production service that talks to an unreliable dependency — because that’s exactly what it is. Validate at the boundary, normalise relentlessly, repair what you can, quarantine what you can’t, and alert on coverage. Do that and a crawl that used to need babysitting becomes one you can trust to run unattended.