Skip to content
All posts
etlairflowpysparkfastapi

Anyone Can Write a Spider

Scraping is step one. The value is in turning raw, messy crawl output into clean, monitored, model-ready data that lands where it's needed — on a schedule, reliably.

Plenty of people can write a spider. Far fewer can take what that spider produces and turn it into something a business actually consumes — clean, deduplicated, refreshed on a schedule, and delivered through an interface the consumer trusts. That gap, from raw crawl output to delivered data product, is where most of the real engineering lives.

I think of it as one pipeline with three responsibilities: crawl → transform → deliver. Here’s how I wire each stage so the whole thing runs unattended.

Stage 1 — crawl produces raw, not final

The crawler’s job is narrow on purpose: fetch and extract, then hand off. I resist the temptation to do heavy cleaning inside the spider — it couples extraction to business logic and makes both harder to change. The spider writes raw extracted items to a landing area (object storage or a raw table), tagged with a run ID and timestamp.

Keeping raw data immutable pays off constantly: when a transform has a bug, I re-run the transform against the stored raw output instead of re-crawling the target. Re-crawling is expensive and rude; re-transforming is free.

Stage 2 — transform at whatever scale the data demands

This is where raw becomes usable: type coercion, deduplication, normalisation, joins against reference data, feature derivation. For modest volumes, Pandas is plenty. Once a run is millions of rows, I move the transform to PySpark so it scales horizontally instead of melting one machine’s memory:

from pyspark.sql import functions as F

clean = (
    raw.dropDuplicates(["listing_id"])
       .withColumn("price", F.regexp_replace("price", r"[^0-9.]", "").cast("double"))
       .filter(F.col("price").isNotNull())
       .withColumn("scraped_date", F.to_date("scraped_at"))
)

The output of this stage is the thing people actually want: a clean, well-typed, deduplicated dataset — often shaped directly into a feature store or analytics table.

Stage 3 — deliver through a stable interface

Data nobody can reach isn’t a product. Depending on the consumer, delivery looks like a warehouse table, a file drop to S3, or — most often for me — a REST API that hides all the pipeline machinery behind a clean contract:

from fastapi import FastAPI, Query

app = FastAPI()

@app.get("/listings")
def listings(min_price: float = 0, limit: int = Query(100, le=1000)):
    return query_clean_listings(min_price=min_price, limit=limit)

The consumer sees a stable endpoint. They never know — or care — that behind it sits a distributed crawl, a Spark job, and a dedup pass.

The glue: orchestration

What turns three scripts into a pipeline is orchestration. I run the stages as a DAG in Apache Airflow, so dependencies, retries, logging, and scheduling are declarative rather than a pile of cron jobs and hope:

crawl >> transform >> publish >> notify

Airflow gives me the things that make a pipeline trustworthy: a failed transform retries with backoff, every run is logged and inspectable, and a stage that fails outright pages me instead of silently leaving stale data in place. Tie that to the monitoring from the self-healing pipelines side and the whole path becomes observable end to end.

Why owning the whole path matters

When one person owns crawl-through-delivery, things that fall through the cracks in a hand-off simply don’t. The schema the API exposes is designed with how the data is scraped in mind. A target change that breaks extraction surfaces as a coverage alert, not a confused downstream consumer two weeks later. And re-delivering corrected data is a transform re-run, not a fire drill.

That end-to-end ownership — messy public web in, clean model-ready data out, on a schedule, monitored — is the actual deliverable. The spider is just where it starts.