County Property & Mortgage Data Pipeline
Harvested public mortgage and property records scattered across 40+ US government county portals into one cleaned, monitored ETL pipeline — with instant alerts the moment a run fails.
I build and operate large-scale distributed crawlers and data-extraction systems — turning the messy public web into clean, monitored, model-ready data.
$ keval --whoami { "role": "Web Scraping Specialist", "focus": ["distributed crawling", "anti-bot"], "stack": ["Scrapy", "Airflow", "PySpark"], "open_source": "scrapy/scrapy", "location": "Anand, India", "status": "available" } $
// 01 — about
I'm a web-scraping specialist and Python developer with 7+ years building and running large-scale distributed crawlers for US companies — across real estate, finance, e-commerce, and sports. The work I'm known for is the hard part: extracting reliably from sites that actively try to stop me.
Scrapy is home turf. I build spiders, item pipelines, and middlewares, then scale them with Redis-backed distributed crawling on Scrapyd, Scrapy Cloud, or Gerapy — async and multithreaded architectures tuned to run hard for hours without falling over.
Anti-bot is where most scrapers die — and where I go deepest. I handle all of it — rotating residential and datacenter proxies, TLS/JA3 and browser fingerprinting, CAPTCHA solving, and stealth headless automation with Playwright and Selenium — then validate, dedup, and normalise the results into clean MongoDB and PostgreSQL data you can trust.
A working spider is only step one — the value is everything after it. I take the raw crawl through ETL to delivery: large-scale processing with PySpark, orchestration with Apache Airflow, and production monitoring and alerting with Spidermon, so the data arrives clean, on schedule, and watched. I'm also an open-source contributor to Scrapy.
// 02 — experience
// 03 — key projects
Harvested public mortgage and property records scattered across 40+ US government county portals into one cleaned, monitored ETL pipeline — with instant alerts the moment a run fails.
Weekly Scrapy system scraping two registries that validates, dedupes, and repairs bad rows mid-run — delivering broker profiles with zero manual cleanup between runs.
Apache Airflow ETL pulling Reddit posts across thousands of subreddits, scored with NLP and embedding models — delivered ready for downstream prediction models.
Scrapers across many sports sites feeding a pipeline that cleans, normalises, and vectorises match and player data into a model-ready feature store — auto-refreshed as new matches complete.
// 04 — technical skills
// 05 — open source
Documentation contribution reviewed & merged into the official scrapy/scrapy repository. Public scraping projects & utilities live on GitHub.
// 06 — contact
I'm available for web-scraping & data-engineering work. Let's scope your pipeline.
$ kevalsakhiya@gmail.com