Skip to content
Web Scraping Specialist

I build systems that
harvest the web
at scale.

I build and operate large-scale distributed crawlers and data-extraction systems — turning the messy public web into clean, monitored, model-ready data.

keval@pipeline: ~
$ keval --whoami
{
  "role": "Web Scraping Specialist",
  "focus": ["distributed crawling", "anti-bot"],
  "stack": ["Scrapy", "Airflow", "PySpark"],
  "open_source": "scrapy/scrapy",
  "location": "Anand, India",
  "status": "available"
}
$ 
  • 0 Years in production
  • 0 Records / day at peak
  • 0 Scrapers in prod
  • 0 Posts / day pipelines

// 01 — about

I own the full path from scrape to ETL to delivery — and keep it running in production.

I'm a web-scraping specialist and Python developer with 7+ years building and running large-scale distributed crawlers for US companies — across real estate, finance, e-commerce, and sports. The work I'm known for is the hard part: extracting reliably from sites that actively try to stop me.

Scrapy is home turf. I build spiders, item pipelines, and middlewares, then scale them with Redis-backed distributed crawling on Scrapyd, Scrapy Cloud, or Gerapy — async and multithreaded architectures tuned to run hard for hours without falling over.

Anti-bot is where most scrapers die — and where I go deepest. I handle all of it — rotating residential and datacenter proxies, TLS/JA3 and browser fingerprinting, CAPTCHA solving, and stealth headless automation with Playwright and Selenium — then validate, dedup, and normalise the results into clean MongoDB and PostgreSQL data you can trust.

A working spider is only step one — the value is everything after it. I take the raw crawl through ETL to delivery: large-scale processing with PySpark, orchestration with Apache Airflow, and production monitoring and alerting with Spidermon, so the data arrives clean, on schedule, and watched. I'm also an open-source contributor to Scrapy.

  • Distributed Crawling at Scale
  • Async & Multithreaded Python
  • Anti-Bot & Stealth
  • Data Quality & Validation
  • Pipeline Monitoring & Alerting
  • NoSQL (MongoDB) & Redis
  • Cloud at Scale (AWS)
  • ETL & PySpark

// 02 — experience

Where I've shipped

Jul 2023 — Feb 2026

Turing

Web Scraping Specialist & Data Engineer
Remote (USA) · Full-time contract
  • Built and operated large-scale distributed web crawlers — production Scrapy with Redis-backed request queues and shared state, tuning concurrency, AutoThrottle, and memory profiles for reliable long-running jobs.
  • Engineered anti-bot strategies at scale — rotating residential/datacenter proxies, TLS/JA3 and browser-fingerprint handling, UA/header rotation, and CAPTCHA solving (2Captcha, Anti-Captcha).
  • Built Scrapy item pipelines for data-quality validation, deduplication, and normalisation, with monitoring and alerting via Spidermon — crawl-health checks and alerts on validation failures and coverage drops.
  • Processed large-scale datasets with PySpark and distributed computing for big-data transformations, alongside Pandas / AWS Glue ETL flows.
  • Orchestrated multi-stage pipelines with Apache Airflow (retry, logging, error handling) and exposed cleaned datasets through FastAPI / Django REST services.
Mar 2021 — May 2023

Heirlift Estate

Web Scraping Specialist
Remote (USA) · Full-time contract
  • Designed and maintained Scrapy spiders to extract real-estate listings at scale — property attributes, agent details, price history, and images.
  • Handled paginated, JavaScript-rendered, and AJAX/session-protected pages using scrapy-playwright and Selenium with stealth configs; managed login flows and cookie persistence.
  • Engineered anti-bot bypass with rotating residential proxies, randomized request fingerprints, and CAPTCHA-solving middleware to sustain reliable extraction.
  • Built Scrapy item pipelines for validation, deduplication, and normalisation into PostgreSQL and MongoDB, and loaded cleaned data through an AWS Glue + S3 ETL flow feeding downstream analytics.
2019 — 2021

Upwork

Freelance Python Developer — Web Scraping
Remote · Top-Rated
  • Built production Scrapy spiders for international clients across e-commerce, B2B directories, and aggregator sites, with structured item pipelines and rotating proxy pools.
  • Achieved Top-Rated status through consistent delivery and client satisfaction across long-running engagements.
  • Exposed crawled datasets via FastAPI services and collaborated directly with clients in English to scope requirements and define data schemas.

// 03 — key projects

Pipelines I've engineered

Real Estate 01

County Property & Mortgage Data Pipeline

Harvested public mortgage and property records scattered across 40+ US government county portals into one cleaned, monitored ETL pipeline — with instant alerts the moment a run fails.

0
records / day
0
scrapers
Market Intelligence 02

Self-Healing Broker-Intelligence Pipeline

Weekly Scrapy system scraping two registries that validates, dedupes, and repairs bad rows mid-run — delivering broker profiles with zero manual cleanup between runs.

0
brokers / week
0
coverage
Finance 03

Reddit Sentiment Pipeline for ML Signals

Apache Airflow ETL pulling Reddit posts across thousands of subreddits, scored with NLP and embedding models — delivered ready for downstream prediction models.

0
posts / day
0
subreddits
Sports 04

ML-Ready Sports Data Feature Store

Scrapers across many sports sites feeding a pipeline that cleans, normalises, and vectorises match and player data into a model-ready feature store — auto-refreshed as new matches complete.

0
data points
live
auto-refresh

// 04 — technical skills

The toolchain

Languages
  • Python (7+ yrs)
  • SQL
  • JavaScript
  • Bash
Web Scraping & Crawling
  • Scrapy
  • scrapy-playwright
  • Spidermon
  • BeautifulSoup
  • lxml / parsel
  • Selenium
  • Playwright
  • Requests
Distributed & Async
  • Redis-backed crawling
  • asyncio
  • Multithreading
  • AutoThrottle tuning
  • Long-running job scaling
Anti-Bot & Stealth
  • Rotating proxies
  • CAPTCHA solving
  • TLS/JA3 handling
  • Browser fingerprinting
  • Session / cookie mgmt
  • UA/header rotation
Data & Processing
  • ETL design
  • Data-quality validation
  • Pandas
  • NumPy
  • PySpark
  • Distributed computing
  • JSON / XML / CSV
Databases
  • MongoDB
  • Redis
  • PostgreSQL
  • MySQL
Orchestration & Deploy
  • Apache Airflow
  • Docker
  • Kubernetes
  • Scrapyd
  • Scrapy Cloud
  • Gerapy
  • Git
  • cron
Cloud & APIs
  • AWS (EC2, S3, Glue, Lambda)
  • Azure Synapse
  • FastAPI
  • Django REST
  • GraphQL
  • WebSockets
  • Linux/Unix
ML & MLOps
  • Scikit-learn
  • TensorFlow
  • MLflow
  • Model deployment via FastAPI

// 05 — open source

Contributor to scrapy/scrapy

Documentation contribution reviewed & merged into the official scrapy/scrapy repository. Public scraping projects & utilities live on GitHub.

education Bachelor of Science (B.Sc.), Chemistry · Gujarat, India

// 06 — contact

Got data that needs harvesting?

I'm available for web-scraping & data-engineering work. Let's scope your pipeline.