Web Scraping Specialist

I build systems that
harvest the web
at scale.

I build and operate large-scale distributed crawlers and data-extraction systems — turning the messy public web into clean, monitored, model-ready data.

View key projects → Download resume

keval@pipeline: ~

$ keval --whoami
{
  "role": "Web Scraping Specialist",
  "focus": ["distributed crawling", "anti-bot"],
  "stack": ["Scrapy", "Airflow", "PySpark"],
  "open_source": "scrapy/scrapy",
  "location": "Anand, India",
  "status": "available"
}
$

0 Years in production
0 Records / day at peak
0 Scrapers in prod
0 Posts / day pipelines

// 01 — about

I own the full path from scrape to ETL to delivery — and keep it running in production.

I'm a web-scraping specialist and Python developer with 7+ years building and running large-scale distributed crawlers for US companies — across real estate, finance, e-commerce, and sports. The work I'm known for is the hard part: extracting reliably from sites that actively try to stop me.

Scrapy is home turf. I build spiders, item pipelines, and middlewares, then scale them with Redis-backed distributed crawling on Scrapyd, Scrapy Cloud, or Gerapy — async and multithreaded architectures tuned to run hard for hours without falling over.

Anti-bot is where most scrapers die — and where I go deepest. I handle all of it — rotating residential and datacenter proxies, TLS/JA3 and browser fingerprinting, CAPTCHA solving, and stealth headless automation with Playwright and Selenium — then validate, dedup, and normalise the results into clean MongoDB and PostgreSQL data you can trust.

A working spider is only step one — the value is everything after it. I take the raw crawl through ETL to delivery: large-scale processing with PySpark, orchestration with Apache Airflow, and production monitoring and alerting with Spidermon, so the data arrives clean, on schedule, and watched. I'm also an open-source contributor to Scrapy.

Distributed Crawling at Scale
Async & Multithreaded Python
Anti-Bot & Stealth
Data Quality & Validation
Pipeline Monitoring & Alerting
NoSQL (MongoDB) & Redis
Cloud at Scale (AWS)
ETL & PySpark

// 02 — experience

Where I've shipped

Jul 2023 — Feb 2026

Turing

Web Scraping Specialist & Data Engineer

Remote (USA) · Full-time contract

▹Built and operated large-scale distributed web crawlers — production Scrapy with Redis-backed request queues and shared state, tuning concurrency, AutoThrottle, and memory profiles for reliable long-running jobs.
▹Engineered anti-bot strategies at scale — rotating residential/datacenter proxies, TLS/JA3 and browser-fingerprint handling, UA/header rotation, and CAPTCHA solving (2Captcha, Anti-Captcha).
▹Built Scrapy item pipelines for data-quality validation, deduplication, and normalisation, with monitoring and alerting via Spidermon — crawl-health checks and alerts on validation failures and coverage drops.
▹Processed large-scale datasets with PySpark and distributed computing for big-data transformations, alongside Pandas / AWS Glue ETL flows.
▹Orchestrated multi-stage pipelines with Apache Airflow (retry, logging, error handling) and exposed cleaned datasets through FastAPI / Django REST services.

Mar 2021 — May 2023

Heirlift Estate

Web Scraping Specialist

Remote (USA) · Full-time contract

▹Designed and maintained Scrapy spiders to extract real-estate listings at scale — property attributes, agent details, price history, and images.
▹Handled paginated, JavaScript-rendered, and AJAX/session-protected pages using scrapy-playwright and Selenium with stealth configs; managed login flows and cookie persistence.
▹Engineered anti-bot bypass with rotating residential proxies, randomized request fingerprints, and CAPTCHA-solving middleware to sustain reliable extraction.
▹Built Scrapy item pipelines for validation, deduplication, and normalisation into PostgreSQL and MongoDB, and loaded cleaned data through an AWS Glue + S3 ETL flow feeding downstream analytics.

2019 — 2021

Upwork

Freelance Python Developer — Web Scraping

Remote · Top-Rated

▹Built production Scrapy spiders for international clients across e-commerce, B2B directories, and aggregator sites, with structured item pipelines and rotating proxy pools.
▹Achieved Top-Rated status through consistent delivery and client satisfaction across long-running engagements.
▹Exposed crawled datasets via FastAPI services and collaborated directly with clients in English to scope requirements and define data schemas.

// 03 — key projects

Pipelines I've engineered

Real Estate 01

County Property & Mortgage Data Pipeline

Harvested public mortgage and property records scattered across 40+ US government county portals into one cleaned, monitored ETL pipeline — with instant alerts the moment a run fails.

records / day

scrapers

Market Intelligence 02

Self-Healing Broker-Intelligence Pipeline

Weekly Scrapy system scraping two registries that validates, dedupes, and repairs bad rows mid-run — delivering broker profiles with zero manual cleanup between runs.

brokers / week

coverage

Finance 03

Reddit Sentiment Pipeline for ML Signals

Apache Airflow ETL pulling Reddit posts across thousands of subreddits, scored with NLP and embedding models — delivered ready for downstream prediction models.

posts / day

subreddits

Sports 04

ML-Ready Sports Data Feature Store

Scrapers across many sports sites feeding a pipeline that cleans, normalises, and vectorises match and player data into a model-ready feature store — auto-refreshed as new matches complete.

data points

live

auto-refresh

// 04 — technical skills

The toolchain

Languages

Python (7+ yrs)
SQL
JavaScript
Bash

Web Scraping & Crawling

Scrapy
scrapy-playwright
Spidermon
BeautifulSoup
lxml / parsel
Selenium
Playwright
Requests

Distributed & Async

Redis-backed crawling
asyncio
Multithreading
AutoThrottle tuning
Long-running job scaling

Anti-Bot & Stealth

Rotating proxies
CAPTCHA solving
TLS/JA3 handling
Browser fingerprinting
Session / cookie mgmt
UA/header rotation

Data & Processing

ETL design
Data-quality validation
Pandas
NumPy
PySpark
Distributed computing
JSON / XML / CSV

Databases

MongoDB
Redis
PostgreSQL
MySQL

Orchestration & Deploy

Apache Airflow
Docker
Kubernetes
Scrapyd
Scrapy Cloud
Gerapy
Git
cron

Cloud & APIs

AWS (EC2, S3, Glue, Lambda)
Azure Synapse
FastAPI
Django REST
GraphQL
WebSockets
Linux/Unix

ML & MLOps

Scikit-learn
TensorFlow
MLflow
Model deployment via FastAPI

// 05 — open source

Contributor to scrapy/scrapy

Documentation contribution reviewed & merged into the official scrapy/scrapy repository. Public scraping projects & utilities live on GitHub.

education Bachelor of Science (B.Sc.), Chemistry · Gujarat, India

github

github.com/kevalsakhiya

↗

linkedin.com/in/kevalsakhiya

↗

// 06 — contact

Got data that needs harvesting?

I'm available for web-scraping & data-engineering work. Let's scope your pipeline.

$ kevalsakhiya@gmail.com

github ·linkedin ·medium ·upwork · Anand, Gujarat, India

I build systems that harvest the web at scale.