A distributed scraping platform that ingests 30GB/month across 12 locations, 2 providers, and 6 timeslots — in under 3 minutes per cycle.
Overview
Clubworks needed a way to systematically scrape and index location-based pricing data from multiple providers. I built the entire system solo — from scrapers to task orchestration to the frontend dashboard. Two major iterations: the first worked, the second is production-grade.
The Problem
V1 was stable — ran 3 months without downtime. But it hit a ceiling: Selenium-based scrapers couldn't scale. Individual scraper concurrency and multi-scraper parallelism became bottlenecks at production volume.
The Solution
V2 decoupled everything. Celery + Celery Beat for scheduling, Playwright for browser automation, FlareSolverr for complex cookie management. Added circuit breakers, dynamic rate limiting, cookie jar to reduce proxy costs. Backend consumes Redis Streams with parallel batch processing. The architecture now scales.
Key Achievements
- 1Scrapes 12 locations × 2 providers × 6 timeslots in ~3 minutes — zero downtime
- 2Extracts ~30GB/month of pricing data, stored in NeonDB with snowflake-style aggregation
- 320 concurrent Celery workers with tuned threading for Playwright + FlareSolverr sessions
- 4Redis TTL caching (3 min) for hot data, eliminating frontend latency spikes
Technical Highlights
Scraper Architecture
Playwright + aiohttp hybrid sessions based on provider complexity. Cookie jar persists sessions to minimize proxy usage. FlareSolverr handles Cloudflare-protected endpoints. Circuit breakers prevent cascade failures.
Task Orchestration
Celery Beat schedules daily/weekly/monthly/yearly scrapes. Workers stream results to Redis, backend consumes via stream consumer groups with parallel batch processing. Fully async, retriable, observable.
Data Layer
Express.js backend consumes Redis streams and writes to NeonDB (Postgres) with flattened data models for fast aggregation. Redis TTL caching on hot paths. Next.js frontend with shadcn data tables — no loading states needed due to cache strategy.
Reflection
V1 taught me that 'working' isn't the same as 'scalable.' It ran fine for months, but when we needed more concurrency, the architecture couldn't support it. V2 was about building room to grow. Celery made all the difference.
Tech Stack
Next Project
View All Projects