Clubworks Data Platform

Data Pipeline

A distributed scraping platform that ingests 30GB/month across 12 locations, 2 providers, and 6 timeslots — in under 3 minutes per cycle.

Founding Engineer @ Clubworks (Contract)

2025 — Present

Overview

Clubworks needed a way to systematically scrape and index location-based pricing data from multiple providers. I built the entire system solo — from scrapers to task orchestration to the frontend dashboard. Two major iterations: the first worked, the second is production-grade.

The Problem

V1 was stable — ran 3 months without downtime. But it hit a ceiling: Selenium-based scrapers couldn't scale. Individual scraper concurrency and multi-scraper parallelism became bottlenecks at production volume.

The Solution

V2 decoupled everything. Celery + Celery Beat for scheduling, Playwright for browser automation, FlareSolverr for complex cookie management. Added circuit breakers, dynamic rate limiting, cookie jar to reduce proxy costs. Backend consumes Redis Streams with parallel batch processing. The architecture now scales.

Key Achievements

1Scrapes 12 locations × 2 providers × 6 timeslots in ~3 minutes — zero downtime
2Extracts ~30GB/month of pricing data, stored in NeonDB with snowflake-style aggregation
320 concurrent Celery workers with tuned threading for Playwright + FlareSolverr sessions
4Redis TTL caching (3 min) for hot data, eliminating frontend latency spikes

Technical Highlights

Scraper Architecture

Playwright + aiohttp hybrid sessions based on provider complexity. Cookie jar persists sessions to minimize proxy usage. FlareSolverr handles Cloudflare-protected endpoints. Circuit breakers prevent cascade failures.

Task Orchestration

Celery Beat schedules daily/weekly/monthly/yearly scrapes. Workers stream results to Redis, backend consumes via stream consumer groups with parallel batch processing. Fully async, retriable, observable.

Data Layer

Express.js backend consumes Redis streams and writes to NeonDB (Postgres) with flattened data models for fast aggregation. Redis TTL caching on hot paths. Next.js frontend with shadcn data tables — no loading states needed due to cache strategy.

Reflection

V1 taught me that 'working' isn't the same as 'scalable.' It ran fine for months, but when we needed more concurrency, the architecture couldn't support it. V2 was about building room to grow. Celery made all the difference.

Tech Stack

PythonCeleryRedis StreamsPlaywrightExpress.jsNext.jsNeonDBPrismaDocker

Next Project

View All Projects

Work Together

Key Achievements

1Scrapes 12 locations × 2 providers × 6 timeslots in ~3 minutes — zero downtime

2Extracts ~30GB/month of pricing data, stored in NeonDB with snowflake-style aggregation

320 concurrent Celery workers with tuned threading for Playwright + FlareSolverr sessions

Overview

The Problem

The Solution

Key Achievements

Technical Highlights

Scraper Architecture

Task Orchestration

Data Layer

Reflection

Tech Stack

BERKE YILMAZ

Overview

The Problem

The Solution

Key Achievements

Technical Highlights

Scraper Architecture

Task Orchestration

Data Layer

Reflection

Tech Stack