dataflow/CLAUDE.md

6.7 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Overview

Dataflow is a simple data transformation tool for importing, cleaning, and standardizing data from various sources. Built with PostgreSQL and Node.js/Express, it emphasizes clarity and simplicity over complexity.

Core Concepts

  1. Sources - Define data sources and deduplication rules (which fields make a record unique)
  2. Import - Load CSV data, automatically deduplicating based on source rules
  3. Rules - Extract information using regex patterns (e.g., extract merchant from transaction description)
  4. Mappings - Map extracted values to standardized output (e.g., "WALMART" → {"vendor": "Walmart", "category": "Groceries"})
  5. Transform - Apply rules and mappings to create clean, enriched data

Architecture

Database Schema (database/schema.sql)

5 simple tables:

  • sources - Source definitions with dedup_fields array
  • records - Imported data with data (raw) and transformed (enriched) JSONB columns
  • rules - Regex extraction rules with field, pattern, output_field
  • mappings - Input/output value mappings
  • import_log - Audit trail

Key design:

  • JSONB for flexible data storage
  • Deduplication via MD5 hash of specified fields
  • Simple, flat structure (no complex relationships)

Database Functions (database/functions.sql)

4 focused functions:

  • import_records(source_name, data) - Import with deduplication
  • apply_transformations(source_name, record_ids) - Apply rules and mappings
  • get_unmapped_values(source_name, rule_name) - Find values needing mappings
  • reprocess_records(source_name) - Re-transform all records

Design principle: Each function does ONE thing. No nested CTEs, no duplication.

API Server (api/server.js + api/routes/)

RESTful endpoints:

  • /api/sources - CRUD sources, import CSV, trigger transformations
  • /api/rules - CRUD transformation rules
  • /api/mappings - CRUD value mappings, view unmapped values
  • /api/records - Query and search transformed data

Route files:

  • routes/sources.js - Source management and CSV import
  • routes/rules.js - Rule management
  • routes/mappings.js - Mapping management + unmapped values
  • routes/records.js - Record queries and search

Common Development Tasks

Running the Application

# Setup (first time only)
./setup.sh

# Start development server with auto-reload
npm run dev

# Start production server
npm start

# Test API
curl http://localhost:3000/health

Database Changes

When modifying schema:

  1. Edit database/schema.sql
  2. Drop and recreate schema: psql -d dataflow -f database/schema.sql
  3. Redeploy functions: psql -d dataflow -f database/functions.sql

For production, write migration scripts instead of dropping schema.

Adding a New API Endpoint

  1. Add route to appropriate file in api/routes/
  2. Follow existing patterns (async/await, error handling via next())
  3. Use parameterized queries to prevent SQL injection
  4. Return consistent JSON format

Testing

Manual testing workflow:

  1. Create a source: POST /api/sources
  2. Create rules: POST /api/rules
  3. Import data: POST /api/sources/:name/import
  4. Apply transformations: POST /api/sources/:name/transform
  5. View results: GET /api/records/source/:name

See examples/GETTING_STARTED.md for complete curl examples.

Design Principles

  1. Simple over clever - Straightforward code beats optimization
  2. Explicit over implicit - No magic, no hidden triggers
  3. Clear naming - data not rec, transformed not allj
  4. One function, one job - No 250-line functions
  5. JSONB for flexibility - Handle varying schemas without migrations

Common Patterns

Import Flow

CSV file → parse → import_records() → records table (data column)

Transformation Flow

records.data → apply_transformations() →
  - Apply each rule (regex extraction)
  - Look up mappings
  - Merge into records.transformed

Deduplication

  • Hash is MD5 of concatenated values from dedup_fields
  • Unique constraint on (source_name, dedup_key) prevents duplicates
  • Import function catches unique violations and counts them

Error Handling

  • API routes use try/catch and pass errors to next(err)
  • Server.js has global error handler
  • Database functions return JSON with success boolean

File Structure

dataflow/
├── database/
│   ├── schema.sql         # Table definitions
│   └── functions.sql      # Import/transform functions
├── api/
│   ├── server.js          # Express server
│   └── routes/            # API endpoints
│       ├── sources.js
│       ├── rules.js
│       ├── mappings.js
│       └── records.js
├── examples/
│   ├── GETTING_STARTED.md # Tutorial
│   └── bank_transactions.csv
├── .env.example           # Config template
├── package.json
└── README.md

Comparison to Legacy TPS System

This project replaces an older system (in /opt/tps) that had:

  • 2,150 lines of complex SQL with heavy duplication
  • 5 nearly-identical 200+ line functions
  • Confusing names and deep nested CTEs
  • Complex trigger-based processing

Dataflow achieves the same functionality with:

  • ~400 lines of simple SQL
  • 4 focused functions
  • Clear names and linear logic
  • Explicit API-triggered processing

The simplification makes it easy to understand, modify, and maintain.

Troubleshooting

Database connection fails:

  • Check .env file exists and has correct credentials
  • Verify PostgreSQL is running: psql -U postgres -l
  • Check search path is set: Should default to dataflow schema

Import succeeds but transformation fails:

  • Check rules exist: SELECT * FROM dataflow.rules WHERE source_name = 'xxx'
  • Verify field names match CSV columns
  • Test regex pattern manually
  • Check for SQL errors in logs

All records marked as duplicates:

  • Verify dedup_fields match actual field names in data
  • Check if data was already imported
  • Use different source name for testing

Adding New Features

When adding features, follow these principles:

  • Add ONE function that does ONE thing
  • Keep functions under 100 lines if possible
  • Write clear SQL, not clever SQL
  • Add API endpoint that calls the function
  • Document in README.md and update examples

Notes for Claude

  • This is a simple system by design - don't over-engineer it
  • Keep functions focused and linear
  • Use JSONB for flexibility, not as a crutch for bad design
  • When confused, read the examples/GETTING_STARTED.md walkthrough
  • The old TPS system is in /opt/tps - this is a clean rewrite, not a refactor