6.7 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Overview
Dataflow is a simple data transformation tool for importing, cleaning, and standardizing data from various sources. Built with PostgreSQL and Node.js/Express, it emphasizes clarity and simplicity over complexity.
Core Concepts
- Sources - Define data sources and deduplication rules (which fields make a record unique)
- Import - Load CSV data, automatically deduplicating based on source rules
- Rules - Extract information using regex patterns (e.g., extract merchant from transaction description)
- Mappings - Map extracted values to standardized output (e.g., "WALMART" → {"vendor": "Walmart", "category": "Groceries"})
- Transform - Apply rules and mappings to create clean, enriched data
Architecture
Database Schema (database/schema.sql)
5 simple tables:
sources- Source definitions withdedup_fieldsarrayrecords- Imported data withdata(raw) andtransformed(enriched) JSONB columnsrules- Regex extraction rules withfield,pattern,output_fieldmappings- Input/output value mappingsimport_log- Audit trail
Key design:
- JSONB for flexible data storage
- Deduplication via MD5 hash of specified fields
- Simple, flat structure (no complex relationships)
Database Functions (database/functions.sql)
4 focused functions:
import_records(source_name, data)- Import with deduplicationapply_transformations(source_name, record_ids)- Apply rules and mappingsget_unmapped_values(source_name, rule_name)- Find values needing mappingsreprocess_records(source_name)- Re-transform all records
Design principle: Each function does ONE thing. No nested CTEs, no duplication.
API Server (api/server.js + api/routes/)
RESTful endpoints:
/api/sources- CRUD sources, import CSV, trigger transformations/api/rules- CRUD transformation rules/api/mappings- CRUD value mappings, view unmapped values/api/records- Query and search transformed data
Route files:
routes/sources.js- Source management and CSV importroutes/rules.js- Rule managementroutes/mappings.js- Mapping management + unmapped valuesroutes/records.js- Record queries and search
Common Development Tasks
Running the Application
# Setup (first time only)
./setup.sh
# Start development server with auto-reload
npm run dev
# Start production server
npm start
# Test API
curl http://localhost:3000/health
Database Changes
When modifying schema:
- Edit
database/schema.sql - Drop and recreate schema:
psql -d dataflow -f database/schema.sql - Redeploy functions:
psql -d dataflow -f database/functions.sql
For production, write migration scripts instead of dropping schema.
Adding a New API Endpoint
- Add route to appropriate file in
api/routes/ - Follow existing patterns (async/await, error handling via
next()) - Use parameterized queries to prevent SQL injection
- Return consistent JSON format
Testing
Manual testing workflow:
- Create a source:
POST /api/sources - Create rules:
POST /api/rules - Import data:
POST /api/sources/:name/import - Apply transformations:
POST /api/sources/:name/transform - View results:
GET /api/records/source/:name
See examples/GETTING_STARTED.md for complete curl examples.
Design Principles
- Simple over clever - Straightforward code beats optimization
- Explicit over implicit - No magic, no hidden triggers
- Clear naming -
datanotrec,transformednotallj - One function, one job - No 250-line functions
- JSONB for flexibility - Handle varying schemas without migrations
Common Patterns
Import Flow
CSV file → parse → import_records() → records table (data column)
Transformation Flow
records.data → apply_transformations() →
- Apply each rule (regex extraction)
- Look up mappings
- Merge into records.transformed
Deduplication
- Hash is MD5 of concatenated values from
dedup_fields - Unique constraint on
(source_name, dedup_key)prevents duplicates - Import function catches unique violations and counts them
Error Handling
- API routes use
try/catchand pass errors tonext(err) - Server.js has global error handler
- Database functions return JSON with
successboolean
File Structure
dataflow/
├── database/
│ ├── schema.sql # Table definitions
│ └── functions.sql # Import/transform functions
├── api/
│ ├── server.js # Express server
│ └── routes/ # API endpoints
│ ├── sources.js
│ ├── rules.js
│ ├── mappings.js
│ └── records.js
├── examples/
│ ├── GETTING_STARTED.md # Tutorial
│ └── bank_transactions.csv
├── .env.example # Config template
├── package.json
└── README.md
Comparison to Legacy TPS System
This project replaces an older system (in /opt/tps) that had:
- 2,150 lines of complex SQL with heavy duplication
- 5 nearly-identical 200+ line functions
- Confusing names and deep nested CTEs
- Complex trigger-based processing
Dataflow achieves the same functionality with:
- ~400 lines of simple SQL
- 4 focused functions
- Clear names and linear logic
- Explicit API-triggered processing
The simplification makes it easy to understand, modify, and maintain.
Troubleshooting
Database connection fails:
- Check
.envfile exists and has correct credentials - Verify PostgreSQL is running:
psql -U postgres -l - Check search path is set: Should default to
dataflowschema
Import succeeds but transformation fails:
- Check rules exist:
SELECT * FROM dataflow.rules WHERE source_name = 'xxx' - Verify field names match CSV columns
- Test regex pattern manually
- Check for SQL errors in logs
All records marked as duplicates:
- Verify
dedup_fieldsmatch actual field names in data - Check if data was already imported
- Use different source name for testing
Adding New Features
When adding features, follow these principles:
- Add ONE function that does ONE thing
- Keep functions under 100 lines if possible
- Write clear SQL, not clever SQL
- Add API endpoint that calls the function
- Document in README.md and update examples
Notes for Claude
- This is a simple system by design - don't over-engineer it
- Keep functions focused and linear
- Use JSONB for flexibility, not as a crutch for bad design
- When confused, read the examples/GETTING_STARTED.md walkthrough
- The old TPS system is in
/opt/tps- this is a clean rewrite, not a refactor