9.6 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Overview
Dataflow is a simple data transformation tool for importing, cleaning, and standardizing data from various sources. Built with PostgreSQL and Node.js/Express, it emphasizes clarity and simplicity over complexity.
Core Concepts
- Sources - Define data sources and deduplication rules (which fields make a record unique)
- Import - Load CSV data, automatically deduplicating based on source rules
- Rules - Extract information using regex patterns (e.g., extract merchant from transaction description)
- Mappings - Map extracted values to standardized output (e.g., "WALMART" → {"vendor": "Walmart", "category": "Groceries"})
- Transform - Apply rules and mappings to create clean, enriched data
Architecture
Database Schema (database/schema.sql)
5 simple tables:
sources- Source definitions withconstraint_fieldsarrayrecords- Imported data withdata(raw) andtransformed(enriched) JSONB columnsrules- Regex extraction rules withfield,pattern,output_fieldmappings- Input/output value mappingsimport_log- Audit trail
Key design:
- JSONB for flexible data storage
- Deduplication via MD5 hash of specified fields
- Simple, flat structure (no complex relationships)
Database Functions (database/functions.sql)
4 focused functions:
import_records(source_name, data)- Import with deduplicationapply_transformations(source_name, record_ids)- Apply rules and mappingsget_unmapped_values(source_name, rule_name)- Find values needing mappingsreprocess_records(source_name)- Re-transform all records
Design principle: Each function does ONE thing. No nested CTEs, no duplication.
API Server (api/server.js + api/routes/)
RESTful endpoints:
/api/sources- CRUD sources, import CSV, trigger transformations/api/rules- CRUD transformation rules/api/mappings- CRUD value mappings, view unmapped values/api/records- Query and search transformed data
Route files:
routes/sources.js- Source management and CSV importroutes/rules.js- Rule managementroutes/mappings.js- Mapping management + unmapped valuesroutes/records.js- Record queries and search
Common Development Tasks
Running the Application
# Setup (first time only)
./setup.sh
# Start development server with auto-reload
npm run dev
# Start production server
npm start
# Test API
curl http://localhost:3000/health
Database Changes
When modifying schema:
- Edit
database/schema.sql - Drop and recreate schema:
psql -d dataflow -f database/schema.sql - Redeploy functions:
psql -d dataflow -f database/functions.sql
For production, write migration scripts instead of dropping schema.
Adding a New API Endpoint
- Add route to appropriate file in
api/routes/ - Follow existing patterns (async/await, error handling via
next()) - Use parameterized queries to prevent SQL injection
- Return consistent JSON format
Testing
Manual testing workflow:
- Create a source:
POST /api/sources - Create rules:
POST /api/rules - Import data:
POST /api/sources/:name/import - Apply transformations:
POST /api/sources/:name/transform - View results:
GET /api/records/source/:name
See examples/GETTING_STARTED.md for complete curl examples.
Design Principles
- Simple over clever - Straightforward code beats optimization
- Explicit over implicit - No magic, no hidden triggers
- Clear naming -
datanotrec,transformednotallj - One function, one job - No 250-line functions
- JSONB for flexibility - Handle varying schemas without migrations
Common Patterns
Import Flow
CSV file → parse → import_records() → records table (data column)
Transformation Flow
records.data → apply_transformations() →
- Apply each rule (regex extraction)
- Look up mappings
- Merge into records.transformed
Deduplication
constraint_keyis a JSONB object of the constraint field values (readable, no hashing)- Dedup is enforced at import time via CTE — NO unique DB constraint on constraint_key
- The constraint key is for cross-batch re-import protection, NOT record uniqueness
- Within a single import batch, ALL rows insert regardless of duplicate constraint keys
- Banks legitimately send multiple identical-looking transactions (same date, description, amount)
- Example: 11 Cedar Point merchandise charges on one day — all should insert in one batch
- On re-import of overlapping date range, rows whose constraint_key already exists in DB are skipped
- This prevents double-counting when you re-run a month-to-date export the next day
- NEVER use
ON CONFLICT (constraint_key)— there is no unique constraint and it would wrongly drop legitimate duplicate transactions from the same batch - Deleting an import log entry cascades to all records from that batch (import_id FK)
Error Handling
- API routes use
try/catchand pass errors tonext(err) - Server.js has global error handler
- Database functions return JSON with
successboolean
UI (React + Vite)
The frontend lives in ui/src/ and is built to public/ via npm run build from the ui/ directory. Always run npm run build from ui/ after any changes to ui/src/ files.
Pages
- Sources / Rules / Mappings / Records — standard CRUD pages
- Pivot (
ui/src/pages/Pivot.jsx) — interactive pivot/crosstab powered by Perspective (@perspective-devv4.4.0, loaded from CDN). Seedocs/perspective-pivot.mdfor the full Perspective API reference. - Stacks — multi-source union views with running balance
- Log — import audit trail
Pivot inspector panel
Clicking a data cell opens a right-hand inspector panel showing the underlying transactions for that cell. Key behaviors:
- Toggle: clicking the same cell again closes the panel. The toggle key is
JSON.stringify({ p: row.__ROW_PATH__, c: column_names })— stable across source and stack views. - Listener cleanup: the
perspective-clickhandler is stored inperspClickHandlerRefand removed viaremoveEventListeneron effect cleanup. Without this, switching views accumulates duplicate listeners that fire multiple times per click. - split_by filter derivation:
detail.config.filterfrom the click event may omit split_by column constraints. They are derived fromcolumn_namespositionally (column_names[i]matchesconfig.split_by[i]) and appended to the filter before querying. - Row filtering: a temporary
table.view({ filter, expressions })is used so Perspective evaluates expression/computed columns correctly. Falls back to JS-sidefilterRowsByConfigon error (which skips filters for fields not in raw data). - The panel is resizable via a drag handle on its left edge (
paneWidthstate, min 240px). - The transaction table is sortable (click header) and shows column totals for all-numeric columns.
File Structure
dataflow/
├── database/
│ ├── schema.sql # Table definitions
│ └── functions.sql # Import/transform functions
├── api/
│ ├── server.js # Express server
│ └── routes/ # API endpoints
│ ├── sources.js
│ ├── rules.js
│ ├── mappings.js
│ └── records.js
├── ui/
│ ├── src/
│ │ ├── pages/ # One file per page
│ │ └── api.js # API client
│ └── package.json
├── public/ # Built UI (gitignored, generated by npm run build)
├── docs/
│ └── perspective-pivot.md # Perspective API reference
├── examples/
│ ├── GETTING_STARTED.md # Tutorial
│ └── bank_transactions.csv
├── .env.example # Config template
├── package.json
└── README.md
Comparison to Legacy TPS System
This project replaces an older system (in /opt/tps) that had:
- 2,150 lines of complex SQL with heavy duplication
- 5 nearly-identical 200+ line functions
- Confusing names and deep nested CTEs
- Complex trigger-based processing
Dataflow achieves the same functionality with:
- ~400 lines of simple SQL
- 4 focused functions
- Clear names and linear logic
- Explicit API-triggered processing
The simplification makes it easy to understand, modify, and maintain.
Troubleshooting
Database connection fails:
- Check
.envfile exists and has correct credentials - Verify PostgreSQL is running:
psql -U postgres -l - Check search path is set: Should default to
dataflowschema
Import succeeds but transformation fails:
- Check rules exist:
SELECT * FROM dataflow.rules WHERE source_name = 'xxx' - Verify field names match CSV columns
- Test regex pattern manually
- Check for SQL errors in logs
All records marked as duplicates:
- Verify
constraint_fieldsmatch actual field names in data - Check if data was already imported
- Use different source name for testing
Adding New Features
When adding features, follow these principles:
- Add ONE function that does ONE thing
- Keep functions under 100 lines if possible
- Write clear SQL, not clever SQL
- Add API endpoint that calls the function
- Document in README.md and update examples
Notes for Claude
- This is a simple system by design - don't over-engineer it
- Keep functions focused and linear
- Use JSONB for flexibility, not as a crutch for bad design
- When confused, read the examples/GETTING_STARTED.md walkthrough
- The old TPS system is in
/opt/tps- this is a clean rewrite, not a refactor