# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Overview Dataflow is a simple data transformation tool for importing, cleaning, and standardizing data from various sources. Built with PostgreSQL and Node.js/Express, it emphasizes clarity and simplicity over complexity. ## Core Concepts 1. **Sources** - Define data sources and deduplication rules (which fields make a record unique) 2. **Import** - Load CSV data, automatically deduplicating based on source rules 3. **Rules** - Extract information using regex patterns (e.g., extract merchant from transaction description) 4. **Mappings** - Map extracted values to standardized output (e.g., "WALMART" → {"vendor": "Walmart", "category": "Groceries"}) 5. **Transform** - Apply rules and mappings to create clean, enriched data ## Architecture ### Database Schema (`database/schema.sql`) **5 simple tables:** - `sources` - Source definitions with `constraint_fields` array - `records` - Imported data with `data` (raw) and `transformed` (enriched) JSONB columns - `rules` - Regex extraction rules with `field`, `pattern`, `output_field` - `mappings` - Input/output value mappings - `import_log` - Audit trail **Key design:** - JSONB for flexible data storage - Deduplication via MD5 hash of specified fields - Simple, flat structure (no complex relationships) ### Database Functions (`database/functions.sql`) **4 focused functions:** - `import_records(source_name, data)` - Import with deduplication - `apply_transformations(source_name, record_ids)` - Apply rules and mappings - `get_unmapped_values(source_name, rule_name)` - Find values needing mappings - `reprocess_records(source_name)` - Re-transform all records **Design principle:** Each function does ONE thing. No nested CTEs, no duplication. ### API Server (`api/server.js` + `api/routes/`) **RESTful endpoints:** - `/api/sources` - CRUD sources, import CSV, trigger transformations - `/api/rules` - CRUD transformation rules - `/api/mappings` - CRUD value mappings, view unmapped values - `/api/records` - Query and search transformed data **Route files:** - `routes/sources.js` - Source management and CSV import - `routes/rules.js` - Rule management - `routes/mappings.js` - Mapping management + unmapped values - `routes/records.js` - Record queries and search ## Common Development Tasks ### Running the Application ```bash # Setup (first time only) ./setup.sh # Start development server with auto-reload npm run dev # Start production server npm start # Test API curl http://localhost:3000/health ``` ### Database Changes When modifying schema: 1. Edit `database/schema.sql` 2. Drop and recreate schema: `psql -d dataflow -f database/schema.sql` 3. Redeploy functions: `psql -d dataflow -f database/functions.sql` For production, write migration scripts instead of dropping schema. ### Adding a New API Endpoint 1. Add route to appropriate file in `api/routes/` 2. Follow existing patterns (async/await, error handling via `next()`) 3. Use parameterized queries to prevent SQL injection 4. Return consistent JSON format ### Testing Manual testing workflow: 1. Create a source: `POST /api/sources` 2. Create rules: `POST /api/rules` 3. Import data: `POST /api/sources/:name/import` 4. Apply transformations: `POST /api/sources/:name/transform` 5. View results: `GET /api/records/source/:name` See `examples/GETTING_STARTED.md` for complete curl examples. ## Design Principles 1. **Simple over clever** - Straightforward code beats optimization 2. **Explicit over implicit** - No magic, no hidden triggers 3. **Clear naming** - `data` not `rec`, `transformed` not `allj` 4. **One function, one job** - No 250-line functions 5. **JSONB for flexibility** - Handle varying schemas without migrations ## Common Patterns ### Import Flow ``` CSV file → parse → import_records() → records table (data column) ``` ### Transformation Flow ``` records.data → apply_transformations() → - Apply each rule (regex extraction) - Look up mappings - Merge into records.transformed ``` ### Deduplication - `constraint_key` is a JSONB object of the constraint field values (readable, no hashing) - Dedup is enforced at import time via CTE — NO unique DB constraint on constraint_key - **The constraint key is for cross-batch re-import protection, NOT record uniqueness** - Within a single import batch, ALL rows insert regardless of duplicate constraint keys - Banks legitimately send multiple identical-looking transactions (same date, description, amount) - Example: 11 Cedar Point merchandise charges on one day — all should insert in one batch - On re-import of overlapping date range, rows whose constraint_key already exists in DB are skipped - This prevents double-counting when you re-run a month-to-date export the next day - NEVER use `ON CONFLICT (constraint_key)` — there is no unique constraint and it would wrongly drop legitimate duplicate transactions from the same batch - Deleting an import log entry cascades to all records from that batch (import_id FK) ### Error Handling - API routes use `try/catch` and pass errors to `next(err)` - Server.js has global error handler - Database functions return JSON with `success` boolean ## UI (React + Vite) The frontend lives in `ui/src/` and is built to `public/` via `npm run build` from the `ui/` directory. **Always run `npm run build` from `ui/` after any changes to `ui/src/` files.** ### Pages - **Sources / Rules / Mappings / Records** — standard CRUD pages - **Pivot** (`ui/src/pages/Pivot.jsx`) — interactive pivot/crosstab powered by Perspective (`@perspective-dev` v4.4.0, loaded from CDN). See `docs/perspective-pivot.md` for the full Perspective API reference. - **Stacks** — multi-source union views with running balance - **Log** — import audit trail ### Pivot inspector panel Clicking a data cell opens a right-hand inspector panel showing the underlying transactions for that cell. Key behaviors: - **Toggle**: clicking the same cell again closes the panel. The toggle key is `JSON.stringify({ p: row.__ROW_PATH__, c: column_names })` — stable across source and stack views. - **Listener cleanup**: the `perspective-click` handler is stored in `perspClickHandlerRef` and removed via `removeEventListener` on effect cleanup. Without this, switching views accumulates duplicate listeners that fire multiple times per click. - **split_by filter derivation**: `detail.config.filter` from the click event may omit split_by column constraints. They are derived from `column_names` positionally (`column_names[i]` matches `config.split_by[i]`) and appended to the filter before querying. - **Row filtering**: a temporary `table.view({ filter, expressions })` is used so Perspective evaluates expression/computed columns correctly. Falls back to JS-side `filterRowsByConfig` on error (which skips filters for fields not in raw data). - The panel is resizable via a drag handle on its left edge (`paneWidth` state, min 240px). - The transaction table is sortable (click header) and shows column totals for all-numeric columns. ### Pivot layout persistence Named layouts are stored in `dataflow.pivot_layouts` for both sources and stacks. The `source_name` column holds either a source name or a stack name — the FK to `sources(name)` was dropped to allow this. Source layouts use `/api/sources/:name/layouts`; stack layouts use `/api/stacks/:name/layouts`. Both call the same DB functions (`list_pivot_layouts`, `save_pivot_layout`, `delete_pivot_layout`). `localStorage` is still used to remember the *last active layout* for a view (the `psp_layout_` key), but named layout definitions live in the DB so they persist across machines. ## File Structure ``` dataflow/ ├── database/ │ ├── schema.sql # Table definitions │ └── functions.sql # Import/transform functions ├── api/ │ ├── server.js # Express server │ └── routes/ # API endpoints │ ├── sources.js │ ├── rules.js │ ├── mappings.js │ └── records.js ├── ui/ │ ├── src/ │ │ ├── pages/ # One file per page │ │ └── api.js # API client │ └── package.json ├── public/ # Built UI (gitignored, generated by npm run build) ├── docs/ │ └── perspective-pivot.md # Perspective API reference ├── examples/ │ ├── GETTING_STARTED.md # Tutorial │ └── bank_transactions.csv ├── .env.example # Config template ├── package.json └── README.md ``` ## Comparison to Legacy TPS System This project replaces an older system (in `/opt/tps`) that had: - 2,150 lines of complex SQL with heavy duplication - 5 nearly-identical 200+ line functions - Confusing names and deep nested CTEs - Complex trigger-based processing Dataflow achieves the same functionality with: - ~400 lines of simple SQL - 4 focused functions - Clear names and linear logic - Explicit API-triggered processing The simplification makes it easy to understand, modify, and maintain. ## Troubleshooting **Database connection fails:** - Check `.env` file exists and has correct credentials - Verify PostgreSQL is running: `psql -U postgres -l` - Check search path is set: Should default to `dataflow` schema **Import succeeds but transformation fails:** - Check rules exist: `SELECT * FROM dataflow.rules WHERE source_name = 'xxx'` - Verify field names match CSV columns - Test regex pattern manually - Check for SQL errors in logs **All records marked as duplicates:** - Verify `constraint_fields` match actual field names in data - Check if data was already imported - Use different source name for testing ## Adding New Features When adding features, follow these principles: - Add ONE function that does ONE thing - Keep functions under 100 lines if possible - Write clear SQL, not clever SQL - Add API endpoint that calls the function - Document in README.md and update examples ## Notes for Claude - This is a **simple** system by design - don't over-engineer it - Keep functions focused and linear - Use JSONB for flexibility, not as a crutch for bad design - When confused, read the examples/GETTING_STARTED.md walkthrough - The old TPS system is in `/opt/tps` - this is a clean rewrite, not a refactor