dataflow/CLAUDE.md
Paul Trowbridge d63d70cd52 Import log, constraint key overhaul, and dedup improvements
- Rename dedup_key/dedup_fields → constraint_key/constraint_fields everywhere
  (schema, functions, routes, UI, migration script, docs)
- Change constraint_key from MD5 TEXT hash to readable JSONB object
- Drop unique constraint on (source_name, constraint_key); dedup is now
  enforced at import time via CTE, allowing intra-file duplicate rows
- Add import_id FK (ON DELETE CASCADE) so deleting a log entry removes its records
- Add info JSONB to import_log with inserted_keys and excluded_keys arrays
- Add get_import_log, get_all_import_logs, delete_import SQL functions
- Auto-apply transformations immediately after import
- Import UI: expandable key detail, checkbox selection, delete with confirm,
  import ID column, transform result display
- New Log page: global import log across all sources

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 23:44:30 -04:00

209 lines
6.9 KiB
Markdown

# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Overview
Dataflow is a simple data transformation tool for importing, cleaning, and standardizing data from various sources. Built with PostgreSQL and Node.js/Express, it emphasizes clarity and simplicity over complexity.
## Core Concepts
1. **Sources** - Define data sources and deduplication rules (which fields make a record unique)
2. **Import** - Load CSV data, automatically deduplicating based on source rules
3. **Rules** - Extract information using regex patterns (e.g., extract merchant from transaction description)
4. **Mappings** - Map extracted values to standardized output (e.g., "WALMART" → {"vendor": "Walmart", "category": "Groceries"})
5. **Transform** - Apply rules and mappings to create clean, enriched data
## Architecture
### Database Schema (`database/schema.sql`)
**5 simple tables:**
- `sources` - Source definitions with `constraint_fields` array
- `records` - Imported data with `data` (raw) and `transformed` (enriched) JSONB columns
- `rules` - Regex extraction rules with `field`, `pattern`, `output_field`
- `mappings` - Input/output value mappings
- `import_log` - Audit trail
**Key design:**
- JSONB for flexible data storage
- Deduplication via MD5 hash of specified fields
- Simple, flat structure (no complex relationships)
### Database Functions (`database/functions.sql`)
**4 focused functions:**
- `import_records(source_name, data)` - Import with deduplication
- `apply_transformations(source_name, record_ids)` - Apply rules and mappings
- `get_unmapped_values(source_name, rule_name)` - Find values needing mappings
- `reprocess_records(source_name)` - Re-transform all records
**Design principle:** Each function does ONE thing. No nested CTEs, no duplication.
### API Server (`api/server.js` + `api/routes/`)
**RESTful endpoints:**
- `/api/sources` - CRUD sources, import CSV, trigger transformations
- `/api/rules` - CRUD transformation rules
- `/api/mappings` - CRUD value mappings, view unmapped values
- `/api/records` - Query and search transformed data
**Route files:**
- `routes/sources.js` - Source management and CSV import
- `routes/rules.js` - Rule management
- `routes/mappings.js` - Mapping management + unmapped values
- `routes/records.js` - Record queries and search
## Common Development Tasks
### Running the Application
```bash
# Setup (first time only)
./setup.sh
# Start development server with auto-reload
npm run dev
# Start production server
npm start
# Test API
curl http://localhost:3000/health
```
### Database Changes
When modifying schema:
1. Edit `database/schema.sql`
2. Drop and recreate schema: `psql -d dataflow -f database/schema.sql`
3. Redeploy functions: `psql -d dataflow -f database/functions.sql`
For production, write migration scripts instead of dropping schema.
### Adding a New API Endpoint
1. Add route to appropriate file in `api/routes/`
2. Follow existing patterns (async/await, error handling via `next()`)
3. Use parameterized queries to prevent SQL injection
4. Return consistent JSON format
### Testing
Manual testing workflow:
1. Create a source: `POST /api/sources`
2. Create rules: `POST /api/rules`
3. Import data: `POST /api/sources/:name/import`
4. Apply transformations: `POST /api/sources/:name/transform`
5. View results: `GET /api/records/source/:name`
See `examples/GETTING_STARTED.md` for complete curl examples.
## Design Principles
1. **Simple over clever** - Straightforward code beats optimization
2. **Explicit over implicit** - No magic, no hidden triggers
3. **Clear naming** - `data` not `rec`, `transformed` not `allj`
4. **One function, one job** - No 250-line functions
5. **JSONB for flexibility** - Handle varying schemas without migrations
## Common Patterns
### Import Flow
```
CSV file → parse → import_records() → records table (data column)
```
### Transformation Flow
```
records.data → apply_transformations() →
- Apply each rule (regex extraction)
- Look up mappings
- Merge into records.transformed
```
### Deduplication
- `constraint_key` is a JSONB object of the constraint field values (readable, no hashing)
- Dedup is enforced at import time via CTE — no unique DB constraint
- Intra-file duplicate rows are allowed (bank may send identical rows); they all insert
- On re-import, all rows whose constraint_key already exists in the DB are skipped
- Deleting an import log entry cascades to all records from that batch (import_id FK)
### Error Handling
- API routes use `try/catch` and pass errors to `next(err)`
- Server.js has global error handler
- Database functions return JSON with `success` boolean
## File Structure
```
dataflow/
├── database/
│ ├── schema.sql # Table definitions
│ └── functions.sql # Import/transform functions
├── api/
│ ├── server.js # Express server
│ └── routes/ # API endpoints
│ ├── sources.js
│ ├── rules.js
│ ├── mappings.js
│ └── records.js
├── examples/
│ ├── GETTING_STARTED.md # Tutorial
│ └── bank_transactions.csv
├── .env.example # Config template
├── package.json
└── README.md
```
## Comparison to Legacy TPS System
This project replaces an older system (in `/opt/tps`) that had:
- 2,150 lines of complex SQL with heavy duplication
- 5 nearly-identical 200+ line functions
- Confusing names and deep nested CTEs
- Complex trigger-based processing
Dataflow achieves the same functionality with:
- ~400 lines of simple SQL
- 4 focused functions
- Clear names and linear logic
- Explicit API-triggered processing
The simplification makes it easy to understand, modify, and maintain.
## Troubleshooting
**Database connection fails:**
- Check `.env` file exists and has correct credentials
- Verify PostgreSQL is running: `psql -U postgres -l`
- Check search path is set: Should default to `dataflow` schema
**Import succeeds but transformation fails:**
- Check rules exist: `SELECT * FROM dataflow.rules WHERE source_name = 'xxx'`
- Verify field names match CSV columns
- Test regex pattern manually
- Check for SQL errors in logs
**All records marked as duplicates:**
- Verify `constraint_fields` match actual field names in data
- Check if data was already imported
- Use different source name for testing
## Adding New Features
When adding features, follow these principles:
- Add ONE function that does ONE thing
- Keep functions under 100 lines if possible
- Write clear SQL, not clever SQL
- Add API endpoint that calls the function
- Document in README.md and update examples
## Notes for Claude
- This is a **simple** system by design - don't over-engineer it
- Keep functions focused and linear
- Use JSONB for flexibility, not as a crutch for bad design
- When confused, read the examples/GETTING_STARTED.md walkthrough
- The old TPS system is in `/opt/tps` - this is a clean rewrite, not a refactor