Paul Trowbridge bef3d6d89c CLAUDE.md: add UI section covering Pivot inspector patterns

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-02 10:39:25 -04:00

9.6 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Overview

Dataflow is a simple data transformation tool for importing, cleaning, and standardizing data from various sources. Built with PostgreSQL and Node.js/Express, it emphasizes clarity and simplicity over complexity.

Core Concepts

Sources - Define data sources and deduplication rules (which fields make a record unique)
Import - Load CSV data, automatically deduplicating based on source rules
Rules - Extract information using regex patterns (e.g., extract merchant from transaction description)
Mappings - Map extracted values to standardized output (e.g., "WALMART" → {"vendor": "Walmart", "category": "Groceries"})
Transform - Apply rules and mappings to create clean, enriched data

Architecture

Database Schema (`database/schema.sql`)

5 simple tables:

sources - Source definitions with constraint_fields array
records - Imported data with data (raw) and transformed (enriched) JSONB columns
rules - Regex extraction rules with field, pattern, output_field
mappings - Input/output value mappings
import_log - Audit trail

Key design:

JSONB for flexible data storage
Deduplication via MD5 hash of specified fields
Simple, flat structure (no complex relationships)

Database Functions (`database/functions.sql`)

4 focused functions:

import_records(source_name, data) - Import with deduplication
apply_transformations(source_name, record_ids) - Apply rules and mappings
get_unmapped_values(source_name, rule_name) - Find values needing mappings
reprocess_records(source_name) - Re-transform all records

Design principle: Each function does ONE thing. No nested CTEs, no duplication.

API Server (`api/server.js` + `api/routes/`)

RESTful endpoints:

/api/sources - CRUD sources, import CSV, trigger transformations
/api/rules - CRUD transformation rules
/api/mappings - CRUD value mappings, view unmapped values
/api/records - Query and search transformed data

Route files:

routes/sources.js - Source management and CSV import
routes/rules.js - Rule management
routes/mappings.js - Mapping management + unmapped values
routes/records.js - Record queries and search

Common Development Tasks

Running the Application

# Setup (first time only)
./setup.sh

# Start development server with auto-reload
npm run dev

# Start production server
npm start

# Test API
curl http://localhost:3000/health

Database Changes

When modifying schema:

Edit database/schema.sql
Drop and recreate schema: psql -d dataflow -f database/schema.sql
Redeploy functions: psql -d dataflow -f database/functions.sql

For production, write migration scripts instead of dropping schema.

Adding a New API Endpoint

Add route to appropriate file in api/routes/
Follow existing patterns (async/await, error handling via next())
Use parameterized queries to prevent SQL injection
Return consistent JSON format

Testing

Manual testing workflow:

Create a source: POST /api/sources
Create rules: POST /api/rules
Import data: POST /api/sources/:name/import
Apply transformations: POST /api/sources/:name/transform
View results: GET /api/records/source/:name

See examples/GETTING_STARTED.md for complete curl examples.

Design Principles

Simple over clever - Straightforward code beats optimization
Explicit over implicit - No magic, no hidden triggers
Clear naming - data not rec, transformed not allj
One function, one job - No 250-line functions
JSONB for flexibility - Handle varying schemas without migrations

Common Patterns

Import Flow

CSV file → parse → import_records() → records table (data column)

Transformation Flow

records.data → apply_transformations() →
  - Apply each rule (regex extraction)
  - Look up mappings
  - Merge into records.transformed

Deduplication

constraint_key is a JSONB object of the constraint field values (readable, no hashing)
Dedup is enforced at import time via CTE — NO unique DB constraint on constraint_key
The constraint key is for cross-batch re-import protection, NOT record uniqueness
Within a single import batch, ALL rows insert regardless of duplicate constraint keys
- Banks legitimately send multiple identical-looking transactions (same date, description, amount)
- Example: 11 Cedar Point merchandise charges on one day — all should insert in one batch
On re-import of overlapping date range, rows whose constraint_key already exists in DB are skipped
- This prevents double-counting when you re-run a month-to-date export the next day
NEVER use ON CONFLICT (constraint_key) — there is no unique constraint and it would wrongly drop legitimate duplicate transactions from the same batch
Deleting an import log entry cascades to all records from that batch (import_id FK)

Error Handling

API routes use try/catch and pass errors to next(err)
Server.js has global error handler
Database functions return JSON with success boolean

UI (React + Vite)

The frontend lives in ui/src/ and is built to public/ via npm run build from the ui/ directory. Always run npm run build from ui/ after any changes to ui/src/ files.

Pivot inspector panel

Clicking a data cell opens a right-hand inspector panel showing the underlying transactions for that cell. Key behaviors:

Toggle: clicking the same cell again closes the panel. The toggle key is JSON.stringify({ p: row.__ROW_PATH__, c: column_names }) — stable across source and stack views.
Listener cleanup: the perspective-click handler is stored in perspClickHandlerRef and removed via removeEventListener on effect cleanup. Without this, switching views accumulates duplicate listeners that fire multiple times per click.
split_by filter derivation: detail.config.filter from the click event may omit split_by column constraints. They are derived from column_names positionally (column_names[i] matches config.split_by[i]) and appended to the filter before querying.
Row filtering: a temporary table.view({ filter, expressions }) is used so Perspective evaluates expression/computed columns correctly. Falls back to JS-side filterRowsByConfig on error (which skips filters for fields not in raw data).
The panel is resizable via a drag handle on its left edge (paneWidth state, min 240px).
The transaction table is sortable (click header) and shows column totals for all-numeric columns.

File Structure

dataflow/
├── database/
│   ├── schema.sql         # Table definitions
│   └── functions.sql      # Import/transform functions
├── api/
│   ├── server.js          # Express server
│   └── routes/            # API endpoints
│       ├── sources.js
│       ├── rules.js
│       ├── mappings.js
│       └── records.js
├── ui/
│   ├── src/
│   │   ├── pages/         # One file per page
│   │   └── api.js         # API client
│   └── package.json
├── public/                # Built UI (gitignored, generated by npm run build)
├── docs/
│   └── perspective-pivot.md  # Perspective API reference
├── examples/
│   ├── GETTING_STARTED.md # Tutorial
│   └── bank_transactions.csv
├── .env.example           # Config template
├── package.json
└── README.md

Comparison to Legacy TPS System

This project replaces an older system (in /opt/tps) that had:

2,150 lines of complex SQL with heavy duplication
5 nearly-identical 200+ line functions
Confusing names and deep nested CTEs
Complex trigger-based processing

Dataflow achieves the same functionality with:

~400 lines of simple SQL
4 focused functions
Clear names and linear logic
Explicit API-triggered processing

The simplification makes it easy to understand, modify, and maintain.

Troubleshooting

Database connection fails:

Check .env file exists and has correct credentials
Verify PostgreSQL is running: psql -U postgres -l
Check search path is set: Should default to dataflow schema

Import succeeds but transformation fails:

Check rules exist: SELECT * FROM dataflow.rules WHERE source_name = 'xxx'
Verify field names match CSV columns
Test regex pattern manually
Check for SQL errors in logs

All records marked as duplicates:

Verify constraint_fields match actual field names in data
Check if data was already imported
Use different source name for testing

Adding New Features

When adding features, follow these principles:

Add ONE function that does ONE thing
Keep functions under 100 lines if possible
Write clear SQL, not clever SQL
Add API endpoint that calls the function
Document in README.md and update examples

Notes for Claude

This is a simple system by design - don't over-engineer it
Keep functions focused and linear
Use JSONB for flexibility, not as a crutch for bad design
When confused, read the examples/GETTING_STARTED.md walkthrough
The old TPS system is in /opt/tps - this is a clean rewrite, not a refactor

9.6 KiB Raw Blame History