Paul Trowbridge 3e2d56991c Initial commit: dataflow data transformation tool

2026-03-28 00:44:13 -04:00

6.7 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Overview

Dataflow is a simple data transformation tool for importing, cleaning, and standardizing data from various sources. Built with PostgreSQL and Node.js/Express, it emphasizes clarity and simplicity over complexity.

Core Concepts

Sources - Define data sources and deduplication rules (which fields make a record unique)
Import - Load CSV data, automatically deduplicating based on source rules
Rules - Extract information using regex patterns (e.g., extract merchant from transaction description)
Mappings - Map extracted values to standardized output (e.g., "WALMART" → {"vendor": "Walmart", "category": "Groceries"})
Transform - Apply rules and mappings to create clean, enriched data

Architecture

Database Schema (`database/schema.sql`)

5 simple tables:

sources - Source definitions with dedup_fields array
records - Imported data with data (raw) and transformed (enriched) JSONB columns
rules - Regex extraction rules with field, pattern, output_field
mappings - Input/output value mappings
import_log - Audit trail

Key design:

JSONB for flexible data storage
Deduplication via MD5 hash of specified fields
Simple, flat structure (no complex relationships)

Database Functions (`database/functions.sql`)

4 focused functions:

import_records(source_name, data) - Import with deduplication
apply_transformations(source_name, record_ids) - Apply rules and mappings
get_unmapped_values(source_name, rule_name) - Find values needing mappings
reprocess_records(source_name) - Re-transform all records

Design principle: Each function does ONE thing. No nested CTEs, no duplication.

API Server (`api/server.js` + `api/routes/`)

RESTful endpoints:

/api/sources - CRUD sources, import CSV, trigger transformations
/api/rules - CRUD transformation rules
/api/mappings - CRUD value mappings, view unmapped values
/api/records - Query and search transformed data

Route files:

routes/sources.js - Source management and CSV import
routes/rules.js - Rule management
routes/mappings.js - Mapping management + unmapped values
routes/records.js - Record queries and search

Common Development Tasks

Running the Application

# Setup (first time only)
./setup.sh

# Start development server with auto-reload
npm run dev

# Start production server
npm start

# Test API
curl http://localhost:3000/health

Database Changes

When modifying schema:

Edit database/schema.sql
Drop and recreate schema: psql -d dataflow -f database/schema.sql
Redeploy functions: psql -d dataflow -f database/functions.sql

For production, write migration scripts instead of dropping schema.

Adding a New API Endpoint

Add route to appropriate file in api/routes/
Follow existing patterns (async/await, error handling via next())
Use parameterized queries to prevent SQL injection
Return consistent JSON format

Testing

Manual testing workflow:

Create a source: POST /api/sources
Create rules: POST /api/rules
Import data: POST /api/sources/:name/import
Apply transformations: POST /api/sources/:name/transform
View results: GET /api/records/source/:name

See examples/GETTING_STARTED.md for complete curl examples.

Design Principles

Simple over clever - Straightforward code beats optimization
Explicit over implicit - No magic, no hidden triggers
Clear naming - data not rec, transformed not allj
One function, one job - No 250-line functions
JSONB for flexibility - Handle varying schemas without migrations

Common Patterns

Import Flow

CSV file → parse → import_records() → records table (data column)

Transformation Flow

records.data → apply_transformations() →
  - Apply each rule (regex extraction)
  - Look up mappings
  - Merge into records.transformed

Deduplication

Hash is MD5 of concatenated values from dedup_fields
Unique constraint on (source_name, dedup_key) prevents duplicates
Import function catches unique violations and counts them

Error Handling

API routes use try/catch and pass errors to next(err)
Server.js has global error handler
Database functions return JSON with success boolean

File Structure

dataflow/
├── database/
│   ├── schema.sql         # Table definitions
│   └── functions.sql      # Import/transform functions
├── api/
│   ├── server.js          # Express server
│   └── routes/            # API endpoints
│       ├── sources.js
│       ├── rules.js
│       ├── mappings.js
│       └── records.js
├── examples/
│   ├── GETTING_STARTED.md # Tutorial
│   └── bank_transactions.csv
├── .env.example           # Config template
├── package.json
└── README.md

Comparison to Legacy TPS System

This project replaces an older system (in /opt/tps) that had:

2,150 lines of complex SQL with heavy duplication
5 nearly-identical 200+ line functions
Confusing names and deep nested CTEs
Complex trigger-based processing

Dataflow achieves the same functionality with:

~400 lines of simple SQL
4 focused functions
Clear names and linear logic
Explicit API-triggered processing

The simplification makes it easy to understand, modify, and maintain.

Troubleshooting

Database connection fails:

Check .env file exists and has correct credentials
Verify PostgreSQL is running: psql -U postgres -l
Check search path is set: Should default to dataflow schema

Import succeeds but transformation fails:

Check rules exist: SELECT * FROM dataflow.rules WHERE source_name = 'xxx'
Verify field names match CSV columns
Test regex pattern manually
Check for SQL errors in logs

All records marked as duplicates:

Verify dedup_fields match actual field names in data
Check if data was already imported
Use different source name for testing

Adding New Features

When adding features, follow these principles:

Add ONE function that does ONE thing
Keep functions under 100 lines if possible
Write clear SQL, not clever SQL
Add API endpoint that calls the function
Document in README.md and update examples

Notes for Claude

This is a simple system by design - don't over-engineer it
Keep functions focused and linear
Use JSONB for flexibility, not as a crutch for bad design
When confused, read the examples/GETTING_STARTED.md walkthrough
The old TPS system is in /opt/tps - this is a clean rewrite, not a refactor

6.7 KiB Raw Blame History