Paul Trowbridge efa65d8409 Update all docs to reflect current state

- perspective-pivot.md: npm install pattern, v4.5.1/v4.4.1 versions
- README.md: Node 18+, port 3020, add stacks routes, fix project structure
- SPEC.md: add stacks/status routes, pages, SQL functions; update Perspective version
- ui/README.md: replace Vite boilerplate with project-specific content
- Remove docs/refactor-transformed-split.md (completed work)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-06-12 23:51:00 -04:00

8.0 KiB

Raw Permalink Blame History

Dataflow

A simple data transformation tool for importing, cleaning, and standardizing data from various sources.

What It Does

Dataflow helps you:

Import CSV data with automatic deduplication
Transform data using regex rules to extract meaningful information
Map extracted values to standardized output
Query the transformed data via a web UI or REST API

Perfect for cleaning up messy data like bank transactions, product lists, or any repetitive data that needs normalization.

Core Concepts

1. Sources

Define where data comes from and how to deduplicate it.

Example: Bank transactions deduplicated by date + amount + description

2. Rules

Extract information using regex patterns (extract or replace modes).

Example: Extract merchant name from transaction description

3. Mappings

Map extracted values to clean, standardized output.

Example: "DISCOUNT DRUG MART 32" → {"vendor": "Discount Drug Mart", "category": "Healthcare"}

Architecture

Database: PostgreSQL with JSONB for flexible data storage
API: Node.js/Express REST API
UI: React SPA served from public/
Auth: HTTP Basic auth (configured in .env)

Design Principles

Simple & Clear - Easy to understand what's happening
Explicit - No hidden magic or complex triggers
Flexible - Handle varying data formats without schema changes

Getting Started

Prerequisites

PostgreSQL 12+
Node.js 18+
Python 3 (for manage.py)

Installation

Install Node dependencies:

npm install

Run the management script to configure and deploy everything:

python3 manage.py

For development with auto-reload:

npm run dev

The UI is available at http://localhost:3020. The API is at http://localhost:3020/api (port set by API_PORT in .env).

Management Script (`manage.py`)

manage.py is an interactive tool for configuring, deploying, and managing the service. Run it and choose from the numbered menu:

python3 manage.py

#	Action
1	Database configuration — create/update `.env`, optionally create the PostgreSQL user/database, and deploy schema + functions
2	Redeploy schema only (`database/schema.sql`) — drops and recreates all tables
3	Redeploy SQL functions only (`database/queries/`)
4	Build UI (`ui/` → `public/`)
5	Set up nginx reverse proxy (HTTP or HTTPS via certbot)
6	Install systemd service unit (`dataflow.service`)
7	Start / restart `dataflow.service`
8	Stop `dataflow.service`
9	Set login credentials (`LOGIN_USER` / `LOGIN_PASSWORD_HASH` in `.env`)

The status screen at the top of the menu shows the current state of each component (database connection, schema, UI build, service, nginx).

Typical first-time setup: run options 1 → 4 → 9 → 6 → 7 (→ 5 if you want nginx).

API Reference

All /api routes require HTTP Basic authentication.

Sources — `/api/sources`

Method	Path	Description
GET	`/api/sources`	List all sources
POST	`/api/sources`	Create a source
GET	`/api/sources/:name`	Get a source
PUT	`/api/sources/:name`	Update a source
DELETE	`/api/sources/:name`	Delete a source
POST	`/api/sources/suggest`	Suggest source definition from CSV upload
POST	`/api/sources/:name/import`	Import CSV data and auto-apply transformations to new records
GET	`/api/sources/:name/import-log`	View import history (includes `inserted_keys` / `excluded_keys` in `info`)
DELETE	`/api/sources/:name/import-log/:id`	Delete an import batch and all its records
POST	`/api/sources/:name/transform`	Apply rules and mappings to any untransformed records
POST	`/api/sources/:name/reprocess`	Re-transform all records
GET	`/api/sources/:name/fields`	List all known field names
GET	`/api/sources/:name/stats`	Get record and mapping counts
POST	`/api/sources/:name/view`	Generate output view
GET	`/api/sources/:name/view-data`	Query output view (paginated, sortable)

Rules — `/api/rules`

Method	Path	Description
GET	`/api/rules/source/:source_name`	List rules for a source
POST	`/api/rules`	Create a rule
GET	`/api/rules/:id`	Get a rule
PUT	`/api/rules/:id`	Update a rule
DELETE	`/api/rules/:id`	Delete a rule
GET	`/api/rules/preview`	Preview a pattern against real records (ad-hoc)
GET	`/api/rules/:id/test`	Test a saved rule against real records

Mappings — `/api/mappings`

Method	Path	Description
GET	`/api/mappings/source/:source_name`	List mappings
POST	`/api/mappings`	Create a mapping
POST	`/api/mappings/bulk`	Bulk create/update mappings
GET	`/api/mappings/:id`	Get a mapping
PUT	`/api/mappings/:id`	Update a mapping
DELETE	`/api/mappings/:id`	Delete a mapping
GET	`/api/mappings/source/:source_name/unmapped`	Get values with no mapping yet
GET	`/api/mappings/source/:source_name/all-values`	All extracted values with counts
GET	`/api/mappings/source/:source_name/counts`	Record counts for existing mappings
GET	`/api/mappings/source/:source_name/export.tsv`	Export values as TSV
POST	`/api/mappings/source/:source_name/import-csv`	Import mappings from TSV

Records — `/api/records`

Method	Path	Description
GET	`/api/records/source/:source_name`	List records (paginated)
GET	`/api/records/:id`	Get a single record
POST	`/api/records/search`	Search records
DELETE	`/api/records/:id`	Delete a record
DELETE	`/api/records/source/:source_name/all`	Delete all records for a source

Stacks — `/api/stacks`

Method	Path	Description
GET	`/api/stacks`	List all stacks
POST	`/api/stacks`	Create a stack
GET	`/api/stacks/:name`	Get a stack
PUT	`/api/stacks/:name`	Update a stack
DELETE	`/api/stacks/:name`	Delete a stack
GET	`/api/stacks/:name/view-data`	Query stacked data (paginated)
GET	`/api/stacks/:name/layouts`	List saved pivot layouts
POST	`/api/stacks/:name/layouts`	Save a pivot layout
DELETE	`/api/stacks/:name/layouts/:id`	Delete a pivot layout

Typical Workflow

1. Create a source (POST /api/sources)
2. Create transformation rules (POST /api/rules)
3. Import CSV data (POST /api/sources/:name/import) — transformations applied automatically to new records
4. Preview rules against real data (GET /api/rules/preview)
5. Review unmapped values (GET /api/mappings/source/:name/unmapped)
6. Add mappings (POST /api/mappings or bulk import via TSV)
7. Reprocess to apply new mappings (POST /api/sources/:name/reprocess)
8. Query results (GET /api/sources/:name/view-data)

See examples/GETTING_STARTED.md for a complete walkthrough with curl examples.

Project Structure

dataflow/
├── database/
│   ├── schema.sql           # Table definitions
│   └── queries/             # SQL functions, one file per route
│       ├── sources.sql
│       ├── rules.sql
│       ├── mappings.sql
│       ├── records.sql
│       ├── stacks.sql
│       └── status.sql
├── api/
│   ├── server.js            # Express server
│   ├── middleware/
│   │   └── auth.js          # Basic auth middleware
│   ├── lib/
│   │   └── sql.js           # SQL literal helpers
│   └── routes/
│       ├── sources.js
│       ├── rules.js
│       ├── mappings.js
│       ├── records.js
│       ├── stacks.js
│       └── status.js
├── public/                  # Built React UI (served as static files)
├── examples/
│   ├── GETTING_STARTED.md
│   └── bank_transactions.csv
└── .env.example

License

MIT

8.0 KiB Raw Permalink Blame History