Go to file
Paul Trowbridge 442c38d3c4 Add PERSPECTIVE.md documenting the @perspective-dev version pairing
Records why the 4.5.1 viewer/client + 4.4.1 d3fc pairing is deliberate,
not a skew to "fix": the /inline and /themes entrypoints exist only in
4.5.x, while viewer-d3fc caps at 4.4.1, so this is the only combination
that keeps both inline WASM bundling and the d3fc charts. Verified by
build failure when pinning all four to 4.4.1. Points to the canonical
guide in pf_app for shared rationale.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 23:53:31 -04:00
api Fix pg deprecation warning: set search_path via connection options 2026-06-12 23:36:27 -04:00
database Split transformed column; add override management; show all override keys in panel 2026-05-23 11:00:24 -04:00
docs Update all docs to reflect current state 2026-06-12 23:51:00 -04:00
examples Import log, constraint key overhaul, and dedup improvements 2026-04-13 23:44:30 -04:00
migrate Add migration scripts for dataflow/dcard reimport 2026-04-19 21:35:28 -04:00
scripts Add systemd service setup script for production deployment 2026-03-28 02:45:23 -04:00
ui Update all docs to reflect current state 2026-06-12 23:51:00 -04:00
.env.example Initial commit: dataflow data transformation tool 2026-03-28 00:44:13 -04:00
.gitignore Add Python pycache to .gitignore 2026-04-12 11:07:21 -04:00
CLAUDE.md Migrate Perspective from CDN to npm; upgrade to 4.5.1 2026-06-12 23:00:23 -04:00
dataflow.service Add unified deploy.sh and systemd service unit 2026-04-05 15:53:02 -04:00
deploy.sh Fix deploy.sh: don't prompt for systemd service if already installed 2026-04-05 16:13:06 -04:00
manage.py Consolidate all SQL into database/queries/, switch to literal SQL in routes 2026-04-05 22:36:53 -04:00
package.json Bump major dependencies: express 5, csv-parse 6, dotenv 17, multer 2 2026-06-12 23:35:15 -04:00
PERSPECTIVE.md Add PERSPECTIVE.md documenting the @perspective-dev version pairing 2026-06-15 23:53:31 -04:00
README.md Update all docs to reflect current state 2026-06-12 23:51:00 -04:00
SPEC.md Update all docs to reflect current state 2026-06-12 23:51:00 -04:00
uninstall.sh Add interactive setup script with PostgreSQL user/database creation and uninstall script 2026-03-28 00:59:41 -04:00

Dataflow

A simple data transformation tool for importing, cleaning, and standardizing data from various sources.

What It Does

Dataflow helps you:

  1. Import CSV data with automatic deduplication
  2. Transform data using regex rules to extract meaningful information
  3. Map extracted values to standardized output
  4. Query the transformed data via a web UI or REST API

Perfect for cleaning up messy data like bank transactions, product lists, or any repetitive data that needs normalization.

Core Concepts

1. Sources

Define where data comes from and how to deduplicate it.

Example: Bank transactions deduplicated by date + amount + description

2. Rules

Extract information using regex patterns (extract or replace modes).

Example: Extract merchant name from transaction description

3. Mappings

Map extracted values to clean, standardized output.

Example: "DISCOUNT DRUG MART 32" → {"vendor": "Discount Drug Mart", "category": "Healthcare"}

Architecture

  • Database: PostgreSQL with JSONB for flexible data storage
  • API: Node.js/Express REST API
  • UI: React SPA served from public/
  • Auth: HTTP Basic auth (configured in .env)

Design Principles

  • Simple & Clear - Easy to understand what's happening
  • Explicit - No hidden magic or complex triggers
  • Flexible - Handle varying data formats without schema changes

Getting Started

Prerequisites

  • PostgreSQL 12+
  • Node.js 18+
  • Python 3 (for manage.py)

Installation

  1. Install Node dependencies:
npm install
  1. Run the management script to configure and deploy everything:
python3 manage.py

For development with auto-reload:

npm run dev

The UI is available at http://localhost:3020. The API is at http://localhost:3020/api (port set by API_PORT in .env).

Management Script (manage.py)

manage.py is an interactive tool for configuring, deploying, and managing the service. Run it and choose from the numbered menu:

python3 manage.py
# Action
1 Database configuration — create/update .env, optionally create the PostgreSQL user/database, and deploy schema + functions
2 Redeploy schema only (database/schema.sql) — drops and recreates all tables
3 Redeploy SQL functions only (database/queries/)
4 Build UI (ui/public/)
5 Set up nginx reverse proxy (HTTP or HTTPS via certbot)
6 Install systemd service unit (dataflow.service)
7 Start / restart dataflow.service
8 Stop dataflow.service
9 Set login credentials (LOGIN_USER / LOGIN_PASSWORD_HASH in .env)

The status screen at the top of the menu shows the current state of each component (database connection, schema, UI build, service, nginx).

Typical first-time setup: run options 1 → 4 → 9 → 6 → 7 (→ 5 if you want nginx).

API Reference

All /api routes require HTTP Basic authentication.

Sources — /api/sources

Method Path Description
GET /api/sources List all sources
POST /api/sources Create a source
GET /api/sources/:name Get a source
PUT /api/sources/:name Update a source
DELETE /api/sources/:name Delete a source
POST /api/sources/suggest Suggest source definition from CSV upload
POST /api/sources/:name/import Import CSV data and auto-apply transformations to new records
GET /api/sources/:name/import-log View import history (includes inserted_keys / excluded_keys in info)
DELETE /api/sources/:name/import-log/:id Delete an import batch and all its records
POST /api/sources/:name/transform Apply rules and mappings to any untransformed records
POST /api/sources/:name/reprocess Re-transform all records
GET /api/sources/:name/fields List all known field names
GET /api/sources/:name/stats Get record and mapping counts
POST /api/sources/:name/view Generate output view
GET /api/sources/:name/view-data Query output view (paginated, sortable)

Rules — /api/rules

Method Path Description
GET /api/rules/source/:source_name List rules for a source
POST /api/rules Create a rule
GET /api/rules/:id Get a rule
PUT /api/rules/:id Update a rule
DELETE /api/rules/:id Delete a rule
GET /api/rules/preview Preview a pattern against real records (ad-hoc)
GET /api/rules/:id/test Test a saved rule against real records

Mappings — /api/mappings

Method Path Description
GET /api/mappings/source/:source_name List mappings
POST /api/mappings Create a mapping
POST /api/mappings/bulk Bulk create/update mappings
GET /api/mappings/:id Get a mapping
PUT /api/mappings/:id Update a mapping
DELETE /api/mappings/:id Delete a mapping
GET /api/mappings/source/:source_name/unmapped Get values with no mapping yet
GET /api/mappings/source/:source_name/all-values All extracted values with counts
GET /api/mappings/source/:source_name/counts Record counts for existing mappings
GET /api/mappings/source/:source_name/export.tsv Export values as TSV
POST /api/mappings/source/:source_name/import-csv Import mappings from TSV

Records — /api/records

Method Path Description
GET /api/records/source/:source_name List records (paginated)
GET /api/records/:id Get a single record
POST /api/records/search Search records
DELETE /api/records/:id Delete a record
DELETE /api/records/source/:source_name/all Delete all records for a source

Stacks — /api/stacks

Method Path Description
GET /api/stacks List all stacks
POST /api/stacks Create a stack
GET /api/stacks/:name Get a stack
PUT /api/stacks/:name Update a stack
DELETE /api/stacks/:name Delete a stack
GET /api/stacks/:name/view-data Query stacked data (paginated)
GET /api/stacks/:name/layouts List saved pivot layouts
POST /api/stacks/:name/layouts Save a pivot layout
DELETE /api/stacks/:name/layouts/:id Delete a pivot layout

Typical Workflow

1. Create a source (POST /api/sources)
2. Create transformation rules (POST /api/rules)
3. Import CSV data (POST /api/sources/:name/import) — transformations applied automatically to new records
4. Preview rules against real data (GET /api/rules/preview)
5. Review unmapped values (GET /api/mappings/source/:name/unmapped)
6. Add mappings (POST /api/mappings or bulk import via TSV)
7. Reprocess to apply new mappings (POST /api/sources/:name/reprocess)
8. Query results (GET /api/sources/:name/view-data)

See examples/GETTING_STARTED.md for a complete walkthrough with curl examples.

Project Structure

dataflow/
├── database/
│   ├── schema.sql           # Table definitions
│   └── queries/             # SQL functions, one file per route
│       ├── sources.sql
│       ├── rules.sql
│       ├── mappings.sql
│       ├── records.sql
│       ├── stacks.sql
│       └── status.sql
├── api/
│   ├── server.js            # Express server
│   ├── middleware/
│   │   └── auth.js          # Basic auth middleware
│   ├── lib/
│   │   └── sql.js           # SQL literal helpers
│   └── routes/
│       ├── sources.js
│       ├── rules.js
│       ├── mappings.js
│       ├── records.js
│       ├── stacks.js
│       └── status.js
├── public/                  # Built React UI (served as static files)
├── examples/
│   ├── GETTING_STARTED.md
│   └── bank_transactions.csv
└── .env.example

License

MIT