dataflow/README.md
Paul Trowbridge efa65d8409 Update all docs to reflect current state
- perspective-pivot.md: npm install pattern, v4.5.1/v4.4.1 versions
- README.md: Node 18+, port 3020, add stacks routes, fix project structure
- SPEC.md: add stacks/status routes, pages, SQL functions; update Perspective version
- ui/README.md: replace Vite boilerplate with project-specific content
- Remove docs/refactor-transformed-split.md (completed work)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-12 23:51:00 -04:00

222 lines
8.0 KiB
Markdown

# Dataflow
A simple data transformation tool for importing, cleaning, and standardizing data from various sources.
## What It Does
Dataflow helps you:
1. **Import** CSV data with automatic deduplication
2. **Transform** data using regex rules to extract meaningful information
3. **Map** extracted values to standardized output
4. **Query** the transformed data via a web UI or REST API
Perfect for cleaning up messy data like bank transactions, product lists, or any repetitive data that needs normalization.
## Core Concepts
### 1. Sources
Define where data comes from and how to deduplicate it.
**Example:** Bank transactions deduplicated by date + amount + description
### 2. Rules
Extract information using regex patterns (`extract` or `replace` modes).
**Example:** Extract merchant name from transaction description
### 3. Mappings
Map extracted values to clean, standardized output.
**Example:** "DISCOUNT DRUG MART 32" → `{"vendor": "Discount Drug Mart", "category": "Healthcare"}`
## Architecture
- **Database:** PostgreSQL with JSONB for flexible data storage
- **API:** Node.js/Express REST API
- **UI:** React SPA served from `public/`
- **Auth:** HTTP Basic auth (configured in `.env`)
## Design Principles
- **Simple & Clear** - Easy to understand what's happening
- **Explicit** - No hidden magic or complex triggers
- **Flexible** - Handle varying data formats without schema changes
## Getting Started
### Prerequisites
- PostgreSQL 12+
- Node.js 18+
- Python 3 (for `manage.py`)
### Installation
1. Install Node dependencies:
```bash
npm install
```
2. Run the management script to configure and deploy everything:
```bash
python3 manage.py
```
For development with auto-reload:
```bash
npm run dev
```
The UI is available at `http://localhost:3020`. The API is at `http://localhost:3020/api` (port set by `API_PORT` in `.env`).
## Management Script (`manage.py`)
`manage.py` is an interactive tool for configuring, deploying, and managing the service. Run it and choose from the numbered menu:
```
python3 manage.py
```
| # | Action |
|---|--------|
| 1 | **Database configuration** — create/update `.env`, optionally create the PostgreSQL user/database, and deploy schema + functions |
| 2 | Redeploy schema only (`database/schema.sql`) — drops and recreates all tables |
| 3 | Redeploy SQL functions only (`database/queries/`) |
| 4 | Build UI (`ui/` → `public/`) |
| 5 | Set up nginx reverse proxy (HTTP or HTTPS via certbot) |
| 6 | Install systemd service unit (`dataflow.service`) |
| 7 | Start / restart `dataflow.service` |
| 8 | Stop `dataflow.service` |
| 9 | Set login credentials (`LOGIN_USER` / `LOGIN_PASSWORD_HASH` in `.env`) |
The status screen at the top of the menu shows the current state of each component (database connection, schema, UI build, service, nginx).
**Typical first-time setup:** run options 1 → 4 → 9 → 6 → 7 (→ 5 if you want nginx).
## API Reference
All `/api` routes require HTTP Basic authentication.
### Sources — `/api/sources`
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/sources` | List all sources |
| POST | `/api/sources` | Create a source |
| GET | `/api/sources/:name` | Get a source |
| PUT | `/api/sources/:name` | Update a source |
| DELETE | `/api/sources/:name` | Delete a source |
| POST | `/api/sources/suggest` | Suggest source definition from CSV upload |
| POST | `/api/sources/:name/import` | Import CSV data and auto-apply transformations to new records |
| GET | `/api/sources/:name/import-log` | View import history (includes `inserted_keys` / `excluded_keys` in `info`) |
| DELETE | `/api/sources/:name/import-log/:id` | Delete an import batch and all its records |
| POST | `/api/sources/:name/transform` | Apply rules and mappings to any untransformed records |
| POST | `/api/sources/:name/reprocess` | Re-transform all records |
| GET | `/api/sources/:name/fields` | List all known field names |
| GET | `/api/sources/:name/stats` | Get record and mapping counts |
| POST | `/api/sources/:name/view` | Generate output view |
| GET | `/api/sources/:name/view-data` | Query output view (paginated, sortable) |
### Rules — `/api/rules`
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/rules/source/:source_name` | List rules for a source |
| POST | `/api/rules` | Create a rule |
| GET | `/api/rules/:id` | Get a rule |
| PUT | `/api/rules/:id` | Update a rule |
| DELETE | `/api/rules/:id` | Delete a rule |
| GET | `/api/rules/preview` | Preview a pattern against real records (ad-hoc) |
| GET | `/api/rules/:id/test` | Test a saved rule against real records |
### Mappings — `/api/mappings`
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/mappings/source/:source_name` | List mappings |
| POST | `/api/mappings` | Create a mapping |
| POST | `/api/mappings/bulk` | Bulk create/update mappings |
| GET | `/api/mappings/:id` | Get a mapping |
| PUT | `/api/mappings/:id` | Update a mapping |
| DELETE | `/api/mappings/:id` | Delete a mapping |
| GET | `/api/mappings/source/:source_name/unmapped` | Get values with no mapping yet |
| GET | `/api/mappings/source/:source_name/all-values` | All extracted values with counts |
| GET | `/api/mappings/source/:source_name/counts` | Record counts for existing mappings |
| GET | `/api/mappings/source/:source_name/export.tsv` | Export values as TSV |
| POST | `/api/mappings/source/:source_name/import-csv` | Import mappings from TSV |
### Records — `/api/records`
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/records/source/:source_name` | List records (paginated) |
| GET | `/api/records/:id` | Get a single record |
| POST | `/api/records/search` | Search records |
| DELETE | `/api/records/:id` | Delete a record |
| DELETE | `/api/records/source/:source_name/all` | Delete all records for a source |
### Stacks — `/api/stacks`
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/stacks` | List all stacks |
| POST | `/api/stacks` | Create a stack |
| GET | `/api/stacks/:name` | Get a stack |
| PUT | `/api/stacks/:name` | Update a stack |
| DELETE | `/api/stacks/:name` | Delete a stack |
| GET | `/api/stacks/:name/view-data` | Query stacked data (paginated) |
| GET | `/api/stacks/:name/layouts` | List saved pivot layouts |
| POST | `/api/stacks/:name/layouts` | Save a pivot layout |
| DELETE | `/api/stacks/:name/layouts/:id` | Delete a pivot layout |
## Typical Workflow
```
1. Create a source (POST /api/sources)
2. Create transformation rules (POST /api/rules)
3. Import CSV data (POST /api/sources/:name/import) — transformations applied automatically to new records
4. Preview rules against real data (GET /api/rules/preview)
5. Review unmapped values (GET /api/mappings/source/:name/unmapped)
6. Add mappings (POST /api/mappings or bulk import via TSV)
7. Reprocess to apply new mappings (POST /api/sources/:name/reprocess)
8. Query results (GET /api/sources/:name/view-data)
```
See `examples/GETTING_STARTED.md` for a complete walkthrough with curl examples.
## Project Structure
```
dataflow/
├── database/
│ ├── schema.sql # Table definitions
│ └── queries/ # SQL functions, one file per route
│ ├── sources.sql
│ ├── rules.sql
│ ├── mappings.sql
│ ├── records.sql
│ ├── stacks.sql
│ └── status.sql
├── api/
│ ├── server.js # Express server
│ ├── middleware/
│ │ └── auth.js # Basic auth middleware
│ ├── lib/
│ │ └── sql.js # SQL literal helpers
│ └── routes/
│ ├── sources.js
│ ├── rules.js
│ ├── mappings.js
│ ├── records.js
│ ├── stacks.js
│ └── status.js
├── public/ # Built React UI (served as static files)
├── examples/
│ ├── GETTING_STARTED.md
│ └── bank_transactions.csv
└── .env.example
```
## License
MIT