dataflow/README.md
Paul Trowbridge 3cc8bc635a Update README to reflect current state of the project
Documents manage.py menu, adds full API reference tables, fixes
incorrect route in quick example, and removes stale sections
(docs/ dir, initial development status).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 22:42:49 -04:00

200 lines
6.9 KiB
Markdown

# Dataflow
A simple data transformation tool for importing, cleaning, and standardizing data from various sources.
## What It Does
Dataflow helps you:
1. **Import** CSV data with automatic deduplication
2. **Transform** data using regex rules to extract meaningful information
3. **Map** extracted values to standardized output
4. **Query** the transformed data via a web UI or REST API
Perfect for cleaning up messy data like bank transactions, product lists, or any repetitive data that needs normalization.
## Core Concepts
### 1. Sources
Define where data comes from and how to deduplicate it.
**Example:** Bank transactions deduplicated by date + amount + description
### 2. Rules
Extract information using regex patterns (`extract` or `replace` modes).
**Example:** Extract merchant name from transaction description
### 3. Mappings
Map extracted values to clean, standardized output.
**Example:** "DISCOUNT DRUG MART 32" → `{"vendor": "Discount Drug Mart", "category": "Healthcare"}`
## Architecture
- **Database:** PostgreSQL with JSONB for flexible data storage
- **API:** Node.js/Express REST API
- **UI:** React SPA served from `public/`
- **Auth:** HTTP Basic auth (configured in `.env`)
## Design Principles
- **Simple & Clear** - Easy to understand what's happening
- **Explicit** - No hidden magic or complex triggers
- **Flexible** - Handle varying data formats without schema changes
## Getting Started
### Prerequisites
- PostgreSQL 12+
- Node.js 16+
- Python 3 (for `manage.py`)
### Installation
1. Install Node dependencies:
```bash
npm install
```
2. Run the management script to configure and deploy everything:
```bash
python3 manage.py
```
For development with auto-reload:
```bash
npm run dev
```
The UI is available at `http://localhost:3000`. The API is at `http://localhost:3000/api`.
## Management Script (`manage.py`)
`manage.py` is an interactive tool for configuring, deploying, and managing the service. Run it and choose from the numbered menu:
```
python3 manage.py
```
| # | Action |
|---|--------|
| 1 | **Database configuration** — create/update `.env`, optionally create the PostgreSQL user/database, and deploy schema + functions |
| 2 | Redeploy schema only (`database/schema.sql`) — drops and recreates all tables |
| 3 | Redeploy SQL functions only (`database/queries/`) |
| 4 | Build UI (`ui/` → `public/`) |
| 5 | Set up nginx reverse proxy (HTTP or HTTPS via certbot) |
| 6 | Install systemd service unit (`dataflow.service`) |
| 7 | Start / restart `dataflow.service` |
| 8 | Stop `dataflow.service` |
| 9 | Set login credentials (`LOGIN_USER` / `LOGIN_PASSWORD_HASH` in `.env`) |
The status screen at the top of the menu shows the current state of each component (database connection, schema, UI build, service, nginx).
**Typical first-time setup:** run options 1 → 4 → 9 → 6 → 7 (→ 5 if you want nginx).
## API Reference
All `/api` routes require HTTP Basic authentication.
### Sources — `/api/sources`
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/sources` | List all sources |
| POST | `/api/sources` | Create a source |
| GET | `/api/sources/:name` | Get a source |
| PUT | `/api/sources/:name` | Update a source |
| DELETE | `/api/sources/:name` | Delete a source |
| POST | `/api/sources/suggest` | Suggest source definition from CSV upload |
| POST | `/api/sources/:name/import` | Import CSV data |
| GET | `/api/sources/:name/import-log` | View import history |
| POST | `/api/sources/:name/transform` | Apply rules and mappings to records |
| POST | `/api/sources/:name/reprocess` | Re-transform all records |
| GET | `/api/sources/:name/fields` | List all known field names |
| GET | `/api/sources/:name/stats` | Get record and mapping counts |
| POST | `/api/sources/:name/view` | Generate output view |
| GET | `/api/sources/:name/view-data` | Query output view (paginated, sortable) |
### Rules — `/api/rules`
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/rules/source/:source_name` | List rules for a source |
| POST | `/api/rules` | Create a rule |
| GET | `/api/rules/:id` | Get a rule |
| PUT | `/api/rules/:id` | Update a rule |
| DELETE | `/api/rules/:id` | Delete a rule |
| GET | `/api/rules/preview` | Preview a pattern against real records (ad-hoc) |
| GET | `/api/rules/:id/test` | Test a saved rule against real records |
### Mappings — `/api/mappings`
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/mappings/source/:source_name` | List mappings |
| POST | `/api/mappings` | Create a mapping |
| POST | `/api/mappings/bulk` | Bulk create/update mappings |
| GET | `/api/mappings/:id` | Get a mapping |
| PUT | `/api/mappings/:id` | Update a mapping |
| DELETE | `/api/mappings/:id` | Delete a mapping |
| GET | `/api/mappings/source/:source_name/unmapped` | Get values with no mapping yet |
| GET | `/api/mappings/source/:source_name/all-values` | All extracted values with counts |
| GET | `/api/mappings/source/:source_name/counts` | Record counts for existing mappings |
| GET | `/api/mappings/source/:source_name/export.tsv` | Export values as TSV |
| POST | `/api/mappings/source/:source_name/import-csv` | Import mappings from TSV |
### Records — `/api/records`
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/records/source/:source_name` | List records (paginated) |
| GET | `/api/records/:id` | Get a single record |
| POST | `/api/records/search` | Search records |
| DELETE | `/api/records/:id` | Delete a record |
| DELETE | `/api/records/source/:source_name/all` | Delete all records for a source |
## Typical Workflow
```
1. Create a source (POST /api/sources)
2. Import CSV data (POST /api/sources/:name/import)
3. Create transformation rules (POST /api/rules)
4. Preview rules against real data (GET /api/rules/preview)
5. Apply transformations (POST /api/sources/:name/transform)
6. Review unmapped values (GET /api/mappings/source/:name/unmapped)
7. Add mappings (POST /api/mappings or bulk import via TSV)
8. Reprocess to apply new mappings (POST /api/sources/:name/reprocess)
9. Query results (GET /api/sources/:name/view-data)
```
See `examples/GETTING_STARTED.md` for a complete walkthrough with curl examples.
## Project Structure
```
dataflow/
├── database/
│ ├── schema.sql # Table definitions
│ └── functions.sql # Import/transform/query functions
├── api/
│ ├── server.js # Express server
│ ├── middleware/
│ │ └── auth.js # Basic auth middleware
│ ├── lib/
│ │ └── sql.js # SQL literal helpers
│ └── routes/
│ ├── sources.js
│ ├── rules.js
│ ├── mappings.js
│ └── records.js
├── public/ # Built React UI (served as static files)
├── examples/
│ ├── GETTING_STARTED.md
│ └── bank_transactions.csv
└── .env.example
```
## License
MIT