Update README to reflect current state of the project

Documents manage.py menu, adds full API reference tables, fixes
incorrect route in quick example, and removes stale sections
(docs/ dir, initial development status).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Paul Trowbridge 2026-04-05 22:42:49 -04:00
parent 291c665ed1
commit 3cc8bc635a

179
README.md
View File

@ -1,14 +1,14 @@
# Dataflow # Dataflow
A simple, understandable data transformation tool for ingesting, mapping, and transforming data from various sources. A simple data transformation tool for importing, cleaning, and standardizing data from various sources.
## What It Does ## What It Does
Dataflow helps you: Dataflow helps you:
1. **Import** data from CSV files (or other formats) 1. **Import** CSV data with automatic deduplication
2. **Transform** data using regex rules to extract meaningful information 2. **Transform** data using regex rules to extract meaningful information
3. **Map** extracted values to standardized output 3. **Map** extracted values to standardized output
4. **Query** the transformed data 4. **Query** the transformed data via a web UI or REST API
Perfect for cleaning up messy data like bank transactions, product lists, or any repetitive data that needs normalization. Perfect for cleaning up messy data like bank transactions, product lists, or any repetitive data that needs normalization.
@ -20,26 +20,26 @@ Define where data comes from and how to deduplicate it.
**Example:** Bank transactions deduplicated by date + amount + description **Example:** Bank transactions deduplicated by date + amount + description
### 2. Rules ### 2. Rules
Extract information using regex patterns. Extract information using regex patterns (`extract` or `replace` modes).
**Example:** Extract merchant name from transaction description **Example:** Extract merchant name from transaction description
### 3. Mappings ### 3. Mappings
Map extracted values to clean, standardized output. Map extracted values to clean, standardized output.
**Example:** "DISCOUNT DRUG MART 32" → {"vendor": "Discount Drug Mart", "category": "Healthcare"} **Example:** "DISCOUNT DRUG MART 32" → `{"vendor": "Discount Drug Mart", "category": "Healthcare"}`
## Architecture ## Architecture
- **Database:** PostgreSQL with JSONB for flexibility - **Database:** PostgreSQL with JSONB for flexible data storage
- **API:** Node.js/Express for REST endpoints - **API:** Node.js/Express REST API
- **Storage:** Raw data preserved, transformations are computed and stored - **UI:** React SPA served from `public/`
- **Auth:** HTTP Basic auth (configured in `.env`)
## Design Principles ## Design Principles
- **Simple & Clear** - Easy to understand what's happening - **Simple & Clear** - Easy to understand what's happening
- **Explicit** - No hidden magic or complex triggers - **Explicit** - No hidden magic or complex triggers
- **Testable** - Every function can be tested independently
- **Flexible** - Handle varying data formats without schema changes - **Flexible** - Handle varying data formats without schema changes
## Getting Started ## Getting Started
@ -47,74 +47,153 @@ Map extracted values to clean, standardized output.
### Prerequisites ### Prerequisites
- PostgreSQL 12+ - PostgreSQL 12+
- Node.js 16+ - Node.js 16+
- Python 3 (for `manage.py`)
### Installation ### Installation
1. Install dependencies: 1. Install Node dependencies:
```bash ```bash
npm install npm install
``` ```
2. Configure database (copy .env.example to .env and edit): 2. Run the management script to configure and deploy everything:
```bash ```bash
cp .env.example .env python3 manage.py
``` ```
3. Deploy database schema: For development with auto-reload:
```bash ```bash
psql -U postgres -d dataflow -f database/schema.sql npm run dev
psql -U postgres -d dataflow -f database/functions.sql
``` ```
4. Start the API server: The UI is available at `http://localhost:3000`. The API is at `http://localhost:3000/api`.
```bash
npm start ## Management Script (`manage.py`)
`manage.py` is an interactive tool for configuring, deploying, and managing the service. Run it and choose from the numbered menu:
```
python3 manage.py
``` ```
## Quick Example | # | Action |
|---|--------|
| 1 | **Database configuration** — create/update `.env`, optionally create the PostgreSQL user/database, and deploy schema + functions |
| 2 | Redeploy schema only (`database/schema.sql`) — drops and recreates all tables |
| 3 | Redeploy SQL functions only (`database/queries/`) |
| 4 | Build UI (`ui/` → `public/`) |
| 5 | Set up nginx reverse proxy (HTTP or HTTPS via certbot) |
| 6 | Install systemd service unit (`dataflow.service`) |
| 7 | Start / restart `dataflow.service` |
| 8 | Stop `dataflow.service` |
| 9 | Set login credentials (`LOGIN_USER` / `LOGIN_PASSWORD_HASH` in `.env`) |
```javascript The status screen at the top of the menu shows the current state of each component (database connection, schema, UI build, service, nginx).
// 1. Define a source
POST /api/sources
{
"name": "bank_transactions",
"dedup_fields": ["date", "amount", "description"]
}
// 2. Create a transformation rule **Typical first-time setup:** run options 1 → 4 → 9 → 6 → 7 (→ 5 if you want nginx).
POST /api/sources/bank_transactions/rules
{
"name": "extract_merchant",
"pattern": "^([A-Z][A-Z ]+)",
"field": "description"
}
// 3. Import data ## API Reference
POST /api/sources/bank_transactions/import
[CSV file upload] All `/api` routes require HTTP Basic authentication.
### Sources — `/api/sources`
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/sources` | List all sources |
| POST | `/api/sources` | Create a source |
| GET | `/api/sources/:name` | Get a source |
| PUT | `/api/sources/:name` | Update a source |
| DELETE | `/api/sources/:name` | Delete a source |
| POST | `/api/sources/suggest` | Suggest source definition from CSV upload |
| POST | `/api/sources/:name/import` | Import CSV data |
| GET | `/api/sources/:name/import-log` | View import history |
| POST | `/api/sources/:name/transform` | Apply rules and mappings to records |
| POST | `/api/sources/:name/reprocess` | Re-transform all records |
| GET | `/api/sources/:name/fields` | List all known field names |
| GET | `/api/sources/:name/stats` | Get record and mapping counts |
| POST | `/api/sources/:name/view` | Generate output view |
| GET | `/api/sources/:name/view-data` | Query output view (paginated, sortable) |
### Rules — `/api/rules`
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/rules/source/:source_name` | List rules for a source |
| POST | `/api/rules` | Create a rule |
| GET | `/api/rules/:id` | Get a rule |
| PUT | `/api/rules/:id` | Update a rule |
| DELETE | `/api/rules/:id` | Delete a rule |
| GET | `/api/rules/preview` | Preview a pattern against real records (ad-hoc) |
| GET | `/api/rules/:id/test` | Test a saved rule against real records |
### Mappings — `/api/mappings`
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/mappings/source/:source_name` | List mappings |
| POST | `/api/mappings` | Create a mapping |
| POST | `/api/mappings/bulk` | Bulk create/update mappings |
| GET | `/api/mappings/:id` | Get a mapping |
| PUT | `/api/mappings/:id` | Update a mapping |
| DELETE | `/api/mappings/:id` | Delete a mapping |
| GET | `/api/mappings/source/:source_name/unmapped` | Get values with no mapping yet |
| GET | `/api/mappings/source/:source_name/all-values` | All extracted values with counts |
| GET | `/api/mappings/source/:source_name/counts` | Record counts for existing mappings |
| GET | `/api/mappings/source/:source_name/export.tsv` | Export values as TSV |
| POST | `/api/mappings/source/:source_name/import-csv` | Import mappings from TSV |
### Records — `/api/records`
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/records/source/:source_name` | List records (paginated) |
| GET | `/api/records/:id` | Get a single record |
| POST | `/api/records/search` | Search records |
| DELETE | `/api/records/:id` | Delete a record |
| DELETE | `/api/records/source/:source_name/all` | Delete all records for a source |
## Typical Workflow
// 4. Query transformed data
GET /api/sources/bank_transactions/records
``` ```
1. Create a source (POST /api/sources)
2. Import CSV data (POST /api/sources/:name/import)
3. Create transformation rules (POST /api/rules)
4. Preview rules against real data (GET /api/rules/preview)
5. Apply transformations (POST /api/sources/:name/transform)
6. Review unmapped values (GET /api/mappings/source/:name/unmapped)
7. Add mappings (POST /api/mappings or bulk import via TSV)
8. Reprocess to apply new mappings (POST /api/sources/:name/reprocess)
9. Query results (GET /api/sources/:name/view-data)
```
See `examples/GETTING_STARTED.md` for a complete walkthrough with curl examples.
## Project Structure ## Project Structure
``` ```
dataflow/ dataflow/
├── database/ # PostgreSQL schema and functions ├── database/
│ ├── schema.sql # Table definitions │ ├── schema.sql # Table definitions
│ └── functions.sql # Import and transformation functions │ └── functions.sql # Import/transform/query functions
├── api/ # Express REST API ├── api/
│ ├── server.js # Main server │ ├── server.js # Express server
│ └── routes/ # API route handlers │ ├── middleware/
├── examples/ # Sample data and use cases │ │ └── auth.js # Basic auth middleware
└── docs/ # Additional documentation │ ├── lib/
│ │ └── sql.js # SQL literal helpers
│ └── routes/
│ ├── sources.js
│ ├── rules.js
│ ├── mappings.js
│ └── records.js
├── public/ # Built React UI (served as static files)
├── examples/
│ ├── GETTING_STARTED.md
│ └── bank_transactions.csv
└── .env.example
``` ```
## Status
**Current Phase:** Initial development - building core functionality
## License ## License
MIT MIT