Update README to reflect current state of the project
Documents manage.py menu, adds full API reference tables, fixes incorrect route in quick example, and removes stale sections (docs/ dir, initial development status). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
291c665ed1
commit
3cc8bc635a
177
README.md
177
README.md
@ -1,14 +1,14 @@
|
|||||||
# Dataflow
|
# Dataflow
|
||||||
|
|
||||||
A simple, understandable data transformation tool for ingesting, mapping, and transforming data from various sources.
|
A simple data transformation tool for importing, cleaning, and standardizing data from various sources.
|
||||||
|
|
||||||
## What It Does
|
## What It Does
|
||||||
|
|
||||||
Dataflow helps you:
|
Dataflow helps you:
|
||||||
1. **Import** data from CSV files (or other formats)
|
1. **Import** CSV data with automatic deduplication
|
||||||
2. **Transform** data using regex rules to extract meaningful information
|
2. **Transform** data using regex rules to extract meaningful information
|
||||||
3. **Map** extracted values to standardized output
|
3. **Map** extracted values to standardized output
|
||||||
4. **Query** the transformed data
|
4. **Query** the transformed data via a web UI or REST API
|
||||||
|
|
||||||
Perfect for cleaning up messy data like bank transactions, product lists, or any repetitive data that needs normalization.
|
Perfect for cleaning up messy data like bank transactions, product lists, or any repetitive data that needs normalization.
|
||||||
|
|
||||||
@ -20,26 +20,26 @@ Define where data comes from and how to deduplicate it.
|
|||||||
**Example:** Bank transactions deduplicated by date + amount + description
|
**Example:** Bank transactions deduplicated by date + amount + description
|
||||||
|
|
||||||
### 2. Rules
|
### 2. Rules
|
||||||
Extract information using regex patterns.
|
Extract information using regex patterns (`extract` or `replace` modes).
|
||||||
|
|
||||||
**Example:** Extract merchant name from transaction description
|
**Example:** Extract merchant name from transaction description
|
||||||
|
|
||||||
### 3. Mappings
|
### 3. Mappings
|
||||||
Map extracted values to clean, standardized output.
|
Map extracted values to clean, standardized output.
|
||||||
|
|
||||||
**Example:** "DISCOUNT DRUG MART 32" → {"vendor": "Discount Drug Mart", "category": "Healthcare"}
|
**Example:** "DISCOUNT DRUG MART 32" → `{"vendor": "Discount Drug Mart", "category": "Healthcare"}`
|
||||||
|
|
||||||
## Architecture
|
## Architecture
|
||||||
|
|
||||||
- **Database:** PostgreSQL with JSONB for flexibility
|
- **Database:** PostgreSQL with JSONB for flexible data storage
|
||||||
- **API:** Node.js/Express for REST endpoints
|
- **API:** Node.js/Express REST API
|
||||||
- **Storage:** Raw data preserved, transformations are computed and stored
|
- **UI:** React SPA served from `public/`
|
||||||
|
- **Auth:** HTTP Basic auth (configured in `.env`)
|
||||||
|
|
||||||
## Design Principles
|
## Design Principles
|
||||||
|
|
||||||
- **Simple & Clear** - Easy to understand what's happening
|
- **Simple & Clear** - Easy to understand what's happening
|
||||||
- **Explicit** - No hidden magic or complex triggers
|
- **Explicit** - No hidden magic or complex triggers
|
||||||
- **Testable** - Every function can be tested independently
|
|
||||||
- **Flexible** - Handle varying data formats without schema changes
|
- **Flexible** - Handle varying data formats without schema changes
|
||||||
|
|
||||||
## Getting Started
|
## Getting Started
|
||||||
@ -47,74 +47,153 @@ Map extracted values to clean, standardized output.
|
|||||||
### Prerequisites
|
### Prerequisites
|
||||||
- PostgreSQL 12+
|
- PostgreSQL 12+
|
||||||
- Node.js 16+
|
- Node.js 16+
|
||||||
|
- Python 3 (for `manage.py`)
|
||||||
|
|
||||||
### Installation
|
### Installation
|
||||||
|
|
||||||
1. Install dependencies:
|
1. Install Node dependencies:
|
||||||
```bash
|
```bash
|
||||||
npm install
|
npm install
|
||||||
```
|
```
|
||||||
|
|
||||||
2. Configure database (copy .env.example to .env and edit):
|
2. Run the management script to configure and deploy everything:
|
||||||
```bash
|
```bash
|
||||||
cp .env.example .env
|
python3 manage.py
|
||||||
```
|
```
|
||||||
|
|
||||||
3. Deploy database schema:
|
For development with auto-reload:
|
||||||
```bash
|
```bash
|
||||||
psql -U postgres -d dataflow -f database/schema.sql
|
npm run dev
|
||||||
psql -U postgres -d dataflow -f database/functions.sql
|
|
||||||
```
|
```
|
||||||
|
|
||||||
4. Start the API server:
|
The UI is available at `http://localhost:3000`. The API is at `http://localhost:3000/api`.
|
||||||
```bash
|
|
||||||
npm start
|
## Management Script (`manage.py`)
|
||||||
|
|
||||||
|
`manage.py` is an interactive tool for configuring, deploying, and managing the service. Run it and choose from the numbered menu:
|
||||||
|
|
||||||
|
```
|
||||||
|
python3 manage.py
|
||||||
```
|
```
|
||||||
|
|
||||||
## Quick Example
|
| # | Action |
|
||||||
|
|---|--------|
|
||||||
|
| 1 | **Database configuration** — create/update `.env`, optionally create the PostgreSQL user/database, and deploy schema + functions |
|
||||||
|
| 2 | Redeploy schema only (`database/schema.sql`) — drops and recreates all tables |
|
||||||
|
| 3 | Redeploy SQL functions only (`database/queries/`) |
|
||||||
|
| 4 | Build UI (`ui/` → `public/`) |
|
||||||
|
| 5 | Set up nginx reverse proxy (HTTP or HTTPS via certbot) |
|
||||||
|
| 6 | Install systemd service unit (`dataflow.service`) |
|
||||||
|
| 7 | Start / restart `dataflow.service` |
|
||||||
|
| 8 | Stop `dataflow.service` |
|
||||||
|
| 9 | Set login credentials (`LOGIN_USER` / `LOGIN_PASSWORD_HASH` in `.env`) |
|
||||||
|
|
||||||
```javascript
|
The status screen at the top of the menu shows the current state of each component (database connection, schema, UI build, service, nginx).
|
||||||
// 1. Define a source
|
|
||||||
POST /api/sources
|
|
||||||
{
|
|
||||||
"name": "bank_transactions",
|
|
||||||
"dedup_fields": ["date", "amount", "description"]
|
|
||||||
}
|
|
||||||
|
|
||||||
// 2. Create a transformation rule
|
**Typical first-time setup:** run options 1 → 4 → 9 → 6 → 7 (→ 5 if you want nginx).
|
||||||
POST /api/sources/bank_transactions/rules
|
|
||||||
{
|
|
||||||
"name": "extract_merchant",
|
|
||||||
"pattern": "^([A-Z][A-Z ]+)",
|
|
||||||
"field": "description"
|
|
||||||
}
|
|
||||||
|
|
||||||
// 3. Import data
|
## API Reference
|
||||||
POST /api/sources/bank_transactions/import
|
|
||||||
[CSV file upload]
|
All `/api` routes require HTTP Basic authentication.
|
||||||
|
|
||||||
|
### Sources — `/api/sources`
|
||||||
|
|
||||||
|
| Method | Path | Description |
|
||||||
|
|--------|------|-------------|
|
||||||
|
| GET | `/api/sources` | List all sources |
|
||||||
|
| POST | `/api/sources` | Create a source |
|
||||||
|
| GET | `/api/sources/:name` | Get a source |
|
||||||
|
| PUT | `/api/sources/:name` | Update a source |
|
||||||
|
| DELETE | `/api/sources/:name` | Delete a source |
|
||||||
|
| POST | `/api/sources/suggest` | Suggest source definition from CSV upload |
|
||||||
|
| POST | `/api/sources/:name/import` | Import CSV data |
|
||||||
|
| GET | `/api/sources/:name/import-log` | View import history |
|
||||||
|
| POST | `/api/sources/:name/transform` | Apply rules and mappings to records |
|
||||||
|
| POST | `/api/sources/:name/reprocess` | Re-transform all records |
|
||||||
|
| GET | `/api/sources/:name/fields` | List all known field names |
|
||||||
|
| GET | `/api/sources/:name/stats` | Get record and mapping counts |
|
||||||
|
| POST | `/api/sources/:name/view` | Generate output view |
|
||||||
|
| GET | `/api/sources/:name/view-data` | Query output view (paginated, sortable) |
|
||||||
|
|
||||||
|
### Rules — `/api/rules`
|
||||||
|
|
||||||
|
| Method | Path | Description |
|
||||||
|
|--------|------|-------------|
|
||||||
|
| GET | `/api/rules/source/:source_name` | List rules for a source |
|
||||||
|
| POST | `/api/rules` | Create a rule |
|
||||||
|
| GET | `/api/rules/:id` | Get a rule |
|
||||||
|
| PUT | `/api/rules/:id` | Update a rule |
|
||||||
|
| DELETE | `/api/rules/:id` | Delete a rule |
|
||||||
|
| GET | `/api/rules/preview` | Preview a pattern against real records (ad-hoc) |
|
||||||
|
| GET | `/api/rules/:id/test` | Test a saved rule against real records |
|
||||||
|
|
||||||
|
### Mappings — `/api/mappings`
|
||||||
|
|
||||||
|
| Method | Path | Description |
|
||||||
|
|--------|------|-------------|
|
||||||
|
| GET | `/api/mappings/source/:source_name` | List mappings |
|
||||||
|
| POST | `/api/mappings` | Create a mapping |
|
||||||
|
| POST | `/api/mappings/bulk` | Bulk create/update mappings |
|
||||||
|
| GET | `/api/mappings/:id` | Get a mapping |
|
||||||
|
| PUT | `/api/mappings/:id` | Update a mapping |
|
||||||
|
| DELETE | `/api/mappings/:id` | Delete a mapping |
|
||||||
|
| GET | `/api/mappings/source/:source_name/unmapped` | Get values with no mapping yet |
|
||||||
|
| GET | `/api/mappings/source/:source_name/all-values` | All extracted values with counts |
|
||||||
|
| GET | `/api/mappings/source/:source_name/counts` | Record counts for existing mappings |
|
||||||
|
| GET | `/api/mappings/source/:source_name/export.tsv` | Export values as TSV |
|
||||||
|
| POST | `/api/mappings/source/:source_name/import-csv` | Import mappings from TSV |
|
||||||
|
|
||||||
|
### Records — `/api/records`
|
||||||
|
|
||||||
|
| Method | Path | Description |
|
||||||
|
|--------|------|-------------|
|
||||||
|
| GET | `/api/records/source/:source_name` | List records (paginated) |
|
||||||
|
| GET | `/api/records/:id` | Get a single record |
|
||||||
|
| POST | `/api/records/search` | Search records |
|
||||||
|
| DELETE | `/api/records/:id` | Delete a record |
|
||||||
|
| DELETE | `/api/records/source/:source_name/all` | Delete all records for a source |
|
||||||
|
|
||||||
|
## Typical Workflow
|
||||||
|
|
||||||
// 4. Query transformed data
|
|
||||||
GET /api/sources/bank_transactions/records
|
|
||||||
```
|
```
|
||||||
|
1. Create a source (POST /api/sources)
|
||||||
|
2. Import CSV data (POST /api/sources/:name/import)
|
||||||
|
3. Create transformation rules (POST /api/rules)
|
||||||
|
4. Preview rules against real data (GET /api/rules/preview)
|
||||||
|
5. Apply transformations (POST /api/sources/:name/transform)
|
||||||
|
6. Review unmapped values (GET /api/mappings/source/:name/unmapped)
|
||||||
|
7. Add mappings (POST /api/mappings or bulk import via TSV)
|
||||||
|
8. Reprocess to apply new mappings (POST /api/sources/:name/reprocess)
|
||||||
|
9. Query results (GET /api/sources/:name/view-data)
|
||||||
|
```
|
||||||
|
|
||||||
|
See `examples/GETTING_STARTED.md` for a complete walkthrough with curl examples.
|
||||||
|
|
||||||
## Project Structure
|
## Project Structure
|
||||||
|
|
||||||
```
|
```
|
||||||
dataflow/
|
dataflow/
|
||||||
├── database/ # PostgreSQL schema and functions
|
├── database/
|
||||||
│ ├── schema.sql # Table definitions
|
│ ├── schema.sql # Table definitions
|
||||||
│ └── functions.sql # Import and transformation functions
|
│ └── functions.sql # Import/transform/query functions
|
||||||
├── api/ # Express REST API
|
├── api/
|
||||||
│ ├── server.js # Main server
|
│ ├── server.js # Express server
|
||||||
│ └── routes/ # API route handlers
|
│ ├── middleware/
|
||||||
├── examples/ # Sample data and use cases
|
│ │ └── auth.js # Basic auth middleware
|
||||||
└── docs/ # Additional documentation
|
│ ├── lib/
|
||||||
|
│ │ └── sql.js # SQL literal helpers
|
||||||
|
│ └── routes/
|
||||||
|
│ ├── sources.js
|
||||||
|
│ ├── rules.js
|
||||||
|
│ ├── mappings.js
|
||||||
|
│ └── records.js
|
||||||
|
├── public/ # Built React UI (served as static files)
|
||||||
|
├── examples/
|
||||||
|
│ ├── GETTING_STARTED.md
|
||||||
|
│ └── bank_transactions.csv
|
||||||
|
└── .env.example
|
||||||
```
|
```
|
||||||
|
|
||||||
## Status
|
|
||||||
|
|
||||||
**Current Phase:** Initial development - building core functionality
|
|
||||||
|
|
||||||
## License
|
## License
|
||||||
|
|
||||||
MIT
|
MIT
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user