diff --git a/README.md b/README.md index 70b27c7..cc65aab 100644 --- a/README.md +++ b/README.md @@ -1,14 +1,14 @@ # Dataflow -A simple, understandable data transformation tool for ingesting, mapping, and transforming data from various sources. +A simple data transformation tool for importing, cleaning, and standardizing data from various sources. ## What It Does Dataflow helps you: -1. **Import** data from CSV files (or other formats) +1. **Import** CSV data with automatic deduplication 2. **Transform** data using regex rules to extract meaningful information 3. **Map** extracted values to standardized output -4. **Query** the transformed data +4. **Query** the transformed data via a web UI or REST API Perfect for cleaning up messy data like bank transactions, product lists, or any repetitive data that needs normalization. @@ -20,26 +20,26 @@ Define where data comes from and how to deduplicate it. **Example:** Bank transactions deduplicated by date + amount + description ### 2. Rules -Extract information using regex patterns. +Extract information using regex patterns (`extract` or `replace` modes). **Example:** Extract merchant name from transaction description ### 3. Mappings Map extracted values to clean, standardized output. -**Example:** "DISCOUNT DRUG MART 32" → {"vendor": "Discount Drug Mart", "category": "Healthcare"} +**Example:** "DISCOUNT DRUG MART 32" → `{"vendor": "Discount Drug Mart", "category": "Healthcare"}` ## Architecture -- **Database:** PostgreSQL with JSONB for flexibility -- **API:** Node.js/Express for REST endpoints -- **Storage:** Raw data preserved, transformations are computed and stored +- **Database:** PostgreSQL with JSONB for flexible data storage +- **API:** Node.js/Express REST API +- **UI:** React SPA served from `public/` +- **Auth:** HTTP Basic auth (configured in `.env`) ## Design Principles - **Simple & Clear** - Easy to understand what's happening - **Explicit** - No hidden magic or complex triggers -- **Testable** - Every function can be tested independently - **Flexible** - Handle varying data formats without schema changes ## Getting Started @@ -47,74 +47,153 @@ Map extracted values to clean, standardized output. ### Prerequisites - PostgreSQL 12+ - Node.js 16+ +- Python 3 (for `manage.py`) ### Installation -1. Install dependencies: +1. Install Node dependencies: ```bash npm install ``` -2. Configure database (copy .env.example to .env and edit): +2. Run the management script to configure and deploy everything: ```bash -cp .env.example .env +python3 manage.py ``` -3. Deploy database schema: +For development with auto-reload: ```bash -psql -U postgres -d dataflow -f database/schema.sql -psql -U postgres -d dataflow -f database/functions.sql +npm run dev ``` -4. Start the API server: -```bash -npm start +The UI is available at `http://localhost:3000`. The API is at `http://localhost:3000/api`. + +## Management Script (`manage.py`) + +`manage.py` is an interactive tool for configuring, deploying, and managing the service. Run it and choose from the numbered menu: + +``` +python3 manage.py ``` -## Quick Example +| # | Action | +|---|--------| +| 1 | **Database configuration** — create/update `.env`, optionally create the PostgreSQL user/database, and deploy schema + functions | +| 2 | Redeploy schema only (`database/schema.sql`) — drops and recreates all tables | +| 3 | Redeploy SQL functions only (`database/queries/`) | +| 4 | Build UI (`ui/` → `public/`) | +| 5 | Set up nginx reverse proxy (HTTP or HTTPS via certbot) | +| 6 | Install systemd service unit (`dataflow.service`) | +| 7 | Start / restart `dataflow.service` | +| 8 | Stop `dataflow.service` | +| 9 | Set login credentials (`LOGIN_USER` / `LOGIN_PASSWORD_HASH` in `.env`) | -```javascript -// 1. Define a source -POST /api/sources -{ - "name": "bank_transactions", - "dedup_fields": ["date", "amount", "description"] -} +The status screen at the top of the menu shows the current state of each component (database connection, schema, UI build, service, nginx). -// 2. Create a transformation rule -POST /api/sources/bank_transactions/rules -{ - "name": "extract_merchant", - "pattern": "^([A-Z][A-Z ]+)", - "field": "description" -} +**Typical first-time setup:** run options 1 → 4 → 9 → 6 → 7 (→ 5 if you want nginx). -// 3. Import data -POST /api/sources/bank_transactions/import -[CSV file upload] +## API Reference + +All `/api` routes require HTTP Basic authentication. + +### Sources — `/api/sources` + +| Method | Path | Description | +|--------|------|-------------| +| GET | `/api/sources` | List all sources | +| POST | `/api/sources` | Create a source | +| GET | `/api/sources/:name` | Get a source | +| PUT | `/api/sources/:name` | Update a source | +| DELETE | `/api/sources/:name` | Delete a source | +| POST | `/api/sources/suggest` | Suggest source definition from CSV upload | +| POST | `/api/sources/:name/import` | Import CSV data | +| GET | `/api/sources/:name/import-log` | View import history | +| POST | `/api/sources/:name/transform` | Apply rules and mappings to records | +| POST | `/api/sources/:name/reprocess` | Re-transform all records | +| GET | `/api/sources/:name/fields` | List all known field names | +| GET | `/api/sources/:name/stats` | Get record and mapping counts | +| POST | `/api/sources/:name/view` | Generate output view | +| GET | `/api/sources/:name/view-data` | Query output view (paginated, sortable) | + +### Rules — `/api/rules` + +| Method | Path | Description | +|--------|------|-------------| +| GET | `/api/rules/source/:source_name` | List rules for a source | +| POST | `/api/rules` | Create a rule | +| GET | `/api/rules/:id` | Get a rule | +| PUT | `/api/rules/:id` | Update a rule | +| DELETE | `/api/rules/:id` | Delete a rule | +| GET | `/api/rules/preview` | Preview a pattern against real records (ad-hoc) | +| GET | `/api/rules/:id/test` | Test a saved rule against real records | + +### Mappings — `/api/mappings` + +| Method | Path | Description | +|--------|------|-------------| +| GET | `/api/mappings/source/:source_name` | List mappings | +| POST | `/api/mappings` | Create a mapping | +| POST | `/api/mappings/bulk` | Bulk create/update mappings | +| GET | `/api/mappings/:id` | Get a mapping | +| PUT | `/api/mappings/:id` | Update a mapping | +| DELETE | `/api/mappings/:id` | Delete a mapping | +| GET | `/api/mappings/source/:source_name/unmapped` | Get values with no mapping yet | +| GET | `/api/mappings/source/:source_name/all-values` | All extracted values with counts | +| GET | `/api/mappings/source/:source_name/counts` | Record counts for existing mappings | +| GET | `/api/mappings/source/:source_name/export.tsv` | Export values as TSV | +| POST | `/api/mappings/source/:source_name/import-csv` | Import mappings from TSV | + +### Records — `/api/records` + +| Method | Path | Description | +|--------|------|-------------| +| GET | `/api/records/source/:source_name` | List records (paginated) | +| GET | `/api/records/:id` | Get a single record | +| POST | `/api/records/search` | Search records | +| DELETE | `/api/records/:id` | Delete a record | +| DELETE | `/api/records/source/:source_name/all` | Delete all records for a source | + +## Typical Workflow -// 4. Query transformed data -GET /api/sources/bank_transactions/records ``` +1. Create a source (POST /api/sources) +2. Import CSV data (POST /api/sources/:name/import) +3. Create transformation rules (POST /api/rules) +4. Preview rules against real data (GET /api/rules/preview) +5. Apply transformations (POST /api/sources/:name/transform) +6. Review unmapped values (GET /api/mappings/source/:name/unmapped) +7. Add mappings (POST /api/mappings or bulk import via TSV) +8. Reprocess to apply new mappings (POST /api/sources/:name/reprocess) +9. Query results (GET /api/sources/:name/view-data) +``` + +See `examples/GETTING_STARTED.md` for a complete walkthrough with curl examples. ## Project Structure ``` dataflow/ -├── database/ # PostgreSQL schema and functions -│ ├── schema.sql # Table definitions -│ └── functions.sql # Import and transformation functions -├── api/ # Express REST API -│ ├── server.js # Main server -│ └── routes/ # API route handlers -├── examples/ # Sample data and use cases -└── docs/ # Additional documentation +├── database/ +│ ├── schema.sql # Table definitions +│ └── functions.sql # Import/transform/query functions +├── api/ +│ ├── server.js # Express server +│ ├── middleware/ +│ │ └── auth.js # Basic auth middleware +│ ├── lib/ +│ │ └── sql.js # SQL literal helpers +│ └── routes/ +│ ├── sources.js +│ ├── rules.js +│ ├── mappings.js +│ └── records.js +├── public/ # Built React UI (served as static files) +├── examples/ +│ ├── GETTING_STARTED.md +│ └── bank_transactions.csv +└── .env.example ``` -## Status - -**Current Phase:** Initial development - building core functionality - ## License MIT