dataflow/README.md

121 lines
2.8 KiB
Markdown

# Dataflow
A simple, understandable data transformation tool for ingesting, mapping, and transforming data from various sources.
## What It Does
Dataflow helps you:
1. **Import** data from CSV files (or other formats)
2. **Transform** data using regex rules to extract meaningful information
3. **Map** extracted values to standardized output
4. **Query** the transformed data
Perfect for cleaning up messy data like bank transactions, product lists, or any repetitive data that needs normalization.
## Core Concepts
### 1. Sources
Define where data comes from and how to deduplicate it.
**Example:** Bank transactions deduplicated by date + amount + description
### 2. Rules
Extract information using regex patterns.
**Example:** Extract merchant name from transaction description
### 3. Mappings
Map extracted values to clean, standardized output.
**Example:** "DISCOUNT DRUG MART 32" → {"vendor": "Discount Drug Mart", "category": "Healthcare"}
## Architecture
- **Database:** PostgreSQL with JSONB for flexibility
- **API:** Node.js/Express for REST endpoints
- **Storage:** Raw data preserved, transformations are computed and stored
## Design Principles
- **Simple & Clear** - Easy to understand what's happening
- **Explicit** - No hidden magic or complex triggers
- **Testable** - Every function can be tested independently
- **Flexible** - Handle varying data formats without schema changes
## Getting Started
### Prerequisites
- PostgreSQL 12+
- Node.js 16+
### Installation
1. Install dependencies:
```bash
npm install
```
2. Configure database (copy .env.example to .env and edit):
```bash
cp .env.example .env
```
3. Deploy database schema:
```bash
psql -U postgres -d dataflow -f database/schema.sql
psql -U postgres -d dataflow -f database/functions.sql
```
4. Start the API server:
```bash
npm start
```
## Quick Example
```javascript
// 1. Define a source
POST /api/sources
{
"name": "bank_transactions",
"dedup_fields": ["date", "amount", "description"]
}
// 2. Create a transformation rule
POST /api/sources/bank_transactions/rules
{
"name": "extract_merchant",
"pattern": "^([A-Z][A-Z ]+)",
"field": "description"
}
// 3. Import data
POST /api/sources/bank_transactions/import
[CSV file upload]
// 4. Query transformed data
GET /api/sources/bank_transactions/records
```
## Project Structure
```
dataflow/
├── database/ # PostgreSQL schema and functions
│ ├── schema.sql # Table definitions
│ └── functions.sql # Import and transformation functions
├── api/ # Express REST API
│ ├── server.js # Main server
│ └── routes/ # API route handlers
├── examples/ # Sample data and use cases
└── docs/ # Additional documentation
```
## Status
**Current Phase:** Initial development - building core functionality
## License
MIT