121 lines
2.8 KiB
Markdown
121 lines
2.8 KiB
Markdown
# Dataflow
|
|
|
|
A simple, understandable data transformation tool for ingesting, mapping, and transforming data from various sources.
|
|
|
|
## What It Does
|
|
|
|
Dataflow helps you:
|
|
1. **Import** data from CSV files (or other formats)
|
|
2. **Transform** data using regex rules to extract meaningful information
|
|
3. **Map** extracted values to standardized output
|
|
4. **Query** the transformed data
|
|
|
|
Perfect for cleaning up messy data like bank transactions, product lists, or any repetitive data that needs normalization.
|
|
|
|
## Core Concepts
|
|
|
|
### 1. Sources
|
|
Define where data comes from and how to deduplicate it.
|
|
|
|
**Example:** Bank transactions deduplicated by date + amount + description
|
|
|
|
### 2. Rules
|
|
Extract information using regex patterns.
|
|
|
|
**Example:** Extract merchant name from transaction description
|
|
|
|
### 3. Mappings
|
|
Map extracted values to clean, standardized output.
|
|
|
|
**Example:** "DISCOUNT DRUG MART 32" → {"vendor": "Discount Drug Mart", "category": "Healthcare"}
|
|
|
|
## Architecture
|
|
|
|
- **Database:** PostgreSQL with JSONB for flexibility
|
|
- **API:** Node.js/Express for REST endpoints
|
|
- **Storage:** Raw data preserved, transformations are computed and stored
|
|
|
|
## Design Principles
|
|
|
|
- **Simple & Clear** - Easy to understand what's happening
|
|
- **Explicit** - No hidden magic or complex triggers
|
|
- **Testable** - Every function can be tested independently
|
|
- **Flexible** - Handle varying data formats without schema changes
|
|
|
|
## Getting Started
|
|
|
|
### Prerequisites
|
|
- PostgreSQL 12+
|
|
- Node.js 16+
|
|
|
|
### Installation
|
|
|
|
1. Install dependencies:
|
|
```bash
|
|
npm install
|
|
```
|
|
|
|
2. Configure database (copy .env.example to .env and edit):
|
|
```bash
|
|
cp .env.example .env
|
|
```
|
|
|
|
3. Deploy database schema:
|
|
```bash
|
|
psql -U postgres -d dataflow -f database/schema.sql
|
|
psql -U postgres -d dataflow -f database/functions.sql
|
|
```
|
|
|
|
4. Start the API server:
|
|
```bash
|
|
npm start
|
|
```
|
|
|
|
## Quick Example
|
|
|
|
```javascript
|
|
// 1. Define a source
|
|
POST /api/sources
|
|
{
|
|
"name": "bank_transactions",
|
|
"dedup_fields": ["date", "amount", "description"]
|
|
}
|
|
|
|
// 2. Create a transformation rule
|
|
POST /api/sources/bank_transactions/rules
|
|
{
|
|
"name": "extract_merchant",
|
|
"pattern": "^([A-Z][A-Z ]+)",
|
|
"field": "description"
|
|
}
|
|
|
|
// 3. Import data
|
|
POST /api/sources/bank_transactions/import
|
|
[CSV file upload]
|
|
|
|
// 4. Query transformed data
|
|
GET /api/sources/bank_transactions/records
|
|
```
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
dataflow/
|
|
├── database/ # PostgreSQL schema and functions
|
|
│ ├── schema.sql # Table definitions
|
|
│ └── functions.sql # Import and transformation functions
|
|
├── api/ # Express REST API
|
|
│ ├── server.js # Main server
|
|
│ └── routes/ # API route handlers
|
|
├── examples/ # Sample data and use cases
|
|
└── docs/ # Additional documentation
|
|
```
|
|
|
|
## Status
|
|
|
|
**Current Phase:** Initial development - building core functionality
|
|
|
|
## License
|
|
|
|
MIT
|