Go to file

Paul Trowbridge 4cf5be52e8 Rewrite apply_transformations as set-based CTE chain Replaces the nested FOR loops (row-by-row, rule-by-rule) with a single SQL CTE chain that processes all records × rules in one pass, mirroring the TPS approach. CTE chain: qualifying → all untransformed records for the source rx → apply each rule (extract/replace) to each record linked → LEFT JOIN mappings to find mapped output rule_output → build per-rule JSONB (with retain support) record_additions → merge all rule outputs per record in sequence order UPDATE → set transformed = data \|\| additions Also adds jsonb_concat_obj aggregate (jsonb merge with ORDER BY support) needed to collapse multiple rule outputs per record into one object. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>		2026-04-04 21:13:49 -04:00
api	Add retain flag to rules for preserving extracted values alongside mappings	2026-04-04 20:48:52 -04:00
database	Rewrite apply_transformations as set-based CTE chain	2026-04-04 21:13:49 -04:00
examples	Initial commit: dataflow data transformation tool	2026-03-28 00:44:13 -04:00
scripts	Add systemd service setup script for production deployment	2026-03-28 02:45:23 -04:00
ui	Add retain flag to rules for preserving extracted values alongside mappings	2026-04-04 20:48:52 -04:00
.env.example	Initial commit: dataflow data transformation tool	2026-03-28 00:44:13 -04:00
.gitignore	Add TSV export/import UI for mappings	2026-04-03 23:29:07 -04:00
CLAUDE.md	Initial commit: dataflow data transformation tool	2026-03-28 00:44:13 -04:00
package.json	Add missing backend features before UI build	2026-03-28 22:48:41 -04:00
README.md	Initial commit: dataflow data transformation tool	2026-03-28 00:44:13 -04:00
setup.sh	Fix user existence check and add PGPASSWORD for app user during deploy	2026-03-28 01:16:45 -04:00
uninstall.sh	Add interactive setup script with PostgreSQL user/database creation and uninstall script	2026-03-28 00:59:41 -04:00

README.md

Dataflow

A simple, understandable data transformation tool for ingesting, mapping, and transforming data from various sources.

What It Does

Dataflow helps you:

Import data from CSV files (or other formats)
Transform data using regex rules to extract meaningful information
Map extracted values to standardized output
Query the transformed data

Perfect for cleaning up messy data like bank transactions, product lists, or any repetitive data that needs normalization.

Core Concepts

1. Sources

Define where data comes from and how to deduplicate it.

Example: Bank transactions deduplicated by date + amount + description

2. Rules

Extract information using regex patterns.

Example: Extract merchant name from transaction description

3. Mappings

Map extracted values to clean, standardized output.

Example: "DISCOUNT DRUG MART 32" → {"vendor": "Discount Drug Mart", "category": "Healthcare"}

Architecture

Database: PostgreSQL with JSONB for flexibility
API: Node.js/Express for REST endpoints
Storage: Raw data preserved, transformations are computed and stored

Design Principles

Simple & Clear - Easy to understand what's happening
Explicit - No hidden magic or complex triggers
Testable - Every function can be tested independently
Flexible - Handle varying data formats without schema changes

Getting Started

Prerequisites

PostgreSQL 12+
Node.js 16+

Installation

Install dependencies:

npm install

Configure database (copy .env.example to .env and edit):

cp .env.example .env

Deploy database schema:

psql -U postgres -d dataflow -f database/schema.sql
psql -U postgres -d dataflow -f database/functions.sql

Start the API server:

npm start

Quick Example

// 1. Define a source
POST /api/sources
{
  "name": "bank_transactions",
  "dedup_fields": ["date", "amount", "description"]
}

// 2. Create a transformation rule
POST /api/sources/bank_transactions/rules
{
  "name": "extract_merchant",
  "pattern": "^([A-Z][A-Z ]+)",
  "field": "description"
}

// 3. Import data
POST /api/sources/bank_transactions/import
[CSV file upload]

// 4. Query transformed data
GET /api/sources/bank_transactions/records

Project Structure

dataflow/
├── database/           # PostgreSQL schema and functions
│   ├── schema.sql     # Table definitions
│   └── functions.sql  # Import and transformation functions
├── api/               # Express REST API
│   ├── server.js     # Main server
│   └── routes/       # API route handlers
├── examples/          # Sample data and use cases
└── docs/             # Additional documentation

Status

Current Phase: Initial development - building core functionality

License

MIT