dataflow/examples/GETTING_STARTED.md

7.0 KiB

Getting Started with Dataflow

This guide walks through a complete example using bank transaction data.

Prerequisites

  1. PostgreSQL database running
  2. Database created: CREATE DATABASE dataflow;
  3. .env file configured (copy from .env.example)

Step 1: Deploy Database Schema

cd /opt/dataflow
psql -U postgres -d dataflow -f database/schema.sql
psql -U postgres -d dataflow -f database/functions.sql

You should see tables created without errors.

Step 2: Start the API Server

npm install
npm start

The server should start on port 3000 (or your configured port).

Test it:

curl http://localhost:3000/health
# Should return: {"status":"ok","timestamp":"..."}

Step 3: Create a Data Source

A source defines where data comes from and how to deduplicate it.

curl -X POST http://localhost:3000/api/sources \
  -H "Content-Type: application/json" \
  -d '{
    "name": "bank_transactions",
    "dedup_fields": ["date", "description", "amount"]
  }'

What this does: Records with the same date + description + amount will be considered duplicates.

Step 4: Create Transformation Rules

Rules extract meaningful data using regex patterns.

Rule 1: Extract merchant name (first part of description)

curl -X POST http://localhost:3000/api/rules \
  -H "Content-Type: application/json" \
  -d '{
    "source_name": "bank_transactions",
    "name": "extract_merchant",
    "field": "description",
    "pattern": "^([A-Z][A-Z ]+)",
    "output_field": "merchant",
    "sequence": 1
  }'

Rule 2: Extract location (city + state pattern)

curl -X POST http://localhost:3000/api/rules \
  -H "Content-Type: application/json" \
  -d '{
    "source_name": "bank_transactions",
    "name": "extract_location",
    "field": "description",
    "pattern": "([A-Z]+) OH",
    "output_field": "location",
    "sequence": 2
  }'

Step 5: Import Data

Import the example CSV file:

curl -X POST http://localhost:3000/api/sources/bank_transactions/import \
  -F "file=@examples/bank_transactions.csv"

Response:

{
  "success": true,
  "imported": 14,
  "duplicates": 0,
  "log_id": 1
}

Step 6: View Imported Records

curl http://localhost:3000/api/records/source/bank_transactions?limit=5

You'll see the raw imported data. Note that transformed is null - we haven't applied transformations yet!

Step 7: Apply Transformations

curl -X POST http://localhost:3000/api/sources/bank_transactions/transform

Response:

{
  "success": true,
  "transformed": 14
}

Now check the records again:

curl http://localhost:3000/api/records/source/bank_transactions?limit=2

You'll see the transformed field now contains the original data plus extracted fields like merchant and location.

Step 8: View Extracted Values That Need Mapping

curl http://localhost:3000/api/mappings/source/bank_transactions/unmapped

Response shows extracted merchant names that aren't mapped yet:

[
  {"rule_name": "extract_merchant", "extracted_value": "GOOGLE", "record_count": 2},
  {"rule_name": "extract_merchant", "extracted_value": "TARGET", "record_count": 2},
  {"rule_name": "extract_merchant", "extracted_value": "WALMART", "record_count": 1},
  ...
]

Step 9: Create Value Mappings

Map extracted values to clean, standardized output:

curl -X POST http://localhost:3000/api/mappings \
  -H "Content-Type: application/json" \
  -d '{
    "source_name": "bank_transactions",
    "rule_name": "extract_merchant",
    "input_value": "GOOGLE",
    "output": {
      "vendor": "Google",
      "category": "Technology"
    }
  }'

curl -X POST http://localhost:3000/api/mappings \
  -H "Content-Type: application/json" \
  -d '{
    "source_name": "bank_transactions",
    "rule_name": "extract_merchant",
    "input_value": "TARGET",
    "output": {
      "vendor": "Target",
      "category": "Retail"
    }
  }'

curl -X POST http://localhost:3000/api/mappings \
  -H "Content-Type: application/json" \
  -d '{
    "source_name": "bank_transactions",
    "rule_name": "extract_merchant",
    "input_value": "WALMART",
    "output": {
      "vendor": "Walmart",
      "category": "Groceries"
    }
  }'

Step 10: Reprocess With Mappings

Clear and reapply transformations to pick up the new mappings:

curl -X POST http://localhost:3000/api/sources/bank_transactions/reprocess

Step 11: View Final Results

curl http://localhost:3000/api/records/source/bank_transactions?limit=5

Now the transformed field contains:

  • Original fields (date, description, amount, category)
  • Extracted fields (merchant, location)
  • Mapped fields (vendor, category from mappings)

Example result:

{
  "id": 1,
  "data": {
    "date": "2024-01-02",
    "description": "GOOGLE *YOUTUBE VIDEOS",
    "amount": "4.26",
    "category": "Services"
  },
  "transformed": {
    "date": "2024-01-02",
    "description": "GOOGLE *YOUTUBE VIDEOS",
    "amount": "4.26",
    "category": "Services",
    "merchant": "GOOGLE",
    "vendor": "Google",
    "category": "Technology"
  }
}

Step 12: Test Deduplication

Try importing the same file again:

curl -X POST http://localhost:3000/api/sources/bank_transactions/import \
  -F "file=@examples/bank_transactions.csv"

Response:

{
  "success": true,
  "imported": 0,
  "duplicates": 14,
  "log_id": 2
}

All records were rejected as duplicates! ✓

Summary

You've now:

  • Created a data source with deduplication rules
  • Defined transformation rules to extract data
  • Imported CSV data
  • Applied transformations
  • Created value mappings for clean output
  • Reprocessed data with mappings
  • Tested deduplication

Next Steps

  • Add more rules for other extraction patterns
  • Create more value mappings as needed
  • Query the transformed data for reporting
  • Import additional CSV files

Useful Commands

# View all sources
curl http://localhost:3000/api/sources

# View source statistics
curl http://localhost:3000/api/sources/bank_transactions/stats

# View all rules for a source
curl http://localhost:3000/api/rules/source/bank_transactions

# View all mappings for a source
curl http://localhost:3000/api/mappings/source/bank_transactions

# Search for specific records
curl -X POST http://localhost:3000/api/records/search \
  -H "Content-Type: application/json" \
  -d '{
    "source_name": "bank_transactions",
    "query": {"vendor": "Google"},
    "limit": 10
  }'

Troubleshooting

API won't start:

  • Check .env file exists with correct database credentials
  • Verify PostgreSQL is running: psql -U postgres -l
  • Check logs for error messages

Import fails:

  • Verify source exists: curl http://localhost:3000/api/sources
  • Check CSV format matches expectations
  • Ensure dedup_fields match CSV column names

Transformations not working:

  • Check rules exist: curl http://localhost:3000/api/rules/source/bank_transactions
  • Test regex pattern manually
  • Check records have the specified field