7.0 KiB
Getting Started with Dataflow
This guide walks through a complete example using bank transaction data.
Prerequisites
- PostgreSQL database running
- Database created:
CREATE DATABASE dataflow; .envfile configured (copy from.env.example)
Step 1: Deploy Database Schema
cd /opt/dataflow
psql -U postgres -d dataflow -f database/schema.sql
psql -U postgres -d dataflow -f database/functions.sql
You should see tables created without errors.
Step 2: Start the API Server
npm install
npm start
The server should start on port 3000 (or your configured port).
Test it:
curl http://localhost:3000/health
# Should return: {"status":"ok","timestamp":"..."}
Step 3: Create a Data Source
A source defines where data comes from and how to deduplicate it.
curl -X POST http://localhost:3000/api/sources \
-H "Content-Type: application/json" \
-d '{
"name": "bank_transactions",
"dedup_fields": ["date", "description", "amount"]
}'
What this does: Records with the same date + description + amount will be considered duplicates.
Step 4: Create Transformation Rules
Rules extract meaningful data using regex patterns.
Rule 1: Extract merchant name (first part of description)
curl -X POST http://localhost:3000/api/rules \
-H "Content-Type: application/json" \
-d '{
"source_name": "bank_transactions",
"name": "extract_merchant",
"field": "description",
"pattern": "^([A-Z][A-Z ]+)",
"output_field": "merchant",
"sequence": 1
}'
Rule 2: Extract location (city + state pattern)
curl -X POST http://localhost:3000/api/rules \
-H "Content-Type: application/json" \
-d '{
"source_name": "bank_transactions",
"name": "extract_location",
"field": "description",
"pattern": "([A-Z]+) OH",
"output_field": "location",
"sequence": 2
}'
Step 5: Import Data
Import the example CSV file:
curl -X POST http://localhost:3000/api/sources/bank_transactions/import \
-F "file=@examples/bank_transactions.csv"
Response:
{
"success": true,
"imported": 14,
"duplicates": 0,
"log_id": 1
}
Step 6: View Imported Records
curl http://localhost:3000/api/records/source/bank_transactions?limit=5
You'll see the raw imported data. Note that transformed is null - we haven't applied transformations yet!
Step 7: Apply Transformations
curl -X POST http://localhost:3000/api/sources/bank_transactions/transform
Response:
{
"success": true,
"transformed": 14
}
Now check the records again:
curl http://localhost:3000/api/records/source/bank_transactions?limit=2
You'll see the transformed field now contains the original data plus extracted fields like merchant and location.
Step 8: View Extracted Values That Need Mapping
curl http://localhost:3000/api/mappings/source/bank_transactions/unmapped
Response shows extracted merchant names that aren't mapped yet:
[
{"rule_name": "extract_merchant", "extracted_value": "GOOGLE", "record_count": 2},
{"rule_name": "extract_merchant", "extracted_value": "TARGET", "record_count": 2},
{"rule_name": "extract_merchant", "extracted_value": "WALMART", "record_count": 1},
...
]
Step 9: Create Value Mappings
Map extracted values to clean, standardized output:
curl -X POST http://localhost:3000/api/mappings \
-H "Content-Type: application/json" \
-d '{
"source_name": "bank_transactions",
"rule_name": "extract_merchant",
"input_value": "GOOGLE",
"output": {
"vendor": "Google",
"category": "Technology"
}
}'
curl -X POST http://localhost:3000/api/mappings \
-H "Content-Type: application/json" \
-d '{
"source_name": "bank_transactions",
"rule_name": "extract_merchant",
"input_value": "TARGET",
"output": {
"vendor": "Target",
"category": "Retail"
}
}'
curl -X POST http://localhost:3000/api/mappings \
-H "Content-Type: application/json" \
-d '{
"source_name": "bank_transactions",
"rule_name": "extract_merchant",
"input_value": "WALMART",
"output": {
"vendor": "Walmart",
"category": "Groceries"
}
}'
Step 10: Reprocess With Mappings
Clear and reapply transformations to pick up the new mappings:
curl -X POST http://localhost:3000/api/sources/bank_transactions/reprocess
Step 11: View Final Results
curl http://localhost:3000/api/records/source/bank_transactions?limit=5
Now the transformed field contains:
- Original fields (date, description, amount, category)
- Extracted fields (merchant, location)
- Mapped fields (vendor, category from mappings)
Example result:
{
"id": 1,
"data": {
"date": "2024-01-02",
"description": "GOOGLE *YOUTUBE VIDEOS",
"amount": "4.26",
"category": "Services"
},
"transformed": {
"date": "2024-01-02",
"description": "GOOGLE *YOUTUBE VIDEOS",
"amount": "4.26",
"category": "Services",
"merchant": "GOOGLE",
"vendor": "Google",
"category": "Technology"
}
}
Step 12: Test Deduplication
Try importing the same file again:
curl -X POST http://localhost:3000/api/sources/bank_transactions/import \
-F "file=@examples/bank_transactions.csv"
Response:
{
"success": true,
"imported": 0,
"duplicates": 14,
"log_id": 2
}
All records were rejected as duplicates! ✓
Summary
You've now:
- ✅ Created a data source with deduplication rules
- ✅ Defined transformation rules to extract data
- ✅ Imported CSV data
- ✅ Applied transformations
- ✅ Created value mappings for clean output
- ✅ Reprocessed data with mappings
- ✅ Tested deduplication
Next Steps
- Add more rules for other extraction patterns
- Create more value mappings as needed
- Query the
transformeddata for reporting - Import additional CSV files
Useful Commands
# View all sources
curl http://localhost:3000/api/sources
# View source statistics
curl http://localhost:3000/api/sources/bank_transactions/stats
# View all rules for a source
curl http://localhost:3000/api/rules/source/bank_transactions
# View all mappings for a source
curl http://localhost:3000/api/mappings/source/bank_transactions
# Search for specific records
curl -X POST http://localhost:3000/api/records/search \
-H "Content-Type: application/json" \
-d '{
"source_name": "bank_transactions",
"query": {"vendor": "Google"},
"limit": 10
}'
Troubleshooting
API won't start:
- Check
.envfile exists with correct database credentials - Verify PostgreSQL is running:
psql -U postgres -l - Check logs for error messages
Import fails:
- Verify source exists:
curl http://localhost:3000/api/sources - Check CSV format matches expectations
- Ensure dedup_fields match CSV column names
Transformations not working:
- Check rules exist:
curl http://localhost:3000/api/rules/source/bank_transactions - Test regex pattern manually
- Check records have the specified field