# Getting Started with Dataflow This guide walks through a complete example using bank transaction data. ## Prerequisites 1. PostgreSQL database running 2. Database created: `CREATE DATABASE dataflow;` 3. `.env` file configured (copy from `.env.example`) ## Step 1: Deploy Database Schema ```bash cd /opt/dataflow psql -U postgres -d dataflow -f database/schema.sql psql -U postgres -d dataflow -f database/functions.sql ``` You should see tables created without errors. ## Step 2: Start the API Server ```bash npm install npm start ``` The server should start on port 3000 (or your configured port). Test it: ```bash curl http://localhost:3000/health # Should return: {"status":"ok","timestamp":"..."} ``` ## Step 3: Create a Data Source A source defines where data comes from and how to deduplicate it. ```bash curl -X POST http://localhost:3000/api/sources \ -H "Content-Type: application/json" \ -d '{ "name": "bank_transactions", "dedup_fields": ["date", "description", "amount"] }' ``` **What this does:** Records with the same date + description + amount will be considered duplicates. ## Step 4: Create Transformation Rules Rules extract meaningful data using regex patterns. ### Rule 1: Extract merchant name (first part of description) ```bash curl -X POST http://localhost:3000/api/rules \ -H "Content-Type: application/json" \ -d '{ "source_name": "bank_transactions", "name": "extract_merchant", "field": "description", "pattern": "^([A-Z][A-Z ]+)", "output_field": "merchant", "sequence": 1 }' ``` ### Rule 2: Extract location (city + state pattern) ```bash curl -X POST http://localhost:3000/api/rules \ -H "Content-Type: application/json" \ -d '{ "source_name": "bank_transactions", "name": "extract_location", "field": "description", "pattern": "([A-Z]+) OH", "output_field": "location", "sequence": 2 }' ``` ## Step 5: Import Data Import the example CSV file: ```bash curl -X POST http://localhost:3000/api/sources/bank_transactions/import \ -F "file=@examples/bank_transactions.csv" ``` Response: ```json { "success": true, "imported": 14, "duplicates": 0, "log_id": 1 } ``` ## Step 6: View Imported Records ```bash curl http://localhost:3000/api/records/source/bank_transactions?limit=5 ``` You'll see the raw imported data. Note that `transformed` is `null` - we haven't applied transformations yet! ## Step 7: Apply Transformations ```bash curl -X POST http://localhost:3000/api/sources/bank_transactions/transform ``` Response: ```json { "success": true, "transformed": 14 } ``` Now check the records again: ```bash curl http://localhost:3000/api/records/source/bank_transactions?limit=2 ``` You'll see the `transformed` field now contains the original data plus extracted fields like `merchant` and `location`. ## Step 8: View Extracted Values That Need Mapping ```bash curl http://localhost:3000/api/mappings/source/bank_transactions/unmapped ``` Response shows extracted merchant names that aren't mapped yet: ```json [ {"rule_name": "extract_merchant", "extracted_value": "GOOGLE", "record_count": 2}, {"rule_name": "extract_merchant", "extracted_value": "TARGET", "record_count": 2}, {"rule_name": "extract_merchant", "extracted_value": "WALMART", "record_count": 1}, ... ] ``` ## Step 9: Create Value Mappings Map extracted values to clean, standardized output: ```bash curl -X POST http://localhost:3000/api/mappings \ -H "Content-Type: application/json" \ -d '{ "source_name": "bank_transactions", "rule_name": "extract_merchant", "input_value": "GOOGLE", "output": { "vendor": "Google", "category": "Technology" } }' curl -X POST http://localhost:3000/api/mappings \ -H "Content-Type: application/json" \ -d '{ "source_name": "bank_transactions", "rule_name": "extract_merchant", "input_value": "TARGET", "output": { "vendor": "Target", "category": "Retail" } }' curl -X POST http://localhost:3000/api/mappings \ -H "Content-Type: application/json" \ -d '{ "source_name": "bank_transactions", "rule_name": "extract_merchant", "input_value": "WALMART", "output": { "vendor": "Walmart", "category": "Groceries" } }' ``` ## Step 10: Reprocess With Mappings Clear and reapply transformations to pick up the new mappings: ```bash curl -X POST http://localhost:3000/api/sources/bank_transactions/reprocess ``` ## Step 11: View Final Results ```bash curl http://localhost:3000/api/records/source/bank_transactions?limit=5 ``` Now the `transformed` field contains: - Original fields (date, description, amount, category) - Extracted fields (merchant, location) - Mapped fields (vendor, category from mappings) Example result: ```json { "id": 1, "data": { "date": "2024-01-02", "description": "GOOGLE *YOUTUBE VIDEOS", "amount": "4.26", "category": "Services" }, "transformed": { "date": "2024-01-02", "description": "GOOGLE *YOUTUBE VIDEOS", "amount": "4.26", "category": "Services", "merchant": "GOOGLE", "vendor": "Google", "category": "Technology" } } ``` ## Step 12: Test Deduplication Try importing the same file again: ```bash curl -X POST http://localhost:3000/api/sources/bank_transactions/import \ -F "file=@examples/bank_transactions.csv" ``` Response: ```json { "success": true, "imported": 0, "duplicates": 14, "log_id": 2 } ``` All records were rejected as duplicates! ✓ ## Summary You've now: - ✅ Created a data source with deduplication rules - ✅ Defined transformation rules to extract data - ✅ Imported CSV data - ✅ Applied transformations - ✅ Created value mappings for clean output - ✅ Reprocessed data with mappings - ✅ Tested deduplication ## Next Steps - Add more rules for other extraction patterns - Create more value mappings as needed - Query the `transformed` data for reporting - Import additional CSV files ## Useful Commands ```bash # View all sources curl http://localhost:3000/api/sources # View source statistics curl http://localhost:3000/api/sources/bank_transactions/stats # View all rules for a source curl http://localhost:3000/api/rules/source/bank_transactions # View all mappings for a source curl http://localhost:3000/api/mappings/source/bank_transactions # Search for specific records curl -X POST http://localhost:3000/api/records/search \ -H "Content-Type: application/json" \ -d '{ "source_name": "bank_transactions", "query": {"vendor": "Google"}, "limit": 10 }' ``` ## Troubleshooting **API won't start:** - Check `.env` file exists with correct database credentials - Verify PostgreSQL is running: `psql -U postgres -l` - Check logs for error messages **Import fails:** - Verify source exists: `curl http://localhost:3000/api/sources` - Check CSV format matches expectations - Ensure dedup_fields match CSV column names **Transformations not working:** - Check rules exist: `curl http://localhost:3000/api/rules/source/bank_transactions` - Test regex pattern manually - Check records have the specified field