312 lines
7.0 KiB
Markdown
312 lines
7.0 KiB
Markdown
# Getting Started with Dataflow
|
|
|
|
This guide walks through a complete example using bank transaction data.
|
|
|
|
## Prerequisites
|
|
|
|
1. PostgreSQL database running
|
|
2. Database created: `CREATE DATABASE dataflow;`
|
|
3. `.env` file configured (copy from `.env.example`)
|
|
|
|
## Step 1: Deploy Database Schema
|
|
|
|
```bash
|
|
cd /opt/dataflow
|
|
psql -U postgres -d dataflow -f database/schema.sql
|
|
psql -U postgres -d dataflow -f database/functions.sql
|
|
```
|
|
|
|
You should see tables created without errors.
|
|
|
|
## Step 2: Start the API Server
|
|
|
|
```bash
|
|
npm install
|
|
npm start
|
|
```
|
|
|
|
The server should start on port 3000 (or your configured port).
|
|
|
|
Test it:
|
|
```bash
|
|
curl http://localhost:3000/health
|
|
# Should return: {"status":"ok","timestamp":"..."}
|
|
```
|
|
|
|
## Step 3: Create a Data Source
|
|
|
|
A source defines where data comes from and how to deduplicate it.
|
|
|
|
```bash
|
|
curl -X POST http://localhost:3000/api/sources \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"name": "bank_transactions",
|
|
"dedup_fields": ["date", "description", "amount"]
|
|
}'
|
|
```
|
|
|
|
**What this does:** Records with the same date + description + amount will be considered duplicates.
|
|
|
|
## Step 4: Create Transformation Rules
|
|
|
|
Rules extract meaningful data using regex patterns.
|
|
|
|
### Rule 1: Extract merchant name (first part of description)
|
|
|
|
```bash
|
|
curl -X POST http://localhost:3000/api/rules \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"source_name": "bank_transactions",
|
|
"name": "extract_merchant",
|
|
"field": "description",
|
|
"pattern": "^([A-Z][A-Z ]+)",
|
|
"output_field": "merchant",
|
|
"sequence": 1
|
|
}'
|
|
```
|
|
|
|
### Rule 2: Extract location (city + state pattern)
|
|
|
|
```bash
|
|
curl -X POST http://localhost:3000/api/rules \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"source_name": "bank_transactions",
|
|
"name": "extract_location",
|
|
"field": "description",
|
|
"pattern": "([A-Z]+) OH",
|
|
"output_field": "location",
|
|
"sequence": 2
|
|
}'
|
|
```
|
|
|
|
## Step 5: Import Data
|
|
|
|
Import the example CSV file:
|
|
|
|
```bash
|
|
curl -X POST http://localhost:3000/api/sources/bank_transactions/import \
|
|
-F "file=@examples/bank_transactions.csv"
|
|
```
|
|
|
|
Response:
|
|
```json
|
|
{
|
|
"success": true,
|
|
"imported": 14,
|
|
"duplicates": 0,
|
|
"log_id": 1
|
|
}
|
|
```
|
|
|
|
## Step 6: View Imported Records
|
|
|
|
```bash
|
|
curl http://localhost:3000/api/records/source/bank_transactions?limit=5
|
|
```
|
|
|
|
You'll see the raw imported data. Note that `transformed` is `null` - we haven't applied transformations yet!
|
|
|
|
## Step 7: Apply Transformations
|
|
|
|
```bash
|
|
curl -X POST http://localhost:3000/api/sources/bank_transactions/transform
|
|
```
|
|
|
|
Response:
|
|
```json
|
|
{
|
|
"success": true,
|
|
"transformed": 14
|
|
}
|
|
```
|
|
|
|
Now check the records again:
|
|
```bash
|
|
curl http://localhost:3000/api/records/source/bank_transactions?limit=2
|
|
```
|
|
|
|
You'll see the `transformed` field now contains the original data plus extracted fields like `merchant` and `location`.
|
|
|
|
## Step 8: View Extracted Values That Need Mapping
|
|
|
|
```bash
|
|
curl http://localhost:3000/api/mappings/source/bank_transactions/unmapped
|
|
```
|
|
|
|
Response shows extracted merchant names that aren't mapped yet:
|
|
```json
|
|
[
|
|
{"rule_name": "extract_merchant", "extracted_value": "GOOGLE", "record_count": 2},
|
|
{"rule_name": "extract_merchant", "extracted_value": "TARGET", "record_count": 2},
|
|
{"rule_name": "extract_merchant", "extracted_value": "WALMART", "record_count": 1},
|
|
...
|
|
]
|
|
```
|
|
|
|
## Step 9: Create Value Mappings
|
|
|
|
Map extracted values to clean, standardized output:
|
|
|
|
```bash
|
|
curl -X POST http://localhost:3000/api/mappings \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"source_name": "bank_transactions",
|
|
"rule_name": "extract_merchant",
|
|
"input_value": "GOOGLE",
|
|
"output": {
|
|
"vendor": "Google",
|
|
"category": "Technology"
|
|
}
|
|
}'
|
|
|
|
curl -X POST http://localhost:3000/api/mappings \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"source_name": "bank_transactions",
|
|
"rule_name": "extract_merchant",
|
|
"input_value": "TARGET",
|
|
"output": {
|
|
"vendor": "Target",
|
|
"category": "Retail"
|
|
}
|
|
}'
|
|
|
|
curl -X POST http://localhost:3000/api/mappings \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"source_name": "bank_transactions",
|
|
"rule_name": "extract_merchant",
|
|
"input_value": "WALMART",
|
|
"output": {
|
|
"vendor": "Walmart",
|
|
"category": "Groceries"
|
|
}
|
|
}'
|
|
```
|
|
|
|
## Step 10: Reprocess With Mappings
|
|
|
|
Clear and reapply transformations to pick up the new mappings:
|
|
|
|
```bash
|
|
curl -X POST http://localhost:3000/api/sources/bank_transactions/reprocess
|
|
```
|
|
|
|
## Step 11: View Final Results
|
|
|
|
```bash
|
|
curl http://localhost:3000/api/records/source/bank_transactions?limit=5
|
|
```
|
|
|
|
Now the `transformed` field contains:
|
|
- Original fields (date, description, amount, category)
|
|
- Extracted fields (merchant, location)
|
|
- Mapped fields (vendor, category from mappings)
|
|
|
|
Example result:
|
|
```json
|
|
{
|
|
"id": 1,
|
|
"data": {
|
|
"date": "2024-01-02",
|
|
"description": "GOOGLE *YOUTUBE VIDEOS",
|
|
"amount": "4.26",
|
|
"category": "Services"
|
|
},
|
|
"transformed": {
|
|
"date": "2024-01-02",
|
|
"description": "GOOGLE *YOUTUBE VIDEOS",
|
|
"amount": "4.26",
|
|
"category": "Services",
|
|
"merchant": "GOOGLE",
|
|
"vendor": "Google",
|
|
"category": "Technology"
|
|
}
|
|
}
|
|
```
|
|
|
|
## Step 12: Test Deduplication
|
|
|
|
Try importing the same file again:
|
|
|
|
```bash
|
|
curl -X POST http://localhost:3000/api/sources/bank_transactions/import \
|
|
-F "file=@examples/bank_transactions.csv"
|
|
```
|
|
|
|
Response:
|
|
```json
|
|
{
|
|
"success": true,
|
|
"imported": 0,
|
|
"duplicates": 14,
|
|
"log_id": 2
|
|
}
|
|
```
|
|
|
|
All records were rejected as duplicates! ✓
|
|
|
|
## Summary
|
|
|
|
You've now:
|
|
- ✅ Created a data source with deduplication rules
|
|
- ✅ Defined transformation rules to extract data
|
|
- ✅ Imported CSV data
|
|
- ✅ Applied transformations
|
|
- ✅ Created value mappings for clean output
|
|
- ✅ Reprocessed data with mappings
|
|
- ✅ Tested deduplication
|
|
|
|
## Next Steps
|
|
|
|
- Add more rules for other extraction patterns
|
|
- Create more value mappings as needed
|
|
- Query the `transformed` data for reporting
|
|
- Import additional CSV files
|
|
|
|
## Useful Commands
|
|
|
|
```bash
|
|
# View all sources
|
|
curl http://localhost:3000/api/sources
|
|
|
|
# View source statistics
|
|
curl http://localhost:3000/api/sources/bank_transactions/stats
|
|
|
|
# View all rules for a source
|
|
curl http://localhost:3000/api/rules/source/bank_transactions
|
|
|
|
# View all mappings for a source
|
|
curl http://localhost:3000/api/mappings/source/bank_transactions
|
|
|
|
# Search for specific records
|
|
curl -X POST http://localhost:3000/api/records/search \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"source_name": "bank_transactions",
|
|
"query": {"vendor": "Google"},
|
|
"limit": 10
|
|
}'
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
**API won't start:**
|
|
- Check `.env` file exists with correct database credentials
|
|
- Verify PostgreSQL is running: `psql -U postgres -l`
|
|
- Check logs for error messages
|
|
|
|
**Import fails:**
|
|
- Verify source exists: `curl http://localhost:3000/api/sources`
|
|
- Check CSV format matches expectations
|
|
- Ensure dedup_fields match CSV column names
|
|
|
|
**Transformations not working:**
|
|
- Check rules exist: `curl http://localhost:3000/api/rules/source/bank_transactions`
|
|
- Test regex pattern manually
|
|
- Check records have the specified field
|