dataflow/examples/GETTING_STARTED.md

312 lines
7.0 KiB
Markdown

# Getting Started with Dataflow
This guide walks through a complete example using bank transaction data.
## Prerequisites
1. PostgreSQL database running
2. Database created: `CREATE DATABASE dataflow;`
3. `.env` file configured (copy from `.env.example`)
## Step 1: Deploy Database Schema
```bash
cd /opt/dataflow
psql -U postgres -d dataflow -f database/schema.sql
psql -U postgres -d dataflow -f database/functions.sql
```
You should see tables created without errors.
## Step 2: Start the API Server
```bash
npm install
npm start
```
The server should start on port 3000 (or your configured port).
Test it:
```bash
curl http://localhost:3000/health
# Should return: {"status":"ok","timestamp":"..."}
```
## Step 3: Create a Data Source
A source defines where data comes from and how to deduplicate it.
```bash
curl -X POST http://localhost:3000/api/sources \
-H "Content-Type: application/json" \
-d '{
"name": "bank_transactions",
"dedup_fields": ["date", "description", "amount"]
}'
```
**What this does:** Records with the same date + description + amount will be considered duplicates.
## Step 4: Create Transformation Rules
Rules extract meaningful data using regex patterns.
### Rule 1: Extract merchant name (first part of description)
```bash
curl -X POST http://localhost:3000/api/rules \
-H "Content-Type: application/json" \
-d '{
"source_name": "bank_transactions",
"name": "extract_merchant",
"field": "description",
"pattern": "^([A-Z][A-Z ]+)",
"output_field": "merchant",
"sequence": 1
}'
```
### Rule 2: Extract location (city + state pattern)
```bash
curl -X POST http://localhost:3000/api/rules \
-H "Content-Type: application/json" \
-d '{
"source_name": "bank_transactions",
"name": "extract_location",
"field": "description",
"pattern": "([A-Z]+) OH",
"output_field": "location",
"sequence": 2
}'
```
## Step 5: Import Data
Import the example CSV file:
```bash
curl -X POST http://localhost:3000/api/sources/bank_transactions/import \
-F "file=@examples/bank_transactions.csv"
```
Response:
```json
{
"success": true,
"imported": 14,
"duplicates": 0,
"log_id": 1
}
```
## Step 6: View Imported Records
```bash
curl http://localhost:3000/api/records/source/bank_transactions?limit=5
```
You'll see the raw imported data. Note that `transformed` is `null` - we haven't applied transformations yet!
## Step 7: Apply Transformations
```bash
curl -X POST http://localhost:3000/api/sources/bank_transactions/transform
```
Response:
```json
{
"success": true,
"transformed": 14
}
```
Now check the records again:
```bash
curl http://localhost:3000/api/records/source/bank_transactions?limit=2
```
You'll see the `transformed` field now contains the original data plus extracted fields like `merchant` and `location`.
## Step 8: View Extracted Values That Need Mapping
```bash
curl http://localhost:3000/api/mappings/source/bank_transactions/unmapped
```
Response shows extracted merchant names that aren't mapped yet:
```json
[
{"rule_name": "extract_merchant", "extracted_value": "GOOGLE", "record_count": 2},
{"rule_name": "extract_merchant", "extracted_value": "TARGET", "record_count": 2},
{"rule_name": "extract_merchant", "extracted_value": "WALMART", "record_count": 1},
...
]
```
## Step 9: Create Value Mappings
Map extracted values to clean, standardized output:
```bash
curl -X POST http://localhost:3000/api/mappings \
-H "Content-Type: application/json" \
-d '{
"source_name": "bank_transactions",
"rule_name": "extract_merchant",
"input_value": "GOOGLE",
"output": {
"vendor": "Google",
"category": "Technology"
}
}'
curl -X POST http://localhost:3000/api/mappings \
-H "Content-Type: application/json" \
-d '{
"source_name": "bank_transactions",
"rule_name": "extract_merchant",
"input_value": "TARGET",
"output": {
"vendor": "Target",
"category": "Retail"
}
}'
curl -X POST http://localhost:3000/api/mappings \
-H "Content-Type: application/json" \
-d '{
"source_name": "bank_transactions",
"rule_name": "extract_merchant",
"input_value": "WALMART",
"output": {
"vendor": "Walmart",
"category": "Groceries"
}
}'
```
## Step 10: Reprocess With Mappings
Clear and reapply transformations to pick up the new mappings:
```bash
curl -X POST http://localhost:3000/api/sources/bank_transactions/reprocess
```
## Step 11: View Final Results
```bash
curl http://localhost:3000/api/records/source/bank_transactions?limit=5
```
Now the `transformed` field contains:
- Original fields (date, description, amount, category)
- Extracted fields (merchant, location)
- Mapped fields (vendor, category from mappings)
Example result:
```json
{
"id": 1,
"data": {
"date": "2024-01-02",
"description": "GOOGLE *YOUTUBE VIDEOS",
"amount": "4.26",
"category": "Services"
},
"transformed": {
"date": "2024-01-02",
"description": "GOOGLE *YOUTUBE VIDEOS",
"amount": "4.26",
"category": "Services",
"merchant": "GOOGLE",
"vendor": "Google",
"category": "Technology"
}
}
```
## Step 12: Test Deduplication
Try importing the same file again:
```bash
curl -X POST http://localhost:3000/api/sources/bank_transactions/import \
-F "file=@examples/bank_transactions.csv"
```
Response:
```json
{
"success": true,
"imported": 0,
"duplicates": 14,
"log_id": 2
}
```
All records were rejected as duplicates! ✓
## Summary
You've now:
- ✅ Created a data source with deduplication rules
- ✅ Defined transformation rules to extract data
- ✅ Imported CSV data
- ✅ Applied transformations
- ✅ Created value mappings for clean output
- ✅ Reprocessed data with mappings
- ✅ Tested deduplication
## Next Steps
- Add more rules for other extraction patterns
- Create more value mappings as needed
- Query the `transformed` data for reporting
- Import additional CSV files
## Useful Commands
```bash
# View all sources
curl http://localhost:3000/api/sources
# View source statistics
curl http://localhost:3000/api/sources/bank_transactions/stats
# View all rules for a source
curl http://localhost:3000/api/rules/source/bank_transactions
# View all mappings for a source
curl http://localhost:3000/api/mappings/source/bank_transactions
# Search for specific records
curl -X POST http://localhost:3000/api/records/search \
-H "Content-Type: application/json" \
-d '{
"source_name": "bank_transactions",
"query": {"vendor": "Google"},
"limit": 10
}'
```
## Troubleshooting
**API won't start:**
- Check `.env` file exists with correct database credentials
- Verify PostgreSQL is running: `psql -U postgres -l`
- Check logs for error messages
**Import fails:**
- Verify source exists: `curl http://localhost:3000/api/sources`
- Check CSV format matches expectations
- Ensure dedup_fields match CSV column names
**Transformations not working:**
- Check rules exist: `curl http://localhost:3000/api/rules/source/bank_transactions`
- Test regex pattern manually
- Check records have the specified field