pf_app/pf_perspective_options.md
Paul Trowbridge 1a3209cbc2 Document Perspective architecture options
Captures current /data path (with bug history that forced single-batch
encoding), and four candidate redesigns: optimize the existing encoder,
DuckDB-WASM with Parquet, server-side DuckDB virtual server, and the
hybrid read-from-WASM/write-via-deltas variant. Each option weighed
against the forecasting write path, not just initial load. Intended as
a decision record so context survives a lost conversation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 20:21:19 -04:00

258 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Perspective Architecture Options
This document weighs how the Forecast view should source data for the Perspective
pivot. The current implementation hits practical limits on initial load
(~30s for 350k rows × ~55 cols), and growth is expected. Choosing an architecture
now should account for both **read** (initial pivot load + interaction) and
**write** (forecasting operations that mutate rows).
---
## Current architecture
### Data flow
- **Transport:** `GET /api/versions/:id/data` returns the full forecast table
as Apache Arrow IPC stream. Server-side: pg cursor (`FETCH 10000`) accumulates
all rows, `tableFromJSON` builds an Arrow table, `tableToIPC` produces one
record batch, response sent with `Content-Length`.
- **Joined columns:** `/data` LEFT JOINs `pf.log` to surface `pf_note` (the
user's note for the operation that produced each row) and `pf_op`
(baseline/scale/recode/clone). Joined at fetch time so note edits are
always live. (Added in `bf85f11`.)
- **Client:** Streams the response body to a `Uint8Array`, hands it to
Perspective's `worker.table()` (`@perspective-dev/client@4.4.0` from CDN).
Perspective's WASM engine owns the table in browser memory; all
pivots/filters/group-bys run locally.
- **Progress UI:** Forecast view reads the response body via
`response.body.getReader()` and shows received-bytes / total-bytes while
loading.
- **Forecasting writes:**
- `scale`/`recode`/`clone` POST → server INSERTs new rows with
`RETURNING *` → client receives JSON rows →
`tableRef.current.update(rows)` appends to Perspective's local table.
**Fast — no reload.**
- `undo` (DELETE) → server removes rows by `pf_logid` → client calls
`initViewer(...)` which **fully reloads** the table.
- `baseline` reload → currently also a full reload.
### Why this specific shape (the bug history)
The current "accumulate all rows, emit one record batch" approach is not
accidental. Two failure modes drove it:
1. **pg returns `bigint` (oid 20) and `numeric` (oid 1700) as JS strings by
default.** That made `tableFromJSON` infer `Dictionary<Utf8>` for ~50 of
55 columns. Fix in `server.js`: register type parsers that coerce both
to `Number` so Arrow infers `Int`/`Float64`.
2. **Per-batch `tableFromJSON` creates independent dictionaries.** When we
streamed batches, the writer emitted ~1230 dictionary REPLACEMENT
messages between batches. Perspective's WASM Arrow reader crashes on
those (`RuntimeError: memory access out of bounds`). Fix: accumulate
rows server-side, build one Arrow table, emit a single record batch.
Reference comment lives in `routes/operations.js` near the cursor loop.
These two bugs explain the ~1015s server stall before the progress bar
appears: the server can't send byte 1 until every row has been fetched,
encoded, and the buffer is sized for `Content-Length`. **Any redesign of
the read path needs to either solve the dictionary-replacement issue
(streaming with stable dictionary IDs declared up front) or replace the
transport entirely (e.g., Parquet, server-side virtual table).**
### Implication for any redesign
The incremental update path (`table.update(rows)`) is what makes
operations feel snappy today. Whatever architecture comes next, writes
need to stay incremental — or get even cheaper. Undo's full reload is
already a known wart.
---
## The options
### A. Stay client-side WASM; optimize the encode path
Keep the architecture. Replace the slow pieces.
- **Encode:** drop `tableFromJSON`. Build Arrow vectors directly from
`cols_meta` types (typed arrays for numerics, dictionary builders for
strings). Eliminates per-row type inference.
- **Stream:** declare schema up front, send dictionaries once, stream record
batches as they come off the cursor. Progress bar starts within ~1s.
- **Trim:** request-level `?cols=` parameter so the server can return only
the columns the active layout needs.
- **Writes:** unchanged — `table.update(rows)` keeps working.
- **Undo:** same path; same wart. Could be improved by surfacing a
`table.remove(pf_ids)` instead of `initViewer`.
| Aspect | Impact |
|---|---|
| Initial load | ~35× faster server encode + parallel transfer; bar appears in ~1s |
| Interaction | Unchanged (already instant) |
| Writes | Unchanged (already fast) |
| Browser memory ceiling | Still limited by Perspective WASM (~12M rows is the rough wall) |
| Code change | Medium: new builder code in `routes/operations.js`, schema declaration; UI mostly unchanged |
| New runtime deps | None |
**Right answer if:** dataset stays under ~1M rows and the goal is "make it
faster without rearchitecting."
---
### B. DuckDB-WASM in the browser (Parquet load + `DuckDBHandler`)
Replace the Arrow IPC payload with a Parquet file. Browser loads it into
DuckDB-WASM. Perspective's `DuckDBHandler` (from
`@perspective-dev/client/dist/esm/virtual_servers/duckdb.js`) backs the
viewer — every pivot interaction becomes a SQL query against the local
DuckDB-WASM instance. Perspective ships the view-config-to-SQL translator;
no custom code there.
- **Initial transfer:** Parquet for a forecast table is typically ~1030 MB
for 350k rows (vs. ~80150 MB for Arrow IPC). Smaller download, no
server-side `tableFromJSON`.
- **Encode:** server-side. DuckDB on the server can `COPY (SELECT ... FROM
postgres_scan(...)) TO 'foo.parquet'`, or pre-stage Parquet on each
forecast write. Either way, no Node-side Arrow encode.
- **Interaction:** instant — local SQL on a columnar engine. No round trips.
- **Writes:** **this is the hard part.** After a `scale`/`recode`/`clone`,
the server has new rows in pg but DuckDB-WASM has a stale snapshot.
Options:
1. **Server returns new rows as Arrow** → client does `INSERT INTO
forecast SELECT * FROM arrow_view` in DuckDB-WASM, then notifies the
`DuckDBHandler` to refresh views.
2. **Re-export Parquet** → re-fetch. Simple but wasteful for small
incremental ops.
3. **Maintain a delta log** → client replays inserts/deletes by `pf_logid`.
- **Undo:** `DELETE FROM forecast WHERE pf_logid = $1` against DuckDB-WASM,
then refresh. Strictly faster than the current full reload.
| Aspect | Impact |
|---|---|
| Initial load | Smaller payload + fast WASM ingest; likely 35× total |
| Interaction | Instant (local SQL) — same as today |
| Writes | New write-sync layer required (medium effort) |
| Browser memory ceiling | DuckDB-WASM handles 10M+ rows comfortably |
| Code change | Significant: new server route for Parquet, new client wiring, write-sync code |
| New runtime deps | DuckDB on server (Node-API or shell), `@duckdb/duckdb-wasm` on client |
**Right answer if:** dataset will grow past ~1M rows but you still want
local interaction speed, *and* you're willing to write the write-sync layer.
---
### C. Server-side DuckDB as a virtual server (no client load)
DuckDB lives on the Node server. Browser uses a `VirtualServerHandler`
implementation that proxies Perspective's view requests (`tableMakeView`,
`viewGetData`, `viewGetMinMax`, `tableSchema`) to a `/perspective` endpoint.
Server runs SQL against DuckDB which queries pg directly via
`postgres_scanner`, or against a Parquet copy.
- **Initial transfer:** essentially zero. Schema + first viewport only.
- **Interaction:** every drag/filter/group-by is a network round trip.
50200ms typical. Imperceptible for most operations; noticeable on
rapid drag interactions.
- **Writes:** simplest. Operations write to pg as today. DuckDB queries
pg live (via `postgres_scanner`) so it always sees current state. No
client-side state to sync.
- **Undo:** same as writes — server state is the source of truth.
| Aspect | Impact |
|---|---|
| Initial load | <1s regardless of dataset size |
| Interaction | 50200ms round trip per interaction |
| Writes | Simple single source of truth on server |
| Browser memory ceiling | Irrelevant data never enters the browser |
| Code change | Significant: custom `VirtualServerHandler` that talks to a new `/perspective` endpoint; server-side translator wiring |
| New runtime deps | DuckDB on server |
**Right answer if:** dataset will outgrow browser memory (10M+ rows) or
multiple users need to see real-time shared state. Pays an interaction
latency tax forever.
**Note:** Perspective-dev also ships a Python `virtual_servers/duckdb`.
If you're willing to add a Python sidecar, you may not need to write the
JS-side handler just stand up the Python server. Significant infra
change for a Node-based app.
---
### D. Hybrid — DuckDB-WASM read, pg write, server-pushed deltas
Same browser stack as B, but writes flow differently. After a forecast
operation, the server pushes back an Arrow batch of new rows (or a list of
`pf_logid`s to delete for undo). The client applies it to DuckDB-WASM via
SQL and refreshes the Perspective view. No re-export of Parquet on every
write.
This is essentially B with the write-sync layer specified. Splitting it out
because the write contract is the architectural decision worth deciding
explicitly:
- **Insert deltas:** server returns new rows as Arrow IPC, client does
`INSERT INTO forecast SELECT * FROM arrow_view`. Already trivial in
DuckDB-WASM.
- **Delete deltas:** server returns `{deleted_logid: N}`, client does
`DELETE FROM forecast WHERE pf_logid = N`.
- **Replace deltas (e.g., note edits):** if `pf_note` is joined at fetch
time (current state after `bf85f11`), edits are invisible until refetch.
Either accept that, or store note on the row and `UPDATE`.
This is the cleanest end state for a forecasting app: bulk read once,
incremental sync after.
---
## Comparison
| | Current | A: optimize | B/D: DuckDB-WASM | C: server DuckDB |
|---|---|---|---|---|
| Initial load (350k rows) | ~30s | ~510s | ~38s | <1s |
| Interaction latency | 0 | 0 | 0 | 50200ms |
| Write feedback | instant | instant | instant (after sync) | instant |
| Undo cost | full reload | full reload (or fix) | local DELETE | server-side |
| Browser memory ceiling | ~1M rows | ~1M rows | 10M+ rows | none |
| New deps | | | DuckDB (server + WASM) | DuckDB (server) |
| Code change | | medium | significant | significant |
| Risk surface | low | low | medium (write sync) | medium (translator wiring) |
---
## Open questions to resolve before choosing
1. **Expected dataset size 12 months out.** If it stays at ~350k1M rows,
option A is enough. If it goes to 5M+, A is dead in the water.
2. **Parquet caching strategy if going B/D.** Re-export on every write is
wasteful; delta replay is more code. Pick one explicitly before
building.
3. **Multi-user scenarios.** If two users edit the same version
concurrently, options B/D need a mechanism for one user's writes to
appear in another's local DuckDB-WASM. Option C gets this for free.
4. **Python-or-Node decision for server-side DuckDB.** Perspective-dev's
Python virtual server might let you skip writing a translator entirely
at the cost of a Python runtime alongside Node. Worth investigating
before committing to a JS-side custom handler.
5. **Should the spec move?** The spec mentions DuckDB only as a faster
bulk-encode path (option A-ish, server-side). Options B/C/D are
architectural shifts the spec doesn't contemplate. Whatever's chosen
should be written into `pf_spec.md` so the reasoning isn't lost again.
---
## Recommendation framing (not a decision)
- **If the immediate problem is "30s loads feel bad":** option A. It's the
smallest change with the highest perceived impact and doesn't paint you
into an architectural corner.
- **If you're already planning for data growth:** option D (DuckDB-WASM +
delta sync). It's the right end state for a single-user-per-version
forecasting tool with mid-to-large datasets.
- **If multi-user real-time becomes a goal:** option C. Pay the latency
tax once and have a cleaner data model.
A reasonable phased path: do A first (fast, low risk, ships value this
week), live with it while planning, then move to D when row counts demand
it. C is a different shape and probably not warranted unless multi-user
emerges as a requirement.