Document Perspective architecture options
Captures current /data path (with bug history that forced single-batch encoding), and four candidate redesigns: optimize the existing encoder, DuckDB-WASM with Parquet, server-side DuckDB virtual server, and the hybrid read-from-WASM/write-via-deltas variant. Each option weighed against the forecasting write path, not just initial load. Intended as a decision record so context survives a lost conversation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
6b69b00645
commit
1a3209cbc2
257
pf_perspective_options.md
Normal file
257
pf_perspective_options.md
Normal file
@ -0,0 +1,257 @@
|
|||||||
|
# Perspective Architecture Options
|
||||||
|
|
||||||
|
This document weighs how the Forecast view should source data for the Perspective
|
||||||
|
pivot. The current implementation hits practical limits on initial load
|
||||||
|
(~30s for 350k rows × ~55 cols), and growth is expected. Choosing an architecture
|
||||||
|
now should account for both **read** (initial pivot load + interaction) and
|
||||||
|
**write** (forecasting operations that mutate rows).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Current architecture
|
||||||
|
|
||||||
|
### Data flow
|
||||||
|
|
||||||
|
- **Transport:** `GET /api/versions/:id/data` returns the full forecast table
|
||||||
|
as Apache Arrow IPC stream. Server-side: pg cursor (`FETCH 10000`) accumulates
|
||||||
|
all rows, `tableFromJSON` builds an Arrow table, `tableToIPC` produces one
|
||||||
|
record batch, response sent with `Content-Length`.
|
||||||
|
- **Joined columns:** `/data` LEFT JOINs `pf.log` to surface `pf_note` (the
|
||||||
|
user's note for the operation that produced each row) and `pf_op`
|
||||||
|
(baseline/scale/recode/clone). Joined at fetch time so note edits are
|
||||||
|
always live. (Added in `bf85f11`.)
|
||||||
|
- **Client:** Streams the response body to a `Uint8Array`, hands it to
|
||||||
|
Perspective's `worker.table()` (`@perspective-dev/client@4.4.0` from CDN).
|
||||||
|
Perspective's WASM engine owns the table in browser memory; all
|
||||||
|
pivots/filters/group-bys run locally.
|
||||||
|
- **Progress UI:** Forecast view reads the response body via
|
||||||
|
`response.body.getReader()` and shows received-bytes / total-bytes while
|
||||||
|
loading.
|
||||||
|
- **Forecasting writes:**
|
||||||
|
- `scale`/`recode`/`clone` POST → server INSERTs new rows with
|
||||||
|
`RETURNING *` → client receives JSON rows →
|
||||||
|
`tableRef.current.update(rows)` appends to Perspective's local table.
|
||||||
|
**Fast — no reload.**
|
||||||
|
- `undo` (DELETE) → server removes rows by `pf_logid` → client calls
|
||||||
|
`initViewer(...)` which **fully reloads** the table.
|
||||||
|
- `baseline` reload → currently also a full reload.
|
||||||
|
|
||||||
|
### Why this specific shape (the bug history)
|
||||||
|
|
||||||
|
The current "accumulate all rows, emit one record batch" approach is not
|
||||||
|
accidental. Two failure modes drove it:
|
||||||
|
|
||||||
|
1. **pg returns `bigint` (oid 20) and `numeric` (oid 1700) as JS strings by
|
||||||
|
default.** That made `tableFromJSON` infer `Dictionary<Utf8>` for ~50 of
|
||||||
|
55 columns. Fix in `server.js`: register type parsers that coerce both
|
||||||
|
to `Number` so Arrow infers `Int`/`Float64`.
|
||||||
|
2. **Per-batch `tableFromJSON` creates independent dictionaries.** When we
|
||||||
|
streamed batches, the writer emitted ~1230 dictionary REPLACEMENT
|
||||||
|
messages between batches. Perspective's WASM Arrow reader crashes on
|
||||||
|
those (`RuntimeError: memory access out of bounds`). Fix: accumulate
|
||||||
|
rows server-side, build one Arrow table, emit a single record batch.
|
||||||
|
Reference comment lives in `routes/operations.js` near the cursor loop.
|
||||||
|
|
||||||
|
These two bugs explain the ~10–15s server stall before the progress bar
|
||||||
|
appears: the server can't send byte 1 until every row has been fetched,
|
||||||
|
encoded, and the buffer is sized for `Content-Length`. **Any redesign of
|
||||||
|
the read path needs to either solve the dictionary-replacement issue
|
||||||
|
(streaming with stable dictionary IDs declared up front) or replace the
|
||||||
|
transport entirely (e.g., Parquet, server-side virtual table).**
|
||||||
|
|
||||||
|
### Implication for any redesign
|
||||||
|
|
||||||
|
The incremental update path (`table.update(rows)`) is what makes
|
||||||
|
operations feel snappy today. Whatever architecture comes next, writes
|
||||||
|
need to stay incremental — or get even cheaper. Undo's full reload is
|
||||||
|
already a known wart.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The options
|
||||||
|
|
||||||
|
### A. Stay client-side WASM; optimize the encode path
|
||||||
|
|
||||||
|
Keep the architecture. Replace the slow pieces.
|
||||||
|
|
||||||
|
- **Encode:** drop `tableFromJSON`. Build Arrow vectors directly from
|
||||||
|
`cols_meta` types (typed arrays for numerics, dictionary builders for
|
||||||
|
strings). Eliminates per-row type inference.
|
||||||
|
- **Stream:** declare schema up front, send dictionaries once, stream record
|
||||||
|
batches as they come off the cursor. Progress bar starts within ~1s.
|
||||||
|
- **Trim:** request-level `?cols=` parameter so the server can return only
|
||||||
|
the columns the active layout needs.
|
||||||
|
- **Writes:** unchanged — `table.update(rows)` keeps working.
|
||||||
|
- **Undo:** same path; same wart. Could be improved by surfacing a
|
||||||
|
`table.remove(pf_ids)` instead of `initViewer`.
|
||||||
|
|
||||||
|
| Aspect | Impact |
|
||||||
|
|---|---|
|
||||||
|
| Initial load | ~3–5× faster server encode + parallel transfer; bar appears in ~1s |
|
||||||
|
| Interaction | Unchanged (already instant) |
|
||||||
|
| Writes | Unchanged (already fast) |
|
||||||
|
| Browser memory ceiling | Still limited by Perspective WASM (~1–2M rows is the rough wall) |
|
||||||
|
| Code change | Medium: new builder code in `routes/operations.js`, schema declaration; UI mostly unchanged |
|
||||||
|
| New runtime deps | None |
|
||||||
|
|
||||||
|
**Right answer if:** dataset stays under ~1M rows and the goal is "make it
|
||||||
|
faster without rearchitecting."
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### B. DuckDB-WASM in the browser (Parquet load + `DuckDBHandler`)
|
||||||
|
|
||||||
|
Replace the Arrow IPC payload with a Parquet file. Browser loads it into
|
||||||
|
DuckDB-WASM. Perspective's `DuckDBHandler` (from
|
||||||
|
`@perspective-dev/client/dist/esm/virtual_servers/duckdb.js`) backs the
|
||||||
|
viewer — every pivot interaction becomes a SQL query against the local
|
||||||
|
DuckDB-WASM instance. Perspective ships the view-config-to-SQL translator;
|
||||||
|
no custom code there.
|
||||||
|
|
||||||
|
- **Initial transfer:** Parquet for a forecast table is typically ~10–30 MB
|
||||||
|
for 350k rows (vs. ~80–150 MB for Arrow IPC). Smaller download, no
|
||||||
|
server-side `tableFromJSON`.
|
||||||
|
- **Encode:** server-side. DuckDB on the server can `COPY (SELECT ... FROM
|
||||||
|
postgres_scan(...)) TO 'foo.parquet'`, or pre-stage Parquet on each
|
||||||
|
forecast write. Either way, no Node-side Arrow encode.
|
||||||
|
- **Interaction:** instant — local SQL on a columnar engine. No round trips.
|
||||||
|
- **Writes:** **this is the hard part.** After a `scale`/`recode`/`clone`,
|
||||||
|
the server has new rows in pg but DuckDB-WASM has a stale snapshot.
|
||||||
|
Options:
|
||||||
|
1. **Server returns new rows as Arrow** → client does `INSERT INTO
|
||||||
|
forecast SELECT * FROM arrow_view` in DuckDB-WASM, then notifies the
|
||||||
|
`DuckDBHandler` to refresh views.
|
||||||
|
2. **Re-export Parquet** → re-fetch. Simple but wasteful for small
|
||||||
|
incremental ops.
|
||||||
|
3. **Maintain a delta log** → client replays inserts/deletes by `pf_logid`.
|
||||||
|
- **Undo:** `DELETE FROM forecast WHERE pf_logid = $1` against DuckDB-WASM,
|
||||||
|
then refresh. Strictly faster than the current full reload.
|
||||||
|
|
||||||
|
| Aspect | Impact |
|
||||||
|
|---|---|
|
||||||
|
| Initial load | Smaller payload + fast WASM ingest; likely 3–5× total |
|
||||||
|
| Interaction | Instant (local SQL) — same as today |
|
||||||
|
| Writes | New write-sync layer required (medium effort) |
|
||||||
|
| Browser memory ceiling | DuckDB-WASM handles 10M+ rows comfortably |
|
||||||
|
| Code change | Significant: new server route for Parquet, new client wiring, write-sync code |
|
||||||
|
| New runtime deps | DuckDB on server (Node-API or shell), `@duckdb/duckdb-wasm` on client |
|
||||||
|
|
||||||
|
**Right answer if:** dataset will grow past ~1M rows but you still want
|
||||||
|
local interaction speed, *and* you're willing to write the write-sync layer.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### C. Server-side DuckDB as a virtual server (no client load)
|
||||||
|
|
||||||
|
DuckDB lives on the Node server. Browser uses a `VirtualServerHandler`
|
||||||
|
implementation that proxies Perspective's view requests (`tableMakeView`,
|
||||||
|
`viewGetData`, `viewGetMinMax`, `tableSchema`) to a `/perspective` endpoint.
|
||||||
|
Server runs SQL against DuckDB which queries pg directly via
|
||||||
|
`postgres_scanner`, or against a Parquet copy.
|
||||||
|
|
||||||
|
- **Initial transfer:** essentially zero. Schema + first viewport only.
|
||||||
|
- **Interaction:** every drag/filter/group-by is a network round trip.
|
||||||
|
50–200ms typical. Imperceptible for most operations; noticeable on
|
||||||
|
rapid drag interactions.
|
||||||
|
- **Writes:** simplest. Operations write to pg as today. DuckDB queries
|
||||||
|
pg live (via `postgres_scanner`) so it always sees current state. No
|
||||||
|
client-side state to sync.
|
||||||
|
- **Undo:** same as writes — server state is the source of truth.
|
||||||
|
|
||||||
|
| Aspect | Impact |
|
||||||
|
|---|---|
|
||||||
|
| Initial load | <1s regardless of dataset size |
|
||||||
|
| Interaction | 50–200ms round trip per interaction |
|
||||||
|
| Writes | Simple — single source of truth on server |
|
||||||
|
| Browser memory ceiling | Irrelevant — data never enters the browser |
|
||||||
|
| Code change | Significant: custom `VirtualServerHandler` that talks to a new `/perspective` endpoint; server-side translator wiring |
|
||||||
|
| New runtime deps | DuckDB on server |
|
||||||
|
|
||||||
|
**Right answer if:** dataset will outgrow browser memory (10M+ rows) or
|
||||||
|
multiple users need to see real-time shared state. Pays an interaction
|
||||||
|
latency tax forever.
|
||||||
|
|
||||||
|
**Note:** Perspective-dev also ships a Python `virtual_servers/duckdb`.
|
||||||
|
If you're willing to add a Python sidecar, you may not need to write the
|
||||||
|
JS-side handler — just stand up the Python server. Significant infra
|
||||||
|
change for a Node-based app.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### D. Hybrid — DuckDB-WASM read, pg write, server-pushed deltas
|
||||||
|
|
||||||
|
Same browser stack as B, but writes flow differently. After a forecast
|
||||||
|
operation, the server pushes back an Arrow batch of new rows (or a list of
|
||||||
|
`pf_logid`s to delete for undo). The client applies it to DuckDB-WASM via
|
||||||
|
SQL and refreshes the Perspective view. No re-export of Parquet on every
|
||||||
|
write.
|
||||||
|
|
||||||
|
This is essentially B with the write-sync layer specified. Splitting it out
|
||||||
|
because the write contract is the architectural decision worth deciding
|
||||||
|
explicitly:
|
||||||
|
|
||||||
|
- **Insert deltas:** server returns new rows as Arrow IPC, client does
|
||||||
|
`INSERT INTO forecast SELECT * FROM arrow_view`. Already trivial in
|
||||||
|
DuckDB-WASM.
|
||||||
|
- **Delete deltas:** server returns `{deleted_logid: N}`, client does
|
||||||
|
`DELETE FROM forecast WHERE pf_logid = N`.
|
||||||
|
- **Replace deltas (e.g., note edits):** if `pf_note` is joined at fetch
|
||||||
|
time (current state after `bf85f11`), edits are invisible until refetch.
|
||||||
|
Either accept that, or store note on the row and `UPDATE`.
|
||||||
|
|
||||||
|
This is the cleanest end state for a forecasting app: bulk read once,
|
||||||
|
incremental sync after.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Comparison
|
||||||
|
|
||||||
|
| | Current | A: optimize | B/D: DuckDB-WASM | C: server DuckDB |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| Initial load (350k rows) | ~30s | ~5–10s | ~3–8s | <1s |
|
||||||
|
| Interaction latency | 0 | 0 | 0 | 50–200ms |
|
||||||
|
| Write feedback | instant | instant | instant (after sync) | instant |
|
||||||
|
| Undo cost | full reload | full reload (or fix) | local DELETE | server-side |
|
||||||
|
| Browser memory ceiling | ~1M rows | ~1M rows | 10M+ rows | none |
|
||||||
|
| New deps | — | — | DuckDB (server + WASM) | DuckDB (server) |
|
||||||
|
| Code change | — | medium | significant | significant |
|
||||||
|
| Risk surface | low | low | medium (write sync) | medium (translator wiring) |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Open questions to resolve before choosing
|
||||||
|
|
||||||
|
1. **Expected dataset size 12 months out.** If it stays at ~350k–1M rows,
|
||||||
|
option A is enough. If it goes to 5M+, A is dead in the water.
|
||||||
|
2. **Parquet caching strategy if going B/D.** Re-export on every write is
|
||||||
|
wasteful; delta replay is more code. Pick one explicitly before
|
||||||
|
building.
|
||||||
|
3. **Multi-user scenarios.** If two users edit the same version
|
||||||
|
concurrently, options B/D need a mechanism for one user's writes to
|
||||||
|
appear in another's local DuckDB-WASM. Option C gets this for free.
|
||||||
|
4. **Python-or-Node decision for server-side DuckDB.** Perspective-dev's
|
||||||
|
Python virtual server might let you skip writing a translator entirely
|
||||||
|
— at the cost of a Python runtime alongside Node. Worth investigating
|
||||||
|
before committing to a JS-side custom handler.
|
||||||
|
5. **Should the spec move?** The spec mentions DuckDB only as a faster
|
||||||
|
bulk-encode path (option A-ish, server-side). Options B/C/D are
|
||||||
|
architectural shifts the spec doesn't contemplate. Whatever's chosen
|
||||||
|
should be written into `pf_spec.md` so the reasoning isn't lost again.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommendation framing (not a decision)
|
||||||
|
|
||||||
|
- **If the immediate problem is "30s loads feel bad":** option A. It's the
|
||||||
|
smallest change with the highest perceived impact and doesn't paint you
|
||||||
|
into an architectural corner.
|
||||||
|
- **If you're already planning for data growth:** option D (DuckDB-WASM +
|
||||||
|
delta sync). It's the right end state for a single-user-per-version
|
||||||
|
forecasting tool with mid-to-large datasets.
|
||||||
|
- **If multi-user real-time becomes a goal:** option C. Pay the latency
|
||||||
|
tax once and have a cleaner data model.
|
||||||
|
|
||||||
|
A reasonable phased path: do A first (fast, low risk, ships value this
|
||||||
|
week), live with it while planning, then move to D when row counts demand
|
||||||
|
it. C is a different shape and probably not warranted unless multi-user
|
||||||
|
emerges as a requirement.
|
||||||
Loading…
Reference in New Issue
Block a user