Captures current /data path (with bug history that forced single-batch encoding), and four candidate redesigns: optimize the existing encoder, DuckDB-WASM with Parquet, server-side DuckDB virtual server, and the hybrid read-from-WASM/write-via-deltas variant. Each option weighed against the forecasting write path, not just initial load. Intended as a decision record so context survives a lost conversation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
12 KiB
Perspective Architecture Options
This document weighs how the Forecast view should source data for the Perspective pivot. The current implementation hits practical limits on initial load (~30s for 350k rows × ~55 cols), and growth is expected. Choosing an architecture now should account for both read (initial pivot load + interaction) and write (forecasting operations that mutate rows).
Current architecture
Data flow
- Transport:
GET /api/versions/:id/datareturns the full forecast table as Apache Arrow IPC stream. Server-side: pg cursor (FETCH 10000) accumulates all rows,tableFromJSONbuilds an Arrow table,tableToIPCproduces one record batch, response sent withContent-Length. - Joined columns:
/dataLEFT JOINspf.logto surfacepf_note(the user's note for the operation that produced each row) andpf_op(baseline/scale/recode/clone). Joined at fetch time so note edits are always live. (Added inbf85f11.) - Client: Streams the response body to a
Uint8Array, hands it to Perspective'sworker.table()(@perspective-dev/client@4.4.0from CDN). Perspective's WASM engine owns the table in browser memory; all pivots/filters/group-bys run locally. - Progress UI: Forecast view reads the response body via
response.body.getReader()and shows received-bytes / total-bytes while loading. - Forecasting writes:
scale/recode/clonePOST → server INSERTs new rows withRETURNING *→ client receives JSON rows →tableRef.current.update(rows)appends to Perspective's local table. Fast — no reload.undo(DELETE) → server removes rows bypf_logid→ client callsinitViewer(...)which fully reloads the table.baselinereload → currently also a full reload.
Why this specific shape (the bug history)
The current "accumulate all rows, emit one record batch" approach is not accidental. Two failure modes drove it:
- pg returns
bigint(oid 20) andnumeric(oid 1700) as JS strings by default. That madetableFromJSONinferDictionary<Utf8>for ~50 of 55 columns. Fix inserver.js: register type parsers that coerce both toNumberso Arrow infersInt/Float64. - Per-batch
tableFromJSONcreates independent dictionaries. When we streamed batches, the writer emitted ~1230 dictionary REPLACEMENT messages between batches. Perspective's WASM Arrow reader crashes on those (RuntimeError: memory access out of bounds). Fix: accumulate rows server-side, build one Arrow table, emit a single record batch. Reference comment lives inroutes/operations.jsnear the cursor loop.
These two bugs explain the ~10–15s server stall before the progress bar
appears: the server can't send byte 1 until every row has been fetched,
encoded, and the buffer is sized for Content-Length. Any redesign of
the read path needs to either solve the dictionary-replacement issue
(streaming with stable dictionary IDs declared up front) or replace the
transport entirely (e.g., Parquet, server-side virtual table).
Implication for any redesign
The incremental update path (table.update(rows)) is what makes
operations feel snappy today. Whatever architecture comes next, writes
need to stay incremental — or get even cheaper. Undo's full reload is
already a known wart.
The options
A. Stay client-side WASM; optimize the encode path
Keep the architecture. Replace the slow pieces.
- Encode: drop
tableFromJSON. Build Arrow vectors directly fromcols_metatypes (typed arrays for numerics, dictionary builders for strings). Eliminates per-row type inference. - Stream: declare schema up front, send dictionaries once, stream record batches as they come off the cursor. Progress bar starts within ~1s.
- Trim: request-level
?cols=parameter so the server can return only the columns the active layout needs. - Writes: unchanged —
table.update(rows)keeps working. - Undo: same path; same wart. Could be improved by surfacing a
table.remove(pf_ids)instead ofinitViewer.
| Aspect | Impact |
|---|---|
| Initial load | ~3–5× faster server encode + parallel transfer; bar appears in ~1s |
| Interaction | Unchanged (already instant) |
| Writes | Unchanged (already fast) |
| Browser memory ceiling | Still limited by Perspective WASM (~1–2M rows is the rough wall) |
| Code change | Medium: new builder code in routes/operations.js, schema declaration; UI mostly unchanged |
| New runtime deps | None |
Right answer if: dataset stays under ~1M rows and the goal is "make it faster without rearchitecting."
B. DuckDB-WASM in the browser (Parquet load + DuckDBHandler)
Replace the Arrow IPC payload with a Parquet file. Browser loads it into
DuckDB-WASM. Perspective's DuckDBHandler (from
@perspective-dev/client/dist/esm/virtual_servers/duckdb.js) backs the
viewer — every pivot interaction becomes a SQL query against the local
DuckDB-WASM instance. Perspective ships the view-config-to-SQL translator;
no custom code there.
- Initial transfer: Parquet for a forecast table is typically ~10–30 MB
for 350k rows (vs. ~80–150 MB for Arrow IPC). Smaller download, no
server-side
tableFromJSON. - Encode: server-side. DuckDB on the server can
COPY (SELECT ... FROM postgres_scan(...)) TO 'foo.parquet', or pre-stage Parquet on each forecast write. Either way, no Node-side Arrow encode. - Interaction: instant — local SQL on a columnar engine. No round trips.
- Writes: this is the hard part. After a
scale/recode/clone, the server has new rows in pg but DuckDB-WASM has a stale snapshot. Options:- Server returns new rows as Arrow → client does
INSERT INTO forecast SELECT * FROM arrow_viewin DuckDB-WASM, then notifies theDuckDBHandlerto refresh views. - Re-export Parquet → re-fetch. Simple but wasteful for small incremental ops.
- Maintain a delta log → client replays inserts/deletes by
pf_logid.
- Server returns new rows as Arrow → client does
- Undo:
DELETE FROM forecast WHERE pf_logid = $1against DuckDB-WASM, then refresh. Strictly faster than the current full reload.
| Aspect | Impact |
|---|---|
| Initial load | Smaller payload + fast WASM ingest; likely 3–5× total |
| Interaction | Instant (local SQL) — same as today |
| Writes | New write-sync layer required (medium effort) |
| Browser memory ceiling | DuckDB-WASM handles 10M+ rows comfortably |
| Code change | Significant: new server route for Parquet, new client wiring, write-sync code |
| New runtime deps | DuckDB on server (Node-API or shell), @duckdb/duckdb-wasm on client |
Right answer if: dataset will grow past ~1M rows but you still want local interaction speed, and you're willing to write the write-sync layer.
C. Server-side DuckDB as a virtual server (no client load)
DuckDB lives on the Node server. Browser uses a VirtualServerHandler
implementation that proxies Perspective's view requests (tableMakeView,
viewGetData, viewGetMinMax, tableSchema) to a /perspective endpoint.
Server runs SQL against DuckDB which queries pg directly via
postgres_scanner, or against a Parquet copy.
- Initial transfer: essentially zero. Schema + first viewport only.
- Interaction: every drag/filter/group-by is a network round trip. 50–200ms typical. Imperceptible for most operations; noticeable on rapid drag interactions.
- Writes: simplest. Operations write to pg as today. DuckDB queries
pg live (via
postgres_scanner) so it always sees current state. No client-side state to sync. - Undo: same as writes — server state is the source of truth.
| Aspect | Impact |
|---|---|
| Initial load | <1s regardless of dataset size |
| Interaction | 50–200ms round trip per interaction |
| Writes | Simple — single source of truth on server |
| Browser memory ceiling | Irrelevant — data never enters the browser |
| Code change | Significant: custom VirtualServerHandler that talks to a new /perspective endpoint; server-side translator wiring |
| New runtime deps | DuckDB on server |
Right answer if: dataset will outgrow browser memory (10M+ rows) or multiple users need to see real-time shared state. Pays an interaction latency tax forever.
Note: Perspective-dev also ships a Python virtual_servers/duckdb.
If you're willing to add a Python sidecar, you may not need to write the
JS-side handler — just stand up the Python server. Significant infra
change for a Node-based app.
D. Hybrid — DuckDB-WASM read, pg write, server-pushed deltas
Same browser stack as B, but writes flow differently. After a forecast
operation, the server pushes back an Arrow batch of new rows (or a list of
pf_logids to delete for undo). The client applies it to DuckDB-WASM via
SQL and refreshes the Perspective view. No re-export of Parquet on every
write.
This is essentially B with the write-sync layer specified. Splitting it out because the write contract is the architectural decision worth deciding explicitly:
- Insert deltas: server returns new rows as Arrow IPC, client does
INSERT INTO forecast SELECT * FROM arrow_view. Already trivial in DuckDB-WASM. - Delete deltas: server returns
{deleted_logid: N}, client doesDELETE FROM forecast WHERE pf_logid = N. - Replace deltas (e.g., note edits): if
pf_noteis joined at fetch time (current state afterbf85f11), edits are invisible until refetch. Either accept that, or store note on the row andUPDATE.
This is the cleanest end state for a forecasting app: bulk read once, incremental sync after.
Comparison
| Current | A: optimize | B/D: DuckDB-WASM | C: server DuckDB | |
|---|---|---|---|---|
| Initial load (350k rows) | ~30s | ~5–10s | ~3–8s | <1s |
| Interaction latency | 0 | 0 | 0 | 50–200ms |
| Write feedback | instant | instant | instant (after sync) | instant |
| Undo cost | full reload | full reload (or fix) | local DELETE | server-side |
| Browser memory ceiling | ~1M rows | ~1M rows | 10M+ rows | none |
| New deps | — | — | DuckDB (server + WASM) | DuckDB (server) |
| Code change | — | medium | significant | significant |
| Risk surface | low | low | medium (write sync) | medium (translator wiring) |
Open questions to resolve before choosing
- Expected dataset size 12 months out. If it stays at ~350k–1M rows, option A is enough. If it goes to 5M+, A is dead in the water.
- Parquet caching strategy if going B/D. Re-export on every write is wasteful; delta replay is more code. Pick one explicitly before building.
- Multi-user scenarios. If two users edit the same version concurrently, options B/D need a mechanism for one user's writes to appear in another's local DuckDB-WASM. Option C gets this for free.
- Python-or-Node decision for server-side DuckDB. Perspective-dev's Python virtual server might let you skip writing a translator entirely — at the cost of a Python runtime alongside Node. Worth investigating before committing to a JS-side custom handler.
- Should the spec move? The spec mentions DuckDB only as a faster
bulk-encode path (option A-ish, server-side). Options B/C/D are
architectural shifts the spec doesn't contemplate. Whatever's chosen
should be written into
pf_spec.mdso the reasoning isn't lost again.
Recommendation framing (not a decision)
- If the immediate problem is "30s loads feel bad": option A. It's the smallest change with the highest perceived impact and doesn't paint you into an architectural corner.
- If you're already planning for data growth: option D (DuckDB-WASM + delta sync). It's the right end state for a single-user-per-version forecasting tool with mid-to-large datasets.
- If multi-user real-time becomes a goal: option C. Pay the latency tax once and have a cleaner data model.
A reasonable phased path: do A first (fast, low risk, ships value this week), live with it while planning, then move to D when row counts demand it. C is a different shape and probably not warranted unless multi-user emerges as a requirement.