pf_app/pf_perspective_options.md
Paul Trowbridge 1a3209cbc2 Document Perspective architecture options
Captures current /data path (with bug history that forced single-batch
encoding), and four candidate redesigns: optimize the existing encoder,
DuckDB-WASM with Parquet, server-side DuckDB virtual server, and the
hybrid read-from-WASM/write-via-deltas variant. Each option weighed
against the forecasting write path, not just initial load. Intended as
a decision record so context survives a lost conversation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 20:21:19 -04:00

12 KiB
Raw Permalink Blame History

Perspective Architecture Options

This document weighs how the Forecast view should source data for the Perspective pivot. The current implementation hits practical limits on initial load (~30s for 350k rows × ~55 cols), and growth is expected. Choosing an architecture now should account for both read (initial pivot load + interaction) and write (forecasting operations that mutate rows).


Current architecture

Data flow

  • Transport: GET /api/versions/:id/data returns the full forecast table as Apache Arrow IPC stream. Server-side: pg cursor (FETCH 10000) accumulates all rows, tableFromJSON builds an Arrow table, tableToIPC produces one record batch, response sent with Content-Length.
  • Joined columns: /data LEFT JOINs pf.log to surface pf_note (the user's note for the operation that produced each row) and pf_op (baseline/scale/recode/clone). Joined at fetch time so note edits are always live. (Added in bf85f11.)
  • Client: Streams the response body to a Uint8Array, hands it to Perspective's worker.table() (@perspective-dev/client@4.4.0 from CDN). Perspective's WASM engine owns the table in browser memory; all pivots/filters/group-bys run locally.
  • Progress UI: Forecast view reads the response body via response.body.getReader() and shows received-bytes / total-bytes while loading.
  • Forecasting writes:
    • scale/recode/clone POST → server INSERTs new rows with RETURNING * → client receives JSON rows → tableRef.current.update(rows) appends to Perspective's local table. Fast — no reload.
    • undo (DELETE) → server removes rows by pf_logid → client calls initViewer(...) which fully reloads the table.
    • baseline reload → currently also a full reload.

Why this specific shape (the bug history)

The current "accumulate all rows, emit one record batch" approach is not accidental. Two failure modes drove it:

  1. pg returns bigint (oid 20) and numeric (oid 1700) as JS strings by default. That made tableFromJSON infer Dictionary<Utf8> for ~50 of 55 columns. Fix in server.js: register type parsers that coerce both to Number so Arrow infers Int/Float64.
  2. Per-batch tableFromJSON creates independent dictionaries. When we streamed batches, the writer emitted ~1230 dictionary REPLACEMENT messages between batches. Perspective's WASM Arrow reader crashes on those (RuntimeError: memory access out of bounds). Fix: accumulate rows server-side, build one Arrow table, emit a single record batch. Reference comment lives in routes/operations.js near the cursor loop.

These two bugs explain the ~1015s server stall before the progress bar appears: the server can't send byte 1 until every row has been fetched, encoded, and the buffer is sized for Content-Length. Any redesign of the read path needs to either solve the dictionary-replacement issue (streaming with stable dictionary IDs declared up front) or replace the transport entirely (e.g., Parquet, server-side virtual table).

Implication for any redesign

The incremental update path (table.update(rows)) is what makes operations feel snappy today. Whatever architecture comes next, writes need to stay incremental — or get even cheaper. Undo's full reload is already a known wart.


The options

A. Stay client-side WASM; optimize the encode path

Keep the architecture. Replace the slow pieces.

  • Encode: drop tableFromJSON. Build Arrow vectors directly from cols_meta types (typed arrays for numerics, dictionary builders for strings). Eliminates per-row type inference.
  • Stream: declare schema up front, send dictionaries once, stream record batches as they come off the cursor. Progress bar starts within ~1s.
  • Trim: request-level ?cols= parameter so the server can return only the columns the active layout needs.
  • Writes: unchanged — table.update(rows) keeps working.
  • Undo: same path; same wart. Could be improved by surfacing a table.remove(pf_ids) instead of initViewer.
Aspect Impact
Initial load ~35× faster server encode + parallel transfer; bar appears in ~1s
Interaction Unchanged (already instant)
Writes Unchanged (already fast)
Browser memory ceiling Still limited by Perspective WASM (~12M rows is the rough wall)
Code change Medium: new builder code in routes/operations.js, schema declaration; UI mostly unchanged
New runtime deps None

Right answer if: dataset stays under ~1M rows and the goal is "make it faster without rearchitecting."


B. DuckDB-WASM in the browser (Parquet load + DuckDBHandler)

Replace the Arrow IPC payload with a Parquet file. Browser loads it into DuckDB-WASM. Perspective's DuckDBHandler (from @perspective-dev/client/dist/esm/virtual_servers/duckdb.js) backs the viewer — every pivot interaction becomes a SQL query against the local DuckDB-WASM instance. Perspective ships the view-config-to-SQL translator; no custom code there.

  • Initial transfer: Parquet for a forecast table is typically ~1030 MB for 350k rows (vs. ~80150 MB for Arrow IPC). Smaller download, no server-side tableFromJSON.
  • Encode: server-side. DuckDB on the server can COPY (SELECT ... FROM postgres_scan(...)) TO 'foo.parquet', or pre-stage Parquet on each forecast write. Either way, no Node-side Arrow encode.
  • Interaction: instant — local SQL on a columnar engine. No round trips.
  • Writes: this is the hard part. After a scale/recode/clone, the server has new rows in pg but DuckDB-WASM has a stale snapshot. Options:
    1. Server returns new rows as Arrow → client does INSERT INTO forecast SELECT * FROM arrow_view in DuckDB-WASM, then notifies the DuckDBHandler to refresh views.
    2. Re-export Parquet → re-fetch. Simple but wasteful for small incremental ops.
    3. Maintain a delta log → client replays inserts/deletes by pf_logid.
  • Undo: DELETE FROM forecast WHERE pf_logid = $1 against DuckDB-WASM, then refresh. Strictly faster than the current full reload.
Aspect Impact
Initial load Smaller payload + fast WASM ingest; likely 35× total
Interaction Instant (local SQL) — same as today
Writes New write-sync layer required (medium effort)
Browser memory ceiling DuckDB-WASM handles 10M+ rows comfortably
Code change Significant: new server route for Parquet, new client wiring, write-sync code
New runtime deps DuckDB on server (Node-API or shell), @duckdb/duckdb-wasm on client

Right answer if: dataset will grow past ~1M rows but you still want local interaction speed, and you're willing to write the write-sync layer.


C. Server-side DuckDB as a virtual server (no client load)

DuckDB lives on the Node server. Browser uses a VirtualServerHandler implementation that proxies Perspective's view requests (tableMakeView, viewGetData, viewGetMinMax, tableSchema) to a /perspective endpoint. Server runs SQL against DuckDB which queries pg directly via postgres_scanner, or against a Parquet copy.

  • Initial transfer: essentially zero. Schema + first viewport only.
  • Interaction: every drag/filter/group-by is a network round trip. 50200ms typical. Imperceptible for most operations; noticeable on rapid drag interactions.
  • Writes: simplest. Operations write to pg as today. DuckDB queries pg live (via postgres_scanner) so it always sees current state. No client-side state to sync.
  • Undo: same as writes — server state is the source of truth.
Aspect Impact
Initial load <1s regardless of dataset size
Interaction 50200ms round trip per interaction
Writes Simple — single source of truth on server
Browser memory ceiling Irrelevant — data never enters the browser
Code change Significant: custom VirtualServerHandler that talks to a new /perspective endpoint; server-side translator wiring
New runtime deps DuckDB on server

Right answer if: dataset will outgrow browser memory (10M+ rows) or multiple users need to see real-time shared state. Pays an interaction latency tax forever.

Note: Perspective-dev also ships a Python virtual_servers/duckdb. If you're willing to add a Python sidecar, you may not need to write the JS-side handler — just stand up the Python server. Significant infra change for a Node-based app.


D. Hybrid — DuckDB-WASM read, pg write, server-pushed deltas

Same browser stack as B, but writes flow differently. After a forecast operation, the server pushes back an Arrow batch of new rows (or a list of pf_logids to delete for undo). The client applies it to DuckDB-WASM via SQL and refreshes the Perspective view. No re-export of Parquet on every write.

This is essentially B with the write-sync layer specified. Splitting it out because the write contract is the architectural decision worth deciding explicitly:

  • Insert deltas: server returns new rows as Arrow IPC, client does INSERT INTO forecast SELECT * FROM arrow_view. Already trivial in DuckDB-WASM.
  • Delete deltas: server returns {deleted_logid: N}, client does DELETE FROM forecast WHERE pf_logid = N.
  • Replace deltas (e.g., note edits): if pf_note is joined at fetch time (current state after bf85f11), edits are invisible until refetch. Either accept that, or store note on the row and UPDATE.

This is the cleanest end state for a forecasting app: bulk read once, incremental sync after.


Comparison

Current A: optimize B/D: DuckDB-WASM C: server DuckDB
Initial load (350k rows) ~30s ~510s ~38s <1s
Interaction latency 0 0 0 50200ms
Write feedback instant instant instant (after sync) instant
Undo cost full reload full reload (or fix) local DELETE server-side
Browser memory ceiling ~1M rows ~1M rows 10M+ rows none
New deps DuckDB (server + WASM) DuckDB (server)
Code change medium significant significant
Risk surface low low medium (write sync) medium (translator wiring)

Open questions to resolve before choosing

  1. Expected dataset size 12 months out. If it stays at ~350k1M rows, option A is enough. If it goes to 5M+, A is dead in the water.
  2. Parquet caching strategy if going B/D. Re-export on every write is wasteful; delta replay is more code. Pick one explicitly before building.
  3. Multi-user scenarios. If two users edit the same version concurrently, options B/D need a mechanism for one user's writes to appear in another's local DuckDB-WASM. Option C gets this for free.
  4. Python-or-Node decision for server-side DuckDB. Perspective-dev's Python virtual server might let you skip writing a translator entirely — at the cost of a Python runtime alongside Node. Worth investigating before committing to a JS-side custom handler.
  5. Should the spec move? The spec mentions DuckDB only as a faster bulk-encode path (option A-ish, server-side). Options B/C/D are architectural shifts the spec doesn't contemplate. Whatever's chosen should be written into pf_spec.md so the reasoning isn't lost again.

Recommendation framing (not a decision)

  • If the immediate problem is "30s loads feel bad": option A. It's the smallest change with the highest perceived impact and doesn't paint you into an architectural corner.
  • If you're already planning for data growth: option D (DuckDB-WASM + delta sync). It's the right end state for a single-user-per-version forecasting tool with mid-to-large datasets.
  • If multi-user real-time becomes a goal: option C. Pay the latency tax once and have a cleaner data model.

A reasonable phased path: do A first (fast, low risk, ships value this week), live with it while planning, then move to D when row counts demand it. C is a different shape and probably not warranted unless multi-user emerges as a requirement.