Spec: add OR filter groups, raw_where escape hatch, and Arrow IPC streaming for large datasets

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-27 22:50:02 -04:00 · 2026-04-27 22:50:02 -04:00 · 11f5b02fc4
commit 11f5b02fc4
parent 4a4cb80189
1 changed files with 68 additions and 19 deletions
--- a/pf_spec.md
+++ b/pf_spec.md
@ -225,8 +225,13 @@ Source registration, col_meta configuration, SQL generation, version creation, a
 {
  "date_offset":  "1 year",
  "filters": [
-    { "col": "order_date",   "op": "BETWEEN", "values": ["2024-01-01", "2024-12-31"] },
+    [
-    { "col": "order_status", "op": "IN",      "values": ["OPEN", "PENDING"] }
+      { "col": "order_date",   "op": "BETWEEN", "values": ["2024-01-01", "2024-12-31"] },
      { "col": "order_status", "op": "IN",      "values": ["OPEN", "PENDING"] }
    ],
    [
      { "col": "order_status", "op": "IS NULL" }
    ]
  ],
  "pf_user":  "admin",
  "note":     "FY2024 actuals + open orders projected to FY2025",
@ -234,12 +239,16 @@ Source registration, col_meta configuration, SQL generation, version creation, a
 }
 ```
 The example above generates: `(order_date BETWEEN '2024-01-01' AND '2024-12-31' AND order_status IN ('OPEN','PENDING')) OR (order_status IS NULL)`
 - `date_offset` — PostgreSQL interval string applied to the primary `role = 'date'` column at insert time. Examples: `"1 year"`, `"6 months"`, `"2 years 3 months"`. Defaults to `"0 days"`. Applied to the stored date value only — filter columns are never shifted.
- `filters` — one or more filter conditions defining what rows to pull from the source table. Period selection (date range, season, fiscal year, etc.) is expressed here as a regular filter — there is no separate date range parameter. Each condition has:
+- `filters` — an array of **groups**. Conditions within a group are AND-ed; groups are OR-ed together. Each group is an array of one or more condition objects:
  - `col` — must be `role = 'date'` or `role = 'filter'` in col_meta
  - `op` — one of `=`, `!=`, `IN`, `NOT IN`, `BETWEEN`, `IS NULL`, `IS NOT NULL`
  - `values` — array of strings; two elements for `BETWEEN`; multiple for `IN`/`NOT IN`; omitted for `IS NULL`/`IS NOT NULL`
- At least one filter is required.
+  - Backward compatibility: a flat array of condition objects (non-nested) is treated as a single group (all AND).
 - At least one group with at least one condition is required.
 - `raw_where` — optional string. When present, bypasses `filters` entirely and injects the value verbatim as the WHERE clause body. **Admin-only** — rejected with `403` if the requesting `pf_user` is not in the admin list. Not validated against col_meta. Caller is responsible for correctness and SQL safety. Stored as-is in `pf.log.params` for audit. Cannot be combined with `filters` — if both are present the request is rejected with `400`.
 - Baseline loads are **additive** — existing `iter = 'baseline'` rows are not touched. Each load is its own log entry and is independently undoable.
 `replay` controls behavior when incremental rows exist (applies to Clear + reload, not individual segments):
@ -257,9 +266,35 @@ Source registration, col_meta configuration, SQL generation, version creation, a
 | Method | Route | Description |
 |--------|-------|-------------|
-| GET | `/api/versions/:id/data` | Return all rows for this version (all iters including reference) |
+| GET | `/api/versions/:id/data` | Stream all rows for this version as an Arrow IPC binary |
-Returns flat array. Perspective pivot runs client-side on this data.
+**Transport format — Apache Arrow IPC stream**
 The endpoint returns `Content-Type: application/vnd.apache.arrow.stream` (binary). JSON is not used for this route. The client fetches the response as `arrayBuffer()` and passes it directly to `worker.table(buffer)` — Perspective's native ingestion path with no JS deserialization overhead.
 Arrow's columnar layout with dictionary encoding on string dimension columns keeps payload size manageable at scale (typically 50–150 MB for 1M rows depending on string cardinality), compared to several times that for equivalent JSON.
 **Server-side streaming (cursor-based)**
 For datasets that may reach 1M+ rows, the server must not buffer the full query result in memory before writing the response. Instead:
 1. Open a PostgreSQL cursor over the `SELECT * FROM {{fc_table}}` query
 2. Fetch rows in batches (target: 10 000 rows per batch)
 3. For each batch, append a serialized Arrow record batch to the HTTP response using chunked transfer encoding
 4. Close the cursor and end the response when all batches are written
 This means the first bytes of the Arrow stream reach the client while the server is still reading from the database, and Node.js heap stays bounded regardless of dataset size.
 **Client-side loading**
 - **Moderate datasets (< ~500k rows):** accumulate the full `arrayBuffer()` then call `worker.table(buffer)` once. Perspective becomes interactive after the stream completes.
 - **Large datasets (≥ ~500k rows):** process Arrow record batches incrementally — call `worker.table(firstBatch)` to create the table, then `pspTable.update(batch)` for each subsequent batch. Perspective is interactive and browseable while remaining batches are still arriving.
 The client detects which path to use by checking the `X-Row-Count` response header (see below).
 **Row-count pre-check**
 Before opening the cursor, the server runs `SELECT COUNT(*) FROM {{fc_table}}`. The result is attached as the `X-Row-Count` response header so the client can choose its loading strategy. If the count exceeds 500 000, the UI displays a non-blocking notice ("Loading large dataset — pivot will become interactive as data arrives") rather than a blank screen.
 ### Forecast Operations
@ -408,24 +443,28 @@ A dedicated view for constructing the baseline for the selected version. The bas
 - **Description** — free text label stored as the log `note`, shown in the segments list
 - **Date offset** — years + months spinners; shifts the primary `role = 'date'` column forward on insert
- **Filters** — one or more filter conditions that define what rows to pull. There is no separate "date range" section — period selection is just a filter like any other:
+- **Filters** — one or more filter groups that define what rows to pull. Conditions within a group are AND-ed; groups are OR-ed. There is no separate "date range" section — period selection is just a filter like any other:
-  - Column — any `role = 'date'` or `role = 'filter'` column
+  - Each group has a header row ("Group 1", "Group 2 — OR", …) and a `+ Add condition` link
-  - Operator — `=`, `!=`, `IN`, `NOT IN`, `BETWEEN`, `IS NULL`, `IS NOT NULL`
+  - Within a group: Column (any `role = 'date'` or `role = 'filter'`), Operator (`=`, `!=`, `IN`, `NOT IN`, `BETWEEN`, `IS NULL`, `IS NOT NULL`), Value(s)
-  - Value(s) — for `BETWEEN`: two date/text inputs; for `IN`/`NOT IN`: comma-separated list; for `=`/`!=`: single input; omitted for `IS NULL`/`IS NOT NULL`
+  - Value inputs: `BETWEEN` → two date/text inputs; `IN`/`NOT IN` → comma-separated list; `=`/`!=` → single input; omitted for `IS NULL`/`IS NOT NULL`
-  - At least one filter is required to load a segment
+  - `+ Add OR group` button appends a new empty group below, joined by an "OR" separator label
- **Timeline preview** — rendered when any filter condition is a `BETWEEN` or `=` on a `role = 'date'` column. Shows a horizontal bar (number-line style) for the source period and, if offset > 0, a second bar below for the projected period. Each bar shows start date on the left, end date on the right, duration in the centre. The two bars share the same visual width so the shift is immediately apparent. For non-date filters (e.g. `season IN (...)`) no timeline is shown.
+  - Groups with more than one condition render an "AND" badge between rows to make the logic explicit
  - A group can be removed with `×` on its header (not available when only one group remains)
  - At least one group with at least one condition is required to load a segment
 - **Manual WHERE clause** (admin only) — a toggle link ("Switch to manual SQL") that replaces the filter builder with a plain textarea. The admin types a raw PostgreSQL WHERE clause body (no `WHERE` keyword). Switching back to the builder clears the textarea. When active, the filter builder is hidden and the structured `filters` field is not sent; `raw_where` is sent instead. A prominent warning banner reads: "Raw SQL is not validated. You are responsible for correctness and security."
 - **Timeline preview** — rendered when any condition in any group is a `BETWEEN` or `=` on a `role = 'date'` column. Shows a horizontal bar (number-line style) for the source period and, if offset > 0, a second bar below for the projected period. Each bar shows start date on the left, end date on the right, duration in the centre. The two bars share the same visual width so the shift is immediately apparent. Not shown in manual WHERE mode or when no date condition is present.
 - **Note** — optional free text
 - **Load Segment** — submits; appends rows, does not clear existing baseline rows
 **Example — three-segment baseline:**
-| # | Description | Filters | Offset |
+| # | Description | Filter logic | Offset |
-|---|-------------|---------|--------|
+|---|-------------|--------------|--------|
 | 1 | All orders taken 6/1/25–3/31/26 | `order_date BETWEEN 2025-06-01 AND 2026-03-31` | 0 |
-| 2 | All open/unshipped orders | `status IN (OPEN, PENDING)` | 0 |
+| 2 | Open or unshipped orders (status missing or explicit) | `(status IN ('OPEN','PENDING')) OR (status IS NULL)` | 0 |
-| 3 | Prior year book-and-ship 4/1/25–5/31/25 | `order_date BETWEEN 2025-04-01 AND 2025-05-31`, `ship_date BETWEEN 2025-04-01 AND 2025-05-31` | 0 |
+| 3 | Prior year book-and-ship 4/1/25–5/31/25 | `order_date BETWEEN 2025-04-01 AND 2025-05-31 AND ship_date BETWEEN 2025-04-01 AND 2025-05-31` | 0 |
-Note: segment 2 has no date filter — any filter combination is valid as long as at least one filter is present.
+Segment 2 uses two OR groups; segment 3 has two AND conditions in one group. Any combination is valid as long as at least one group with at least one condition is present.
 ### Forecast View
@ -452,7 +491,15 @@ Note: segment 2 has no date filter — any filter combination is valid as long a
 └──────────────────────────────────────┴──────────────────────────┘
 ```
-**Pivot control:** [Perspective](https://perspective.finos.org/) 4.4.0, loaded from CDN at runtime. All rows from `GET /api/versions/:id/data` are loaded into an in-browser Perspective worker. Supports grouping, splitting, filtering, sorting, and charting interactively. Layout (group_by, split_by, filters, plugin) is saved per version to `localStorage` via Save layout / Reset layout buttons.
+**Pivot control:** [Perspective](https://perspective.finos.org/) 4.4.0, loaded from CDN at runtime. Data is fetched from `GET /api/versions/:id/data` as an Arrow IPC binary stream and loaded into an in-browser Perspective worker — Perspective's native ingestion path. Supports grouping, splitting, filtering, sorting, and charting interactively. Layout (group_by, split_by, filters, plugin) is saved per version to `localStorage` via Save layout / Reset layout buttons.
 **Large-dataset loading sequence:**
 1. Client issues `GET /api/versions/:id/data`
 2. Server responds with `X-Row-Count` header and begins streaming Arrow record batches
 3. If `X-Row-Count` ≥ 500 000, UI shows a non-blocking loading banner; otherwise no indicator
 4. Client calls `worker.table(firstBatch)` on the first batch to make the pivot interactive immediately
 5. Each subsequent batch is applied with `pspTable.update(batch)` as it arrives
 6. Banner clears when the stream closes
 **Interaction flow:**
 1. Click a cell or row in the pivot — the `perspective-click` event fires
@ -502,7 +549,9 @@ Baseline loads are **additive** — no DELETE before INSERT. Each segment append
 Token details:
 - `{{date_offset}}` — PostgreSQL interval string (e.g. `1 year`); defaults to `0 days`; applied only to the primary `role = 'date'` column on insert
- `{{filter_clause}}` — one or more `AND` conditions built from the `filters` array at request time (not baked into stored SQL since conditions vary per segment). Each condition is validated against col_meta (column must be `role = 'date'` or `role = 'filter'`). Supported operators: `=`, `!=`, `IN`, `NOT IN`, `BETWEEN`, `IS NULL`, `IS NOT NULL`.
+- `{{filter_clause}}` — built from `filters` or `raw_where` at request time (not baked into stored SQL since conditions vary per segment).
  - Structured path (`filters`): each group becomes a parenthesized AND block; groups are joined with `OR`. Every column is validated against col_meta (`role = 'date'` or `role = 'filter'`). Values are escaped (single quotes doubled). Supported operators: `=`, `!=`, `IN`, `NOT IN`, `BETWEEN`, `IS NULL`, `IS NOT NULL`.
  - Raw path (`raw_where`): the string is injected verbatim. No col_meta validation. Admin-only.
 ### Clear Baseline