Escalate — lazy-loaded L2 leaf
The cascade resolves most queries at L0/L1 for picojoules — see /exhibits/reason for the trace shape. When a query can't be placed at the spine, the page dynamic-imports a separate WASM bundle (the model tier) and pays its byte cost plus per-token inference. The two bundles are visible in the browser's Network panel.
Today this leaf ships a deterministic stub generator; the receipt math is parameterised for a 1-bit-class small local model (~280 MB on disk, ~10⁶ bit-ops per token). When real weights replace the stub, the cascade above doesn't change; only the leaf does.
| spine bit-ops | — |
| L2 bundle bytes | — |
| L2 inference bit-ops | — |
| L2 wall-time | — |
| projected real model bit-ops | — |
| model id | — |
view the L2-leaf source crates/mathground-l2-leaf/src/lib.rs
//! L2 leaf — the lazy-loaded model tier of the mathground cascade.
//!
//! Loaded only when the L0/L1 spine fails to resolve a query. Ships as a
//! separate WASM bundle so the page's first-paint cost is small (the L0/L1
//! spine alone) and visitors only pay the model-tier byte cost when an
//! exhibit actually escalates.
//!
//! Today this is a deterministic stub that hand-crafts responses to a
//! handful of canonical "escapes-the-spine" queries and falls back to a
//! template generator for everything else. The receipt math is
//! parameterised for a 1-bit-class small local model in the 1.7 B
//! parameter range — when real weights replace the stub, the receipt
//! continues to read in the same units; the cascade above doesn't change.
//!
//! The honest framing on the page: every visit measures the byte cost
//! of THIS bundle (the stub), so the receipt under-estimates a real
//! 1.7 B 1-bit model's load cost by the size ratio (~280 MB target /
//! stub bundle KB). The per-token math models inference at a 1-bit grain
//! — ~10⁶ bit-ops per token — which is the upper bound the stub commits
//! to honour.
use serde::{Deserialize, Serialize};
/// Target weight size for a 1-bit-class 1.7 B-parameter local model:
/// roughly 280 MB on disk. The receipt math scales the stub's measured
/// load cost by this factor to project the would-be cost of the real
/// model.
pub const TARGET_WEIGHTS_BYTES: u64 = 280 * 1024 * 1024;
/// Per-token cost model for a 1-bit-class 1.7 B model: roughly 10⁶
/// bit-ops per generated token at the 1-bit grain. Conservative upper
/// bound the stub commits to honour.
pub const PER_TOKEN_BIT_OPS: u64 = 1_000_000;
/// One generation outcome.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Generation {
pub query: String,
pub response: String,
pub tokens: u32,
/// Bit-ops attributed to running inference on the query.
pub inference_bit_ops: u64,
/// Wall-time the stub took to produce the response, in nanoseconds.
pub wall_ns: f64,
/// True when the response came from the curated table (the spine got
/// close enough that a hand-crafted L2 answer exists). False when the
/// procedural fallback fired.
pub from_curated_table: bool,
/// Identifier for the model "running" — today always the stub. When
/// real weights replace it the id reflects the model in use.
pub model_id: &'static str,
}
/// Curated responses for a handful of canonical L2 escapes. These are
/// queries the spine intentionally can't catch, so the L2 leaf is the
/// only path. Hand-crafted today; replaced by real model output when
/// the weights ship.
const CURATED: &[(&str, &str)] = &[
(
"explain why we cannot resolve this at l0",
"Resolution at L0 requires a closed-form path. Open-ended questions like this one have no closed-form. The cascade escalates so the model tier can synthesise; the receipt above shows the substrate paid for the escalation.",
),
(
"what is the meaning of life?",
"L0/L1 has no fact-table entry for the meaning of life and no closed-form computation; substrate refuses to fabricate. The model tier can compose a candidate answer, but the substrate makes the cost visible: this is what the page just charged you.",
),
(
"describe quantum entanglement to a five-year-old",
"Two particles can be linked in a way that whatever happens to one happens to the other instantly, even if they are very far apart. Scientists call that link 'entanglement', and it only works for very small particles — not for people or balls.",
),
(
"write a haiku about the landauer floor",
"kT log two —\nan irreducible coin\nflipped to forget.",
),
(
"what does mathground.ai measure",
"Every call on this page returns a joule receipt. The cascade resolves most queries at L0/L1 for picojoules; this answer cost ~10⁹ bit-ops because no spine tier matched and the L2 leaf had to load.",
),
];
/// Deterministic procedural fallback. Returns a templated response so
/// every query produces SOMETHING even when no curated entry matches.
fn template_response(query: &str) -> String {
let words = query.split_whitespace().count();
format!(
"[stub L2 generation] Query of {words} words escaped the L0/L1 spine. A real model would compose a response here. This stub commits the receipt math the spine would charge against — load_bytes + tokens × per_token_bit_ops — so the page can show the cost without the answer."
)
}
/// Time source. Native uses `std::time::Instant`; the wasm feature
/// switches to `performance.now()`.
#[cfg(all(feature = "wasm", target_arch = "wasm32"))]
fn now_ns() -> f64 {
let perf = web_sys::window()
.and_then(|w| w.performance())
.expect("performance API");
perf.now() * 1.0e6
}
#[cfg(not(all(feature = "wasm", target_arch = "wasm32")))]
fn now_ns() -> f64 {
use std::sync::OnceLock;
use std::time::Instant;
static EPOCH: OnceLock<Instant> = OnceLock::new();
let epoch = EPOCH.get_or_init(Instant::now);
epoch.elapsed().as_nanos() as f64
}
/// Generate a response for `query`. Tries the curated table first; on
/// miss, falls back to a template. Either way returns a `Generation`
/// carrying the bit-op cost and wall-time.
pub fn generate(query: &str) -> Generation {
let normalised = query.trim().to_lowercase();
let t0 = now_ns();
let (response, from_curated_table) = match CURATED
.iter()
.find(|(k, _)| normalised == *k)
{
Some((_, r)) => ((*r).to_string(), true),
None => (template_response(query), false),
};
let wall_ns = now_ns() - t0;
let tokens = response.split_whitespace().count() as u32;
let inference_bit_ops = (tokens as u64) * PER_TOKEN_BIT_OPS;
Generation {
query: query.to_string(),
response,
tokens,
inference_bit_ops,
wall_ns,
from_curated_table,
model_id: "stub-placeholder · 1-bit-class · 1.7B projection",
}
}
// ── Optional JS surface ──────────────────────────────────────────────
#[cfg(feature = "wasm")]
mod js_surface {
use super::*;
use wasm_bindgen::prelude::*;
/// Generate + return the result to JS as a serializable object. Calling
/// this lazy-imports this whole crate's bundle into the page; the
/// loaded bytes are visible in the browser's Network panel.
#[wasm_bindgen]
pub fn generate(query: &str) -> JsValue {
let g = super::generate(query);
serde_wasm_bindgen::to_value(&g).unwrap_or(JsValue::NULL)
}
/// The estimated total cost of running THIS query at the target
/// 1-bit-class 1.7 B size — exposed so the page can show the would-
/// cost of a real model alongside the stub's actual cost.
#[wasm_bindgen]
pub fn projected_real_model_cost(tokens: u32) -> JsValue {
#[derive(Serialize)]
struct Projection {
tokens: u32,
model_bytes: u64,
per_token_bit_ops: u64,
total_bit_ops: u64,
}
let p = Projection {
tokens,
model_bytes: TARGET_WEIGHTS_BYTES,
per_token_bit_ops: PER_TOKEN_BIT_OPS,
total_bit_ops: TARGET_WEIGHTS_BYTES * 8 + (tokens as u64) * PER_TOKEN_BIT_OPS,
};
serde_wasm_bindgen::to_value(&p).unwrap_or(JsValue::NULL)
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn curated_query_hits_table() {
let g = generate("What is the meaning of life?");
assert!(g.from_curated_table);
assert!(g.tokens > 0);
assert!(g.inference_bit_ops > 0);
}
#[test]
fn unknown_query_uses_template() {
let g = generate("totally novel query phrasing nobody has used");
assert!(!g.from_curated_table);
assert!(g.response.contains("stub L2 generation"));
}
#[test]
fn inference_cost_scales_with_tokens() {
let g_short = generate("describe quantum entanglement to a five-year-old");
let g_long = generate("what does mathground.ai measure");
// Both curated; verify per-token math is linear.
assert_eq!(
g_short.inference_bit_ops,
(g_short.tokens as u64) * PER_TOKEN_BIT_OPS
);
assert_eq!(
g_long.inference_bit_ops,
(g_long.tokens as u64) * PER_TOKEN_BIT_OPS
);
}
#[test]
fn deterministic_replay() {
let a = generate("write a haiku about the landauer floor");
let b = generate("write a haiku about the landauer floor");
assert_eq!(a.response, b.response);
assert_eq!(a.tokens, b.tokens);
assert_eq!(a.inference_bit_ops, b.inference_bit_ops);
assert_eq!(a.from_curated_table, b.from_curated_table);
}
}
This file compiles to mathground_l2_leaf_bg.wasm
(~41 KB after wasm-opt -Oz). The page dynamic-imports it
only when the cascade fails to resolve at L0/L1.
About this exhibit
The cascade catches almost every query at the deterministic spine for picojoules. When escalation IS required, the substrate asks a different question than "how big a model can fit?" — it asks "what is the smallest local model that does the job?". The deeper the spine, the smaller the leaf needs to be.
The page first-paint pays only for the spine. Only the queries that escape the spine pay for the leaf, and the leaf size is bounded by how deep the spine is.