Splat world — 3-D Gaussians

A 32,768-splat procedural galaxy rendered through a full WebGPU pipeline: PROJECT_WGSL (per-splat covariance → screen-space ellipse) → a 6-stage radix sort on 16-bit depth keys → instanced triangle-strip quads with a Gaussian-falloff fragment shader in premultiplied alpha. The receipt below carries a GaussianSplat artifact and reports joules-per-splat alongside the usual impedance term μ. Upload an image to emit a second receipt for the substrate cost of authoring a new splat shell.
view the kernel that wrote this receipt crates/mathground-splat/src/render.rs
//! WebGPU pipeline for splat rendering.
//!
//! Driven via `js_sys::Reflect` instead of the `wgpu` Rust crate. This keeps
//! the WASM bundle slim and matches the access pattern used by
//! `lux-worlds-web`.
//!
//! Stages (filled in incrementally — see plan P1.3..P1.7):
//!   1. Device acquire (adapter → device → queue → canvas context configure)  ← P1.3 ✅
//!   2. Buffer upload (`SplatCloud::to_gpu_buffer()` → storage buffer)         ← P1.4 ✅
//!   3. Project compute pipeline (`PROJECT_WGSL`)                              ← P1.5 ✅
//!   4. Radix sort (`SORT_CLEAR/HISTOGRAM/SCAN_*/SCATTER_WGSL`)                ← P1.6 ✅
//!   5. Paint pipeline (instanced-quad Gaussian falloff render)                ← P1.7 ✅

use glam::{Mat4, Vec3, Vec4};
use js_sys::{Array, Float32Array, Object, Uint32Array, Uint8Array};
use lux_worlds_splat::{
    shaders::{
        PROJECT_WGSL, SORT_CLEAR_WGSL, SORT_HISTOGRAM_WGSL, SORT_SCAN_BROADCAST_WGSL,
        SORT_SCAN_GLOBAL_WGSL, SORT_SCAN_LOCAL_WGSL, SORT_SCATTER_WGSL,
    },
    Splat, SplatCloud,
};
use mgai_meter_web::now_ns;
use wasm_bindgen::prelude::*;
use wasm_bindgen::JsCast;

/// Clear color when no splats have rendered yet: the same desaturated indigo
/// the mathground exhibits use as their canvas backdrop.
const CLEAR_RGBA: [f64; 4] = [0.031, 0.047, 0.078, 1.0];

/// WebGPU `GPUBufferUsage` bit flags.
const USAGE_STORAGE_COPY_DST: f64 = 128.0 + 8.0; // STORAGE | COPY_DST = 136
const USAGE_STORAGE_RW: f64 = 128.0; // STORAGE
const USAGE_UNIFORM_COPY_DST: f64 = 64.0 + 8.0; // UNIFORM | COPY_DST = 72

/// Project-pass uniform layout, std140-padded to 256 bytes.
///
/// Layout (matches `PROJECT_WGSL::Uniforms`):
///   view        offset 0   size 64
///   proj        offset 64  size 64
///   view_proj   offset 128 size 64
///   cam_pos     offset 192 size 12
///   time        offset 204 size 4   (packs into vec3's 4-byte tail)
///   viewport    offset 208 size 8
///   _pad1       offset 216 size 8
///   anchor      offset 224 size 16
///   (padding to 256)
const UNIFORM_BYTES: usize = 256;

/// Approximate bit-ops per splat for the project pass: 3 mat4 multiplies,
/// 2 mat3 builds, a 2×2 eigendecomposition, a normalisation, a sqrt and a
/// handful of branches. At f32 grain (~32 bits per scalar op) ≈ 1024.
const PROJECT_BIT_OPS_PER_SPLAT: u64 = 1024;

/// Sort-stage fixed bit-op cost (independent of splat count): histogram
/// clear + workgroup-local scan + global scan + broadcast. ≈ 21M.
const SORT_FIXED_BIT_OPS: u64 = 21_000_000;

/// Per-splat bit-ops added on top of the fixed sort cost: 32 bits for the
/// histogram atomicAdd + 64 bits for the 8-byte SCATTER write.
const SORT_BIT_OPS_PER_SPLAT: u64 = 96;

/// `SortParams` uniform = `count: u32 + 3 × _pad: u32` = 16 bytes.
const SORT_PARAMS_BYTES: usize = 16;

/// Histogram bucket count from the shader header: 65536 atomic<u32> = 256 KB.
const HISTOGRAM_BUCKETS: u64 = 65_536;

/// 256 workgroup totals = 1 KB.
const WG_TOTALS_LEN: u64 = 256;

/// Approximate bit-ops per splat for the paint pass (vertex shader for the
/// 4-vertex quad + fragment shader over the covered pixels with discards
/// outside 3σ). Empirical conservative bound at typical demo splat sizes.
const PAINT_BIT_OPS_PER_SPLAT: u64 = 32_768;

/// `PaintUniforms` = `viewport: vec2<f32> + _pad: vec2<f32>` = 16 bytes.
const PAINT_UNIFORM_BYTES: usize = 16;

/// Screen-space ellipse render shader. Consumes the `PROJECT_WGSL` output
/// (`ProjectedSplat` array indexed via the sorted `keys_out`), draws one
/// 4-vertex triangle-strip quad per visible splat, paints a Gaussian falloff
/// in premultiplied alpha. Quad corners are scaled into σ-units so the
/// fragment's `r² = dot(local, local)` is the Mahalanobis distance squared
/// (the eigenbasis was already applied in the vertex shader).
const PAINT_WGSL: &str = r#"
struct ProjectedSplat {
    ndc_x: f32, ndc_y: f32, depth: f32,
    axis_a: f32, axis_b: f32, angle: f32,
    col_r: f32, col_g: f32, col_b: f32, opacity: f32,
    _pad0: f32, _pad1: f32,
};

struct PaintUniforms {
    viewport: vec2<f32>,
    _pad: vec2<f32>,
};

@group(0) @binding(0) var<storage, read> projected: array<ProjectedSplat>;
@group(0) @binding(1) var<storage, read> keys_out: array<vec2<u32>>;
@group(0) @binding(2) var<uniform> u: PaintUniforms;

struct VertexOutput {
    @builtin(position) pos: vec4<f32>,
    @location(0) color: vec3<f32>,
    @location(1) opacity: f32,
    @location(2) local: vec2<f32>,
};

@vertex
fn vs_main(
    @builtin(vertex_index) vid: u32,
    @builtin(instance_index) iid: u32,
) -> VertexOutput {
    var quad: array<vec2<f32>, 4> = array<vec2<f32>, 4>(
        vec2<f32>(-1.0, -1.0),
        vec2<f32>( 1.0, -1.0),
        vec2<f32>(-1.0,  1.0),
        vec2<f32>( 1.0,  1.0),
    );

    var out: VertexOutput;
    let splat_idx = keys_out[iid].y;
    let s = projected[splat_idx];

    if (s.opacity <= 0.0) {
        // Cull (PROJECT marked it invalid).
        out.pos = vec4<f32>(2.0, 2.0, 2.0, 1.0);
        out.color = vec3<f32>(0.0);
        out.opacity = 0.0;
        out.local = vec2<f32>(0.0);
        return out;
    }

    let corner = quad[vid];
    let c = cos(s.angle);
    let sn = sin(s.angle);
    let major = vec2<f32>(c, sn) * s.axis_a * 3.0;
    let minor = vec2<f32>(-sn, c) * s.axis_b * 3.0;
    let pixel_offset = corner.x * major + corner.y * minor;
    let ndc_offset = pixel_offset / u.viewport * 2.0;
    let ndc = vec2<f32>(s.ndc_x, s.ndc_y) + ndc_offset;

    out.pos = vec4<f32>(ndc, s.depth, 1.0);
    out.color = vec3<f32>(s.col_r, s.col_g, s.col_b);
    out.opacity = s.opacity;
    out.local = corner * 3.0;
    return out;
}

@fragment
fn fs_main(in: VertexOutput) -> @location(0) vec4<f32> {
    let r2 = dot(in.local, in.local);
    if (r2 > 9.0) { discard; }
    let alpha = in.opacity * exp(-0.5 * r2);
    if (alpha < 1.0 / 255.0) { discard; }
    return vec4<f32>(in.color * alpha, alpha);
}
"#;

// ── tiny js_sys::Reflect helpers (same shape as lux-worlds-web's) ───────

fn js_set(obj: &JsValue, key: &str, val: &JsValue) {
    let _ = js_sys::Reflect::set(obj, &JsValue::from_str(key), val);
}

fn js_get(obj: &JsValue, key: &str) -> JsValue {
    js_sys::Reflect::get(obj, &JsValue::from_str(key)).unwrap_or(JsValue::UNDEFINED)
}

fn js_call(obj: &JsValue, method: &str, args: &[JsValue]) -> Result<JsValue, JsValue> {
    let func: js_sys::Function = js_get(obj, method).unchecked_into();
    let arr = Array::new();
    for a in args {
        arr.push(a);
    }
    js_sys::Reflect::apply(&func, obj, &arr)
}

fn write_bytes_to_buffer(queue: &JsValue, buffer: &JsValue, bytes: &[u8]) {
    let arr = Uint8Array::from(bytes);
    let _ = js_call(
        queue,
        "writeBuffer",
        &[
            buffer.clone(),
            JsValue::from_f64(0.0),
            arr.buffer().into(),
            arr.byte_offset().into(),
            arr.byte_length().into(),
        ],
    );
}

fn create_buffer(device: &JsValue, size: u64, usage: f64, label: &str) -> JsValue {
    let desc = Object::new();
    js_set(&desc, "size", &JsValue::from_f64(size as f64));
    js_set(&desc, "usage", &JsValue::from_f64(usage));
    js_set(&desc, "label", &JsValue::from_str(label));
    js_call(device, "createBuffer", &[desc.into()]).unwrap_or(JsValue::NULL)
}

fn create_shader_module(device: &JsValue, code: &str, label: &str) -> JsValue {
    let desc = Object::new();
    js_set(&desc, "code", &JsValue::from_str(code));
    js_set(&desc, "label", &JsValue::from_str(label));
    js_call(device, "createShaderModule", &[desc.into()]).unwrap_or(JsValue::NULL)
}

fn create_compute_pipeline(device: &JsValue, module: &JsValue, label: &str) -> JsValue {
    let compute = Object::new();
    js_set(&compute, "module", module);
    js_set(&compute, "entryPoint", &JsValue::from_str("main"));
    let desc = Object::new();
    js_set(&desc, "label", &JsValue::from_str(label));
    js_set(&desc, "layout", &JsValue::from_str("auto"));
    js_set(&desc, "compute", &compute.into());
    js_call(device, "createComputePipeline", &[desc.into()]).unwrap_or(JsValue::NULL)
}

fn buffer_resource(buffer: &JsValue) -> JsValue {
    let r = Object::new();
    js_set(&r, "buffer", buffer);
    r.into()
}

fn bind_entry(binding: u32, resource: JsValue) -> JsValue {
    let e = Object::new();
    js_set(&e, "binding", &JsValue::from_f64(binding as f64));
    js_set(&e, "resource", &resource);
    e.into()
}

// ── Camera ──────────────────────────────────────────────────────────────

/// Slow orbital camera centred on the splat cloud's bounds. P1.5 only needs
/// enough motion to make the per-frame project pass non-degenerate; once
/// paint lands in P1.7 the page will swap in proper navigation.
pub struct Camera {
    pub target: Vec3,
    pub distance: f32,
    pub yaw: f32,
    pub pitch: f32,
    pub fov: f32,
    pub aspect: f32,
}

impl Camera {
    pub fn orbital(target: Vec3, distance: f32, aspect: f32) -> Self {
        Self {
            target,
            distance,
            yaw: 0.0,
            pitch: -0.2,
            fov: 0.9,
            aspect,
        }
    }

    pub fn eye(&self) -> Vec3 {
        let xz = self.distance * self.pitch.cos();
        Vec3::new(
            self.target.x + xz * self.yaw.cos(),
            self.target.y + self.distance * self.pitch.sin(),
            self.target.z + xz * self.yaw.sin(),
        )
    }

    pub fn view(&self) -> Mat4 {
        Mat4::look_at_rh(self.eye(), self.target, Vec3::Y)
    }

    pub fn proj(&self) -> Mat4 {
        Mat4::perspective_rh(self.fov, self.aspect.max(1e-3), 0.05, 1000.0)
    }
}

/// Pack the project-pass uniform into `UNIFORM_BYTES` of std140 layout.
fn pack_uniform(cam: &Camera, time_s: f32, viewport: [f32; 2], anchor: Vec4) -> [u8; UNIFORM_BYTES] {
    let mut out = [0u8; UNIFORM_BYTES];
    let v = cam.view();
    let p = cam.proj();
    let vp = p * v;

    let write_mat = |out: &mut [u8], offset: usize, m: &Mat4| {
        let cols: [[f32; 4]; 4] = m.to_cols_array_2d();
        for (i, col) in cols.iter().enumerate() {
            for (j, x) in col.iter().enumerate() {
                let o = offset + (i * 4 + j) * 4;
                out[o..o + 4].copy_from_slice(&x.to_le_bytes());
            }
        }
    };
    write_mat(&mut out, 0, &v);
    write_mat(&mut out, 64, &p);
    write_mat(&mut out, 128, &vp);

    let eye = cam.eye();
    out[192..196].copy_from_slice(&eye.x.to_le_bytes());
    out[196..200].copy_from_slice(&eye.y.to_le_bytes());
    out[200..204].copy_from_slice(&eye.z.to_le_bytes());
    out[204..208].copy_from_slice(&time_s.to_le_bytes());

    out[208..212].copy_from_slice(&viewport[0].to_le_bytes());
    out[212..216].copy_from_slice(&viewport[1].to_le_bytes());

    out[224..228].copy_from_slice(&anchor.x.to_le_bytes());
    out[228..232].copy_from_slice(&anchor.y.to_le_bytes());
    out[232..236].copy_from_slice(&anchor.z.to_le_bytes());
    out[236..240].copy_from_slice(&anchor.w.to_le_bytes());

    out
}

// ── One acquired WebGPU context ─────────────────────────────────────────

pub struct GpuContext {
    pub device: JsValue,
    pub queue: JsValue,
    pub context: JsValue,
    pub format: String,
    pub width: u32,
    pub height: u32,
}

impl GpuContext {
    pub async fn acquire(canvas_id: &str) -> Result<Self, JsValue> {
        let global = js_sys::global();
        let navigator = js_get(&global, "navigator");
        let gpu_obj = js_get(&navigator, "gpu");
        if gpu_obj.is_undefined() || gpu_obj.is_null() {
            return Err(JsValue::from_str("WebGPU not supported"));
        }

        let adapter_opts = Object::new();
        js_set(
            &adapter_opts,
            "powerPreference",
            &JsValue::from_str("high-performance"),
        );
        let adapter_promise = js_call(&gpu_obj, "requestAdapter", &[adapter_opts.into()])?;
        let adapter =
            wasm_bindgen_futures::JsFuture::from(js_sys::Promise::from(adapter_promise)).await?;
        if adapter.is_null() || adapter.is_undefined() {
            return Err(JsValue::from_str("WebGPU requestAdapter returned null"));
        }

        let limits = js_get(&adapter, "limits");
        let max_storage_buf = js_get(&limits, "maxStorageBufferBindingSize")
            .as_f64()
            .unwrap_or(134_217_728.0);
        let max_buffer = js_get(&limits, "maxBufferSize")
            .as_f64()
            .unwrap_or(268_435_456.0);
        let max_storage_per_stage = js_get(&limits, "maxStorageBuffersPerShaderStage")
            .as_f64()
            .unwrap_or(10.0);

        let required_limits = Object::new();
        js_set(
            &required_limits,
            "maxStorageBufferBindingSize",
            &JsValue::from_f64(max_storage_buf),
        );
        js_set(
            &required_limits,
            "maxBufferSize",
            &JsValue::from_f64(max_buffer),
        );
        js_set(
            &required_limits,
            "maxStorageBuffersPerShaderStage",
            &JsValue::from_f64(max_storage_per_stage),
        );

        let device_desc = Object::new();
        js_set(&device_desc, "requiredLimits", &required_limits.into());
        let device_promise = js_call(&adapter, "requestDevice", &[device_desc.into()])?;
        let device =
            wasm_bindgen_futures::JsFuture::from(js_sys::Promise::from(device_promise)).await?;
        let queue = js_get(&device, "queue");

        let document = js_get(&global, "document");
        let canvas = js_call(
            &document,
            "getElementById",
            &[JsValue::from_str(canvas_id)],
        )?;
        if canvas.is_null() || canvas.is_undefined() {
            return Err(JsValue::from_str(&format!(
                "canvas #{canvas_id} not found in document"
            )));
        }
        let context = js_call(&canvas, "getContext", &[JsValue::from_str("webgpu")])?;
        if context.is_null() || context.is_undefined() {
            return Err(JsValue::from_str(
                "canvas.getContext('webgpu') returned null",
            ));
        }

        let format = js_call(&gpu_obj, "getPreferredCanvasFormat", &[])?
            .as_string()
            .unwrap_or_else(|| "bgra8unorm".to_string());

        let config = Object::new();
        js_set(&config, "device", &device);
        js_set(&config, "format", &JsValue::from_str(&format));
        js_set(&config, "alphaMode", &JsValue::from_str("opaque"));
        js_call(&context, "configure", &[config.into()])?;

        let width = js_get(&canvas, "width").as_f64().unwrap_or(800.0) as u32;
        let height = js_get(&canvas, "height").as_f64().unwrap_or(600.0) as u32;

        Ok(Self {
            device,
            queue,
            context,
            format,
            width,
            height,
        })
    }

    /// Frame render: clear + (optional) splat draw in a single render pass.
    /// Returns timing + bit-ops for the entire surface frame: a clear cost
    /// (`width × height × 32`) and, when a `Paint` pipeline is supplied,
    /// the per-splat paint cost (`N × PAINT_BIT_OPS_PER_SPLAT`).
    pub fn render_frame(&self, paint: Option<&Paint>) -> StageTiming {
        let t0 = now_ns();

        let surface_tex = js_call(&self.context, "getCurrentTexture", &[]).unwrap_or(JsValue::NULL);
        let surface_view = js_call(&surface_tex, "createView", &[]).unwrap_or(JsValue::NULL);

        let encoder_desc = Object::new();
        let encoder = js_call(
            &self.device,
            "createCommandEncoder",
            &[encoder_desc.into()],
        )
        .unwrap_or(JsValue::NULL);

        let color_att = Object::new();
        js_set(&color_att, "view", &surface_view);
        let clear_val = Array::of4(
            &JsValue::from_f64(CLEAR_RGBA[0]),
            &JsValue::from_f64(CLEAR_RGBA[1]),
            &JsValue::from_f64(CLEAR_RGBA[2]),
            &JsValue::from_f64(CLEAR_RGBA[3]),
        );
        js_set(&color_att, "clearValue", &clear_val.into());
        js_set(&color_att, "loadOp", &JsValue::from_str("clear"));
        js_set(&color_att, "storeOp", &JsValue::from_str("store"));

        let pass_desc = Object::new();
        js_set(
            &pass_desc,
            "colorAttachments",
            &Array::of1(&color_att.into()).into(),
        );

        let pass = js_call(&encoder, "beginRenderPass", &[pass_desc.into()])
            .unwrap_or(JsValue::NULL);

        let mut paint_bit_ops: u64 = 0;
        if let Some(p) = paint {
            js_call(&pass, "setPipeline", &[p.pipeline.clone()]).ok();
            js_call(
                &pass,
                "setBindGroup",
                &[JsValue::from_f64(0.0), p.bind_group.clone()],
            )
            .ok();
            js_call(
                &pass,
                "draw",
                &[JsValue::from_f64(4.0), JsValue::from_f64(p.instance_count as f64)],
            )
            .ok();
            paint_bit_ops = (p.instance_count as u64).saturating_mul(PAINT_BIT_OPS_PER_SPLAT);
        }

        js_call(&pass, "end", &[]).ok();

        let cmd = js_call(&encoder, "finish", &[]).unwrap_or(JsValue::NULL);
        js_call(&self.queue, "submit", &[Array::of1(&cmd).into()]).ok();

        let wall_ns = now_ns() - t0;
        let clear_bit_ops = (self.width as u64) * (self.height as u64) * 32;
        StageTiming {
            wall_ns,
            bit_ops: clear_bit_ops.saturating_add(paint_bit_ops),
        }
    }
}

// ── Splat buffer (P1.4) ─────────────────────────────────────────────────

pub struct SplatBuffer {
    pub handle: JsValue,
    pub gpu_bytes: u64,
    pub splat_count: u64,
}

impl SplatBuffer {
    pub fn upload(gpu: &GpuContext, cloud: &SplatCloud) -> (Self, f64) {
        let splat_count = cloud.splats.len() as u64;
        let gpu_bytes = splat_count.saturating_mul(Splat::GPU_SIZE as u64);

        let t0 = now_ns();
        let flat = cloud.to_gpu_buffer();
        let arr = Float32Array::from(&flat[..]);

        let handle = create_buffer(
            &gpu.device,
            gpu_bytes,
            USAGE_STORAGE_COPY_DST,
            "mathground-splat:splats",
        );
        let _ = js_call(
            &gpu.queue,
            "writeBuffer",
            &[
                handle.clone(),
                JsValue::from_f64(0.0),
                arr.buffer().into(),
                arr.byte_offset().into(),
                arr.byte_length().into(),
            ],
        );
        let wall_ns = now_ns() - t0;

        (
            Self {
                handle,
                gpu_bytes,
                splat_count,
            },
            wall_ns,
        )
    }
}

// ── Projection pass (P1.5) ──────────────────────────────────────────────

pub struct Projection {
    pub pipeline: JsValue,
    pub bind_group: JsValue,
    pub uniform_buf: JsValue,
    pub projected_buf: JsValue,
    pub sort_keys_buf: JsValue,
    pub sh1_buf: JsValue,
    pub workgroups: u32,
}

impl Projection {
    pub fn new(gpu: &GpuContext, splat_buf: &SplatBuffer) -> Self {
        let n = splat_buf.splat_count.max(1);

        let uniform_buf = create_buffer(
            &gpu.device,
            UNIFORM_BYTES as u64,
            USAGE_UNIFORM_COPY_DST,
            "mathground-splat:uniform",
        );
        let projected_buf = create_buffer(
            &gpu.device,
            n * 48,
            USAGE_STORAGE_RW,
            "mathground-splat:projected",
        );
        let sort_keys_buf = create_buffer(
            &gpu.device,
            n * 8,
            USAGE_STORAGE_RW,
            "mathground-splat:sort_keys",
        );
        // No-SH path: a tiny zero buffer so `array<f32>` is non-empty. The
        // shader's loop runs but every coefficient is zero → DC-only color,
        // matching `SplatCloud::sh_buffer()` when the cloud carries no SH.
        // 16 floats is enough to satisfy WGSL's "stride > 0" requirement.
        let sh_byte_len = 16 * 4;
        let sh1_buf = create_buffer(
            &gpu.device,
            sh_byte_len as u64,
            USAGE_STORAGE_COPY_DST, // STORAGE | COPY_DST so we can zero-init it
            "mathground-splat:sh1",
        );
        let zeros = vec![0u8; sh_byte_len];
        write_bytes_to_buffer(&gpu.queue, &sh1_buf, &zeros);

        let module = create_shader_module(&gpu.device, PROJECT_WGSL, "mathground-splat:project");
        let pipeline = create_compute_pipeline(&gpu.device, &module, "mathground-splat:project");

        let bgl = js_call(&pipeline, "getBindGroupLayout", &[JsValue::from_f64(0.0)])
            .unwrap_or(JsValue::NULL);
        let entries = Array::new();
        entries.push(&bind_entry(0, buffer_resource(&uniform_buf)));
        entries.push(&bind_entry(1, buffer_resource(&splat_buf.handle)));
        entries.push(&bind_entry(2, buffer_resource(&projected_buf)));
        entries.push(&bind_entry(3, buffer_resource(&sort_keys_buf)));
        entries.push(&bind_entry(4, buffer_resource(&sh1_buf)));
        let bg_desc = Object::new();
        js_set(&bg_desc, "layout", &bgl);
        js_set(&bg_desc, "entries", &entries.into());
        let bind_group =
            js_call(&gpu.device, "createBindGroup", &[bg_desc.into()]).unwrap_or(JsValue::NULL);

        let workgroups = ((n as u32) + 255) / 256;

        Self {
            pipeline,
            bind_group,
            uniform_buf,
            projected_buf,
            sort_keys_buf,
            sh1_buf,
            workgroups,
        }
    }

    /// Update the uniform buffer and dispatch the project compute pass.
    pub fn dispatch(
        &self,
        gpu: &GpuContext,
        cam: &Camera,
        time_s: f32,
        viewport: [f32; 2],
        splat_count: u64,
    ) -> StageTiming {
        let t0 = now_ns();
        let bytes = pack_uniform(cam, time_s, viewport, Vec4::new(0.0, 0.0, 0.0, 1.0));
        write_bytes_to_buffer(&gpu.queue, &self.uniform_buf, &bytes);

        let enc_desc = Object::new();
        let encoder = js_call(&gpu.device, "createCommandEncoder", &[enc_desc.into()])
            .unwrap_or(JsValue::NULL);
        let pass_desc = Object::new();
        let pass = js_call(&encoder, "beginComputePass", &[pass_desc.into()])
            .unwrap_or(JsValue::NULL);
        js_call(&pass, "setPipeline", &[self.pipeline.clone()]).ok();
        js_call(
            &pass,
            "setBindGroup",
            &[JsValue::from_f64(0.0), self.bind_group.clone()],
        )
        .ok();
        js_call(
            &pass,
            "dispatchWorkgroups",
            &[JsValue::from_f64(self.workgroups as f64)],
        )
        .ok();
        js_call(&pass, "end", &[]).ok();
        let cmd = js_call(&encoder, "finish", &[]).unwrap_or(JsValue::NULL);
        js_call(&gpu.queue, "submit", &[Array::of1(&cmd).into()]).ok();

        let wall_ns = now_ns() - t0;
        let bit_ops = splat_count.saturating_mul(PROJECT_BIT_OPS_PER_SPLAT);
        StageTiming { wall_ns, bit_ops }
    }
}

// ── Sort pass (P1.6) ────────────────────────────────────────────────────

pub struct Sort {
    pub clear_pipeline: JsValue,
    pub histogram_pipeline: JsValue,
    pub scan_local_pipeline: JsValue,
    pub scan_global_pipeline: JsValue,
    pub scan_broadcast_pipeline: JsValue,
    pub scatter_pipeline: JsValue,

    pub clear_bg: JsValue,
    pub histogram_bg: JsValue,
    pub scan_local_bg: JsValue,
    pub scan_global_bg: JsValue,
    pub scan_broadcast_bg: JsValue,
    pub scatter_bg: JsValue,

    pub histogram_buf: JsValue,
    pub wg_totals_buf: JsValue,
    pub keys_out_buf: JsValue,
    pub params_buf: JsValue,

    pub scatter_workgroups: u32,
}

impl Sort {
    /// Build all 6 sort pipelines + bind groups + auxiliary buffers around
    /// the project pass's `sort_keys_buf` output. Writes the static
    /// `SortParams { count, _pads }` uniform once — the demo doesn't resize
    /// the cloud mid-session.
    pub fn new(gpu: &GpuContext, projection: &Projection, splat_count: u64) -> Self {
        let n = splat_count.max(1);

        let histogram_buf = create_buffer(
            &gpu.device,
            HISTOGRAM_BUCKETS * 4,
            USAGE_STORAGE_RW,
            "mathground-splat:sort_histogram",
        );
        let wg_totals_buf = create_buffer(
            &gpu.device,
            WG_TOTALS_LEN * 4,
            USAGE_STORAGE_RW,
            "mathground-splat:sort_wg_totals",
        );
        let keys_out_buf = create_buffer(
            &gpu.device,
            n * 8,
            USAGE_STORAGE_RW,
            "mathground-splat:sort_keys_out",
        );
        let params_buf = create_buffer(
            &gpu.device,
            SORT_PARAMS_BYTES as u64,
            USAGE_UNIFORM_COPY_DST,
            "mathground-splat:sort_params",
        );
        // SortParams { count, _pad0, _pad1, _pad2 } — static for this mount.
        let params = [splat_count as u32, 0u32, 0u32, 0u32];
        let arr = Uint32Array::from(&params[..]);
        let _ = js_call(
            &gpu.queue,
            "writeBuffer",
            &[
                params_buf.clone(),
                JsValue::from_f64(0.0),
                arr.buffer().into(),
                arr.byte_offset().into(),
                arr.byte_length().into(),
            ],
        );

        // ── Pipelines ──
        let clear_pipeline = create_compute_pipeline(
            &gpu.device,
            &create_shader_module(&gpu.device, SORT_CLEAR_WGSL, "mathground-splat:sort_clear"),
            "mathground-splat:sort_clear",
        );
        let histogram_pipeline = create_compute_pipeline(
            &gpu.device,
            &create_shader_module(
                &gpu.device,
                SORT_HISTOGRAM_WGSL,
                "mathground-splat:sort_histogram",
            ),
            "mathground-splat:sort_histogram",
        );
        let scan_local_pipeline = create_compute_pipeline(
            &gpu.device,
            &create_shader_module(
                &gpu.device,
                SORT_SCAN_LOCAL_WGSL,
                "mathground-splat:sort_scan_local",
            ),
            "mathground-splat:sort_scan_local",
        );
        let scan_global_pipeline = create_compute_pipeline(
            &gpu.device,
            &create_shader_module(
                &gpu.device,
                SORT_SCAN_GLOBAL_WGSL,
                "mathground-splat:sort_scan_global",
            ),
            "mathground-splat:sort_scan_global",
        );
        let scan_broadcast_pipeline = create_compute_pipeline(
            &gpu.device,
            &create_shader_module(
                &gpu.device,
                SORT_SCAN_BROADCAST_WGSL,
                "mathground-splat:sort_scan_broadcast",
            ),
            "mathground-splat:sort_scan_broadcast",
        );
        let scatter_pipeline = create_compute_pipeline(
            &gpu.device,
            &create_shader_module(
                &gpu.device,
                SORT_SCATTER_WGSL,
                "mathground-splat:sort_scatter",
            ),
            "mathground-splat:sort_scatter",
        );

        // ── Bind groups (one per pipeline, layout derived from shader) ──
        let mk_bg = |pipeline: &JsValue, bindings: &[(u32, &JsValue)]| -> JsValue {
            let bgl = js_call(pipeline, "getBindGroupLayout", &[JsValue::from_f64(0.0)])
                .unwrap_or(JsValue::NULL);
            let entries = Array::new();
            for (slot, buf) in bindings {
                entries.push(&bind_entry(*slot, buffer_resource(buf)));
            }
            let desc = Object::new();
            js_set(&desc, "layout", &bgl);
            js_set(&desc, "entries", &entries.into());
            js_call(&gpu.device, "createBindGroup", &[desc.into()]).unwrap_or(JsValue::NULL)
        };

        let clear_bg = mk_bg(&clear_pipeline, &[(0, &histogram_buf)]);
        let histogram_bg = mk_bg(
            &histogram_pipeline,
            &[
                (0, &projection.sort_keys_buf),
                (1, &histogram_buf),
                (2, &params_buf),
            ],
        );
        let scan_local_bg = mk_bg(
            &scan_local_pipeline,
            &[(0, &histogram_buf), (1, &wg_totals_buf)],
        );
        let scan_global_bg = mk_bg(&scan_global_pipeline, &[(0, &wg_totals_buf)]);
        let scan_broadcast_bg = mk_bg(
            &scan_broadcast_pipeline,
            &[(0, &histogram_buf), (1, &wg_totals_buf)],
        );
        let scatter_bg = mk_bg(
            &scatter_pipeline,
            &[
                (0, &projection.sort_keys_buf),
                (1, &keys_out_buf),
                (2, &histogram_buf),
                (3, &params_buf),
            ],
        );

        let scatter_workgroups = ((n as u32) + 255) / 256;

        Self {
            clear_pipeline,
            histogram_pipeline,
            scan_local_pipeline,
            scan_global_pipeline,
            scan_broadcast_pipeline,
            scatter_pipeline,
            clear_bg,
            histogram_bg,
            scan_local_bg,
            scan_global_bg,
            scan_broadcast_bg,
            scatter_bg,
            histogram_buf,
            wg_totals_buf,
            keys_out_buf,
            params_buf,
            scatter_workgroups,
        }
    }

    /// Run all 6 sort passes in a single command encoder + single submit.
    /// Returns measured wall_ns and an honest bit-op estimate that includes
    /// the histogram clear, the workgroup scans, and the per-splat
    /// scatter.
    pub fn dispatch(&self, gpu: &GpuContext, splat_count: u64) -> StageTiming {
        let t0 = now_ns();

        let enc_desc = Object::new();
        let encoder = js_call(&gpu.device, "createCommandEncoder", &[enc_desc.into()])
            .unwrap_or(JsValue::NULL);
        let pass_desc = Object::new();
        let pass = js_call(&encoder, "beginComputePass", &[pass_desc.into()])
            .unwrap_or(JsValue::NULL);

        let dispatch = |pipeline: &JsValue, bg: &JsValue, workgroups: u32| {
            js_call(&pass, "setPipeline", &[pipeline.clone()]).ok();
            js_call(
                &pass,
                "setBindGroup",
                &[JsValue::from_f64(0.0), bg.clone()],
            )
            .ok();
            js_call(
                &pass,
                "dispatchWorkgroups",
                &[JsValue::from_f64(workgroups as f64)],
            )
            .ok();
        };

        // 1. CLEAR — zero the 65536-bucket histogram.
        dispatch(&self.clear_pipeline, &self.clear_bg, 256);
        // 2. HISTOGRAM — atomic-count per bucket.
        dispatch(
            &self.histogram_pipeline,
            &self.histogram_bg,
            self.scatter_workgroups,
        );
        // 3. SCAN_LOCAL — per-block Hillis-Steele scan; emit wg_totals[k].
        dispatch(&self.scan_local_pipeline, &self.scan_local_bg, 256);
        // 4. SCAN_GLOBAL — exclusive scan over the 256 block totals.
        dispatch(&self.scan_global_pipeline, &self.scan_global_bg, 1);
        // 5. SCAN_BROADCAST — fold block offsets back into per-bucket prefix.
        dispatch(&self.scan_broadcast_pipeline, &self.scan_broadcast_bg, 256);
        // 6. SCATTER — each splat atomic-claims its slot in keys_out.
        dispatch(
            &self.scatter_pipeline,
            &self.scatter_bg,
            self.scatter_workgroups,
        );

        js_call(&pass, "end", &[]).ok();
        let cmd = js_call(&encoder, "finish", &[]).unwrap_or(JsValue::NULL);
        js_call(&gpu.queue, "submit", &[Array::of1(&cmd).into()]).ok();

        let wall_ns = now_ns() - t0;
        let bit_ops = SORT_FIXED_BIT_OPS
            .saturating_add(splat_count.saturating_mul(SORT_BIT_OPS_PER_SPLAT));
        StageTiming { wall_ns, bit_ops }
    }
}

// ── Paint pass (P1.7) ───────────────────────────────────────────────────

pub struct Paint {
    pub pipeline: JsValue,
    pub bind_group: JsValue,
    pub uniform_buf: JsValue,
    pub instance_count: u32,
}

impl Paint {
    pub fn new(
        gpu: &GpuContext,
        projection: &Projection,
        sort: &Sort,
        splat_count: u64,
    ) -> Self {
        let uniform_buf = create_buffer(
            &gpu.device,
            PAINT_UNIFORM_BYTES as u64,
            USAGE_UNIFORM_COPY_DST,
            "mathground-splat:paint_uniform",
        );
        // Viewport is static for this mount; rewrite if the canvas resizes.
        let view = [gpu.width as f32, gpu.height as f32, 0.0_f32, 0.0_f32];
        let arr = Float32Array::from(&view[..]);
        let _ = js_call(
            &gpu.queue,
            "writeBuffer",
            &[
                uniform_buf.clone(),
                JsValue::from_f64(0.0),
                arr.buffer().into(),
                arr.byte_offset().into(),
                arr.byte_length().into(),
            ],
        );

        let module = create_shader_module(&gpu.device, PAINT_WGSL, "mathground-splat:paint");

        // ── Render pipeline (triangle-strip, premultiplied alpha blend) ──
        let blend_color = Object::new();
        js_set(&blend_color, "srcFactor", &JsValue::from_str("one"));
        js_set(
            &blend_color,
            "dstFactor",
            &JsValue::from_str("one-minus-src-alpha"),
        );
        js_set(&blend_color, "operation", &JsValue::from_str("add"));
        let blend_alpha = Object::new();
        js_set(&blend_alpha, "srcFactor", &JsValue::from_str("one"));
        js_set(
            &blend_alpha,
            "dstFactor",
            &JsValue::from_str("one-minus-src-alpha"),
        );
        js_set(&blend_alpha, "operation", &JsValue::from_str("add"));
        let blend = Object::new();
        js_set(&blend, "color", &blend_color.into());
        js_set(&blend, "alpha", &blend_alpha.into());

        let target = Object::new();
        js_set(&target, "format", &JsValue::from_str(&gpu.format));
        js_set(&target, "blend", &blend.into());

        let vertex = Object::new();
        js_set(&vertex, "module", &module);
        js_set(&vertex, "entryPoint", &JsValue::from_str("vs_main"));

        let fragment = Object::new();
        js_set(&fragment, "module", &module);
        js_set(&fragment, "entryPoint", &JsValue::from_str("fs_main"));
        js_set(&fragment, "targets", &Array::of1(&target.into()).into());

        let primitive = Object::new();
        js_set(&primitive, "topology", &JsValue::from_str("triangle-strip"));

        let desc = Object::new();
        js_set(&desc, "label", &JsValue::from_str("mathground-splat:paint"));
        js_set(&desc, "layout", &JsValue::from_str("auto"));
        js_set(&desc, "vertex", &vertex.into());
        js_set(&desc, "fragment", &fragment.into());
        js_set(&desc, "primitive", &primitive.into());
        let pipeline =
            js_call(&gpu.device, "createRenderPipeline", &[desc.into()]).unwrap_or(JsValue::NULL);

        let bgl = js_call(&pipeline, "getBindGroupLayout", &[JsValue::from_f64(0.0)])
            .unwrap_or(JsValue::NULL);
        let entries = Array::new();
        entries.push(&bind_entry(0, buffer_resource(&projection.projected_buf)));
        entries.push(&bind_entry(1, buffer_resource(&sort.keys_out_buf)));
        entries.push(&bind_entry(2, buffer_resource(&uniform_buf)));
        let bg_desc = Object::new();
        js_set(&bg_desc, "layout", &bgl);
        js_set(&bg_desc, "entries", &entries.into());
        let bind_group =
            js_call(&gpu.device, "createBindGroup", &[bg_desc.into()]).unwrap_or(JsValue::NULL);

        Self {
            pipeline,
            bind_group,
            uniform_buf,
            instance_count: splat_count as u32,
        }
    }
}

// ── UploadResult (one-shot at mount) ────────────────────────────────────

pub struct UploadResult {
    pub gpu_bytes: u64,
    pub splat_count: u64,
    pub wall_ns: f64,
}

// ── SplatRenderer (composes the pipeline) ───────────────────────────────

/// Estimate a scene-fit camera distance from the cloud's bounding box.
fn frame_target_and_distance(cloud: &SplatCloud) -> (Vec3, f32) {
    let (min, max) = cloud.bounds();
    let centre = Vec3::new(
        0.5 * (min[0] + max[0]),
        0.5 * (min[1] + max[1]),
        0.5 * (min[2] + max[2]),
    );
    let extent = Vec3::new(
        max[0] - min[0],
        max[1] - min[1],
        max[2] - min[2],
    );
    let radius = extent.length().max(1.0);
    (centre, radius * 1.6)
}

pub struct SplatRenderer {
    pub gpu: Option<GpuContext>,
    pub splats: Option<SplatBuffer>,
    pub projection: Option<Projection>,
    pub sort: Option<Sort>,
    pub paint: Option<Paint>,
    pub camera: Option<Camera>,
    pub time_s: f32,
    pub source_bytes: u64,
    pub splat_count: u64,
    upload: Option<UploadResult>,
}

impl SplatRenderer {
    pub async fn mount(
        canvas_id: &str,
        decoded: crate::decode::DecodedSplat,
    ) -> Result<Self, JsValue> {
        let source_bytes = decoded.source_bytes;
        let splat_count = decoded.splat_count;
        let cloud = decoded.cloud;

        let gpu = match GpuContext::acquire(canvas_id).await {
            Ok(ctx) => Some(ctx),
            Err(e) => {
                web_sys::console::warn_1(&e);
                None
            }
        };

        let (splats, upload, projection, sort, paint, camera) = match &gpu {
            Some(ctx) => {
                let (buf, wall_ns) = SplatBuffer::upload(ctx, &cloud);
                let result = UploadResult {
                    gpu_bytes: buf.gpu_bytes,
                    splat_count: buf.splat_count,
                    wall_ns,
                };
                let proj = Projection::new(ctx, &buf);
                let sort = Sort::new(ctx, &proj, buf.splat_count);
                let paint = Paint::new(ctx, &proj, &sort, buf.splat_count);
                let (centre, dist) = frame_target_and_distance(&cloud);
                let aspect = ctx.width as f32 / ctx.height.max(1) as f32;
                let cam = Camera::orbital(centre, dist, aspect);
                (
                    Some(buf),
                    Some(result),
                    Some(proj),
                    Some(sort),
                    Some(paint),
                    Some(cam),
                )
            }
            None => (None, None, None, None, None, None),
        };

        Ok(Self {
            gpu,
            splats,
            projection,
            sort,
            paint,
            camera,
            time_s: 0.0,
            source_bytes,
            splat_count,
            upload,
        })
    }

    pub fn has_gpu(&self) -> bool {
        self.gpu.is_some()
    }

    pub fn take_upload_result(&mut self) -> Option<UploadResult> {
        self.upload.take()
    }

    /// One frame. Returns accumulated timing + bit-ops across every stage
    /// that has been wired so far (P1.3 clear + P1.5 project + …).
    pub fn frame(&mut self) -> FrameResult {
        let Some(gpu) = self.gpu.as_ref() else {
            return FrameResult::zero();
        };

        let mut wall_ns = 0.0;
        let mut bit_ops = 0u64;

        // Project pass — runs first so the sorted/painted stages (P1.6, P1.7)
        // can fold their own timing into the same FrameResult.
        self.time_s += 1.0 / 60.0;
        if let Some(c) = self.camera.as_mut() {
            // Slow orbit so the per-frame project work isn't trivially
            // cache-hit.
            c.yaw += 0.004;
        }
        if let (Some(proj), Some(cam), Some(splat_buf)) =
            (&self.projection, &self.camera, &self.splats)
        {
            let stage = proj.dispatch(
                gpu,
                cam,
                self.time_s,
                [gpu.width as f32, gpu.height as f32],
                splat_buf.splat_count,
            );
            wall_ns += stage.wall_ns;
            bit_ops = bit_ops.saturating_add(stage.bit_ops);
        }

        // Sort pass — depth-orders the projected splats so the (P1.7) paint
        // pipeline can blend them back-to-front. Queue-submitted after the
        // project pass; WebGPU guarantees in-submit-order execution so the
        // sort reads the project's writes without an explicit barrier.
        if let (Some(sort), Some(splat_buf)) = (&self.sort, &self.splats) {
            let stage = sort.dispatch(gpu, splat_buf.splat_count);
            wall_ns += stage.wall_ns;
            bit_ops = bit_ops.saturating_add(stage.bit_ops);
        }

        // Render pass: clear + (P1.7) paint pipeline in one render pass.
        let render = gpu.render_frame(self.paint.as_ref());
        wall_ns += render.wall_ns;
        bit_ops = bit_ops.saturating_add(render.bit_ops);

        FrameResult { wall_ns, bit_ops }
    }
}

pub struct StageTiming {
    pub wall_ns: f64,
    pub bit_ops: u64,
}

pub struct FrameResult {
    pub wall_ns: f64,
    pub bit_ops: u64,
}

impl FrameResult {
    fn zero() -> Self {
        Self {
            wall_ns: 0.0,
            bit_ops: 0,
        }
    }
}
This is the exact Rust file compiled into the mathground_splat WASM bundle. Every WebGPU call — adapter, buffer, shader module, compute pass, render pass — is in this file, driven via js_sys::Reflect. The receipt above is a function of this code, not a bespoke benchmark.
Notes on the receipt

three receipt kinds: splat-world-upload (one-shot at mount, bandwidth into GPU memory), splat-world-render (per-frame, project + sort + paint), splat-world-author (one-shot per image upload, model-tier).
artifact = GaussianSplat: the receipt grammar carries the produced artifact's byte count and splat count, so the panel reports joules-per-byte and joules-per-splat alongside the usual envelope.
why the J / splat is high at first: the first measurable window includes the project pass alone (~1 kbit-ops / splat); paint adds ∼30 kbit-ops / splat once the fragment shader is rasterising. μ usually settles between 10⁹ and 10¹⁰.
method	—
primitive	—
V-class	—
wall_ns / op	—
E_floor / op	—
E_tdp / op	—
μ apparent	—
bit-ops / op	—
artifact	—
J / splat	—