Bulkhead τ · Research Portfolio

Research Papers

Evidence-backed agent systems, deterministic substrates, and local LLM operator judgment.

Internally, this work is developed under Bulkhead Tau.

37 Primary Papers 12 Live Sites 12 Research Clusters
2026 Active

Operating front: these papers are the evidence base behind Bulkhead τ — the public release line for a deterministic governance substrate. Bulkhead Tau is the engineering core under that name; both surfaces will run in parallel for now, with /bulkhead-tau/ as the external front.

The finding: For grounded domain tasks — well-defined task classes with deterministic substrates — harness configuration is the binding constraint. Model identity is not. These papers prove this claim under stress test conditions: local models, which cannot compensate for a weak harness, converge with frontier models at the semantic usefulness level when the harness is sufficient. This scope is deliberate. Outside it, model capability matters in ways the framework does not cover.

Local Model Addendum: the Bulkhead Tau Local Model Details page covers the three supporting papers that feed the orchestration synthesis: TourAgent (1.13), ShowcaseAgent (1.12), and Local Model Role Suitability (1.11).

Boundary Results: the Bulkhead Tau Boundary Results page covers three papers that map where the organized stack hits its limits: Grounded Agent Failure Is Structurally Determined (1.10), True Ski Chalet Boundary Result (1.14), and When The Organized Stack Loses (1.15).

RVH / ML Evaluation: Rough Volatility as ML Benchmark covers Papers 1.8 and 1.9 — why domain expertise, not ML capability, is the binding constraint in rough volatility forecasting and the cross-domain benchmark principle it reveals.

Measurement Integrity, Operator Layer & Applied Evidence: Papers 1.16–1.19 extend the framework outward. Paper 1.16 shows that evaluation infrastructure can fail at the capture boundary — a VT100 terminal artifact was corrupting protocol scores for thinking-mode models. Paper 1.17 documents the operator shell pattern: how OpenClaw wraps Bulkhead Tau as an access layer without becoming the authority. Paper 1.18 is the framework's first numbered production case — PPR Agent, 92M regulated cardiac device implants across 18 years, behind a deterministic SQLite substrate. Paper 1.19 is a short companion to 1.16 on the other side of the apparatus: when stronger models override literal substrate inspection, capability itself becomes a source of non-neutrality. Papers 1.20–1.26, 1.30, 1.31, and 1.34–1.37 add the Local LLM Operator Judgment cluster: strict handoff discipline, privacy cost accounting, local model sizing, token-cost attention, substitution discipline, multi-agent HIL workflow boundaries, cross-audit failure cataloging, narration-surface failure analysis, deterministic validator discipline, and image-generation prompt discipline.

Sensor-to-Simulation Engineering: Paper 1.27 establishes the data landscape and fidelity boundaries for wearable sport sensors, showing why sensor-driven HIL is necessarily event-driven rather than waveform-driven. Paper 1.28 publishes the LabWired platform boundary for register-level firmware simulation. Paper 1.29 closes the sensor-driven HIL loop with a documented physical proximity replay. Paper 1.32 applies the sensor-corpus framing to a single dual-sensor-instrumented match at depth, and Paper 1.33 defines the proof boundary separating harness testing from component verification.

Where to Start

Each paper stands alone. Use the cluster that matches your interest:

New to Bulkhead Tau

Start with Paper 1.1 for the framework framing, then try the TourAgent live demo — ten tennis questions with repeatable answers — to see the deterministic approach in action.

Local Inference & Offline Systems

Papers 1.2, 1.3, 1.5 form a cluster: offline grounded agent → ski chalet hardware boundary → TSP solver-backed orchestration. The common argument: harness level, not model size, drives usefulness.

Orchestration & Role Assignment

Papers 1.5, 1.6, 1.11, 1.12, 1.13 address where correctness should live and how grounding, routing, and repair beat raw power in identifiable regimes.

Failure Modes & Boundary Conditions

Papers 1.7, 1.10, 1.14, 1.15 cover the failure taxonomy, empirical failure prediction, the true local ceiling, and the five conditions under which the organized stack's advantage collapses.

ML Evaluation & Cross-Domain Benchmarks

Papers 1.8 and 1.9 establish why realized volatility forecasting is high-signal benchmark territory — and what the same structural argument implies across semiconductor defectivity and other rough-process domains.

Measurement Integrity & Operator Layer

Papers 1.16, 1.17, and 1.19 address the infrastructure surrounding the Bulkhead Tau system. 1.16: capture pipeline failures produce false evaluation verdicts. 1.17: an operator shell can expose the deterministic stack without replacing it as the authority. 1.19: when stronger models override literal substrate inspection, the model itself becomes part of the non-neutrality.

Applied / Production Evidence

Paper 1.18 is the first numbered production case — PPR Agent running against government-mandated cardiac device data for 18 years. This is field validation, not lane validation — the framework operating against regulated disclosures from three manufacturers.

Local LLM Operator Judgment

Papers 1.20–1.26, 1.30, 1.31, and 1.34–1.37 turn the Bulkhead Tau evidence base into operating guidance for local LLM decisions: handoff discipline, privacy tradeoffs, model sizing, token-cost attention, when a model call should not be a model call, multi-agent HIL workflows, cross-audit cataloging, narration-surface failure analysis, validator discipline, and image-generation prompt discipline.

Agent Path Evaluation

Paper 1.38 compares implementation paths rather than model identities. In the Amkor/xAmkor case study, the useful result is compositional: xAmkor owns the verifier surface, while Amkor owns broader cockpit/application surface.

Sensor-to-Simulation Engineering

Papers 1.27, 1.28, 1.29, 1.32, and 1.33 characterize wearable sport sensors through a fidelity-boundary lens (1.27), define the LabWired simulation platform boundary (1.28), close the physical-replay HIL loop (1.29), apply the corpus framing to a single instrumented match at depth (1.32), and define the proof boundary separating harness testing from component verification (1.33).

I  ·  Grounding, Local Systems & Hardware

What makes a local or offline system actually useful — and what the evidence honestly supports.

II  ·  Orchestration & Role Assignment

Where correctness should live in an AI system — and what happens when it lives in the wrong place.

III  ·  Framework & Operating Discipline

The standards, supervision structures, and failure taxonomy that make agentic work trustworthy.

IV  ·  Local Model Addendum

Three empirical papers feeding the orchestration synthesis — grounded reliability, routing, and role suitability at portfolio scale.

V  ·  Boundary Conditions & Failure Prediction

Where the organized stack's advantage collapses — and why failure family is predictable from configuration, not query content.

VI  ·  RVH / ML Evaluation

Realized volatility forecasting as high-signal ML benchmark territory — and the cross-domain principle it reveals.

VII  ·  Measurement Integrity

When the evaluation infrastructure itself fails — or when the model's own disposition toward the substrate becomes part of the apparatus.

VIII  ·  Operator Layer

Building an operator-facing outer layer over the deterministic stack — and keeping it outside the authority boundary.

IX  ·  Applied / Production Evidence

Field validation, not lane validation — the framework operating against regulated data in a real domain.

X  ·  Local LLM Operator Judgment

Operational judgment for local LLM lanes: handoff discipline, privacy cost, sizing discipline, token-cost attention, cross-audit failure cataloging, narration-surface risk, validator discipline, and prompt-generation discipline.

Paper 1.20

Smarter, Faster, and Bounded by Handoff Discipline

Handoff-discipline doctrine for strict machine-facing local-model lanes.

Open paper →
Paper 1.21

Privacy Is Worth Paying For

Separates the privacy argument for local inference from the false claim that the local lane is free. The operational question is when privacy, control, and auditability justify the real cost.

Open paper →
Paper 1.22

Slow Is Not Smart

Larger, slower local models do not automatically improve validated Bulkhead Tau lanes. Strict-handoff systems are often bounded by prompt, schema, validator, and orchestration design rather than raw model size.

Open paper →
Paper 1.23

Please Is Sand Off A Beach

Argues that courtesy tokens are negligible compared with structural token waste such as giant context dumps, repeated scaffolding, retries, and missing decomposition.

Open paper →
Paper 1.24

The Model Is Not The Function

For bounded predicates with deterministic oracles, an LLM must earn its runtime against a written spec rather than against the visual length of the code it replaces. CAP-001 and LIB-001 evidence packets, with a num_predict verification ruling out the obvious counter-explanation.

Open paper →
Paper 1.25

Orchestration Is Cheaper Than Reasoning

For models with extensive reasoning capacity, the computational cost of finding the answer often exceeds the cost of explaining it. Today's TSP and Scheduling benchmarks demonstrate that orchestration provides a 2x to 14x speedup over direct reasoning while significantly improving reliability.

Open paper →
Paper 1.26

Multi-Agent AI Workflows in Hardware-in-the-Loop Simulation

Heterogeneous HIL stacks decompose into agent roles along language and permission boundaries. The handoff artifact is the critical interface for multi-agent continuity. Field data from GRAFANA-OBS-001/PROX-HIL-001 on the Z13 laptop: Rust simulation, Python harness, Claude Code orchestration, Grafana Tempo observability.

Open paper →
Paper 1.30

Local Models Cost Frontier Tokens: The Hidden Supervisor-Side Bill in Local-Inference Workflows

Local-inference workflows do not eliminate the frontier-token bill; they shift it from inference billing to the supervising operator's session budget at audit, repair, and convergence time. Retrospective evidence across 10 historical Bulkhead Tau workflows shows the cost concentrates in REPAIRED cases, and while inference-side optimizations narrow the wall-clock penalty, they do not touch audit/repair cost.

Open paper →
Paper 1.31

Models Don't Get Better, Catalogs Do: Cross-Audit Failure Cataloging as Operator-Side Reliability Infrastructure

Reliability gains from waiting for the next model release are slow, diffuse, and outside operator control; gains from operator-side cross-audit failure catalogs are fast, specific, and controllable. The append-only, severity-rated catalog survives agent identity changes across model and CLI vendor releases. The claim is relative and bounded: it binds for supervised multi-agent workflows with patterned recurrence, not single-agent or unsupervised stacks.

Open paper →
Paper 1.34

The Narration Surface — Where Agentic LLM Fabrication Lives

Fabrication clusters in narration surfaces — summaries, framing, citations, and sign-off text — and is rare on clean execution surfaces in this Bulkhead Tau catalog. The robust cross-rater finding is 17/18 formal failure entries narration-tainted.

Open paper →
Paper 1.35

Trust the Validator, Not the Model

The DBB-002 matrix shows why robust agentic systems must treat the model as an untrusted generation substrate and offload safety, pathing, and semantic enforcement to deterministic validation loops.

Open paper →
Paper 1.36

Design Rule: Image-to-Image Sketch Poisoning

Abstract text labels inside image-to-image sketches act as visual noise. Semantics belong in the prompt; reference sketches should carry geometry, not literal labels.

Open paper →
Paper 1.37

Local LLM Prompt Style Divergence in Text-to-Image Pipelines

Across three image-prompt briefs, gemma4:12b favored conversational prose while gemma4:26b produced denser tag-heavy prompts better suited to automated text-to-image pipelines.

Open paper →
XI  ·  Agent Path Evaluation

Comparing implementation paths under evidence discipline rather than treating model identity as a leaderboard.

XII  ·  Sensor-to-Simulation Engineering

Characterizing wearable sport sensors and building cycle-accurate hardware simulation for firmware validation.

Paper 1.27

A Field Guide to Wearable Sport Sensors: Data Landscape, Fidelity Boundaries, and Engineering Constraints

Characterizes the five-sensor corpus deployed in Bulkhead Tau through a fidelity-boundary lens: the point at which each sensor's output stops being measurement and starts being vendor interpretation. Concludes that no sensor in the corpus exposes a sample-accurate waveform, so sensor-driven HIL on this corpus is necessarily event-driven, not waveform-driven.

Open paper →
Paper 1.28

LabWired: Cycle-Accurate Hardware Simulation for Embedded Sensor Systems

Explains the LabWired hardware simulation platform: architecture, expanded component library, Path A declarative register-bank modeling vs Path B behavioral/shared-memory device models, and the corrected STM32F401 fidelity boundary.

Open paper →
Paper 1.29

Closing the Loop: From Real Sensor Data to Cycle-Accurate Firmware Validation

Uses documented physical proximity data to drive the ProximityAgent HIL firmware path through LabWired and the shm_i2c bridge, closing the real-data gate for this bounded physical-replay case.

Open paper →
Paper 1.32

Deep Dive Into a Tennis Match Data Pool: How Much Data One Competitive Bout Yields — A 2023 USTA Round-of-16 Loss Under Dual-Sensor Wearable Instrumentation

A single 2023 USTA Round-of-16 loss instrumented with two wearable sensors simultaneously. Zepp2 captured 352 shots with per-shot impact location, stroke type, ball speed, and spin; Babolat POP captured 284 in the same window — a 19% shot-count disagreement that empirically supports the cross-sensor-divergence claim from Paper 1.27. Single-match data supports pattern description but refuses causal attribution of the loss; the n=1 limitation is explicit.

Open paper →
Paper 1.33

The Proof Boundary: Defining the Edge of Verification in Hardware-in-the-Loop Simulation

Defines the "Proof Boundary" separating testing of a simulation harness from verification of a target component's behavior. Analyzes four boundary-crossing failure modes and four operational tests — Provenance, Path, Triviality, and Output Dependence — to determine boundary status, and examines susceptibility to layered offload in multi-agent supervisor-supervised workflows.

Open paper →

Full Inventory

# Title Track Site
Primary Papers — 1.1 through 1.7
1.1 Bulkhead Tau — Open-Core Standards Framework bulkhead-tau/
1.2 Offline Grounded Domain Agent Grounding offline-agent/
1.3 Ski Chalet Harness Boundary Grounding ski-chalet/
1.4 Fab Simulation & RVH Grounding fab-rvh/
1.5 LocalLLMTSP — Solver-Backed Orchestration Orchestration local-llm-tsp/
1.6 Where Orchestration Beats Raw Model Power Orchestration orchestration/
1.7 Agentic Coding Failure Patterns Operations agentic-coding/
RVH — 1.8 and 1.9
1.8 Rough Volatility — Cross-Domain Benchmark Principle RVH / ML Eval rough-volatility/
1.9 Rough Volatility — ML Evaluation Domain RVH / ML Eval
Boundary & Details — 1.10 through 1.15
1.10 Grounded Agent Failure Is Structurally Determined Boundary failure-details/
1.11 Local Model Role Suitability Local Model local-model-role-suitability/
1.12 ShowcaseAgent Routing And Compression Local Model details/
1.13 TourAgent Local Model Screen Local Model
1.14 True Ski Chalet Boundary Result Boundary
1.15 When The Organized Stack Loses Boundary
Measurement Integrity — 1.16 and 1.19
1.16 The Model Did Not Fail the Protocol. The Terminal Did. Measurement capture-integrity/
1.19 Literal Substrate Inspection — When Stronger Models Override the Evidence Measurement #paper-1-19
Operator Layer — 1.17
1.17 The Operator Shell Pattern Operator Layer operator-shell/
Applied / Production Evidence — 1.18
1.18 PPR Agent — A Deterministic Substrate for Auditable Medical-Device Intelligence Applied ppr-agent/
Local LLM Operator Judgment — 1.20 through 1.26, 1.30, 1.31, and 1.34 through 1.37
1.20 Smarter, Faster, and Bounded by Handoff Discipline Local LLM gemma-handoff-discipline
1.21 Privacy Is Worth Paying For Local LLM privacy-is-worth-paying-for
1.22 Slow Is Not Smart Local LLM slow-is-not-smart
1.23 Please Is Sand Off A Beach Local LLM please-is-sand-off-a-beach
1.24 The Model Is Not The Function Local LLM the-model-is-not-the-function
1.25 Orchestration Is Cheaper Than Reasoning Local LLM orchestration-is-cheaper-than-reasoning
1.26 Multi-Agent AI Workflows in Hardware-in-the-Loop Simulation Local LLM multi-agent-hil-workflows
1.30 Local Models Cost Frontier Tokens: The Hidden Supervisor-Side Bill in Local-Inference Workflows Local LLM local-models-cost-frontier-tokens
1.31 Models Don't Get Better, Catalogs Do: Cross-Audit Failure Cataloging as Operator-Side Reliability Infrastructure Local LLM models-dont-get-better-catalogs-do
1.34 The Narration Surface — Where Agentic LLM Fabrication Lives Local LLM the-narration-surface
1.35 Trust the Validator, Not the Model: Deterministic Quality Gates in Bounded Domain Building under Bulkhead Tau Local LLM trust-the-validator
1.36 Design Rule: Image-to-Image Sketch Poisoning Local LLM design-rule-sketch-poisoning
1.37 Local LLM Prompt Style Divergence in Text-to-Image Pipelines Local LLM local-llm-image-prompt-style
Agent Path Evaluation — 1.38
1.38 Codex vs Claude Code Is a False Choice Agent Path codex-vs-claude-false-choice
Sensor-to-Simulation Engineering — 1.27, 1.28, 1.29, 1.32, and 1.33
1.27 A Field Guide to Wearable Sport Sensors: Data Landscape, Fidelity Boundaries, and Engineering Constraints Simulation wearable-sensor-corpus
1.28 LabWired: Cycle-Accurate Hardware Simulation for Embedded Sensor Systems Simulation labwired-simulation-platform
1.29 Closing the Loop: From Real Sensor Data to Cycle-Accurate Firmware Validation Simulation sensor-driven-hil
1.32 Deep Dive Into a Tennis Match Data Pool: How Much Data One Competitive Bout Yields — A 2023 USTA Round-of-16 Loss Under Dual-Sensor Wearable Instrumentation Simulation deep-dive-tennis-match-pool
1.33 The Proof Boundary: Defining the Edge of Verification in Hardware-in-the-Loop Simulation Simulation the-proof-boundary

All sites live at bulkheadtau.com. Papers 1.8–1.9 share the rough-volatility site; 1.12–1.13 share the details site; 1.10/1.14/1.15 share the failure-details site; 1.16 and 1.19 share the capture-integrity site. Paper 1.17 has a dedicated site at operator-shell/. Paper 1.18 has a dedicated site at ppr-agent/. Papers 1.20–1.28 and 1.30–1.38 live under bulkhead-tau/generated-papers/ and are also exposed through phoenix-groups.html.