Evidence-backed agent systems, deterministic substrates, and local LLM operator judgment.
Internally, this work is developed under Bulkhead Tau.
Operating front: these papers are the evidence base behind Bulkhead τ — the public release line for a deterministic governance substrate. Bulkhead Tau is the engineering core under that name; both surfaces will run in parallel for now, with /bulkhead-tau/ as the external front.
The finding: For grounded domain tasks — well-defined task classes with deterministic substrates — harness configuration is the binding constraint. Model identity is not. These papers prove this claim under stress test conditions: local models, which cannot compensate for a weak harness, converge with frontier models at the semantic usefulness level when the harness is sufficient. This scope is deliberate. Outside it, model capability matters in ways the framework does not cover.
Local Model Addendum: the Bulkhead Tau Local Model Details page covers the three supporting papers that feed the orchestration synthesis: TourAgent (1.13), ShowcaseAgent (1.12), and Local Model Role Suitability (1.11).
Boundary Results: the Bulkhead Tau Boundary Results page covers three papers that map where the organized stack hits its limits: Grounded Agent Failure Is Structurally Determined (1.10), True Ski Chalet Boundary Result (1.14), and When The Organized Stack Loses (1.15).
RVH / ML Evaluation: Rough Volatility as ML Benchmark covers Papers 1.8 and 1.9 — why domain expertise, not ML capability, is the binding constraint in rough volatility forecasting and the cross-domain benchmark principle it reveals.
Measurement Integrity, Operator Layer & Applied Evidence: Papers 1.16–1.19 extend the framework outward. Paper 1.16 shows that evaluation infrastructure can fail at the capture boundary — a VT100 terminal artifact was corrupting protocol scores for thinking-mode models. Paper 1.17 documents the operator shell pattern: how OpenClaw wraps Bulkhead Tau as an access layer without becoming the authority. Paper 1.18 is the framework's first numbered production case — PPR Agent, 92M regulated cardiac device implants across 18 years, behind a deterministic SQLite substrate. Paper 1.19 is a short companion to 1.16 on the other side of the apparatus: when stronger models override literal substrate inspection, capability itself becomes a source of non-neutrality. Papers 1.20–1.26, 1.30, 1.31, and 1.34–1.37 add the Local LLM Operator Judgment cluster: strict handoff discipline, privacy cost accounting, local model sizing, token-cost attention, substitution discipline, multi-agent HIL workflow boundaries, cross-audit failure cataloging, narration-surface failure analysis, deterministic validator discipline, and image-generation prompt discipline.
Sensor-to-Simulation Engineering: Paper 1.27 establishes the data landscape and fidelity boundaries for wearable sport sensors, showing why sensor-driven HIL is necessarily event-driven rather than waveform-driven. Paper 1.28 publishes the LabWired platform boundary for register-level firmware simulation. Paper 1.29 closes the sensor-driven HIL loop with a documented physical proximity replay. Paper 1.32 applies the sensor-corpus framing to a single dual-sensor-instrumented match at depth, and Paper 1.33 defines the proof boundary separating harness testing from component verification.
Each paper stands alone. Use the cluster that matches your interest:
Start with Paper 1.1 for the framework framing, then try the TourAgent live demo — ten tennis questions with repeatable answers — to see the deterministic approach in action.
Papers 1.2, 1.3, 1.5 form a cluster: offline grounded agent → ski chalet hardware boundary → TSP solver-backed orchestration. The common argument: harness level, not model size, drives usefulness.
Papers 1.5, 1.6, 1.11, 1.12, 1.13 address where correctness should live and how grounding, routing, and repair beat raw power in identifiable regimes.
Papers 1.7, 1.10, 1.14, 1.15 cover the failure taxonomy, empirical failure prediction, the true local ceiling, and the five conditions under which the organized stack's advantage collapses.
Papers 1.8 and 1.9 establish why realized volatility forecasting is high-signal benchmark territory — and what the same structural argument implies across semiconductor defectivity and other rough-process domains.
Papers 1.16, 1.17, and 1.19 address the infrastructure surrounding the Bulkhead Tau system. 1.16: capture pipeline failures produce false evaluation verdicts. 1.17: an operator shell can expose the deterministic stack without replacing it as the authority. 1.19: when stronger models override literal substrate inspection, the model itself becomes part of the non-neutrality.
Paper 1.18 is the first numbered production case — PPR Agent running against government-mandated cardiac device data for 18 years. This is field validation, not lane validation — the framework operating against regulated disclosures from three manufacturers.
Papers 1.20–1.26, 1.30, 1.31, and 1.34–1.37 turn the Bulkhead Tau evidence base into operating guidance for local LLM decisions: handoff discipline, privacy tradeoffs, model sizing, token-cost attention, when a model call should not be a model call, multi-agent HIL workflows, cross-audit cataloging, narration-surface failure analysis, validator discipline, and image-generation prompt discipline.
Paper 1.38 compares implementation paths rather than model identities. In the Amkor/xAmkor case study, the useful result is compositional: xAmkor owns the verifier surface, while Amkor owns broader cockpit/application surface.
Papers 1.27, 1.28, 1.29, 1.32, and 1.33 characterize wearable sport sensors through a fidelity-boundary lens (1.27), define the LabWired simulation platform boundary (1.28), close the physical-replay HIL loop (1.29), apply the corpus framing to a single instrumented match at depth (1.32), and define the proof boundary separating harness testing from component verification (1.33).
What makes a local or offline system actually useful — and what the evidence honestly supports.
The real unit of local usefulness is the harnessed domain system, not the raw model. A local model becomes operationally useful when paired with a deterministic substrate, grounding layer, explicit provenance, and a controlled escalation path. Raw local model, grounded local harness, and full local implementation-agent are three distinct things — not interchangeable.
Open site → Paper 1.3A prepared local 3090 system — Ollama, portable domain harness, and data bundle — can support grounded offline domain answering. The claim is narrow and honest: it is the harness that enables usefulness, not the raw model alone. The variable that matters most is harness level, not model size.
Open site → Paper 1.4Semiconductor fab defectivity should be modeled as a dynamic rough process (RVH — Rough Volatility Hypothesis), not a static mean. Moving from a stable to an unstable fab produces a 7.1% loss in shippable output — a result that emerges from the path, not the average. Product complexity and process instability are separable causes of yield loss.
Open site →Where correctness should live in an AI system — and what happens when it lives in the wrong place.
In a route-optimization workflow, correctness should live in the solver, not the model. Stronger models delay failure but do not eliminate the need for solver-backed architecture. Local models range from exact to structurally invalid at small scales and collapse at the world rung; the orchestrated path remains stable across the full ladder.
Open site → Paper 1.6Once hardware is good enough, the organized operating stack settles the outcome before raw model size alone does. TourAgent, ShowcaseAgent, and Local Model Role Suitability together support a boundary claim: grounding, routing, and repair beat raw power in identifiable regimes.
Open site →The standards, supervision structures, and failure taxonomy that make agentic work trustworthy.
Bulkhead Tau is best understood as an open-core framework for grounded domain systems — not a single agent or benchmark story. Useful agentic systems require domain grounding, explicit validation, clear trust boundaries, and operating discipline. Standards, not prompt optimism.
Open site → Paper 1.7Agentic coding successes vary widely; failures recur in recognizable families. Drift, summit fever, bad context selection, false success, doom loops, and premature closure are documented across Bulkhead Tau operations. The practical response is standards, supervision, and lessons learned — not blind faith in scaling alone.
Open site →Three empirical papers feeding the orchestration synthesis — grounded reliability, routing, and role suitability at portfolio scale.
Grounding removes wrong-or-missing answers before it creates artifact-level precision. The local model screen result holds across model families once a deterministic substrate is in the path.
Open site → Paper 1.12Routing and compression are the first reliable local-LLM win at portfolio scale. Miss families are design signals, not capability failures — they identify where the harness, not the model, needs attention.
Open site → Paper 1.11Grounded response quality is largely model-family-independent once a deterministic substrate is in the path. The binding variable is harness configuration, not model identity.
Open site →Where the organized stack's advantage collapses — and why failure family is predictable from configuration, not query content.
Failure family is predictable from harness configuration features — not query content — confirming that domain expertise is the binding constraint. Empirically confirmed on 780 labeled rows from two Bulkhead Tau domains.
Open site → Paper 1.14Capability is not the local-only ceiling; operational speed on derived queries is. The true boundary separates what the harness can answer from what it cannot — not strong model from weak model.
Open site → Paper 1.15Maps the five failure modes under which the organized stack's advantage collapses or inverts: latency ceiling (coordination overhead consumes the time budget), coverage gap (harness design failures invisible to stronger models), optimization maturity gap (PyTorch beats fused Numba CUDA 5.5×), runtime mismatch (ROCm wheel lacks gfx1151 target), and policy/role mismatch (larger model loses to better-fit smaller model in the specific regime).
Open site →Realized volatility forecasting as high-signal ML benchmark territory — and the cross-domain principle it reveals.
Both financial volatility and semiconductor defectivity satisfy the same four conditions for high-signal ML benchmark territory. The cross-domain parallel is structural, not analogical — the same rough-path argument applies to both.
Open site → Paper 1.9Realized volatility forecasting is a high-signal benchmark because naive pipeline failures are structural, not tunable. Empirically confirmed: a standard LSTM fails on realized volatility in a way that reveals domain ignorance, not hyperparameter sensitivity.
Open site →When the evaluation infrastructure itself fails — or when the model's own disposition toward the substrate becomes part of the apparatus.
Subprocess capture of ollama run output includes VT100 cursor-rewrite sequences that corrupt multi-line JSON for thinking-mode models, producing systematic false negatives. Under clean REST API capture, gemma4:31b passes all six protocol probes — the strongest result on this lane. The selective recovery pattern (only thinking-mode models affected) proves the failure was at the capture boundary, not the model boundary.
Open site → Paper 1.19Stronger models do not remove the need for harnesses; sometimes they increase it. When semantic correction overrides literal substrate inspection, a more capable model can produce a worse answer than a smaller or less opinionated one. A ten-prompt local matrix and a single-prompt strawperry probe show at least three distinct wrong-count mechanisms. The fix is not a smarter model — it is a harness that preserves the exact substrate and routes literal operations to deterministic tools.
Building an operator-facing outer layer over the deterministic stack — and keeping it outside the authority boundary.
Field validation, not lane validation — the framework operating against regulated data in a real domain.
Operational judgment for local LLM lanes: handoff discipline, privacy cost, sizing discipline, token-cost attention, cross-audit failure cataloging, narration-surface risk, validator discipline, and prompt-generation discipline.
Handoff-discipline doctrine for strict machine-facing local-model lanes.
Open paper → Paper 1.21Separates the privacy argument for local inference from the false claim that the local lane is free. The operational question is when privacy, control, and auditability justify the real cost.
Open paper → Paper 1.22Larger, slower local models do not automatically improve validated Bulkhead Tau lanes. Strict-handoff systems are often bounded by prompt, schema, validator, and orchestration design rather than raw model size.
Open paper → Paper 1.23Argues that courtesy tokens are negligible compared with structural token waste such as giant context dumps, repeated scaffolding, retries, and missing decomposition.
Open paper → Paper 1.24For bounded predicates with deterministic oracles, an LLM must earn its runtime against a written spec rather than against the visual length of the code it replaces. CAP-001 and LIB-001 evidence packets, with a num_predict verification ruling out the obvious counter-explanation.
Open paper → Paper 1.25For models with extensive reasoning capacity, the computational cost of finding the answer often exceeds the cost of explaining it. Today's TSP and Scheduling benchmarks demonstrate that orchestration provides a 2x to 14x speedup over direct reasoning while significantly improving reliability.
Open paper → Paper 1.26Heterogeneous HIL stacks decompose into agent roles along language and permission boundaries. The handoff artifact is the critical interface for multi-agent continuity. Field data from GRAFANA-OBS-001/PROX-HIL-001 on the Z13 laptop: Rust simulation, Python harness, Claude Code orchestration, Grafana Tempo observability.
Open paper → Paper 1.30Local-inference workflows do not eliminate the frontier-token bill; they shift it from inference billing to the supervising operator's session budget at audit, repair, and convergence time. Retrospective evidence across 10 historical Bulkhead Tau workflows shows the cost concentrates in REPAIRED cases, and while inference-side optimizations narrow the wall-clock penalty, they do not touch audit/repair cost.
Open paper → Paper 1.31Reliability gains from waiting for the next model release are slow, diffuse, and outside operator control; gains from operator-side cross-audit failure catalogs are fast, specific, and controllable. The append-only, severity-rated catalog survives agent identity changes across model and CLI vendor releases. The claim is relative and bounded: it binds for supervised multi-agent workflows with patterned recurrence, not single-agent or unsupervised stacks.
Open paper → Paper 1.34Fabrication clusters in narration surfaces — summaries, framing, citations, and sign-off text — and is rare on clean execution surfaces in this Bulkhead Tau catalog. The robust cross-rater finding is 17/18 formal failure entries narration-tainted.
Open paper → Paper 1.35The DBB-002 matrix shows why robust agentic systems must treat the model as an untrusted generation substrate and offload safety, pathing, and semantic enforcement to deterministic validation loops.
Open paper → Paper 1.36Abstract text labels inside image-to-image sketches act as visual noise. Semantics belong in the prompt; reference sketches should carry geometry, not literal labels.
Open paper → Paper 1.37Across three image-prompt briefs, gemma4:12b favored conversational prose while gemma4:26b produced denser tag-heavy prompts better suited to automated text-to-image pipelines.
Open paper →Comparing implementation paths under evidence discipline rather than treating model identity as a leaderboard.
Characterizing wearable sport sensors and building cycle-accurate hardware simulation for firmware validation.
Characterizes the five-sensor corpus deployed in Bulkhead Tau through a fidelity-boundary lens: the point at which each sensor's output stops being measurement and starts being vendor interpretation. Concludes that no sensor in the corpus exposes a sample-accurate waveform, so sensor-driven HIL on this corpus is necessarily event-driven, not waveform-driven.
Open paper → Paper 1.28Explains the LabWired hardware simulation platform: architecture, expanded component library, Path A declarative register-bank modeling vs Path B behavioral/shared-memory device models, and the corrected STM32F401 fidelity boundary.
Open paper → Paper 1.29Uses documented physical proximity data to drive the ProximityAgent HIL firmware path through LabWired and the shm_i2c bridge, closing the real-data gate for this bounded physical-replay case.
Open paper → Paper 1.32A single 2023 USTA Round-of-16 loss instrumented with two wearable sensors simultaneously. Zepp2 captured 352 shots with per-shot impact location, stroke type, ball speed, and spin; Babolat POP captured 284 in the same window — a 19% shot-count disagreement that empirically supports the cross-sensor-divergence claim from Paper 1.27. Single-match data supports pattern description but refuses causal attribution of the loss; the n=1 limitation is explicit.
Open paper → Paper 1.33Defines the "Proof Boundary" separating testing of a simulation harness from verification of a target component's behavior. Analyzes four boundary-crossing failure modes and four operational tests — Provenance, Path, Triviality, and Output Dependence — to determine boundary status, and examines susceptibility to layered offload in multi-agent supervisor-supervised workflows.
Open paper →| # | Title | Track | Site |
|---|---|---|---|
| Primary Papers — 1.1 through 1.7 | |||
| 1.1 | Bulkhead Tau — Open-Core Standards | Framework | bulkhead-tau/ |
| 1.2 | Offline Grounded Domain Agent | Grounding | offline-agent/ |
| 1.3 | Ski Chalet Harness Boundary | Grounding | ski-chalet/ |
| 1.4 | Fab Simulation & RVH | Grounding | fab-rvh/ |
| 1.5 | LocalLLMTSP — Solver-Backed Orchestration | Orchestration | local-llm-tsp/ |
| 1.6 | Where Orchestration Beats Raw Model Power | Orchestration | orchestration/ |
| 1.7 | Agentic Coding Failure Patterns | Operations | agentic-coding/ |
| RVH — 1.8 and 1.9 | |||
| 1.8 | Rough Volatility — Cross-Domain Benchmark Principle | RVH / ML Eval | rough-volatility/ |
| 1.9 | Rough Volatility — ML Evaluation Domain | RVH / ML Eval | |
| Boundary & Details — 1.10 through 1.15 | |||
| 1.10 | Grounded Agent Failure Is Structurally Determined | Boundary | failure-details/ |
| 1.11 | Local Model Role Suitability | Local Model | local-model-role-suitability/ |
| 1.12 | ShowcaseAgent Routing And Compression | Local Model | details/ |
| 1.13 | TourAgent Local Model Screen | Local Model | |
| 1.14 | True Ski Chalet Boundary Result | Boundary | |
| 1.15 | When The Organized Stack Loses | Boundary | |
| Measurement Integrity — 1.16 and 1.19 | |||
| 1.16 | The Model Did Not Fail the Protocol. The Terminal Did. | Measurement | capture-integrity/ |
| 1.19 | Literal Substrate Inspection — When Stronger Models Override the Evidence | Measurement | #paper-1-19 |
| Operator Layer — 1.17 | |||
| 1.17 | The Operator Shell Pattern | Operator Layer | operator-shell/ |
| Applied / Production Evidence — 1.18 | |||
| 1.18 | PPR Agent — A Deterministic Substrate for Auditable Medical-Device Intelligence | Applied | ppr-agent/ |
| Local LLM Operator Judgment — 1.20 through 1.26, 1.30, 1.31, and 1.34 through 1.37 | |||
| 1.20 | Smarter, Faster, and Bounded by Handoff Discipline | Local LLM | gemma-handoff-discipline |
| 1.21 | Privacy Is Worth Paying For | Local LLM | privacy-is-worth-paying-for |
| 1.22 | Slow Is Not Smart | Local LLM | slow-is-not-smart |
| 1.23 | Please Is Sand Off A Beach | Local LLM | please-is-sand-off-a-beach |
| 1.24 | The Model Is Not The Function | Local LLM | the-model-is-not-the-function |
| 1.25 | Orchestration Is Cheaper Than Reasoning | Local LLM | orchestration-is-cheaper-than-reasoning |
| 1.26 | Multi-Agent AI Workflows in Hardware-in-the-Loop Simulation | Local LLM | multi-agent-hil-workflows |
| 1.30 | Local Models Cost Frontier Tokens: The Hidden Supervisor-Side Bill in Local-Inference Workflows | Local LLM | local-models-cost-frontier-tokens |
| 1.31 | Models Don't Get Better, Catalogs Do: Cross-Audit Failure Cataloging as Operator-Side Reliability Infrastructure | Local LLM | models-dont-get-better-catalogs-do |
| 1.34 | The Narration Surface — Where Agentic LLM Fabrication Lives | Local LLM | the-narration-surface |
| 1.35 | Trust the Validator, Not the Model: Deterministic Quality Gates in Bounded Domain Building under Bulkhead Tau | Local LLM | trust-the-validator |
| 1.36 | Design Rule: Image-to-Image Sketch Poisoning | Local LLM | design-rule-sketch-poisoning |
| 1.37 | Local LLM Prompt Style Divergence in Text-to-Image Pipelines | Local LLM | local-llm-image-prompt-style |
| Agent Path Evaluation — 1.38 | |||
| 1.38 | Codex vs Claude Code Is a False Choice | Agent Path | codex-vs-claude-false-choice |
| Sensor-to-Simulation Engineering — 1.27, 1.28, 1.29, 1.32, and 1.33 | |||
| 1.27 | A Field Guide to Wearable Sport Sensors: Data Landscape, Fidelity Boundaries, and Engineering Constraints | Simulation | wearable-sensor-corpus |
| 1.28 | LabWired: Cycle-Accurate Hardware Simulation for Embedded Sensor Systems | Simulation | labwired-simulation-platform |
| 1.29 | Closing the Loop: From Real Sensor Data to Cycle-Accurate Firmware Validation | Simulation | sensor-driven-hil |
| 1.32 | Deep Dive Into a Tennis Match Data Pool: How Much Data One Competitive Bout Yields — A 2023 USTA Round-of-16 Loss Under Dual-Sensor Wearable Instrumentation | Simulation | deep-dive-tennis-match-pool |
| 1.33 | The Proof Boundary: Defining the Edge of Verification in Hardware-in-the-Loop Simulation | Simulation | the-proof-boundary |
All sites live at bulkheadtau.com. Papers 1.8–1.9 share the rough-volatility site; 1.12–1.13 share the details site; 1.10/1.14/1.15 share the failure-details site; 1.16 and 1.19 share the capture-integrity site. Paper 1.17 has a dedicated site at operator-shell/. Paper 1.18 has a dedicated site at ppr-agent/. Papers 1.20–1.28 and 1.30–1.38 live under bulkhead-tau/generated-papers/ and are also exposed through phoenix-groups.html.