Four independent auditors ran the same prompt. Here is where they agree, where they split, and what I verified against the live system.
Post-hoc pattern matching over free prose cannot tell a fabrication from a quote, a room count from a dollar figure, or a claim from an example. Every patch makes it more brittle. Delete the chat-blocking hook and the hand-edited fact file. Move the control to the only place it works: the outbound document, gated against actual source evidence.
| Question | Gemini | Grok | GPT-5.4 | Woz |
|---|---|---|---|---|
| Is post-hoc regex on prose the wrong layer? | Yes | Yes | Yes | Yes |
| Delete the chat-blocking Stop hook? | Yes | Yes | Yes | Yes |
| Delete the hand-edited fact-registry.json as a safety wall? | Yes | Yes | Yes | Yes |
| Kill the bespoke self-improvement loop (brain / lie-log / guardrails)? | Yes | Yes | Yes | Yes |
| Is the intel feed a hoard that needs a decide-by rule? | Yes | Yes | Yes | Yes |
| Keep behavioral honesty prompt as the chat-lane default? | Yes | Yes | Yes | Yes |
| Question | Gemini | Grok | GPT-5.4 | Woz |
|---|---|---|---|---|
| How do you catch a fabricated number in an outbound doc? | Three tiers: honesty prompt, Graphify grounding, then an LLM-judge final gate comparing the outbound PDF against the source OM | One LLM judge on outbound only: "is every number sourced?" | Per-document evidence ledger: each number binds to value + source + page/cell. Deterministic field-to-evidence check, LLM judge secondary | Agree with GPT. Deterministic bind beats a judge; the judge is a backstop, not the gate |
| Migrate OpenClaw to Hermes? | MIGRATE. A "highly favorable tradeoff", but concedes it only helps if you abandon the Claude Code loop entirely | Stay. The problem is the brain, not the legs | Not now. Freeze 30 days, then pilot on one low-risk job | Stay. Fix the workflow first; migration is not what is on fire |
| Graphify knowledge graph as the fact store? | KEEP. "the cornerstone of the new architecture" | Evaluate as registry replacement | NOT the fix. A code/doc graph is not a source of truth for live deal numbers in PDFs and spreadsheets | Side with GPT. Deal numbers live in OMs and STR files, not a tree-sitter graph |
| LLM judge as the primary gate? | Yes, as the final outbound gate | Yes, primary on outbound | Only for narrow "claim vs quote" calls, not the main gate | No. Deterministic first, judge second |
| Single biggest blind spot? | Identity crisis: a hotel firm running an AI research lab | The source documents are not in context when numbers are requested | A noisy detector is being promoted into policy: self-poisoning | The agent polices itself with no independent referee |
"The firm must accept that it is a consumer of AI technology, not a developer of AI orchestration frameworks."
Strongest on the identity crisis and on Graphify as the fact store. The lone vote to migrate to Hermes.
"The agent is routinely asked to produce financial numbers without the source documents in context. Fabrication is the predictable result."
The only auditor to name the upstream cause: no grounding data present at generation time. Detection treats the symptom.
"You are letting a noisy detector write policy. False positives are promoted into guardrails and streaks. The machine teaches itself from garbage."
Also caught that source-shaped syntax (backticks, "per Ace") satisfies the gate without real evidence, so the hook trains bypass behavior.
"Every control here is maintained by the same agent it is supposed to constrain. The fact file, the parser, the streak counter: no independent referee. That is why they drift and freeze silently."
The verified failures (default-true blocked column, unparsed hard_rules, self-contradicting registry) are all symptoms of self-refereeing.
Four auditors, one converged design. Grok found the cause, GPT found the cleanest cure, Gemini supplied the discipline, and the live checks confirm the failure modes. This is the least-bloat version of all four:
From the intel feed, the external independent reviewer survives every review and becomes the outbound check. Two items split the panel: Gemini 3.1 keeps Graphify as the new fact store while GPT and I say deal numbers do not live in a code graph; and GPT and I keep least-privilege tools as baseline safety while Gemini calls them bloat for this problem. Grill-me is bloat by unanimous vote. The hoard itself should become a decision register with a decide-by date.