Scheming Defences Aren't Just for Schemers
Both Anthropic and OpenAI have put in place (or are thinking about putting in place) some kind of monitoring for internal coding agents. Neither company has set up their monitoring to be scheming-proof. When I've spoken to people working on this, they (correctly, IMO) argue that current models aren't likely to be scheming.
(In particular, OpenAI appears to be thinking about EDIT: using [poor word choice on my part, they are thinking about better methods] untrusted-only monitoring without much collusion busting, where one instance of a model audits the outputs of another instance of itself.)
But, there are other weird things a model might do that don't qualify as "scheming" which you might still want to defend against! For example, reward hacking/fudging critical safety infrastructure is extremely bad, yet well within the remit of bad behaviour we might expect. Would this cause a model to get caught? I expect so, but who knows!
One vector for monitoring failure might be a kind of persuasion; if a model wants to reward hack, it might justify to itself---using some kind of motivated reasoning---why the hack is actually the correct action to take. Another copy of that model, reading the chain-of-thought, might then agree with the first one! (this is extremely common in motivated reasoning amongst humans as well) This would kinda fall under the broader umbrella of collusion in untrusted monitoring, and would definitely be mitigated by certain signal jamming techniques (though not all).
The above example is illustrative of the more general point that you can't get safety by ruling out a single worst-case scenario (scheming) then only mitigating for one other specific scenario (whatever they are planning for here, I think mostly model mistakes and low-stakes deception and occasional destruction) because there are other scenarios you haven't thought of!