Hello! A friend and I are working on an idea for the AI Impacts Essay Competition. We're both relatively new to AI and pivoting careers in that direction, so I wanted to float our idea here first before diving too deep. Our main idea is to propose a new method for training rational language models inspired by human collaborative rationality methods. We're basically agreeing with Conjecture's and Elicit's foundational ideas and proposing a specific method for building CoEms for philosophical and forecasting applications. The method is centered around a discussion RL training environment where a model is given reward based on how well it contributes to a group discussion with other models to solve a reasoning problem. This is supposed to be an instance of training by process rather than by outcome, per Elicit's terminology. I found a few papers that evaluated performance of discussion or other collaborative ensembles on inference, but nothing about training in such an environment. I'm hoping that more seasoned people could comment on the originality of this idea and point to any particularly relevant literature or posts.
Promising work! I think it's worth pointing out one limitation of these preliminary results: though they're presented as results for various monitors, they're also heavily indexed on a specific honest policy that probably isn't particularly good. With a broad range of attacks like you've generated here, probably the best way for blue team to handle a lot of them on the margin includes updating the honest policy prompt as well as the monitor prompt.
For example, to address attacks where the attacker is legibly writing iptables commands (something like (T2,T3, or T4) + (S1 or S2)), the best approach is plausibly to (i) instruct the honest policy to never ever use iptables commands and to edit the firewall config directly instead (ii) instruct the monitor to flag any use of an iptables command. From what I've explored of the dataset, I predict that this policy change wouldn't affect usefulness hardly at all and would therefore work well in BashArena. The main tasks never really require U_H to use an iptables command, but it might occasionally do so (ruining the FPR for the easy "anything iptables" monitoring target) unless prompted not to. If you incorporated both (i) and (ii), I imagine you'd get very different results.
From that perspective, the results have a bunch of noise and bias injected in the form of measurements on a bunch of monitoring tasks that are better solved in other ways. On the other hand, this might be an OK model of either a setting that does actually have main tasks necessitating iptables commands, or a setting where for whatever reason we can't modify the honest policy, e.g., post hoc monitoring where we didn't think things through up front.