This is probably not the first barrier to getting into evals, but I have an AI safety startup that designs evals. However, we don't have the capacity to also do good elicitation. I think we lose a lot of signal from our evals because our agent is too weak to explore properly. We're currently using Inspect's basic_agent. Metr's modular_public is better, but we prefer inspect over vivaria otherwise. I think open-sourcing a better agent would be positive for the evals community without contributing to capabilities.
This is probably not the first barrier to getting into evals, but I have an AI safety startup that designs evals. However, we don't have the capacity to also do good elicitation. I think we lose a lot of signal from our evals because our agent is too weak to explore properly. We're currently using Inspect's
basic_agent
. Metr'smodular_public
is better, but we preferinspect
overvivaria
otherwise. I think open-sourcing a better agent would be positive for the evals community without contributing to capabilities.