I only read through the beginning and their control safety case. Some thoughts:
I don't think one needs to generate hyper-realistic-looking scenarios for monitoring, and I'm confused by their reasoning for believing one does. I guess it depends on the fact that they need to elicit the best possible capabilities from the model during the eval, but I don't think this is necessary for the reasons I stated above.
General Thoughts:
I think this is pretty good and shows knowledge of the core points of control. I think Antrhopic's control protocols are easily beaten, and there are much better things to do. They made it seem like control is much harder to implement than I think it will be. I don't think a perfect Elicitation Strategy or Simulation Procedure is needed for control to work in the settings laid out. Once we have much scarier AIs and deploy them in more crazy environments, we will likely need better techniques, but their setting seems to be before this.
Thanks for your post! I think it hits at the heart of the issue, and I like the term "tool AI" as a phrase that points to what we actually want to build. I share your fear of racing with China to build ASI.
As if this were not bad enough, it is now rather obvious that the first AGI will not be a pure LLM, but a hybrid scaffolded system.
I think scaffolding systems are great. Scaffolding tools allow inspecting how the agent is using the tools. For example, a scaffold could have a monitor that watches all of the calls to the code compiler to ensure the code doesn't look suspicious. The same monitoring system could read all of the text the model commits to long-term memory and ensure it isn't planning something bad or stenographically hiding information.
While it might be harder to formally "align" these systems, I think it will be easier to make safety cases for them for why they won't "escape the data center" or "hack the missiles". It will be much harder for the AI to do anything evil when we watch its every move. A superintelligence would likely still escape the scaffold, but it seems unlikely that an AGI could.
I think the scaffold systems could be the future of tool AI if we use them carefully. These hybrid systems will unlock new crazy and awesome capabilities for humans in the future without needing to solve the hard "alignment problem." This does not
When we observe the programs' outputs, we can form a hypothesis class about what structures we think the programs have. My claim is that only a couple of structures pop out after testing a utility-maximizing agent in a sufficient amount of environments.
You are correct in pointing out that an agent could be employing multiple structures at a time. Future work on this problem would include ways of quantifying how much of a certain structure some program has. This might look like trying to come up with a distance measure for programs that would say a version of quicksort written in OOP or a functional representation would have distance 0.
I think this is a good point. I would push back a small amount on being unable to tell the difference between those two cases. There is more information you can extract from the system, like the amount of time that it takes after pressing the button for the current to turn on. But In general, I agree.
I agree that it would be very easy to have huge blind spots regarding this line of inquiry. This is the thing I worry about most. But I do have a hunch that given enough data about a system and its capabilities, we can make strong claims about its internal structure, and these structures will yield predictive power.
When you have little information like "pressing this button makes this wire turn on," it is much harder to do this. However, I believe testing an unknown program in many different environments and having information like its run time and size can narrow the space of possibilities sufficiently to say something useful.
It is hard to pinpoint motivation here. If you are a top researcher at a top lab working on alignment and you disagree with something within the company, I see two categories of options you can take to try to fix things
Jan and Ilya left but haven't said much about how they lost confidence in OpenAI. I expect we will see them making more damning statements about OpenAI in the future
Or is there a possible motivation I'm missing here?
I'm intrigued by the distinction between the policy function and the program implementing it, as their structures seem different.
For example, a policy function might be said to have the structure of mapping all inputs to a single output, and the program that implements it is a Python program that uses a dictionary. Does this distinction matter?
When we talk about agent structure, I'd imagine that we care both about the structure of the actions that the agent takes and the structure of the computation the agent does to decide on the actions.
I agree if there are privacy guarantees, then you can't get data from the real world, so you'd need honeypots. I guess I'm much more scared of the internal deployments where I don't think privacy guarantees are as likely. I'm trying to respond to someone who is like, "Control seems hard because you have to generate real-world looking input, and that seems hard" and I'm like "You can just use the real-world inputs and filter then with a monitor to find the most scary ones and perform auditing"
Yeah, I guess I'm saying that if you don't have a great way to elicit the capabilities from the model, you can substitute that by assuming that you are going against the best possible adversary you can make. This is much more conservative, but I don't think it makes control impossible to implement. In practice, I think it looks like finding the maximum capabilities of the model using your Elecitation Strategy, then boosting its capabilities even more with a scaffold and some clever red team tricks.