Tyler Tracy

Wiki Contributions

Comments

Sorted by

When we observe the programs' outputs, we can form a hypothesis class about what structures we think the programs have. My claim is that only a couple of structures pop out after testing a utility-maximizing agent in a sufficient amount of environments.

You are correct in pointing out that an agent could be employing multiple structures at a time. Future work on this problem would include ways of quantifying how much of a certain structure some program has. This might look like trying to come up with a distance measure for programs that would say a version of quicksort written in OOP or a functional representation would have distance 0.

I think this is a good point. I would push back a small amount on being unable to tell the difference between those two cases. There is more information you can extract from the system, like the amount of time that it takes after pressing the button for the current to turn on. But In general, I agree.

I agree that it would be very easy to have huge blind spots regarding this line of inquiry. This is the thing I worry about most. But I do have a hunch that given enough data about a system and its capabilities, we can make strong claims about its internal structure, and these structures will yield predictive power.

When you have little information like "pressing this button makes this wire turn on," it is much harder to do this. However, I believe testing an unknown program in many different environments and having information like its run time and size can narrow the space of possibilities sufficiently to say something useful.

It is hard to pinpoint motivation here. If you are a top researcher at a top lab working on alignment and you disagree with something within the company, I see two categories of options you can take to try to fix things

  • Stay and try to use your position of power to do good. Better that someone who deeply cares about AI risk is in charge than someone who doesn't
  • Leave in protest to try to sway public opinion into thinking that your organization is unsafe and that we should not trust it

Jan and Ilya left but haven't said much about how they lost confidence in OpenAI. I expect we will see them making more damning statements about OpenAI in the future

Or is there a possible motivation I'm missing here?

I'm intrigued by the distinction between the policy function and the program implementing it, as their structures seem different.

For example, a policy function might be said to have the structure of mapping all inputs to a single output, and the program that implements it is a Python program that uses a dictionary. Does this distinction matter?

When we talk about agent structure, I'd imagine that we care both about the structure of the actions that the agent takes and the structure of the computation the agent does to decide on the actions.