I'm interested in doing in-depth dialogues to find cruxes. Message me if you are interested in doing this.
I do alignment research, mostly stuff that is vaguely agent foundations. Currently doing independent alignment research on ontology identification. Formerly on Vivek's team at MIRI. Most of my writing before mid 2023 is not representative of my current views about alignment difficulty.
because stabler optimization tends to be more powerful / influential / able-to-skillfully-and-forcefully-steer-the-future
I personally doubt that this is true, which is maybe the crux here.
Would you like to do a dialogue about this? To me it seems clearly true in exactly the same way that having more time to pursue a goal makes it more likely you will achieve that goal.
It's possible another crux is related to the danger of Goodharting, which I think you are exaggerating the danger of. When an agent actually understand what it wants, and/or understands the limits of its understanding, then Goodhart is easy to mitigate, and it should try hard to achieve its goals (i.e. optimize a metric).
There are multiple ways to interpret "being an actual human". I interpret it as pointing at an ability level.
"the task GPTs are being trained on is harder" => the prediction objective doesn't top out at (i.e. the task has more difficulty in it than).
"than being an actual human" => the ability level of a human (i.e. the task of matching the human ability level at the relevant set of tasks).
Or as Eliezer said:
I said that GPT's task is harder than being an actual human; in other words, being an actual human is not enough to solve GPT's task.
In different words again: the tasks GPTs are being incentivised to solve aren't all solvable at a human level of capability.
You almost had it when you said:
- Maybe you mean something like task + performance threshold. Here 'predict the activation of photoreceptors in human retina well enough to be able to function as a typical human' is clearly less difficult than task + performance threshold 'predict next word on the internet, almost perfectly'. But this comparison does not seem to be particularly informative.
It's more accurate if I edit it to:
- Maybe you mean something like task + performance threshold. Here 'predict the activation of photoreceptors in human retina [text] well enough to be able to function as a typical human' is clearly less difficult than task + performance threshold 'predict next word on the internet, almost perfectly'.
You say it's not particularly informative. Eliezer responds by explaining the argument it responds to, which provides the context in which this is an informative statement about the training incentives of a GPT.
The OP argument boils down to: the text prediction objective doesn't stop incentivizing higher capabilities once you get to human level capabilities. This is a valid counter-argument to: GPTs will cap out at human capabilities because humans generated the training data.
Your central point is:
Where GPT and humans differ is not some general mathematical fact about the task, but differences in what sensory data is a human and GPT trying to predict, and differences in cognitive architecture and ways how the systems are bounded.
You are misinterpreting the OP by thinking it's about comparing the mathematical properties of two tasks, when it was just pointing at the loss gradient of the text prediction task (at the location of a ~human capability profile). The OP works through text prediction sub-tasks where it's obvious that the gradient points toward higher-than-human inference capabilities.
You seem to focus too hard on the minima of the loss function:
notice that “what would the loss function like the system to do” in principle tells you very little about what the system will do
You're correct to point out that the minima of a loss function doesn't tell you much about the actual loss that could be achieved by a particular system. Like you say, the particular boundedness and cognitive architecture are more relevant to this question. But this is irrelevant to the argument being made, which is about whether the text prediction objective stops incentivising improvements above human capability.
The post showcases the inability of the aggregate LW community to recognize locally invalid reasoning
I think a better lesson to learn is that communication is hard, and therefore we should try not to be too salty toward each other.
I sometimes think of alignment as having two barriers:
My current understanding of your agenda, in my own words:
You're trying to create a low-capability AI paradigm that has way more levers. This paradigm centers on building useful systems by patching together LLM calls. You're collecting a set of useful tactics for doing this patching. You can rely on tactics in a similar way to how we rely on programming language features, because they are small and well-tested-ish. (1 & 2)
As new tactics are developed, you're hoping that expertise and robust theories develop around building systems this way. (3)
This by itself doesn't scale to hard problems, so you're trying to develop methods for learning and tracking knowledge/facts in a way that interfaces with the rest of it in a way that remains legible. (4)
Maybe with some additional tools, we build a relatively-legible emulation of human thinking on top of this paradigm. (5)
Have I understood this correctly?
I feel like the alignment section of this is missing. Is the hope that better legibility and experience allows us to solve the alignment problems that we expect at this point?
Maybe it'd be good to name some speculative tools/theory that you hope to have been developed for shaping CoEms, then say how they would help with some of:
Most alignment research skips to trying to resolve issues like these first, at least in principle. Then often backs off to develop a relevant theory. I can see why you might want to do the levers part first, and have theory develop along with experience building things. But it's risky to do the hard part last.
but because the same solutions that will make AI systems beneficial will also make them safer
This is often not true, and I don't think your paradigm makes it true. E.g. often we lose legibility to increase capability, and that is plausibly also true during AGI development in the CoEm paradigm.
In practice, sadly, developing a true ELM is currently too expensive for us to pursue
Expensive why? Seems like the bottleneck here is theoretical understanding.
Yeah I read that prize contest post, that was much of where I got my impression of the "consensus". It didn't really describe which parts you still considered valuable. I'd be curious to know which they are? My understanding was that most of the conclusions made in that post were downstream of the Landauer limit argument.
Could you explain or directly link to something about the 4x claim? Seems wrong. Communication speed scales with distance not area.
Jacob Cannells' brain efficiency post
I thought the consensus on that post was that it was mostly bullshit?
These seem right, but more importantly I think it would eliminate investing in new scalable companies. Or dramatically reduce it in the 50% case. So there would be very few new companies created.
(As a side note: Maybe our response to this proposal was a bit cruel. It might have been better to just point toward some econ reading material).
would hopefully include many people who understand that understanding constraints is key and that past research understood some constraints.
Good point, I'm convinced by this.
build on past agent foundations research
I don't really agree with this. Why do you say this?
That's my guess at the level of engagement required to understand something. Maybe just because when I've tried to use or modify some research that I thought I understood, I always realise I didn't understand it deeply enough. I'm probably anchoring too hard on my own experience here, other people often learn faster than me.
(Also I'm confused about the discourse in this thread (which is fine), because I thought we were discussing "how / how much should grantmakers let the money flow".)
I was thinking "should grantmakers let the money flow to unknown young people who want a chance to prove themselves."
I agree this would be a great program to run, but I want to call it a different lever to the one I was referring to.
The only thing I would change is that I think new researchers need to understand the purpose and value of past agent foundations research. I spent too long searching for novel ideas while I still misunderstood the main constraints of alignment. I expect you'd get a lot of wasted effort if you asked for out-of-paradigm ideas. Instead it might be better to ask for people to understand and build on past agent foundations research, then gradually move away if they see other pathways after having understood the constraints. Now I see my work as mostly about trying to run into constraints for the purpose of better understand them.
Maybe that wouldn't help though, it's really hard to make people see the constraints.
Here's two ways that a high-level model can be wrong:
It sounds like you're saying high-level agency-as-outcome-directed is wrong in the second way? If so, I disagree, it looks much more like the first way. I don't think I understand your beliefs well enough to argue about this, maybe there's something I should read?
I have a discomfort that I want to try to gesture at:
Are you ultimately wanting to build a piece of software that solves a problem so difficult that it needs to modify itself? My impression from the post is that you are thinking about this level of capability in a distant way, and mostly focusing on much earlier and easier regimes. I think it's probably very easy to work on legible low-level capabilities without making any progress on the regime that matters.
To me it looks important for researchers to have this ultimate goal constantly in their mind, because there are many pathways off-track. Does it look different to you?
I think this is a bad place to rely on governance, given the fuzziness of this boundary and the huge incentive toward capability over legibility. Am I right in thinking that you're making a large-ish gamble here on the way the tech tree shakes out (such that it's easy to see a legible-illegible boundary, and the legible approaches are competitive-ish) and also the way governance shakes out (such that governments decide that e.g. assigning detailed blame for failures is extremely important and worth delaying capabilities)?
I'm glad you're doing ambitious things, and I'm generally a fan of trying to understand problems from scratch in the hope that they dissolve or become easier to solve.
Why would this be a project that requires large scale experiments? Looks like something that a random PhD student with two GPUs could maybe make progress on. Might be a good problem to make a prize for even?