Dialogues seem under-incentivised relative to comments, given the amount of effort involved. Maybe they would get more karma if we could vote on individual replies, so it's more like a comment chain?
This could also help with skimming a dialogue because you can skip to the best parts, to see whether it's worth reading the whole thing.
The ideal situation understanding-wise is that we understand AI at an algorithmic level. We can say stuff like: there are X,Y,Z components of the algorithm, and X passes (e.g.) beliefs to Y in format b, and Z can be viewed as a function that takes information in format w and links it with... etc. And infrabayes might be the theory you use to explain what some of the internal datastructures mean. Heuristic arguments might be how some subcomponent of the algorithm works. Most theoretical AI work (both from the alignment community and in normal AI and ML theo...
Great comment, agreed. There was some suggestion of (3), and maybe there was too much. I think there are times when expectations about the plan are equivalent to literal desires about how the task should be done. For making coffee, I expect that it won't create much noise. But also, I actually want the coffee-making to not be particularly noisy, and if it's the case that the first plan for making coffee also creates a lot of noise as a side effect, this is a situation where something in the goal specification has gone horribly wrong (and there should be some institutional response).
Yeah I think I remember Stuart talking about agents that request clarification whenever they are uncertain about how a concept generalizes. That is vaguely similar. I can't remember whether he proposed any way to make that reflectively stable though.
From the perspective of this post, wouldn't natural language work a bit as a redundancy specifier in that case and so LLMs are more alignable than RL agents?
LLMs in their current form don't really cause Edge Instantiation problems. Plausibly this is because they internally implement many kinds of regularization...
Yeah I agree there are similarities. I think a benefit of my approach, that I should have emphasized more, is that it's reflectively stable (and theoretically simple and therefore easy to analyze). In your description of an AI that wants to seek clarification, it isn't clear that it won't self-modify (but it's hard to tell).
...There’s a general problem that people will want AGIs to find clever out-of-the-box solutions to problems, and there’s no principled distinction between “finding a clever out-of-the-box solution to a problem” and “Goodharting the problem
The Alice and Bob example isn't a good argument against the independence axiom. The combined agent can be represented using a fact-conditional utility function. Include the event "get job offer" in the outcome space, so that the combined utility function is a function of that fact.
E.g.
Bob {A: 0, B: 0.5, C: 1}
Alice {A: 0.3, B: 0, C: 0}
Should merge to become
AliceBob {Ao: 0, Bo: 0.5, Co: 1, A¬o: 0, B¬o: 0, C¬o: 0.3}, where o="get job offer".
This is a far more natural way to combine agents. We can avoid the ontologically weird mixing of probabilities and prefe...
Yeah I can see how Scott's quote can be interpreted that way. I think the people listed would usually be more careful with their words. But also, Scott isn't necessarily claiming what you say he is. Everyone agrees that when you prompt a base model to act agentically, it can kinda do so. This can happen during RLHF. Properties of this behaviour will be absorbed from pretraining data, including moral systems. I don't know how Scott is imagining this, but it needn't be an inner homunculi that has consistent goals.
I think the thread below with Daniel and Evan...
but his takes were probably a little more predictably unwelcome in this venue
I hope he doesn't feel his takes are unwelcome here. I think they're empirically very welcome. His posts seem to have a roughly similar level of controversy and popularity as e.g. so8res. I'm pretty sad that he largely stopped engaging with lesswrong.
There's definitely value to being (rudely?) shaken out of lazy habits of thinking [...] and I think Alex has a knack for (at least sometimes correctly) calling out others' confusion or equivocation.
Yeah I agree, that's why I like to read Alex's takes.
Really appreciate dialogues like this. This kind of engagement across worldviews should happen far more, and I'd love to do more of it myself.[1]
Some aspects were slightly disappointing:
Tsvi has many underrated posts. This one was rated correctly.
I didn't previously have a crisp conceptual handle for the category that Tsvi calls Playful Thinking. Initially it seemed a slightly unnatural category. Now it's such a natural category that perhaps it should be called "Thinking", and other kinds should be the ones with a modifier (e.g. maybe Directed Thinking?).
Tsvi gives many theoretical justifications for engaging in Playful Thinking. I want to talk about one because it was only briefly mentioned in the post:
...Your sense of fun decor
This post deserves to be remembered as a LessWrong classic.
There are several problems that are fundamentally about attaching very different world models together and transferring information from one to the other.
I'm curious whether the recent trend toward bi-level optimization via chain-of-thought was any update for you? I would have thought this would have updated people (partially?) back toward actually-evolution-was-a-decent-analogy.
There's this paragraph, which seems right-ish to me:
...In order to experience a sharp left turn that arose due to the same mechanistic reasons as the sharp left turn of human evolution, an AI developer would have to:
- Deliberately create a (very obvious[2]) inner optimizer, whose inner loss function includes no mention of human val
I think we currently do not have good gears level models of lots of the important questions of AI/cognition/alignment, and I think the way to get there is by treating it as a software/physicalist/engineering problem, not presupposing an already higher level agentic/psychological/functionalist framing.
Here's two ways that a high-level model can be wrong:
because stabler optimization tends to be more powerful / influential / able-to-skillfully-and-forcefully-steer-the-future
I personally doubt that this is true, which is maybe the crux here.
Would you like to do a dialogue about this? To me it seems clearly true in exactly the same way that having more time to pursue a goal makes it more likely you will achieve that goal.
It's possible another crux is related to the danger of Goodharting, which I think you are exaggerating the danger of. When an agent actually understand what it wants, and/or understands the l...
There are multiple ways to interpret "being an actual human". I interpret it as pointing at an ability level.
"the task GPTs are being trained on is harder" => the prediction objective doesn't top out at (i.e. the task has more difficulty in it than).
"than being an actual human" => the ability level of a human (i.e. the task of matching the human ability level at the relevant set of tasks).
Or as Eliezer said:
I said that GPT's task is harder than being an actual human; in other words, being an actual human is not enough to solve GPT's task.
In different...
The OP argument boils down to: the text prediction objective doesn't stop incentivizing higher capabilities once you get to human level capabilities. This is a valid counter-argument to: GPTs will cap out at human capabilities because humans generated the training data.
Your central point is:
Where GPT and humans differ is not some general mathematical fact about the task, but differences in what sensory data is a human and GPT trying to predict, and differences in cognitive architecture and ways how the systems are bounded.
You are misinterpretin...
I sometimes think of alignment as having two barriers:
My current understanding of your agenda, in my own words:
You're trying to create a low-capability AI paradigm that has way more levers. This paradigm centers on building useful systems by patching together LLM calls. You're collecting a set of useful tactics for doing this patching. You can rely on tactics in a similar way to how we rely on programming langu...
Thanks for the comment!
Have I understood this correctly?
I am most confident in phases 1-3 of this agenda, and I think you have overall a pretty good rephrasing of 1-5, thanks! One note is that I don't think "LLM calls" as being fundamental, I think of LLMs as a stand-in for "banks of patterns" or "piles of shards of cognition." The exact shape of this can vary, LLMs are just our current most common shape of "cognition engines", but I can think of many other, potentially better, shapes this "neural primitive/co-processor" could take.
I think there is s...
Yeah I read that prize contest post, that was much of where I got my impression of the "consensus". It didn't really describe which parts you still considered valuable. I'd be curious to know which they are? My understanding was that most of the conclusions made in that post were downstream of the Landauer limit argument.
These seem right, but more importantly I think it would eliminate investing in new scalable companies. Or dramatically reduce it in the 50% case. So there would be very few new companies created.
(As a side note: Maybe our response to this proposal was a bit cruel. It might have been better to just point toward some econ reading material).
would hopefully include many people who understand that understanding constraints is key and that past research understood some constraints.
Good point, I'm convinced by this.
build on past agent foundations research
I don't really agree with this. Why do you say this?
That's my guess at the level of engagement required to understand something. Maybe just because when I've tried to use or modify some research that I thought I understood, I always realise I didn't understand it deeply enough. I'm probably anchoring too hard on my own experience here, othe...
I agree this would be a great program to run, but I want to call it a different lever to the one I was referring to.
The only thing I would change is that I think new researchers need to understand the purpose and value of past agent foundations research. I spent too long searching for novel ideas while I still misunderstood the main constraints of alignment. I expect you'd get a lot of wasted effort if you asked for out-of-paradigm ideas. Instead it might be better to ask for people to understand and build on past agent foundations research, then gradually...
The main thing I'm referring to are upskilling or career transition grants, especially from LTFF, in the last couple of years. I don't have stats, I'm assuming there were a lot given out because I met a lot of people who had received them. Probably there were a bunch given out by the ftx future fund also.
Also when I did MATS, many of us got grants post-MATS to continue our research. Relatively little seems to have come of these.
How are they falling short?
(I sound negative about these grants but I'm not, and I do want more stuff like that to happen. If I we...
upskilling or career transition grants, especially from LTFF, in the last couple of years
Interesting; I'm less aware of these.
How are they falling short?
I'll answer as though I know what's going on in various private processes, but I don't, and therefore could easily be wrong. I assume some of these are sort of done somewhere, but not enough and not together enough.
I think I disagree. This is a bandit problem, and grantmakers have tried pulling that lever a bunch of times. There hasn't been any field-changing research (yet). They knew it had a low chance of success so it's not a big update. But it is a small update.
Probably the optimal move isn't cutting early-career support entirely, but having a higher bar seems correct. There are other levers that are worth trying, and we don't have the resources to try every lever.
Also there are more grifters now that the word is out, so the EV is also declining that way.
(I feel bad saying this as someone who benefited a lot from early-career financial support).
grantmakers have tried pulling that lever a bunch of times
What do you mean by this? I can think of lots of things that seem in some broad class of pulling some lever that kinda looks like this, but most of the ones I'm aware of fall greatly short of being an appropriate attempt to leverage smart young creative motivated would-be AGI alignment insight-havers. So the update should be much smaller (or there's a bunch of stuff I'm not aware of).
My first exposure to rationalists was a Rationally Speaking episode where Julia recommended the movie Locke.
It's about a man pursuing difficult goals under emotional stress using few tools. For me it was a great way to be introduced to rationalism because it showed how a ~rational actor could look very different from a straw Vulcan.
It's also a great movie.
Nice.
Similar rule of thumb I find handy is 70 divided by growth rate to get doubling time implied by a growth rate. I find it way easier to think about doubling times than growth rates.
E.g. 3% interest rate means 70/3 ≈ 23 year doubling time.
I get the feeling that I’m still missing the point somehow and that Yudkowsky would say we still have a big chance of doom if our algorithms were created by hand with programmers whose algorithms always did exactly what they intended even when combined with their other algorithms.
I would bet against Eliezer being pessimistic about this, if we are assuming the algorithms are deeply-understood enough that we are confident that we can iterate on building AGI. I think there's maybe a problem with the way Eliezer communicates that gives people the impression th...
Intelligence/IQ is always good, but not a dealbreaker as long as you can substitute it with a larger population.
IMO this is pretty obviously wrong. There are some kinds of problem solving that scales poorly with population, just as there are some computations that scale poorly with parallelisation.
E.g. project euler problems.
When I said "problems we care about", I was referring to a cluster of problems that very strongly appear to not scale well with population. Maybe this is an intuitive picture of the cluster of problems I'm referring to.
I buy that such an intervention is possible. But doing it requires understanding the internals at a deep level. You can't expect SGD to implement the patch in a robust way. The patch would need to still be working after 6 months on an impossible problem, in spite of it actively getting in the way of finding the solution!
I was probably influenced by your ideas! I just (re?)read your post on the topic.
Tbh I think it's unlikely such a sweet spot exists, and I find your example unconvincing. The value of this kind of reflection for difficult problem solving directly conflicts with the "useful" assumption.
I'd be more convinced if you described the task where you expect an AI to be useful (significantly above current humans), and doesn't involve failing and reevaluating high-level strategy every now and then.
Extremely underrated post, I'm sorry I only skimmed it when it came out.
I found 3a,b,c to be strong and well written, a good representation of my view.
In contrast, 3d I found to be a weak argument that I didn't identify with. In particular, I don't think internal conflicts are a good way to explain the source of goal misgeneralization. To me it's better described as just overfitting or misgeneralization.[1] Edge cases in goals are clearly going to be explored by a stepping back process, if initial attempts fail. In particular if attempted pathwa...
Trying to write a new steelman of Matt's view. It's probably incorrect, but seems good to post as a measure of progress:
You believe in agentic capabilities generalizing, but also in additional high-level patterns that generalize and often overpower agentic behaviour. You expect training to learn all the algorithms required for intelligence, but also pick up patterns in the data like "research style", maybe "personality", maybe "things a person wouldn't do" and also build those into the various-algorithms-that-add-up-to-intelligence at a deep level. In part...
Not much to add, I haven't spent enough time thinking about structural selection theorems.
I'm a fan of making more assumptions. I've had a number of conversations with people who seem to make the mistake of not assuming enough. Sometimes leading them to incorrectly consider various things impossible. E.g. "How could an agent store a utility function over all possible worlds?" or "Rice's theorem/halting problem/incompleteness/NP-hardness/no-free-lunch theorems means it's impossible to do xyz". The answer is always nah, it's possible, we just need to t...
Good point.
What I meant by updatelessness removes most of the justification is the reason given here at the very beginning of "Against Resolute Choice". In order to make a money pump that leads the agent in a circle, the agent has to continue accepting trades around a full preference loop. But if it has decided on the entire plan beforehand, it will just do any plan that involves <1 trip around the preference loop. (Although it's unclear how it would settle on such a plan, maybe just stopping its search after a given time). It won't (I think?) choose an...
I think the problem might be that you've given this definition of heuristic:
A heuristic is a local, interpretable, and simple function (e.g., boolean/arithmetic/lookup functions) learned from the training data. There are multiple heuristics in each layer and their outputs are used in later layers.
Taking this definition seriously, it's easy to decompose a forward pass into such functions.
But you have a much more detailed idea of a heuristic in mind. You've pointed toward some properties this might have in your point (2), but haven't put it into specific wor...
I've only skimmed this post, but I like it because I think it puts into words a fairly common model (that I disagree with). I've heard "it's all just a stack of heuristics" as an explanation of neural networks and as a claim that all intelligence is this, from several people. (Probably I'm overinterpreting other people's words to some extent, they probably meant a weaker/nuanced version. But like you say, it can be useful to talk about the strong version).
I think you've correctly identified the flaw in this idea (it isn't predictive, it's unfalsifiable, so...
Trying to think this through, I'll write a bit of a braindump just in case that's useful:
The futachy hack can be split into two parts. The first is that is that conditioning on untaken actions makes most probabilities ill-defined. Because there are no incentives to get it right, the market can can settle to many equilibria. The second part is that there are various incentives for traders to take advantage of this for their own interests.
With your technique, I think approach would be to duplicate each trader into two traders with the same knowledge, and mak...
How about I assume there is some epsilon such that the probability of an agent going off the rails is greater than epsilon in any given year. Why can't the agent split into multiple ~uncorrelated agents and have them each control some fraction of resources (maybe space) such that one off-the-rails agent can easily be fought and controlled by the others? This should reduce the risk to some fraction of epsilon, right?
(I'm gonna try and stay focused on a single point, specifically the argument that leads up to >99%, because that part seems wrong for quite simple reasons).
You could call this a failure of the AGI’s goal-related systems if you mean with that that the machinery failed to control its external effects in line with internally represented goals.
But this would be a problem with the control process itself.
So it's the AI being incompetent?
Unfortunately, there are fundamental limits to that cap the extent to which the machinery can improve its own control process.
Yeah I think would be a good response to my argument against premise 2). I've had a quick look at the list of theorems in the paper, I don't know...
...In practice, engineers know that complex architectures interacting with the surrounding world end up having functional failures (because of unexpected interactive effects, or noisy interference). With AGI, we are talking about an architecture here that would be replacing all our jobs and move to managing conditions across our environment. If AGI continues to persist in some form over time, failures will occur and build up toward lethality at some unknown rate. Over a long enough period, this repeated potential for uncontrolled failures pushes the risk of h
To me it seems like one important application of this work is to understanding and fixing the futachy hack in FixDT and in Logical Inductor decision theory. But I'm not sure whether your results can transfer to these settings, because of the requirement that the agents have the same beliefs.
Is there a reason we can't make duplicate traders in LI and have their trades be zero-sum?
I'm generally confused about this. Do you have thoughts?
The non-spicy answer is probably the LTFF, if you're happy deferring to the fund managers there. I don't know what your risk tolerance for wasting money is, but you can check whether they meet it by looking at their track record.
If you have a lot of time you might be able to find better ways to spend money than the LTFF can. (Like if you can find a good way to fund intelligence amplification as Tsvi said).
Hmm good point. Looking at your dialogues has changed my mind, they have higher karma than the ones I was looking at.
You might also be unusual on some axis that makes arguments easier. It takes me a lot of time to go over peoples words and work out what beliefs are consistent with them. And the inverse, translating model to words, also takes a while.