All of Jeremy Gillen's Comments + Replies

It's not entirely clear to me that the math works out for AIs being helpful on net relative to humans just doing it, because of the supervision required, and the trust and misalignment issues.

But on this question (for AIs that are just capable of "prosaic and relatively unenlightened ML research") it feels like shot-in-the-dark guesses. It's very unclear to me what is and isn't possible.

4ryan_greenblatt
I certainly agree it isn't clear, just my current best guess.

Thanks, I appreciate the draft. I see why it's not plausible to get started on now, since much of it depends on having AGIs or proto-AGIs to play with.

I guess I shouldn't respond too much in public until you've published the doc, but:

  • If I'm interpreting correctly, a number of the things you intend to try involve having a misaligned (but controlled) proto-AGI run experiments involving training (or otherwise messing with in some way) an AGI. I hope you have some empathy the internal screaming I have toward this category of things.
  • A bunch of the ideas do seem
... (read more)
2ryan_greenblatt
Yes, I just meant on net. (Relative to the current ML community and given a similar fraction of resources to spend on AI compute.)

I think if the model is scheming it can behave arbitrarily badly in concentrated ways (either in a small number of actions or in a short period of time), but you can make it behave well in the average case using online training.

I think we kind of agree here. The cruxes remain: I think that the metric for "behave well" won't be good enough for "real" large research acceleration. And "average case" means very little when it allows room for deliberate-or-not mistakes sometimes when they can be plausibly got-away-with. [Edit: Or sabotage, escape, etc.]

Also, yo... (read more)

2ryan_greenblatt
Oh, yeah I meant "perform well according to your metrics" not "behave well" (edited)

Yep this is the third crux I think. Perhaps the most important.

To me it looks like you're making a wild guess that "prosaic and relatively unenlightened ML research" is a very large fraction of the necessary work for solving alignment, without any justification that I know of?

For all the pathways to solving alignment that I am aware of, this is clearly false. I think if you know of a pathway that just involves mostly "prosaic and relatively unenlightened ML research", you should write out this plan, why you expect it to work, and then ask OpenPhil throw a billion dollars toward every available ML-research-capable human to do this work right now. Surely it'd be better to get started already?

3ryan_greenblatt
I don't think "what is the necessary work for solving alignment" is a frame I really buy. My perspective on alignment is more like: * Avoiding egregious misalignment (where AIs intentionally act in ways that make our tests highly misleading or do pretty obviously unintended/dangerous actions) reduces risk once AIs are otherwise dangerous. * Additionally, we will likely to need to hand over making most near term decisions and most near term labor to some AI systems at some point. This going well very likely requires being able to avoid egregious misalignment (in systems capable enough to obsolete us) and also requires some other stuff. * There is a bunch of "prosaic and relatively unenlightened ML research" which can make egregious misalignment much less likely and can resolve other problems needed for handover. * Much of this work is much easier once you already have powerful AIs to experiment on. * The risk reduction will depend on the amount of effort put in and the quality of the execution etc. * The total quantity of risk reduction is unclear, but seems substantial to me. I'd guess takeover risk goes from 50% to 5% if you do a very good job at executing on huge amounts of prosaic and relatively unenlightened ML research at the relevant time. (This require more misc conceptual work, but not something that requires deep understanding persay.) I think my perspective is more like "here's a long list of stuff which would help". Some of this is readily doable to work on in advance and should be worked on, and some is harder to work on. This work isn't extremely easy to verify or scale up (such that I don't think "throw a billion dollars at it" just works), though I'm excited for a bunch more work on this stuff. ("relatively unenlightened" doesn't mean "trivial to get the ML community work on this using money" and I also think that getting the ML community to work on things effectively is probably substantially harder than getting AIs to work on things effecti

I'm not entirely sure where our upstream cruxes are. We definitely disagree about your conclusions. My best guess is the "core mistake" comment below, and the "faithful simulators" comment is another possibility.

Maybe another relevant thing that looks wrong to me: You will still get slop when you train an AI to look like it is epistemically virtuously updating its beliefs. You'll get outputs that look very epistemically virtuous, but it takes time and expertise to rank them in a way that reflects actual epistemic virtue level, just like other kinds of slop... (read more)

these are also alignment failures we see in humans.

Many of them have close analogies in human behaviour. But you seem to be implying "and therefore those are non-issues"???

There are many groups of humans (or groups of humans), that if you set them on the task of solving alignment, will at some point decide to do something else. In fact, most groups of humans will probably fail like this.

How is this evidence in favour of your plan ultimately resulting in a solution to alignment???

but these systems empirically often move in reasonable and socially-beneficial

... (read more)

to the extent developers succeed in creating faithful simulators

There's a crux I have with Ryan which is "whether future capabilities will allow data-efficient long-horizon RL fine-tuning that generalizes well". As of last time we talked about it, Ryan says we probably will, I say we probably won't.

If we have the kind of generalizing ML that we can use to make faithful simulations, then alignment is pretty much solved. We make exact human uploads, and that's pretty much it. This is one end of the spectrum on this question.

There are weaker versions, which I... (read more)

4ryan_greenblatt
FWIW, I don't think "data-efficient long-horizon RL" (which is sample efficient in a online training sense) implies you can make faithful simulations. I think if the model is scheming it can behave arbitrarily badly in concentrated ways (either in a small number of actions or in a short period of time), but you can make it behave well perform well according to your metrics in the average case using online training.

My guess is that your core mistake is here:

When I say agents are “not egregiously misaligned,” I mean they mostly perform their work earnestly – in the same way humans are mostly earnest and vaguely try to do their job. Maybe agents are a bit sycophantic, but not more than the humans whom they would replace. Therefore, if agents are consistently “not egregiously misaligned,” the situation is no worse than if humans performed their research instead.

Obviously, all agents having undergone training to look "not egregiously misaligned", will not look egregiousl... (read more)

2joshc
I think my arguments still hold in this case though right? i.e. we are training models so they try to improve their work and identify these subtle issues -- and so if they actually behave this way they will find these issues insofar as humans identify the subtle mistakes they make.   I agree there are lots of  "messy in between places," but these are also alignment failures we see in humans. And if humans had a really long time to do safety reseach, my guess is we'd be ok. Why? Like you said, there's a messy complicated system of humans with different goals, but these systems empirically often move in reasonable and socially-beneficial directions over time (governments get set up to deal with corrupt companies, new agencies get set up to deal with issues in governments, etc) and i expect we can make AI agents a lot more aligned than humans typically are. e.g. most humans don't actually care about the law etc but, Claude sure as hell seems to. If we have agents that sure as hell seem to care about the law and are not just pretending (they really will, in most cases, act like they care about the law) then that seems to be a good state to be in.

(Some) acceleration doesn't require being fully competitive with humans while deference does.

Agreed. The invention of calculators was useful for research, and the invention of more tools will also be helpful.

I think AIs that can autonomously do moderate duration ML tasks (e.g., 1 week tasks), but don't really have any interesting new ideas could plausibly speed up safety work by 5-10x if they were cheap and fast enough.

Maybe some kinds of "safety work", but real alignment involves a human obtaining a deep understanding of intelligence and agency. The path ... (read more)

3ryan_greenblatt
A typical crux is that I think we can increase our chances of "real alignment" using prosaic and relatively unenlightened ML reasearch without any deep understanding. I both think: 1. We can significantly accelerate prosaic ML safety research (e.g., of the sort people are doing today) using AIs that are importantly limited. 2. Prosaic ML safety research can be very helpful for increasing the chance of "real alignment" for AIs that we hand off to. (At least when this research is well executed and has access to powerful AIs to experiment on.) This top level post is part of Josh's argument for (2).

(vague memory from the in person discussions we had last year, might be inaccurate):

jeremy!2023: If you're expecting AI to be capable enough to "accelerate alignment research" significantly, it'll need to be a full-blown agent that learns stuff. And that'll be enough to create alignment problems because data-efficient long-horizon generalization is not something we can do.

joshc!2023: No way, all you need is AI with stereotyped skills. Imagine how fast we could do interp experiments if we had AIs that were good at writing code but dumb in other ways!

...

josh... (read more)

2joshc
I definitely agree that the AI agents at the start will need to be roughly aligned for the proposal above to work. What is it you think we disagree about?
8ryan_greenblatt
Something important is that "significantly accelerate alignment research" isn't the same as "making AIs that we're happy to fully defer to". This post is talking about conditions needed for deference and how we might achieve them. (Some) acceleration doesn't require being fully competitive with humans while deference does. I think AIs that can autonomously do moderate duration ML tasks (e.g., 1 week tasks), but don't really have any interesting ideas could plausibly speed up safety work by 5-10x if they were cheap and fast enough.

In that case, what does the conditional goal look like when you translate it into a preference relation over outcomes?

We can't reduce the domain of the utility function without destroying some information. If we tried to change the domain variables from [g, h, shutdown] to [g, shutdown], we wouldn't get the desired behaviour. Maybe you have a particular translation method in mind?

I don't mess up the medical test because true information is instrumentally useful to me, given my goals.

Yep that's what I meant. The goal u is constructed to make information abo... (read more)

With regards to the agent believing that it's impossible to influence the probability that its plan passes validation

This is a misinterpretation. The agent entirely has true beliefs. It knows it could manipulate the validation step. It just doesn't want to, because of the conditional shape of its goal. This is a common behaviour among humans, for example you wouldn't mess up a medical test to make it come out negative, because you need to know the result in order to know what to do afterwards.

3EJT
Oh I see. In that case, what does the conditional goal look like when you translate it into a preference relation over outcomes? I think it might involve incomplete preferences. Here's why I say that. For the agent to be useful, it needs to have some preference between plans conditional on their passing validation: there must be some plan A and some plan A+ such that the agent prefers A+ to A. Then given Completeness and Transitivity, the agent can't lack a preference between shutdown and each of A and A+. If the agent lacks a preference between shutdown and A, it must prefer A+ to shutdown. It might then try to increase the probability that A+ passes validation. If the agent lacks a preference between shutdown and A+, it must prefer shutdown to A. It might then try to decrease the probability that A passes validation. This is basically my Second Theorem and the point that John Wentworth makes here. I'm not sure the medical test is a good analogy. I don't mess up the medical test because true information is instrumentally useful to me, given my goals. But (it seems to me) true information about whether a plan passes validation is only instrumentally useful to the agent if the agent's goal is to do what we humans really want. And that's something we can't assume, given the difficulty of alignment.

I propose: the best planners must break the beta.

Because if a planner is going to be the best, it needs to be capable of finding unusual (better!) plans. If it's capable of finding those, there's ~no benefit of knowing the conventional wisdom about how to do it (climbing slang: beta). 

Edit: or maybe: good planners don't need beta?

6Jesse Hoogland
That's fun but a little long. Why not... BetaZero?

I think you're wrong to be psychoanalysing why people aren't paying attention to your work. You're overcomplicating it. Most people just think you're wrong upon hearing a short summary, and don't trust you enough to spend time learning the details. Whether your scenario is important or not, from your perspective it'll usually look like people are bouncing off for bad reasons.

For example, I read the executive summary. For several shallow reasons,[1] the scenario seemed unlikely and unimportant. I didn't expect there to be better arguments further on. S... (read more)

6dr_s
I think the shell games point is interesting though. It's not psychoanalysing (one can think that people are in denial or have rational beliefs about this, not much point second guessing too far), it's pointing out a specific fallacy: a sort of god of the gaps in which every person with a focus on subsystem X assumes the problem will be solved in subsystem Y, which they understand or care less about because it's not their specialty. If everyone does it, that does indeed lead to completely ignoring serious problems due to a sort of bystander effect.
2[comment deleted]

I think 'people aren't paying attention to your work' is somewhat different situation than voiced in the original post. I'm discussing specific ways in which people engage with the argument, as opposed to just ignoring it. It is the baseline that most people ignore most arguments most of time. 

Also it's probably worth noting the ways seem somewhat specific to the crowd over-represented here - in different contexts people are engaging with it in different ways. 
 

The description of how sequential choice can be defined is helpful, I was previously confused by how this was supposed to work. This matches what I meant by preferences over tuples of outcomes. Thanks!

We'd incorrectly rule out the possibility that the agent goes for (B+,B).

There's two things we might want from the idea of incomplete preferences:

  1. To predict the actions of agents.
  2. Because complete agents behave dangerously sometimes, and we want to design better agents with different behaviour.

I think modelling an agent as having incomplete preferences is grea... (read more)

Perhaps I'm misusing the word "representable"? But what I meant was that any single sequence of actions generate by the agent could also have been generated by an outcome-utility maximizer (that has the same world model). This seems like the relevant definition, right?

That's not right

Are you saying that my description (following) is incorrect? 

[incomplete preferences w/ caprice] would be equivalent to 1. choosing the best policy by ranking them in the partial order of outcomes (randomizing over multiple maxima), then 2. implementing that policy without further consideration.

Or are you saying that it is correct, but you disagree that this implies that it is "behaviorally indistinguishable from an agent with complete preferences"? If this is the case, then I think we might disagree on the definition of "behaviorally ... (read more)

I think it's important to note the OOD push that comes from online-accumulated knowledge and reasoning. Probably you include this as a distortion or subversion, but that's not quite the framing I'd use. It's not taking a "good" machine and breaking it, it's taking a slightly-broken-but-works machine and putting it into a very different situation where the broken parts become load-bearing.

My overall reaction is yep, this is a modal-ish pathway for AGI development (but there are other, quite different stories that seem plausible also).

Hmm good point. Looking at your dialogues has changed my mind, they have higher karma than the ones I was looking at.

You might also be unusual on some axis that makes arguments easier. It takes me a lot of time to go over peoples words and work out what beliefs are consistent with them. And the inverse, translating model to words, also takes a while.

Dialogues are more difficult to create (if done well between people with different beliefs), and are less pleasant to read, but are often higher value for reaching true beliefs as a group.

8ryan_greenblatt
The dialogues I've done have all been substantially less time investment than basically any of my posts.

Dialogues seem under-incentivised relative to comments, given the amount of effort involved. Maybe they would get more karma if we could vote on individual replies, so it's more like a comment chain?

This could also help with skimming a dialogue because you can skip to the best parts, to see whether it's worth reading the whole thing.

2ryan_greenblatt
I don't see a reason to give dialogues more karma than posts, but I agree posts (including dialogues) are under-incentivized relative to comments.

The ideal situation understanding-wise is that we understand AI at an algorithmic level. We can say stuff like: there are X,Y,Z components of the algorithm, and X passes (e.g.) beliefs to Y in format b, and Z can be viewed as a function that takes information in format w and links it with... etc. And infrabayes might be the theory you use to explain what some of the internal datastructures mean. Heuristic arguments might be how some subcomponent of the algorithm works. Most theoretical AI work (both from the alignment community and in normal AI and ML theo... (read more)

1Jonas Hallgren
Okay, that makes sense to me so thank you for explaining! I guess what I was pointing at with the language thing is the question of what the actual underlying objects that you called XYZ were and their relation to the linguistic explanation of language as a contextually dependent symbol defined by many scenarios rather than some sort of logic. Like if we use IB it might be easy to look at that as a probability distribution of probability distributions? I just thought it was interesting to get some more context on how language might help in an alignment plan.

Fair enough, good points. I guess I classify these LLM agents as "something-like-an-LLM that is genuinely creative", at least to some extent.

Although I don't think the first example is great, seems more like a capability/observation-bandwidth issue.

4Garrett Baker
I think you can have multiple failures at the same time. The reason I think this was also goodhart was because I think the failure-mode could have been averted if sonnet was told “collect wood WITHOUT BREAKING MY HOUSE” ahead of time.

I'm not sure how this is different from the solution I describe in the latter half of the post.

Great comment, agreed. There was some suggestion of (3), and maybe there was too much. I think there are times when expectations about the plan are equivalent to literal desires about how the task should be done. For making coffee, I expect that it won't create much noise. But also, I actually want the coffee-making to not be particularly noisy, and if it's the case that the first plan for making coffee also creates a lot of noise as a side effect, this is a situation where something in the goal specification has gone horribly wrong (and there should be some institutional response).

Yeah I think I remember Stuart talking about agents that request clarification whenever they are uncertain about how a concept generalizes. That is vaguely similar. I can't remember whether he proposed any way to make that reflectively stable though.

From the perspective of this post, wouldn't natural language work a bit as a redundancy specifier in that case and so LLMs are more alignable than RL agents?

LLMs in their current form don't really cause Edge Instantiation problems. Plausibly this is because they internally implement many kinds of regularization... (read more)

3Jonas Hallgren
Those are some great points, made me think of some more questions. Any thoughts on what language "understood vs not understood" might be in? ARC Heuristic arguments or something like infrabayesianism? Like what is the type signature of this and how does this relate to what you wrote in the post? Also what is its relation to natural language?
4Garrett Baker
If you put current language models in weird situations & give them a goal, I’d say they do do edge instantiation, without the missing “creativity” ingredient. Eg see claude sonnet in minecraft repurposing someone’s house for wood after being asked to collect wood. Edit: There are other instances of this too, where you can tell claude to protect you in minecraft, and it will constantly tp to your position, and build walls around you when monsters are around. Protecting you, but also preventing any movement or fun you may have wanted to have.

Yeah I agree there are similarities. I think a benefit of my approach, that I should have emphasized more, is that it's reflectively stable (and theoretically simple and therefore easy to analyze). In your description of an AI that wants to seek clarification, it isn't clear that it won't self-modify (but it's hard to tell).

There’s a general problem that people will want AGIs to find clever out-of-the-box solutions to problems, and there’s no principled distinction between “finding a clever out-of-the-box solution to a problem” and “Goodharting the problem

... (read more)

The Alice and Bob example isn't a good argument against the independence axiom. The combined agent can be represented using a fact-conditional utility function. Include the event "get job offer" in the outcome space, so that the combined utility function is a function of that fact.

E.g.

Bob {A: 0, B: 0.5, C: 1}

Alice {A: 0.3, B: 0, C: 0}

Should merge to become

AliceBob {Ao: 0, Bo: 0.5, Co: 1, A¬o: 0, B¬o: 0, C¬o: 0.3}, where o="get job offer".

This is a far more natural way to combine agents. We can avoid the ontologically weird mixing of probabilities and prefe... (read more)

Excited to attend, the 2023 conference was great!

Can we submit talks?

2Alexander Gietelink Oldenziel
Yes, this should be an option in the form.

Yeah I can see how Scott's quote can be interpreted that way. I think the people listed would usually be more careful with their words. But also, Scott isn't necessarily claiming what you say he is. Everyone agrees that when you prompt a base model to act agentically, it can kinda do so. This can happen during RLHF. Properties of this behaviour will be absorbed from pretraining data, including moral systems. I don't know how Scott is imagining this, but it needn't be an inner homunculi that has consistent goals.

I think the thread below with Daniel and Evan... (read more)

but his takes were probably a little more predictably unwelcome in this venue

I hope he doesn't feel his takes are unwelcome here. I think they're empirically very welcome. His posts seem to have a roughly similar level of controversy and popularity as e.g. so8res. I'm pretty sad that he largely stopped engaging with lesswrong.

There's definitely value to being (rudely?) shaken out of lazy habits of thinking [...] and I think Alex has a knack for (at least sometimes correctly) calling out others' confusion or equivocation.

Yeah I agree, that's why I like to read Alex's takes.

Really appreciate dialogues like this. This kind of engagement across worldviews should happen far more, and I'd love to do more of it myself.[1]

Some aspects were slightly disappointing:

  • Alex keeps putting (inaccurate) words in the mouths of people he disagrees with, without citation. E.g.
    • 'we still haven't seen consistent-across-contexts agency from pretrained systems, a possibility seriously grappled with by eg The Parable of Predict-O-Matic).'
      • That post was  describing a very different kind of AI than generative language models. In particular, it is e
... (read more)
31a3orn
Without going deeply into history -- many people saying that this is a risk from pretraining LLMs is not a strawman, let alone an extreme strawman. For instance, here's Scott Alexander from like 3 weeks ago, outlining why a lack of "corrigibility" is scary: So he's concerned about precisely this supposedly extreme strawman. Granted, it would be harder to dig up the precise quotes for all the people mentioned in the text.
5Oliver Sourbut
Thanks for this! I hadn't seen those quotes, or at least hadn't remembered them. I actually really appreciate Alex sticking his neck out a bit here and suggesting this LessWrong dialogue. We both have some contrary opinions, but his takes were probably a little more predictably unwelcome in this venue. (Maybe we should try this on a different crowd - we could try rendering this on Twitter too, lol.) There's definitely value to being (rudely?) shaken out of lazy habits of thinking - though I might not personally accuse someone of fanfiction research! As discussed in the dialogue, I'm still unsure the exact extent of correct- vs mis-interpretation and I think Alex has a knack for (at least sometimes correctly) calling out others' confusion or equivocation.

Tsvi has many underrated posts. This one was rated correctly.

I didn't previously have a crisp conceptual handle for the category that Tsvi calls Playful Thinking. Initially it seemed a slightly unnatural category. Now it's such a natural category that perhaps it should be called "Thinking", and other kinds should be the ones with a modifier (e.g. maybe Directed Thinking?).

 Tsvi gives many theoretical justifications for engaging in Playful Thinking. I want to talk about one because it was only briefly mentioned in the post: 

Your sense of fun decor

... (read more)

This post deserves to be remembered as a LessWrong classic. 

  1. It directly tries to solve a difficult and important cluster of problems (whether it succeeds is yet to be seen).
  2. It uses a new diagrammatic method of manipulating sets of independence relations.
  3. It's a technical result! These feel like they're getting rarer on LessWrong and should be encouraged.

There are several problems that are fundamentally about attaching very different world models together and transferring information from one to the other. 

... (read more)

I'm curious whether the recent trend toward bi-level optimization via chain-of-thought was any update for you? I would have thought this would have updated people (partially?) back toward actually-evolution-was-a-decent-analogy.

There's this paragraph, which seems right-ish to me: 

In order to experience a sharp left turn that arose due to the same mechanistic reasons as the sharp left turn of human evolution, an AI developer would have to:

  1. Deliberately create a (very obvious[2]) inner optimizer, whose inner loss function includes no mention of human val
... (read more)
9niplav
I haven't evaluated this particular analogy for optimization on the CoT, since I don't think the evolution analogy is necessary to see why optimizing on the CoT is a bad idea. (Or, the very least, whether optimizing on the CoT is a bad idea is independent from whether evolution was successful). I probably should,,, TL;DR: Disanalogies that training can update the model on the contents of the CoT, while evolution can not update on the percepts of an organism; also CoT systems aren't re-initialized after long CoTs so they retain representations of human values. So a CoT is unlike the life of an organism. Details: Claude voice Let me think about this step by step… Evolution optimizes the learning algorithm + reward architecture for organisms, those organisms then learn based on feedback from the environment. Evolution only gets really sparse feedback, namely how many offspring the organism had, and how many offspring those offspring had in turn (&c) (in the case of sexual reproduction). Humans choose the learning algorithm (e.g. transformers) + the reward system (search depth/breadth, number of samples, whether to use a novelty sampler like entropix, …). I guess one might want to disambiguate what is analogized to the lifetime learning of the organism: A single CoT, or all CoTs in the training process. A difference in both cases is that the reward process can be set up so that SGD updates on the contents of the CoT, not just on whether the result was achieved (unlike in the evolution case, where evolution has no way of encoding the observations of an organism into the genome of its offspring (modulo epigenetics-blah). My expectation for a while[1] has been that people are going to COCONUT away any possibility of updating the weights on a function of the contents of the CoT because (by the bitter lesson) human language just isn't the best representation for every problem[2], but the fact that with current setups it's possible is a difference from the paragraph you

I think we currently do not have good gears level models of lots of the important questions of AI/cognition/alignment, and I think the way to get there is by treating it as a software/physicalist/engineering problem, not presupposing an already higher level agentic/psychological/functionalist framing.

Here's two ways that a high-level model can be wrong:

  • It isn't detailed enough, but once you learn the detail it adds up to basically the same picture. E.g. Newtonian physics, ideal gas laws. When you get a more detailed model, you learn more about which edge-c
... (read more)

because stabler optimization tends to be more powerful / influential / able-to-skillfully-and-forcefully-steer-the-future

I personally doubt that this is true, which is maybe the crux here.

Would you like to do a dialogue about this? To me it seems clearly true in exactly the same way that having more time to pursue a goal makes it more likely you will achieve that goal.

It's possible another crux is related to the danger of Goodharting, which I think you are exaggerating the danger of. When an agent actually understand what it wants, and/or understands the l... (read more)

There are multiple ways to interpret "being an actual human". I interpret it as pointing at an ability level.

"the task GPTs are being trained on is harder" => the prediction objective doesn't top out at (i.e. the task has more difficulty in it than).

"than being an actual human" => the ability level of a human (i.e. the task of matching the human ability level at the relevant set of tasks).

Or as Eliezer said:

I said that GPT's task is harder than being an actual human; in other words, being an actual human is not enough to solve GPT's task.

In different... (read more)

The OP argument boils down to: the text prediction objective doesn't stop incentivizing higher capabilities once you get to human level capabilities. This is a valid counter-argument to: GPTs will cap out at human capabilities because humans generated the training data.

Your central point is: 

Where GPT and humans differ is not some general mathematical fact about the task,  but differences in what sensory data is a human and GPT trying to predict, and differences in cognitive architecture and ways how the systems are bounded.

You are misinterpretin... (read more)

2Jan_Kulveit
The question is not about the very general claim, or general argument, but about this specific reasoning step I do claim this is not locally valid, that's all (and recommend reading the linked essay).  I do not claim the broad argument that text prediction objective doesn't stop incentivizing higher capabilities once you get to human level capabilities is wrong.  I do agree communication can be hard, and maybe I misunderstand the quoted two sentences, but it seems very natural to read them as making a comparison between tasks at the level of math.

I sometimes think of alignment as having two barriers: 

  • Obtaining levers that can be used to design and shape an AGI in development.
  • Developing theory that predicts the effect of your design choices.

My current understanding of your agenda, in my own words:

You're trying to create a low-capability AI paradigm that has way more levers. This paradigm centers on building useful systems by patching together LLM calls. You're collecting a set of useful tactics for doing this patching. You can rely on tactics in a similar way to how we rely on programming langu... (read more)

Thanks for the comment!

Have I understood this correctly?

I am most confident in phases 1-3 of this agenda, and I think you have overall a pretty good rephrasing of 1-5, thanks! One note is that I don't think "LLM calls" as being fundamental, I think of LLMs as a stand-in for "banks of patterns" or "piles of shards of cognition." The exact shape of this can vary, LLMs are just our current most common shape of "cognition engines", but I can think of many other, potentially better, shapes this "neural primitive/co-processor" could take. 

I think there is s... (read more)

Yeah I read that prize contest post, that was much of where I got my impression of the "consensus". It didn't really describe which parts you still considered valuable. I'd be curious to know which they are? My understanding was that most of the conclusions made in that post were downstream of the Landauer limit argument.

Could you explain or directly link to something about the 4x claim? Seems wrong. Communication speed scales with distance not area.

Jacob Cannells' brain efficiency post

I thought the consensus on that post was that it was mostly bullshit?

4Alexander Gietelink Oldenziel
Sorry i phrased this wrong. You are right. I meant roundtrip time which is twice the length but scales linearly not quadratically. I actually ran the debate contest to get to the bottom of Jake Cannells arguments. Some of the argument, especially around the landauer argument dont hold up but i think it s important not to throw out the baby with bathwater. I think most of the analysis holds up. https://www.lesswrong.com/posts/fm88c8SvXvemk3BhW/brain-efficiency-cannell-prize-contest-award-ceremony

These seem right, but more importantly I think it would eliminate investing in new scalable companies. Or dramatically reduce it in the 50% case. So there would be very few new companies created.

(As a side note: Maybe our response to this proposal was a bit cruel. It might have been better to just point toward some econ reading material).

4Alexander Gietelink Oldenziel
I was about to delete my message because I was afraid it was a bit much but then the likes started streaming in and god knows how much of a sloot i am for internet validation.

would hopefully include many people who understand that understanding constraints is key and that past research understood some constraints.

Good point, I'm convinced by this. 

build on past agent foundations research

I don't really agree with this. Why do you say this?

That's my guess at the level of engagement required to understand something. Maybe just because when I've tried to use or modify some research that I thought I understood, I always realise I didn't understand it deeply enough. I'm probably anchoring too hard on my own experience here, othe... (read more)

6TsviBT
Hm. A couple things: * Existing AF research is rooted in core questions about alignment. * Existing AF research, pound for pound / word for word, and even idea for idea, is much more unnecessary stuff than necessary stuff. (Which is to be expected.) * Existing AF research is among the best sources of compute-traces of trying to figure some of this stuff out (next to perhaps some philosophy and some other math). * Empirically, most people who set out to stuff existing AF fail to get many of the deep lessons. * There's a key dimension of: how much are you always asking for the context? E.g.: Why did this feel like a mainline question to investigate? If we understood this, what could we then do / understand? If we don't understand this, are we doomed / how are we doomed? Are there ways around that? What's the argument, more clearly? * It's more important whether people are doing that, than whether / how exactly they engage with existing AF research. * If people are doing that, they'll usually migrate away from playing with / extending existing AF, towards the more core (more difficult) problems. Ah ok you're right that that was the original claim. I mentally autosteelmanned.

I agree this would be a great program to run, but I want to call it a different lever to the one I was referring to.

The only thing I would change is that I think new researchers need to understand the purpose and value of past agent foundations research. I spent too long searching for novel ideas while I still misunderstood the main constraints of alignment. I expect you'd get a lot of wasted effort if you asked for out-of-paradigm ideas. Instead it might be better to ask for people to understand and build on past agent foundations research, then gradually... (read more)

5TsviBT
We agree this is a crucial lever, and we agree that the bar for funding has to be in some way "high". I'm arguing for a bar that's differently shaped. The set of "people established enough in AGI alignment that they get 5 [fund a person for 2 years and maybe more depending how things go in low-bandwidth mentorship, no questions asked] tokens" would hopefully include many people who understand that understanding constraints is key and that past research understood some constraints. I don't really agree with this. Why do you say this? I agree with this in isolation. I think some programs do state something about OOP ideas, and I agree that the statement itself does not come close to solving the problem. (Also I'm confused about the discourse in this thread (which is fine), because I thought we were discussing "how / how much should grantmakers let the money flow".)

The main thing I'm referring to are upskilling or career transition grants, especially from LTFF, in the last couple of years. I don't have stats, I'm assuming there were a lot given out because I met a lot of people who had received them. Probably there were a bunch given out by the ftx future fund also.

Also when I did MATS, many of us got grants post-MATS to continue our research. Relatively little seems to have come of these.

How are they falling short?

(I sound negative about these grants but I'm not, and I do want more stuff like that to happen. If I we... (read more)

TsviBT1812

upskilling or career transition grants, especially from LTFF, in the last couple of years

Interesting; I'm less aware of these.

How are they falling short?

I'll answer as though I know what's going on in various private processes, but I don't, and therefore could easily be wrong. I assume some of these are sort of done somewhere, but not enough and not together enough.

  • Favor insightful critiques and orientations as much as constructive ideas. If you have a large search space and little traction, a half-plane of rejects is as or more valuable than a gu
... (read more)

I think I disagree. This is a bandit problem, and grantmakers have tried pulling that lever a bunch of times. There hasn't been any field-changing research (yet). They knew it had a low chance of success so it's not a big update. But it is a small update.

Probably the optimal move isn't cutting early-career support entirely, but having a higher bar seems correct. There are other levers that are worth trying, and we don't have the resources to try every lever.

Also there are more grifters now that the word is out, so the EV is also declining that way.

(I feel bad saying this as someone who benefited a lot from early-career financial support).

TsviBT198

grantmakers have tried pulling that lever a bunch of times

What do you mean by this? I can think of lots of things that seem in some broad class of pulling some lever that kinda looks like this, but most of the ones I'm aware of fall greatly short of being an appropriate attempt to leverage smart young creative motivated would-be AGI alignment insight-havers. So the update should be much smaller (or there's a bunch of stuff I'm not aware of).

Answer by Jeremy Gillen*70

My first exposure to rationalists was a Rationally Speaking episode where Julia recommended the movie Locke

It's about a man pursuing difficult goals under emotional stress using few tools. For me it was a great way to be introduced to rationalism because it showed how a ~rational actor could look very different from a straw Vulcan.

It's also a great movie.

Nice.

Similar rule of thumb I find handy is 70 divided by growth rate to get doubling time implied by a growth rate. I find it way easier to think about doubling times than growth rates.

E.g. 3% interest rate means 70/3 ≈ 23 year doubling time.

8Lorenzo
Known as Rule of 72

I get the feeling that I’m still missing the point somehow and that Yudkowsky would say we still have a big chance of doom if our algorithms were created by hand with programmers whose algorithms always did exactly what they intended even when combined with their other algorithms.

I would bet against Eliezer being pessimistic about this, if we are assuming the algorithms are deeply-understood enough that we are confident that we can iterate on building AGI. I think there's maybe a problem with the way Eliezer communicates that gives people the impression th... (read more)

3Tahp
I think we're both saying the same thing here, except that the thing I'm saying implies that I would bet for Eliezer being pessimistic about this. My point was that I have a lot of pessimism that people would code something wrong even if we knew what we were trying to code, and this is where a lot of my doom comes from. Beyond that, I think we don't know what it is we're trying to code up, and you give some evidence for that. I'm not saying that if we knew how to make good AI, it would still fail if we coded it perfectly. I'm saying we don't know how to make good AI (even though we could in principle figure it out), and also current industry standards for coding things would not get it right the first time even if we knew what we were trying to build. I feel like I basically understanding the second thing, but I don't have any gears-level understanding for why it's hard to encode human desires beyond a bunch of intuitions from monkey's-paw things that go wrong if you try to come up with creative disastrous ways to accomplish what seem like laudable goals. I don't think Eliezer is a DOOM rock, although I think a DOOM rock would be about as useful as Eliezer in practice right now because everyone making capability progress has doomed alignment strategies. My model of Eliezer's doom argument for the current timeline is approximately "programming smart stuff that does anything useful is dangerous, we don't know how to specify smart stuff that avoids that danger, and even if we did we seem to be content to train black-box algorithms until they look smarter without checking what they do before we run them." I don't understand one of the steps in that funnel of doom as well as I would like. I think that in a world where people weren't doing the obvious doomed thing of making black-box algorithms which are smart, he would instead have a last step in the funnel of "even if we knew what we need a safe algorithm to do we don't know how to write programs that do exactly what w
Load More