All of Jeremy Gillen's Comments + Replies

Hmm good point. Looking at your dialogues has changed my mind, they have higher karma than the ones I was looking at.

You might also be unusual on some axis that makes arguments easier. It takes me a lot of time to go over peoples words and work out what beliefs are consistent with them. And the inverse, translating model to words, also takes a while.

Dialogues are more difficult to create (if done well between people with different beliefs), and are less pleasant to read, but are often higher value for reaching true beliefs as a group.

8ryan_greenblatt
The dialogues I've done have all been substantially less time investment than basically any of my posts.

Dialogues seem under-incentivised relative to comments, given the amount of effort involved. Maybe they would get more karma if we could vote on individual replies, so it's more like a comment chain?

This could also help with skimming a dialogue because you can skip to the best parts, to see whether it's worth reading the whole thing.

2ryan_greenblatt
I don't see a reason to give dialogues more karma than posts, but I agree posts (including dialogues) are under-incentivized relative to comments.

The ideal situation understanding-wise is that we understand AI at an algorithmic level. We can say stuff like: there are X,Y,Z components of the algorithm, and X passes (e.g.) beliefs to Y in format b, and Z can be viewed as a function that takes information in format w and links it with... etc. And infrabayes might be the theory you use to explain what some of the internal datastructures mean. Heuristic arguments might be how some subcomponent of the algorithm works. Most theoretical AI work (both from the alignment community and in normal AI and ML theo... (read more)

1Jonas Hallgren
Okay, that makes sense to me so thank you for explaining! I guess what I was pointing at with the language thing is the question of what the actual underlying objects that you called XYZ were and their relation to the linguistic explanation of language as a contextually dependent symbol defined by many scenarios rather than some sort of logic. Like if we use IB it might be easy to look at that as a probability distribution of probability distributions? I just thought it was interesting to get some more context on how language might help in an alignment plan.

Fair enough, good points. I guess I classify these LLM agents as "something-like-an-LLM that is genuinely creative", at least to some extent.

Although I don't think the first example is great, seems more like a capability/observation-bandwidth issue.

4Garrett Baker
I think you can have multiple failures at the same time. The reason I think this was also goodhart was because I think the failure-mode could have been averted if sonnet was told “collect wood WITHOUT BREAKING MY HOUSE” ahead of time.

I'm not sure how this is different from the solution I describe in the latter half of the post.

Great comment, agreed. There was some suggestion of (3), and maybe there was too much. I think there are times when expectations about the plan are equivalent to literal desires about how the task should be done. For making coffee, I expect that it won't create much noise. But also, I actually want the coffee-making to not be particularly noisy, and if it's the case that the first plan for making coffee also creates a lot of noise as a side effect, this is a situation where something in the goal specification has gone horribly wrong (and there should be some institutional response).

Yeah I think I remember Stuart talking about agents that request clarification whenever they are uncertain about how a concept generalizes. That is vaguely similar. I can't remember whether he proposed any way to make that reflectively stable though.

From the perspective of this post, wouldn't natural language work a bit as a redundancy specifier in that case and so LLMs are more alignable than RL agents?

LLMs in their current form don't really cause Edge Instantiation problems. Plausibly this is because they internally implement many kinds of regularization... (read more)

3Jonas Hallgren
Those are some great points, made me think of some more questions. Any thoughts on what language "understood vs not understood" might be in? ARC Heuristic arguments or something like infrabayesianism? Like what is the type signature of this and how does this relate to what you wrote in the post? Also what is its relation to natural language?
4Garrett Baker
If you put current language models in weird situations & give them a goal, I’d say they do do edge instantiation, without the missing “creativity” ingredient. Eg see claude sonnet in minecraft repurposing someone’s house for wood after being asked to collect wood. Edit: There are other instances of this too, where you can tell claude to protect you in minecraft, and it will constantly tp to your position, and build walls around you when monsters are around. Protecting you, but also preventing any movement or fun you may have wanted to have.

Yeah I agree there are similarities. I think a benefit of my approach, that I should have emphasized more, is that it's reflectively stable (and theoretically simple and therefore easy to analyze). In your description of an AI that wants to seek clarification, it isn't clear that it won't self-modify (but it's hard to tell).

There’s a general problem that people will want AGIs to find clever out-of-the-box solutions to problems, and there’s no principled distinction between “finding a clever out-of-the-box solution to a problem” and “Goodharting the problem

... (read more)

The Alice and Bob example isn't a good argument against the independence axiom. The combined agent can be represented using a fact-conditional utility function. Include the event "get job offer" in the outcome space, so that the combined utility function is a function of that fact.

E.g.

Bob {A: 0, B: 0.5, C: 1}

Alice {A: 0.3, B: 0, C: 0}

Should merge to become

AliceBob {Ao: 0, Bo: 0.5, Co: 1, A¬o: 0, B¬o: 0, C¬o: 0.3}, where o="get job offer".

This is a far more natural way to combine agents. We can avoid the ontologically weird mixing of probabilities and prefe... (read more)

Excited to attend, the 2023 conference was great!

Can we submit talks?

2Alexander Gietelink Oldenziel
Yes, this should be an option in the form.

Yeah I can see how Scott's quote can be interpreted that way. I think the people listed would usually be more careful with their words. But also, Scott isn't necessarily claiming what you say he is. Everyone agrees that when you prompt a base model to act agentically, it can kinda do so. This can happen during RLHF. Properties of this behaviour will be absorbed from pretraining data, including moral systems. I don't know how Scott is imagining this, but it needn't be an inner homunculi that has consistent goals.

I think the thread below with Daniel and Evan... (read more)

but his takes were probably a little more predictably unwelcome in this venue

I hope he doesn't feel his takes are unwelcome here. I think they're empirically very welcome. His posts seem to have a roughly similar level of controversy and popularity as e.g. so8res. I'm pretty sad that he largely stopped engaging with lesswrong.

There's definitely value to being (rudely?) shaken out of lazy habits of thinking [...] and I think Alex has a knack for (at least sometimes correctly) calling out others' confusion or equivocation.

Yeah I agree, that's why I like to read Alex's takes.

Really appreciate dialogues like this. This kind of engagement across worldviews should happen far more, and I'd love to do more of it myself.[1]

Some aspects were slightly disappointing:

  • Alex keeps putting (inaccurate) words in the mouths of people he disagrees with, without citation. E.g.
    • 'we still haven't seen consistent-across-contexts agency from pretrained systems, a possibility seriously grappled with by eg The Parable of Predict-O-Matic).'
      • That post was  describing a very different kind of AI than generative language models. In particular, it is e
... (read more)
31a3orn
Without going deeply into history -- many people saying that this is a risk from pretraining LLMs is not a strawman, let alone an extreme strawman. For instance, here's Scott Alexander from like 3 weeks ago, outlining why a lack of "corrigibility" is scary: So he's concerned about precisely this supposedly extreme strawman. Granted, it would be harder to dig up the precise quotes for all the people mentioned in the text.
5Oliver Sourbut
Thanks for this! I hadn't seen those quotes, or at least hadn't remembered them. I actually really appreciate Alex sticking his neck out a bit here and suggesting this LessWrong dialogue. We both have some contrary opinions, but his takes were probably a little more predictably unwelcome in this venue. (Maybe we should try this on a different crowd - we could try rendering this on Twitter too, lol.) There's definitely value to being (rudely?) shaken out of lazy habits of thinking - though I might not personally accuse someone of fanfiction research! As discussed in the dialogue, I'm still unsure the exact extent of correct- vs mis-interpretation and I think Alex has a knack for (at least sometimes correctly) calling out others' confusion or equivocation.

Tsvi has many underrated posts. This one was rated correctly.

I didn't previously have a crisp conceptual handle for the category that Tsvi calls Playful Thinking. Initially it seemed a slightly unnatural category. Now it's such a natural category that perhaps it should be called "Thinking", and other kinds should be the ones with a modifier (e.g. maybe Directed Thinking?).

 Tsvi gives many theoretical justifications for engaging in Playful Thinking. I want to talk about one because it was only briefly mentioned in the post: 

Your sense of fun decor

... (read more)

This post deserves to be remembered as a LessWrong classic. 

  1. It directly tries to solve a difficult and important cluster of problems (whether it succeeds is yet to be seen).
  2. It uses a new diagrammatic method of manipulating sets of independence relations.
  3. It's a technical result! These feel like they're getting rarer on LessWrong and should be encouraged.

There are several problems that are fundamentally about attaching very different world models together and transferring information from one to the other. 

... (read more)

I'm curious whether the recent trend toward bi-level optimization via chain-of-thought was any update for you? I would have thought this would have updated people (partially?) back toward actually-evolution-was-a-decent-analogy.

There's this paragraph, which seems right-ish to me: 

In order to experience a sharp left turn that arose due to the same mechanistic reasons as the sharp left turn of human evolution, an AI developer would have to:

  1. Deliberately create a (very obvious[2]) inner optimizer, whose inner loss function includes no mention of human val
... (read more)
9niplav
I haven't evaluated this particular analogy for optimization on the CoT, since I don't think the evolution analogy is necessary to see why optimizing on the CoT is a bad idea. (Or, the very least, whether optimizing on the CoT is a bad idea is independent from whether evolution was successful). I probably should,,, TL;DR: Disanalogies that training can update the model on the contents of the CoT, while evolution can not update on the percepts of an organism; also CoT systems aren't re-initialized after long CoTs so they retain representations of human values. So a CoT is unlike the life of an organism. Details: Claude voice Let me think about this step by step… Evolution optimizes the learning algorithm + reward architecture for organisms, those organisms then learn based on feedback from the environment. Evolution only gets really sparse feedback, namely how many offspring the organism had, and how many offspring those offspring had in turn (&c) (in the case of sexual reproduction). Humans choose the learning algorithm (e.g. transformers) + the reward system (search depth/breadth, number of samples, whether to use a novelty sampler like entropix, …). I guess one might want to disambiguate what is analogized to the lifetime learning of the organism: A single CoT, or all CoTs in the training process. A difference in both cases is that the reward process can be set up so that SGD updates on the contents of the CoT, not just on whether the result was achieved (unlike in the evolution case, where evolution has no way of encoding the observations of an organism into the genome of its offspring (modulo epigenetics-blah). My expectation for a while[1] has been that people are going to COCONUT away any possibility of updating the weights on a function of the contents of the CoT because (by the bitter lesson) human language just isn't the best representation for every problem[2], but the fact that with current setups it's possible is a difference from the paragraph you

I think we currently do not have good gears level models of lots of the important questions of AI/cognition/alignment, and I think the way to get there is by treating it as a software/physicalist/engineering problem, not presupposing an already higher level agentic/psychological/functionalist framing.

Here's two ways that a high-level model can be wrong:

  • It isn't detailed enough, but once you learn the detail it adds up to basically the same picture. E.g. Newtonian physics, ideal gas laws. When you get a more detailed model, you learn more about which edge-c
... (read more)

because stabler optimization tends to be more powerful / influential / able-to-skillfully-and-forcefully-steer-the-future

I personally doubt that this is true, which is maybe the crux here.

Would you like to do a dialogue about this? To me it seems clearly true in exactly the same way that having more time to pursue a goal makes it more likely you will achieve that goal.

It's possible another crux is related to the danger of Goodharting, which I think you are exaggerating the danger of. When an agent actually understand what it wants, and/or understands the l... (read more)

There are multiple ways to interpret "being an actual human". I interpret it as pointing at an ability level.

"the task GPTs are being trained on is harder" => the prediction objective doesn't top out at (i.e. the task has more difficulty in it than).

"than being an actual human" => the ability level of a human (i.e. the task of matching the human ability level at the relevant set of tasks).

Or as Eliezer said:

I said that GPT's task is harder than being an actual human; in other words, being an actual human is not enough to solve GPT's task.

In different... (read more)

The OP argument boils down to: the text prediction objective doesn't stop incentivizing higher capabilities once you get to human level capabilities. This is a valid counter-argument to: GPTs will cap out at human capabilities because humans generated the training data.

Your central point is: 

Where GPT and humans differ is not some general mathematical fact about the task,  but differences in what sensory data is a human and GPT trying to predict, and differences in cognitive architecture and ways how the systems are bounded.

You are misinterpretin... (read more)

2Jan_Kulveit
The question is not about the very general claim, or general argument, but about this specific reasoning step I do claim this is not locally valid, that's all (and recommend reading the linked essay).  I do not claim the broad argument that text prediction objective doesn't stop incentivizing higher capabilities once you get to human level capabilities is wrong.  I do agree communication can be hard, and maybe I misunderstand the quoted two sentences, but it seems very natural to read them as making a comparison between tasks at the level of math.

I sometimes think of alignment as having two barriers: 

  • Obtaining levers that can be used to design and shape an AGI in development.
  • Developing theory that predicts the effect of your design choices.

My current understanding of your agenda, in my own words:

You're trying to create a low-capability AI paradigm that has way more levers. This paradigm centers on building useful systems by patching together LLM calls. You're collecting a set of useful tactics for doing this patching. You can rely on tactics in a similar way to how we rely on programming langu... (read more)

Thanks for the comment!

Have I understood this correctly?

I am most confident in phases 1-3 of this agenda, and I think you have overall a pretty good rephrasing of 1-5, thanks! One note is that I don't think "LLM calls" as being fundamental, I think of LLMs as a stand-in for "banks of patterns" or "piles of shards of cognition." The exact shape of this can vary, LLMs are just our current most common shape of "cognition engines", but I can think of many other, potentially better, shapes this "neural primitive/co-processor" could take. 

I think there is s... (read more)

Yeah I read that prize contest post, that was much of where I got my impression of the "consensus". It didn't really describe which parts you still considered valuable. I'd be curious to know which they are? My understanding was that most of the conclusions made in that post were downstream of the Landauer limit argument.

Could you explain or directly link to something about the 4x claim? Seems wrong. Communication speed scales with distance not area.

Jacob Cannells' brain efficiency post

I thought the consensus on that post was that it was mostly bullshit?

4Alexander Gietelink Oldenziel
Sorry i phrased this wrong. You are right. I meant roundtrip time which is twice the length but scales linearly not quadratically. I actually ran the debate contest to get to the bottom of Jake Cannells arguments. Some of the argument, especially around the landauer argument dont hold up but i think it s important not to throw out the baby with bathwater. I think most of the analysis holds up. https://www.lesswrong.com/posts/fm88c8SvXvemk3BhW/brain-efficiency-cannell-prize-contest-award-ceremony

These seem right, but more importantly I think it would eliminate investing in new scalable companies. Or dramatically reduce it in the 50% case. So there would be very few new companies created.

(As a side note: Maybe our response to this proposal was a bit cruel. It might have been better to just point toward some econ reading material).

4Alexander Gietelink Oldenziel
I was about to delete my message because I was afraid it was a bit much but then the likes started streaming in and god knows how much of a sloot i am for internet validation.

would hopefully include many people who understand that understanding constraints is key and that past research understood some constraints.

Good point, I'm convinced by this. 

build on past agent foundations research

I don't really agree with this. Why do you say this?

That's my guess at the level of engagement required to understand something. Maybe just because when I've tried to use or modify some research that I thought I understood, I always realise I didn't understand it deeply enough. I'm probably anchoring too hard on my own experience here, othe... (read more)

6TsviBT
Hm. A couple things: * Existing AF research is rooted in core questions about alignment. * Existing AF research, pound for pound / word for word, and even idea for idea, is much more unnecessary stuff than necessary stuff. (Which is to be expected.) * Existing AF research is among the best sources of compute-traces of trying to figure some of this stuff out (next to perhaps some philosophy and some other math). * Empirically, most people who set out to stuff existing AF fail to get many of the deep lessons. * There's a key dimension of: how much are you always asking for the context? E.g.: Why did this feel like a mainline question to investigate? If we understood this, what could we then do / understand? If we don't understand this, are we doomed / how are we doomed? Are there ways around that? What's the argument, more clearly? * It's more important whether people are doing that, than whether / how exactly they engage with existing AF research. * If people are doing that, they'll usually migrate away from playing with / extending existing AF, towards the more core (more difficult) problems. Ah ok you're right that that was the original claim. I mentally autosteelmanned.

I agree this would be a great program to run, but I want to call it a different lever to the one I was referring to.

The only thing I would change is that I think new researchers need to understand the purpose and value of past agent foundations research. I spent too long searching for novel ideas while I still misunderstood the main constraints of alignment. I expect you'd get a lot of wasted effort if you asked for out-of-paradigm ideas. Instead it might be better to ask for people to understand and build on past agent foundations research, then gradually... (read more)

5TsviBT
We agree this is a crucial lever, and we agree that the bar for funding has to be in some way "high". I'm arguing for a bar that's differently shaped. The set of "people established enough in AGI alignment that they get 5 [fund a person for 2 years and maybe more depending how things go in low-bandwidth mentorship, no questions asked] tokens" would hopefully include many people who understand that understanding constraints is key and that past research understood some constraints. I don't really agree with this. Why do you say this? I agree with this in isolation. I think some programs do state something about OOP ideas, and I agree that the statement itself does not come close to solving the problem. (Also I'm confused about the discourse in this thread (which is fine), because I thought we were discussing "how / how much should grantmakers let the money flow".)

The main thing I'm referring to are upskilling or career transition grants, especially from LTFF, in the last couple of years. I don't have stats, I'm assuming there were a lot given out because I met a lot of people who had received them. Probably there were a bunch given out by the ftx future fund also.

Also when I did MATS, many of us got grants post-MATS to continue our research. Relatively little seems to have come of these.

How are they falling short?

(I sound negative about these grants but I'm not, and I do want more stuff like that to happen. If I we... (read more)

upskilling or career transition grants, especially from LTFF, in the last couple of years

Interesting; I'm less aware of these.

How are they falling short?

I'll answer as though I know what's going on in various private processes, but I don't, and therefore could easily be wrong. I assume some of these are sort of done somewhere, but not enough and not together enough.

  • Favor insightful critiques and orientations as much as constructive ideas. If you have a large search space and little traction, a half-plane of rejects is as or more valuable than a gu
... (read more)

I think I disagree. This is a bandit problem, and grantmakers have tried pulling that lever a bunch of times. There hasn't been any field-changing research (yet). They knew it had a low chance of success so it's not a big update. But it is a small update.

Probably the optimal move isn't cutting early-career support entirely, but having a higher bar seems correct. There are other levers that are worth trying, and we don't have the resources to try every lever.

Also there are more grifters now that the word is out, so the EV is also declining that way.

(I feel bad saying this as someone who benefited a lot from early-career financial support).

grantmakers have tried pulling that lever a bunch of times

What do you mean by this? I can think of lots of things that seem in some broad class of pulling some lever that kinda looks like this, but most of the ones I'm aware of fall greatly short of being an appropriate attempt to leverage smart young creative motivated would-be AGI alignment insight-havers. So the update should be much smaller (or there's a bunch of stuff I'm not aware of).

My first exposure to rationalists was a Rationally Speaking episode where Julia recommended the movie Locke

It's about a man pursuing difficult goals under emotional stress using few tools. For me it was a great way to be introduced to rationalism because it showed how a ~rational actor could look very different from a straw Vulcan.

It's also a great movie.

Nice.

Similar rule of thumb I find handy is 70 divided by growth rate to get doubling time implied by a growth rate. I find it way easier to think about doubling times than growth rates.

E.g. 3% interest rate means 70/3 ≈ 23 year doubling time.

8Lorenzo
Known as Rule of 72

I get the feeling that I’m still missing the point somehow and that Yudkowsky would say we still have a big chance of doom if our algorithms were created by hand with programmers whose algorithms always did exactly what they intended even when combined with their other algorithms.

I would bet against Eliezer being pessimistic about this, if we are assuming the algorithms are deeply-understood enough that we are confident that we can iterate on building AGI. I think there's maybe a problem with the way Eliezer communicates that gives people the impression th... (read more)

3Tahp
I think we're both saying the same thing here, except that the thing I'm saying implies that I would bet for Eliezer being pessimistic about this. My point was that I have a lot of pessimism that people would code something wrong even if we knew what we were trying to code, and this is where a lot of my doom comes from. Beyond that, I think we don't know what it is we're trying to code up, and you give some evidence for that. I'm not saying that if we knew how to make good AI, it would still fail if we coded it perfectly. I'm saying we don't know how to make good AI (even though we could in principle figure it out), and also current industry standards for coding things would not get it right the first time even if we knew what we were trying to build. I feel like I basically understanding the second thing, but I don't have any gears-level understanding for why it's hard to encode human desires beyond a bunch of intuitions from monkey's-paw things that go wrong if you try to come up with creative disastrous ways to accomplish what seem like laudable goals. I don't think Eliezer is a DOOM rock, although I think a DOOM rock would be about as useful as Eliezer in practice right now because everyone making capability progress has doomed alignment strategies. My model of Eliezer's doom argument for the current timeline is approximately "programming smart stuff that does anything useful is dangerous, we don't know how to specify smart stuff that avoids that danger, and even if we did we seem to be content to train black-box algorithms until they look smarter without checking what they do before we run them." I don't understand one of the steps in that funnel of doom as well as I would like. I think that in a world where people weren't doing the obvious doomed thing of making black-box algorithms which are smart, he would instead have a last step in the funnel of "even if we knew what we need a safe algorithm to do we don't know how to write programs that do exactly what w

Intelligence/IQ is always good, but not a dealbreaker as long as you can substitute it with a larger population.

IMO this is pretty obviously wrong. There are some kinds of problem solving that scales poorly with population, just as there are some computations that scale poorly with parallelisation.

E.g. project euler problems.

When I said "problems we care about", I was referring to a cluster of problems that very strongly appear to not scale well with population. Maybe this is an intuitive picture of the cluster of problems I'm referring to.

2Noosphere89
On this: I think the problem identified here is in large part a demand problem, in that lots of AI people only wanted AI capabilities, and didn't care for AI interpretability at all, so once the scaling happened, a lot of the focus went purely to AI scaling. (Which is an interesting example of Goodhart's law in action, perhaps.) See here: https://www.lesswrong.com/posts/gXinMpNJcXXgSTEpn/ai-craftsmanship#Qm8Kg7PjZoPTyxrr6 I definitely agree that there exist such problems where the scaling with population is pretty bad, but I'll give 2 responses here:   1. The differences between a human level AI and an actual human are the ability to coordinate and share ontologies better between millions of instances, so the common problems that arise when trying to factorize out problems are greatly reduced. 2. I think that while there are serial bottlenecks to lots of problem solving in the real world such that it prevents hyperfast outcomes, I don't think that serial bottlenecks are the dominating factor, because the stuff that is parallelizable like good execution is often far more valuable than the inherently serial computations like deep/original ideas.

I buy that such an intervention is possible. But doing it requires understanding the internals at a deep level. You can't expect SGD to implement the patch in a robust way. The patch would need to still be working after 6 months on an impossible problem, in spite of it actively getting in the way of finding the solution!

I'd be curious about why it isn't changing the picture quite a lot, maybe after you've chewed on the ideas. From my perspective it makes the entire non-reflective-AI-via-training pathway not worth pursuing. At least for large scale thinking.

3james.lucassen
It doesn't change the picture a lot because the proposal for preventing misaligned goals from arising via this mechanism was to try and get control over when the AI does/doesn't step back, in order to allow it in the capability-critical cases but disallow it in the dangerous cases. This argument means you'll have more attempts at dangerous stepping back that you have to catch, but doesn't break the strategy. The strategy does break if when we do this blocking, the AI piles on more and more effort trying to unblock it until it either succeeds or is rendered useless for anything else. There being more baseline attempts probably raises the chance of that or some other problem that makes prolonged censorship while maintaining capabilities impossible. But again, just makes it harder, doesn't break it. I don't think you need to have that pile-on property to be useful. Consider MTTR(n), the mean time an LLM takes to realize it's made a mistake, parameterized by how far up the stack the mistake was. By default you'll want to have short MTTR for all n. But if you can get your MTTR short enough for small n, you can afford to have MTTR long for large n. Basically, this agent tends to get stuck/rabbit-hole/nerd-snipe but only when the mistake that caused it to get stuck was made a long time ago. Imagine a capabilities scheme where you train MTTR using synthetic data with an explicit stack and intentionally introduced mistakes. If you're worried about this destabilization threat model, there's a pretty clear recommendation: only train for small-n MTTR, treat large-n MTTR as a dangerous capability, and you pay some alignment tax in the form of inefficient MTTR training and occasionally rebooting your agent when it does get stuck in a non dangerous case. Figured I should get back to this comment but unfortunately the chewing continues. Hoping to get a short post out soon with my all things considered thoughts on whether this direction has any legs

I was probably influenced by your ideas! I just (re?)read your post on the topic.

Tbh I think it's unlikely such a sweet spot exists, and I find your example unconvincing. The value of this kind of reflection for difficult problem solving directly conflicts with the "useful" assumption. 

I'd be more convinced if you described the task where you expect an AI to be useful (significantly above current humans), and doesn't involve failing and reevaluating high-level strategy every now and then.

3james.lucassen
I agree that I wouldn't want to lean on the sweet-spot-by-default version of this, and I agree that the example is less strong than I thought it was. I still think there might be safety gains to be had from blocking higher level reflection if you can do it without damaging lower level reflection. I don't think that requires a task where the AI doesn't try and fail and re-evaluate - it just requires that the re-evalution never climbs above a certain level in the stack. There's such a thing as being pathologically persistent, and such a thing as being pathologically flaky. It doesn't seem too hard to train a model that will be pathologically persistent in some domains while remaining functional in others. A lot of my current uncertainty is bound up in how robust these boundaries are going to have to be.

Extremely underrated post, I'm sorry I only skimmed it when it came out.

I found 3a,b,c to be strong and well written, a good representation of my view. 

In contrast, 3d I found to be a weak argument that I didn't identify with. In particular, I don't think internal conflicts are a good way to explain the source of goal misgeneralization. To me it's better described as just overfitting or misgeneralization.[1] Edge cases in goals are clearly going to be explored by a stepping back process, if initial attempts fail. In particular if attempted pathwa... (read more)

3james.lucassen
Yup not obvious. I do in fact think a lot more humans would be helpful. But I also agree that my mental picture of "transformative human level research assistant" relies heavily on serial speedup, and I can't immediately picture a version that feels similarly transformative without speedup. Maybe evhub or Ethan Perez or one of the folks running a thousand research threads at once would disagree.
7james.lucassen
Hmm. This is a good point, and I agree that it significantly weakens the analogy. I was originally going to counter-argue and claim something like "sure total failure forces you to step back far but it doesn't mean you have to step back literally all the way". Then I tried to back that up with an example, such as "when I was doing alignment research, I encountered total failure that forced me to abandon large chunks of planning stack, but this never caused me to 'spill upward' to questioning whether or not I should be doing alignment research at all". But uh then I realized that isn't actually true :/ On consideration, yup this obviously matters. The thing that causes you to step back from a goal is that goal being a bad way to accomplish its supergoal, aka "too difficult". Can't believe I missed this, thanks for pointing it out. I don't think this changes the picture too much, besides increasing my estimate of how much optimization we'll have to do to catch and prevent value-reflection. But a lot of muddy half-ideas came out of this that I'm interested in chewing on.
2Noosphere89
I admit, I think this is kind of a crux, but let me get down to this statement: One big difference between a human-level AI and a real human is coordination costs: Even without advanced decision theories like FDT/UDT/LDT, the ability to have millions of copies of an AI makes it possible for them to all have similar values, and divergences between them are more controllable in a virtual environment than a physical environment. But my more substantive claim is that lots of how progress is made in the real world is because population growth allows for more complicated economies, more ability to specialize without losing essential skills, and just simply more data to deal with reality, and alignment, including strong alignment is not different here. Indeed, I'd argue that a lot more alignment progress happened in the 2022-2024 period than the 2005-2015 period, and while I don't credit it all to population growth of alignment researchers, I do think a reasonably significant amount of the progress happened because we got more people into alignment. Intelligence/IQ is always good, but not a dealbreaker as long as you can substitute it with a larger population. See these quotes from Carl Shulman here for why:   The link for these quotes is here below: https://www.lesswrong.com/posts/BdPjLDG3PBjZLd5QY/carl-shulman-on-dwarkesh-podcast-june-2023#Can_we_detect_deception_

Trying to write a new steelman of Matt's view. It's probably incorrect, but seems good to post as a measure of progress:

You believe in agentic capabilities generalizing, but also in additional high-level patterns that generalize and often overpower agentic behaviour. You expect training to learn all the algorithms required for intelligence, but also pick up patterns in the data like "research style", maybe "personality", maybe "things a person wouldn't do" and also build those into the various-algorithms-that-add-up-to-intelligence at a deep level. In part... (read more)

3james.lucassen
Maybe I'm just reading my own frames into your words, but this feels quite similar to the rough model of human-level LLMs I've had in the back of my mind for a while now. In particular, this reads to me like the "unstable alignment" paradigm I wrote about a while ago. You have an agent which is consequentialist enough to be useful, but not so consequentialist that it'll do things like spontaneously notice conflicts in the set of corrigible behaviors you've asked it to adhere to and undertake drastic value reflection to resolve those conflicts. You might hope to hit this sweet spot by default, because humans are in a similar sort of sweet spot. It's possible to get humans to do things they massively regret upon reflection as long as their day to day work can be done without attending to obvious clues (eg guy who's an accountant for the Nazis for 40 years and doesn't think about the Holocaust he just thinks about accounting). Or you might try and steer towards this sweet spot by developing ways to block reflection in cases where it's dangerous without interfering with it in cases where it's essential for capabilities.

Not much to add, I haven't spent enough time thinking about structural selection theorems. 

I'm a fan of making more assumptions. I've had a number of conversations with people who seem to make the mistake of not assuming enough. Sometimes leading them to incorrectly consider various things impossible. E.g. "How could an agent store a utility function over all possible worlds?" or "Rice's theorem/halting problem/incompleteness/NP-hardness/no-free-lunch theorems means it's impossible to do xyz". The answer is always nah, it's possible, we just need to t... (read more)

4Alexander Gietelink Oldenziel
Yes. I would even say that finding the right assumptions is the most important part of proving nontrivial selection theorems.
4cubefox
I think I ger what you mean, though making more assumptions is perhaps not the best way to think about it. Logic is monotonic (classical logic at least), meaning that a valid proof remains valid even when adding more assumptions. The "taking advantage of some structure" seems to be different.

Good point.

What I meant by updatelessness removes most of the justification is the reason given here at the very beginning of "Against Resolute Choice". In order to make a money pump that leads the agent in a circle, the agent has to continue accepting trades around a full preference loop. But if it has decided on the entire plan beforehand, it will just do any plan that involves <1 trip around the preference loop. (Although it's unclear how it would settle on such a plan, maybe just stopping its search after a given time). It won't (I think?) choose an... (read more)

3EJT
I don't think agents that avoid the money pump for cyclicity are representable as satisfying VNM, at least holding fixed the objects of preference (as we should). Resolute choosers with cyclic preferences will reliably choose B over A- at node 3, but they'll reliably choose A- over B if choosing between these options ex nihilo. That's not VNM representable, because it requires that the utility of A- be greater than the utility of B and. that the utility of B be greater than the utility of A-

I think the problem might be that you've given this definition of heuristic:

A heuristic is a local, interpretable, and simple function (e.g., boolean/arithmetic/lookup functions) learned from the training data. There are multiple heuristics in each layer and their outputs are used in later layers.

Taking this definition seriously, it's easy to decompose a forward pass into such functions.

But you have a much more detailed idea of a heuristic in mind. You've pointed toward some properties this might have in your point (2), but haven't put it into specific wor... (read more)

3Sodium
I agree that if you put more limitations on what heuristics are and how they compose, you end up with a stronger hypothesis. I think it's probably better to leave that out and try do some more empirical work before making a claim there though (I suppose you could say that the hypothesis isn't actually making a lot of concrete predictions yet at this stage).  I don't think (2) necessarily follows, but I do sympathize with your point that the post is perhaps a more specific version of the hypothesis that "we can understand neural network computation by doing mech interp."

I've only skimmed this post, but I like it because I think it puts into words a fairly common model (that I disagree with). I've heard "it's all just a stack of heuristics" as an explanation of neural networks and as a claim that all intelligence is this, from several people. (Probably I'm overinterpreting other people's words to some extent, they probably meant a weaker/nuanced version. But like you say, it can be useful to talk about the strong version).

I think you've correctly identified the flaw in this idea (it isn't predictive, it's unfalsifiable, so... (read more)

3Sodium
Thanks for reading my post! Here's how I think this hypothesis is helpful: It's possible that we wouldn't be able to understand what's going on even if we had some perfect way to decompose a forward pass into interpretable constituent heuristics. I'm skeptical that this would be the case, mostly because I think (1) we can get a lot of juice out of auto-interp methods and (2) we probably wouldn't need to simultaneously understand that many heuristics at the same time (which is the case for your logic gate example for modern computers). At the minimum, I would argue that the decomposed bag of heuristics is likely to be much more interpretable than the original model itself. Suppose that the hypothesis is true, then it at least suggests that interpretability researchers should put in more efforts to try find and study individual heuristics/circuits, as opposed to the current more "feature-centric" framework. I don't know how this would manifest itself exactly, but it felt like it's worth saying. I believe that some of the empirical work I cited suggests that we might make more incremental progress if we focused on heuristics more right now.

Trying to think this through, I'll write a bit of a braindump just in case that's useful:

The futachy hack can be split into two parts. The first is that is that conditioning on untaken actions makes most probabilities ill-defined. Because there are no incentives to get it right, the market can can settle to many equilibria. The second part is that there are various incentives for traders to take advantage of this for their own interests.

With your technique, I think approach would be to duplicate each trader into two traders with the same knowledge, and mak... (read more)

I appreciate that you tried. If words are failing us to this extent, I'm going to give up.

How about I assume there is some epsilon such that the probability of an agent going off the rails is greater than epsilon in any given year. Why can't the agent split into multiple ~uncorrelated agents and have them each control some fraction of resources (maybe space) such that one off-the-rails agent can easily be fought and controlled by the others? This should reduce the risk to some fraction of epsilon, right?

(I'm gonna try and stay focused on a single point, specifically the argument that leads up to >99%, because that part seems wrong for quite simple reasons).

1Remmelt
Got it. So we are both assuming that there would be some accumulative failure rate [per point 3.].   I tried to adopt this ~uncorrelated agents framing, and then argue from within that. But I ran up against some problems with this framing:  * It assumes there are stable boundaries between "agents" that allows us to mark them as separate entities. This kinda works for us as physically bounded and communication-bottlenecked humans. But in practice it wouldn't really work to define "agent" separations within a larger machine network maintaining of own existence in the environment.  (Also, it is not clear to me how failures of those defined "agent" subsets would necessarily be sufficiently uncorrelated – as an example, if the failure involves one subset hijacking the functioning of another subset, their failures become correlated.) * It assumes that if any (physical or functional) subset of this adaptive machinery happens to gain any edge in influencing the distributed flows of atoms and energy back towards own growth, that the other machinery subsets can robustly "control" for that. * It assumes a macroscale-explanation of physical processes that build up from the microscale. Agreed that the concept of agents owning and directing the allocation of "resources" is a useful abstraction, but it also involves holding a leaky representation of what's going on. Any argument for control using that representation can turn out not to capture crucial aspects. * It raises the question what "off-the-rails" means here. This gets us into the hashiness model: Consider the space of possible machinery output sequences over time. How large is the subset of output sequences that in their propagation as (cascading) environmental effects would end up lethally disrupting the bodily functioning of humans? How is the accumulative probability of human extinction distributed across the entire output possibility space (or simplified: how mixed are the adjoining lethal and non-lethal p

You could call this a failure of the AGI’s goal-related systems if you mean with that that the machinery failed to control its external effects in line with internally represented goals. 

But this would be a problem with the control process itself.

So it's the AI being incompetent?

Unfortunately, there are fundamental limits to that cap the extent to which the machinery can improve its own control process. 

Yeah I think would be a good response to my argument against premise 2). I've had a quick look at the list of theorems in the paper, I don't know... (read more)

1Remmelt
Yes, but in the sense that there are limits to the AGI's capacity to sense, model, simulate, evaluate, and correct own component effects propagating through a larger environment.   If you can't simulate (and therefore predict) that a failure mode that by default is likely to happen would happen, then you cannot counterfactually act to prevent the failure mode.   Maybe take a look at the hashiness model of AGI uncontainability. That's an elegant way of representing the problem (instead of pointing at lots of examples of theorems that show limits to control). This is not put into mathematical notation yet though. Anders Sandberg is working on it, but also somewhat distracted. Would value your contribution/thinking here, but I also get if you don't want to read through the long transcripts of explanation at this stage. See project here.  Anders' summary: "A key issue is the thesis that AGI will be uncontrollable in the sense that there is no control mechanism that can guarantee aligned behavior since the more complex and abstract the target behavior is the amount of resources and forcing ability needed become unattainable.  In order to analyse this better a sufficiently general toy model is needed for how controllable systems of different complexity can be, that ideally can be analysed rigorously. One such model is to study families of binary functions parametrized by their circuit complexity and their "hashiness" (how much they mix information) as an analog for the AGI and the alignment model, and the limits to finding predicates that can keep the alignment system making the AGI analog producing a desired output."   We're talking about learning from inputs received from a more complex environment (through which AGI outputs also propagate as changed effects of which some are received as inputs).  Does Garrabrant take that into account in his self-referential reasoning?   A human democracy is composed out of humans with similar needs. This turns out to be an

I've reread and my understanding of point 3 remains the same. I wasn't trying to summarize points 1-5, to be clear. And by "goal-related systems" I just meant whatever is keeping track of the outcomes being optimized for.

Perhaps you could point me to my misunderstanding?

1Remmelt
Appreciating your openness.  (Just making dinner – will get back to this when I’m behind my laptop in around an hour). 

In practice, engineers know that complex architectures interacting with the surrounding world end up having functional failures (because of unexpected interactive effects, or noisy interference). With AGI, we are talking about an architecture here that would be replacing all our jobs and move to managing conditions across our environment. If AGI continues to persist in some form over time, failures will occur and build up toward lethality at some unknown rate. Over a long enough period, this repeated potential for uncontrolled failures pushes the risk of h

... (read more)
1Remmelt
Ah, that’s actually not the argument. Could you try read points 1-5. again?

To me it seems like one important application of this work is to understanding and fixing the futachy hack in FixDT and in Logical Inductor decision theory. But I'm not sure whether your results can transfer to these settings, because of the requirement that the agents have the same beliefs.

Is there a reason we can't make duplicate traders in LI and have their trades be zero-sum?

I'm generally confused about this. Do you have thoughts? 

3Rubi J. Hudson
Having re-read the posts and thought about it some more, I do think zero-sum competition could be applied to logical inductors to resolve the futarchy hack. It would require minor changes to the formalism to accommodate, but I don't see how those changes would break anything else.
3Rubi J. Hudson
I'll take a look at the linked posts and let you know my thoughts soon!

The non-spicy answer is probably the LTFF, if you're happy deferring to the fund managers there. I don't know what your risk tolerance for wasting money is, but you can check whether they meet it by looking at their track record.

If you have a lot of time you might be able to find better ways to spend money than the LTFF can. (Like if you can find a good way to fund intelligence amplification as Tsvi said).

1Birk Källberg
Wanting to answer a very similar question, I’ve just done about a day of donation research into x-risk funds. There are three that have caught my interest: * Long Term Future Fund (LTFF) from EA Funds * in 2024, LTFF grants have gone mostly to individual TAIS researchers (also some policy folk and very small orgs) working on promising projects. Most are 3- to 12-month stipends between 10k$ and 100k$. * see their Grants Database for details * Emerging Challenges Fund (ECF) - Longview Philanthropy * gives grants to orgs in AIS, biorisk and nuclear. funds both policy work (diplomacy, laws, advocacy) and technical work (TAIS research, technical bio-safety) * see their 2024 Report for details * Global Catastrophic Risks Fund (GCR Fund) - Founders Pledge * focuses on prevention of great power conflicts * their grants cover things like US-China diplomacy efforts on nuclear, AI and autonomous weapons issues. also biorisk strategy and policy work.   A very rough estimate on LTFF effectiveness (how much p(doom) does $1 reduce?): The article Microdooms averted by working on AI Safety uses a simple quantitative model to estimate that one extra AIS researcher will avert 49 microdooms on average at current margins. Considering only humanities current 8B people, this would mean 400,000 current people saved in expectation by each additional researcher. Note that depending on parameter choices, the model’s result could easily go up or down an order of magnitude. The rest are my calculations:  * optimistic case: the researcher has all their impact in the first year and only requires a yearly salary of $80k. This would imply 0.6 nanodooms / $ or 5 current people saved / $. * pessimistic case: the researcher takes 40 years (a full career) to have that impact and big compute and org. staffing costs mean their career costs 10x their salary. This implies a 400x lower effectiveness, i.e. 1.5 picodooms / $ or 0.012 current people saved / $ or 80$ to save a perso
3MichaelDickens
My perspective is that I'm much more optimistic about policy than about technical research, and I don't really feel qualified to evaluate policy work, and LTFF makes almost no grants on policy. I looked around and I couldn't find any grantmakers who focus on AI policy. And even if they existed, I don't know that I could trust them (like I don't think Open Phil is trustworthy on AI policy and I kind of buy Habryka's arguments that their policy grants are net negative). I'm in the process of looking through a bunch of AI policy orgs myself. I don't think I can do a great job of evaluating them but I can at least tell that most policy orgs aren't focusing on x-risk so I can scratch them off the list.
1M. Y. Zuo
How does someone view the actual outcomes of the ‘Highlighted Grants’ on that page? It would be a lot more reassuring if readers can check that they’ve all been fulfilled and/or exceeded expectations.
Load More