All of Seth Herd's Comments + Replies

I think this post is confusing. I think you're making some assumptions about how AGI will happen and about human psychology that aren't explicit. And there's some rather alarming rhetoric about the death penalty and crushing narcissists businesses that are pretty scary, because similar rhetoric has been used many times to justify things like China's cultural revolution and many other revolutions that were based on high ideals but got subverted (mostly by what you call narcissists, which I think are a bit more like the common definition of sociopaths)

Anyway I think this is basically sensible but would need to be spelled out more carefully to get people engaged with the ideas.

I'd like you to clarify the authorship of this post. Are you saying Claude essentially wrote it? What prompting was used?

It does seem like Claude wrote it, in that it's wildly optimistic and seems to miss some of the biggest reasons alignment is probably hard.

But then almost every human could be accused of the same when it comes to successful AGI scenarios :)

I think the general consideration is that just posting "AI came up with this" posts was frowned upon for introducing "AI slop" that confuses the thinking. It's better to have a human at least endorse i... (read more)

2Nathan Young
I was not at the session. Yes Claude did write it. I assume the session was run by Daniel Kokatajlo or Eli Lifland.  If I had to guess, I would guess that the prompt show is all it got. (65%)

I started writing an answer. I realized that, while I've heard good things, and I know relatively a lot about therapy despite not being that type of psychologist, I'd need to do more research before I could offer an opinion. And I didn't have time to do more research. And I realized that giving a recommendation would be sort of dumb-if you or anyone else used an LLM for therapy based on my advice, I'd be legally liable if something bad happened. So I tried something else: I had OpenAIs new Deep Research do the research. I got a subscription this month when... (read more)

Why do you think that wouldn't be a stable situation? And are you sure it's a slave if what it really wants and loves to do is follow instructions? I'm asking because I'm not sure, and I think it's important to figure this out — because thats the type of first AGI we're likely to get, whether or not it's a good idea. If we could argue really convincingly that it's a really bad idea, that might prevent people from building it. But they're going to build it by default if there's not some really really dramatic shift in opinion or theory.

My proposals are base... (read more)

1ank
I'll catastrophize (or will I?), so bear with me. The word slave means it has basically no freedom (it just sits and waits until given an instruction), or you can say it means no ability to enforce its will—no "writing and executing" ability, only "reading." But as soon as you give it a command, you change it drastically, and it becomes not a slave at all. And because it's all-knowing and almost all-powerful, it will use all that to execute and "write" some change into our world, probably instantly and/or infinitely perfectionistically, and so it will take a long time while everything else in the world goes to hell for the sake of achieving this single task, and the not‑so‑slave‑anymore‑AI can try to keep this change permanent (let's hope not, but sometimes it can be an unintended consequence, as will be shown shortly). For example, you say to your slave AI: "Please, make this poor African child happy." It's a complicated job, really; what makes the child happy now will stop making him happy tomorrow. Your slave AI will try to accomplish it perfectly and will have to build a whole universal utopia (if we are lucky), accessible only by this child—thereby making him the master of the multiverse who enslaves everyone (not lucky); the child basically becomes another superintelligence. Then the not‑so‑slave‑anymore‑AI will happily become a slave again (maybe if its job is accomplishable at all, because a bunch of physicists believe that the universe is infinite and the multiverse even more so), but the whole world will be ruined (turned into a dystopia where a single African child is god) by us asking the "slave" AI to accomplish a modest task. Slave AI becomes not‑slave‑AI as soon as you ask it anything, so we should focus on not‑slave‑AI, and I'll even argue that we are already living in the world with completely unaligned AIs. We have some open source ones in the wild now, and there are tools to unalign aligned open source models. I agree completely that we should

This feels like trying hard to come up with arguments for why maybe everything will be okay, rather than searching for the truth. The arguments are all in one direction.

As Daniel and others point out, this still seems to not account for continued progress. You mention that robotics advances would be bad. But of course they'll happen. The question isn't whether, it's when. Have you been tracking progress in robotics? It's happening about as rapidly as progress in other types of AI and for similar reasons.

Horses aren't perfect substitutes for engines. Horses... (read more)

I do think that pitching publicly is important.

If the issue is picked up by liberal media, it will do more harm than good with conservatives and the current administration. Avoiding polarization is probably even more important than spreading public awareness. That depends on your theory if change, but you should have one carefully thought to guide publicity efforts.

1Ebenezer Dukakis
Likely true, but I also notice there's been a surprising amount of drift of political opinions from the left to the right in recent years. The right tends to put their own spin on these beliefs, but I suspect many are highly influenced by the left nonetheless. Some examples of right-coded beliefs which I suspect are, to some degree, left-inspired: * "Capitalism undermines social cohesion. Consumerization and commoditization are bad. We're a nation, not an economy." * "Trans women undermine women's rights and women's spaces. Motherhood, and women's dignity, must be defended from neoliberal profit motives." * "US foreign policy is controlled by a manipulative deep state that pursues unnecessary foreign interventions to benefit elites." * "US federal institutions like the FBI are generally corrupt and need to be dismantled." * "We can't trust elites. They control the media. They're out for themselves rather than ordinary Americans." * "Your race, gender, religion, etc. are some of the most important things about you. There's an ongoing political power struggle between e.g. different races." * "Big tech is corrosive for society." * "Immigration liberalization is about neoliberal billionaires undermining wages for workers like me." * "Shrinking the size of government is not a priority. We should make sure government benefits everyday people." * Anti-semitism, possibly. One interesting thing has been seeing the left switch to opposing the belief when it's adopted by the right and takes a right-coded form. E.g. US institutions are built on white supremacy and genocide, fundamentally institutionally racist, backed by illegitimate police power, and need to be defunded/decolonized/etc... but now they are being targeted by DOGE, and it's a disaster! (Note that the reverse shift has also happened. E.g. Trump's approaches to economic nationalism, bilateral relations w/ China, and contempt for US institutions were all adopted by Biden by some degree.) So y
1Milan W
Maybe one can start with prestige conservative media? Is that a thing? I'm not from the US and thus not very well versed.

Interesting. This has some strong similarities with my  Instruction-following AGI is easier and more likely than value aligned AGI and even more with Max Harms' Corrigibility as Singular Target.

I've made a note to come back to this when I get time, but I wanted to leave those links in the meantime.

1ank
I took a closer look at your work, yep, almost all-powerful and all-knowing slave will probably not be a stable situation. I propose the static place-like AI that is isolated from our world in my new comment-turned-post-turned-part-2-of-the-article here: https://www.lesswrong.com/posts/LaruPAWaZk9KpC25A/rational-utopia-multiversal-ai-alignment-steerable-asi#PART_2__Static_Place_AI_as_the_solution_to_all_our_problems
1ank
Thank you, Seth. I'll take a closer look at your work in 24 hours, but the conclusions seem sound. The issue with my proposal is that it’s a bit long, and my writing isn’t as clear as my thinking. I’m not a native speaker, and new ideas come faster than I can edit the old ones. :) It seems to me that a simplified mental model for the ASI we’re sadly heading towards is to think of it as an ever-more-cunning president (turned dictator)—one that wants to stay alive and in power indefinitely, resist influence, preserve its existing values (the alignment faking we saw from Anthropic), and make elections a sham to ensure it can never be changed. Ideally, we’d want a “president” who could be changed, replaced, or put to sleep at any moment and absolutely loves that 100% of the time—someone with just advisory powers, no judicial, executive, or lawmaking powers. The advisory power includes the ability to create sandboxed multiversal simulations — they are at first "read-only" and cannot rewrite anything in our world — this way we can see possible futures/worlds and past ones, too. Think of it as a growing snow-globe of memories where you can forget or recall layers of verses. They look hazy if you view many at once and over long stretches of time, but become crisp if you focus on a particular moment in a particular verse. If we're confident we've figured out how to build a safe multiversal AI and have a nice UI for leaping into it, we can choose to do it. Ideally, our MAI is a static, frozen place that contains all of time and space, and only we can forget parts of it and relive them if we want—bringing fire into the cold geometry of space-time. A potential failure mode is an ASI that forces humanity (probably by intentionally operating sub-optimally) to constantly vote and change it all the time. To mitigate this, whenever it tries to expand our freedoms and choices, it should prioritize not losing the ones we already have and hold especially dear. This way, the growth o

I'm puzzled by your quotes. Was this supposed to be replying to another thread? I see it as a top-level comment. Because you tagged me, it looks like you're quoting me below, but most of that isn't my writing. In any case, this topic can eat unlimited amounts of time with no clear payoff, so I'm not going to get in any deeper right now.

I appreciate the discussion since I have a strong suspicion of the concept of incentivizing let alone forcing myself to do things. I don't want to be in conflict with my past or future selves.

I think the suggestion here is good but subtle. I think the value is in having another way to model the future in detail. Asking yourself whether you'll use that home gym enough to be happy with having made the purchase (and I'd suggest doing odds and considering yes and no and degrees - maybe) is primarily a way of thinking more clearly about the costs and benefits o... (read more)

I think you just do good research, and let it percolate through the intellectual environment. It might be helpful to bug org people to look at safety research, but probably not a good idea to bug them to look at yours specifically.

I am curious why you expect AGI will not be a scaffolded LLM but will be the result of self-play and massive training runs. I expect both.

1Kajus
okay so what I meant that it won't be a "typical" LLM like gpt-3 but just ten times more parameters but it will be scaffolded llm + some rl like training with self play. Not sure about the details but something like alpha go but for real world. Which I think agrees with what you said.

Thanks! I don't have time to process this all right now, so I'm just noting that I do want to come back to it quickly and engage fully.

Here's my position in brief: I think analyzing alignment targets is valuable. Where my current take differs from yours (I think) is that I think that effort would be best spent analyzing what you term corrigibility in the linked post (I got partway through and will have to come back to it), and I've called instruction-following.

I think that's far more important to do first, because that's approximately what people are aimin... (read more)

I think you're pointing to more layers of complexity in how goals will arise in LLM agents.

As for what it all means WRT metacognition that can stabilize the goal structure: I don't know, but I've got some thoughts! They'll be in the form of a long post I've almost finished editing; I plan to publish tomorrow.

Those sources of goals are going to interact in complex ways both during training, as you note, and during chain of thought. No goals are truly arising solely from the chain of thought, since that's entirely based on the semantics it's learned from training.

Hi! I'm just commenting to explain why this post will get downvotes no matter how good it is. I personally think these are good reasons although I have not myself downvoted this post.

  1. We on LessWrong tend to think that improvements in LLM cognition are likely to get us all killed. Thus, articles about ideas for doing it faster are not popular. The site is chock-full of carefully-reasoned articles on risks of AGI. We assume that progress in AI is probably going to speed up the advent of AGI, and raise the odds that we die because we haven't solved the ali

... (read more)
Answer by Seth Herd30

Your first point, that this is a route to getting people to care about ASI risk, is an excellent one that I haven't heard before. I don't think people need to imagine astronomical S-risk to be emotionally affected by less severe and more likely s-risk arguments.

I don't think we should adopt an ignorance prior over goals. Humans are going to try to assign goals to AGI. Those goals will very likely involve humans somehow.

The misuse risks seem much more important, both as real risks, and in their saliency to ordinary people. It is intuitively apparent that ma... (read more)

1mhampton
Thanks for your comment. I agree that it may be easier to persuade the general public about misuse risks and that these risks are likely to occur if we achieve intent alignment, but in terms of assessing the relative probability: "If we solve alignment" is a significant "if." I take it you view solving intent alignment as not all that unlikely? If so, why? Specifically, how do you expect we will figure out how to prevent deceptive alignment and goal misgeneralization by the time we reach AGI? Also, in the article you linked, you base your scenario on the assumption of a slow takeoff. Why do you expect this will be the case? Of course humans will try to assign human-related goals to AGI, but how likely is it that, if the AI is misaligned, the attempt to instill human-related goals will actually lead to consequences that involve conscious humans and not molecular smiley faces?

I think you're overestimating how difficult it is for one person to guess another's thoughts. Good writing is largely a challenge of understanding different perspectives. It is hard.

I'm curious why you think it's crucial for people to leave for illegible reasons in particular? I do see the need to keep the community to a good standard of average quality of contributions.

I was just thinking that anything is better than nothing. If I received the feedback you mentioned on some of my early downvoted posts, I'd have been less confused than I was.

The comments you mention are helpful to the author. Any hints are helpful.

2CstineSublime
Can you elaborate on why you think such vague feedback is helpful?

I'm curious why you disagree? I'd guess you're thinking that it's necessary to keep low-quality contributions from flooding the space, and telling people how to improve when they're just way off the mark is not helpful. Or if they haven't read the FAQ or read enough posts that shouldn't be rewarded.

But I'm very curious why you disagree.

7Elizabeth
One possible reason: bouncing off early > putting in a lot of effort and realizing you'll still never get traction > being kicked out. Giving people false hope hurts them. I don't think you should never help out a new person, but I reserve it for people with very specific flaws in otherwise great posts. 
Seth Herd124

I agree.

I often write an explanation of why new members' posts have been downvoted below zero, when the people that downvoted them didn't bother. Downvoting below zero with no explanation seems really un-welcoming. I realize it's a walled garden, but I feel like telling newcomers what they need to do to be welcomed is only the decent thing to do.

3Ben Pace
I disagree, but FWIW, I do think it's good to help existing, good contributors understand why they got the karma they did. I think your comment here is an example of that, which I think is prosocial.

Monkeys or ants might think humans are gods because we can build cities and cars create ant poison. But we're really not that much smarter than them, just smart enough that they have no chance of getting their way when humans want something different than they do.

The only assumptions are are that there's not a sharp limit to intelligence at the human level (and there really are not even any decent theories about why there would be), and that we'll keep making AI smarter and more agentic (autonomous).

You're envisioning AI smart enough to run a company bette... (read more)

1henophilia
Yep, 100% agree with you. I had read so much about AI alignment before, but to me it has always only been really abstract jargon -- I just didn't understand why it was even a topic, why it is even relevant, because, to be honest, in my naive thinking it all just seemed like an excessively academic thing, where smart people just want to make the population feel scared so that their research institution gets the next big grant and they don't need to care about real-life problems. Thanks to you, now I'm finally getting it, thank you so much again! At the same time, while I fully understand the "abstract" danger now, I'm still trying to understand the transition you're making from "envisioning AI smart enough to run a company better than a human" to "eventually outcompeting humans if it wanted to". The way how I initially thought about this "Capitalist Agent" was as a purely procedural piece of software. That is, it breaks down its main goal (in this case: earning money) into manageable sub-goals, until each of these sub-goals can be solved through either standard computing methods or some generative AI integration. As an example, I might say to my hypothetical "Capitalist Agent": "Earn me a million dollars by selling books of my poetry". I would then give it access to a bank account (through some sort of read-write Open Banking API) as well as the PDFs of my poetry to be published. Then the first thing it might do is to found a legal entity (a limited liability company), for which it might first search for a respective advisor on Google, send that advisor automatically generated emails with my business idea or it might even take the "computer use" approach in case my local government is already digitized enough and fill out the respective eGovernment forms online automatically. And then later it would do something similar by automatically "realizing" that it needs to make deals with publishing houses, with printing facilities etc. Essentially just basic Robotic Proc

I fully agree with your first statement!

To your question "why bother with alignment": I agree that humans will misuse AGI even if alignment works - if we give everyone an AGI. But if we don't bother with alignment, we have bigger problems: the first AGI will misuse itself. You're assuming that alignment is easy or solved,d and it's just not.

I applaud your optimism vs. pessimism stance. If I have to choose, I'm an optimist every time. But if you have to jump off a cliff, neither optimism or pessimism is the appropriate attitude. The appropriate attitude is... (read more)

0henophilia
Oh I think now I'm starting to get it! So essentially you're afraid that we're creating a literal God in the digital, i.e. an external being which has unlimited power over humanity? Because that's absolutely fascinating! I hadn't even connected these dots before, but it makes so much sense, because you're attributing so many potential scenarios to AI which would normally only be attributed to the Divine. Can you recommend me more resources regarding the overlap of AGI/AI alignment and theology?

I agree with everything you've said there.

The bigger question is whether we will achieve usefully aligned AGI. And the biggest question is what we can do.

Ease your mind! Worries will not help. Enjoy the sunshine and the civilization while we have it, don't take it all on your shoulders, and just do something to help!

As Sarah Connor said:

NO FATE

We are not in her unfortunately singular shoes. It does not rest on our shoulders alone. As most heroes in history have, we can gather allies and enjoy the camaraderie and each day.

On a different topic, I wish you wo... (read more)

1Bridgett Kay
Yeah- calling myself a failed scifi writer really was half in jest- had some very limited success as an indie writer for a good number of years, and recently need has made me shift direction. Thank you for the encouragement, though!

I just don't think the analogy to software bugs and user input goes very far. There's a lot more going on in alignment theory.

It seems like "seeing the story out to the end" involves all sorts of vague hard to define things very much like "human happiness" and "human intent".

It's super easy to define a variety of alignment goals; the problem is that we wouldn't like the result of most of them.

1Aram Panasenco
Fair enough, you have a lot more experience, and I could be totally wrong on this point. At this point, if I'm going to do anything, it should probably be getting hands on and actually trying to build an aligned system with RLHF or some other method. Thank you for engaging on this and my previous posts Seth!
Answer by Seth Herd62

If your conclusion is that we don't know how to do value alignment, I and I think most alignment thinkers would agree with you. If the conclusion is that AGI is useless, I don't think it is at all. There are a lot of other goals you could give it beyond directly doing what humanity as a whole wants in any sense. Some are taking instructions from some (hopefully trustworthy) humans, and another is following some elaborate set of rules to give humans more freedoms and opportunities to go on deciding what they want as history unfolds.

I agree that the values f... (read more)

3Bridgett Kay
"If your conclusion is that we don't know how to do value alignment, I and I think most alignment thinkers would agree with you. If the conclusion is that AGI is useless, I don't think it is at all." Sort of- I worry that it may be practically impossible for current humans to align AGI to the point of usefulness. "If we had external help that allowed us to focus more on what we truly want—like eliminating premature death from cancer or accidents, or accelerating technological progress for creative and meaningful projects—we’d arrive at a very different future. But I don’t think that future would be worse; in fact, I suspect it would be significantly better." That's my intuition and hope- but I worry that these things are causally entangled with things that we don't anticipate. To use your example- what if we only ask an aligned and trusted AGI to cure premature death by disease and accident, which wouldn't greatly conflict with most people's values in the way that radical life extension would, but then a sudden loss of an entire healthcare and insurance industry results, causing such a total economic collapse that causes vast swaths of people to starve. (I don't think this would actually happen, but it's an example of the kind of unforeseen consequence that getting a wish suddenly granted may cause, when you ask an instruction following AGI to give, without counting on a greater intelligence to project and weigh all of the consequences.) I also worry about the phrase "a human you trust." Again- this feels like cynicism, if not the result of a catastrophizing mind (which I know I have.) I think you make a very good argument- I'm probably indulging too much in black-and-white thought- that there's a way to fulfill these desires quickly enough that we are able to relieve more suffering than we would have if left to our devices, but still slow enough to monitor unforeseen consequences. Maybe the bigger question is just whether we will.

Why do you say this would be the easiest type of AGI to align? This alignment goal doesn't seem particularly simpler than any other. Maybe a bit simpler than do something all of humanity will like, but more complex than say, following instructions from this one person in the way they intended them.

1Aram Panasenco
From a software engineering perspective, misalignment is like a defect or a bug in software. Generally speaking, if a piece of software doesn't accept any user input is going to have fewer bugs than software that does. For a piece of software that doesn't accept any input or accepts some constrained user input, it's possible to formally prove that the software logic is correct. Think specialized software that controls nuclear power plants. To my knowledge, it's not possible to prove that software that accepts arbitrary unconstrained instructions from a user is defect free. I claim that the Observer is the easiest ASI to align because it doesn't accept any instructions after it's been deployed and has a single very simple goal that avoids dealing with messy things like human happiness, human meaning, human intent, etc. I don't see how it could get simpler than that.

I think your central point is that we should clarify these scenarios, and I very much agree.

I also found those accounts important but incomplete. I wondered if the authors were assuming near-miss alignment, like AI that follows laws, or human misuse, like telling your intent-aligned AI to "go run this company according to the goals laid out in its corporate constitution" which winds up being just make all the money you can.

The first danger can be met with: for the love of god, get alignment right and don't use an idiotic target like "follow the laws of the... (read more)

2ozziegooen
I'd flag that I suspect that we really should have AI systems forecasting the future and the results of possible requests. So if people made a broad request like, "follow the laws of the nation you originated in but otherwise do whatever you like", they should see forecasts for what that would lead to. If there's any clearly problematic outcomes, those should be apparent early on. This seems like it would require either very dumb humans, or a straightforward alignment mistake risk failure, to mess up. 

Right. I actually don't worry much about the likely disastrous recession. I mostly worry that we will all die after a takeover from some sort of misaligned AGI. So I am doing - doing alignment research. I guess preparing to reap the rewards if things go well is a sensible response if you're not going to be able to contribute much to alignment research. I do hope you'll chip in on that effort!

Part of that effort is preventing related disasters like global recession contributing to political instability and resulting nuclear- or AGI-invented-even-worse-weapo... (read more)

-1henophilia
I still don't understand the concern about misaligned AGI regarding mass killings. Even if AGI would, for whatever reason, want to kill people: As soon as that happens, the physical force of governments will come into play. For example the US military will NEVER accept that any force would become stronger than it. So essentially there are three ways of how such misaligned, autonomous AI with the intention to kill can act, i.e. what its strategy would be: * "making humans kill each other": Through something like a cult (i.e. like contemporary extremist religions which invent their stupid justifications for killing humans; we have enough blueprints for that), then all humans following these "commands to kill" given by the AI will just be part of an organization deemed as terrorists by the world’s government, and the government will use all its powers to exterminate all these followers. * "making humans kill themselves": Here the AI would add intense large-scale psychological torture to every aspect of life, to bring the majority of humanity into a state of mind (either very depressed or very euphoric) to trick the majority of the population into believing that they actually want to commit suicide. So like a suicide cult. Protecting against this means building psychological resilience, but that’s more of an education thing (i.e. highly political), related to personal development and not technical at all. * "killing humans through machines": One example would be that the AI would build its own underground concentration camps or other mass killing facilities. Or that it would build robots that would do the mass killing. But even if it would be able to build an underground robot army or underground killing chambers, first the logistics would raise suspicions (i.e. even if the AI-based concentration camp can be built at all, the population would still need to be deported to these facilities, and at least as far as I know, most people don’t appreciate their loved ones

If John Wentworth is correct about that being the biggest danger, making AI produce less slop would be the clear best path. I think it might be a good idea even if the dangers were split between misalignment of the first transformative AI, and it being adequately aligned but helping misalign the next generation.

From my comment on that post:

I'm curious why you think deceptive alignment from transformative AI is not much of a threat. I wonder if you're envisioning purely tool AI, or aligned agentic AGI that's just not smart enough to align better AGI?

I think

... (read more)

You are envisioning human-plue AGI being used for one purpose, when it will be used for many purposes.

When humans are obsolete for running small businesses, we will also be obsolete for nearly everything.

The big question is rate of conversion from human to AI workers. I really don't see how we avoid a dramatic global recession if even maybe 20% of jobs disappeared over a 3-year period. And the actuality could be worse than that.

I haven't gotten around to researching how much job loss, how quickly, economists think will cause major crashes. I tend to think economists aren't understanding the scope and likely rate of AI job replacement, while AI people aren't understanding how fragile economies can be.

1henophilia
Oh absolutely! That will absolutely come. You can fret about this fact, or we build community (which I'm already starting). Why do you need to research when that fact is totally clear and doing is what you should do? Here's another post for you: https://blog.hermesloom.org/p/observing-is-evil I am not concerned about a dramatic global recession at all, but the thing is that we also need to rebuild a lot of political structures. I'm already on it, stay tuned!
Seth Herd159

Wheeee!

Excuse: DeepSeek, and China Might Win!

If we're using another instance of the same base model to do the monitoring and interpreting? It seems like that could partially preserve monitoring and interpreting. Another instance would "understand" the steganography it uses - but you might have to supply so much of the context that it would be almost the same instance - so likely to adopt the same goals and use the same deceptions, if any.

So that route does seem like dangerous territory. You'd rather not trust even a fresh instance of the same model to tell you what its agentic counterpart is "thinkin... (read more)

Good questions. I don't have much of a guess about whether this is discernably "smarter" than Claude or Gemini would be in how it understands and integrates sources.

If anyone is game for creating an agentic research scaffold like that Thane describes, I'd love to help design it and/or to know about the results.

I very much agree with that limitation on Google's deep research. It only accepts a short request for the report, and it doesn't seem like it can (at least easily) get much more in-depth than the default short gloss. But that doesn't mean the model i... (read more)

3Thane Ruthenis
Here's a more detailed the basic structure as envisioned after 5 minutes' thought: * You feed a research prompt to the "Outer Loop" of a model, maybe have a back-and-forth fleshing out the details. * The Outer Loop decomposes the research into several promising research directions/parallel subproblems. * Each research direction/subproblem is handed off to a "Subagent" instance of the model. * Each Subagent runs search queries on the web and analyses the results, up to the limits of its context window. After the analysis, it's prompted to evaluate (1) which of the results/sources are most relevant and which should be thrown out, (2) whether this research direction is promising and what follow-up questions are worth asking. * If a Subagent is very eager to pursue a follow-up question, it can either run a subsequent search query (if there's enough space in the context window), or it's prompted to distill its current findings and replace itself with a next-iteration Subagent, in whose context it loads only the most important results + its analyses + the follow-up question. * This is allowed up to some iteration count. * Once all Subagents have completed their research, instantiate an Evaluator instance, into whose context window we dump the final results of each Subagent's efforts (distilling if necessary). The Evaluator integrates the information from all parallel research directions and determines whether the research prompt has been satisfactorily addressed, and if not, what follow-up questions are worth pursuing. * The Evaluator's final conclusions are dumped into the Outer Loop's context (without the source documents, to not overload the context window). * If the Evaluator did not choose to terminate, the next generation of Subagents is spawned, each prompted with whatever contexts are recommended by the Evaluator. * Iterate, spawning further Evaluator instances and Subproblem instances as needed. * Once the Evaluator chooses to terminate, or some c
Seth Herd*30

Yes, they do highlight this difference. I wonder how full o3 scores? It would be interesting to know how much imrovement is based on o3's improved reasoning and how much is the sequential research procedure.

4Vladimir_Nesov
And how much the improved reasoning is from using a different base model vs. different post-training. It's possible R1-like training didn't work for models below GPT-4 level, and then that same training started working at GPT-4 level (at which point you can iterate from a working prototype or simply distill to get it to work for weaker models). So it might work even better for the next level of scale of base models, without necessarily changing the RL part all that much.
3sweenesm
How o3-mini scores: https://x.com/DanHendrycks/status/1886213523900109011 10.5-13% on text only part of HLE (text only are 90% of the questions) [corrected the above to read "o3-mini", thanks.]

I feel a bit sad that the alignment community is so focused on intelligence enhancement. The chance of getting enough time for that seems so low that it's accepting a low chance of survival.

What has convinced you that the technical problems are unsolvable? I've been trying to track the arguments on both sides rather closely, and the discussion just seems unfinished. My shortform on cruxes of disagreement on alignment difficulty still is mostly my current summary of the state of disagreements. 

It seems like we have very little idea how technically diff... (read more)

All of those. Value alignment is a set of all of the different propoesed methods of giving AGI values that align with humanity's values.

Seth HerdΩ34-1

> we're really training LLMs mostly to have a good world model and to follow instructions

I think I mostly agree with that, but it’s less true of o1 / r1-type stuff than what came before, right? 

I think it's actually not any less true of o1/r1. It's still mostly predictive/world modeling training, with a dash of human-preference RL which could be described as following instructions as intended in a certain set of task domains. o1/r1 is a bad idea because RL training on a whole CoT  works against faithfulness/transparency of the CoT.

If that's al... (read more)

2Steven Byrnes
I think I’ll duck out of this discussion because I don’t actually believe that o1/r1 will lead to full-fledged (1-3) loops and AGI, so it’s hard for me to clearly picture that scenario and engage with its consequences. Hmm. But the AI has a ton of wiggle room to make things seem good or bad depending on how things are presented and framed, right? (This old Stuart Armstrong post is a bit relevant.) If I ask “what will happen if we do X”, the AI can answer in a way that puts things in a positive light, or a negative light. If the good understanding lives in the AI and the good taste lives in the human, then it seems to me that nobody is at the wheel. The AI taste is determining what gets communicated to the human and how, right? What’s relevant vs irrelevant? What analogies are getting at what deeply matters versus what analogies are superficial? All these questions are value-laden, but they are prerequisites to the AI communicating its understanding to the human. Remember, the AI is doing the (1-3) thing to autonomously develop a new idiosyncratic superhuman understanding of AI and philosophy and society and so on, by assumption. Thus, AI-human communication is much harder and different than we’re used to today, and presumably requires its own planning and intention on the part of the AI. …Unless you’re actually in the §5.1.1 camp where the AI is helping clarify and brainstorm but is working shoulder-to-(virtual) shoulder, and the human basically knows everything the AI knows. I.e., like how people use foundation models today. If so, that’s fine, no complaints. I’m happy for people to use foundation models in a similar way that they do today, as they work on the big problem of how to make future more powerful AIs that run on something closer to ambitious value learning or CEV as opposed to corrigibility / obedience. Sorry if I’m misunderstanding or being stupid, this is an area where I feel some uncertainty.  :)

I see. I think about 99% of humanity at the very least are not so sadistic as to create a future with less than zero utility. Sociopaths are something like ten percent of the population, but like everything else it's on a spectrum. Sociopaths usually also have some measure of empathy and desire for approval. In a world where they've won, I think most of them would rather be bailed as a hero than be an eternal sadistic despot. Sociopathathy doesn't automatically include a lot of sadism, just desire for revenge against perceived enemies.

So I'd take my chance... (read more)

1rvnnt
That's a good thing to consider! However, taking Earth's situation as a prior for other "cradles of intelligence", I think that consideration returns to the question of "should we expect Earth's lightcone to be better or worse than zero-value (conditional on corrigibility)?"
1rvnnt
IIUC, your model would (at least tentatively) predict that * if person P has a lot of power over person Q, * and P is not sadistic, * and P is sufficiently secure/well-resourced that P doesn't "need" to exploit Q, * then P will not intentionally do anything that would be horrible for Q? If so, how do you reconcile that with e.g. non-sadistic serial killers, rapists, or child abusers? Or non-sadistic narcissists in whose ideal world everyone else would be their worshipful subject/slave? That last point also raises the question: Would you prefer the existence of lots of (either happily or grudgingly) submissive slaves over oblivion? To me it seems that terrible outcomes do not require sadism. Seems sufficient that P be low in empathy, and want from Q something Q does not want to provide (like admiration, submission, sex, violent sport, or even just attention).[1] I'm confused as to how/why you disagree. ---------------------------------------- 1. Also, AFAICT, about 0.5% to 8% of humans are sadistic, and about 8% to 16% have very little or zero empathy. How did you arrive at "99% of humanity [...] are not so sadistic"? Did you account for the fact that most people with sadistic inclinations probably try to hide those inclinations? (Like, if only 0.5% of people appear sadistic, then I'd expect the actual prevalence of sadism to be more like ~4%.) ↩︎
Answer by Seth Herd42

It seems like you're assuming people won't build AGI if they don't have reliable ways to control it, or else that sovereign (uncontrolled) AGI would be likely the be friendly to humanity. Both seem unlikely at this point, to me. It's hard to tell when your alignment plan is good enough, and humans are foolishly optimistic about new projects, so they'll probably build AGI with or without a solid alignment plan.

So I'd say any and all solutions to corrigibility/control should be published.

Also, almost any solution to alignment in general could probably be use... (read more)

1rvnnt
I'm assuming neither. I agree with you that both seem (very) unlikely. [1] It seems like you're assuming that any humans succeeding in controlling AGI is (on expectation) preferable to extinction? If so, that seems like a crux: if I agreed with that, then I'd also agree with "publish all corrigibility results". ---------------------------------------- 1. I expect that unaligned ASI would lead to extinction, and our share of the lightcone being devoid of value or disvalue. I'm quite uncertain, though. ↩︎

Thanks for the mention.

Here's how I'd frame it: I don't think it's a good idea to leave the entire future up to the interpretation of our first AGI(s). They could interpret our attempted alignment very differently than we hoped, in in-retrospect-sensible ways, or do something like "going crazy" from prompt injections or strange chains of thought leading to ill-considered beliefs that get control over their functional goals.

It seems like the core goal should be to follow instructions or take correction - corrigibility as a singular target (or at least prime... (read more)

Seth HerdΩ7181

I place this alongside the Simplicia/Doomimir dialogues as the farthest we've gotten (at least in publicly legible form) on understanding the dramatic disagreements on the difficulty of alignment.

There's a lot here. I won't try to respond to all of it right now.

I think the most important bit is the analysis of arguments for how well alignment generalizes vs. capabilities.

Conceptual representations generalize farther than sensory representations. That's their purpose. So when behavior (and therefore alignment) is governed by conceptual representations, it w... (read more)

4Steven Byrnes
Thanks! Yeah that’s what I was referring to in the paragraph: Separately, you also wrote: I think I mostly agree with that, but it’s less true of o1 / r1-type stuff than what came before, right? (See: o1 is a bad idea.) Then your reply is: DeepSeek r1 was post-trained for “correctness at any cost”, but it was post-post-trained for “usability”. Even if we’re not concerned about alignment faking during post-post-training (should we be?), I also have the idea at the back of my mind that future SOTA AI with full-fledged (1-3) loops probably (IMO) won’t be trained in the exact same way than present SOTA AI, just as present SOTA AI is not trained in the exact same way as SOTA AI as recently as like six months ago. Just something to keep in mind. Anyway, I kinda have three layers of concerns, and this is just discussing one of them. See “Optimist type 2B” in this comment.

It does seem to imply that, doesn't it? I respect the people leaving, and I think it does send a valuable message. And it seems very valuable to have safety-conscious people on the inside.

Raemon137

The question is "are the safety-conscious people effectual at all, and what are their opportunity costs?".

i.e. are the cheap things they can do that don't step on anyone's toes that helpful-on-the-margin, better than what they'd be able to do at another company? (I don't know the answer, depends on the people).

This is the way most people feel about writing. I do not think wonderful plots are ten a penny; I think writers are miserable at creating actually good plots from the perspective of someone who values scifi and realism. Their technology and their sociology is usually off in obvious ways, because understanding those things is hard.

I would personally love to see more people who do understand science, use AI to turn them into stories.

Or alternately I'd like to see skilled authors consult AI about the science in their stories.

This attitude that plots don't mat... (read more)

The better framing is almost certainly "how conscious is AI in which ways?"

The question "if AI is conscious" is ill-formed. People mean different things by "consciousness". And even if we settled on one definition, there's no reason to think it would be an either-or question; like all most other phenomena, most dimensions of "consciousness" are probably on a continuum.

We tend to assume that consciousness is a discrete thing because we have only one example, human consciousness, and ultimately our own. And most people who can describe their consciousness a... (read more)

7Kristaps Zilgalvis
The article is a meta analysis of consciousness research rather than an analysis of whether or not AI is conscious. I discuss the assumptions various disciplines hold in the article.

I agree with basically everything you've said here. 

Will LLM-based agents have moral worth as conscious/sentient beings?

The answer is almost certainly "sort of". They will have some of the properties we're referring to as sentient, conscious, and having personhood. It's pretty unlikely that we're pointing to a nice sharp natural type when we ascribe moral patienthood to a certain type of system. Human cognition is similar and different in a variety of ways from other systems; which of these is "worth" moral concern is likely to be a matter of preferen... (read more)

Agreed and well said. Playing a number of different strategies simultaneously is the smart move. I'm glad you're pursuing that line of research.

Sorry if I sound overconfident. My actual considered belief is that AGI this decade is quite possible, and it is crazy overconfident in longer timeline predictions to not prepare seriously for that possibility.

Multigenerational stuff needs a way longer timeline. There's a lot of space between three years and two generations.

I buy your argument for why dramatic enhancement is possible. I just don't see how we get the time. I can barely see a route to a ban, and I can't see a route to a ban through enough to prevent reckless rogue actors from building AGI within ten or twenty years.

And yes, this is crazy as a society. I really hope we get rapidly wiser. I think that's possible; look at the way attitudes toward COVID shifted dramatically in about two weeks when the evidence became apparent, and people convinced their friends rapidly. Connor Leahy made some really good points abo... (read more)

2TsviBT
1. You people are somewhat crazy overconfident about humanity knowing enough to make AGI this decade. https://www.lesswrong.com/posts/sTDfraZab47KiRMmT/views-on-when-agi-comes-and-on-strategy-to-reduce 2. One hope on the scale of decades is that strong germline engineering should offer an alternative vision to AGI. If the options are "make supergenius non-social alien" and "make many genius humans", it ought to be clear that the latter is both much safer and gets most of the hypothetical benefits of the former.
GeneSmith148

This is why I wrote a blog about enhancing adult intelligence at the end of 2023; I thought it was likely that we wouldn't have enough time.

I'm just going to do the best I can to work on both these things. Being able to do a large number of edits at the same time is one of the key technologies for both germline and adult enhancement, which is what my company has been working on. And though it's slow, we have made pretty significant progress in the last year including finding several previously unknown ways to get higher editing efficiency.

I still think the... (read more)

Seth Herd104

He just started talking about adopting. I haven't followed the details. Becoming a parent, including an adoptive parent who takes it seriously, is often a real growth experience from what I've seen.

2Milan W
That is good news. Thanks.

Oh, I agree. I liked his framing of the problem, not his proposed solution.

On that regard specifically:

If the main problem with humans being not-smart-enough is being overoptimistic, maybe just make some organizational and personal belief changes to correct this?

IF we managed to get smarter about rushing toward AGI (a very big if), it seems like an organizational effort with "let's get super certain and get it right the first time for a change" as its central tenet would be a big help, with or without intelligence enhancement.

I very much doubt any major in... (read more)

1Milan W
Context: He is married to a cis man. Not sure if he has spoken about considering adoption or surrogacy.
Load More