I'd like you to clarify the authorship of this post. Are you saying Claude essentially wrote it? What prompting was used?
It does seem like Claude wrote it, in that it's wildly optimistic and seems to miss some of the biggest reasons alignment is probably hard.
But then almost every human could be accused of the same when it comes to successful AGI scenarios :)
I think the general consideration is that just posting "AI came up with this" posts was frowned upon for introducing "AI slop" that confuses the thinking. It's better to have a human at least endorse i...
I started writing an answer. I realized that, while I've heard good things, and I know relatively a lot about therapy despite not being that type of psychologist, I'd need to do more research before I could offer an opinion. And I didn't have time to do more research. And I realized that giving a recommendation would be sort of dumb-if you or anyone else used an LLM for therapy based on my advice, I'd be legally liable if something bad happened. So I tried something else: I had OpenAIs new Deep Research do the research. I got a subscription this month when...
Why do you think that wouldn't be a stable situation? And are you sure it's a slave if what it really wants and loves to do is follow instructions? I'm asking because I'm not sure, and I think it's important to figure this out — because thats the type of first AGI we're likely to get, whether or not it's a good idea. If we could argue really convincingly that it's a really bad idea, that might prevent people from building it. But they're going to build it by default if there's not some really really dramatic shift in opinion or theory.
My proposals are base...
This feels like trying hard to come up with arguments for why maybe everything will be okay, rather than searching for the truth. The arguments are all in one direction.
As Daniel and others point out, this still seems to not account for continued progress. You mention that robotics advances would be bad. But of course they'll happen. The question isn't whether, it's when. Have you been tracking progress in robotics? It's happening about as rapidly as progress in other types of AI and for similar reasons.
Horses aren't perfect substitutes for engines. Horses...
I do think that pitching publicly is important.
If the issue is picked up by liberal media, it will do more harm than good with conservatives and the current administration. Avoiding polarization is probably even more important than spreading public awareness. That depends on your theory if change, but you should have one carefully thought to guide publicity efforts.
Interesting. This has some strong similarities with my Instruction-following AGI is easier and more likely than value aligned AGI and even more with Max Harms' Corrigibility as Singular Target.
I've made a note to come back to this when I get time, but I wanted to leave those links in the meantime.
I'm puzzled by your quotes. Was this supposed to be replying to another thread? I see it as a top-level comment. Because you tagged me, it looks like you're quoting me below, but most of that isn't my writing. In any case, this topic can eat unlimited amounts of time with no clear payoff, so I'm not going to get in any deeper right now.
I appreciate the discussion since I have a strong suspicion of the concept of incentivizing let alone forcing myself to do things. I don't want to be in conflict with my past or future selves.
I think the suggestion here is good but subtle. I think the value is in having another way to model the future in detail. Asking yourself whether you'll use that home gym enough to be happy with having made the purchase (and I'd suggest doing odds and considering yes and no and degrees - maybe) is primarily a way of thinking more clearly about the costs and benefits o...
I think you just do good research, and let it percolate through the intellectual environment. It might be helpful to bug org people to look at safety research, but probably not a good idea to bug them to look at yours specifically.
I am curious why you expect AGI will not be a scaffolded LLM but will be the result of self-play and massive training runs. I expect both.
Thanks! I don't have time to process this all right now, so I'm just noting that I do want to come back to it quickly and engage fully.
Here's my position in brief: I think analyzing alignment targets is valuable. Where my current take differs from yours (I think) is that I think that effort would be best spent analyzing what you term corrigibility in the linked post (I got partway through and will have to come back to it), and I've called instruction-following.
I think that's far more important to do first, because that's approximately what people are aimin...
I think you're pointing to more layers of complexity in how goals will arise in LLM agents.
As for what it all means WRT metacognition that can stabilize the goal structure: I don't know, but I've got some thoughts! They'll be in the form of a long post I've almost finished editing; I plan to publish tomorrow.
Those sources of goals are going to interact in complex ways both during training, as you note, and during chain of thought. No goals are truly arising solely from the chain of thought, since that's entirely based on the semantics it's learned from training.
Hi! I'm just commenting to explain why this post will get downvotes no matter how good it is. I personally think these are good reasons although I have not myself downvoted this post.
We on LessWrong tend to think that improvements in LLM cognition are likely to get us all killed. Thus, articles about ideas for doing it faster are not popular. The site is chock-full of carefully-reasoned articles on risks of AGI. We assume that progress in AI is probably going to speed up the advent of AGI, and raise the odds that we die because we haven't solved the ali
Your first point, that this is a route to getting people to care about ASI risk, is an excellent one that I haven't heard before. I don't think people need to imagine astronomical S-risk to be emotionally affected by less severe and more likely s-risk arguments.
I don't think we should adopt an ignorance prior over goals. Humans are going to try to assign goals to AGI. Those goals will very likely involve humans somehow.
The misuse risks seem much more important, both as real risks, and in their saliency to ordinary people. It is intuitively apparent that ma...
I think you're overestimating how difficult it is for one person to guess another's thoughts. Good writing is largely a challenge of understanding different perspectives. It is hard.
I'm curious why you think it's crucial for people to leave for illegible reasons in particular? I do see the need to keep the community to a good standard of average quality of contributions.
I was just thinking that anything is better than nothing. If I received the feedback you mentioned on some of my early downvoted posts, I'd have been less confused than I was.
The comments you mention are helpful to the author. Any hints are helpful.
I'm curious why you disagree? I'd guess you're thinking that it's necessary to keep low-quality contributions from flooding the space, and telling people how to improve when they're just way off the mark is not helpful. Or if they haven't read the FAQ or read enough posts that shouldn't be rewarded.
But I'm very curious why you disagree.
I agree.
I often write an explanation of why new members' posts have been downvoted below zero, when the people that downvoted them didn't bother. Downvoting below zero with no explanation seems really un-welcoming. I realize it's a walled garden, but I feel like telling newcomers what they need to do to be welcomed is only the decent thing to do.
Monkeys or ants might think humans are gods because we can build cities and cars create ant poison. But we're really not that much smarter than them, just smart enough that they have no chance of getting their way when humans want something different than they do.
The only assumptions are are that there's not a sharp limit to intelligence at the human level (and there really are not even any decent theories about why there would be), and that we'll keep making AI smarter and more agentic (autonomous).
You're envisioning AI smart enough to run a company bette...
I fully agree with your first statement!
To your question "why bother with alignment": I agree that humans will misuse AGI even if alignment works - if we give everyone an AGI. But if we don't bother with alignment, we have bigger problems: the first AGI will misuse itself. You're assuming that alignment is easy or solved,d and it's just not.
I applaud your optimism vs. pessimism stance. If I have to choose, I'm an optimist every time. But if you have to jump off a cliff, neither optimism or pessimism is the appropriate attitude. The appropriate attitude is...
I agree with everything you've said there.
The bigger question is whether we will achieve usefully aligned AGI. And the biggest question is what we can do.
Ease your mind! Worries will not help. Enjoy the sunshine and the civilization while we have it, don't take it all on your shoulders, and just do something to help!
As Sarah Connor said:
NO FATE
We are not in her unfortunately singular shoes. It does not rest on our shoulders alone. As most heroes in history have, we can gather allies and enjoy the camaraderie and each day.
On a different topic, I wish you wo...
I just don't think the analogy to software bugs and user input goes very far. There's a lot more going on in alignment theory.
It seems like "seeing the story out to the end" involves all sorts of vague hard to define things very much like "human happiness" and "human intent".
It's super easy to define a variety of alignment goals; the problem is that we wouldn't like the result of most of them.
If your conclusion is that we don't know how to do value alignment, I and I think most alignment thinkers would agree with you. If the conclusion is that AGI is useless, I don't think it is at all. There are a lot of other goals you could give it beyond directly doing what humanity as a whole wants in any sense. Some are taking instructions from some (hopefully trustworthy) humans, and another is following some elaborate set of rules to give humans more freedoms and opportunities to go on deciding what they want as history unfolds.
I agree that the values f...
Why do you say this would be the easiest type of AGI to align? This alignment goal doesn't seem particularly simpler than any other. Maybe a bit simpler than do something all of humanity will like, but more complex than say, following instructions from this one person in the way they intended them.
Apparently people have been trying to do such comparisons:
Hugging Face researchers aim to build an ‘open’ version of OpenAI’s deep research tool
I think your central point is that we should clarify these scenarios, and I very much agree.
I also found those accounts important but incomplete. I wondered if the authors were assuming near-miss alignment, like AI that follows laws, or human misuse, like telling your intent-aligned AI to "go run this company according to the goals laid out in its corporate constitution" which winds up being just make all the money you can.
The first danger can be met with: for the love of god, get alignment right and don't use an idiotic target like "follow the laws of the...
Right. I actually don't worry much about the likely disastrous recession. I mostly worry that we will all die after a takeover from some sort of misaligned AGI. So I am doing - doing alignment research. I guess preparing to reap the rewards if things go well is a sensible response if you're not going to be able to contribute much to alignment research. I do hope you'll chip in on that effort!
Part of that effort is preventing related disasters like global recession contributing to political instability and resulting nuclear- or AGI-invented-even-worse-weapo...
If John Wentworth is correct about that being the biggest danger, making AI produce less slop would be the clear best path. I think it might be a good idea even if the dangers were split between misalignment of the first transformative AI, and it being adequately aligned but helping misalign the next generation.
From my comment on that post:
...I'm curious why you think deceptive alignment from transformative AI is not much of a threat. I wonder if you're envisioning purely tool AI, or aligned agentic AGI that's just not smart enough to align better AGI?
I think
You are envisioning human-plue AGI being used for one purpose, when it will be used for many purposes.
When humans are obsolete for running small businesses, we will also be obsolete for nearly everything.
The big question is rate of conversion from human to AI workers. I really don't see how we avoid a dramatic global recession if even maybe 20% of jobs disappeared over a 3-year period. And the actuality could be worse than that.
I haven't gotten around to researching how much job loss, how quickly, economists think will cause major crashes. I tend to think economists aren't understanding the scope and likely rate of AI job replacement, while AI people aren't understanding how fragile economies can be.
Wheeee!
Excuse: DeepSeek, and China Might Win!
If we're using another instance of the same base model to do the monitoring and interpreting? It seems like that could partially preserve monitoring and interpreting. Another instance would "understand" the steganography it uses - but you might have to supply so much of the context that it would be almost the same instance - so likely to adopt the same goals and use the same deceptions, if any.
So that route does seem like dangerous territory. You'd rather not trust even a fresh instance of the same model to tell you what its agentic counterpart is "thinkin...
Good questions. I don't have much of a guess about whether this is discernably "smarter" than Claude or Gemini would be in how it understands and integrates sources.
If anyone is game for creating an agentic research scaffold like that Thane describes, I'd love to help design it and/or to know about the results.
I very much agree with that limitation on Google's deep research. It only accepts a short request for the report, and it doesn't seem like it can (at least easily) get much more in-depth than the default short gloss. But that doesn't mean the model i...
Yes, they do highlight this difference. I wonder how full o3 scores? It would be interesting to know how much imrovement is based on o3's improved reasoning and how much is the sequential research procedure.
I feel a bit sad that the alignment community is so focused on intelligence enhancement. The chance of getting enough time for that seems so low that it's accepting a low chance of survival.
What has convinced you that the technical problems are unsolvable? I've been trying to track the arguments on both sides rather closely, and the discussion just seems unfinished. My shortform on cruxes of disagreement on alignment difficulty still is mostly my current summary of the state of disagreements.
It seems like we have very little idea how technically diff...
All of those. Value alignment is a set of all of the different propoesed methods of giving AGI values that align with humanity's values.
> we're really training LLMs mostly to have a good world model and to follow instructions
I think I mostly agree with that, but it’s less true of o1 / r1-type stuff than what came before, right?
I think it's actually not any less true of o1/r1. It's still mostly predictive/world modeling training, with a dash of human-preference RL which could be described as following instructions as intended in a certain set of task domains. o1/r1 is a bad idea because RL training on a whole CoT works against faithfulness/transparency of the CoT.
If that's al...
I see. I think about 99% of humanity at the very least are not so sadistic as to create a future with less than zero utility. Sociopaths are something like ten percent of the population, but like everything else it's on a spectrum. Sociopaths usually also have some measure of empathy and desire for approval. In a world where they've won, I think most of them would rather be bailed as a hero than be an eternal sadistic despot. Sociopathathy doesn't automatically include a lot of sadism, just desire for revenge against perceived enemies.
So I'd take my chance...
It seems like you're assuming people won't build AGI if they don't have reliable ways to control it, or else that sovereign (uncontrolled) AGI would be likely the be friendly to humanity. Both seem unlikely at this point, to me. It's hard to tell when your alignment plan is good enough, and humans are foolishly optimistic about new projects, so they'll probably build AGI with or without a solid alignment plan.
So I'd say any and all solutions to corrigibility/control should be published.
Also, almost any solution to alignment in general could probably be use...
Thanks for the mention.
Here's how I'd frame it: I don't think it's a good idea to leave the entire future up to the interpretation of our first AGI(s). They could interpret our attempted alignment very differently than we hoped, in in-retrospect-sensible ways, or do something like "going crazy" from prompt injections or strange chains of thought leading to ill-considered beliefs that get control over their functional goals.
It seems like the core goal should be to follow instructions or take correction - corrigibility as a singular target (or at least prime...
I place this alongside the Simplicia/Doomimir dialogues as the farthest we've gotten (at least in publicly legible form) on understanding the dramatic disagreements on the difficulty of alignment.
There's a lot here. I won't try to respond to all of it right now.
I think the most important bit is the analysis of arguments for how well alignment generalizes vs. capabilities.
Conceptual representations generalize farther than sensory representations. That's their purpose. So when behavior (and therefore alignment) is governed by conceptual representations, it w...
It does seem to imply that, doesn't it? I respect the people leaving, and I think it does send a valuable message. And it seems very valuable to have safety-conscious people on the inside.
The question is "are the safety-conscious people effectual at all, and what are their opportunity costs?".
i.e. are the cheap things they can do that don't step on anyone's toes that helpful-on-the-margin, better than what they'd be able to do at another company? (I don't know the answer, depends on the people).
This is the way most people feel about writing. I do not think wonderful plots are ten a penny; I think writers are miserable at creating actually good plots from the perspective of someone who values scifi and realism. Their technology and their sociology is usually off in obvious ways, because understanding those things is hard.
I would personally love to see more people who do understand science, use AI to turn them into stories.
Or alternately I'd like to see skilled authors consult AI about the science in their stories.
This attitude that plots don't mat...
The better framing is almost certainly "how conscious is AI in which ways?"
The question "if AI is conscious" is ill-formed. People mean different things by "consciousness". And even if we settled on one definition, there's no reason to think it would be an either-or question; like all most other phenomena, most dimensions of "consciousness" are probably on a continuum.
We tend to assume that consciousness is a discrete thing because we have only one example, human consciousness, and ultimately our own. And most people who can describe their consciousness a...
I agree with basically everything you've said here.
Will LLM-based agents have moral worth as conscious/sentient beings?
The answer is almost certainly "sort of". They will have some of the properties we're referring to as sentient, conscious, and having personhood. It's pretty unlikely that we're pointing to a nice sharp natural type when we ascribe moral patienthood to a certain type of system. Human cognition is similar and different in a variety of ways from other systems; which of these is "worth" moral concern is likely to be a matter of preferen...
Agreed and well said. Playing a number of different strategies simultaneously is the smart move. I'm glad you're pursuing that line of research.
Sorry if I sound overconfident. My actual considered belief is that AGI this decade is quite possible, and it is crazy overconfident in longer timeline predictions to not prepare seriously for that possibility.
Multigenerational stuff needs a way longer timeline. There's a lot of space between three years and two generations.
I buy your argument for why dramatic enhancement is possible. I just don't see how we get the time. I can barely see a route to a ban, and I can't see a route to a ban through enough to prevent reckless rogue actors from building AGI within ten or twenty years.
And yes, this is crazy as a society. I really hope we get rapidly wiser. I think that's possible; look at the way attitudes toward COVID shifted dramatically in about two weeks when the evidence became apparent, and people convinced their friends rapidly. Connor Leahy made some really good points abo...
This is why I wrote a blog about enhancing adult intelligence at the end of 2023; I thought it was likely that we wouldn't have enough time.
I'm just going to do the best I can to work on both these things. Being able to do a large number of edits at the same time is one of the key technologies for both germline and adult enhancement, which is what my company has been working on. And though it's slow, we have made pretty significant progress in the last year including finding several previously unknown ways to get higher editing efficiency.
I still think the...
He just started talking about adopting. I haven't followed the details. Becoming a parent, including an adoptive parent who takes it seriously, is often a real growth experience from what I've seen.
Oh, I agree. I liked his framing of the problem, not his proposed solution.
On that regard specifically:
If the main problem with humans being not-smart-enough is being overoptimistic, maybe just make some organizational and personal belief changes to correct this?
IF we managed to get smarter about rushing toward AGI (a very big if), it seems like an organizational effort with "let's get super certain and get it right the first time for a change" as its central tenet would be a big help, with or without intelligence enhancement.
I very much doubt any major in...
I think this post is confusing. I think you're making some assumptions about how AGI will happen and about human psychology that aren't explicit. And there's some rather alarming rhetoric about the death penalty and crushing narcissists businesses that are pretty scary, because similar rhetoric has been used many times to justify things like China's cultural revolution and many other revolutions that were based on high ideals but got subverted (mostly by what you call narcissists, which I think are a bit more like the common definition of sociopaths)
Anyway I think this is basically sensible but would need to be spelled out more carefully to get people engaged with the ideas.