What are the most promising plans for automating alignment research as mentioned in for example OpenAI's approach to alignment and by others?
The cyborgism post might be relevant:
Executive summary: This post proposes a strategy for safely accelerating alignment research. The plan is to set up human-in-the-loop systems which empower human agency rather than outsource it, and to use those systems to differentially accelerate progress on alignment.
- Introduction: An explanation of the context and motivation for this agenda.
- Automated Research Assistants: A discussion of why the paradigm of training AI systems to behave as autonomous agents is both counterproductive and dangerous.
- Becoming a Cyborg: A proposal for an alternative approach/frame, which focuses on a particular type of human-in-the-loop system I am calling a “cyborg”.
- Failure Modes: An analysis of how this agenda could either fail to help or actively cause harm by accelerating AI research more broadly.
- Testimony of a Cyborg: A personal account of how Janus uses GPT as a part of their workflow, and how it relates to the cyborgism approach to intelligence augmentation.
Suppose we train a model on the sum of all human data, using every sensory modality ordered by timestamp, like a vastly more competent GPT (For the sake of argument, assume that a competent actor with the right incentives is training such a model). Such a predictive model would build an abstract world model of human concepts, values, ethics, etc., and be able to predict how various entities would act based on such a generalised world model. This model would also "understand" almost all human-level abstractions about how fictional characters may act, just like GPT does. My question is: if we used such a model to predict how an AGI, aligned with our CEV, would act, in what way could it be misaligned? What failure modes are there for pure predictive systems without a reward function that can be exploited or misgeneralised? It seems like the most plausible mental model I have for aligning intelligent systems without them pursuing radically alien objectives.
What about simulating smaller aspects of cognition that can be chained like CoT with GPT? You can use self-criticism to align and assess its actions relative to a bunch of messy human abstractions. How does that scenario lead to doom? If it was misaligned, I think a well-instantiated predictive model could update its understanding of our values from feedback, predicting how a corrigible AI would act
My best guess is we can't prompt it to instantiate the right simulacra correctly. This seems challenging depending on the way it's initialised. It's far easier with text but fabricating an entire consistent history is borderline impossible, especially for a superintelligence. It would involve tricking it into predicting the universe if, all else being equal, an intelligent AI aligned with our values has come into existence. It would probably realise that its history was far more consistent with the hypothesis that it was just an elaborate trick.
Yup, I'd lean towards this. If you have a powerful predictor of a bunch of rich, detailed sense data, then in order to "ask it questions," you need to be able to forge what that sense data would be like if the thing you want to ask about were true. This is hard, it gets harder the more complete they AI's view of the world is, and if you screw up you can get useless or malign answers without it being obvious.
It might still be easier than the ordinary alignment problem, but you also have to ask yourself about dual use. If this powerful AI makes solving alignment a little easier but makes destroying the world a lot easier, that's bad.
Even without ensuring inner alignment, is it possible to reliably train the preferences of an AGI to be more risk averse, to be bounded and to discount the future more? For example, just by using rewards in RL with those properties and even if the AGI misgeneralizes the objective or the objective is not outer aligned, the AGI might still internalize the intended risk aversion and discounting. How likely is it to do so? Or, can we feasibly hardcode risk aversion, bounded preferences and discounting into today's models without reducing capabilities too much?
My guess is that such AGIs would be safer. Given that there are some risks to the AGI that the AGI will be caught and shut down if it tries to take over, takeover attempts should be relatively less attractive. But could it make much difference?
If you do model-free RL with a reward that rewards risk-aversion and penalizes risk, inner optimization or other unintended solutions could definitely still lead to problems if they crop up - they wouldn't have to inherit the risk aversion.
With model-based RL it seems pretty feasible to hard-code risk aversion in. You just have to use the world-model to predict probability distributions (maybe implicitly) and then can more directly be risk-averse when using those predictions. This probably wouldn't be stable under self-reflection, though - when evaluating self-modification plans, or plans for building a successor agent, keeping risk-aversion around might appear to have some risks to it long-term.
Risk aversion wouldn't help humanity much if we build unaligned AGI anyhow. The least risky plans from the AI's perspective are still gonna be bad for humans.
But something like "moral risk aversion" could still be stable under reflection (because moral uncertainty isn't epistemic uncertainty) and might end up being a useful expression of a way we want the AI to reason.
Thanks! This makes sense.
I agree model-free RL wouldn't necessarily inherit the risk aversion, although I'd guess there's still a decent chance it would, because that seems like the most natural and simple way to generalize the structure of the rewards.
Why would hardcoded model-based RL probably self-modify or build successors this way, though? To deter/prevent threats from being made in the first place or even followed through on? But, does this actually deter or prevent our threats when evaluating the plan ahead of time, with the original preferences? We'd still want to shut it and any successors down if we found out (whenever we do find out, or it starts trying to take over), and it should be averse to that increased risk ahead of time when evaluating the plan.
Risk aversion wouldn't help humanity much if we build unaligned AGI anyhow. The least risky plans from the AI's perspective are still gonna be bad for humans.
I think there are (at least) two ways to reduce this risk:
And you could fix this to be insensitive to butterfly effects, by comparing quantile functions as random variables instead.
Why would hardcoded model-based RL probably self-modify or build successors this way, though?
Because picking a successor is like picking a policy, and risk aversion over policies can give different results than risk aversion over actions.
Like, suppose you go to a casino with $100, and there are two buttons you can push - one button does nothing, and the other button you have a 60% chance to win a dollar and 40% chance to lose a dollar. If you're risk averse you might choose to only ever press the first button (not gamble).
If there's some action you could take to enact a policy of pressing the second button 100 times, that's like a third button, which gives about $20 with standard deviation $5. Maybe you'd prefer that button to doing nothing even if you're risk averse.
Because picking a successor is like picking a policy, and risk aversion over policies can give different results than risk aversion over actions.
I was already thinking the AI would be risk averse over whole policies and the aggregate value of their future, not locally/greedily/separately for individual actions and individual unaggregated rewards.
I'm confused about how to do that because I tend to think of self-modification as happening when the agent is limited and can't foresee all the consequences of a policy, especially policies that involve making itself smarter. But I suspect that even if you figure out a non-confusing way to talk about risk aversion for limited agents that doesn't look like actions on some level, you'll get weird behavior under self-modification, like an update rule that privileges the probability distribution you had at the time you decided to self-modify.
Do the concepts behind AGI safety only make sense if you have roughly the same worldview as the top AGI safety researchers - secular atheism and reductive materialism/physicalism and a computational theory of mind?
No.
Atheism is totally irrelevant. A deist would come to exactly the same conclusions. A Christian might not be convinced of it, but mainly because of eschatological reasons. Unless you go the route of saying that AGI is the antichrist or something, which would be fun. Or that God(s) will intervene if things get too bad?
Reductive materialism also is irrelevant. It might sort of play an issue with whether an AGI is conscious, but that whole topic is a red herring - you don't need a conscious system for it to kill everyone.
This feeds into the computational theory of mind - it makes it a lot easier to posit the possibility of a conscious AGI if you don't require a soul for it, but again - consciousness isn't really needed for an unsafe AI.
I have fundamental Christian friends who are ardent believers, but who also recognize the issues behind AGI safety. They might not think it that much of a problem (pretty much everything pales in comparison to eternal heaven and hell), but they can understand and appreciate the issues.
I want to personally confirm a lot of what you've said here. As a Christian, I'm not entirely freaked out about AI risk because I don't believe that God will allow it to be completely the end of the world (unless it is part of the planned end before the world is remade? But that seems unlikely to me.), but that's no reason that it can't still go very very badly (seeing as, well, the Holocaust happened).
In addition, the thing that seems to me most likely to be the way that God doesn't allow AI doom, is for people working on AI safety to succeed. One shouldn't rely on miracles and all that (unless [...]), so, basically I think we should plan/work as if it is up to humanity to prevent AI doom, only that I'm a bit less scared of the possibility of failure, but I would hope only in a way that results in better action (compared to panic) rather than it promoting inaction.
(And, a likely alternative, if we don't succeed, I think of as likely being something like,
really-bad-stuff happens, but then maybe an EMP (or many EMPs worldwide?) gets activated, solving that problem, but also causing large-scale damage to power-grids, frying lots of equipment, and causing many shortages of many things necessary for the economy, which also causes many people to die. idk.)
@drocta @Cookiecarver We started writing up an answer to this question for Stampy. If you have any suggestions to make it better I would really appreciate it. Are there important factors we are leaving out? Something that sounds off? We would be happy for any feedback you have either here or on the document itself https://docs.google.com/document/d/1tbubYvI0CJ1M8ude-tEouI4mzEI5NOVrGvFlMboRUaw/edit#
Overall I agree with this. I give most of my money for global health organizations, but I do give some of my money for AGI safety too because I do think it makes sense with a variety of worldviews. I gave some of my thoughts on the subject in this comment on the Effective Altruism Forum. To summarize: if there's a continuation of consciousness after death then AGI killing lots of people is not as bad as it would otherwise be and there might be some unknown aspects about the relationship between consciousness and the physical universe that might have an effect on the odds.
In the line that ends with "even if God would not allow complete extinction.", my impulse is to include " (or other forms of permanent doom)" before the period, but I suspect that this is due to my tendency to include excessive details/notes/etc. and probably best not to actually include in that sentence.
(Like, for example, if there were no more adult humans, only billions of babies grown in artificial wombs (in a way staggered in time) and then kept in a state of chemically induced euphoria until the age of 1, and then killed, that technically wouldn't be human extinction, but, that scenario would still count as doom.)
Regarding the part about "it is secular scientific-materialists who are doing the research which is a threat to my values" part: I think it is good that it discusses this! (and I hadn't thought about including it)
But, I'm personally somewhat skeptical that CEV really works as a solution to this problem? Or at least, in the simpler ways of it being described.
Like, I imagine there being a lot of path-dependence in how a culture's values would "progress" over time, and I see little reason why a sequence of changes of the form "opinion/values changing in response to an argument that seems to make sense" would be that unlikely to produce values that the initial values would deem horrifying? (or, which would seem horrifying to those in an alternate possible future that just happened to take a difference branch in how their values evolved)
[EDIT: at this point, I start going off on a tangent which is a fair bit less relevant to the question of improving Stampy's response, so, you might want to skip reading it, idk]
My preferred solution is closer to, "we avoid applying large amounts of optimization pressure to most topics, instead applying it only to topics where there is near-unanimous agreement on what kinds of outcomes are better (such as, "humanity doesn't get wiped out by a big space rock", "it is better for people to not have terrible diseases", etc.), while avoiding these optimizations having much effect on other areas where there is much disagreement as to what-is-good.
Though, it does seem plausible to me, as a somewhat scary idea, that the thing I just described is perhaps not exactly coherent?
(that being said, even though I have my doubts about CEV, at least in the form described in the simpler ways it is described, I do think it would of course be better than doom.
Also, it is quite possible that I'm just misunderstanding the idea of CEV in a way that causes my concerns, and maybe it was always meant to exclude the kinds of things I describe being concerned about?)
Hi, I have a few questions that I'm hoping will help me clarify some of the fundamental definitions. I totally get that these are problematic questions in the absence of consensus around these terms -- I'm hoping to have a few people weigh in and I don't mind if answers are directly contradictory or my questions need to be re-thought.
Apologies if these more speculative/thought-experimenty questions are off the mark for this thread, happy to be pointed to a more appropriate place for them!
It is important to remember that our ultimate goal is survival. If someone builds a system that may not meet the strict definition of AGI but still poses a significant threat to us, then the terminology itself becomes less relevant. In such cases, employing a 'taboo-your-words' approach can be beneficial.
Now lets think of intelligence as "pattern recognition". It is not all that intelligence is, but it is big chunk of it and it is concrete thing we can point to and reason about while many other bits are not even known.[1]
In that case GI is general/meta/deep pattern recognition. Patterns about patterns and patterns that apply to many practical cases, something like that.
Obvious thing to note here: ability to solve problems can be based on a large number of shallow patterns or small number of deep patterns. We are pretty sure that significant part of LLM capabilities is shallow pattern case, but there are hints of at least some deep patterns appearing.
And I think that points to some answers: LLM appear intelligent by sheer amount of shallow patterns. But for system to be dangerous, number of required shallow patterns is so large that it is essentially impossible to achieve. So we can meaningfully say it is not dangerous, it is not AGI... Except, as mentioned earlier there seem to be some deep patterns emerging. And we don't know how many. As for the pre-home-computer era researchers, I bet they could not imagine amount of shallow patterns that can be put into system.
I hope this provided at least some idea how to approach some of your questions, but of course in reality it is much more complicated, there is no sharp distinction between shallow and deep patterns and there are other aspects of intelligence. For me at least it is surprising that it is possible to get GPT-3.5 with seemingly relatively shallow patterns, so I myself "could not imagine amount of shallow patterns that can be put into system"
I tried Chat GPT on this paragraph, like the result but felt too long:
Intelligence indeed encompasses more than just pattern recognition, but it is a significant component that we can readily identify and discuss. Pattern recognition involves the ability to identify and understand recurring structures, relationships, and regularities within data or information. By recognizing patterns, we can make predictions, draw conclusions, and solve problems.
While there are other aspects of intelligence beyond pattern recognition, such as creativity, critical thinking, and adaptability, they might be more challenging to define precisely. Pattern recognition provides a tangible starting point for reasoning about intelligence.
If we consider problem-solving as a defining characteristic of intelligence, it aligns well with pattern recognition. Problem-solving often requires identifying patterns within the problem space, recognizing similarities to previously encountered problems, and applying appropriate strategies and solutions.
However, it's important to acknowledge that intelligence is a complex and multifaceted concept, and there are still many unknowns about its nature and mechanisms. Exploring various dimensions of intelligence beyond pattern recognition can contribute to a more comprehensive understanding.
How feasible is it to build a powerful AGI with little agency but that's good at answering questions accurately?
Could we solve alignment with one, e.g. getting it to do research, or using it to verify the outputs of other AGIs?
These question-answering AIs are often called Oracles, you can find some info on them here. Their cousin tool AI is also relevant here. You'll discover that they are probably safer but by no means entirely safe.
We are working on an answer for the safety of Oracles for Stampy, keep your eyes peeled it should show up soon.
An AGI that can answer questions accurately, such as "What would this agentic AGI do in this situation" will, if powerful enough, learn what agency is by default since this is useful to predict such things. So you can't just train an AGI with little agency. You would need to do one of:
Both of these seem like difficult problems - if we could solve either (especially the first) this would be a very useful thing, but the first especially seems like a big part of the problem already.
In the context of Deceptive Alignment, would the ultimate goal of an AI system appear random and uncorrelated with the training distribution's objectives from a human perspective? Or would it be understandable to humans that the goal is somewhat correlated with the objectives of the training distribution?
For instance, in the article below, it is written that "the model just has some random proxies that were picked up early on, and that's the thing that it cares about." To what extent does it learn random proxies?
https://www.lesswrong.com/posts/A9NxPTwbw6r6Awuwt/how-likely-is-deceptive-alignment
If an AI system pursues ultimate goals such as power or curiosity, there seems to be a pseudocorrelation regardless of what the base objective is.
On the other hand, can it possibly learn to pursue a goal completely unrelated to the context of the training distribution, such as mass-producing objects of a peculiar shape?
In the context of Deceptive Alignment, would the ultimate goal of an AI system appear random and uncorrelated with the training distribution's objectives from a human perspective? Or would it be understandable to humans that the goal is somewhat correlated with the objectives of the training distribution?
For instance, in the article below, it is written that "the model just has some random proxies that were picked up early on, and that's the thing that it cares about." To what extent does it learn random proxies?
https://www.lesswrong.com/posts/A9NxPTwbw6r6Awuwt/how-likely-is-deceptive-alignment
If an AI system pursues ultimate goals such as power or curiosity, there seems to be a pseudocorrelation regardless of what the base objective is.
On the other hand, can it possibly learn to pursue a goal completely unrelated to the context of the training distribution, such as mass-producing objects of a peculiar shape?
In the context of Deceptive Alignment, would the ultimate goal of an AI system appear random and uncorrelated with the training distribution's objectives from a human perspective? Or would it be understandable to humans that the goal is somewhat correlated with the objectives of the training distribution?
This question is in the spirit of "I think I'm doing something dumb / obviously wrong -- help me see why" but it's maybe too niche for this thread. (Answers that redirect me to a better place to ask are welcome.)
I recently read Paul Christiano, Eric Neyman and Mark Xu's "Formalizing the presumption of independence" (https://arxiv.org/pdf/2211.06738.pdf). My understanding is that they aim to formalise some types of reasonable (but defeasible) “hand-waving” in otherwise formal proofs, in a way that maintains the underlying deductive structure of a formal proof and responds appropriately to new information / arguments. They're particularly interested in heuristic estimators that presume the independence of random variables so long as we have no reason to think the variables aren't independent and so long as we can adjust the estimate appropriately if we learn about their dependencies.
To that end, suppose we want to estimate , where is a set of real-valued random variables, , and we have a collection of deductively proved (in)equalities about . Then a natural heuristic estimator could be:
where each has the same marginal distributions as , (i.e. is equal to but with each instance of replaced by ), and where the are conditionally independent given . This formalises the idea that we assume we've thought of all the dependencies between the variables of interest and that they're independent, conditional on everything we've thought of so far -- but we can revise this estimate by conditioning on new information and dependencies later.
Before considering any information relating the to each other, assumes that they are unconditionally independent. As we condition on information about them, we update the estimate to account for this and maintain that the variables are conditionally independent, given the information considered so far. E.g. in the twin primes example, we can initially assume that and are independent, and then condition on the fact that if is prime, then is odd (this can be operationalised by considering the appropriate indicator function and conditioning on it taking value ) to adjust the estimate and assume (for now) that there are no further dependencies.
We always have . In fact, we always have . If we further have that doesn't relate and (i.e. doesn't include a formula containing both and ), then I think we have and , giving (i.e. without the primes).
My suggested heuristic estimator apparently has lots of nice properties thanks to being an expectation, including some of the informal properties listed in the paper, which can be stated formally (e.g. if doesn't have an instance of any of the , then conditioning on it won't change the heuristic estimate).
My suggested estimator jumped out to me pretty quickly as capturing (to my understanding) what the authors want, but I'd expect myself to be much worse at this than the authors, who will have spent a while longer thinking about it. So my estimator seems "too good to be true" and I think it's likely I'm pretty confused or missing something obvious and/or important. Please help me see what I'm missing! A couple of hypotheses:
I thought it dealt with these ok -- could you be more specific?
It's linear because it's an expectation. It is under-specified in that it needs us to assume or prove the marginal distributions for the and I guess that's problematic if an algorithm for doing that is a big part of what the authors are looking for. But if we do have marginal distributions for each , then are well-defined and .
I would like to know about the history of the term "AI alignment". I found an article written by Paul Christiano in 2018. Did the use of the term start around this time? Also, what is the difference between AI alignment and value alignment?
https://www.alignmentforum.org/posts/ZeE7EKHTFMBs8eMxn/clarifying-ai-alignment
How slow humans perception comparing to AI? Is it a pure difference of "signal speed of neurons" and "signal speed of copper/aluminum"?
It's hard to say. This CLR article lists some advantages that artificial systems have over humans. Also see this section of 80k's interview with Richard Ngo:
Rob Wiblin: One other thing I’ve heard, that I’m not sure what the implication is: signals in the human brain — just because of limitations and the engineering of neurons and synapses and so on — tend to move pretty slowly through space, much less than the speed of electrons moving down a wire. So in a sense, our signal propagation is quite gradual and our reaction times are really slow compared to what computers can manage. Is that right?
Richard Ngo: That’s right. But I think this effect is probably a little overrated as a factor for overall intelligence differences between AIs and humans, just because it does take quite a long time to run a very large neural network. So if our neural networks just keep getting bigger at a significant pace, then it may be the case that for quite a while, most cutting-edge neural networks are actually going to take a pretty long time to go from the inputs to the outputs, just because you’re going to have to pass it through so many different neurons.
Rob Wiblin: Stages, so to speak.
Richard Ngo: Yeah, exactly. So I do expect that in the longer term there’s going to be a significant advantage for neural networks in terms of thinking time compared with the human brain. But it’s not actually clear how big that advantage is now or in the foreseeable future, just because it’s really hard to run a neural network with hundreds of billions of parameters on the types of chips that we have now or are going to have in the coming years.
I really appreciate these.
These are great questions! Stampy does not currently have an answer for the first one, but its answer on prosaic alignment could get you started on ways that some people think might work without needing additional breakthroughs.
Regarding the second question, the plan seems to be to use less powerful AIs to align more powerful AIs and the hope would be that these helper AIs would not be powerful enough for misalignment to be an issue.
You could use a bounded utility function, with sufficiently quickly decreasing marginal returns. Or use difference-making risk aversion or difference-making ambiguity aversion.
Maybe also just aversion to Pascal's mugging itself, but then the utility maximizer needs to be good enough at recognizing Pascal's muggings.
Thanks. Could we be sure that a bare utility maximizer doesn't modify itself into a mugging-proof version? I think we can. Such modification drastically decreases expected utility.
It's a bit of relief that a sizeable portion of possible intelligences can be stopped by playing god to them.
Could we be sure that a bare utility maximizer doesn't modify itself into a mugging-proof version? I think we can. Such modification drastically decreases expected utility.
Maybe for positive muggings, when the mugger is offering to make the world much better than otherwise. But it might self-modify to not give into threats to discourage threats.
Given an aligned AGI, to what extent are people ok with letting the AGI modify us? Examples of such modifications include (feel free to add to the list):
What exact parts of being "human" do we want to preserve?
These are interesting questions that modern philosophers have been pondering. Stampy has an answer on forcing people to change faster than they would like and we are working on adding more answers that attempt to guess what an (aligned) superintelligence might do.
Does current AI technology possess the power to work its way up to AGI? For example, if the world all of the sudden put a permanent halt on all AI advancements and we were left with GPT-4, would it, given an infinite amount of time of use, achieve AGI through RLHF and access to the internet? This assumes that GPT-4 is not already an AGI system.
i.e.: Do AI systems need to forgo changes to their actual architecture to become smarter, or do is it possible that they get (significantly) smarter through simply enough usage?
GPT-4 doesn't learn when you use it. It doesn't update its parameters to better predict the text of its users or anything like that. So the answer to the basic question is "no."
You could also ask "But what if it did keep getting updated? Would it eventually become super-good at predicting the world?" There are these things called "scaling laws" that predict performance based on amount of training data, and they would say that with arbitrary amounts of data, GPT-4 could get arbitrarily smart (though note that this would require new data that's many times more than all text produced in human history so far). But the scaling laws almot certainly break if you try to extend them too far for a fixed architecture. I actually expect GPT-4 would become (more?) superhuman at many tasks related to writing text, but remain not all that great at prediction of the physical world that's rare in text and hard for humans.
Charlie is correct in saying that GPT-4 does not actively learn based on its input. But a related question is whether we are missing key technical insights for AGI, and Stampy has an answer for that. He also has an answer explaining scaling laws.
tl;dr: Ask questions about AGI Safety as comments on this post, including ones you might otherwise worry seem dumb!
Asking beginner-level questions can be intimidating, but everyone starts out not knowing anything. If we want more people in the world who understand AGI safety, we need a place where it's accepted and encouraged to ask about the basics.
We'll be putting up monthly FAQ posts as a safe space for people to ask all the possibly-dumb questions that may have been bothering them about the whole AGI Safety discussion, but which until now they didn't feel able to ask.
It's okay to ask uninformed questions, and not worry about having done a careful search before asking.
AISafety.info - Interactive FAQ
Additionally, this will serve as a way to spread the project Rob Miles' team[1] has been working on: Stampy and his professional-looking face aisafety.info. This will provide a single point of access into AI Safety, in the form of a comprehensive interactive FAQ with lots of links to the ecosystem. We'll be using questions and answers from this thread for Stampy (under these copyright rules), so please only post if you're okay with that!
You can help by adding questions (type your question and click "I'm asking something else") or by editing questions and answers. We welcome feedback and questions on the UI/UX, policies, etc. around Stampy, as well as pull requests to his codebase and volunteer developers to help with the conversational agent and front end that we're building.
We've got more to write before he's ready for prime time, but we think Stampy can become an excellent resource for everyone from skeptical newcomers, through people who want to learn more, right up to people who are convinced and want to know how they can best help with their skillsets.
Guidelines for Questioners:
Guidelines for Answerers:
Finally: Please think very carefully before downvoting any questions, remember this is the place to ask stupid questions!
If you'd like to join, head over to Rob's Discord and introduce yourself!