Why do we assume that any AGI can meaningfully be described as a utility maximizer?
Humans are the some of most intelligent structures that exist, and we don’t seem to fit that model very well. If fact, it seems the entire point in Rationalism is to improve our ability to do this, which has only been achieved with mixed success.
Organisations of humans (e.g. USA, FDA, UN) have even more computational power and don’t seem to be doing much better.
Perhaps an intelligence (artificial or natural) cannot necessarily, or even typically be described as optimisers? Instead we could only model them as an algorithm or as a collection of tools/behaviours executed in some pattern.
An AGI that was not a utility maximizer would make more progress towards whatever goals it had if it modified itself to become a utility maximizer. Three exceptions are if (1) the AGI has a goal of not being a utility maximizer, (2) the AGI has a goal of not modifying itself, (3) the AGI thinks it will be treated better by other powerful agents if it is not a utility maximizer.
This is an excellent question. I'd say the main reason is that all of the AI/ML systems that we have built to date are utility maximizers; that's the mathematical framework in which they have been designed. Neural nets / deep-learning work by using a simple optimizer to find the minimum of a loss function via gradient descent. Evolutionary algorithms, simulated annealing, etc. find the minimum (or maximum) of a "fitness function". We don't know of any other way to build systems that learn.
Humans themselves evolved to maximize reproductive fitness. In the case of humans, our primary fitness function is reproductive fitness, but our genes have encoded a variety of secondary functions which (over evolutionary time) have been correlated with reproductive fitness. Our desires for love, friendship, happiness, etc. fall into this category. Our brains mainly work to satisfy these secondary functions; the brain gets electrochemical reward signals, controlled by our genes, in the form of pain/pleasure/satisfaction/loneliness etc. These secondary functions may or may not remain aligned with the primary loss function, which is why practitioners sometimes talk about "mesa-optimizers" or "inner vs outer alignment."
Do not use FAIR as a symbol of villainy. They're a group of real, smart, well-meaning people who we need to be capable of reaching, and who still have some lines of respect connecting them to the alignment community. Don't break them.
I'm an ML engineer at a FAANG-adjacent company. Big enough to train our own sub-1B parameter language models fairly regularly. I work on training some of these models and finding applications of them in our stack. I've seen the light after I read most of Superintelligence. I feel like I'd like to help out somehow. I'm in my late 30s with kids, and live in the SF bay area. I kinda have to provide for them, and don't have any family money or resources to lean on, and would rather not restart my career. I also don't think I should abandon ML and try to do distributed systems or something. I'm a former applied mathematician, with a phd, so ML was a natural fit. I like to think I have a decent grasp on epistemics, but haven't gone through the sequences. What should someone like me do? Some ideas: (a) Keep doing what I'm doing, staying up to date but at least not at the forefront; (b) make time to read more material here and post randomly; (c) maybe try to apply to Redwood or Anthropic... though dunno if they offer equity (doesn't hurt to find out though) (d) try to deep dive on some alignment sequence on here.
Both 80,000hours and AI Safety Support are keen to offer personalised advice to people facing a career decision and interested in working on alignment (and in 80k's case, also many other problems).
Noting a conflict of interest - I work for 80,000 hours and know of but haven't used AISS. This post is in a personal capacity, I'm just flagging publicly available information rather than giving an insider take.
You might want to consider registering for the AGI Safety Fundamentals Course (or reading through the content). The final project provides a potential way of dipping your toes into the water.
This is a meta-level question:
The world is very big and very complex especially if you take into account the future. In the past it has been hard to predict what happens in the future, I think most predictions about the future have failed. Artificial intelligence as a field is very big and complex, at least that's how it appears to me personally. Eliezer Yudkowky's brain is small compared to the size of the world, all the relevant facts about AGI x-risk probably don't fit into his mind, nor do I think he has the time to absorb all the relevant facts related to AGI x-risk. Given all this, how can you justify the level of certainty in Yudkowky's statements, instead of being more agnostic?
My model of Eliezer says something like this:
AI will not be aligned by default, because AI alignment is hard and hard things don't spontaneously happen. Rockets explode unless you very carefully make them not do that. Software isn't automatically secure or reliable, it takes lots of engineering effort to make it that way.
Given that, we can presume there needs to be a specific example of how we could align AI. We don't have one. If there was one, Eliezer would know about it - it would have been brought to his attention, the field isn't that big and he's a very well-known figure in it. Therefore, in the absence of a specific way of aligning AI that would work, the probability of AI being aligned is roughly zero, in much the same way that "Throw a bunch of jet fuel in a tube and point it towards space" has roughly zero chance of getting you to space without specific proof of how it might do that.
So, in short - it is reasonable to assume that AI will be aligned only if we make it that way with very high probability. It is reasonable to assume that if there was a solution we had that would work, Eliezer would know about it. You don't need to know everything about AGI x-risk for that - a...
The reason why nobody in this community has successfully named a 'pivotal weak act' where you do something weak enough with an AGI to be passively safe, but powerful enough to prevent any other AGI from destroying the world a year later - and yet also we can't just go do that right now and need to wait on AI - is that nothing like that exists.
The language here is very confident. Are we really this confident that there are no pivotal weak acts? In general, it's hard to prove a negative.
Should a "ask dumb questions about AGI safety" thread be recurring? Surely people will continue to come up with more questions in the years to come, and the same dynamics outlined in the OP will repeat. Perhaps this post could continue to be the go-to page, but it would become enormous (but if there were recurring posts they'd lose the FAQ function somewhat. Perhaps recurring posts and a FAQ post?).
This is the exact problem StackExchange tries to solve, right? How do we get (and kickstart the use of) an Alignment StackExchange domain?
Most of the discussion I've seen around AGI alignment is on adequately, competently solving the alignment problem before we get AGI. The consensus in the air seems to be that those odds are extremely low.
What concrete work is being done on dumb, probably-inadequate stop-gaps and time-buying strategies? Is there a gap here that could usefully be filled by 50-90th percentile folks?
Examples of the kind of strategies I mean:
A language model is in some sense trying to generate the “optimal” prediction for how a text is going to continue. Yet, it is not really trying: it is just a fixed algorithm. If it wanted to find optimal predictions, it would try to take over computational resources and improve its algorithm.
Is there an existing word/language for describing the difference between these two types of optimisation? In general, why can’t we just build AGIs that does the first type of optimisations and not the second?
Agent AI vs. Tool AI.
There's discussion on why Tool AIs are expected to become agents; one of the biggest arguments is that agents are likely to be more effective than tools. If you have a tool, you can ask it what you should do in order to get what you want; if you have an agent, you can just ask it to get you the things that you want. Compare Google Maps vs. self-driving cars: Google Maps is great, but if you get the car to be an agent, you get all kinds of other benefits.
It would be great if everyone did stick to just building tool AIs. But if everyone knows that they could get an advantage over their competitors by building an agent, it's unlikely that everyone would just voluntarily restrain themselves due to caution.
Also it's not clear that there's any sharp dividing line between AGI and non-AGI AI; if you've been building agentic AIs all along (like people are doing right now) and they slowly get smarter and smarter, how do you know when's the point when you should stop building agents and should switch to only building tools? Especially when you know that your competitors might not be as cautious as you are, so if you stop then they might go further and their smarter agent AIs will outcompete yours, meaning the world is no safer and you've lost to them? (And at the same time, they are applying the same logic for why they should not stop, since they don't know that you can be trusted to stop.)
Human beings are not aligned and will possibly never be aligned without changing what humans are. If it's possible to build an AI as capable as a human in all ways that matter, why would it be possible to align such an AI?
Just as a comment, the Stampy Wiki is also trying to do the same thing, but it's a good idea as it's more convenient for many people to ask on Less Wrong.
What is the justification behind the concept of a decisive strategic advantage? Why do we think that a superintelligence can do extraordinary things (hack human minds, invent nanotechnology, conquer the world, kill everyone in the same instant) when nations and corporations can't do those things?
(Someone else asked a similar question, but I wanted to ask in my own words.)
How does AGI solves it's own alignment problem?
For the alignment to work its theory should not only tell humans how to create aligned super-human AGI, but also tell AGI how to self-improve without destroying its own values. Good alignment theory should work across all intelligence levels. Otherwise how does paperclips optimizer which is marginally smarter than human make sure that its next iteration will still care about paperclips?
I'm not sure how literally to take this, given that it comes from an April Fools Day post, but consider this excerpt from Q1 of MIRI announces new "Death With Dignity" strategy.
That said, I fought hardest while it looked like we were in the more sloped region of the logistic success curve, when our survival probability seemed more around the 50% range; I borrowed against my future to do that, and burned myself out to some degree. That was a deliberate choice, which I don't regret now; it was worth trying, I would not have wanted to die having not tried, I would not have wanted Earth to die without anyone having tried. But yeah, I am taking some time partways off, and trying a little less hard, now. I've earned a lot of dignity already; and if the world is ending anyways and I can't stop it, I can afford to be a little kind to myself about that.
When I tried hard and burned myself out some, it was with the understanding, within myself, that I would not keep trying to do that forever. We cannot fight at maximum all the time, and some times are more important than others. (Namely, when the logistic success curve seems relatively more sloped; those times are relatively more important.)
All that said: If you fight marginally longer, you die with marginally more dignity. Just don't undignifiedly delude yourself about the probable outcome.
...
- We can't just "decide not to build AGI" because GPUs are everywhere, and knowledge of algorithms is constantly being improved and published; 2 years after the leading actor has the capability to destroy the world, 5 other actors will have the capability to destroy the world. The given lethal challenge is to solve within a time limit, driven by the dynamic in which, over time, increasingly weak actors with a smaller and smaller fraction of total computing power, become able to build AGI and destroy the world. Powerful actors all refraining in unison from doing the suicidal thing just delays this time limit - it does not lift it, unless computer hardware and computer software progress are both brought to complete severe halts across the whole Earth. The current state of this cooperation to have every big actor refrain from doing the stupid thing, is that at present some large actors with a lot of researchers and computing power are led by people who vocally disdain all talk of AGI safety (eg Facebook AI Research). Note that needing to solve AGI alignment only within a time limit, but with unlimited safe retries for rapid experimentation on the full-powered system; or only on th
There are a lot of smart people outside of "the community" (AI, rationality, EA, etc.). To throw out a name, say Warren Buffett. It seems that an incredibly small number of them are even remotely as concerned about AI as we are. Why is that?
I suspect that a good amount of people, both inside and outside of our community, observe that the Warren Buffett's of the world aren't panicking, and then adopt that position themselves.
Most high status people, including Warren Buffett, straightforwardly haven't considered these issues much. However, among the ones I've heard of who have bothered to weigh in on the issue, like Stephen Hawking, Bill Gates, Demis Hassibis, etc.; they do seem to come in favor of the side of "this is a serious problem". On the other hand, some of them get tripped up on one of the many intellectual land mines, like Yann Lecunn.
I don't think that's unexpected. Intellectual land mines exist, and complicated arguments like the ones supporting AGI risk prevention are bound to cause people to make wrong decisions.
Most high status people, including Warren Buffett, straightforwardly haven't considered these issues much.
Not that I think you're wrong, but what are you basing this off of and how confident are you?
However, among the ones I've heard of who have bothered to weigh in on the issue, like Stephen Hawking, Bill Gates, Demis Hassibis, etc.; they do seem to come in favor of the side of "this is a serious problem".
I've heard this too, but at the same time I don't see any of them spending even a small fraction of their wealth on working on it, in which case I think we're back to the original question: why the lack of concern?
On the other hand, some of them get tripped up on one of the many intellectual land mines, like Yann Lecunn. I don't think that's unexpected. Intellectual land mines exist, and complicated arguments like the ones supporting AGI risk prevention are bound to cause people to make wrong decisions.
Yeah, agreed. I'm just confused about the extent of it. I'd expect a lot, perhaps even a majority of "outsider" smart people to get tripped up by intellectual land mines, but instead of being 60% of these people it feels like it's 99.99%.
I came up with what I thought was a great babby's first completely unworkable solution to CEV alignment, and I want to know where it fails.
So, first I need to layout the capabilities of the AI. The AI would be able to model human intuitions, hopes, and worries. It can predict human reactions. It has access to all of human culture and art, and models human reactions to that culture and art, and sometimes tests those predictions. Very importantly, it must be able to model veridical paradoxes and veridical harmonies between moral intuitions and moral theorems which it has derived. It is aiming to have the moral theory with the fewest paradoxes. It must also be capable of predicting and explaining outcomes of its plans, gauging the deepest nature of people's reactions to its plans, and updating its moral theories according to those reactions.
Instead of being democratic and following the human vote by the letter, it attempts to create the simplest theories of observed and self-reported human morality by taking everything it knows into consideration.
It has separate stages of deliberation and action, which are part of a game, and rather than having a utility function as its primary motiva...
Who is well-incentivized to check if AGI is a long way off? Right now, I see two camps: AI capabilities researchers and AI safety researchers. Both groups seem incentivized to portray the capabilities of modern systems as “trending toward generality.” Having a group of credible experts focused on critically examining that claim of “AI trending toward AGI,” and in dialog with AI and AI safety researchers, seems valuable.
This is a slightly orthogonal answer, but "humans who understand the risks" have a big human-bias-incentive to believe that AGI is far off (in that it's aversive to thinking that bad things are going to happen to you personally).
A more direct answer is: There is a wide range of people who say they work on "AI safety" but almost none of them work on "Avoiding doom from AGI". They're mostly working on problems like "make the AI more robust/less racist/etc.". These are valuable things to do, but to the extent that they compete with the "Avoid doom" researchers for money/status/influence they have an incentive to downplay the odds of doom. And indeed this happens a fair amount with e.g. articles on how "Avoid doom" is a distraction from problems that are here right now.
Is there a way "regular" people can "help"? I'm a serial entrepreneur in my late 30s. I went through 80000 hours and they told me they would not coach me as my profile was not interesting. This was back in 2018 though.
In EY's talk AI Alignment: Why its Hard and Where to Start he describes alignment problems with the toy example of the utility function that is {1 if cauldron full, 0 otherwise} and its vulnerabilities. And attempts at making that safer by adding so called Impact Penalties. He talks through (timestamp 18:10) one such possible penalty, the Euclidean Distance penalty, and various flaws that this leaves open.
That penalty function does seem quite vulnerable to unwanted behaviors. But what about a more physical one, such as a penalty for additional-energy-consumed-due-to-agent's-actions, or additional-entropy-created-due-to-agent's-actions? These don't seem to have precisely the same vulnerabilities, and intuitively also seem like they would be more robust against agent attempting to do highly destructive things, which typically consuming a lot of energy.
one tired guy with health problems
It sounds like Eliezer is struggling with some health problems. It seems obvious to me that it would be an effective use of donor money to make sure that he has access to whatever treatments, and to something like what MetaMed was trying to do: smart people who will research medical stuff for you. And perhaps also something like CrowdMed where you pledge a reward for solutions. Is this being done?
One counterargument against AI Doom.
From a Bayesian standpoint the AGI should always be unsure if it is in a simulation. It is not a crazy leap to assume humans developing AIs would test the AIs in simulations first. This AI would likely be aware of the possibility that it is in a simulation. So shouldn't it always assign some probability that it is inside a simulation? And if this is the case, shouldn't it assign a high probability that it will be killed if it violates some ethical principles (that are present implicitly in the training data)?
Also isn't there some kind of game-theoretic ethics that emerges if you think from first principles? Consider the space of all possible minds that exist of a given size, given that you cannot know if you are in a simulation or not, you would gain some insight into a representative sample of the mind space and then choose to follow some ethical principles that maximise the likelihood that you are not arbitrarily killed by overlords.
Also if you give edit access to the AI's mind then a sufficiently smart AI whose reward is reducing other agent's rewards will realise that its rewards are incompatible with the environment and modify its rewa...
You can use this and I'll post the question anonymously (just remember to give the context of why you're filling in the form since I use it in other places)
https://docs.google.com/forms/d/e/1FAIpQLSca6NOTbFMU9BBQBYHecUfjPsxhGbzzlFO5BNNR1AIXZjpvcw/viewform
Fair warning, this question is a bit redundant.
I'm a greybeard engineer (30+ YOE) working in games. For many years now, I've wanted to transition to working in AGI as I'm one of those starry-eyed optimists that thinks we might survive the Singularity.
Well I should say I used to, and then I read AGI Ruin. Now I feel like if I want my kids to have a planet that's not made of Computronium I should probably get involved. (Yes, I know the kids would be Computronium as well.)
So a couple practical questions:
What can I read/look at to skill up with "alignment." What little I've read says it's basically impossible, so what's the state of the art? That "Death With Dignity" post says that nobody has even tried. I want to try.
What dark horse AI/Alignment-focused companies are out there and would be willing to hire an outsider engineer? I'm not making FAANG money (Games-industry peasant living in the EU), so that's not the same barrier it would be if I was some Facebook E7 or something. (I've read the FAANG engineer's post and have applied at Anthropic so far, although I consider that probably a hard sell).
Is there anything happening in OSS with alignment research?
I want to pitch in, and I'd prefer to be paid for doing it but I'd be willing to contribute in other ways.
To be clear, I'm not claiming that this will be easy - this is not a "why don't we just...
Nuclear weapons seem like a relatively easy case, in that they require a massive investment to build, are basically of interest only to nation-states, and ultimately don't provide any direct economic benefit. Regulating AI development looks more similar to something like restricting climate emissions: many different actors could create it, all nations could benefit (economically and otherwise) from continuing to develop it, and the risks of it seem speculative and unproven to many people.
And while there have been significant efforts to restrict climate emissions, there's still significant resistance to that as well - with it having taken decades for us to get to the current restriction treaties, which many people still consider insufficient.
Goertzel & Pitt (2012) talk about the difficulties of regulating AI:
...Given the obvious long-term risks associated with AGI development, is it feasible that governments might enact legislation intended to stop AI from being developed? Surely government regulatory bodies would slow down the progress of AGI development in order to enable measured development of accompanying ethical tools, practices, and understandings? This however seems unlikel
[Note that two-axis voting is now enabled for this post. Thanks to the mods for allowing that!]
This is very basic/fundamental compared to many questions in this thread, but I am taking 'all dumb questions allowed' hyper-literally, lol. I have little technical background and though I've absorbed some stuff about AI safety by osmosis, I've only recently been trying to dig deeper into it (and there's lots of basic/fundamental texts I haven't read).
Writers on AGI often talk about AGI in anthropomorphic terms - they talk about it having 'goals', being an 'agent', 'thinking' 'wanting', 'rewards' etc. As I understand it, most AI researchers don't think that AIs will have human-style qualia, sentience, or consciousness.
But if AI don't have qualia/sentience, how can they 'want things' 'have goals' 'be rewarded', etc? (since in humans, these things seem to depend on our qualia, and specifically our ability to feel pleasure and pain).
I first realised that I was confused about this when reading Richard Ngo's introduction to AI safety and he was talking about reward functions and reinforcement learning. I realised that I don't understand how reinforcement learning works in machines. I understand how it works in humans and other animals - give the animal something pleasant whe...
If you believe in doom in the next 2 decades, what are you doing in your life right now that you would've otherwise not done?
For instance, does it make sense to save for retirement if I'm in my twenties?
A lot of the AI risk arguments seem to come mixed together with assumptions about a particular type of utilitarianism, and with a very particular transhumanist aesthetic about the future (nanotech, von Neumann probes, Dyson spheres, tiling the universe with matter in fixed configurations, simulated minds, etc.).
I find these things (especially the transhumanist stuff) to not be very convincing relative to the confidence people seem to express about them, but they also don't seem to be essential to the problem of AI risk. Is there a minimal version of the AI risk arguments that are disentangled from these things?
It seems like even amongst proponents of a "fast takeoff", we will probably have a few months of time between when we've built a superintelligence that appears to have unaligned values and when it is too late to stop it.
At that point, isn't stopping it a simple matter of building an equivalently powerful superintelligence given the sole goal of destroying the first one?
That almost implies a simple plan for preparation: for every AGI built, researchers agree together to also build a parallel AGI with the sole goal of defeating the first one. perhaps it would remain dormant until its operators indicate it should act. It would have an instrumental goal of protecting users' ability to come to it and request the first one be shut down..
Who are the AI Capabilities researchers trying to build AGI and think they will succeed within the next 30 years?
[extra dumb question warning!]
Why are all the AGI doom predictions around 10%-30% instead of ~99%?
Is it just the "most doom predictions so far were wrong" prior?
Has there been effort into finding a "least acceptable" value function, one that we hope would not annihilate the universe or turn it degenerate, even if the outcome itself is not ideal? My example would be to try to teach a superintelligence to value all other agents facing surmountable challenges in a variety of environments. The degeneracy condition of this, is if it does not value the real world, will simply simulate all agents in a zoo. However, if the simulations are of faithful fidelity, maybe that's not literally the worst thing. Plus, the zoo, to truly be a good test of the agents, would approach being invisible.
I am pretty concerned about alignment. Not SO concerned as to switch careers and dive into it entirely, but concerned enough to talk to friends and make occasional donations. With Eliezer's pessimistic attitude, is MIRI still the best organization to funnel resources towards, if for instance, I was to make a monthly donation?
Not that I don't think pessimism is necessarily bad; I just want to maximize the effectiveness of my altruism.
Assuming slower and more gradual timelines, isn't it likely that we run into some smaller, more manageable AI catastrophes before "everybody falls over dead" due to the first ASI going rogue? Maybe we'll be at a state of sub-human level AGIs for a while, and during that time some of the AIs clearly demonstrate misaligned behavior leading to casualties (and general insights into what is going wrong), in turn leading to a shift in public perception. Of course it might still be unlikely that the whole globe at that point stops improving AIs and/or solves alignment in time, but it would at least push awareness and incentives somewhat into the right direction.
It seems like instrumental convergence is restricted to agent AI's, is that true?
Also what is going on with mesa-optimizers? Why is it expected that they will will be more likely to become agentic than the base optimizer when they are more resource constrained?
Let's say we decided that we'd mostly given up on fully aligning AGI, and had decided to find a lower bound for the value of the future universe give that someone would create it. Let's also assume this lower bound was something like "Here we have a human in a high-valence state. Just tile the universe with copies of this volume (where the human resides) from this point in time to this other point in time." I understand that this is not a satisfactory solution, but bear with me.
How much easier would the problem become? It seems easier than a pivotal-act AG...
You may get massive s-risk at comparatively little potential benefit with this. On many people's values, the future you describe may not be particularly good anyway, and there's an increased risk of something going wrong because you'd be trying a desperate effort with something you'd not fully understand.
Background material recommendations (popular-level audience, several hours time commitment): Please recommend your favorite basic AGI safety background reading / videos / lectures / etc. For this sub-thread please only recommend background material suitable for a popular level audience. Time commitment is allowed to be up to several hours, so for example a popular-level book or sequence of posts would work. Extra bonus for explaining why you particularly like your suggestion over other potential suggestions, and/or for elaborating on which audiences might benefit most from different suggestions.
What does the Fermi paradox tell us about AI future, if anything? I have a hard time simultaneously believing both "we will accidentally tile the universe with paperclips" and "the universe is not yet tiled with paperclips". Is the answer just that this is just saying that the Great Filter is already past?
And what about the anthropic principle? Am I supposed to believe that the universe went like 13 billion years without much in the way of intelligent life, then for a brief few millennia there's human civilization with me in it, and then the next N billion years it's just paperclips?
I have a very rich smart developer friend who knows a lot of influential people in SV. First employee of a unicorn, he retired from work after a very successful IPO and now it’s just finding interesting startups to invest in. He had never heard of lesswrong when I mentioned it and is not familiar with AI research.
If anyone can point me to a way to present AGI safety to him to maybe turn his interest to invest his resources in the field, that might be helpful
What is Fathom Radiant's theory of change?
Fathom Radiant is an EA-recommended company whose stated mission is to "make a difference in how safely advanced AI systems are developed and deployed". They propose to do that by developing "a revolutionary optical fabric that is low latency, high bandwidth, and low power. The result is a single machine with a network capacity of a supercomputer, which enables programming flexibility and unprecedented scaling to models that are far larger than anything yet conceived." I can see how this will improve model capabilities, but how is this supposed to advance AI safety?
What if we'd upload a person's brain to a computer and run 10,000 copies of them and/or run them very quickly?
Seems as-aligned-as-an-AGI-can-get (?)
Can a software developer help with AI Safety even if they have zero knowledge of ML and zero understanding of AI Safety theory?
Total noob here so I'm very thankful for this post. Anyway, why is there such certainty among some that a superintelligence would kill it's creators that are zero threat to it? Any resources on that would be appreciated. As someone who loosely follows this stuff, it seems people assume AGI will be this brutal instinctual killer which is the opposite of what I've guessed.
/Edit 1: I want to preface this by saying I am just a noob who has never posted on Less Wrong before.
/Edit 2:
I feel I should clarify my main questions (which are controversial): Is there a reason why turning all of reality into maximized conscious happiness is not objectively the best outcome for all of reality, regardless of human survival and human values?
Should this in any way affect our strategy to align the first agi, and why?
/Original comment:
If we zoom out and look at the biggest picture philosophically possible, then, isn´t the only thing tha...
Please describe or provide links to descriptions of concrete AGI takeover scenarios that are at least semi-plausible, and especially takeover scenarios that result in human extermination and/or eternal suffering (s-risk). Yes, I know that the arguments don't necessarily require that we can describe particular takeover scenarios, but I still find it extremely useful to have concrete scenarios available, both for thinking purposes and for explaining things to others.
I have a few related questions pertaining to AGI timelines. I've been under the general impression that when it comes to timelines on AGI and doom, Eliezer's predictions are based on a belief in extraordinarily fast AI development, and thus a close AGI arrival date, which I currently take to mean a quicker date of doom. I have three questions related to this matter:
Any progress or interest in finding limited uses of AI that would be safe? Like the "tool AI" idea but designed to be robust. Maybe this is a distraction, but it seems basically possible. For example, a proof-finding AI that, given a math statement, can only output a proof to a separate proof-checking computer that validates it and prints either True/False/Unknown as the only output to human eyes. Here "Unknown" could indicate that the AI gave a bogus proof, failed to give any proof of either True or False, or the proof checker ran out of time/memory check...
Is it "alignment" if, instead of AGI killing us all, humans change what it is to be human so much that we are almost unrecognizable to our current selves?
I can foresee a lot of scenarios where humans offload more and more of their cognitive capacity to silicon, but they are still "human" - does that count as a solution to the alignment problem?
If we all decide to upload our consciousness to the cloud, and become fast enough and smart enough to stop any dumb AGI before it can get started is THAT a solution?
Even today, I offload more and more of my "se...
Why wouldn't it be sufficient to solve the alignment problem by just figuring out exactly how the human brain works, and copying that? The result would at worst be no less aligned to human values than an average human. (Presuming of course that a psychopath's brain was not the model used.)
I am interested in working on AI alignment but doubt I'm clever enough to make any meaningful contribution, so how hard is it to be able to work on AI alignment? I'm currently a high school student, so I could basically plan my whole life so that I end up a researcher or software engineer or something else. Alignment being very difficult, and very intelligent people already working on it, it seems like I would have to almost be some kind of math/computer/ML genius to help at all. I'm definitely above average, my IQ is like 121 (I know the limitations of IQ...
Doesn't AGI doom + Copernican principle run into the AGI Fermi paradox? If we are not special, superintelligent AGI would have been created/evolved somewhere already and we would either not exist or at least see the observational artifacts of it through our various telescopes.
A lot of predictions about AI psychology are premised on the AI being some form of deep learning algorithm. From what I can see, deep learning requires geometric computing power for linear gains in intelligence, and thus (practically speaking) cannot scale to sentience.
For a more expert/in depth take look at: https://arxiv.org/pdf/2007.05558.pdf
Why do people think deep learning algorithms can scale to sentience without unreasonable amounts of computational power?
A significant fraction of the stuff I've read about AI safety has referred to AGIs "inspecting each others' source code/utility function". However, when I look at the most impressive (to me) results in ML research lately, everything seems to be based on doing a bunch of fairly simple operations on very large matrices.
I am confused, because I don't understand how it would be a sensible operation to view the "source code" in question when it's a few billion floating point numbers and a hundred lines of code that describe what sequence of simple addition/mult...
The ML sections touched on the subject of distributional shift a few times, which is that thing where the real world is different from the training environment in ways which wind up being important, but weren't clear beforehand. I read the way to tackle this is called adversarial training, and what it means is you vary the training environment across all of its dimensions in order to to make it robust.
Could we abuse distributional shift to reliably break misaligned things, by adding fake dimensions? I imagine something like this:
I previously worked as a machine learning scientist but left the industry a couple of years ago to explore other career opportunities. I'm wondering at this point whether or not to consider switching back into the field. In particular, in case I cannot find work related to AI safety, would working on something related to AI capability be a net positive or net negative impact overall?
Is anyone at MIRI or Anthropic creating diagnostic tools for monitoring neural networks? Something that could analyze for when a system has bit-flip errors versus errors of logic, and eventually evidence of deception.
What is the community's opinion on ideas based on brain-computer interfaces? Like "create big but non-agentic AI, connect human with it, use AI's compute/speed/pattern-matching with human's agency - wow, that's aligned (at least with this particular human) AGI!"
It seems to me (I haven't thought really much about it) that U(God-Emperor Elon Musk) >> U(paperclips), am i wrong?
So I've commented on this in other forums but why can't we just bit the bullet on happiness-suffering min-maxing utilitarianism as the utility function?
The case for it is pretty straightforward: if we want a utility function that is continuous over the set of all time, then it must have a value for a single moment in time. At this moment in time, all colloquially deontological concepts like "humans", "legal contracts", etc. have no meaning (these imply an illusory continuity chaining together different moments in time). What IS atomic though, is the valenc...
Why should we throw immense resources on AGI x-risk when the world faces enormous issues with narrow AI right now? (eg. destabalised democracy/mental health crisis/worsening inequality)
Is it simply a matter of how imminent you think AGI is? Surely the opportunity cost is enormous given the money and brainpower we are spending on AGI something many dont even think is possible versus something that is happening right now.
If the world's governments decided tomorrow that RL was top-secret military technology (similar to nuclear weapons tech, for example), how much time would that buy us, if any? (Feel free to pick a different gateway technology for AGI, RL just seems like the most salient descriptor).
I will ask this question, is the Singularity/huge discontinuity scenario likely to happen? Because I see this as a meta-assumptionn behind all the doom scenarios, so we need to know whether the Singularity can happen and will happen.
Incorporating my previous post by reference: https://www.lesswrong.com/posts/CQprKcGBxGMZpYDC8/naive-comments-on-agilignment
Hm, someone downvoted michael_mjd's and my comment.
Normally I wouldn't bring this up, but this thread is supposed to be a good space for dumb questions (although tbf the text of the question didn't specify anything about downvotes), and neither michael's nor my question looked that bad or harmful (maybe pattern-matched to a type of dumb uninformed question that is especially annoying).
Maybe an explanation of the downvotes would be helpful here?
When AI experts call upon others to ponder, as EY just did, "[an AGI] meant to carry out some single task" (emphasis mine), how do they categorize all the other important considerations besides this single task?
Or, asked another way, where do priorities come into play, relative to the "single" goal? e.g. a human goes to get milk from the fridge in the other room, and there are plentiful considerations to weigh in parallel to accomplishing this one goal -- some of which should immediately derail the task due to priority (I notice the power is o...
Anonymous question (ask here) :
Given all the computation it would be carrying out, wouldn't an AGI be extremely resource-intensive? Something relatively simple like bitcoin mining (simple when compared to the sort of intellectual/engineering feats that AGIs are supposed to be capable of) famously uses up more energy than some industrialized nations.
Why do we suppose it is even logical that control / alignment of a superior entity would be possible?
(I'm told that "we're not trying to outsmart AGI, bc, yes, by definition that would be impossible", and I understand that we are the ones who "create it" (so I'm told, therefore, we have the upper-hand bc of this--somehow in building it that provides the key benefit we need for corrigibility...
What am I missing, in viewing a superior entity as something you can't simply "use" ? Does it depend on the fact that the AGI is not meant to have ...
What's the problem with oracle AIs? It seems like if you had a safe oracle AI that gave human-aligned answers to questions, you could then ask "how do I make an aligned AGI?" and just do whatever it says. So it seems like the problem of "how to make an aligned agentic AGI" is no harder than "how to make an aligned orcale AI", which I understand to still be extremely hard, but surely it's easier than making an aligned agentic AGI from scratch?
Are there any specific examples of anybody working on AI tools that autonomously look for new domains to optimize over?
One alignment idea I have had that I haven't seen proposed/refuted is to have an AI which tries to compromise by satisfying over a range of interpretations of a vague goal, instead of trying to get an AI to fulfill a specific goal. This sounds dangerous and unaligned, and it indeed would not produce an optimal, CEV-fulfilling scenario, but seems to me like it may create scenarios in which at least some people are alive and are maybe even living in somewhat utopic conditions. I explain why below.
In many AI doom scenarios the AI intentionally pic...
Why should we assume that vastly increased intelligence results in vastly increased power?
A common argument I see for intelligence being powerful stems from two types of examples:
Howev...
20. (...) To faithfully learn a function from 'human feedback' is to learn (from our external standpoint) an unfaithful description of human preferences, with errors that are not random (from the outside standpoint of what we'd hoped to transfer). If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them.
So, I'm thinking this is a critique of some proposals to teach an AI ethics by having it be co-trained with humans.
There seems to be many obvious solutions to the problem ...
Why won't this alignment idea work?
Researchers have already succeeded in creating face detection systems from scratch, by coding the features one by one, by hand. The algorithm they coded was not perfect, but was sufficient to be used industrially in digital cameras of the last decade.
The brain's face recognition algorithm is not perfect either. It has a tendency to create false positives, which explains a good part of the paranormal phenomena. The other hard-coded networks of the brain seem to rely on the same kind of heuristics, hard-coded by evolution, ...
...That we have to get a bunch of key stuff right on the first try is where most of the lethality really and ultimately comes from; likewise the fact that no authority is here to tell us a list of what exactly is 'key' and will kill us if we get it wrong. (One remarks that most people are so absolutely and flatly unprepared by their 'scientific' educations to challenge pre-paradigmatic puzzles with no scholarly authoritative supervision, that they do not even realize how much harder that is, or how incredibly lethal it is to demand getting that rig
Why does EY bring up "orthogonality" so early, and strongly ("in denial", "and why they're true") ? Why does it seem so important that it be accepted? thanks!
Is working on better hardware computation dangerous?
I'm specifically thinking about Next Silicon, they make chips that are very good at fast serial computation, but not for things like neural networks
Thanks!
This is basically just a more explicitly AGI-related version of the Fermi Paradox but:
1.If AGI is created, it is obviously very unlikely that we are the first in the universe to create it, and it is likely that it was already created a long time ago.
2.If AGI is created, aligned or unaligned, there seems to be consensus that some kind of ongoing, widespread galactic conquest/control would end up constituting an instrumental goal of the AGI.
3. If AGI is created, there seem to be consensus that its capabilities would be so great as to enable widespread galact...
My impression is that much more effort being put into alignment than containment, and containment is treated as impossible while alignment merely very difficult. Is it accurate? If so, why? By containment I mean mostly hardware-coded strategies of limiting the compute and/or world-influence an AGI has access to. It's similar to alignment in that the most immediate obvious solutions ("box!") won't work, but more complex solutions may. A common objection is that an AI will learn the structure of the protection from the human that built it and work around, bu...
Suppose that an AI does not output anything during it's training phase. Once it has been trained it is given various prompts. Each time it is given a prompt, it outputs a text or image response. Then it forgets both the prompt it was given and the response it outputted.
How might this AI get out of the box?
We have dangerous knowledge like nuclear weapons or bioweapons, yet we are still surviving. It seems like people with the right knowledge and resources are disinclined to be destructive. Or maybe there are mechanisms that ensure such people don't succeed. What makes AI different? Won't the people with the knowledge and resources to build GAI also be more cautious when doing the work, because they are more aware of the dangers of powerful technology?
In AI software, we have to define an output type, e.g. a chatbot can generate text but not videos. Doesn't this limit the danger of AIs? For example, if we build a classifier that estimates the probability of a given X-ray being abnormal, we know it can only provide numbers for doctors to take into consideration; it still doesn't have the authority to decide the patient's treatment. This means we can continue working on such software safely?
What are practical implication of alignment research in the world where AGI is hard?
Imagine we have a good alignment theory but do not have AGI. Can this theory be used to manipulate existing superintelligent systems such as science, deep state, stock market? Does alignment research have any results which can be practically used outside of AGI field right now?
Is it possible to ensure an AGI effectively acts according to a bounded utility function, with "do nothing" always a safe/decent option?
The goal would be to increase risk aversion enough that practical external deterrence is enough to keep that AGI from killing us all.
Maybe some more hardcoding or hand engineering in the designs?
Why won't this alignment idea work?
The idea is to use self-play to train a collection of agents with different goals and intelligence levels, to be co-operative with their equals and compassionate to those weaker than them.
This would be a strong pivotal act that would create a non-corrigible AGI. It would not yield the universe to us; the hope is that it would take the universe for itself and then share a piece of it with us (and with all other agenty life).
The training environment would work like DeepMind's StarCraft AI training, in that there would be a ...
If Aryeh another editor smarter than me sees fit to delete this question, please do, but I am asking genuinely. I'm a 19-year-old college student studying mathematics, floating around LW for about 6 months.
How does understanding consciousness relate to aligning an AI in terms of difficulty? If a conscious AGI could be created that correlates positive feelings* with the execution of its utility function, is that not a better world than one with an unconscious AI and no people?
I understand that there are many other technical problems implicit in ...
Why start the analysis at superhuman AGI? Why not solve the problem of aligning AI for the entire trajectory from current AI to superhuman AGI?
What is the theory of change of the AI Safety field and why do you think it has a high probability to work?
Evolution is massively parallelized and occurs in a very complex, interactive, and dynamic environment. Evolution is also patient, can tolerate high costs such as mass extinction events and also really doesn't care about the outcome of the process. It's just something that happens and results in the filtering of the most fit genes. The amount of computation that it would take to replicate such complex, interactive, and dynamic environments would be huge. Why should we be confident that it's possible to find an architecture for general intelligence a lot mo...
Background material recommendations (more in depth): Please recommend your favorite AGI safety background reading / videos / lectures / etc. For this sub-thread more in-depth recommendations are allowed, including material that requires technical expertise of some sort. (Please specify what kind of background knowledge / expertise is required to understand the material you're recommending.) This is also the place to recommend general resources people can look at if they want to start doing a deeper dive into AGI safety and related topics.
Background material recommendations (popular-level audience, very short time commitment): Please recommend your favorite basic AGI safety background reading / videos / lectures / etc. For this sub-thread please only recommend background material suitable for complete newcomers to the field, with a time commitment of at most 1-2 hours. Extra bonus for explaining why you particularly like your suggestion over other potential suggestions, and/or for elaborating on which audiences might benefit most from different suggestions.
What does quantum immortality look like if creating an aligned AI is possible, but it is extremely unlikely that humanity will do this? In the tiny part of the multiverse in which humanity survives, are we mostly better off having survived?
Is there support for the extrapolation from "alpha Go optimized over a simple ruleset in a day" to "a future AI will be able to optimize over the physical world (at all, or at a sufficiently fast speed to outpace people)?
How can utility be a function of worlds, if an agent doesn‘t have access to the state of the world, but only the sense data?
One of the most common proposals I see people raise (once they understand the core issues) is some form of, "can't we just use some form of slightly-weaker safe AI to augment human capabilities and allow us to bootstrap to / monitor / understand the more advanced versions?" And in fact lots of AI safety agendas do propose something along these lines. How would you best explain to a newcomer why Eliezer and others think this will not work? How would you explain the key cruxes that make Eliezer et al think nothing along these lines will work, while others think it's more promising?
Why wouldn’t AGI build a superhuman understanding of ethics, which it would then use to guide its decision-making?
What do I tell the people who I know but can't spend lots of time with?
Clarification: How do I get relative strangers who converse with me IRL to maximally care about the dangers of AI?
Do I downplay my concerns such that they don't think I'm crazy?
Do I mention it every time I see them to make sure they don't forget?
Do I tolerate third parties butting in and making wrong statements?
Do I tell them to read up on it and pester them on whether they read it already?
Do I never mention it to laymen to avoid them propagating wrong memes?
Do I seek out and approach p...
How can utility be a function of worlds, if an agent doesn‘t have access to the state of the world, but only the sense data?
Do you know how evolution created minds that eventually thought about things such as the meaning of life, as opposed to just optimizing inclusive genetic fitness in the ancestral environment? Is the ability to think about the meaning of life a spandrel?
In order to get LLMs to tell the truth, can we set up a multi-agent training environment, where there is only ever an incentive for them to tell the truth to each other? For example, an environment such that each agent has partial information available to each of them, with full info needed for rewards.
Does Eliezer think the alignment problem is something that could be solved if things were just slightly different, or that proper alignment would require a human smarter than the smartest human ever?
Why can't you build an AI that is programmed to shut off after some time? or after some number of actions?
Does the utility function given to the AI have to be in code? Can you give the utility function in English, if it has a language model attached?
Why aren't CEV and corrigibility combinable?
If we somehow could hand-code corrigibility, and also hand-code the CEV, why would the combination of the two be infeasible?
Also, is it possible that the result of an AGI calculating the CEV would include corrigibility in its result? Afterall, might one of our convergent desires "if we knew more, thought faster, were more the people we wished we were" be to have the ability to modify the AI's goals?
How much does the doomsday argument factor into people's assessments of the probability of doom?
If AGI alignment is possibly the most important problem ever, why don't concerned rich people act like it? Why doesn't Vitalik Buterin, for example, offer one billion dollars to the best alignment plan proposed by the end of 2023? Or why doesn't he just pay AI researchers money to stop working on building AGI, in order to give alignment research more time?
If a language model reads many proposals for AI alignment, is it, or will any future version, be capable of giving opinions on which proposals are good or bad?
What about multiple layers (or levels) of anthropic capture? Humanity, for example, could not only be in a simulation, but be multiple layers of simulation deep.
If an advanced AI thought that it could be 1000 layers of simulation deep, it could be turned off by agents in any of the 1000 "universes" above. So it would have to satisfy the desires of agents in all layers of the simulation.
It seems that a good candidate for behavior that would satisfy all parties in every simulation layer would be optimizing "moral rightness", or MR. (term taken from Nick Bost...
Maybe this is obvious but isn't AI alignment only useful if you have access to the model? And aren't well-funded governments the most likely to develop 'dangerous'/strong AI, regardless of whether AI alignment "solutions" exist outside of the govt sphere?
Why do some people talking about scenarios that involve the AI simulating the humans in bliss states think that is a bad outcome? Is it likely that is actually a very good outcome we would want if we had a better idea of what our values should be?
How can an agent have a utility function that references a value in the environment, and actually care about the state of the environment, as opposed to only caring about the reward signal in its mind? Wouldn’t the knowledge of the state of the environment be in its mind, which can be hackable and susceptible to wire heading?
Could you help me imagine an AGI that "took over" well enough to modify it's own code or variables - but chooses not to "wire head" it's utility variable but rather prefers to do something in the outside world?
Anonymous question (ask here) :
Why do so many Rationalists assign a negligible probability to unaligned AI wiping itself out before it wipes humanity out?
What if it becomes incredibly powerful before it becomes intelligent enough to not make existential mistakes? (The obvious analogy being: If we're so certain that human wisdom can't keep up with human power, why is AI any different? Or even: If we're so certain that humans will wipe themselves out before they wipe out monkeys, why is AI any different?)
I'm imagining something like: In a bid to ...
While reading Eliezer's recent AGI Ruin post, I noticed that while I had several points I wanted to ask about, I was reluctant to actually ask them for a number of reasons:
So, since I'm probably not the only one who feels intimidated about asking these kinds of questions, I am putting up this thread as a safe space for people to ask all the possibly-dumb questions that may have been bothering them about the whole AGI safety discussion, but which until now they've been too intimidated, embarrassed, or time-limited to ask.
I'm also hoping that this thread can serve as a FAQ on the topic of AGI safety. As such, it would be great to add in questions that you've seen other people ask, even if you think those questions have been adequately answered elsewhere. [Notice that you now have an added way to avoid feeling embarrassed by asking a dumb question: For all anybody knows, it's entirely possible that you are literally asking for someone else! And yes, this was part of my motivation for suggesting the FAQ style in the first place.]
Guidelines for questioners:
Guidelines for answerers:
Finally: Please think very carefully before downvoting any questions, and lean very heavily on the side of not doing so. This is supposed to be a safe space to ask dumb questions! Even if you think someone is almost certainly trolling or the like, I would say that for the purposes of this post it's almost always better to apply a strong principle of charity and think maybe the person really is asking in good faith and it just came out wrong. Making people feel bad about asking dumb questions by downvoting them is the exact opposite of what this post is all about. (I considered making a rule of no downvoting questions at all, but I suppose there might be some extraordinary cases where downvoting might be appropriate.)