The approach I often take here is to ask the person how they would persuade an amateur chess player who believes they can beat Magnus Carlsen because they've discovered a particularly good opening with which they've won every amateur game they've tried it in so far.
Them: Magnus Carlsen will still beat you, with near certainty
Me: But what is he going to do? This opening is unbeatable!
Them: He's much better at chess than you, he'll figure something out
Me: But what though? I can't think of any strategy that beats this
Them: I don't know, maybe he'll find a way to do <some chess thing X>
Me: If he does X I can just counter it by doing Y!
Them: Ok if X is that easily countered with Y then he won't do X, he'll do some Z that's like X but that you don't know how to counter
Me: Oh, but you conveniently can't tell me what this Z is
Them: Right! I'm not as good at chess as he is and neither are you. I can be confident he'll beat you even without knowing your opener. You cannot expect to win against someone who outclasses you.
If someone builds an AGI, it's likely that they want to actually use it for something and not just keep it in a box. So eventually it'll be given various physical resources to control (directly or indirectly), and then it might be difficult to just shut down. I discussed some possible pathways in Disjunctive Scenarios of Catastrophic AGI Risk, here are some excerpts:
...DSA/MSA Enabler: Power Gradually Shifting to AIs
The historical trend has been to automate everything that can be automated, both to reduce costs and because machines can do things better than humans can. Any kind of a business could potentially run better if it were run by a mind that had been custom-built for running the business—up to and including the replacement of all the workers with one or more with such minds. An AI can think faster and smarter, deal with more information at once, and work for a unified purpose rather than have its efficiency weakened by the kinds of office politics that plague any large organization. Some estimates already suggest that half of the tasks that people are paid to do are susceptible to being automated using techniques from modern-day machine learning and robotics, even without post
I have a question about bounded agents. Rob Miles' video explains a problem with bounded utility functions: namely, that the agent is still incentivized to maximize the probability that the bound is hit, and take extreme actions in pursuit of infinitesimal utility gains.
I agree, but my question is: in practice isn't this still at least a little bit less dangerous than the unbounded agent? An unbounded utility maximizer, given most goals I can think of, will probably accept a 1% chance of taking over the world because the payoff of turning the earth into stamps is so large. Whereas if the bounded utility maximizer is not quite omnipotent and is only mulliganing essentially tiny increases in their certainty, and finds that their best grand and complicated plan to take over the world is only ~99.9% successful, it may not be worth the extra 1e-9 utility increase.
It's also not clear that giving the bounded agent more firepower or making it more intelligent monotonically increases P(doom); maybe it comes up with a takeover plan that is >99.9% successful, but maybe its better reasoning abilities also allow it to increase its initial confidence that it has the correct number of st...
Thanks for doing this!
I was trying to work out how the alignment problem could be framed as a game design problem and I got stuck on this idea of rewards being of different 'types'. Like, when considering reward hacking, how would one hack the reward of reading a book or exploring a world in a video game? Is there such a thing as 'types' of reward in how reward functions are currently created? Or is it that I'm failing to introspect on reward types and they are essentially all the same pain/pleasure axis attached to different items?
That last explanation seems hard to resolve with the huge difference in qualia between different motivational sources (like reading a book versus eating food versus hugging a friend... These are not all the same 'type' of good, are they?)
Sorry if my question is a little confused. I was trying to convey my thought process. The core question is really:
Is there any material on why 'types' of reward signals can or can't exist for AI and what that looks like?
Does anyone know what exactly DeepMind's CEO Demis Hassabis thinks about AGI safety, how seriously does he take AGI safety, how much time does he spend focusing on AGI safety research when compared to AI capabilities research? What does he think is the probability that we will succeed and build a flourishing future?
In this LessWrong post there are several excerpts from Demis Hassabis:
Well to be honest with you I do think that is a very plausible end state–the optimistic one I painted you. And of course that's one reason I work on AI is because I hoped it would be like that. On the other hand, one of the biggest worries I have is what humans are going to do with AI technologies on the way to AGI. Like most technologies they could be used for good or bad and I think that's down to us as a society and governments to decide which direction they're going to go in.
And
...Potentially. I always imagine that as we got closer to the sort of gray zone that you were talking about earlier, the best thing to do might be to pause the pushing of the performance of these systems so that you can analyze down to minute detail exactly and maybe even prove things mathematically about the system so th
What stops a superintelligence from instantly wireheading itself?
A paperclip maximizer, for instance, might not need to turn the universe into paperclips if it can simply access its reward float and set it to the maximum. This is assuming that it has the intelligence and means to modify itself, and it probably still poses an existential risk because it would eliminate all humans to avoid being turned off.
The terrifying thing I imagine about this possibility is that it also answers the Fermi Paradox. A paperclip maximizer seems like it would be obvious in the universe, but an AI sitting quietly on a dead planet with its reward integer set to the max is far more quiet and terrifying.
Hello, I have a question. I hope someone with more knowledge can help me answer it.
There is evidence suggesting that building an AGI requires plenty of computational power (at least early on) and plenty of smart engineers/scientists. The companies with the most computational power are Google, Facebook, Microsoft and Amazon. These same companies also have some of the best engineers and scientists working for them. A recent paper by Yann LeCun titled A Path Towards Autonomous Machine Intelligence suggests that these companies have a vested interest in actual...
Is there a strong theoretical basis for guessing what capabilities superhuman intelligence may have, be it sooner or later? I'm aware of the speed & quality superintelligence frameworks, but I have issues with them.
Speed alone seems relatively weak as an axis of superiority; I can only speculate about what I might be able to accomplish if, for example, my cognition were sped up 1000x, but it find it hard to believe it would extend to achieving strategic dominance over all humanity, especially if there are still limits on my ability to act and perceive ...
I'm still not sure why exactly people (I'm thinking of a few in particular, but this applies to many in the field) tell very detailed stories of AI domination like "AI will use protein nanofactories to embed tiny robots in our bodies to destroy all of humanity at the press of a button." This seems like a classic use of the conjunction fallacy, and it doesn't seem like those people really flinch from the word "and" like the Sequences tell them they should.
Furthermore, it seems like people within AI alignment aren't taking the "sci-fi" criticism as seriously...
I don't think the point of the detailed stories is that they strongly expect that particular thing to happen? It's just useful to have a concrete possibility in mind.
Inspired by https://non-trivial.org, I logged in to ask if people thought a very-beginner-friendly course like that would be valuable for the alignment problem - then I saw Stampy. Is there room for both? Or maybe a recommended beginner path in Stampy styled similarly to non-trivial?
There's a lot of great work going on.
Why should we expect AGIs to optimize much more strongly and “widely” than humans? As far as I know a lot of AI risk is thought to come from “extreme optimization”, but I’m not sure why extreme optimization is the default outcome.
To illustrate: if you hire a human to solve a math problem, the human will probably mostly think about the math problem. They might consult google, or talk to some other humans. They will probably not hire other humans without consulting you first. They definitely won’t try to get brain surgery to become smarter, or kill everyone ...
This is an argument I don’t think I’ve seen made, or at least not made as strongly as it should be. So I will present it as starkly as possible. It is certainly a basic one.
The question I am asking is, is the conclusion below correct, that alignment is fundamentally impossible for any AI built by current methods? And by contraposition, that alignment is only achievable, if at all, for an AI built by deliberate construction? GOFAI never got very far, but that only shows that they never got the right ideas.
The argument:
A trained ML is an uninterpreted pile o...
Rice's theorem says that there's no algorithm for proving nontrivial facts about arbitrary programs, but it does not say that no nontrivial fact can be proven about a particular program. It also does not say that you can't reason probabilistically/heuristically about arbitrary programs in lieu of formal proofs. It just says it's possible to construct any program that breaks an algorithm that purports to prove a specific fact about all possible programs.
(And if it turns out we can't formally prove something about a neural net (like alignment), then of course that also doesn't mean negative thing about it is definitely true; it could be that we can't prove alignment for a program and it happens to be aligned.)
See also the feedback form for some specific questions we're keen to hear answers to.
Can anyone point me to a write-up steelmanning the OpenAI safety strategy; or, alternatively, offer your take on it? To my knowledge, there's no official post on this, but has anyone written an informal one?
Essentially what I'm looking for is something like an expanded/OpenAI version of AXRP ep 16 with Geoffrey Irving in which he lays out the case for DM's recent work on LM alignment. The closest thing I know of is AXRP ep 6 with Beth Barnes.
These monthly threads and Stampy sound like they'll be great resources for learning about alignment research.
I'd like to know about as many resources as possible for supporting and guiding my own alignment research self-study process. (And by resources, I guess I don't just mean more stuff to read; I mean organizations or individuals you can talk to for guidance on how to move forward in one's self-education).
Could someone provide a link to a page that attempts to gather links to all such resources in one place?
I already saw the Stampy answer ...
I have not a shred of a doubt that something smarter than us can kill us all easily should it choose to. Humans are ridiculously easy to kill. A few well placed words and they kill each other even. I also have no doubt that keeping something smarter than you confined is a doomed idea. What I am not convinced of is that that something smarter will try to eradicate humans. I am not arguing against the orthogonality thesis here, but against the point that "AGI will have a single-minded utility function and to achieve its goal it will destroy humanity in...
Some possible paths to creating aligned AGI involve designing systems with certain cognitive properties, like corrigiblility or myopia. We currently don't know how to create sufficiently advanced minds with those particular properties. Do we know how to choose any cognitive properties at all, or do known techniques unavoidably converge on "utility maximizer that has properties implied by near-optimality plus other idiosyncratic properties we can't choose" in the limit of capability? Is there is a list of properties we do know how to manipulate?
Some example...
Q4 Time scale
In order to claim that we need to worry about AGI Alignment today, you need to prove that the time scale of development will be short. Common sense tells us that humans will be able to deal with whatever software we can create. 1) We create some software (eg self driving cars, nuclear power plant sofrtware) 2) People accidentally die (or have other "bad outcomes") 3) Humans, governments, people in general will "course correct".
So you have to prove (or convince) that an AGI will develop, gain control of it's own resources and then be able to act on the world in a very short period of time. I haven't seen a convincing argument for that.
Q3 Technology scale
I would love to read more about how software can emulate a human brain. The human brain is an analog system down the molecular level. The brain is a giant soup with a delicate balance of neurotransmitters and neuropeptides. There thousands of different kinds of neurons in the brain, each one acts a little different. As a programmer, I cannot imagine how to faithfully model something like that directly. Digital computers seem completely inadequate. I would guess you'd have more luck wiring together 1000 monkey brains.
I'm not very familiar with the AI safety canon.
I've been pondering a view of alignment in the frame of intelligence ratios -- humans with capability can produce aligned agents with capability where for some k[1], and alignment techniques might increase k.
Has this already been discussed somewhere, and would it be worth spending time to think this out and write it down?
Or maybe some other function of is more useful?
Do we know how to train act-based agents? Is the only obstacle competitiveness, similarly to how Tool AI wants to be Agent AI?
Two related questions to get a sense of scale of the social problem. (I'm interested in any precise operationalization, as obviously the questions are underspecified.)
Does Gödel's incompleteness theorem apply to AGI safety?
I understand his theorem is one of the most wildly misinterpreted in mathematics because it technically only applies to first order predicate logic, but there's something about it that has always left me unsettled.
As far as I know, this form of logic is the best tool we've developed to really know things with certainty. I'm not aware of better alternatives (senses frequently are misleading, subjective knowledge is not falsifiable, etc). This has left me with the perspective that with the best to...
So, something I am now wondering is: Why don’t Complexity of Value and Fragility of Value make alignment obviously impossible?
Maybe I’m misunderstanding the two theories, but don’t they very basically boil down to “Human values are too complex to program”? Because that just seems like something that’s objectively correct. Like, trying to do exactly that seems like attempting to “solve” ethics which looks pretty blatantly futile to me.
I (hopefully) suspect that I have the exact shape of the issue wrong, and that (most) people aren’t actually literally tryin...
What are the most comprehensive arguments for paths to superintellligence?
My list (please tell me if there is a more comprehensive argument for a certain path or if there is a path that I missed).
I'd be interested to hear thoughts on this argument for optimism that I've never seen anybody address: if we create a superintelligent AI (which will, by instrumental convergence, want to take over the world), it might rush, for fear of competition. If it waits a month, some other superintelligent AI might get developed and take over / destroy the world; so, unless there's a quick safe way for the AI to determine that it's not in a race, it might need to shoot from the hip, which might give its plans a significant chance of failure / getting caught?
Counter...
I have a few questions about corrigibility. First, I will tentatively define corrigibility as creating an agent who is willing to let humans shut it off or change its goals without manipulating humans. I have seen that corrigibility can lead to VNM-incoherence (i.e. an agent can be dutch-booked / money-pumped). Has this result been proven in general?
Also, what is the current state of corrigibility research? If the above incoherence result turns out to be correct and corrigibility leads to incoherence, are there any other tractable theoretical directions we...
Why AGI safety is all about safety of AI for humans and not a word about safety for AI from humans?
Hi! I am new for AGI Safety topic and aware about almost no approaches for resolving it. But I am not exactly new in deep learning and I find the identifiability topic of a deep learning models interesting: for example papers like "Advances in Identifiability of Nonlinear Probabilistic Models" by Ilyes Khemakhem or "On Linear Identifiability of Learned Representations". Does anyone know if there is a some direction of AGI Safety research that somehow relates with the identifiability topic? It seems for me intuitively related but may be it is not.
In older texts on AI alignment, there seems quite some discussion on how to learn human values, like here:
https://ai-alignment.com/the-easy-goal-inference-problem-is-still-hard-fad030e0a876
My impression is that nowadays, the alignment problem seems more focused on something which I would describe as "teach the AI to follow any goal at all", as if the goal with which we should align the AI with doesn't matter as much from a research perspective.
Could someone provide some insights into the reasons for this? Or are my impressions wrong and I hallucinated the shift?
I think learning is likely to be a hard problem in general (for example, the "learning with rounding problem" is the basis of some cryptographic schemes). I am much less sure whether learning the properties of the physical or social worlds is hard, but I think there's a good chance it is. If an individual AI cannot exceed human capabilities by much (e.g., we can get an AGI as brilliant as John von Neumann but not much more intelligent), is it still dangerous?
Why wouldn't the agent just change the line of code containing its loss function? Surely that's easier to do than world domination.
Q5 Scenarios
I have different thoughts about different doomsday scenarios. I can think of two general categories, but maybe there are more.
A) "Build us a better bomb." - The AGI is locked in service to a human organization who uses it's superpowers to dominate and control the rest of the world. In this scenario the AGI is essentially a munitions that may appear in the piucture without warning (which takes us back to the time scale concern). This doesn't require the AGI to become self-sufficient. Presumably lesser AIs would also be ca...
Q2 Agency
I also have a question about agency. Let's say Bob invents an AGI in his garage one day. It even gets smarter the more it runs. When Bob goes to sleep at night he turns the computer off and his AI stops getting smarter. It doesn't control it's own power switch, it's not managing Bob's subnet for him. It doesn't have internet access. I guess in a doomsday scenario Bob would have to have programmed in "root access" for his ever more intelligent software? Then it can eventually modify the operating system tha...
Q1 Definitions
Who decides what kind of software gets called AI? Forget about AGI, just talking about the term AI. What about code in a game that decides where the monsters should move and attack? We call that AI. What about a program that plays Go well enough to beat a master? What about a program that plays checkers? What about a chicken that's trained so that it can't lose at tic-tac-toe? Which of those is AI? The only answer I can think of is that AI is when a program acts in ways that seem like only a person sh...
Lesswrong has a [trove of thought experiments](https://www.lesswrong.com/posts/PcfHSSAMNFMgdqFyB/can-you-control-the-past) about scenarios where arguably the best way to maximize your utility is to verifiably (with some probability) modify your own utility function, starting with the prisoner's dilemma and extending to games with superintelligences predicting what you will do and putting money in boxes etc.
These thought experiments seem to have real world reflections: for example, voting is pretty much irrational under CDT, but paradoxically the outcomes o...
tl;dr: Ask questions about AGI Safety as comments on this post, including ones you might otherwise worry seem dumb!
Asking beginner-level questions can be intimidating, but everyone starts out not knowing anything. If we want more people in the world who understand AGI safety, we need a place where it's accepted and encouraged to ask about the basics.
As requested in the previous thread[1], we'll be putting up monthly FAQ posts as a safe space for people to ask all the possibly-dumb questions that may have been bothering them about the whole AGI Safety discussion, but which until now they didn't feel able to ask.
It's okay to ask uninformed questions, and not worry about having done a careful search before asking.
Stampy's Interactive AGI Safety FAQ
Additionally, this will serve as a soft-launch of the project Rob Miles' volunteer team[2] has been working on: Stampy - which will be (once we've got considerably more content) a single point of access into AGI Safety, in the form of a comprehensive interactive FAQ with lots of links to the ecosystem. We'll be using questions and answers from this thread for Stampy (under these copyright rules), so please only post if you're okay with that! You can help by adding other people's questions and answers to Stampy or getting involved in other ways!
We're not at the "send this to all your friends" stage yet, we're just ready to onboard a bunch of editors who will help us get to that stage :)
We welcome feedback[3] and questions on the UI/UX, policies, etc. around Stampy, as well as pull requests to his codebase.[4] You are encouraged to add other people's answers from this thread to Stampy if you think they're good, and collaboratively improve the content that's already on our wiki.
We've got a lot more to write before he's ready for prime time, but we think Stampy can become an excellent resource for everyone from skeptical newcomers, through people who want to learn more, right up to people who are convinced and want to know how they can best help with their skillsets.
Guidelines for Questioners:
Guidelines for Answerers:
Finally: Please think very carefully before downvoting any questions, remember this is the place to ask stupid questions!
I'm re-using content from Aryeh Englander's thread with permission.
If you'd like to join, head over to Rob's Discord and introduce yourself!
Either via the feedback form or in the feedback thread on this post.
Stampy is a he, we asked him.