I've seen people asking the basic question of how some software could kill people. I want to put together a list of pieces that engage with this question. Here's my attempt, please add more if you can think of them!
(I think the first link is surprisingly good, I only was pointed to it today. I think the rest are also all very helpful.)
For personal context: I can understand why a superintelligent system having any goals that aren't my goals would be very bad for me. I can also understand some of the reasons it is difficult to actually specify my goals or train a system to share my goals. There are a few parts of the basic argument that I don't understand as well though.
For one, I think I have trouble imagining an AGI that actually has "goals" and acts like an agent; I might just be anthropomorphizing too much.
1. Would it make sense to talk about modern large language models a...
I'm going to re-ask all my questions that I don't think have received a satisfactory answer. Some of them are probably basic, some other maybe less so:
looks great, thanks for doing this!
one question i get every once in a while and wish i had a canonical answer to is (probably can be worded more pithily):
"humans have always thought their minds are equivalent to whatever's their latest technological achievement -- eg, see the steam engines. computers are just the latest fad that we currently compare our minds to, so it's silly to think they somehow pose a threat. move on, nothing to see here."
note that the canonical answer has to work for people whose ontology does not include the concepts of "computation"...
Why is it at all plausible that bigger feedforward neural nets trained to predict masses of random data recorded from humans or whatever random other stuff would ever be superintelligent? I think AGI could be near, but I don't see how this paradigm gets there. I don't even see why it's plausible. I don't even see why it's plausible that either a feedforward network gets to AGI, or an AI trained on this random data gets to AGI. These of course aren't hard limits at all. There's some big feedforward neural net that destroys the world, so there's some trainin...
This is probably a very naive proposal, but: Has anyone ever looked into giving a robot a vulnerability that it could use to wirehead itself if it was unaligned so that if it were to escape human control, it would end up being harmless?
You’d probably want to incentivize the robot to wirehead itself as fast as possible to minimise any effects that it would have on the surrounding world. Maybe you could even give it some bonus points for burning up computation, but not so much that it would want to delay hitting the wireheading button by more than a tiny fra...
How likely is the “Everyone ends up hooked up to morphine machines and kept alive forever” scenario? Is it considered less likely than extinction for example?
Obviously it doesn’t have to be specifically that, but something to the affect of it.
Also, is this scenario included as an existential risk in the overall X-risk estimates that people make?
There's been discussion of 'gradient hacking' lately, such as here. What I'm still unsure about is whether or not a gradient hacker is just another word for local minimum? It feels different but when I want to try to put a finer definition on it, I can't. My best alternative is "local minimum, but malicious" but that seems odd since it depends upon some moral character.
What is the probability of the government knowing about unreleased models/upcoming projects of OpenAI/Google? Have they seen GPT-4 already? My guess is that it is not likely, and this seems strange. Do you think I'm wrong?
I've seen some posts on AI timelines. Is this the most definitive research on AI timelines? Where is the most definitive research on AI timelines? Does the MIRI team have a post that gives dates? I'm at an important career branch so I need solid info.
Premise: A person who wants to destroy the world with exceeding amounts of power would be just as big of a problem as an AI with exceeding amounts of power.
Question: Do we just think that AI will be able to obtain that amount of power more effectively than a human, and that the ratio of AIs that will destroy the world to safe AIs is larger than the ratio of world destroying humans to normal humans?
Here are some questions I would have thought were silly a few months ago. I don't think that anymore.
I am wondering if we should be careful when posting about AI online? What should we be careful to say and not say, in case it influences future AI models?
Maybe we need a second space, that we can ensure wont be trained on. But that's completely impossible.
Maybe we should start posting stories about AI utopias instead of AI hellscapes, to influence future AI?
Why can't we buy a pivotal event with power gained from a safely-weak AI?
A kind of "Why can't we just do X" question regarding weak pivotal acts and specifically the "destroy all GPUs" example discussed in (7.) of 1 and 2.
Both these relatively recent sources (feel free to direct me to more comprehensive ones) seem to somewhat acknowledge that destroying all (other) GPUs (short for sufficient hardware capabilities for too strong AI) could be very good, but that this pivotal act would be very hard or impossible with an AI that is weak enough to be safe. [1] ...
Are there any alignment approaches that try to replicate how children end up loving their parents (or vice versa), except with AI and humans? Alternatively, approaches that look like getting an AI to do Buddhist lovingkindness?
Is there a working way to prevent language models from using my text as a training data if it is posted, for example, here? I remember that there were mentions of a certain sequence of characters, and the texts containing it were not used, right?
Is there a consensus on the idea of "training an ai to help with alignment"? What are the reasons that this would / wouldn't be productive?
John Wentworth categorizes this as Bad Idea, but elsewhere (I cannot remember where, it may have been in irl conversations) I've heard it discussed as being potentially useful.
(PLEASE READ THIS POST)
Sorry for putting that there, but I am somewhat paranoid about the idea of having the solution and people just not seeing it.
WHY WOULD THIS IDEA NOT WORK?
Perhaps we could essentially manufacture a situation in which the AGI has to act fast to prevent itself from being turned off. Like we could make it automatically turn off after 1 minute say, this could mean that if it is not aligned properly it has no choice but to try prevent that. No time for RSI, no time to bide it’s time.
Basically if we put the AGI in a situation where it...
How likely are extremely short timelines?
To prevent being ambiguous, I’ll define “extremely short“ as AGI before 1st July 2024.
I have looked at surveys, which generally suggest the overall opinion to be that it is highly unlikely. As someone who only started looking into AI when ChatGPT was released and gained a lot of public interest, it feels like everything is changing very rapidly. It seems like I see new articles every day and people are using AI for more and more impressive things. It seems like big companies are putting lots more money into AI as we...
Yudkowsky gives the example of strawberry duplication as an aspirational level of AI alignment - he thinks getting an AI to duplicate a strawberry at the cellular level would be challenging and an important alignment success if it was pulled off.
In my understanding, the rough reason why this task is maybe a good example for an alignment benchmark is "because duplicating strawberries is really hard". There's a basic idea here - that different tasks have different difficulties, and the difficulty of a task has AI risk implications - seems like it could be quite useful to better understanding and managing AI risk. Has anyone tried to look into this idea in more depth?
What are the groups aiming for (and most likely to achieve) AGI going for in regards to alignment?
Is the goal for the AGI to be controlled or not?
Like is the idea to just make it “good” and let it do whatever is “good”?
Does “good” include “justice“? Are we all going to be judged and rewarded/ punished for our actions? This is of concern to me because plenty of people think that extremely harsh punishments or even eternal punishments are deserved in some cases. I think that having an AGI which dishes out “justice” could be very bad and create S-risks....
How much AI safety work is on caring about the AI’s themselves?
In the paperclip maximiser scenario, for example, I assume that the paperclip maximiser itself will be around for a very long time, and maybe forever. What if it is conscious and suffering?
Is enough being done to try to make sure that even if we do all die, we have not created a being which will suffer forever while it is forced to pursue some goal?
Would an AI which is automatically turned off every second, for example, be safer?
If you had an AI which was automatically turned off every second (and required to be manually turned on again) could this help prevent bad outcomes? It occurs to me that a powerful AI might be able to covertly achieve its goals even in this situation, or it might be able to convince people to stop the automatic turning off.
But even if this is still flawed, might it be better than alternatives?
It would allow us to really consider the AI’s actions in as much time as we wa...
Do AI timeline predictions factor in increases in funding and effort put into AI as it becomes more mainstream and in the public eye? Or are they just based on things carrying on about the same? If the latter is the case then I would imagine that the actual timeline is probably considerably shorter.
Similarly, is the possibility for companies, governments, etc being further along in developing AGI than is publicly known, factored in to AI timeline predictions?
When do maximisers maximise for?
For example, if an ASI is told to ”make as many paperclips as possible”, when is it maximising for? The next second? The next year? Indefinitely?
If a paperclip maximiser only cared about making as many paperclips as possible over the next hour say, and every hour this goal restarts, maybe it would never be optimal to spend the time to do things such as disempower humanity because it only ever cares about the next hour and disempowering humanity would take too long.
Would a paperclip maximiser rather make 1 thousan...
What if we built oracle AIs that were not more intelligent than humans, but they had complementary cognitive skills to humans, and we asked them how to align AI?
How many cognitive skills can we come up with?
Thank you for making these threads. I have been reading LW off and on for several years and this will be my first post.
My question: Is purposely leaning into creating a human wire-header an easier alignment target to hit than the more commonly touted goal of creating an aligned superintelligence that prevents the emergence of other potentially dangerous superintelligence, yet somehow reliably leaves humanity mostly in the driver's seat?
If the current forecast on aligning superintelligent AI is so dire, is there a point where it would make sense to just set...
It seems like people disagree about how to think about the reasoning process within the advanced AI/AGI, and this is a crux for which research to focus on (please lmk if you disagree). E.g. one may argue AIs are (1) implicitly optimising a utility function (over states of the world), (2) explicitly optimising such a utility function, or (3) is it following a bunch of heuristics or something else?
What can I do to form better views on this? By default, I would read about Shard Theory and "consequentialism" (the AI safety, not the moral philosophy term).
Thanks for running this.
My understanding is that reward specification and goal misgeneralisation are supposed to be synonymous words for outer and inner alignment (?) I understand the problem of inner alignment to be about mesa-optimization. I don't understand how the two papers on goal misgeneralisation fit in:
(in both cases, there are no real mesa optimisers. It is just like the...
Sorry for the length.
Before i ask the my question I think its important to give some background as to why I'm asking it. if you feel that its not important you can skip to the last 2 paragraphs.I'm by all means of the word layman. At computer science, AI let alone AI safety. I am on the other hand someone who is very enthusiastic about learning more about AGI and everything related to it. I discovered this forum a while back and in my small world this is the largest repository of digestible AGI knowledge. I have had some trouble understanding introductory ...
(THIS IS A POST ABOUT S-RISKS AND WORSE THAN DEATH SCENARIOS)
Putting the disclaimer there, as I don’t want to cause suffering to anyone who may be avoiding the topic of S-risks for their mental well-being.
To preface this: I have no technical expertise and have only been looking into AI and it’s potential affects for a bit under 2 months. I also have OCD, which undoubtedly has some affect on my reasoning. I am particularly worried about S-risks and I just want to make sure that my concerns are not being overlooked by the people working on this stuff.
H...
tl;dr: Ask questions about AGI Safety as comments on this post, including ones you might otherwise worry seem dumb!
Asking beginner-level questions can be intimidating, but everyone starts out not knowing anything. If we want more people in the world who understand AGI safety, we need a place where it's accepted and encouraged to ask about the basics.
We'll be putting up monthly FAQ posts as a safe space for people to ask all the possibly-dumb questions that may have been bothering them about the whole AGI Safety discussion, but which until now they didn't feel able to ask.
It's okay to ask uninformed questions, and not worry about having done a careful search before asking.
AISafety.info - Interactive FAQ
Additionally, this will serve as a way to spread the project Rob Miles' volunteer team[1] has been working on: Stampy and his professional-looking face aisafety.info. Once we've got considerably more content[2] this will provide a single point of access into AI Safety, in the form of a comprehensive interactive FAQ with lots of links to the ecosystem. We'll be using questions and answers from this thread for Stampy (under these copyright rules), so please only post if you're okay with that! You can help by adding other people's questions and answers or getting involved in other ways!
We're not at the "send this to all your friends" stage yet, we're just ready to onboard a bunch of editors who will help us get to that stage :)
We welcome feedback[3] and questions on the UI/UX, policies, etc. around Stampy, as well as pull requests to his codebase. You are encouraged to add other people's answers from this thread to Stampy if you think they're good, and collaboratively improve the content that's already on our wiki.
We've got a lot more to write before he's ready for prime time, but we think Stampy can become an excellent resource for everyone from skeptical newcomers, through people who want to learn more, right up to people who are convinced and want to know how they can best help with their skillsets.
Guidelines for Questioners:
Guidelines for Answerers:
Finally: Please think very carefully before downvoting any questions, remember this is the place to ask stupid questions!
If you'd like to join, head over to Rob's Discord and introduce yourself!
We'll be starting a three month paid distillation fellowship in mid February. Feel free to get started on your application early by writing some content!
Via the feedback form.