Why do people keep saying we should maximize log(odds) instead of odds? Isn't each 1% of survival equally valuable?
How do doomy AGI safety researchers and enthusiasts find joy while always maintaining the framing that the world is probably doomed?
Does anyone know if grokking occurs in deep RL? That is, you train the system on the same episodes for a long while, and after reaching max performance its policy eventually generalizes?
What does "solving alignment" actually look like? How will we know when it's been solved (if that's even possible)?
Is it true that current image systems like stable diffusion are non-optimizers? How should that change our reasoning about how likely it is that systems become optimizers? How much of a crux is "optimizeriness" for people?
Isn't the risk coming from insufficient AGI alignment relatively small compared to vulnerable world hypthesis? I would expect that even without the invention of AGI or with aligned AGI, it is still possible for us to use some more advanced AI techniques as research assistants that help us invent some kind of smaller/cheaper/easier to use atomic bomb that would destroy the world anyway. Essentially the question is why so much focus on AGI alignment instead of general slowing down of technological progress?
I think this seems quite underexplored. The fact that it is hard to slow down the progress doesn't mean it isn't necessary or that this option shouldn't be researched more.
Last month I asked "What is some of the most time-efficient ways to get a ton of accurate info about AI safety policy, via the internet?" and got some surprisingly good answers, namely this twitter thread by Jack Clark and also links to resources from GovAI and FHI's AI governance group.
I'm wondering if, this time, there's even more time-efficient ways to get a ton of accurate info about AI safety policy.
I've seen Eliezer Yudkowsky claim that we don't need to worry about s-risks from AI, since the Alignment Problem would need to be something like 95% solved in order for s-risks to crop up in a worryingly-large number of a TAI's failure modes: a threshold he thinks we are nowhere near crossing. If this is true, it seems to carry the troubling implication that alignment research could be net-negative, conditional on how difficult it will be for us to conquer that remaining 5% of the Alignment Problem in the time we have.
So is there any work being done on fig...
Why wouldn't a tool/oracle AGI be safe?
Edit: the question I should have asked was "Why would a tool/oracle AGI be a catastrophic risk to mankind?" because obviously people could use an oracle in a dangerous way (and if the oracle is a superintelligence, a human could use it to create a catastrophe, e.g. by asking "how can a biological weapon be built that spreads quickly and undetectably and will kill all women?" and "how can I make this weapon at home while minimizing costs?")
Is there a comprehensive list of AI Safety orgs/personas and what exactly they do? Is there one for capabilities orgs with their stance on safety?
I think I saw something like that, but can't find it.
Are there any features that mainstream programming languages don't have but would help with AI Safety research if they were added?
Are there any school-textbook-style texts about AI Safety? If no, what texts are closest to this and would it be useful if school-textbook style materials existed?
Recently, AGISF has revised its syllabus and moved Risks form Learned Optimization to a recommended reading, replacing it with Goal Misgeneralization. I think this change is wrong, but I don't know why they did it and Chesteron's Fence.
Does anybody have any ideas for why they did this?
Are Goal Misgeneralization and Inner-Misalignment describing the same phenomenon?
What's the best existing critique of Risks from Learned Optimization? (besides it being probably worse than Rob Miles pedagogically, IMO)
Why wouldn't a solution to Eliciting Latent Knowledge (ELK) help with solving deceptive alignment as well? Isn't the answer to whether the model is being deceptive part of its latent knowledge?
If ELK is solved in the worst case, how much more work needs to be done to solve the alignment problem as a whole?
I've seen in the term "AI Explainability" floating around in the mainstream ML community. Is there a major difference between that and what we in the AI Safety community call Interpretability?
Is there any way to contribute to AI safety if you are not willing to move to US/UK or do this for free?
Excluding a superintelligent AGI that has qualia and can form its own (possibly perverse) goals, why wouldn’t we be able to stop any paperclip maximizer whatsoever by the simple addition of the stipulation “and do so without killing or harming or even jeopardizing a single living human being on earth” to its specified goal? Wouldn’t this stipulation trivially force the paperclip maximizer not to turn humans into paperclips either directly or indirectly? There is no goal or subgoal (or is there?) that with the addition of that stipulation is dangerous to hu...
Currently, we don't know how to make a smarter-than-us AGI obey any particular set of instructions whatsoever, at least not robustly in novel circumstances once they are deployed and beyond our ability to recall/delete. (Because of the Inner Alignment Problem). So we can't just type in stipulations like that. If we could, we'd be a lot closer to safety. Probably we'd have a lot more stipulations than just that. Speculating wildly, we might want to try something like: "For now, the only thing you should do is be a faithful imitation of Paul Christiano but thinking 100x faster." Then we can ask our new Paul-mimic to think for a while and then come up with better ideas for how to instruct new versions of the AI.
What are the standard doomy "lol no" responses to "Any AGI will have a smart enough decision theory to not destroy the intelligence that created it (ie. us), because we're only willing to build AGI that won't kill us"?
(I suppose it isn't necessary to give a strong reason why acausality will show up in AGI decision theory, but one good one is that it has to be smart enough to cooperate with itself.)
Some responses that I can think of (but I can also counter, with varying success):
A. Humanity is racing to build an AGI anyway, this "decision" is not really eno...
How much of AI alignment and safety has been informed at all by economics?
Part of the background to my question relates to the paperclip maximizer story. I could be poorly understanding the problem suggested by that issue (have not read the original) but to me it largely screams economic system failure.
In case of AI/Tech singularity, wouldn't open sources like Less Wrong be main source of counter strategy for AGI? In that case AGI can develop in a way that won't trigger thresholds of what is worrying and can gradually mitigate its existential risks.
Are AGIs with bad epistemics more or less dangerous? (By "bad epistemics" I mean a tendency to believe things that aren't true, and a tendency to fail to learn true things, due to faulty and/or limited reasoning processes... or to update too much / too little / incorrectly on evidence, or to fail in peculiar ways like having beliefs that shift incoherently according to the context in which an agent finds itself)
It could make AGIs more dangerous by causing them to act on beliefs that they never should have developed in the first place. But it could make AGI...
Is it a fair assumption to say the Intelligence Spectrum is a core concept that underpins AGI safety . I often feel the idea is the first buy in to introduction AGI safety text. And due to my prior beliefs and lack of emphasis on it in Safety introductions I often bounce off the entire field.
How useful would uncopyable source code be for limiting non super-intelligent AIs? Something like how humans are limited by only being able to copy resulting ideas, not the actual process that produced them. Homomorphic encryption is sort of similar, but only in the sense that the code is unreadable - I'm more interested in mechanisms to enforce the execution to be a sort of singleton. This of course wouldn't prevent unboxing etc., but might potentially limit the damage.
I haven't put much thought into this, but was wondering if anyone else has already gone into the feasibility of it.
In his TED Talk, Bostrom's solution to the alignment problem is to build in at the ground level the goal of predicting what we will approve of so that no matter what other goals it's given, it will aim to achieve those goals only in ways that align with our values.
How (and where) exactly does Yudkowsky object to this solution? I can make guesses based on what Yudkowsky says, but so far, I've found no explicit mention by Yudkowsky of Bostrom's solution. More generally, where should I go to find any objections to or evaluations of Bostrom's solution?
What is the thinking around equilibrium between multiple AGIs with competing goals?
For example AGI one wants to maximize paperclips, AGI two wants to help humans create as many new episodes of the Simpsons as possible, AGI three wants to make sure humans aren't being coerced by machines using violence, AGI four wants to maximize egg production.
In order for AGI one not to have AGIs two through four not try to destroy it, it needs to either gain such a huge advantage over them it can instantly destroy/neutralize them, or it needs to use strategies that don't...
What are the best not-Arxiv and not-NeurIPS sources of information on new capabilities research?
Is trying to reduce internet usage and maybe reducing the amount of data AI companies have to work with something that is at all feasible?
Dear Robert, I just found out about your work and absolutely love it.
Has the following idea been explored yet?
Is this something Stampy would want to help with?
Could AgI safety overhead turn out to be more harmful in terms of actual current human lives, as well as their individual 'qualities'? (I don't accept utilitarian or longtermism couter-arguments since my question solely concerns the present which is the only reality, and existing 'individualities' as the only ones capable of experiencing it. Thank you)
Could we solve alignment by just getting an AI to learn human preferences through training it to predict human behavior, using a "current best guess" model of human preferences to make predictions and updating the model until its predictions are accurate, then using this model as a reward signal for the AI? Is there a danger in relying on these sorts of revealed preferences?
On a somewhat related note, someone should answer, "What is this Coherent Extrapolated Volition I've been hearing about from the AI safety community? Are there any holes in that plan?"
Somewhat of a tangent -- I realized I don't really understand the basic reasoning behind current efforts on "trying to make AI stop saying naughty words". Like, what's the actual problem with an LLM producing racist or otherwise offensive content that warrants so much effort? Why don't the researchers just slap an M content label on the model and be done with it? Movie characters say naughty words all the time, are racist all the time, disembody other people in ingenious and sometimes realistic ways, and nobody cares -- so what's the difference?..
There can only be One X-risk. I wonder if anyone here believes that our first X-risk is not coming from AGI, that AGI is our fastest counter to our first X-risk.
In that case you are resisting what you should embrace?
