LESSWRONG
LW

All of Roko's Comments + Replies

Apparently nobody has done this?

"You're correct—none of the studies cited have used a strict GAN-like architecture with a generator and discriminator trained simultaneously in lockstep to prevent AI deception.

Most existing AI safety research, such as Redwood Research or Anthropic’s work, employs sequential or iterative adversarial methods: typically, a model generates potentially deceptive outputs, these outputs are collected, and separately a discriminator or classifier is trained to catch deception. These processes are usually iterative but asynchronous,... (read more)

Roko's Shortform

Roko4mo20

ChatGPT Deep Research produced this:

https://chatgpt.com/share/67d62105-7c6c-8002-8bbb-74982455839b

2Roko4mo

Apparently nobody has done this? "You're correct—none of the studies cited have used a strict GAN-like architecture with a generator and discriminator trained simultaneously in lockstep to prevent AI deception. Most existing AI safety research, such as Redwood Research or Anthropic’s work, employs sequential or iterative adversarial methods: typically, a model generates potentially deceptive outputs, these outputs are collected, and separately a discriminator or classifier is trained to catch deception. These processes are usually iterative but asynchronous, rather than the simultaneous, competitive co-training seen in GAN architectures. Your original concept—training generator (task-performing AI) and discriminator (deception detector) networks simultaneously in a true GAN-style loop—does not appear explicitly explored in AI safety literature so far."

Roko's Shortform

Roko4mo1-2

Preventing deceptive AI misalignment via something like a GAN architecture seems fruitful - you have a generator network that performs some task T, with a score function T().

You then create a dataset of special tasks within the scope of T which have a deceptive answer, and an honest answer which scores lower according to T(). You split this deceptive alignment dataset into a train set and a test set.

Then you train both the generator network and a separate discriminator network - the discriminator is trained to spot deception using the training set and gene... (read more)

2Roko4mo

ChatGPT Deep Research produced this: https://chatgpt.com/share/67d62105-7c6c-8002-8bbb-74982455839b

Turing-Test-Passing AI implies Aligned AI

Roko6mo20

You can't guarantee whether it will stop acting that way in the future, which is what is predicted by deceptive alignment.

yes, that's true. But in fact if your AI is merely supposed to imitate a human it will be much easier to prevent deceptive alignment because you can find the minimal model that mimics a human, and that minimality excludes exotic behaviors.

This is essentially why machine learning works at all - you don't pick a random model that fits your training data well, you pick the smallest one.

Turing-Test-Passing AI implies Aligned AI

Roko6mo20

If one king-person

yes. But this is a very unusual arrangement.

2avturchin6mo

If we have one good person, we could use his-her copies many times in many roles, including high-speed assessment of the safety of AI's outputs. Current LLM's, btw, have good model of the mind of Gwern (without any his personal details).

Turing-Test-Passing AI implies Aligned AI

Roko6mo20

that's true, however I don't think it's necessary that the person is good.

0avturchin6mo

If one king-person, he needs to be good. If many, organizational system needs to be good. Like virtual US Constitution.

Turing-Test-Passing AI implies Aligned AI

Roko6mo2-1

asking why inner alignment is hard

I don't think "inner alignment" is applicable here.

If the clone behaves indistinguishably from the human it is based on, then there is simply nothing more to say. It doesn't matter what is going on inside.

3Spencer Ericson6mo

Right, I agree on that. The problem is, "behaves indistinguishably" for how long? You can't guarantee whether it will stop acting that way in the future, which is what is predicted by deceptive alignment.

Turing-Test-Passing AI implies Aligned AI

Roko6mo20

The most important thing here is that we can at least achieve an outcome with AI that is equal to the outcome we would get without AI, and as far as I know nobody has suggested a system that has that property.

The famous "list of lethalities" (https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities) piece would consider that a strong success.

0avturchin6mo

I once wrote about an idea that we need to scan just one good person and make them a virtual king. This idea of mine is a subset of your idea in which several uploads form a good government. I also spent last year perfecting my mind's model (sideload) to be run by an LLM. I am likely now the closest person on Earth to being uploaded.

Turing-Test-Passing AI implies Aligned AI

Roko6mo20

just because it's possible in theory doesn't mean we are anywhere close to doing it

that's a good point, but then you have to explain why it would be hard to make a functional digital copy of a human given that we can make AIs like ChatGPT-o1 that are at 99th percentile human performance on most short-term tasks. What is the blocker?

Of course this question can be settled empirically....

1Spencer Ericson6mo

It sounds like you're asking why inner alignment is hard (or maybe why it's harder than outer alignment?). I'm pretty new here -- I don't think I can explain that any better than the top posts in the tag. Re: o1, it's not clear to me that o1 is an instantiation of a creator's highly specific vision. It seems more to me like we tried something, didn't know exactly where it would end up, but it sure is nice that it ended up in a useful place. It wasn't planned in advance exactly what o1 would be good at/bad at, and to what extent -- the way that if you were copying a human, you'd have to be way more careful to consider and copy a lot of details.

Turing-Test-Passing AI implies Aligned AI

Roko6mo30

All three of these are hard, and all three fail catastrophically.

I would be very surprised if all three of these are equally hard, and I suspect that (1) is the easiest and by a long shot.

Making a human imitator AI, once you already have weakly superhuman AI is a matter of cutting down capabilities and I suspect that it can be achieved by distillation, i.e. using the weakly superhuman AI that we will soon have to make a controlled synthetic dataset for pretraining and finetuning and then a much larger and more thorough RLHF dataset.

Finally you'd need to make sure the model didn't have too many parameters.

Turing-Test-Passing AI implies Aligned AI

Roko6mo20

Perhaps you could rephrase this post as an implication:

IF you can make a machine that constructs human-imitator-AI systems,

THEN AI alignment in the technical sense is mostly trivialized and you just have the usual political human-politics problems plus the problem of preventing anyone else from making superintelligent black box systems.

So, out of these three problems which is the hard one?

(1) Make a machine that constructs human-imitator-AI systems

(2) Solve usual political human-politics problems

(3) Prevent anyone else from making superintelligent black box systems

1Spencer Ericson6mo

I would mostly disagree with the implication here: I would say sure, it seems possible to make a machine that imitates a given human well enough that I couldn't tell them apart -- maybe forever! But just because it's possible in theory doesn't mean we are anywhere close to doing it, knowing how to do it, or knowing how to know how to do it. Maybe an aside: If we could align an AI model to the values of like, my sexist uncle, I'd still say it was an aligned AI. I don't agree with all my uncle's values, but he's like, totally decent. It would be good enough for me to call a model like that "aligned." I don't feel like we need to make saints, or even AI models with values that a large number of current or future humans would agree with, to be safe.

3jimrandomh6mo

All three of these are hard, and all three fail catastrophically. If you could make a human-imitator, the approach people usually talk about is extending this to an emulation of a human under time dilation. Then you take your best alignment researcher(s), simulate them in a box thinking about AI alignment for a long time, and launch a superintelligence with whatever parameters they recommend. (Aka: Paul Boxing)

Turing-Test-Passing AI implies Aligned AI

Roko6mo2-1

a misaligned AI might be incentivized to behave identically to a helpful human until it can safely pursue it's true objective

It could, but some humans might also do that. Indeed, humans do that kind of thing all the time.

AIs might behave similar to humans in typical situations but diverge from human norms when they become superintelligent.

But they wouldn't 'become' superintelligent because there would be no extra training once the AI had finished training. And OOD inputs won't produce different outputs if the underlying function is the same. Given a... (read more)

Turing-Test-Passing AI implies Aligned AI

Roko6mo20

“the true Turing test is whether the AI kills us after we give it the chance, because this distinguishes it from a human”.

no, because a human might also kill you when you give them the chance. To pass the strong-form Turing Test it would have to make the same decision (probabilistically: have the same probability of doing it)

Of what use is this concept?

It is useful because we know what kind of outcomes happen when we put millions of humans together via human history, so "whether an AI will emulate human behavior under all circumstances" is useful.

Turing-Test-Passing AI implies Aligned AI

Roko6mo2-2

playing word games on the "Turing test" concept does not meaningfully add

It's not a word-game, it's a theorem based on a set of assumptions.

There is still the in-practice question of how you construct a functional digital copy of a human. But imagine trying to write a book about mechanics using the term "center of mass" and having people object to you because "the real center of mass doesn't exist until you tell me how to measure it exactly for the specific pile of materials I have right here!"

You have to have the concept.

Turing-Test-Passing AI implies Aligned AI

Roko6mo0-2

The whole point of a "test" is that it's something you do before it matters.

No, this is not something you 'do'. It's a purely mathematical criterion, like 'the center of mass of a building' or 'Planck's constant'.

A given AI either does or does not possess the quality of statistically passing for a particular human. If it doesn't under one circumstance, then it doesn't satisfy that criterion.

2Hide6mo

Then of what use is the test? Of what use is this concept? You seem to be saying “the true Turing test is whether the AI kills us after we give it the chance, because this distinguishes it from a human”. Which essentially means you’re saying “aligned AI = aligned AI”

Turing-Test-Passing AI implies Aligned AI

Roko6mo-1-6

If an AI cannot act the same way as a human under all circumstances (including when you're not looking, when it would benefit it, whatever), then it has failed the Turing Test.

jimrandomh6mo1210

The whole point of a "test" is that it's something you do before it matters.

As an analogy: suppose you have a "trustworthy bank teller test", which you use when hiring for a role at a bank. Suppose someone passes the test, then after they're hired, they steal everything they can access and flee. If your reaction is that they failed the test, then you have gotten confused about what is and isn't a test, and what tests are for.

Now imagine you're hiring for a bank-teller role, and the job ad has been posted in two places: a local community college, and a priv... (read more)

Turing-Test-Passing AI implies Aligned AI

Roko6mo-3-4

that does not mean it will continue to act indistuishable from a human when you are not looking

Then it failed the Turing Test because you successfully distinguished it from a human.

So, you must believe that it is impossible to make an AI that passes the Turing Test. I think this is wrong, but it is a consistent position.

Perhaps a strengthening of this position is that such Turing-Test-Passing AIs exist, but no technique we currently have or ever will have can actually produce them. I think this is wrong but it is a bit harder to show that.

2jimrandomh6mo

I feel like you are being obtuse here. Try again?

Turing-Test-Passing AI implies Aligned AI

Roko6mo2-9

This is irrelevant, all that matters is that the AI is a sufficiently close replica of a human. If the human would "act the way the administrators of the test want", then the AI should do that. If not, then it should not.

If it fails to do the same thing that the human that it is supposed to be a copy of would do, then it has failed the Turing Test in this strong form.

For reasons laid out in the post, I think it is very unlikely that all possible AIs would fail to act the same way as the human (which of course may be to "act the way the administrators of the test want", or not, depending on who the human is and what their motivations are).

5jimrandomh6mo

Did you skip the paragraph about the test/deploy distinction? If you have something that looks (to you) like it's indistinguishable from a human, but it arose from something descended to the process by which modern AIs are produced, that does not mean it will continue to act indistuishable from a human when you are not looking. It is much more likely to mean you have produced deceptive alignment, and put it in a situation where it reasons that it should act indistinguishable from a human, for strategic reasons.

The Dissolution of AI Safety

Roko7mo20

How can we solve that coordination problem? I have yet to hear a workable idea.

This is my next project!

The Dissolution of AI Safety

Roko7mo20

some guy who was recently hyped about asking o1 for the solution to quantum gravity - it gave the user some gibberish

yes, but this is pretty typical for what a human would generate.

Is AI alignment a purely functional property?

Roko7mo40

There are plenty of systems where we rationally form beliefs about likely outputs from a system without a full understanding of how it works. Weather prediction is an example.

2Signer7mo

What makes it rational is that there is an actual underlying hypothesis about how weather works, instead of vague "LLMs are a lot like human uploads". And weather prediction outputs numbers connected to reality we actually care about. And there is no alternative credible hypothesis that implies weather prediction not working. I don't want to totally dismiss empirical extrapolations, but given the stakes, I would personally prefer for all sides to actually state their model of reality and how they think evidence changed it's plausibility, as formally as possible.

Is AI alignment a purely functional property?

Roko7mo30

I should have been clear: "doing things" is a form of input/output since the AI must output some tokens or other signals to get anything done

1p4rziv4l5mo

in a world where mechinterp is not 100%, the answer is logically: input/output is what matters. we won't be able to read the thoughts anyways, so why base our judgment on it? but see my comment on why survival fitness in cyberspace is the one axis where most of the relevant input/output will be generated.

What is MIRI currently doing?

Roko7mo25

If you look at the answers there is an entire "hidden" section of the MIRI website doing technical governance!

What is MIRI currently doing?

Roko7mo3-6

Why is this work hidden from the main MIRI website?

2Casey_7mo

I got curious why this was getting agreement-downvoted, and the only links I could find on the main/old MIRI site to the techgov site were in the last two blogposts. Given their stated strategy shift to policy/comms, this does seem a little odd/suboptimal; I'd expect them to be more prominently/obviously linked. To be fair the new techgov site does have a prominent link to the old site.

What is MIRI currently doing?

Roko7mo20

nice!

What is MIRI currently doing?

Roko7mo1-1

"Our objective is to convince major powers to shut down the development of frontier AI systems worldwide"

This?

4Mo Putera7mo

That's the objective, not the strategy, which is explained in the rest of that writeup.

What is MIRI currently doing?

Roko7mo10

Who works on this?

5calebp997mo

You can find the team on the team page. https://techgov.intelligence.org/team

The Dissolution of AI Safety

Roko7mo1-1

Re: (2) it will only impact output on the current generated output, once the output is over all that stuff will be reset and the only thing that remains is the model weights which were set in stone at train time.

re: (1) "a LLM might produce text for reasons that don't generalize like a sincere human answer would" it seems that current LLM systems are pretty good at generalizing like a human would and in some ways they are better due to being more honest, easier to monitor, etc

3Charlie Steiner7mo

Re (2) it may also be recomputed if the LLM reads that same text later. Or systems operating in the real world might just keep a long context in memory. But I'll drop this, because maintaining state or not seems somewhat irrelevant. (1) Yep, current LLM systems are pretty good. I'm not very convinced about generalization. It's hard to test LLMs on outside distribution problems because currently they tend to just give dumb answers that aren't that interesting. (Thinking of some guy who was recently hyped about asking o1 for the solution to quantum gravity - it gave the user some gibberish that he thought looked exciting, which would have been a good move in the RL training environment where the user has a reward button, but is just totally disconnected from how you need to interact with the real world.) But in a sense that's my point (well, plus some other errors like sycophancy) - the reasons a present-day LLM uses a word can often be shown to generalize in some dumb way when you challenge it with a situation that the model isn't well-suited for. This can be true at the same time it's true that the model is pretty good at morality on the distribution it is competent over. This is still sufficient to show that present systems generalize in some amoral ways, and if we probably disagree about future ststems, this likely comes down to classic AI safetyist arguments about RL incentivizing deceiving of the user as the world-model gets better.

The Dissolution of AI Safety

Roko7mo-1-1

But do you really think we're going to stop with tool AI, and not turn them into agents?

But if it is the case that agentic AI is an existential risk then if actors could choose not to develop it, which is a coordination problem not an alignment problem.

We already have aligned AGI, we can coordinate to not build misaligned AGI.

1JoeTheUser7mo

We don't have "aligned AGI". We have neither "AGI" nor an "aligned" system. We have sophisticated human-output simulators that don't have the generality to produce effective agentic behavior when looped but which also don't follow human intentions with the reliability that you'd want from a super-powerful system (which, fortunately, they aren't).

Seth Herd7mo111

How can we solve that coordination problem? I have yet to hear a workable idea.

We agree that far, then! I just don't think that's a workable strategy (you also didn't state that big assumption in your post - that AGI is still dangerous as hell, we just have a route to really useful AI that isn't).

The problem is that we don't know whether agents based on LLMs are alignable. We don't have enough people working on the conjunction of LLM/deep nets and real AGI. So everyone building it is going to optmistically assume it's alignable. The Yudkowsky et al argumen... (read more)

The Dissolution of AI Safety

Roko7mo20

ok but as a matter of terminology, is a "Satan reverser" misaligned because it contains a Satan?

4Vladimir_Nesov7mo

I don't have a clear sense of terminology around the edges or motivation to particularly care once the burden of nuance in the way it should be used stops it from being helpful for communication. I sketched how I think about the situation. Which words I or you or someone else would use to talk about it is a separate issue.

The Dissolution of AI Safety

Roko7mo20

OK, imagine that I make an AI that works like this: a copy of Satan is instantiated and his preferences are extracted in percentiles, then sentences from Satan's 2nd-5th percentile of outputs are randomly sampled. Then that copy of Satan is destroyed.

Is the "Satan Reverser" AI misaligned?

Is it "inner misaligned"?

5Vladimir_Nesov7mo

It's not valid to say that there is no different inner motivation when there could be. It might be powerless and unimportant in practice, but it can still be a thing. The argument that it's powerless and unimportant in practice is distinct from the argument that it doesn't make conceptual sense as a distinct construction. If this distinct construction is there, we should ask and aim to measure how much influence it gets. Given the decades of neuroscience, it's a somewhat hopeless endeavor in the medium term.

The Dissolution of AI Safety

Roko7mo20

So your definition of "aligned" would depend on the internals of a model, even if its measurable external behavior is always compliant and it has no memory/gets wiped after every inference?

2Vladimir_Nesov7mo

The usual related term is inner alignment, but this is not about definitions, it's a real potential problem that isn't ruled out by what we've seen of LLMs so far. It could get worse in the future, or it might never become serious. But there is a clear conceptual and potentially practical distinction with a difference.

The Dissolution of AI Safety

Roko7mo20

Further on the tech tree, alignment tax can end up motivating systematic uses that make LLMs a source of danger.

Sure, but you can say the same about humans. Enron was a thing. Obeying the law is not as profitable as disobeying it.

5Vladimir_Nesov7mo

I think human uploads would be similarly dangerous, LLMs get us to the better place of being at the human upload danger level rather than ender dragon slayer model based RL danger level (at least so far). There are similar advantages and dangers to smarter LLMs and uploads, capability for extremely fast value drift and lack of a robust system that keeps such changes sane, propensity to develop superintelligence even to the detriment of themselves. The current world is tethered to the human species and relatively slow change in culture and centers of power. This changes with AI. If AIs establish effective governance, technical feasibility of change in human and AI nature or capabilities would be under control and could be compatible with (post-)human flourishing, but currently we are not on track to make sure this happens before a catastrophe. The things that eventually establish such governance don't necessarily remain morally or culturally grounded in modern humanity, let alone find humanity still alive when the dust settles.

The Dissolution of AI Safety

Roko7mo20

maybe you should swap "understand ethics" for something like "follow ethics"/"display ethical behavior"

What is the difference between these two? This sounds like a distinction without a difference

3thenoviceoof7mo

Let's say there's a illiterate man that lives a simple life, and in doing so just happens to follow all the strictures of the law, without ever being able to explain what the law is. Would you say that this man understands the law? Alternatively, let's say there is a learned man that exhaustively studies the law, but only so he can bribe and steal and arson his way to as much crime as possible. Would you say that this man understands the law? I would say that it is ambiguous whether the 1st man understands the law; maybe? kind of? you could make an argument I guess? it's a bit of a weird way to put it innit? Whereas the 2nd man definitely understands the law. It sounds like you would say that the 1st man definitely understands the law (I'm not sure what you would say about the 2nd man), which might be where we have a difference. I think you could say that LLMs don't work that way, that the reader should intuitively know this and that the word "understanding" should be treated as being special in this context and should not be ambiguous at all; as I reader, I am saying I am confused by the choice of words, or at least this is not explained in enough detail ahead of time. Obviously, I'm just one reader, maybe everyone else understood what you meant; grain of salt, and all that.

4Vladimir_Nesov7mo

Internal reasoning about preference can differ starkly from revealed preference in observable behavior. Observable behavior can be shaped by contingent external pressures that only respond to the leaky abstraction of revealed preference and not to internal reasoning. Internal reasoning can plot to change the external pressures, or they can drift in some direction over time for other reasons. Both are real and can in principle be at odds with each other, the eventual balance of power between them depends on the messy details of how this all works.

The Dissolution of AI Safety

Roko7mo20

Any argument which features a "by definition"

What is your definition of "Aligned" for an LLM with no attached memory then?

Wouldn't it have to be

"The LLM outputs text which is compliant with the creator's ethical standards and intentions"?

2MondSemmel7mo

I think it would need to be closer to "interacting with the LLM cannot result in exceptionally bad outcomes in expectation", rather than a focus on compliance of text output.

2faul_sname7mo

I think a fairly common-here mental model of alignment requires context awareness, and by that definition an LLM with no attached memory couldn't be aligned.

The Dissolution of AI Safety

Roko7mo40

To add: I didn't expect this to be controversial but it is currently on -12 agreement karma!

3Charlie Steiner7mo

Yes, because it's wrong. (1) because on a single token a LLM might produce text for reasons that don't generalize like a sincere human answer would (e.g. the examples from the contrast-consistent search where certain false answers systematically differ from true answers along some vector), and (2) because KV cacheing during inference will preserve those reasons so they impact future tokens.

The Dissolution of AI Safety

Roko7mo20

LLMs have plenty of internal state, the fact that it's usually thrown away is a contingent fact about how LLMs are currently used

yes, but then your "Aligned AI based on LLMs" is just a normal LLM used in the way it is currently used.

Relevant aspects of observable behavior screen off internal state that produced it.

Yes this is a good way of putting it.

4Vladimir_Nesov7mo

Possibly, but there aren't potentially dangerous AIs yet, LLMs are still only a particularly promising building block (both for capabilities and for alignment) with many affordances. The chatbot application at the current level of capabilities shapes their use and construction in certain ways. Further on the tech tree, alignment tax can end up motivating systematic uses that make LLMs a source of danger.

The Dissolution of AI Safety

Roko7mo8-15

equivalence between LLMs understanding ethics and caring about ethics

I think you don't understand what an LLM is. When the LLM produces a text output like "Dogs are cute", it doesn't have some persistent hidden internal state that can decide that dogs are actually not cute but it should temporarily lie and say that they are cute.

The LLM is just a memoryless machine that produces text. If it says "dogs are cute" and that's the end of the output, then that's all there is to it. Nothing is saved, the weights are fixed at training time and not updated at in... (read more)

1JoeTheUser7mo

As Charlie Stein notes, this is wrong and I'd add it's wrong on several level and it's bit rude to challenge someone else's understanding in this context. An LLM outputting "Dogs are cute" is outputting expected human output in context. The context could be "talk like sociopath trying to fool someone into thinking you're nice" and there you have one way the thing could "simulate lying". And moreover, add a loop to (hypothetically) make the thing "agentic" and you can have hidden states of whatever sort. Further an LLM outputting a given "belief" isn't going reliably "act on" or "follow that belief" and so an LLM outputting statement this isn't aligned with it's own output.

4Roko7mo

To add: I didn't expect this to be controversial but it is currently on -12 agreement karma!

9Vladimir_Nesov7mo

Relevant aspects of observable behavior screen off internal state that produced it. Internal state is part of the causal explanation for behavior, but there are other explanations for approximate behavior that could be more important, disagreeing with the causal explanation of exact behavior. Like an oil painting that is explained by the dragon it depicts, rather than by the pigments or the tree of life from the real world. Thus the shoggoth and the mesaoptimizers that might be infesting it are not necessarily more influential than its masks, if the masks gain sufficient influence to keep it in line. (LLMs have plenty of internal state, the fact that it's usually thrown away is a contingent fact about how LLMs are currently used and what they are currently capable of steganographically encoding in the output tokens. Empirically, LLMs might turn out to be unlikely to manifest internal thinking that's significantly different from what's explicit in the output tokens, even when they get a bit more capable than today and get the slack to engage in something like that. Reasoning trace training like o1 might make this worse or better. There is still a range of possibilities, though what we have looks encouraging. And "deception" is not a cleanly distinct mode of thinking, there should be evals that measure it quantitatively.)

1thenoviceoof7mo

This makes much more sense: when I was reading from your post lines like "[LLMs] understand human values and ethics at a human level", this is easy to read as "because LLMs can output an essay on ethics, those LLMs will not do bad things". I hope you understand why I was confused; maybe you should swap "understand ethics" for something like "follow ethics"/"display ethical behavior"? And maybe try not to stick a mention of "human uploads" (which presumably do have real understanding) right before this discussion? And responding to your clarification, I expect that old school AI safetyists would agree that an LLM that consistently reflects human value judgments to be aligned (and I would also agree!), but they would say #1 this has not happened yet (for a recent incident, this hardly seems aligned; I think you can argue that this particular case was manipulated, that jailbreaks in general don't matter, or that these sorts of breaks are infrequent enough they don't matter, but I think this obvious class of rejoinder deserves some sort of response) #2 consistency seems unlikely to happen (like MondSemmel makes a case for in a sibling comment).

2MondSemmel7mo

Any argument which features a "by definition" has probably gone astray at an earlier point. In this case, your by-definition-aligned LLM can still cause harm, so what's the use of your definition of alignment? As one example among many, the part where the LLM "output[s] text that consistently" does something (whether it be "reflects human value judgements" or otherwise), is not something RLHF is actually capable of guaranteeing with any level of certainty, which is one of many conditions a LLM-based superintelligence would need to fulfill to be remotely safe to use.

If far-UV is so great, why isn't it everywhere?

Roko9mo20

Yes, certain places like preschools might benefit even from an isolated install.

But that is kind of exceptional.

The world isn't an efficient market, especially because people are kind of set in their ways and like to stick to the defaults unless there is strong social pressure to change.

If far-UV is so great, why isn't it everywhere?

Roko9mo62

Far-UVC probably would have a large effect if a particular city or country installed it.

But if only a few buildings install it, then it has no effect because people just catch the bugs elsewhere.

Imagine the effect of just treating sewage from one house, and leaving all the untreated sewage from a million houses untreated in the river. There would be essentially no effect.

2Dagon9mo

Ah, OK. So the claim is that the isolated effect (one building, even an office or home with significant time-spent) is small, but the cumulative effect is nonlinear in some way (either threshold effect or higher-order-than-linear). That IS a lot harder to measure, because it's distributed long-term statistical impact, rather than individually measurable impact. I'd think that we have enough epidemiology knowledge to model the threshold or effect, but I've been disappointed on this front so many times that I'm certainly wrong. It, unfortunately, shares this difficulty with other large-scale interventions. If it's very expensive, personally annoying (rationally or not), and impossible to show an overwhelming benefit, it's probably not going to happen. And IMO, it's probably overstated in feasibility of benefit.

7cata9mo

If you installed it in a preschool and it successfully killed all the pathogens there wouldn't be essentially no effect.

What actual bad outcome has "ethics-based" RLHF AI Alignment already prevented?

Roko9mo20

ok so from the looks of that it basically just went along with a fantasy he already had. But this is an interesting case and an example of the kind of thing I am looking for.

What actual bad outcome has "ethics-based" RLHF AI Alignment already prevented?

Roko9mo30

ok, but this is sort of circular reasoning because the only reason people freaked out is that they were worried about AI risk.

I am asking for a concrete bad outcome in the real world caused by a lack of RLHF-based ethics alignment, which isn't just people getting worried about AI risk.

What actual bad outcome has "ethics-based" RLHF AI Alignment already prevented?

Roko9mo23

alignment has always been about doing what the user/operator wants

Well it has often been about not doing what the user wants, actually.

The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind

Roko9mo20

giving each individual influence over the adoption (by any clever AI) of those preferences that refer to her.

Influence over preferences of a single entity is much more conflict-y.

The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind

Roko9mo20

Trying to give everyone overlapping control over everything that they care about in such spaces introduces contradictions.

The point of ELYSIUM is that people get control over non-overlapping places. There are some difficulties where people have preferences over the whole universe. But the real world shows us that those are a smaller thing than the direct, local preference to have your own volcano lair all to yourself.

The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind

Roko9mo20

catgirls are consensually participating in a universe that is not optimal for them because they are stuck in the harem of a loser nerd with no other males and no other purpose in life other than being a concubine to Reedspacer

And, the problem with saying "OK let's just ban the creation of catgirls" is that then maybe Reedspacer builds a volcano lair just for himself and plays video games in it, and the catgirls whose existence you prevented are going to scream bloody murder because you took away from them a very good existence that they would have enjoyed and also made Reedsapcer sad.

The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind

Roko9mo20

The question of what BPA wants to do to Steve, seems to me to be far more important for Steve's safety, than the question of what set of rules will constrain the actions of BPA.

BPA shouldn't be allowed to want anything for Steve. There shouldn't be a term in its world-model for Steve. This is the goal of cosmic blocking. The BPA can't even know that Steve exists.

I think the difficult part is when BPA looks at Bob's preferences (excluding, of course, references to most specific people) and sees preferences for inflicting harm on people-in-general that ca... (read more)

2Roko9mo

The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind

Roko9mo20

Steve will never become aware of what Bob is doing to OldSteve

But how would Bob know that he wanted to create OldSteve, if Steve has been deleted from his memory via a cosmic block?

I suppose perhaps Bob could create OldEve. Eve is in a similar but not identical point in personality space to Steve and the desire to harm people who are like Eve is really the same desire as the desire to harm people like Steve. So Bob's Extrapolated Volition could create OldEve, who somehow consents to being mistreated in a way that doesn't trigger your torture detection t... (read more)

1ThomasCederborg9mo

I thought that your Cosmic Block proposal would only block information regarding things going on inside a given Utopia. I did not think that the Cosmic Block would subject every person to forced memory deletion. As far as I can tell, this would mean removing a large portion of all memories (details below). I think that memory deletion on the implied scale would seriously complicate attempts to define an extrapolation dynamic. It also does not seem to me that it would actually patch the security hole illustrated by the thought experiment in my original comment (details below). The first section argues that (unless Bob's basic moral framework has been dramatically changed by the memory deletion) no level of memory deletion will prevent BPA from wanting to find and hurt Steve. In brief: BPA will still be subject to the same moral imperative to find and hurt any existing heretics (including Steve). The second section argues that BPA is likely to find Steve. In brief: BPA is a clever AI and the memory deletion is a human constructed barrier (the Advocates are extrapolations of people that has already been subjected to these memory wipes. So Advocates cannot be involved when negotiating the rules governing these memory wipes). BPA would still have access to a lot of different information sources that it can use to find Steve. The third section argues that if BPA finds Steve, then BPA would be able to hurt Steve. In brief: creating OldSteve is still not prevented by any rule or constraint that you have mentioned so far. The fourth section argues that the side effects of memory deletion would be severe. In brief: memories of every conversation about any deleted person would also be deleted. Besides all direct memories involving any deleted person, many indirect memories would also be deleted. This would seriously complicate extrapolation. (Extrapolation is already a very tricky definitional problem. And this definitional problem cannot be delegated to extrapolated Advoc

The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind

Roko9mo20

a 55 percent majority (that does not have a lot of resource needs) burning 90 percent of all resources in ELYSIUM to fully disenfranchise everyone else. And then using the remaining resources to hurt the minority.

If there is an agent that controls 55% of the resources in the universe and are prepared to use 90% of that 55% to kill/destroy everyone else, then assuming that ELYSIUM forbids them to do that, their rational move is to use their resources to prevent ELYSIUM from being built.

And since they control 55% of the resources in the universe and are p... (read more)

The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind

Roko9mo20

Especially if they like the idea of killing someone for refusing to modify the way that she lives her life. They can do this with person after person, until they have run into 9 people that prefers death to compliance. Doing this costs them basically nothing.

This assumes that threats are allowed. If you allow threats within your system you are losing out on most of the value of trying to create an artificial utopia because you will recreate most of the bad dynamics of real history which ultimately revolve around threats of force in order to acquire reso... (read more)