LESSWRONG
LW

All of VojtaKovarik's Comments + Replies

AI 2027: What Superintelligence Looks Like

Data point on the impact of this: in Czech Republic, this scenario made it into one of the popular newspapers, and I have heard about it from some people around me who don't know much about AI.

https://denikn.cz/1700968/do-deseti-let-bude-po-vsem-petice-expertu-nabizi-presvedcivy-scenar-o-tom-jak-ai-ovladne-svet-a-vyhubi-lidstvo/

A Longlist of Theories of Impact for Interpretability

VojtaKovarik7moΩ110

If I could assume things like "they are much better at reading my inner monologue than my non-verbal thoughts", then I could create code words for prohibited things.
I could think in words they don't know.
I could think in complicated concepts they haven't understood yet. Or references to events, or my memories, that they don't know.
I could leave a part of my plans implicit, and only figure out the details later.
I could harm them through some action for which they won't understand that it is harmful, so they might not be alarmed even if they catch me thinkin

VojtaKovarik7moΩ110

even an incredibly sophisticated deceptive model which is impossible to detect via the outputs may be easy to detect via interpretability tools (analogy - if I knew that sophisticated aliens were reading my mind, I have no clue how to think deceptive thoughts in a way that evades their tools!)

It seems to me that your analogy is the wrong way arond. IE, the right analogy would be "if I knew that a bunch of 5-year olds were reading my mind, I have...actually, a pretty good idea how to think deceptive thoughts in a way that avoids their tools".

(For what it's ... (read more)

2Neel Nanda7mo

How would you evade their tools?

The Compendium, A full argument about extinction risk from AGI

VojtaKovarik8moΩ254

After reading the first section and skimming the rest, my impression is that the document is a good overview, but does not present any detailed argument for why godlike AI would lead to human extinction. (Except for the "smarter species" analogy, which I would say doesn't qualify.) So if I put on my sceptic hat, I can imagine reading the whole document in detail and somewhat-justifiably going away with "yeah, well, that sounds like a nice story, but I am not updating based on this".

That seems fine to me, given that (as far as I am concerned) no detailed co... (read more)

2adamShimi8mo

Thanks for the comment! We have indeed gotten the feedback by multiple people that this part didn't feel detailed enough (although we got this much more from very technical readers than from non-technical ones), and are working at improving the arguments.

The Compendium, A full argument about extinction risk from AGI

VojtaKovarik8moΩ132

Some suggestions for improving the doc (I noticed the link to the editable version too late, apologies):

What is AI? Who is building it? Why? And is it going to be a future we want?

Something weird with the last sentence here (substituting "AI" for "it" makes the sentence un-grammatical).

Machines of hateful competition need not have such hindrances.

"Hateful" seems likely to put off some readers here, and I also think it is not warranted -- indifference is both more likely and also sufficient for extinction. So "Machines of indifferent competition" might work... (read more)

4adamShimi8mo

Thanks for the comment! We'll correct the typo in the next patch/bug fix. As for the more direct adversarial tone of the prologue, it is an explicit choice (and is contrasted by the rest of the document). For the moment, we're waiting to get more feedback on the doc to see if it really turns people off or not.

When is "unfalsifiable implies false" incorrect?

VojtaKovarik1y10

I agree that "we can't test it right now" is more appropriate. And I was looking for examples of things that "you can't test right now even if you try really hard".

When is "unfalsifiable implies false" incorrect?

VojtaKovarik1y10

Good point. Also, for the purpose of the analogy with AI X-risk, I think we should be willing to grant that the people arrive at the alternative hypothesis through theorising. (Similarly to how we came up with the notion of AI X-risk before having any powerful AIs.) So that does break my example somewhat. (Although in that particular scenario, I imagine that sceptic of Newtonian gravity would came up with alternative explanations for the observation. Not that this seems very relevant.)

When is "unfalsifiable implies false" incorrect?

VojtaKovarik1y10

I agree with all of this. (And good point about the high confidence aspect.)

The only thing that I would frame slightly differently is that:
[X is unfalsifiable] indeed doesn't imply [X is false] in the logical sense. On reflection, I think a better phrasing of the original question would have been something like: 'When is "unfalsifiability of X is evidence against X" incorrect?'. And this amended version often makes sense as a heuristic --- as a defense against motivated reasoning, conspiracy theories, etc. (Unfortunately, many scientists seem to take this ... (read more)

4localdeity1y

In some sense this must be at least half the time, because if X is unfalsifiable, then not-X is also unfalsifiable, and it makes little sense to have this rule constitute evidence against X and also evidence against not-X. I would generally say that falsifiability doesn't imply anything about truth value. It's more like "this is a hypothesis that scientific investigation can't make progress on". Also, it's probably worth tracking the category of "hypotheses that you haven't figured out how to test empirically, but you haven't thought very hard about it yet". There may be useful heuristics about people who make unfalsifiable claims. Some of which are probably pretty context-dependent.

When is "unfalsifiable implies false" incorrect?

Answer by VojtaKovarikJun 15, 202410

Some partial examples I have so far:

Phenomenon: For virtually any goal specification, if you pursue it sufficiently hard, you are guaranteed to get human extinction.^[1]
Situation where it seems false and unfalsifiable: The present world.
Problems with the example: (i) We don't know whether it is true. (ii) Not obvious enough that it is unfalsifiable.

Phenomenon: Physics and chemistry can give rise to complex life.
Situation where it seems false and unfalsifiable: If Earth didn't exist.
Problems with the example: (i) if Earth didn't exist, there wouldn't b... (read more)

AI takeoff and nuclear war

VojtaKovarik1y40

Nitpick on the framing: I feel that thinking about "misaligned decision-makers" as an "irrational" reason for war could contribute to (mildly) misunderstanding or underestimating the issue.

To elaborate: The "rational vs irrational reasons" distinction talks about the reasons using the framing where states are viewed as monolithic agents who act in "rational" or "irrational" ways. I agree that for the purpose of classifying the risks, this is an ok way to go about things.

I wanted to offer an alternative framing of this, though: For any state, we can consid... (read more)

My AI Model Delta Compared To Yudkowsky

VojtaKovarik1y32

[I am confused about your response. I fully endorse your paragraph on "the AI with superior ontology would be able to predict how humans would react to things". But then the follow-up, on when this would be scary, seems mostly irrelevant / wrong to me --- meaning that I am missing some implicit assumptions, misunderstanding how you view this, etc. I will try react in a hopefully-helpful way, but I might be completely missing the mark here, in which case I apologise :).]

I think the problem is that there is a difference between:
(1) AI which can predict how t... (read more)

My AI Model Delta Compared To Yudkowsky

VojtaKovarik1y30

Nitpicky edit request: your comment contains some typos that make it a bit hard to parse ("be other", "we it"). (So apologies if my reaction misunderstands your point.)

[Assuming that the opposite of the natural abstraction hypothesis is true --- ie, not just that "not all powerful AIs share ontology with us", but actually "most powerful AIs don't share ontology with us":]
I also expect that an AI with superior ontology would be able to answer your questions about its ontology, in a way that would make you feel like^[1] you understand what is happening. ... (read more)

My AI Model Delta Compared To Yudkowsky

VojtaKovarik1y7-2

As a quick reaction, let me just note that I agree that (all else being equal) this (ie, "the AI understanding us & having superior ontology") seems desirable. And also that my comment above did not present any argument about why we should be pessimistic about AI X-risk if we believe that the natural abstraction hypothesis is false. (I was just trying to explain why/how "the AI has a different ontology" is compatible with "the AI understands our ontology".)

As a longer reaction: I think my primary reason for pessimism, if natural abstraction hypothetis ... (read more)

3ozziegooen1y

Thanks for that explanation. Thanks, this makes sense to me. Yea, I guess I'm unsure about that '[Inference step missing here.]'. My guess is that such system would be able to recognize situations where things that score highly with respect to its ontology, would score lowly, or would be likely to score lowly, using a human ontology. Like, it would be able to simulate a human deliberating on this for a very long time and coming to some conclusion. I imagine that the cases where this would be scary are some narrow ones (though perhaps likely ones) where the system is both dramatically intelligent in specific ways, but incredibly inept in others. This ineptness isn't severe enough to stop it from taking over the world, but it is enough to stop it from being at all able to maximize goals - and it also doesn't take basic risk measures like "just keep a bunch of humans around and chat to them a whole lot, when curious", or "try to first make a better AI that doesn't have these failures, before doing huge unilateralist actions" for some reason. It's very hard for me to imagine such an agent, but that doesn't mean it's not possible, or perhaps likely.

My AI Model Delta Compared To Yudkowsky

VojtaKovarik1y148

Simplifying somewhat: I think that my biggest delta with John is that I don't think the natural abstraction hypothesis holds. (EG, if I believed it holds, I would become more optimistic about single-agent alignment, to the point of viewing Moloch as higher priority.) At the same time, I believe that powerful AIs will be able to understand humans just fine. My vague attempt at reconciling these two is something like this:

Humans have some ontology, in which they think about the world. This corresponds to a world model. This world model has a certain amount o... (read more)

6ozziegooen1y

This sounds a lot like a good/preferable thing to me. I would assume that we'd generally want AIs with ideal / superior ontologies. It's not clear to me why you'd think such a scenario would make us less optimistic about single-agent alignment. (If I'm understanding correctly)

MIRI 2024 Communications Strategy

VojtaKovarik1y10

An illustrative example, describing a scenario that is similar to our world, but where "Extinction-level Goodhart's law" would be false & falsifiable (hat tip Vincent Conitzer):

Suppose that we somehow only start working on AGI many years from now, after we have already discovered a way to colonize the universe at the close to the speed of light. And some of the colonies are already unreachable, outside of our future lightcone. But suppose we still understand "humanity" as the collection of all humans, including those in the unreachable colonies. Then a... (read more)

MIRI 2024 Communications Strategy

VojtaKovarik1y20

FWIW, I acknowledge that my presentation of the argument isn't ironclad, but I hope that it makes my position a bit clearer. If anybody has ideas for how to present it better, or has some nice illustrative examples, I would be extremely grateful.

MIRI 2024 Communications Strategy

VojtaKovarik1y3-3

tl;dr: "lack of rigorous arguments for P is evidence against P" is typically valid, but not in case of P = AI X-risk.

A high-level reaction to your point about unfalsifiability:
There seems to be a general sentiment that "AI X-risk arguments are unfalsifiable ==> the arguments are incorrect" and "AI X-risk arguments are unfalsifiable ==> AI X-risk is low".^[1] I am very sympathetic to this sentiment --- but I also think that in the particular case of AI X-risk, it is not justified.^[2] For quite non-obvious reasons.

Why I believe this?
Take this ... (read more)

1VojtaKovarik1y

An illustrative example, describing a scenario that is similar to our world, but where "Extinction-level Goodhart's law" would be false & falsifiable (hat tip Vincent Conitzer): Suppose that we somehow only start working on AGI many years from now, after we have already discovered a way to colonize the universe at the close to the speed of light. And some of the colonies are already unreachable, outside of our future lightcone. But suppose we still understand "humanity" as the collection of all humans, including those in the unreachable colonies. Then any AI that we build, no matter how smart, would be unable to harm these portions of humanity. And thus full-blown human extinction, from AI we build here on Earth, would be impossible. And you could "prove" this using a simple, yet quite rigorous, physics argument.[1] (To be clear, I am not saying that "AI X-risk's unfalsifiability is justifiable ==> we should update in favour of AI X-risk compared to our priors". I am just saying that the justifiability means we should not update against it compared to our priors. Though I guess that in practice, it means that some people should undo some of their updates against AI X-risk... ) 1. ^ And sure, maybe some weird magic is actually possible, and the AI could actually beat speed of light. But whatever, I am ignoring this, and an argument like this would count as falsification as far as I am concerned.

2VojtaKovarik1y

When is Goodhart catastrophic?

VojtaKovarik1y10

Assumption 2 is, barring rather exotic regimes far into the future, basically always correct, and for irreversible computation, this always happens, since there's a minimum cost to increase the features IRL, and it isn't 0.
Increasing utility IRL is not free.

I think this is a misunderstanding of what I meant. (And the misunderstanding probably only makes sense to try clarifying it if you read the paper and disagree with my interpretation of it, rather than if your reaction is only based on my summary. Not sure which of the two is the case.)

What I was trying... (read more)

2Noosphere897mo

I definitely interpreted the model like this, in that I was assuming all the costs and benefits are included by default:

3Noosphere891y

Yeah, that was a different assumption that I didn't realize, because I thought the assumption was solely that we had a limited budget and every increase in a feature has a non-zero cost, which is a very different assumption. I sort of wish the assumptions were distinguished, because these are very, very different assumptions (for example, you can have positive-sum interactions/trade so long as the cost is sufficiently low and the utility gain is sufficiently high, which is pretty usual.)

What is the purpose and application of AI Debate?

VojtaKovarik1y10

I do agree that debate could be used in all of these ways. But at the same time, I think generality often leads to ambiguity and to papers not describing any such application in detail. And that in turn makes it difficult to critique debate-based approaches. (Both because it is unclear what one is critiquing and because it makes it too easy to accidentally dimiss the critiques using the motte-and-bailey fallacy.)

What is the purpose and application of AI Debate?

VojtaKovarik1y50

I was previously unaware of Section 4.2 of the Scalable AI Safety via Doubly-Efficient Debate paper and, hurray, it does give an answer to (2) in Section 4.2. (Thanks for mentioning, @niplav!) That still leaves (1) unanswered, or at least not answered clearly enough, imo. Also I am curious about the extent that other people, who find debate promising, consider this paper's answer to (2) as the answer to (2).

For what it's worth, none of the other results that I know about were helpful for me for understanding (1) and (2). (The things I know about are the or... (read more)

What is the purpose and application of AI Debate?

VojtaKovarik1y20

The original people kind-of did, but new people started, and Geoffrey Irving continued/got-back-to working on it.

What is the purpose and application of AI Debate?

VojtaKovarik1y*10

Further disclaimer: Feel free to answer even if you don't find debate promising, but note that I am primarily interested in hearing from people who do actively work on it, or find it promising --- or at least from people who have a very good model of specific such people.

Motivation behind the question: People often mention Debate as a promising alignment technique. For example, the AI Safety Fundamentals curriculum features it quite prominently. But I think there is a lack of consensus on "as far as the proposal is concerned, how is Debate actually meant t... (read more)

[April Fools' Day] Introducing Open Asteroid Impact

VojtaKovarik1y*274

I believe that a promising safety strategy for the larger asteroids is to put them in a secure box prior to them landing on earth. That way, the asteroid is -- provably -- guaranteed to have no negative impact on earth.

Proof:

Technologies and Terminology: AI isn't Software, it's... Deepware?

VojtaKovarik1y10

Agreed.

It seems relevant, to the progression, that a lot of human problem solving -- though not all -- is done by the informal method of "getting exposed to examples and then, somehow, generalising". (And I likewise failed to appreciate this, not sure until when.) This suggests that if we want to build AI that solves things in similar ways that humans solve them, "magic"-involving "deepware" is a natural step. (Whether building AI in the image of humans is desirable, that's a different topic.)

Technologies and Terminology: AI isn't Software, it's... Deepware?

VojtaKovarik1y102

tl;dr: It seems noteworthy that "deepware" has strong connotations with "it involves magic", while the same is not true for AI in general.

I would like to point out one thing regarding the software vs AI distinction that is confusing me a bit. (I view this as complementing, rather than contradicting, your post.)

As we go along the progression "Tools > Machines > Electric > Electronic > Digital", most^[1] of the examples can be viewed as automating a reasonably-well-understood process, on a progressively higher level of abstraction.^[2]
[For exa... (read more)

3abramdemski1y

Yeah, this is a pretty interesting twist in the progression, and one which I failed to see coming as a teenager learning about AI. I looked at the trend from concrete to abstract -- from machine-code to structured programming to ever-more-abstract high-level programming languages -- and I thought AI would look like the highest-level programming language one could imagine. In some sense this is not wrong. Telling the machine what to do in plain natural language is the highest-level programming language one could imagine. However, naive extrapolation of ever-more-sophisticated programming languages might lead one to anticipate convergence between compilers and computational linguistics, such that computers would be understanding natural language with sophisticated but well-understood parsing algorithms, converting natural-language statements to formal representations resembling logic, and then executing the commands via similarly sophisticated planning algorithms. The reality is that computational linguistics itself has largely abandoned the idea that we can make a formal grammar which captures natural language; the best way to parse a bunch of English is, instead, to let machine learning "get the idea" from a large number of hand-parsed examples! Rather than bridging the formal-informal divide by fully formalizing English grammar, it turns out to be easier to formalize informality itself (ie, mathematically specify a model of messy neural network learning) and then throw the formalized informality at the problem! Weird stuff. However, at some point I did get the idea and make the update. I think it was at the 2012 AGI conference, where someone was presenting a version of neural networks which was supposed to learn interpretable models, due to the individual neurons implementing interpretable functions of their inputs, rather than big weighted sums with a nonlinear transform thrown in. It seemed obvious that the approach would be hopeless, because as the models g

Many arguments for AI x-risk are wrong

VojtaKovarik1yΩ68-5

I want to flag that the overall tone of the post is in tension with the dislacimer that you are "not putting forward a positive argument for alignment being easy".

To hint at what I mean, consider this claim:

Undo the update from the “counting argument”, however, and the probability of scheming plummets substantially.

I think this claim is only valid if you are in a situation such as "your probability of scheming was >95%, and this was based basically only on this particular version of the 'counting argument' ". That is, if you somehow thought that we had ... (read more)

Can we get an AI to "do our alignment homework for us"?

VojtaKovarik1y32

I feel a bit confused about your comment: I agree with each individual claim, but I feel like perhaps you meant to imply something beyond just the individual claims. (Which I either don't understand or perhaps disagree with.)

Are you saying something like: "Yeah, I think that while this plan would work in theory, I expect it to be hopeless in practice (or unneccessary because the homework wasn't hard in the first place)."?

If yes, then I agree --- but I feel that of the two questions, "would the plan work in theory" is the much less interesting one. (For exa... (read more)

4ryan_greenblatt1y

I just think that these are important concepts to distinguish because I think it's useful to notice the extent to which problems could be solved by moderate amount of coordination and which asks could suffice for safety. I wasn't particularly trying to make a broader claim, just trying to highlight something that seemed important. My overall guess is that people paying costs equivalent to 2 years of delay for existential safety reasons is about 50% likely. (Though I'm uncertain overall and this is possible to influence.) Thus, ensuring that the plan for spending that budget is as good as possible looks quite good. And not hopeless overall. By analogy, note that google bears substantial costs to improve security (e.g. running 10% slower). I think that if we could ensure the implementation of our best safety plans which just cost a few years of delay, we'd be in a much better position.

Can we get an AI to "do our alignment homework for us"?

VojtaKovarik1y30

However, note that if you think we would fail to sufficiently check human AI safety work given substantial time, we would also fail to solve various issues given a substantial pause

This does not seem automatic to me (at least in the hypothetical scenario where "pause" takes a couple of decades). The reasoning being that there is difference between [automate a current form of an institution, and speed-run 50 years of it in a month] and [an institutions, as it develops over 50 years].

For example, my crux^[1] is that current institutions do not subscribe ... (read more)

4ryan_greenblatt1y

I said "fail to sufficiently check human AI safety work given substantial time". This might be considerably easier than ensuring that such institutions exist immediately and can already evaluate things. I was just noting there was a weaker version of "build institutions which are reasonably good at checking the quality of AI safety work done by humans" which is required for a pause to produce good safety work. Of course, good AI safety work (in the traditional sense of AI safety work) might be not be the best route forward. We could also (e.g.) work on routes other than AI like emulated minds.

Can we get an AI to "do our alignment homework for us"?

Answer by VojtaKovarikFeb 26, 2024112

Assumming that there is an "alignment homework" to be done, I am tempted to answer something like: AI can do our homework for us, but only if we are already in a position where we could solve that homework even without AI.

An important disclaimer is that perhaps there is no "alignment homework" that needs to get done ("alignment by default", "AGI being impossible", etc). So some people might be optimistic about Superalignment, but for reasons that seem orthogonal to this question - namely, because they think that the homework to be done isn't particularly d... (read more)

4ryan_greenblatt1y

I think there is an important distinction between "If given substantial investment, would the plan to use the AIs to do alignment research work?" and "Will it work in practice given realistic investment?". The cost of the approach where the AIs do alignment research might look like 2 years of delay in median worlds and perhaps considerably more delay with some probability. This is a substantial cost, but it's not an insanely high cost.

Extinction Risks from AI: Invisible to Science?

VojtaKovarik1yΩ110

Quick reaction:

I didn't want to use the ">1 billion people" formulation, because that is compatible with scenarios where a catastrophe or an accident happens, but we still end up controling the future in the end.
I didn't want to use "existential risk", because that includes scenarios where humanity survives but has net-negative effects (say, bad versions of Age of Em or humanity spreading factory farming across the stars).
And for the purpose of this sequence, I wanted to look at the narrower class of scenarios where a single misaligned AI/optimiser/what

... (read more)

2ThomasCederborg1y

What about the term uncaring AI? In other words, an AI that would keep humans alive, if offered resources to do so. This can be contrasted with a Suffering Reducing AI (SRAI), which would not keep humans alive in exchange for resources. SRAI is an example of successfully hitting a bad alignment target, which is an importantly different class of dangers, compared to the dangers of an aiming failure leading to an uncaring AI. While an uncaring AI would happily agree to leave earth alone in exchange for resources, this is not the case for SRAI, because killing humans is inherent in the core concept, of reducing suffering. Any reasonable set of definitions, simply leads to a version of SRAI that rejects all such offers (assuming that the AI project that was aiming for the SRAI alignment target, manages to successfully hit this alignment target). The term Uncaring AI is not meant to imply that the AI does not care about anything. Just that it does not care about anything that humans care about. Such as human lives. Which means that the question of extinction (and everything else that humans care about) is entirely determined by strategic considerations. The dangers stemming from the case where an aiming failure leads to an uncaring AI by accident, is importantly different, from the dangers stemming from a design team that successfully hits a bad alignment target. How about including a footnote, saying that you use Extinction as a shorthand for an outcome where humans are completely powerless, and where the fate of every living human is fully determined by an AI, that does not care about anything, that any human cares about? (and perhaps capitalise Extinction in the rest of the text) (and perhaps mention, in that same footnote, that if a neighbouring AI will pay such an uncaring AI to keep humans alive, then it would happily do so) If an AI project succeeds, then it matters a lot, what alignment target the project was aiming for. Different bad alignment targets imply di

5ryan_greenblatt1y

Maybe "full loss-of-control to AIs"? Idk.

Extinction Risks from AI: Invisible to Science?

VojtaKovarik1yΩ120

I think literal extinction from AI is a somewhat odd outcome to study as it heavily depends on difficult to reason about properties of the world (e.g. the probability that Aliens would trade substantial sums of resources for emulated human minds and the way acausal trade works in practice).

What would you suggest instead? Something like [50% chance the AI kills > 99% of people]?

(My current take is that for a majority reader, sticking to "literal extinction" is the better tradeoff between avoiding confusion/verbosity and accuracy. But perhaps it deserves at least a footnote or some other qualification.)

2ryan_greenblatt1y

I would say "catastrophic outcome (>50% chance the AI kills >1 billion people)" or something and then footnote. Not sure though. The standard approach is to say "existential risk".

Extinction Risks from AI: Invisible to Science?

VojtaKovarik1yΩ330

I think literal extinction from AI is a somewhat odd outcome to study as it heavily depends on difficult to reason about properties of the world (e.g. the probability that Aliens would trade substantial sums of resources for emulated human minds and the way acausal trade works in practice).

That seems fair. For what it's worth, I think the ideas described in the sequence are not sensitive to what you choose here. The point isn't as much to figure out whether the particular arguments go through or not, but to ask which properties must your model have, if you want to be able to evaluate those arguments rigorously.

VojtaKovarik's Shortform

VojtaKovarik1y32

A key claim here is that if you actually are able to explain a high fraction of loss in a human understandable way, you must have done something actually pretty impressive at least on non-algorithmic tasks. So, even if you haven't solved everything, you must have made a bunch of progress.

Right, I agree. I didn't realise the bolded statement was a poor/misleading summary of the non-bolded text below. I guess it would be more accurate to say something like "[% of loss explained] is a good metric for tracking intellectual progress in interpretability. However... (read more)

3ryan_greenblatt1y

Agreed. That said, if you train an AI on some IID training dataset and then explain 99.9% of loss validated as fully corresponding (via something like causal scrubbing), then you probably understand almost all the interesting stuff that SGD put into the model. You still might die because you didn't understand the key 0.1% or because some stuff was put into the model other than via SGD (e.g. gradient hacking or someone put in a backdoor). Typical stories of deceptive alignment imply that to explain 99.9% of loss with a truely human understandable explanation, you'd probably have to explain the key AI machinery to a sufficient extent that you can understand if the AI is deceptively aligned (as the AI is probably doing reasoning about this on a reasonably large fraction of inputs).

VojtaKovarik's Shortform

VojtaKovarik1y*3-6

[% of loss explained] ~~isn't a good interpretability metric~~ [edit: isn't enough to get guarantees].
In interpretability, people use [% of loss explained] as a measure of the quality of an explanation. However, unless you replace the system-being-explained by its explanation, this measure has a fatal flaw.

Suppose you have misaligned superintelligence X pretending to be a helpful assistant A --- that is, acting as A in all situations except those where it could take over the world. Then the explanation "X is behaving as A" will explain 100% of loss, but actual... (read more)

3ryan_greenblatt1y

A key claim here is that if you actually are able to explain a high fraction of loss in a human understandable way, you must have done something actually pretty impressive at least on non-algorithmic tasks. So, even if you haven't solved everything, you must have made a bunch of progress. For algorithmic tasks where humans just know an algorithm which performs well, I think you need to use something like causal scrubbing which checks the correspondence.

4Thomas Kwa1y

The main use of % loss recovered isn't to directly tell us when a misaligned superintelligence will kill you. In interpretability we hope to use explanations to understand the internals of a model, so the circuit we find will have a "can I take over the world" node. In MAD we do not aim to understand the internals, but the whole point of MAD is to detect when the model has new behavior not explained by explanations and flag this as potentially dangerous.

My Alignment "Plan": Avoid Strong Optimisation and Align Economy

VojtaKovarik1y20

I think the relative difficulty of hacking AI(x-1) and AI(x-2) will be sensitive to how much emphasis you put on the "distribute AI(x-1) quickly" part. IE, if you rush it, you might make it worse, even if AI(x-1) has the potential to be more secure. (Also, there is the "single point of failure" effect, though it seems unclear how large.)

My Alignment "Plan": Avoid Strong Optimisation and Align Economy

VojtaKovarik1y10

To clarify: The question about improving Steps 1-2 was meant specifically for [improving things that resemble Steps 1-2], rather than [improving alignment stuff in general]. And the things you mention seem only tangentially related to that, to me.

But that complaint aside: sure, all else being equal, all of the points you mention seem better having than not having.

Protecting agent boundaries

VojtaKovarik1y10

Might be obvious, but perhaps seems worth noting anyway: Ensuring that our boundaries are respected is, at least with a straightforward understanding of "boundaries", not sufficient for being safe.
For example:

If I take away all food from your local supermarkets (etc etc), you will die of starvation --- but I haven't done anything with your boundaries.
On a higher level, you can wipe out humanity without messing with our boundaries, by blocking out the sun.

2Chris Lakin1y

Yes, see Agent membranes/boundaries and formalizing “safety” and davidad's comment. (Also, I'm not necessarily agreeing that your examples are not violations of boundaries. First one isn't a violation of end-person (although probably the farmer). Second one could be.)

Would you have a baby in 2024?

Answer by VojtaKovarikDec 27, 2023*30

An aspect that I would not take into account is the expected impact of your children.

Most importantly, it just seems wrong to make personal-happiness decisions subservient to impact.
But even if you did want to optimise impact through others, then betting on your children seems riskier and less effective than, for example, engaging with interested students. (And even if you wanted to optimise impact at all costs, then the key factors might not be your impact through others. But instead (i) your opportunity costs, (ii) second order effects, where having kids... (read more)

Would you have a baby in 2024?

VojtaKovarik2y10

In fact it's hard to find probable worlds where having kids is a really bad idea, IMO.

One scenario where you might want to have kids in general, but not if timelines are short, is if you feel positive about having kids, but you view the first few years of having kids as a chore (ie, it costs you time, sleep, and money). So if you view kids as an investment of the form "take a hit to your happiness now, get more happiness back later", then not having kids now seems justifiable. But I think that this sort of reasoning requires pretty short timelines (which I... (read more)

5Bezzi2y

My anecdotal evidence from relatives with toddlers is that the first few years of having your first child is indeed the most stressful experience of your life. I barely even meet them anymore, because all their free time is eaten by childcare. Not sure about happiness, but people who openly admit to regretting having their kids face huge social stigma, and I doubt you could get honest answer on that question.

Evaluating the historical value misspecification argument

VojtaKovarik2y*Ω22-9

(For context: My initial reaction to the post was that this is misrepresenting the MIRI-position-as-I-understood-it. And I am one of the people who strongly endorse the view that "it was never about getting the AI to predict human preferences". So when I later saw Yudkowsky's comment and your reaction, it seemed perhaps useful to share my view.)

It seems like you think that human preferences are only being "predicted" by GPT-4, and not "preferred." If so, why do you think that?

My reaction to this is that: Actually, current LLMs do care about our preferences... (read more)

Box inversion revisited

VojtaKovarik2yΩ443

Nitpicky comment / edit request: The circle inversion figure was quite confusing to me. Perhaps add a note to it saying that solid green maps onto solid blue, red maps onto itself, and dotted green maps onto dotted blue. (Rather than colours mapping to each other, which is what I intuitively expected.)

The conceptual Doppelgänger problem

VojtaKovarik2yΩ230

Fun example: The evolution of offensive words seems relevant here. IE, we frown upon using currently-offensive words, so we end up expressing ourselves using some other words. And over time, we realise that those other words are (primarily used as) Doppelgangers, and mark them as offensive as well.

Does LessWrong allow exempting posts from being scraped by GPTBot?

VojtaKovarik2y10