LESSWRONG
LW

All of Andrew_Critch's Comments + Replies

Is "VNM-agent" one of several options, for what minds can grow up into?

Thanks Anna for posting this! I agree with your hypothesis, and would add that shaming humans for not being VNM agents is probably a contributor to AI risk because of the cultural example it sets / because of the self-fulling prophesy of how-intelligence-gets-used that it supports.

LLM chatbots have ~half of the kinds of "consciousness" that humans believe in. Humans should avoid going crazy about that.

Andrew_Critch5moΩ13190

The evidence you present in each case is outputs generated by LLMs.

The total evidence I have (and that everyone has) is more than behavioral. It includes

a) the transformer architecture, in particular the attention module,

b) the training corpus of human writing,

c) the means of execution (recursive calling upon its own outputs and history of QKV vector representations of outputs),

d) as you say, the model's behavior, and

e) "artificial neuroscience" experiments on the model's activation patterns and weights, like mech interp research.

When I think about how... (read more)

My motivation and theory of change for working in AI healthtech

Andrew_Critch6moΩ590

A patient can hire us to collect their medical records into one place, to research a health question for them, and to help them prep for a doctor's appointment with good questions about the research. Then we do that, building and using our AI tool chain as we go, without training AI on sensitive patient data. Then the patient can delete their data from our systems if they want, or re-engage us for further research or other advocacy on their behalf.

A good comparison is the company Picnic Health, except instead of specifically matching patients with clinical trials, we do more general research and advocacy for them.

My motivation and theory of change for working in AI healthtech

Andrew_Critch6moΩ7130

Do you have a mostly disjoint view of AI capabilities between the "extinction from loss of control" scenarios and "extinction by industrial dehumanization" scenarios?

a) If we go extinct from a loss of control event, I count that as extinction from a loss of control event, accounting for the 35% probability mentioned in the post.

b) If we don't have a loss of control event but still go extinct from industrial dehumanization, I count that as extinction caused by industrial dehumanization caused by successionism, accounting for the additional 50% probabilit... (read more)

Safety isn’t safety without a social model (or: dispelling the myth of per se technical safety)

Andrew_Critch10moΩ41314

I very much agree with human flourishing as the main value I most want AI technologies to pursue and be used to pursue.

In that framing, my key claim is that in practice no area of purely technical AI research — including "safety" and/or "alignment" research — can be adequately checked for whether it will help or hinder human flourishing, without a social model of how the resulting techologies will be used by individuals / businesses / governments / etc..

2SebastianG 9mo

And we don't have good social models of technology for really any technology, even retrospectively. So AI is certainly one we are not going to align with human flourishing in advance. When it comes to human flourishing the humanizing of technologies take a lot of time. Eventually we will get there, but it's a process that requires a lot of individual actors making choices and "feature requests" from the world, features that promote human flourishing.

Richard Ngo's Shortform

Andrew_Critch10moΩ590

I may be missing context here, but as written / taken at face value, I strongly agree with the above comment from Richard. I often disagree with Richard about alignment and its role in the future of AI, but this comment is an extremely dense list of things I agree with regarding rationalist epistemic culture.

mesaoptimizer10mo105

I'd love to read an elaboration of your perspective on this, with concrete examples, which avoids focusing on the usual things you disagree about (pivotal acts vs. pivotal processes, social facets of the game is important for us to track, etc.) and mainly focus on your thoughts on epistemology and rationality and how it deviates from what you consider the LW norm.

Consciousness as a conflationary alliance term for intrinsically valued internal experiences

Andrew_Critch2y*8-5

I'm afraid I'm sceptical that you methodology licenses the conclusions you draw.

Thanks for raising this. It's one of the reasons I spelled out my methodology, to the extent that I had one. You're right that, as I said, my methodology explicitly asks people to pay attention to the internal structure of what they were experiencing in themselves and calling consciousness, and to describe it on a process level. Personally I'm confident that whatever people are managing to refer to by "consciousness" is a process than runs on matter. If ... (read more)

6TsviBT5mo

I'm curious how satisfied people seemed to be with the explanations/descriptions of consciousness that you elicited from them. E.g., on a scale from to where did they tend to land, and what was the variance?

Paradiddle2y2521

Thanks for the response.

Personally I'm confident that whatever people are managing to refer to by "consciousness" is a process than runs on matter

I don't disagree that consciousness is a process that runs on matter, but that is a separate question from whether the typical referent of consciousness is that process. If it turned out my consciousness was being implemented on a bunch of grapes it wouldn't change what I am referring to when I speak of my own consciousness. The referents are the experiences themselves from a first-person perspective.

I asked peop

... (read more)

Signer2y106

The “hard problem of consciousness” is the problem of resolving a linguistic dispute disguised as an ontological one, where people agree on the normative properties of consciousness (it’s valuable) but on its descriptive properties (its nature as a process/pattern.)

That's just another conflation - of an easy and the hard problem - yes, there is disagreement about what mental processes are valuable, but there is also ontological problem and not everyone agree that ontological consciousness is intrinsically valuable.

"Membranes" is better terminology than "boundaries" alone

Andrew_Critch2y3-3

I totally agree with the potential for confusion here!

My read is that the LessWrong community has too low of a prior on social norms being about membranes (e.g., when, how, and how not to cross various socially constructed information membranes). Using the term "boundaries" raises the prior on the hypothesis "social norms are often about boundaries", which I endorse and was intentional on my part, specifically for the benefit of LessWrong readership base (especially the EA community) who seemed to pay too little attention to the importance of <<boun... (read more)

Consciousness as a conflationary alliance term for intrinsically valued internal experiences

Andrew_Critch2y43

Nice catch! Now replaced by 'deliberate'.

My May 2023 priorities for AI x-safety: more empathy, more unification of concerns, and less vilification of OpenAI

Andrew_Critch2y183

Thanks for sharing this! Because of strong memetic selection pressures, I was worried I might be literally the only person posting on this platform with that opinion.

My May 2023 priorities for AI x-safety: more empathy, more unification of concerns, and less vilification of OpenAI

Andrew_Critch2y97

FWIW I think you needn't update too hard on signatories absent from the FLI open letter (but update positively on people who did sign). Statements about AI risk are notoriously hard to agree on for a mix of political reasons. I do expect lab leads to eventually find a way of expressing more concerns about risks in light of recent tech, at least before the end of this year. Please feel free to call me "wrong" about this at the end of 2023 if things don't turn out that way.

6Arthur Conmy2y

Given past statements I expect all lab leaders to speak on AI risk soon. However, I bring up the FLI letter not because it is an AI risk letter, but because it is explicitly about slowing AI progress, which OAI and Anthropic have not shown that much support for

My May 2023 priorities for AI x-safety: more empathy, more unification of concerns, and less vilification of OpenAI

Andrew_Critch2y1812

Do you have a success story for how humanity can avoid this outcome? For example what set of technical and/or social problems do you think need to be solved? (I skimmed some of your past posts and didn't find an obvious place where you talked about this.)

I do not, but thanks for asking. To give a best efforts response nonetheless:

David Dalrymple's Open Agency Architecture is probably the best I've seen in terms of a comprehensive statement of what's needed technically, but it would need to be combined with global regulations limiting compute expendit... (read more)

Wei Dai2y2210

In a previous comment you talked about the importance of "the problem of solving the bargaining/cooperation/mutual-governance problem that AI-enhanced companies (and/or countries) will be facing". I wonder if you've written more about this problem anywhere, and why you didn't mention it again in the comment that I'm replying to.

My own thinking about 'the ~50% extinction probability I’m expecting from multi-polar interaction-level effects coming some years after we get individually “safe” AGI systems up and running' is that if we've got "safe" AGIs, we coul... (read more)

Acausal normalcy

Andrew_Critch2y*Ω370

That is, norms do seem feasible to figure out, but not the kind of thing that is relevant right now, unfortunately.

From the OP:

for most real-world-prevalent perspectives on AI alignment, safety, and existential safety, acausal considerations are not particularly dominant [...]. In particular, I do not think acausal normalcy provides a solution to existential safety, nor does it undermine the importance of existential safety in some surprising way.

I.e., I agree.

we are so unprepared that the existing primordial norms are unlikely to matter

... (read more)

Acausal normalcy

Andrew_Critch2y*Ω140

For 18 examples, just think of 3 common everyday norms having to do with each of the 6 boundaries given as example images in the post :) (I.e., cell membranes, skin, fences, social group boundaries, internet firewalls, and national borders). Each norm has the property that, when you reflect on it, it's easy to imagine a lot of other people also reflecting on the same norm, because of the salience of the non-subjectively-defined actual-boundary-thing that the norm is about. That creates more of a Schelling-nature for that norm, relative to... (read more)

Acausal normalcy

Andrew_Critch2y*Ω6162

To your first question, I'm not sure which particular "the reason" would be most helpful to convey. (To contrast: what's "the reason" that physically dispersed human societies have laws? Answer: there's a confluence of reasons.). However, I'll try to point out some things that might be helpful to attend to.

First, committing to a policy that merges your utility function with someone else's is quite a vulnerable maneuver, with a lot of boundary-setting aspects. For instance, will you merge utility functions multiplicatively (as in Nas... (read more)

Modal Fixpoint Cooperation without Löb's Theorem

Andrew_Critch2yΩ240

This is cool (and fwiw to other readers) correct. I must reflect on what it means for real world cooperation... I especially like the A <-> []X -> [][]X <-> []A trick.

Modal Fixpoint Cooperation without Löb's Theorem

Andrew_Critch2yΩ240

I'm working on it :) At this point what I think is true is the following:

If ShortProof(x \leftrightarrow LongProof(ShortProof(x) \to x)), then MediumProof(x).

Apologies that I haven't written out calculations very precisely yet, but since you asked, that's roughly where I'm at :)

3James Payor2y

It looks like you're investigating an angle that I can't follow, but here's my two cents re bounded agents: My main idea to port this to the bounded setting is to have a bot that searches for increasingly long proofs, knowing that if it takes longer to find a proof then it is itself a bit harder to reason about. We can instantiate this like: Ak(B)↔∃i≤k.□i(□i+CAk(B)→B(Ak)) The idea is that if there is a short way to prove that the opponent B would cooperate back, then it takes just a constant C steps more to prove that we cooperate. So it doesn't open us up to exploitation to assume that our own cooperation is provable in i+C steps. The way in which this works at all is by cutting the loop at the point where the opponent is thinking about our own behaviour. This bot cuts it rather aggressively: it assumes that no matter the context, when B thinks about whether A cooperates, it's provable that A does cooperate. (I think this isn't great and can be improved to a weaker assumption that would lead to more potential cooperation.) If you construct Bn similarly, I claim that Ak and Bn mutually cooperate if k and n are large enough, and mutually defect otherwise. Similarly, I claim Ak can mutually cooperate with other bots like Bn(A)=□nA(Bn).

Modal Fixpoint Cooperation without Löb's Theorem

Andrew_Critch2y20

Actually the interpretation of \Box_E as its own proof system only requires the other systems to be finite extenions of PA, but I should mention that requirement! Nonetheless even if they're not finite, everything still works because \Box_E still satisfies necessitation, distributivity, and existence of modal fixed points.

Thanks for bringing this up.

A Löbian argument pattern for implicit reasoning in natural language: Löbian party invitations

Andrew_Critch2yΩ120

Based on a potential misreading of this post, I added the following caveat today:

Important Caveat: Arguments in natural language are basically never "theorems". The main reason is that human thinking isn't perfectly rational in virtually any precisely defined sense, so sometimes the hypotheses of an argument can hold while its conclusion remains unconvincing. Thus, the Löbian argument pattern of this post does not constitute a "theorem" about real-world humans: even when the hypotheses of the argument hold, the argument will not always play out... (read more)

A Löbian argument pattern for implicit reasoning in natural language: Löbian party invitations

Andrew_Critch2yΩ120

Thanks! Added a note to the OP explaining that hereby means "by this utterance".

A Löbian argument pattern for implicit reasoning in natural language: Löbian party invitations

Andrew_Critch2y*Ω120

Hat tip to Ben Pace for pointing out that invitations are often self-referential, such as when people say "You are hereby invited", because "hereby" means "by this utterance":
https://www.lesswrong.com/posts/rrpnEDpLPxsmmsLzs/open-technical-problem-a-quinean-proof-of-loeb-s-theorem-for?commentId=CFvfaWGzJjnMP8FCa

That comment was like 25% of my inspiration for this post :)

5Ustice2y

I was confused for a while by trying to understand why invitations that are self-referential. It wasn’t until I read the inspirational post that I realized that you are referring to is the word “hereby.” I guess I could have used that to be explicit, despite it being implicitly stated.

Löb's Lemma: an easier approach to Löb's Theorem

Andrew_Critch2yΩ120

I've now fleshed out the notation section to elaborate on this a bit. Is it better now?

In short, $⊢$ is our symbol for talking about what PA can prove, and $□$ is shorthand for PA's symbols for talking about what (a copy of) PA can prove.
" $⊢$ 1+1=2" means "Peano Arithmetic (PA) can prove that 1+1=2". No parentheses are needed; the " $⊢$ " applies to the whole line that follows it. Also, $⊢$ does not stand for an expression in PA; it's a symbol we use to talk about what PA can prove.
" $□ (1+1=2)$ " basically means the sam

... (read more)

3quetzal_rainbow2y

Thank you, it's much more clear now.

Löb's Lemma: an easier approach to Löb's Theorem

Andrew_Critch2yΩ240

Well, the deduction theorem is a fact about PA (and, propositional logic), so it's okay to use as long as $⊢$ means "PA can prove".

But you're right that it doesn't mix seamlessly with the (outer) necessitation rule. Necessitation is a property of " $⊢$ ", but not generally a property of " $X ⊢$ ". When PA can prove something, it can prove that it can prove it. By contrast, if PA+X can prove Y, that does mean that PA can prove that PA+X can prove Y (because PA alone can work through proofs in a Gödel encoding), but it doesn't mean that PA+... (read more)

Löb's Lemma: an easier approach to Löb's Theorem

Andrew_Critch2y*Ω130

Well, $A \to B$ is just short for $\neg A \lor B$ , i.e., "(not A) or B". By contrast, $A ⊢ B$ means that there exists a sequence of (very mechanical) applications of modus ponens, starting from the axioms of Peano Arithmetic (PA) with $A$ appended, ending in $B$ . We tried hard to make the rules of $⊢$ so that it would agree with $\to$ in a lot of cases (i.e., we tried to design $⊢$ to make the deduction theorem true), but it took a lot of work in the design of Peano Arithmetic and can't be taken for gr... (read more)

Löb's Lemma: an easier approach to Löb's Theorem

Andrew_Critch2yΩ240

It's true that the deduction theorem is not needed, as in the Wikipedia proof. I just like using the deduction theorem because I find it intuitive (assume $A$ , prove $B$ , then drop the assumption and conclude $A \to B$ ) and it removes the need for lots of parentheses everywhere.

I'll add a note about the meaning of $⊢$ so folks don't need to look it up, thanks for the feedback!

9philh2y

Thanks, that makes sense. And the added explanation helps a lot, I can see the argument for going from ⊢A→B to A⊢B now: * PA proves A→B; * So PA + A certainly proves A→B, since it can prove everything PA can; * Also, PA + A proves A; * So we have A⊢A∧(¬A∨B); * Which in turn gives us A⊢B.

Let’s think about slowing down AI

Andrew_Critch2y20

It did not get accross! Interesting. Procedurally I still object to calling people's arguments "crazy", but selfishly I guess I'm glad they were not my arguments? At a meta level though I'm still concerned that LessWrong culture is too quick to write off views as "crazy". Even the the "coordination is delusional"-type views that Katja highlights in her post do not seem "crazy" to me, more like misguided or scarred or something, in a way that warrants a closer look but not being called "crazy".

8habryka2y

Oops, yeah, sorry about that not coming across. Seems plausible that LessWrong culture is too quick to write off views as "crazy", though I have a bunch of conflicting feeling here. Might be worth going into at some point. I do think there is something pretty qualitatively different about calling a paraphrase or an ITT of my own opinions "crazy" than to call someone's actual opinion crazy. In-general my sense is for reacting to paraphrases it's less bad for the social dynamics to give an honest impression and more important to give a blunt evocative reaction, but I'll still try to clarify more in the future when I am referring to the meat of my interlocutors opinion vs. their representation of my opinion.

Let’s think about slowing down AI

Andrew_Critch2y20

Oliver, see also this comment; I tried to @ you on it, but I don't think LessWrong has that functionality?

Let’s think about slowing down AI

Andrew_Critch2y*20

Separately from my other reply explaining that you are not the source of what I'm complaining about here, I thought I'd add more color to explain why I think my assessment here is not "hyperbolic". Specifically, regarding your claim that reducing AI x-risk through coordination is "not only fine to suggest, but completely uncontroversial accepted wisdom", please see the OP. Perhaps you have not witnessed such conversations yourself, but I have been party to many of these:

Some people: AI might kill everyone. We should design a godlike super-AI of

... (read more)

6habryka2y

Just to clarify, the statements that I described as crazy were not statements you professed, but statements that you said I or "the LessWrong community" believe. I am not sure whether that got across (since like, in that context it doesn't really make sense to say I described sentences I disagree with as crazy, since like, I don't think you believe those sentences either, that's why you are criticizing them).

Let’s think about slowing down AI

Andrew_Critch2y*76

Thanks, Oliver. The biggest update for me here — which made your entire comment worth reading, for me — was that you said this:

I also think it's really not true that coordination has been "fraught to even suggest".

I'm surprised that you think that, but have updated on your statement at face value that you in fact do. By contrast, my experience around a bunch common acquaintances of ours has been much the same as Katja's, like this:

Some people: AI might kill everyone. We should design a godlike super-AI of perfect goodness to prevent that.
Others

... (read more)

Let’s think about slowing down AI

Andrew_Critch2y92

This makes sense to me if you feel my comment is meant as a description of you or people-like-you. It is not, and quite the opposite. As I see it, you are not a representative member of the LessWrong community, or at least, not a representative source of the problem I'm trying to point at. For one thing, you are willing to work for OpenAI, which many (dozens of) LessWrong-adjacent people I've personally met would consider a betrayal of allegiance to "the community". Needless to say, the field of AI governance as it exists is not unc... (read more)

Vivek Hebbar2y125

It would help if you specified which subset of "the community" you're arguing against. I had a similar reaction to your comment as Daniel did, since in my circles (AI safety researchers in Berkeley), governance tends to be well-respected, and I'd be shocked to encounter the sentiment that working for OpenAI is a "betrayal of allegiance to 'the community'".

Let’s think about slowing down AI

Andrew_Critch2y*9145

Katja, many thanks for writing this, and Oliver, thanks for this comment pointing out that everyday people are in fact worried about AI x-risk. Since around 2017 when I left MIRI to rejoin academia, I have been trying continually to point out that everyday people are able to easily understand the case for AI x-risk, and that it's incorrect to assume the existence of AI x-risk can only be understood by a very small and select group of people. My arguments have often been basically the same as yours here: in my case, informal conversations with U... (read more)

Adam Scholl2y*4317

Critch, I agree it’s easy for most people to understand the case for AI being risky. I think the core argument for concern—that it seems plausibly unsafe to build something far smarter than us—is simple and intuitive, and personally, that simple argument in fact motivates a plurality of my concern. That said:

I think it often takes weirder, less intuitive arguments to address many common objections—e.g., that this seems unlikely to happen within our lifetimes, that intelligence far superior to ours doesn’t even seem possible, that we’re safe because softwar

... (read more)

6lc2y

No, you're misunderstanding John Wentworth's comment and then applying that straw man to the rest of less wrong based on the comment's upvote total. It's not that laypeople's can't understand the dangers inherent in engineered viruses, and that leads to governments continuing to finance and leak them. You can probably convince your Uber driver that lab leaks are bad, too. It's a lack of ability to translate that understanding into positive regulatory and legal outcomes, instead of completely net negative ones.

Steven Byrnes2y*2410

I think it's uncharitable to psychoanalyze why people upvoted John's comment; his object-level point about GoF seems good and merits an upvote IMO. Really, I don't know what to make of GoF. It's not just that governments have failed to ban it, they haven't even stopped funding it, or in the USA case they stopped funding it and then restarted I think. My mental models can't explain that. Anyone on the street can immediately understand why GoF is dangerous. GoF is a threat to politicians and national security. GoF has no upsides that stand up to scrutiny, an... (read more)

habryka2y*5528

The question feels leading enough that I don't really know how to respond. Many of these sentences sound pretty crazy to me, so I feel like I primarily want to express frustration and confusion that you assign those sentences to me or "most of the LessWrong community".

However, for some reason, the idea that people outside the LessWrong community might recognize the existence of AI x-risk — and therefore be worth coordinating with on the issue — has felt not only poorly received on LessWrong, but also fraught to even suggest. For instance, I tried to poi

... (read more)

Daniel Kokotajlo2y3616

However, for some reason, the idea that people outside the LessWrong community might recognize the existence of AI x-risk — and therefore be worth coordinating with on the issue — has felt not only poorly received on LessWrong, but also fraught to even suggest.

I object to this hyperbolic and unfair accusation. The entire AI Governance field is founded on this idea; this idea is not only fine to suggest, but completely uncontroversial accepted wisdom. That is, if by "this idea" you really mean literally what you said -- "people outside the LW community migh... (read more)

8the gears to ascension2y

I also have found that almost everyone I talk to outside the field of AI has found it obvious that AI could kill us all. They also find it obvious that AI is about to surpass us, and are generally not surprised by my claims of a coming discontinuity; in contrast, almost anyone working in ai thinks I'm crazy. I suspect that people think I'm claiming I can do it, when in fact I'm trying to tell them they are about to do it. it's really frustrating! also, the majority of opinion in the world doesn't come from AI researchers. That said. I cannot state this hard enough: THE COMING DISCONTINUITY WILL NOT WAIT BEHIND REGULATION. I know of multiple groups who already know what they need to in order to figure it out! regulation will not stop them unless it is broad enough to somehow catch every single person who has tried to participate in creating it, and that is not going to happen, no matter how much the public wishes for it. I don't believe any form of pivotal act could save humanity. anything that attempts to use control to prevent control will simple cause a cascade of escalatory retaliations, starting with whatever form of attack is used to try to stop ai progress, escalating from accelerationists, escalating from attempted shutdown, possibly an international war aided by ai happening in parallel, and ending with the ai executing the last one. Your attempts to slow the trickle of sand into the gravity well of increasing thermal efficiency will utterly fail. there are already enough GPUs in the world, and it only takes one. we must solve alignment so hard that it causes the foom, nothing else could possibly save us. The good news is, alignment is capabilities in a deep way. Solving alignment at full strength would suddenly stabilize AI in a way that makes it much stronger at a micro level, and would simultaneously allow for things like "hey, can you get the carbon out of the air please?" without worry about damaging those inside protected boundaries.

Said Achmiz2y2918

That particular statement was very poorly received, with a 139-karma retort from John Wentworth arguing,

What exactly is the model by which some AI organization demonstrating AI capabilities will lead to world governments jointly preventing scary AI from being built, in a world which does not actually ban gain-of-function research?

I’m not sure what’s going on here

So, wait, what’s actually the answer to this question? I read that entire comment thread and didn’t find one. The question seems to me to be a good one!

3Roman Leventov2y

Probably this opinion of LWers is shaped by their experience communicating with outsiders. Almost all my attempts to communicate AI x-risk to outsiders, from family members to friends to random acquaintances, have not been understood for sure. Your experience (talking to random people at social events, walking away from you with the thought "AI x-risk is indeed a thing!", and starting to worry about it in the slightest afterwards) is highly surprising to me. Maybe there is a huge bias in this regard in the Bay Area, where even normal people generally understand and appreciate the power of technology more than in other places, or have had some similar encounters before, or it's just in the zeitgeist of the place. (My experience is outside the US, primarily with Russians and some Europeans.) All that being said, ChatGPT (if people have experienced it first-hand) and especially GPT-4 could potentially make communication of the AI x-risk case much easier.

«Boundaries», Part 3b: Alignment problems in terms of boundaries

Andrew_Critch2yΩ231

I agree this is a big factor, and might be the main pathway through which people end up believing what people believe the believe. If I had to guess, I'd guess you're right.

E.g., if there's a evidence E in favor of H and evidence E' against H, if the group is really into thinking about and talking about E as a topic, then the group will probably end up believing H too much.

I think it would be great if you or someone wrote a post about this (or whatever you meant by your comment) and pointed to some examples. I think the LessWrong community is somewhat plagued by attentional bias leading to collective epistemic blind spots. (Not necessarily more than other communities; just different blind spots.)

«Boundaries», Part 3b: Alignment problems in terms of boundaries

Andrew_Critch2yΩ120

Ah, thanks for the correction! I've removed that statement about "integrity for consequentialists" now.

On the Diplomacy AI

Andrew_Critch2y110

I've searched my memory for the past day or so, and I just wanted to confirm that the "ever" part of my previous message was not a hot take or exaggeration.

I'm not sure what to do about this. I am mulling.

On the Diplomacy AI

Andrew_Critch2y112

This piece of news is the most depressing thing I've seen in AI since... I don't know, ever? It's not like the algorithms for doing this weren't lying around already. The depressing thing for me is that it was promoted as something to be proud of, with no regard for the framing implication that cooperative discourse exists primarily in service of forming alliances to exterminate enemies.

2the gears to ascension2y

My intuition is having a really hard time being worried about this because... I'm not sure exactly why... in real life, diplomacy occurs in an ongoing long-term game, and it seems to my intuition that the key question is how to win the infinite game by preventing wins of short term destructive games like, well, diplomacy. The fact that a cooperative AI appears to be the best strategy when intending to win the destructive game seems really promising to me, because to me that says that even when playing a game that forces destructive behavior, you still want to play cooperative if the game is sufficiently realistic. The difficult part is forging those alliances in a way that allows making the coprotection alliances broad and durable enough to reach all the way up to planetary and all the way down to cellular; but isn't this still a promising success of cooperative gameplay? Maybe I'm missing something. I'm curious why this in particular is so bad - my world model barely updated in response to this paper, I already had cached from a now-deleted Perun gaming video ("Dominions 5 Strategy: Diplomacy Concepts (Featuring Crusader Kings 2)") that cooperative gameplay is an unreasonably effective strategy in sufficiently realistic games, so seeing an AI discover that doesn't really change my model of real life diplomacy, or of AI capabilities, or of facebook's posture. Seems like we have exactly the same challenge we had before - we need to demonstrate a path out of the destructive game for the planet. How do you quit destructive!diplomacy and play constructive!diplomacy?

Andrew_Critch2y110

I've searched my memory for the past day or so, and I just wanted to confirm that the "ever" part of my previous message was not a hot take or exaggeration.

I'm not sure what to do about this. I am mulling.

Open technical problem: A Quinean proof of Löb's theorem, for an easier cartoon guide

Andrew_Critch2yΩ231

Thanks for raising this! I assume you're talking about this part?

They explore a pretty interesting set-up, but they don't avoid the narrowly-self-referential sentence Ψ:

So, I don't think their motivation was the same as mine. For me, the point of trying to use a quine is to try to get away from that sentence, to create a different perspective on the foundations for people that find that kind of sentence confusing, but who find self-referential documents less confusing. I added a section "Further meta-motivation (added Nov 26)" about this ... (read more)

Open technical problem: A Quinean proof of Löb's theorem, for an easier cartoon guide

Andrew_Critch2yΩ120

Noice :)

2Gurkenglas2y

When disambiguating as far as possible, löb becomes □(□B → A) → □A, but □löb becomes □(□(□B → A) → □B). Perhaps Ψ has a universal property related to this?

Open technical problem: A Quinean proof of Löb's theorem, for an easier cartoon guide

Andrew_Critch2yΩ130

At this point I'm more interested in hashing out approaches that might actually conform to the motivation in the OP. Perhaps I'll come back to this discussion with you after I've spent a lot more time in a mode of searching for a positive result that fits with my motivation here. Meanwhile, thanks for thinking this over for a bit.

4So8res2y

well, in your search for that positive result, i recommend spending some time searching for a critch!simplified alternative to the Y combinator :-p. not every method of attaining self-reference in the λ-calculus will port over to logic (b/c in the logical setting lots of things need to be quoted), but the quotation sure isn't making the problem any easier. a solution to the OP would yield a novel self-reference combinator in the λ-calculus, and the latter might be easier to find (b/c you don't need to juggle quotes). if you can lay bare the self-referential property that you're hoping for in the easier setting of λ-calculus, then perhaps others will have an easier time understanding what you want and helping out (and/or you'll have an easier time noticing why your desires are unsatisfiable). (and if it's still not clear that löb's theorem is tightly connected to the Y combinator, such that any solution to the OP would immediately yield a critch!simplified self-reference combinator in the λ-calculus, then I recommend spending a little time studying the connection between the Y combinator, löb's theorem, and lawvere's fixpoint theorem.)

Open technical problem: A Quinean proof of Löb's theorem, for an easier cartoon guide

Andrew_Critch2yΩ120

True! "Hereby" covers a solid contingent of self-referential sentences. I wonder if there's a "hereby" construction that would make the self-referential sentence Ψ (from the Wikipedia poof) more common-sense-meaningful to, say, lawyers.

Open technical problem: A Quinean proof of Löb's theorem, for an easier cartoon guide

Andrew_Critch2yΩ250

this suggests that you're going to be hard-pressed to do any self-reference without routing through the nomal machinery of löb's theorem, in the same way that it's hard to do recursion in the lambda calculus without routing through the Y combinator

If by "the normal machinery", you mean a clever application of the diagonal lemma, then I agree. But I think we can get away with not having the self-referential sentence, by using the same y-combinator-like diagonal-lemma machinery to make a proof that refers to itself (instead of a proof about sentences t... (read more)

2So8res2y

which self-referential sentence are you trying to avoid? it keeps sounding to me like you're saying "i want a λ-calculus combinator that produces the fixpoint of a given function f, but i don't want to use the Y combinator". do you deny the alleged analogy between the normal proof of löb and the Y combinator? (hypothesis: maybe you see that the diagonal lemma is just the type-level Y combinator, but have not yet noticed that löb's theorem is the corresponding term-level Y combinator?) if you follow the analogy, can you tell me what λ-term should come out when i put in f, and how it's better than (λ s. f (s s)) (λ s. f (s s))? or (still assuming you follow the analogy): what sort of λ-term representing the fixpoint of f would constitute "referring to itself (instead of being a term about types that refer to themselves)"? in what sense is the term (λ s. f (s s)) (λ s. f (s s)) failing to "refer to itself", and what property are you hoping for instead? (in case it helps with communication: when i try myself to answer these questions while staring at the OP, my best guess is that you're asking "instead of the Y combinator, can we get a combinator that goes like f ↦ f ????", and the two obvious ways to fill in the blanks are f ↦ f (Y f) and f ↦ f (f (f (.... i discussed why both of those are troublesome here, but am open to the possibility that i have not successfully understood what sort of fixpoint combinator you desire.) (ETA: also, ftr, in the proof-sketch of löb's theorem that i gave above, the term "g "g"" occurs as a subterm if you do enough substitution, and it refers to the whole proof of löb's theorem. just like how, in the version of the Y combinator given above, the term g g occurs as a subterm if you do enough β-reduction, and it refers to the whole fixpoint. which i note b/c it seems to me that you might have misunderstood a separate point about where the OP struggles as implying that the normal proof isn't self-referring.) ((the OP is gonna struggle

Ben Pace2yΩ8120

This sentence is an exception, but there aren't a lot of naturally occurring examples.

No strong claim either way, but as a datapoint I do somewhat often use the phrase "I hereby invite you to <event>" or "I hereby <request> something of you" to help move from 'describing the world' to 'issuing an invitation/command/etc'.

Open technical problem: A Quinean proof of Löb's theorem, for an easier cartoon guide

Andrew_Critch2yΩ5121

Thanks for your attention to this! The happy face is the outer box. So, line 3 of the cartoon proof is assumption 3.

If you want the full []([]C->C) to be inside a thought bubble, then just take every line of the cartoon and put into a thought bubble, and I think that will do what you want.

LMK if this doesn't make sense; given the time you've spent thinking about this, you're probably my #1 target audience member for making the more intuitive proof (assuming it's possible, which I think it is).

ETA: You might have been asking if th... (read more)

4Eliezer Yudkowsky2y

Okay, that makes much more sense. I initially read the diagram as saying that just lines 1 and 2 were in the box.

Open technical problem: A Quinean proof of Löb's theorem, for an easier cartoon guide

Andrew_Critch2yΩ460

Yes to both of you on these points:

Yes to Alex that (I think) you can use an already-in-hand proof of Löb to make the self-referential proof work, and
Yes to Eliezer that that would be cheating wouldn't actually ground out all of the intuitions, because then the "santa clause"-like sentence is still in use in already-in-hand proof of Löb.

(I'll write a separate comment on Eliezer's original question.)

Open technical problem: A Quinean proof of Löb's theorem, for an easier cartoon guide

Andrew_Critch2yΩ140

That thing is hilarious and good! Thanks for sharing it. As for the relevance, it explains the statement of Gödel's theorem, but not the proof it. So, it could be pretty straightforwardly reworked to explain the statement of Löb's theorem, but not so easily the proof of Löb's theorem. With this post, I'm in the business of trying to find a proof of Löb that's really intuitive/simple, rather than just a statement of it that's intuitive/simple.

«Boundaries», Part 3a: Defining boundaries as directed Markov blankets

Andrew_Critch2yΩ120

Why is it unrealistic? Do you actually mean it's unrealistic that the set I've defined as "A" will be interpretable at "actions" in the usual coarse-grained sense? If so I think that's a topic for another post when I get into talking about the coarsened variables $V^{c}, A^{c}, P^{c}, E^{c}$ ...

2Scott Garrabrant2y

I mean, the definition is a little vague. If your meaning is something like "It goes in A if it is more accurately described as controlled by the viscera, and it goes in P if it is more accurately described as controlled by the environment," then I guess you can get a bijection by definition, but it is not obvious these are natural categories. I think there will be parts of the boundary that feel like they are controlled by both or neither, depending on how strictly you mean "controlled by."

Boundaries vs Frames

Andrew_Critch2yΩ130

2MikkW2y

I agree that "Factorization" is a good, erm, framing for Cartesian Frames

4Scott Garrabrant2y

My default plan is to not try to rename Cartesian frames, mostly because the benefit seems small, and I care more about building up the FFS ontology over the Cartesian frame one.

Boundaries vs Frames

Andrew_Critch2yΩ359

Scott, thanks for writing this! While I very much agree with the distinctions being drawn, I think the word "boundary" should be usable for referring to factorizations that do not factor through the physical separation of the world into objects. In other words, I want the technical concept of «boundaries» that I'm developing to be able to refer to things like social boundaries, which are often not most-easily-expressed in the physics factorization of the world into particles (but are very often expressible as Markov blankets in a more abstract ... (read more)

5Scott Garrabrant2y

I agree completely. I am not really happy with any of the language in this post, and I want it to have scope limited to this post. I will for the most part say boundary for both the additive and multiplicative variants.

3Andrew_Critch2y

Going further, my proposed convention also suggests that "Cartesian frames" should perhaps be renamed to "Cartesian factorizations", which I think is a more immediately interpretable name for what they are. Then in your equation S=A×E, you can refer to A and E as "Cartesian factors", satisfying your desire to treat A and E as interchangeable. And, you leave open the possibility that the factors are derivable from a "Cartesian partition" r=a⊔e of the world into the "Cartesian parts" a and e. There is of course the problem that for some people "Cartesian" just means "factoring into coordinates" (e.g., "Cartesian plane"), in which case "Cartesian factorization" will sound a bit redundant, but for those people "Cartesian frame" is already not very elucidating.

«Boundaries», Part 3a: Defining boundaries as directed Markov blankets

Andrew_Critch2yΩ450

Thanks, Scott!

I think the boundary factorization into active and passive is wrong.

Are you sure? The informal description I gave for A and P allow for the active boundary to be a bit passive and the passive boundary to be a bit active. From the post:

the active boundary, A — the features or parts of the boundary primarily controlled by the viscera, interpretable as "actions" of the system— and the passive boundary, P — the features or parts of the boundary primarily controlled by the environment, interpretable as "perceptions" of the

... (read more)

2Scott Garrabrant2y

Forcing the AxP bijection is an interesting idea, but it feels a little too approximate to my taste.

5Scott Garrabrant2y

Oh yeah, oops, that is what it says. Wasn’t careful, and was responding to reading an old draft. I agree that the post is already saying roughly what I want there. Instead, I should have said that the B=AxP bijection is especially unrealistic. Sorry.

«Boundaries», Part 3a: Defining boundaries as directed Markov blankets

Andrew_Critch2yΩ121

Thanks, fixed!

The Onion Test for Personal and Institutional Honesty

Andrew_Critch2y30

Cool! This was very much in line with the kind of update I was aiming for here, cheers :)

4Raemon2y

I maybe want to add: The reason I made a big deal about these particular jargon-terms, was that they were both places where Eliezer noted "this is a concept that there will be a lot of pressure to distort or drift the term, and this concept is really important, so I'm going to add BIG PROMINENT WARNINGS about how important it is not to distort the concept." (AFAICT he's only done this twice, for metahonesty and pivotal acts) I think I agree with you in both cases that Eliezer didn't actually name the concept very well, but I think it was true that the concepts were important, and likely to get distorted, and probably still would have gotten distorted even if he had named them better. So I endorse people having the move available of "put a giant warning that attempts to fight against linguistic entropy, when you have a technical term you think is important to preserve its meaning, which the surrounding intellectual community helps reinforce." In this case I think there were some competing principles (protect technical terms, vs avoid cluttering the nomenclature-commons with bad terms). I was trying to do the former. My main update here is that I can do the former without imposing as much costs from the latter, and think more about the tradeoffs.

The Onion Test for Personal and Institutional Honesty

Andrew_Critch2y*40

Huh, weird. I read Eliezer's definition of meta-honesty as not the same thing as your definition of «honesty that is closed under reflection». Specifically, in Eliezer-meta-honesty, his honesty at the meta-level is stronger (i.e., zero tolerance for lies) than his honesty at the object level (some tolerance for lies), whereas your notion sounds like it has no such strengthening-as-you-go-up-in-meta pattern to it. Am I misunderstanding you?

7Ben Pace2y

No, but I think you're misunderstanding Eliezer. Let me explain. When I ask myself "Should I be dishonest at all in a particular situation?" I have pretty similar standards for lots of domains. The primary reason to ask is when there's genuine questions to ask about whether an extremely powerful force is attempting to extract a specific lie from me, or whether an extremely powerful immoral force is leaving me no control over what it does except via deception. For domains where this is not the case, I want to speak plainly and honestly. When I list domains and ask how honest one ought to be in them (things like being honest about your work history to the government, honest about your relationship history to prospective partners, honest about your criminal record to anyone, honest about how your work is going to your boss, honest in conversations about your honesty to anyone, and so on), the standard is to be truthful except in a small number of situations where incredibly powerful entities or forces have broken the game board badly enough that the moral thing to do is to lie. I say this because I don't think that being honest about your honesty is fundamentally different than being honest about other things, for all of them there's a standard of no-lying, and an extremely high bar for an powerful entity to be threatening you and everything you care about for you to have to lie. Eliezer writes this reasoning about honesty: I don't believe that Eliezer applies different standards of honesty to normal situations and to meta-sentences about honesty. I think he applies the same standards, and finds that you are more under threat on the object level than you are on the (explicitly-discussed) meta level.