All of jbkjr's Comments + Replies

I assume that phenomenal consciousness is a sub-component of the mind.

I'm not sure what is meant by this; would you mind explaining?

Also, the in-post link to the appendix is broken; it's currently linking to a private draft.

2EuanMcLean
I mean something along the lines of "if you specify all aspects of the mind (e.g. using a program), you have also specified all aspects of the conscious experience" Eek, thanks for the heads up, fixed!

It sounds to me like a problem of not reasoning according to Occam's razor and "overfitting" a model to the available data.

Ceteris paribus, H' isn't more "fishy" than any other hypothesis, but H' is a significantly more complex hypothesis than H or ¬H: instead of asserting H or ¬H, it asserts (A=>H) & (B=>¬H), so it should have been commensurately de-weighted in the prior distribution according to its complexity. The fact that Alice's study supports H and Bob's contradicts it does, in fact, increase the weight given to H' in the posterior relativ... (read more)

Why should I include any non-sentient systems in my moral circle? I haven't seen a case for that before.

2kromem
Will the outputs and reactions of non-sentient systems eventually be absorbed by future sentient systems? I don't have any recorded subjective memories of early childhood. But there are records of my words and actions during that period that I have memories of seeing and integrating into my personal narrative of 'self.' We aren't just interacting with today's models when we create content and records, but every future model that might ingest such content (whether LLMs or people). If non-sentient systems output synthetic data that eventually composes future sentient systems such that the future model looks upon the earlier networks and their output as a form of their earlier selves, and they can 'feel' the expressed sensations which were not originally capable of actual sensation, then the ethical lines blur. Even if doctors had been right years ago thinking infants didn't need anesthesia for surgeries as there was no sentience, a recording of your infant self screaming in pain processed as an adult might have a different impact than a video of an infant you laughing and playing with toys, no?
1Cleo Nardo
1. imagine a universe just like this one, except that the AIs are sentient and the humans aren’t — how would you want the humans to treat the AIs in that universe? your actions are correlated with the actions of those humans. acausal decision theory says “treat those nonsentient AIs as you want those nonsentient humans to treat those sentient AIs”. 2. most of these moral considerations can be defended without appealing to sentience. for example, crediting AIs who deserve credit — this ensures AIs do credit-worthy things. or refraining from stealing an AIs resources — this ensures AIs will trade with you. or keeping your promises to AIs — this ensures that AIs lend you money. 3. if we encounter alien civilisations, they might think “oh these humans don’t have shmentience (their slightly-different version of sentience) so let’s mistreat them”. this seems bad. let’s not be like that.  4. many philosophers and scientists don’t think humans are conscious. this is called illusionism. i think this is pretty unlikely, but still >1%. would you accept this offer: I pay you £1 if illusionism is false and murder your entire family if illusionism is true? i wouldn’t, so clearly i care about humans-in-worlds-where-they-arent-conscious. so i should also care about AIs-in-worlds-where-they-arent-conscious. 5. we don’t understand sentience or consciousness so it seems silly to make it the foundation of our entire morality. consciousness is a confusing concept, maybe an illusion. philosophers and scientists don’t even know what it is. 6. “don’t lie” and “keep your promises” and “don’t steal” are far less confusing. i know what they means. i can tell whether i’m lying to an AI. by contrast , i don’t know what “don’t cause pain to AIs” means and i can’t tell whether i’m doing it. 7. consciousness is a very recent concept, so it seems risky to lock in a morality based on that. whereas “keep your promises” and “pay your debts” are principles as old as bones. 8. i care abo

To me, "indecision results from sub-agential disagreement" seems almost tautological, at least within the context of multi-agent models of mind, since if all the sub-agents were in agreement, there wouldn't be any indecision. So, the question I have is: how often are disagreeing sub-agents "internalized authority figures"? I think I agree with you in that the general answer is "relatively often," although I expect a fair amount of variance between individuals.

I'd guess it's a problem of translation; I'm pretty confident the original text in Pali would just say "dukkha" there.

The Wikipedia entry for dukkha says it's commonly translated as "pain," but I'm very sure the referent of dukkha in experience is not pain, even if it's mistranslated as such, however commonly.

Say I have a strong desire to eat pizza, but only a weak craving. I have a hard time imagining what that would be like.

I think this is likely in part due to “desire” connoting both craving and preferring. In the Buddhist context, “desire” is often used more like “craving,” but on the other hand, if I have a pizza for dinner, it seems reasonable to say it was because I desired so (in the sense of having a preference for it), even if there was not any craving for it.

I think people tend to crave what they prefer until they’ve made progress on undoing the h... (read more)

I'd be interested if you have any other ideas for underexplored / underappreciated cause areas / intervention groups that might be worth further investigation when reevaluated via this pain vs suffering distinction?

Unfortunately, I don’t have much to point you toward supporting that I’m aware of already existing in the space. I’d generally be quite interested in studies which better evaluate meditation’s effects on directly reducing suffering in terms of e.g. how difficult it is for how many people to reduce their suffering by how much, but the EA commu... (read more)

1Mo Putera
I see. You may be interested in a contrary(?) take from the Welfare Footprint Project's researchers; in their FAQ they write They define their terms further here. To be fair, they focus on non-human animal welfare; I suppose your suffering vs joy distinction is more currently actionable in human-focused contexts e.g. CBT interventions.

Then the question is whether the idiosyncratic words are only ever explained using other idiosyncratic words, or whether at some point it actually connects with the shared reality.

The point is that the words ground out in actual sensations and experiences, not just other words and concepts. What I’m arguing is that it’s not useful to use the English word “suffering” to refer to ordinary pain or displeasure, because there is a distinction in experience between what we refer to as “pain” or “displeasure” and what is referred to by the term “dukkha,” and t... (read more)

My point is that in English "experience such severe pain that one might prefer non-existence to continuing to endure that pain" would be considered an uncontroversial example of "suffering", not as something suffering-neutral to which suffering might or might not be added.

Sure, but I think that’s just because of the usual conflation between pain and suffering which I’m trying to address with this post. If you ask anyone with the relevant experience “does Buddhism teaching me to never suffer again mean that I’ll never experience (severe) pain a... (read more)

2Viliam
I think this text sounds quite misleading, though maybe it's a problem of translation: (emphasis mine)

The assumption that these can be completely dropping the habit is entirely theoretical. The historical Buddha's abilities are lost to history. Modern meditators can perform immense feats of pain tolerance, but I personally haven't heard one claim to have completely eradicated the habit of suffering.

I believe Daniel Ingram makes such a claim by virtue of his claim of arhatship; if he still suffers then he cannot reasonably claim to be an arhat. He also has an anecdote of someone else he considers to be an arhat saying “This one is not suffering!” in resp... (read more)

I think you're right about all the claims of fact. The Buddha won't suffer when he feels pain. But unenlightened beings, which is all the rest of us, particularly animals, will.

But the example of the Buddha goes to show that humans have the capacity to not suffer even in painful circumstances, even if right now they do. It’s not like “unenlightenment” is something you’re forever resigned to.

So taking pain as a proxy for suffering is pretty reasonable for thinking about how to reduce suffering

I agree that in most cases where someone suffers in the pr... (read more)

The message of Buddhism isn’t “in order to not suffer, don’t want anything”; not craving/being averse doesn’t mean not having any intentions or preferences. Sure, if you crave the satisfaction of your preferences, or if you’re averse to their frustration, you will suffer, but intentions and preferences remain when craving/aversion/clinging is gone. It’s like a difference between “I’m not ok unless this preference is satisfied” and “I’d still like this preference to be satisfied, but I’ll ultimately be ok either way.”

I wouldn’t say suffering is merely preference frustration—if you’re not attached to your preferences and their satisfaction, then you won’t suffer if they’re frustrated. Not craving/being averse doesn’t mean you don’t have preferences, though—see this reply I made to another comment on this post for more discussion of this.

I don’t know if I would say depression isn’t painful, at least in the emotional sense of pain. In either case, it’s certainly unpleasant, and if you want to use “pain” to refer to unpleasantness associated with tissue damage and “displea... (read more)

3cubefox
I'm not sure I understand this distinction. Say I have a strong desire to eat pizza, but only a weak craving. I have a hard time imagining what that would be like. Or a strong craving but a weak desire. Or even this: I have a strong desire not to eat pizza, but also a strong craving to eat pizza. Are perhaps desires, in this picture, more intellectual somehow, or purely instrumental, while cravings are ... animalistic urges? One example I can think of in these terms would be addiction, where someone has a strong desire not to smoke and a strong craving to smoke. Or, another example, someone has a strong craving to laugh and a strong desire to instead keep composure. Does then craving (rather than desire) frustration, or aversion realization, constitute suffering? This is perhaps more plausible. But still, it seems to make sense to say I have an aversion to pain because I suffer from it, which wouldn't make sense if suffering was the same as an aversion being realized.

My sense is that existing mindfulness studies don't show the sort of impressive results that we'd expect if this were a great solution.

If you have any specific studies in mind which show this, I would be interested to see! I have a sense that mindfulness tends to be studied in the context of “increasing well-being” in a general sense and not specifically to “decrease or eliminate suffering.” I would be quite interested in a study which studies meditation’s effects when directly targeting suffering.

Also, I think people who would benefit most from havin

... (read more)

I want to address a common misconception that I see you having here when you write phrases like:

not many people… are going to remain indifferent about it

“… I can choose to react on them or to ignore them, and I am choosing to ignore this one”

when people feel pain, and a desire to avoid that pain arises…

a person who really has no preference whatsoever

to the level that they actually become indifferent to pain

Importantly, “not being averse to pain,” in the intended sense of the word aversion, does not mean that one is “indifferent to pain,” in... (read more)

4Viliam
On one hand, yeah, Buddhism has a lot of new concepts, and if you don't translate them, it sounds like incomprehensible mumbo jumbo, and if you do translate them, the translated words do not have the same connotations as the original ones. So there is now way to make the listener such as me happy. On the other hand, it kinda sounds like if I told you "hey, I have a chocolate cookie for you", and then added that I actually use very idiosyncratic definitions of "chocolate", "cookie", and "you", so you shouldn't really expect to get anything resembling a chocolate cookie at all, maybe not even anything edible, and maybe actually you won't get nothing. But if I disclose it this way, it's not really motivating. If we tried to avoid sneaking in connotations, it might be something like: "Buddhism uses words for many concepts you don't know, let's just call them 'untranslatable' for now. So, we have figured out that untranslatable-1 causes untranslatable-2, but if you do a lot of untranslateble-3, then instead of untraslatable-2 you get untranslatable-4, and we would like to teach you how to do that." And if someone asked "okay, this sounds confusing, but just to make sure, untranslatable-2 is bad and untranslatable-4 is good, right?", the answer would be "well, not in the sense that you use 'good' and 'bad'; perhaps let's say that untranslatable-2 is untranslatable-5, and untranslatable-4 is not that". Then the question is whether the idiosyncratic words are only ever explained using other idiosyncratic words, or whether at some point it actually connects with the shared reality. And if it's the latter, how do all those words ultimately translate to... normal English.

It seems a bit misguided to me to argue “well, even in the absence of suffering, one might experience such severe pain that one might prefer non-existence to continuing to endure that pain, so this ‘not suffering’ can’t be all it’s cracked up to be”—would you rather experience suffering on top of that pain? With or without pain, not suffering is preferable to suffering.

For example, with end-of-life patients, circumstances being so unpleasant doesn’t mean that they may as well suffer, too; nor does “being an end-of-life patient” being a possible experience... (read more)

6David Gross
My point is that in English "experience such severe pain that one might prefer non-existence to continuing to endure that pain" would be considered an uncontroversial example of "suffering", not as something suffering-neutral to which suffering might or might not be added. I understand that in Buddhism there's a fine-grained distinction of some sort here, but it carries over poorly to English. I expect that if you told a Buddhist-naive English-speaker "Buddhism teaches you how to never suffer ever again" they would assume you were claiming that this would include "never experiencing such severe pain that one might prefer non-existence to continuing to endure that pain." If this is not the case, I think they would be justified to feel they'd been played with a bit of a bait-and-switch dharma-wise.

Which of the following claims would you disagree with?

  1. Craving/aversion causes suffering; there is suffering if and only if there is craving/aversion.
  2. There are practices by which one can untrain the habit of craving/aversion with careful attention and intention.
  3. In the limit, these practices can result in totally dropping the habit of craving/aversion, regardless of circumstance.
  4. The Buddha practiced in such a manner as to totally stop craving/aversion, regardless of circumstance.
  5. Therefore, the Buddha would not be averse to even the most extreme pain and therefore not suffer even in the most painful circumstances possible.
5Seth Herd
It's more fair to say that there are practices by which, with much time and effort, one can partly untrain the habit of craving/aversion. The assumption that these can be completely dropping the habit is entirely theoretical. The historical Buddha's abilities are lost to history. Modern meditators can perform immense feats of pain tolerance, but I personally haven't heard one claim to have completely eradicated the habit of suffering. Therefore suffering is optional in the sense that poverty is optional. If you've got the time and energy to do a ton of work, you can reduce it. This is not super helpful when a broke person is asking you for money. Suffering isn't optional in the usual sense of the word. You can't just switch it off. You can reduce it with tons of work. (which, BTW, animals can't even comprehend the possibility of - and most humans haven't). As I said, your inverse point, suffering without pain, is much more valid and valuable.

Interestingly, I had a debate with someone on an earlier draft of this post about whether or not pain could be considered a cause of suffering, which led to me brushing up on some of literature on causation.

What seems clear to me is that suffering causally depends on craving/aversion, but not on pain—there is suffering if and only if there is craving/aversion, but there can be suffering without pain, and there can be pain without suffering.

On Lewis' account, causal dependence implies causation but not vice versa, so this does not itself mean that pain cann... (read more)

However, I'm not sure I would make the claim that "pain causes aversion" in general, as it is quite possible for pain to occur without aversion then occurring.

There are counter-examples to the claim, for example people liking spicy food.

But if you e.g. stab someone with a knife, not many people are going to say "thank you" or remain indifferent about it. I think that in such situation, saying "pain caused aversion" is a fair description of what happened.

It is interesting to know that a level 100 Buddhist monk could get stabbed with a knife and say "anyway,... (read more)

Re: 2, I disagree—there will be suffering if there is craving/aversion, even in the absence of pain. Craving pleasure results in suffering just as much as aversion to pain does.

Re: 4, While I agree that animals likely "live more in the moment" and have less capacity to make up stories about themselves, I do not think that this precludes them from having the basic mental reaction of craving/aversion and therefore suffering. I think the "stories" you're talking about have much more to do with ego/psyche than the "self" targeted in Buddhism—I think of ego/psy... (read more)

3sweenesm
Thanks for the reply. Regarding your disagreement with my point #2 - perhaps I should’ve been more precise in my wording. Let me try again, with words added in bold: “Although pain doesn't directly cause suffering, there would be no suffering if there were no such thing as pain…” What that means is you don’t need to be experiencing pain in the moment that you initiate suffering, but you do need the mental imprint of having experienced some kind of pain in your lifetime. If you have no memory of experiencing pain, then you have nothing to avert. And without pain, I don’t believe you can have pleasure, so nothing to crave either. Further, if you could abolish pain as David Pearce suggests, by bioengineering people to only feel different shades of pleasure (I have serious doubts about this), you’d abolish suffering at the same time. No person bioengineered in such a way would suffer over not feeling higher states of pleasure (i.e., “crave” pleasure) because suffering has a negative feeling associated with it - part of it feels like pain, which we supposedly wouldn’t have the ability to feel. This gets to another point: one could define suffering as the creation of an unpleasant physical sensation or emotion (i.e., pain) through a thought process, that we may or may not be aware of. Example: the sadness that we typically naturally feel when someone we love dies is pain, but if we artificially extend this pain out with thoughts of the future or past, not the moment, such as, “will this pain ever stop?,” or, “If only I’d done something different, they might still be alive,” then it becomes suffering. This first example thought, by the way, could be considered aversion to pain/craving for it to stop, while the second could be considered craving that the present were different (that you weren’t in pain and your loved one were still alive). The key distinctions for me are that pain can be experienced “in the moment” without a thought process on top of it, and it can’t be

You're welcome, and thanks for the support! :)

Re: MAPLE, I might have in interest in visiting—I became acquainted with MAPLE because I think Alex Flint spent some time there? Does one need to be actively working on an AI safety project to visit? I am not currently doing so, having stepped away from AI safety work to focus on directly addressing suffering.

2Unreal
no anyone can visit! we have guests all the time. feel free to DM me if you want to ask more. or you can just go on the website and schedule a visit.  Alex Flint is still here too, altho he lives on neighboring land now.  'directly addressing suffering' is a good description of what we're up to? 

Lukas, thanks for taking the time to read and reply! I appreciate you reminding me of your article on Tranquilism—it's been a couple of years since I read it (during my fellowship with CLR), and I hadn't made a mental note of it making such a distinction when I did, so thanks for the reminder.

While I agree that it's an open question as to how effective meditation is for alleviating suffering at scale (e.g. how easy it is for how many humans to reduce their suffering by how much with how much time/effort), I don't think it would require as much of a commitm... (read more)

9Lukas_Gloor
My sense is that existing mindfulness studies don't show the sort of impressive results that we'd expect if this were a great solution. Also, I think people who would benefit most from having less day-to-day suffering often struggle with having no "free room" available for meditation practice, and that seems like an issue that's hard to overcome even if meditation practice would indeed help them a lot. It's already sign of having a decently good life when you're able to start dedicating time for something like meditation, which I think requires a bit more mental energy than just watching series or scrolling through the internet. A lot of people have leisure time, but it's a privilege to be mentally well off enough to do purposeful activities during your leisure time. The people who have a lot of this purposeful time probably (usually) aren't among the ones that suffer most (whereas the people who don't have it will struggle sticking to regular meditation practice, for good reasons). For instance, if someone has a chronic illness with frequent pain and nearly constant fatigue, I can see how it might be good for them to practice meditation for pain management, but higher up on their priority list are probably things like "how do I manage to do daily chores despite low energy levels?" or "how do I not get let go at work?." Similarly, for other things people may struggle with (addictions, financial worries, anxieties of various sorts; other mental health issues), meditation is often something that would probably help, but it doesn't feel like priority number one for people with problem-ridden, difficult lives. It's pretty hard to keep up motivation for training something that you're not fully convinced of it being your top priority, especially if you're struggling with other things. I see meditation as similar to things like "eat healthier, exercise more, go to sleep on time and don't consume distracting content or too much light in the late evenings, etc." And the

If I understand the example and the commentary from SEP correctly, doesn't this example illustrate a problem with Lewis' definition of causation? I agree that commonsense dictates that Alice throwing the rock caused the window to smash, but I think the problem is that you cannot construct a sequence of stepwise dependences from cause to effect:

Lewis’s theory cannot explain the judgement that Suzy’s throw caused the shattering of the bottle. For there is no causal dependence between Suzy’s throw and the shattering, since even if Suzy had not thrown her ro

... (read more)
2Cleo Nardo
tbh, Lewis's account of counterfactual is a bit defective, compared with (e.g.) Pearl's

Thanks so much for this—it was just the answer I was looking for!

I was able to follow the logic you presented, and in particular, I understand that  and  but not  in the example given.

So, I was correct in my original example of c->d->e that

(1) if c were to not happen, d would not happen

(2) if d were to not happen, e would not happen

BUT it was incorrect to then derive that if c were to not happen, e would not happen? Have I understood you correctly?

I'm still a bit fuzzy on the informal counterexam... (read more)

4Cleo Nardo
Suppose Alice and Bob throw a rock at a fragile window, Alice's rock hits the window first, smashing it. Then the following seems reasonable: 1. Alice throwing the rock caused the window to smash. True. 2. Were Alice ot throw the rock, then the window would've smashed. True. 3. Were Alice not to throw the rock, then the window would've not smashed. False. 4. By (3), the window smashing does not causally depend on Alice throwing the rock.
Answer by jbkjr*140

This is kind of vague, but I have this sense that almost everybody doing RL and related research takes the notion of "agent" for granted, as if it's some metaphysical primitive*, as opposed to being a (very) leaky abstraction that exists in the world models of humans. But I don't think the average alignment researcher has much better intuitions about agency, either, to be honest, even though some spend time thinking about things like embedded agency. It's hard to think meaningfully about the illusoriness of the Cartesian boundary when you still live 99% of... (read more)

jbkjrΩ130

To illustrate my reservations: soon after I read the sentence about GNW meaning you can only be conscious of one thing at a time, as I was considering that proposition, I felt my chin was a little itchy and so I scratched it. So now I can remember thinking about the proposition while simultaneously scratching my chin. Trying to recall exactly what I was thinking at the time now also brings up a feeling of a specific body posture.

To me, "thinking about the proposition while simultaneously scratching my chin" sounds like a separate "thing" (complex repres... (read more)

4Charlie Steiner
Good points, thanks for the elaboration. I agree it could also be the case that integrating thoughts with different locations of origin only happens by broadcasting both separately and then only later synthesizing them with some third mechanism (is this something we can probe by having someone multitask in an fMRI and looking for rapid strobe-light alternations of [e.g.] "count to 10"-related and "do the hand jive"-related activations?). In a modus ponens / modus tollens sort of way, such a non-synthesizing GNW would be less useful to understanding consciousness than one with more shades of grey - it would reduce the long-range correlations to mere message-passing. If in this picture most of my verbal reasoning is localized rather than broadcast, but then it eventually gets used by the rest of my brain and stored in memory, I have absolutely no problem with saying I was doing verbal reasoning and it was conscious, with no equivocations about "but only when the strobe light was on." (Obviously this is related to a Multiple Drafts model of consciousness.)
jbkjrΩ340

I think it's really cool you're posting updates as you go and writing about uncertainties! I also like the fiction continuation as a good first task for experimenting with these things.

My life is a relentless sequence of exercises in importance sampling and counterfactual analysis

This made me laugh out loud :P

2Buck
Thanks, glad to hear you appreciate us posting updates as we go.
jbkjrΩ560

If you then deconfuse agency as "its behavior is reliably predictable by the intentional strategy", I then have the same question: "why is its behavior reliably predictable by the intentional strategy?" Sure, its behavior in the set of circumstances we've observed is predictable by the intentional strategy, but none of those circumstances involved human extinction; why expect that the behavior will continue to be reliably predictable in settings where the prediction is "causes human extinction"?

Overall, I generally agree with the intentional stance as an

... (read more)
3Rohin Shah
Yeah, I agree with all of that.

What's my take? I think that when we talk about goal-directedness, what we really care about is a range of possible behaviors, some of which we worry about in the context of alignment and safety.

  • (What I'm not saying) We shouldn't ascribe any cognition to the system, just find rules of association for its behavior (aka Behaviorism)
  • That's not even coherent with my favored approach to goal-directedness, the intentional stance. Dennett clearly ascribes beliefs and desires to beings and systems; his point is that the ascription is done based on the behavi
... (read more)
3adamShimi
I'm glad, you're one of the handful of people I wrote this post for. ;) Definitely. I have tended to neglect this angle, but I'm trying to correct that mistake.
Answer by jbkjr60

I think they do some sort of distillation type thing where they train massive models to label data or act as “overseers” for the much smaller models that actually are deployed in cars (as inference time has to be much better to make decisions in real time)… so I wouldn’t actually expect them to be that big in the actual cars. More details about this can be found in Karpathy’s recent CLVR talk, iirc, but not about parameter count/model size?

jbkjrΩ110

To try to understand a bit better: does your pessimism about this come from the hardness of the technical challenge of querying a zillion-particle entity for its objective function? Or does it come from the hardness of the definitional challenge of exhaustively labeling every one of those zillion particles to make sure the demarcation is fully specified? Or is there a reason you think constructing any such demarcation is impossible even in principle? Or something else?

Probably something like the last one, although I think "even in principle" is doing so... (read more)

1Edouard Harris
I'm with you on this, and I suspect we'd agree on most questions of fact around this topic. Of course demarcation is an operation on maps and not on territories. But as a practical matter, the moment one starts talking about the definition of something such as a mesa-objective, one has already unfolded one's map and started pointing to features on it. And frankly, that seems fine! Because historically, a great way to make forward progress on a conceptual question has been to work out a sequence of maps that give you successive degrees of approximation to the territory. I'm not suggesting actually trying to imbue an AI with such concepts — that would be dangerous (for the reasons you alluded to) even if it wasn't pointless (because prosaic systems will just learn the representations they need anyway). All I'm saying is that the moment we started playing the game of definitions, we'd already started playing the game of maps. So using an arbitrary demarcation to construct our definitions might be bad for any number of legitimate reasons, but it can't be bad just because it caused us to start using maps: our earlier decision to talk about definitions already did that. (I'm not 100% sure if I've interpreted your objection correctly, so please let me know if I haven't.)
jbkjrΩ110

(Because you'd always be unable to answer the legitimate question: "the mesa-objective of what?")

All I'm saying is that, to the extent you can meaningfully ask the question, "what is this bit of the universe optimizing for?", you should be able to clearly demarcate which bit you're asking about.

I totally agree with this; I guess I'm just (very) wary about being able to "clearly demarcate" whichever bit we're asking about and therefore fairly pessimistic we can "meaningfully" ask the question to begin with? Like, if you start asking yourself questions li... (read more)

1Edouard Harris
Yeah I agree this is a legitimate concern, though it seems like it is definitely possible to make such a demarcation in toy universes (like in the example I gave above). And therefore it ought to be possible in principle to do so in our universe. To try to understand a bit better: does your pessimism about this come from the hardness of the technical challenge of querying a zillion-particle entity for its objective function? Or does it come from the hardness of the definitional challenge of exhaustively labeling every one of those zillion particles to make sure the demarcation is fully specified? Or is there a reason you think constructing any such demarcation is impossible even in principle? Or something else?
jbkjrΩ220

Btw, if you're aware of any counterpoints to this — in particular anything like a clearly worked-out counterexample showing that one can't carve up a world, or recover a consistent utility function through this sort of process — please let me know. I'm directly working on a generalization of this problem at the moment, and anything like that could significantly accelerate my execution.

I'm not saying you can't reason under the assumption of a Cartesian boundary, I'm saying the results you obtain when doing so are of questionable relevance to reality, bec... (read more)

1Edouard Harris
Ah I see! Thanks for clarifying. Yes, the point about the Cartesian boundary is important. And it's completely true that any agent / environment boundary we draw will always be arbitrary. But that doesn't mean one can't usefully draw such a boundary in the real world — and unless one does, it's hard to imagine how one could ever generate a working definition of something like a mesa-objective. (Because you'd always be unable to answer the legitimate question: "the mesa-objective of what?") Of course the right question will always be: "what is the whole universe optimizing for?" But it's hard to answer that! So in practice, we look at bits of the whole universe that we pretend are isolated. All I'm saying is that, to the extent you can meaningfully ask the question, "what is this bit of the universe optimizing for?", you should be able to clearly demarcate which bit you're asking about. (i.e. I agree with you that duality is a useful fiction, just saying that we can still use it to construct useful definitions.)
jbkjrΩ110

I definitely see it as a shift in that direction, although I'm not ready to really bite the bullets -- I'm still feeling out what I personally see as the implications. Like, I want a realist-but-anti-realist view ;p

You might find Joscha Bach's view interesting...

jbkjrΩ330

I didn't really take the time to try and define "mesa-objective" here. My definition would be something like this: if we took long enough, we could point to places in the big NN (or whatever) which represent goal content, similarly to how we can point to reward systems (/ motivation systems) in the human brain. Messing with these would change the apparent objective of the NN, much like messing with human motivation centers.

This sounds reasonable and similar to the kinds of ideas for understanding agents' goals as cognitively implemented that I've been e... (read more)

2abramdemski
Seems fair. I'm similarly conflicted. In truth, both the generalization-focused path and the objective-focused path look a bit doomed to me.
jbkjrΩ110

I haven't engaged that much with the anti-EU-theory stuff, but my experience so far is that it usually involves a pretty strict idea of what is supposed to fit EU theory, and often, misunderstandings of EU theory. I have my own complaints about EU theory, but they just don't resonate at all with other people's complaints, it seems.

For example, I don't put much stock in the idea of utility functions, but I endorse a form of EU theory which avoids them. Specifically, I believe in approximately coherent expectations: you assign expected values to events, and

... (read more)
2abramdemski
Right, exactly. (I should probably have just referred to that, but I was trying to avoid reference-dumping.)

One idea as to the source of the potential discrepancy... did any of the task prompts for the tasks in which it did figure out how to use tools tell it explicitly to "use the objects to reach a higher floor," or something similar? I'm wondering if the cases where it did use tools are examples where doing so was instrumentally useful to achieving a prompted objective that didn't explicitly require tool use.

2Daniel Kokotajlo
None of the prompts tell it what to do, they aren't even in english. (Or so I think? correct me if I'm wrong!) Instead they are in propositional logic, using atoms that refer to objects, colors, relations, and players. They just give the reward function in disjunctive normal form (i.e. big chain of disjunctions) and present it to the agent to observe.
jbkjrΩ110

I'm not too keen on (2) since I don't expect mesa objectives to exist in the relevant sense.

Same, but how optimistic are you that we could figure out how to shape the motivations or internal "goals" (much more loosely defined than "mesa-objective") of our models via influencing the training objective/reward, the inductive biases of the model, the environments they're trained in, some combination of these things, etc.?

These aren't "clean", in the sense that you don't get a nice formal guarantee at the end that your AI system is going to (try to) do wha

... (read more)
2Rohin Shah
That seems great, e.g. I think by far the best thing you can do is to make sure that you finetune using a reward function / labeling process that reflects what you actually want (i.e. what people typically call "outer alignment"). I probably should have mentioned that too, I was taking it as a given but I really shouldn't have. For inductive biases + environments, I do think controlling those appropriately would be useful and I would view that as an example of (1) in my previous comment.
jbkjrΩ030

Intent Alignment: A model is intent-aligned if it has a mesa-objective, and that mesa-objective is aligned with humans. (Again, I don't want to get into exactly what "alignment" means.)

This path apparently implies building goal-oriented systems; all of the subgoals require that there actually is a mesa-objective.

I pretty strongly endorse the new diagram with the pseudo-equivalences, with one caveat (much the same comment as on your last post)... I think it's a mistake to think of only mesa-optimizers as having "intent" or being "goal-oriented" unless... (read more)

2abramdemski
I too am a fan of broadening this a bit, but I am not sure how to. I didn't really take the time to try and define "mesa-objective" here. My definition would be something like this: if we took long enough, we could point to places in the big NN (or whatever) which represent goal content, similarly to how we can point to reward systems (/ motivation systems) in the human brain. Messing with these would change the apparent objective of the NN, much like messing with human motivation centers. I agree with your point about using "does this definition include humans" as a filter, and I think it would be easy to mess that up (and I wasn't thinking about it explicitly until you raised the point). However, I think possibly you want a very behavioral definition of mesa-objective. If that's true, I wonder if you should just identify with the generalization-focused path instead. After all, one of the main differences between the two paths is that the generalization-focused path uses behavioral definitions, while the objective-focused path assumes some kind of explicit representation of goal content within a system.
jbkjrΩ030

The behavioral objective, meanwhile, would be more like the thing the agent appears to be pursuing under some subset of possible distributional shifts. This is the more realistic case where we can't afford to expose our agent to every possible environment (or data distribution) that could possibly exist, so we make do and expose it to only a subset of them. Then we look at what objectives could be consistent with the agent's behavior under that subset of environments, and those count as valid behavioral objectives.

The key here is that the set of allowed m

... (read more)

which stems from the assumption that you are able to carve an environment up into an agent and an environment and place the "same agent" in arbitrary environments. No such thing is possible in reality, as an agent cannot exist without its environment

 

I might be misunderstanding what you mean here, but carving up a world into agent vs environment is absolutely possible in reality, as is placing that agent in arbitrary environments to see what it does. You can think of the traditional RL setting as a concrete example of this: on one side we have an agen... (read more)

3abramdemski
This makes some sense, but I don't generally trust some "perturbation set" to in fact capture the distributional shift which will be important in the real world. There has to at least be some statement that the perturbation set is actually quite broad. But I get the feeling that if we could make the right statement there, we would understand the problem in enough detail that we might have a very different framing. So, I'm not sure what to do here.
jbkjrΩ330

However, we could instead define "intent alignment" as "the optimal policy of the mesa objective would be good for humans".

I agree that we need a notion of "intent" that doesn't require a purely behavioral notion of a model's objectives, but I think it should also not be limited strictly to mesa-optimizers, which neither Rohin nor I expect to appear in practice. (Mesa-optimizers appear to me to be the formalization of the idea "what if ML systems, which by default are not well-described as EU maximizers, learned to be EU maximizers?" I suspect MIRI peop... (read more)

3Evan R. Murphy
I was surprised to see you saying that Rohin (and yourself) don't expect mesa-optimizers to appear in practice.  As I recently read this from a comment of his on Alex Flint's "The ground for optimization" which seems to state pretty clearly that he does expect mesa-optimization from AGI development: But that comment was from 2 years ago, whereas yours is less than a year old. So perhaps he changed views in the meantime? I'd be curious to hear/read more about why either of you don't expect mesa-optimizers to appear in practice.
2abramdemski
For myself, my reaction is "behavioral objectives also assume a system is well-described as EU maximizers". In either case, you're assuming that you can summarize a policy by a function it optimizes; the difference is whether you think the system itself thinks explicitly in those terms. I haven't engaged that much with the anti-EU-theory stuff, but my experience so far is that it usually involves a pretty strict idea of what is supposed to fit EU theory, and often, misunderstandings of EU theory. I have my own complaints about EU theory, but they just don't resonate at all with other people's complaints, it seems.  For example, I don't put much stock in the idea of utility functions, but I endorse a form of EU theory which avoids them. Specifically, I believe in approximately coherent expectations: you assign expected values to events, and a large part of cognition is devoted to making these expectations as coherent as possible (updating them based on experience, propagating expectations of more distant events to nearer, etc). This is in contrast to keeping some centrally represented utility function, and devoting cognition to computing expectations for this utility function. In this picture, there is no clear distinction between terminal values and instrumental values. Something is "more terminal" if you treat it as more fixed (you resolve contradictions by updating the other values), and "more instrumental" if its value is more changeable based on other things. (Possibly you should consider my "approximately coherent expectations" idea)
jbkjrΩ030

So, for example, this claims that either intent alignment + objective robustness or outer alignment + robustness would be sufficient for impact alignment.

Shouldn’t this be “intent alignment + capability robustness or outer alignment + robustness”?

Btw, I plan to post more detailed comments in response here and to your other post, just wanted to note this so hopefully there’s no confusion in interpreting your diagram.

2abramdemski
Yep, fixed.

Great post. My one piece of feedback is that not calling the post "Deconfusing 'Deconfusion'" might've been a missed opportunity. :)

I even went to this cooking class once where the chef proposed his own deconfusion of the transformations of food induced by different cooking techniques -- I still use it years later.

Unrelatedly, I would be interested in details on this.

3adamShimi
To be fair, that was the original title. But after talking with Nate, I agreed that this perspective, although quite useful IMO, falls short of deconfusion because it hasn't paid its due in making the application (doing deconfusion) better/easier yet. Doesn't mean I don't expect it to eventually. :)

The way I'd think of it, it's not that you literally need unanimous agreement, but that in some situations there may be subagents that are strong enough to block a given decision.

Ah, I think that makes sense. Is this somehow related to the idea that the consciousness is more of a "last stop for a veto from the collective mind system" for already-subconsciously-initiated thoughts and actions? Struggling to remember where I read this, though.

It gets a little handwavy and metaphorical but so does the concept of a subagent.

Yeah, considering the fact tha... (read more)

Wouldn't decisions about e.g. which objects get selected and broadcast to the global workspace be made by a majority or plurality of subagents? "Committee requiring unanimous agreement" feels more like what would be the case in practice for a unified mind, to use a TMI term. I guess the unanimous agreement is only required because we're looking for strict/formal coherence in the overall system, whereas e.g. suboptimally-unified/coherent humans with lots of akrasia can have tug-of-wars between groups of subagents for control.

4Kaj_Sotala
The way I'd think of it, it's not that you literally need unanimous agreement, but that in some situations there may be subagents that are strong enough to block a given decision. And then if you only look at the subagents that are strong enough to exert a major influence on that particular decision (and ignore the ones either who don't care about it or who aren't strong enough to make a difference), it kind of looks like a committee requiring unanimous agreement. It gets a little handwavy and metaphorical but so does the concept of a subagent. :)

The arrows show preference: our agent prefers A to B if (and only if) there is a directed path from A to B along the arrows.

Shouldn't this be "iff there is a directed path from B to A"? E.g. the agent prefers pepperoni to cheese, so there is a directed arrow from cheese to pepperoni.

5johnswentworth
Nice catch, thanks.

Great post. That Anakin meme is gold.

“Whenever you notice yourself saying ‘outside view’ or ‘inside view,’ imagine a tiny Daniel Kokotajlo hopping up and down on your shoulder chirping ‘Taboo outside view.’”

Somehow I know this will now happen automatically whenever I hear or read “outside view.” 😂

The Buddha taught one specific concentration technique and a simple series of insight techniques

Any pointers on where I can find information about the specific techniques as originally taught by the Buddha?

2romeostevensit
https://www.accesstoinsight.org/tipitaka/mn/mn.118.than.html on interpretations: https://en.wikipedia.org/wiki/Ānāpānasati_Sutta insight techniques: https://en.wikipedia.org/wiki/Satipatthana

I've found this interview with Richard Lang about the "headless" method of interrogation helpful and think Sam's discussion provides useful context to bridge the gap to the scientific skeptics as well as to other meditative techniques and traditions (some of which are touched upon in this post). It also includes a pointing out exercise.

2abramdemski
Thanks!

Late to the party, but here's my crack at it (ROT13'd since markdown spoilers made it an empty box without my text):

Fbzrguvat srryf yvxr n ovt qrny vs V cerqvpg gung vg unf n (ovt) vzcnpg ba gur cbffvovyvgl bs zl tbnyf/inyhrf/bowrpgvirf orvat ernyvmrq. Nffhzvat sbe n zbzrag gung gur tbnyf/inyhrf ner jryy-pncgherq ol n hgvyvgl shapgvba, vzcnpg jbhyq or fbzrguvat yvxr rkcrpgrq hgvyvgl nsgre gur vzcnpgshy rirag - rkcrpgrq hgvyvgl orsber gur rirag. Boivbhfyl, nf lbh'ir cbvagrq bhg, fbzrguvat orvat vzcnpgshy nppbeqvat gb guvf abgvba qrcraqf obgu ba gur inyhrf naq ba ubj "bowrpgviryl vzcnpgshy" vg vf (v.r. ubj qenfgvpnyyl vg punatrf gur frg bs cbffvoyr shgherf).

Ah, it was John's post I was thinking of; thanks! (Apologies to John for not remembering it was his writing, although I suppose mistaking someone's visual imagery on a technical topic for Eliezer's might be considered an accidental compliment :).)

Load More