It sounds to me like a problem of not reasoning according to Occam's razor and "overfitting" a model to the available data.
Ceteris paribus, H' isn't more "fishy" than any other hypothesis, but H' is a significantly more complex hypothesis than H or ¬H: instead of asserting H or ¬H, it asserts (A=>H) & (B=>¬H), so it should have been commensurately de-weighted in the prior distribution according to its complexity. The fact that Alice's study supports H and Bob's contradicts it does, in fact, increase the weight given to H' in the posterior relativ...
To me, "indecision results from sub-agential disagreement" seems almost tautological, at least within the context of multi-agent models of mind, since if all the sub-agents were in agreement, there wouldn't be any indecision. So, the question I have is: how often are disagreeing sub-agents "internalized authority figures"? I think I agree with you in that the general answer is "relatively often," although I expect a fair amount of variance between individuals.
I'd guess it's a problem of translation; I'm pretty confident the original text in Pali would just say "dukkha" there.
The Wikipedia entry for dukkha says it's commonly translated as "pain," but I'm very sure the referent of dukkha in experience is not pain, even if it's mistranslated as such, however commonly.
Say I have a strong desire to eat pizza, but only a weak craving. I have a hard time imagining what that would be like.
I think this is likely in part due to “desire” connoting both craving and preferring. In the Buddhist context, “desire” is often used more like “craving,” but on the other hand, if I have a pizza for dinner, it seems reasonable to say it was because I desired so (in the sense of having a preference for it), even if there was not any craving for it.
I think people tend to crave what they prefer until they’ve made progress on undoing the h...
I'd be interested if you have any other ideas for underexplored / underappreciated cause areas / intervention groups that might be worth further investigation when reevaluated via this pain vs suffering distinction?
Unfortunately, I don’t have much to point you toward supporting that I’m aware of already existing in the space. I’d generally be quite interested in studies which better evaluate meditation’s effects on directly reducing suffering in terms of e.g. how difficult it is for how many people to reduce their suffering by how much, but the EA commu...
Then the question is whether the idiosyncratic words are only ever explained using other idiosyncratic words, or whether at some point it actually connects with the shared reality.
The point is that the words ground out in actual sensations and experiences, not just other words and concepts. What I’m arguing is that it’s not useful to use the English word “suffering” to refer to ordinary pain or displeasure, because there is a distinction in experience between what we refer to as “pain” or “displeasure” and what is referred to by the term “dukkha,” and t...
My point is that in English "experience such severe pain that one might prefer non-existence to continuing to endure that pain" would be considered an uncontroversial example of "suffering", not as something suffering-neutral to which suffering might or might not be added.
Sure, but I think that’s just because of the usual conflation between pain and suffering which I’m trying to address with this post. If you ask anyone with the relevant experience “does Buddhism teaching me to never suffer again mean that I’ll never experience (severe) pain a...
The assumption that these can be completely dropping the habit is entirely theoretical. The historical Buddha's abilities are lost to history. Modern meditators can perform immense feats of pain tolerance, but I personally haven't heard one claim to have completely eradicated the habit of suffering.
I believe Daniel Ingram makes such a claim by virtue of his claim of arhatship; if he still suffers then he cannot reasonably claim to be an arhat. He also has an anecdote of someone else he considers to be an arhat saying “This one is not suffering!” in resp...
I think you're right about all the claims of fact. The Buddha won't suffer when he feels pain. But unenlightened beings, which is all the rest of us, particularly animals, will.
But the example of the Buddha goes to show that humans have the capacity to not suffer even in painful circumstances, even if right now they do. It’s not like “unenlightenment” is something you’re forever resigned to.
So taking pain as a proxy for suffering is pretty reasonable for thinking about how to reduce suffering
I agree that in most cases where someone suffers in the pr...
The message of Buddhism isn’t “in order to not suffer, don’t want anything”; not craving/being averse doesn’t mean not having any intentions or preferences. Sure, if you crave the satisfaction of your preferences, or if you’re averse to their frustration, you will suffer, but intentions and preferences remain when craving/aversion/clinging is gone. It’s like a difference between “I’m not ok unless this preference is satisfied” and “I’d still like this preference to be satisfied, but I’ll ultimately be ok either way.”
I wouldn’t say suffering is merely preference frustration—if you’re not attached to your preferences and their satisfaction, then you won’t suffer if they’re frustrated. Not craving/being averse doesn’t mean you don’t have preferences, though—see this reply I made to another comment on this post for more discussion of this.
I don’t know if I would say depression isn’t painful, at least in the emotional sense of pain. In either case, it’s certainly unpleasant, and if you want to use “pain” to refer to unpleasantness associated with tissue damage and “displea...
My sense is that existing mindfulness studies don't show the sort of impressive results that we'd expect if this were a great solution.
If you have any specific studies in mind which show this, I would be interested to see! I have a sense that mindfulness tends to be studied in the context of “increasing well-being” in a general sense and not specifically to “decrease or eliminate suffering.” I would be quite interested in a study which studies meditation’s effects when directly targeting suffering.
...Also, I think people who would benefit most from havin
I want to address a common misconception that I see you having here when you write phrases like:
not many people… are going to remain indifferent about it
“… I can choose to react on them or to ignore them, and I am choosing to ignore this one”
when people feel pain, and a desire to avoid that pain arises…
a person who really has no preference whatsoever
to the level that they actually become indifferent to pain
Importantly, “not being averse to pain,” in the intended sense of the word aversion, does not mean that one is “indifferent to pain,” in...
It seems a bit misguided to me to argue “well, even in the absence of suffering, one might experience such severe pain that one might prefer non-existence to continuing to endure that pain, so this ‘not suffering’ can’t be all it’s cracked up to be”—would you rather experience suffering on top of that pain? With or without pain, not suffering is preferable to suffering.
For example, with end-of-life patients, circumstances being so unpleasant doesn’t mean that they may as well suffer, too; nor does “being an end-of-life patient” being a possible experience...
Which of the following claims would you disagree with?
Interestingly, I had a debate with someone on an earlier draft of this post about whether or not pain could be considered a cause of suffering, which led to me brushing up on some of literature on causation.
What seems clear to me is that suffering causally depends on craving/aversion, but not on pain—there is suffering if and only if there is craving/aversion, but there can be suffering without pain, and there can be pain without suffering.
On Lewis' account, causal dependence implies causation but not vice versa, so this does not itself mean that pain cann...
However, I'm not sure I would make the claim that "pain causes aversion" in general, as it is quite possible for pain to occur without aversion then occurring.
There are counter-examples to the claim, for example people liking spicy food.
But if you e.g. stab someone with a knife, not many people are going to say "thank you" or remain indifferent about it. I think that in such situation, saying "pain caused aversion" is a fair description of what happened.
It is interesting to know that a level 100 Buddhist monk could get stabbed with a knife and say "anyway,...
Re: 2, I disagree—there will be suffering if there is craving/aversion, even in the absence of pain. Craving pleasure results in suffering just as much as aversion to pain does.
Re: 4, While I agree that animals likely "live more in the moment" and have less capacity to make up stories about themselves, I do not think that this precludes them from having the basic mental reaction of craving/aversion and therefore suffering. I think the "stories" you're talking about have much more to do with ego/psyche than the "self" targeted in Buddhism—I think of ego/psy...
You're welcome, and thanks for the support! :)
Re: MAPLE, I might have in interest in visiting—I became acquainted with MAPLE because I think Alex Flint spent some time there? Does one need to be actively working on an AI safety project to visit? I am not currently doing so, having stepped away from AI safety work to focus on directly addressing suffering.
Lukas, thanks for taking the time to read and reply! I appreciate you reminding me of your article on Tranquilism—it's been a couple of years since I read it (during my fellowship with CLR), and I hadn't made a mental note of it making such a distinction when I did, so thanks for the reminder.
While I agree that it's an open question as to how effective meditation is for alleviating suffering at scale (e.g. how easy it is for how many humans to reduce their suffering by how much with how much time/effort), I don't think it would require as much of a commitm...
If I understand the example and the commentary from SEP correctly, doesn't this example illustrate a problem with Lewis' definition of causation? I agree that commonsense dictates that Alice throwing the rock caused the window to smash, but I think the problem is that you cannot construct a sequence of stepwise dependences from cause to effect:
...Lewis’s theory cannot explain the judgement that Suzy’s throw caused the shattering of the bottle. For there is no causal dependence between Suzy’s throw and the shattering, since even if Suzy had not thrown her ro
Thanks so much for this—it was just the answer I was looking for!
I was able to follow the logic you presented, and in particular, I understand that and but not in the example given.
So, I was correct in my original example of c->d->e that
(1) if c were to not happen, d would not happen
(2) if d were to not happen, e would not happen
BUT it was incorrect to then derive that if c were to not happen, e would not happen? Have I understood you correctly?
I'm still a bit fuzzy on the informal counterexam...
This is kind of vague, but I have this sense that almost everybody doing RL and related research takes the notion of "agent" for granted, as if it's some metaphysical primitive*, as opposed to being a (very) leaky abstraction that exists in the world models of humans. But I don't think the average alignment researcher has much better intuitions about agency, either, to be honest, even though some spend time thinking about things like embedded agency. It's hard to think meaningfully about the illusoriness of the Cartesian boundary when you still live 99% of...
To illustrate my reservations: soon after I read the sentence about GNW meaning you can only be conscious of one thing at a time, as I was considering that proposition, I felt my chin was a little itchy and so I scratched it. So now I can remember thinking about the proposition while simultaneously scratching my chin. Trying to recall exactly what I was thinking at the time now also brings up a feeling of a specific body posture.
To me, "thinking about the proposition while simultaneously scratching my chin" sounds like a separate "thing" (complex repres...
I think it's really cool you're posting updates as you go and writing about uncertainties! I also like the fiction continuation as a good first task for experimenting with these things.
My life is a relentless sequence of exercises in importance sampling and counterfactual analysis
This made me laugh out loud :P
...If you then deconfuse agency as "its behavior is reliably predictable by the intentional strategy", I then have the same question: "why is its behavior reliably predictable by the intentional strategy?" Sure, its behavior in the set of circumstances we've observed is predictable by the intentional strategy, but none of those circumstances involved human extinction; why expect that the behavior will continue to be reliably predictable in settings where the prediction is "causes human extinction"?
Overall, I generally agree with the intentional stance as an
What's my take? I think that when we talk about goal-directedness, what we really care about is a range of possible behaviors, some of which we worry about in the context of alignment and safety.
...
- (What I'm not saying) We shouldn't ascribe any cognition to the system, just find rules of association for its behavior (aka Behaviorism)
- That's not even coherent with my favored approach to goal-directedness, the intentional stance. Dennett clearly ascribes beliefs and desires to beings and systems; his point is that the ascription is done based on the behavi
I think they do some sort of distillation type thing where they train massive models to label data or act as “overseers” for the much smaller models that actually are deployed in cars (as inference time has to be much better to make decisions in real time)… so I wouldn’t actually expect them to be that big in the actual cars. More details about this can be found in Karpathy’s recent CLVR talk, iirc, but not about parameter count/model size?
To try to understand a bit better: does your pessimism about this come from the hardness of the technical challenge of querying a zillion-particle entity for its objective function? Or does it come from the hardness of the definitional challenge of exhaustively labeling every one of those zillion particles to make sure the demarcation is fully specified? Or is there a reason you think constructing any such demarcation is impossible even in principle? Or something else?
Probably something like the last one, although I think "even in principle" is doing so...
(Because you'd always be unable to answer the legitimate question: "the mesa-objective of what?")
All I'm saying is that, to the extent you can meaningfully ask the question, "what is this bit of the universe optimizing for?", you should be able to clearly demarcate which bit you're asking about.
I totally agree with this; I guess I'm just (very) wary about being able to "clearly demarcate" whichever bit we're asking about and therefore fairly pessimistic we can "meaningfully" ask the question to begin with? Like, if you start asking yourself questions li...
Btw, if you're aware of any counterpoints to this — in particular anything like a clearly worked-out counterexample showing that one can't carve up a world, or recover a consistent utility function through this sort of process — please let me know. I'm directly working on a generalization of this problem at the moment, and anything like that could significantly accelerate my execution.
I'm not saying you can't reason under the assumption of a Cartesian boundary, I'm saying the results you obtain when doing so are of questionable relevance to reality, bec...
I definitely see it as a shift in that direction, although I'm not ready to really bite the bullets -- I'm still feeling out what I personally see as the implications. Like, I want a realist-but-anti-realist view ;p
You might find Joscha Bach's view interesting...
I didn't really take the time to try and define "mesa-objective" here. My definition would be something like this: if we took long enough, we could point to places in the big NN (or whatever) which represent goal content, similarly to how we can point to reward systems (/ motivation systems) in the human brain. Messing with these would change the apparent objective of the NN, much like messing with human motivation centers.
This sounds reasonable and similar to the kinds of ideas for understanding agents' goals as cognitively implemented that I've been e...
...I haven't engaged that much with the anti-EU-theory stuff, but my experience so far is that it usually involves a pretty strict idea of what is supposed to fit EU theory, and often, misunderstandings of EU theory. I have my own complaints about EU theory, but they just don't resonate at all with other people's complaints, it seems.
For example, I don't put much stock in the idea of utility functions, but I endorse a form of EU theory which avoids them. Specifically, I believe in approximately coherent expectations: you assign expected values to events, and
One idea as to the source of the potential discrepancy... did any of the task prompts for the tasks in which it did figure out how to use tools tell it explicitly to "use the objects to reach a higher floor," or something similar? I'm wondering if the cases where it did use tools are examples where doing so was instrumentally useful to achieving a prompted objective that didn't explicitly require tool use.
I'm not too keen on (2) since I don't expect mesa objectives to exist in the relevant sense.
Same, but how optimistic are you that we could figure out how to shape the motivations or internal "goals" (much more loosely defined than "mesa-objective") of our models via influencing the training objective/reward, the inductive biases of the model, the environments they're trained in, some combination of these things, etc.?
...These aren't "clean", in the sense that you don't get a nice formal guarantee at the end that your AI system is going to (try to) do wha
Intent Alignment: A model is intent-aligned if it has a mesa-objective, and that mesa-objective is aligned with humans. (Again, I don't want to get into exactly what "alignment" means.)
This path apparently implies building goal-oriented systems; all of the subgoals require that there actually is a mesa-objective.
I pretty strongly endorse the new diagram with the pseudo-equivalences, with one caveat (much the same comment as on your last post)... I think it's a mistake to think of only mesa-optimizers as having "intent" or being "goal-oriented" unless...
...The behavioral objective, meanwhile, would be more like the thing the agent appears to be pursuing under some subset of possible distributional shifts. This is the more realistic case where we can't afford to expose our agent to every possible environment (or data distribution) that could possibly exist, so we make do and expose it to only a subset of them. Then we look at what objectives could be consistent with the agent's behavior under that subset of environments, and those count as valid behavioral objectives.
The key here is that the set of allowed m
which stems from the assumption that you are able to carve an environment up into an agent and an environment and place the "same agent" in arbitrary environments. No such thing is possible in reality, as an agent cannot exist without its environment
I might be misunderstanding what you mean here, but carving up a world into agent vs environment is absolutely possible in reality, as is placing that agent in arbitrary environments to see what it does. You can think of the traditional RL setting as a concrete example of this: on one side we have an agen...
However, we could instead define "intent alignment" as "the optimal policy of the mesa objective would be good for humans".
I agree that we need a notion of "intent" that doesn't require a purely behavioral notion of a model's objectives, but I think it should also not be limited strictly to mesa-optimizers, which neither Rohin nor I expect to appear in practice. (Mesa-optimizers appear to me to be the formalization of the idea "what if ML systems, which by default are not well-described as EU maximizers, learned to be EU maximizers?" I suspect MIRI peop...
So, for example, this claims that either intent alignment + objective robustness or outer alignment + robustness would be sufficient for impact alignment.
Shouldn’t this be “intent alignment + capability robustness or outer alignment + robustness”?
Btw, I plan to post more detailed comments in response here and to your other post, just wanted to note this so hopefully there’s no confusion in interpreting your diagram.
Great post. My one piece of feedback is that not calling the post "Deconfusing 'Deconfusion'" might've been a missed opportunity. :)
I even went to this cooking class once where the chef proposed his own deconfusion of the transformations of food induced by different cooking techniques -- I still use it years later.
Unrelatedly, I would be interested in details on this.
The way I'd think of it, it's not that you literally need unanimous agreement, but that in some situations there may be subagents that are strong enough to block a given decision.
Ah, I think that makes sense. Is this somehow related to the idea that the consciousness is more of a "last stop for a veto from the collective mind system" for already-subconsciously-initiated thoughts and actions? Struggling to remember where I read this, though.
It gets a little handwavy and metaphorical but so does the concept of a subagent.
Yeah, considering the fact tha...
Wouldn't decisions about e.g. which objects get selected and broadcast to the global workspace be made by a majority or plurality of subagents? "Committee requiring unanimous agreement" feels more like what would be the case in practice for a unified mind, to use a TMI term. I guess the unanimous agreement is only required because we're looking for strict/formal coherence in the overall system, whereas e.g. suboptimally-unified/coherent humans with lots of akrasia can have tug-of-wars between groups of subagents for control.
Great post. That Anakin meme is gold.
“Whenever you notice yourself saying ‘outside view’ or ‘inside view,’ imagine a tiny Daniel Kokotajlo hopping up and down on your shoulder chirping ‘Taboo outside view.’”
Somehow I know this will now happen automatically whenever I hear or read “outside view.” 😂
I've found this interview with Richard Lang about the "headless" method of interrogation helpful and think Sam's discussion provides useful context to bridge the gap to the scientific skeptics as well as to other meditative techniques and traditions (some of which are touched upon in this post). It also includes a pointing out exercise.
Late to the party, but here's my crack at it (ROT13'd since markdown spoilers made it an empty box without my text):
Fbzrguvat srryf yvxr n ovt qrny vs V cerqvpg gung vg unf n (ovt) vzcnpg ba gur cbffvovyvgl bs zl tbnyf/inyhrf/bowrpgvirf orvat ernyvmrq. Nffhzvat sbe n zbzrag gung gur tbnyf/inyhrf ner jryy-pncgherq ol n hgvyvgl shapgvba, vzcnpg jbhyq or fbzrguvat yvxr rkcrpgrq hgvyvgl nsgre gur vzcnpgshy rirag - rkcrpgrq hgvyvgl orsber gur rirag. Boivbhfyl, nf lbh'ir cbvagrq bhg, fbzrguvat orvat vzcnpgshy nppbeqvat gb guvf abgvba qrcraqf obgu ba gur inyhrf naq ba ubj "bowrpgviryl vzcnpgshy" vg vf (v.r. ubj qenfgvpnyyl vg punatrf gur frg bs cbffvoyr shgherf).
I'm not sure what is meant by this; would you mind explaining?
Also, the in-post link to the appendix is broken; it's currently linking to a private draft.