sunwillrise - LessWrong

some people say that "winning is about not playing dominated strategies"

I do not believe this statement. As in, I do not currently know of a single person, associated either with LW or with decision-theory academia, that says "not playing dominated strategies is entirely action-guiding." So, as Raemon pointed out, "this post seems like it’s arguing with someone but I’m not sure who."

In general, I tend to mildly disapprove of words like "a widely-used strategy", "we often encounter claims" etc, without any direct citations to the individuals who are purportedly making these mistakes. If it really was that widely-used, surely it would be trivial for the authors to quote a few examples off the top of their head, no? What does it say about them that they didn't?

Intent alignment as a stepping-stone to value alignment

sunwillrise2d50

I think it's not quite as clear as needing to shut down all other AGI projects or we're doomed; a small number of AGIs under control of different humans might be stable with good communication and agreements, at least until someone malevolent or foolish enough gets involved.

Realistically, in order to have a reasonable degree of certainty that this state can be maintained for more than a trivial amount of time, this would, at the very least, require a hard ban on open-source AI, as well as international agreements to strictly enforce transparency and compute restrictions, with the direct use of force if need be, especially if governments get much more involved in AI in the near-term future (which I expect will happen).

Do you agree with this, as a baseline?

Intent alignment as a stepping-stone to value alignment

sunwillrise2d50

Does this plan necessarily factor through the using the intent-aligned AGI to quickly commit some sort of pivotal act that flips the gameboard and prevents other intent-aligned AGIs from being used malevolently by self-interested or destructive (human) actors to gain a decisive strategic advantage? After all, it sure seems less than ideal to find yourself in a position where you can solve the theoretical parts of value alignment,^[1] but you cannot implement that in practice because control over the entire future light cone has already been permanently taken over by an AGI intent-aligned to someone who does not care about any of your broadly prosocial goals...

^{^}
In so far as something like this even makes sense, which I have already expressed my skepticism of many times, but I don't think I particularly want to rehash this discussion with you right now...

Update on the Mysterious Trump Buyers on Polymarket

sunwillrise2d10

You've gotten a fair number of disagree-votes thus far, but I think it's generally correct to say that many (arguably most) prediction markets still currently lack the trading volume necessary to justify confidence that EMH-style arguments mean inefficiencies will be rapidly corrected. To a large extent, it's fair to say this is due to over-regulation and attempts at outright banning (perhaps the relatively recent 5th Circuit ruling in favor of PredictIt against the Commodities Future Trading Commission is worth looking at as a microcosm of how these legal battles are playing out in today's day and age).

Nevertheless, the standard theoretical argument that inefficiencies in prediction markets are exploitable and thus lead to a self-correcting mechanism still seems entirely correct, as Garrett Baker points out.

Secular interpretations of core perennialist claims

sunwillrise2mo30

I think mind projection especially happens with value judgements - i.e. people treat "goodness" or "badness" as properties of things out in the world.

It's worth noting, I think, that Steve Byrnes has done a great job describing and analyzing this phenomenon in Section 2.2 of his post on Valence & Normativity. I have mentioned before that I think his post is excellent, so it seems worthwhile to signal-boost it here as well.

Cognitively speaking, treating value as a property of stuff in the world can be useful for planning

Also mentioned and analyzed in Section 2.3 of Byrnes's post :)

Zach Stein-Perlman's Shortform

sunwillrise2mo129

Ben Pace has said that perhaps he doesn't disagree with you in particular about this, but I sure think I do.^[1]

I think the amount of stress incurred when doing public communication is nearly orthogonal to these factors, and in particular is, when trying to be as careful about anything as Zac is trying to be about confidentiality, quite high at baseline.

I don't see how the first half of this could be correct, and while the second half could be true, it doesn't seem to me to offer meaningful support for the first half either (instead, it seems rather... off-topic).

As a general matter, even if it were the case that no matter what you say, at least one person will actively misinterpret your words, this fact would have little bearing on whether you can causally influence the proportion of readers/community members that end up with (what seem to you like) the correct takeaways from a discussion of that kind.

Moreover, in a spot where you have something meaningful and responsible, etc, that you and your company have done to deal with safety issues, the major concern in your mind when communicating publicly is figuring out how to make it clear to everyone that you are on top of things without revealing confidential information. That is certainly stressful, but much less so than the additional constraint you have in a world in which you do not have anything concrete that you can back your generic claims of responsibility with, since that is a spot where you can no longer fall back on (a partial version of) the truth as your defense. For the vast majority of human beings, lying and intentional obfuscation with the intent to mislead are significantly more psychologically straining than telling the truth as-you-see-it is.

Overall, I also think I disagree about the amount of stress that would be caused by conversations with AI safety community members. As I have said earlier:

AI safety community members are not actually arbitrarily intelligent Machiavellians with the ability to convincingly twist every (in-reality) success story into an (in-perception) irresponsible gaffe;[1] the extent to which they can do this depends very heavily on the extent to which you have anything substantive to bring up in the first place.
[1] Quite the opposite, actually, if the change in the wider society's opinions about EA in the wake of the SBF scandal is any representative indication of how the rationalist/EA/AI safety cluster typically handles PR stuff.

In any case, I have already made all these points in a number of ways in my previous response to you (which you haven't addressed, and which still seem to me to be entirely correct).

^{^}
He also said that he thinks your perspective makes sense, which... I'm not really sure about.

Defining alignment research

sunwillrise2mo4-2

Definitely not trying to put words in Habryka's mouth, but I did want to make a concrete prediction to test my understanding of his position; I expect he will say that:

the only work which is relevant is the one that tries to directly tackle what Nate Soares described as "the hard bits of the alignment challenge" (the identity of which Habryka basically agrees with Soares about)
nobody is fully on the ball yet
but agent foundations-like research by MIRI-aligned or formerly MIRI-aligned people (Vanessa Kosoy, Abram Demski, etc.) is the one that's most relevant, in theory
however, in practice, even that is kinda irrelevant because timelines are short and that work is going along too slowly to be useful even for deconfusion purposes

Edit: I was wrong.

What is it to solve the alignment problem?

sunwillrise2mo10

Updatelessness sure seems nice from a theoretical perspective, but it has a ton of problems that go beyond what you just mentioned and which seem to me to basically doom the entire enterprise (at least with regards to what we are currently discussing, namely people):

I am not aware of any method of operationalizing even a weak version of updatelessness in the context of cognitively limited human beings that do not have access to their own source code
I am pretty sure that a large portion of my values (and, by extension, the values of the vast majority of people) are indexical in nature, at least partly because my access to the outside world is mediated through sense data, which my S1 seems to value "terminally" and not as a mere proxy for preferences over current world-states. Indexicality seems to me to play very poorly with updatelesness (although I suspect you would know more about this than me, given your work in this area?)
I don't currently know of a way that humans can remain updateless even under (what seems to be like an inordinately optimistic) world in which we can actually access the "source code" by figuring out how to model the abstract classical computation performed by a particular (and reified) subset of the brain's electronic circuit, basically because of the reasons I gave in my comment to Wei Dai that I referenced earlier ("The feedback loops implicit in the structure of the brain cause reward and punishment signals to "release chemicals that induce the brain to rearrange itself" in a manner closely analogous to and clearly reminiscent of a continuous and (until death) never-ending micro-scale brain surgery. To be sure, barring serious brain trauma, these are typically small-scale changes, but they nevertheless fundamentally modify the connections in the brain and thus the computation it would produce in something like an emulated state (as a straightforward corollary, how would an em that does not "update" its brain chemistry the same way that a biological being does be "human" in any decision-relevant way?")
I have a much broader skepticism about whether the concepts of "beliefs" and "values" make sense as distinct, coherent concepts that carve reality at the joints, and which I think is reflected in some of the other points I made in my long list of questions and confusions about these matters. It doesn't really seem to me like updatelessness solves this, or even necessarily offers a concrete path forward on it.

Of course, I don't expect that you are trying to literally say that going updateless gets rid of all the issues, but rather that thinking about it in those terms, after internalizing that perspective, helps put us in the right frame of mind to make progress on these philosophical and metaphilosophical matters moving forward. But, as I said at the end of my comment to Wei Dai:

I do not have answers to the very large set of questions I have asked and referenced in this comment. Far more worryingly, I have no real idea of how to even go about answering them or what framework to use or what paradigm to think through. Unfortunately, getting all this right seems very important if we want to get to a great future. Based on my reading of the general pessimism you have been signaling throughout your recent posts and comments, it doesn't seem like you have answers to (or even a great path forward on) these questions either despite your great interest in and effort spent on them, which bodes quite terribly for the rest of us.
Perhaps if a group of really smart philosophy-inclined people who have internalized the lessons of the Sequences without being wedded to the very specific set of conclusions MIRI has reached about what AGI cognition must be like and which seem to be contradicted by the modularity, lack of agentic activity, moderate effectiveness of RLHF etc (overall just the empirical information) coming from recent SOTA models were to be given a ton of funding and access and 10 years to work on this problem as part of a proto-Long Reflection, something interesting would come out. But that is quite a long stretch at this point.

What is it to solve the alignment problem?

sunwillrise2mo80

OK, but what is your “intent”? Presumably, it’s that something be done in accordance with your values-on-reflection, right?

No, I don't think so at all. Pretty much the opposite, actually; if it was in accordance to my values-on-reflection, it would be value-aligned to me rather than intent-aligned. Collapsing the meaning of the latter into the former seems entirely unwise to me. After all, when I talk about my intent, I am explicitly not thinking about any long reflection process that gets at the "core" of my beliefs or anything like that;^[1] I am talking more about something like this:

I have preferences right now; this statement makes sense in the type of low-specificity conversation dominated by intuition where we talk about such words as though they referred to real concepts that point to specific areas of reality. Those preferences are probably not coherent, in the case that I can probably be money pumped by an intelligent enough agent that sets up a strange-to-my-current-self scenario. But they still exist, and one of them is to maintain a sufficient amount of money in my bank account to continue living a relatively high-quality life. Whether I "endorse" those preferences or not is entirely irrelevant to whether I have them right now; perhaps you could offer a rational argument to eventually convince me that you would make much better use of all my money, and then I would endorse giving you that money, but I don't care about any of that right now. My current, unreflectively-endorsed self, doesn't want to part with what's in my bank account, and that's what guiding my actions, not an idealized, reified future version.
None of this means anything conclusive about me ultimately endorsing these preferences in the reflective limit, of those preferences being stable under ontology shifts that reveal how my current ontology is hopelessly confused and reifies the analogues of ghosts, of there being any nonzero intersection between the end states of a process that tries to find my individual volition, of changes to my physical and neurological make-up keeping my identity the same (in a decision-relevant sense relative to my values) when my memories and path through history change.

In any case, I am very skeptical of this whole values-on-reflection business,^[2] as I have written about at length in many different spots (1, 2, 3 come to mind off the top of my head). I am loathe to keep copying the exposition of the same ideas over and over and over again (it also probably gets annoying to read at some point), but here is a relevant sample:

Whenever I see discourse about the values or preferences of beings embedded in a physical universe that goes beyond the boundaries of the domains (namely, low-specificity conversations dominated by intuition) in which such ultimately fake frameworks function reasonably well, I get nervous and confused. I get particularly nervous if the people participating in the discussions are not themselves confused about these matters (I am not referring to [Wei Dai] in particular here, since [Wei Dai] has already signaled an appropriate level of confusion about this). Such conversations stretch our intuitive notions past their breaking point by trying to generalize them out of distribution without the appropriate level of rigor and care.
What counts as human "preferences"? Are these utility function-like orderings of future world states, or are they ultimately about universe-histories, or maybe a combination of those, or maybe something else entirely? Do we actually have any good reason to think that (some form of) utility maximization explains real-world behavior, or are the conclusions broadly converged upon on LW ultimately a result of intuitions about what powerful cognition must be like whose source is a set of coherence arguments that do not stretch as far as they were purported to? What do we do with the fact that humans don't seem to have utility functions and yet lingering confusion about this remained as a result of many incorrect and misleading statements by influential members of the community?
How can we use such large sample spaces when it becomes impossible for limited beings like humans or even AGI to differentiate between those outcomes and their associated events? After all, while we might want an AI to push the world towards a desirable state instead of just misleading us into thinking it has done so, how is it possible for humans (or any other cognitively limited agents) to assign a different value, and thus a different preference ranking, to outcomes that they (even in theory) cannot differentiate (either on the basis of sense data or through thought)?
In any case, are they indexical or not? If we are supposed to think about preferences in terms of revealed preferences only, what does this mean in a universe (or an Everett branch, if you subscribe to that particular interpretation of QM) that is deterministic? Aren't preferences thought of as being about possible worlds, so they would fundamentally need to be parts of the map as opposed to the actual territory, meaning we would need some canonical framework of translating the incoherent and yet supposedly very complex and multidimensional set of human desires into something that actually corresponds to reality? What additional structure must be grafted upon the empirically-observable behaviors in order for "what the human actually wants" to be well-defined?
[...]
What do we mean by morality as fixed computation in the context of human beings who are decidedly not fixed and whose moral development through time is almost certainly so path-dependent (through sensitivity to butterfly effects and order dependence) that a concept like "CEV" probably doesn't make sense? The feedback loops implicit in the structure of the brain cause reward and punishment signals to "release chemicals that induce the brain to rearrange itself" in a manner closely analogous to and clearly reminiscent of a continuous and (until death) never-ending micro-scale brain surgery. To be sure, barring serious brain trauma, these are typically small-scale changes, but they nevertheless fundamentally modify the connections in the brain and thus the computation it would produce in something like an emulated state (as a straightforward corollary, how would an em that does not "update" its brain chemistry the same way that a biological being does be "human" in any decision-relevant way?).

I do have some other thoughts on other parts of the post, which I might write out at some point.

^{^}
Except in so much as my current, unreflectively-endorsed version has preferences over what preferences I should have or how they should develop in the future (which I do, but their aggregate effect does not dominate in these spots).
^{^}
By which I mean, I am skeptical it exists as a coherent concept.

LESSWRONG
LW

Posts

Wiki Contributions

Comments