Joey KL

Posts

Sorted by New

Wiki Contributions

Comments

Sorted by
Joey KL55

I don’t think this is accurate, I think most philosophy is done under motivated reasoning but is not straightforwardly about signaling group membership

Joey KL10

Hi, any updates on how this worked out? Considering trying this...

Joey KL20

This is the most interesting answer I've ever gotten to this line of questioning. I will think it over!

Joey KL10

What observation could demonstrate that this code indeed corresponded to the metaphysical important sense of continuity across time? What would the difference be between a world where it did or it didn't?

Joey KL10

Say there is a soul. We inspect a teleportation process, and we find that, just like your body and brain, the soul disappears on the transmitter pad, and an identical soul appears on the receiver. What would this tell you that you don't already know?

What, in principle, could demonstrate that two souls are in fact the same soul across time?

Joey KL32

It is epistemic relativism.

Question 1 and 3 are explicitly about values, so I don't think they do amount to epistemic relativism.

There seems to be a genuine question about what happens and which rules govern it, and you are trying to sidestep it by saying "whatever happens - happens".

I can imagine a universe with such rules that teleportation kills a person and a universe in which it doesn't. I'd like to know how does our universe work.

There seems to be a genuine question here, but it is not at all clear that there actually is one. It is pretty hard to characterize what this question amounts to, i.e. what the difference would be between two worlds where the question has different answers. I take OP to be espousing the view that the question isn't meaningful for this reason (though I do think they could have laid this out more clearly).

Joey KL10

You may find it helpful to read the relevant sections of The Conscious Mind by David Chalmers, the original thorough examination of his view:

Those considerations aside, the main way in which conceivability arguments can go wrong is by subtle conceptual confusion: if we are insufficiently reflective we can overlook an incoherence in a purported possibility, by taking a conceived-of situation and misdescribing it. For example, one might think that one can conceive of a situation in which Fermat's last theorem is false, by imagining a situation in which leading mathematicians declare that they have found a counterexample. But given that the theorem is actually true, this situation is being misdescribed: it is really a scenario in which Fermat's last theorem is true, and in which some mathematicians make a mistake. Importantly, though, this kind of mistake always lies in the a priori domain, as it arises from the incorrect application of the primary intensions of our concepts to a conceived situation. Sufficient reflection will reveal that the concepts are being incorrectly applied, and that the claim of logical possibility is not justified.

So the only route available to an opponent here is to claim that in describing the zombie world as a zombie world, we are misapplying the concepts, and that in fact there is a conceptual contradiction lurking in the description. Perhaps if we thought about it clearly enough we would realize that by imagining a physically identical world we are thereby automatically imagining a world in which there is conscious experience. But then the burden is on the opponent to give us some idea of where the contradiction might lie in the apparently quite coherent description. If no internal incoherence can be revealed, then there is a very strong case that the zombie world is logically possible.

As before, I can detect no internal incoherence; I have a clear picture of what I am conceiving when I conceive of a zombie. Still, some people find conceivability arguments difficult to adjudicate, particularly where strange ideas such as this one are concerned. It is therefore fortunate that every point made using zombies can also be made in other ways, for example by considering epistemology and analysis. To many, arguments of the latter sort (such as arguments 3-5 below) are more straightforward and therefore make a stronger foundation in the argument against logical supervenience. But zombies at least provide a vivid illustration of important issues in the vicinity.

(II.7, "Argument 1: The logical possibility of zombies". Pg. 98).

Joey KL1-1

Iterated Amplification is a fairly specific proposal for indefinitely scalable oversight, which doesn't involve any human in the loop (if you start with a weak aligned AI). Recursive Reward Modeling is imagining (as I understand it) a human assisted by AIs to continuously do reward modeling; DeepMind's original post about it lists "Iterated Amplification" as a separate research direction. 

"Scalable Oversight", as I understand it, refers to the research problem of how to provide a training signal to improve highly capable models. It's the problem which IDA and RRM are both trying to solve. I think your summary of scalable oversight: 

(Figuring out how to ease humans supervising models. Hard to cleanly distinguish from ambitious mechanistic interpretability but here we are.)

is inconsistent with how people in the industry use it. I think it's generally meant to refer to the outer alignment problem, providing the right training objective. For example, here's Anthropic's "Measuring Progress on Scalable Oversight for LLMs" from 2022:

To build and deploy powerful AI responsibly, we will need to develop robust techniques for scalable oversight: the ability to provide reliable supervision—in the form of labels, reward signals, or critiques—to models in a way that will remain effective past the point that models start to achieve broadly human-level performance (Amodei et al., 2016).

It references "Concrete Problems in AI Safety" from 2016, which frames the problem in a closely related way, as a kind of "semi-supervised reinforcement learning". In either case, it's clear what we're talking about is providing a good signal to optimize for, not an AI doing mechanistic interpretability on the internals of another model. I thus think it belongs more under the "Control the thing" header.

I think your characterization of "Prosaic Alignment" suffers from related issues. Paul coined the term to refer to alignment techniques for prosaic AI, not techniques which are themselves prosaic. Since prosaic AI is what we're presently worried about, any technique to align DNNs is prosaic AI alignment, by Paul's definition.

My understanding is that AI labs, particularly Anthropic, are interested in moving from human-supervised techniques to AI-supervised techniques, as part of an overall agenda towards indefinitely scalable oversight via AI self-supervision.  I don't think Anthropic considers RLAIF an alignment endpoint itself. 

Joey KL20

I am very surprised that "Iterated Amplification" appears nowhere on this list. Am I missing something?

Joey KL166

More generally, I think that if mere-humans met very-alien minds with similarly-coherent preferences, and if the humans had the opportunity to magically fulfill certain alien preferences within some resource-budget, my guess is that the humans would have a pretty hard time offering power and wisdom in the right ways such that this overall went well for the aliens by their own lights (as extrapolated at the beginning), at least without some sort of volition-extrapolation.

Isn't the worst case scenario just leaving the aliens alone? If I'm worried I'm going to fuck up some alien's preferences, I'm just not going to give them any power or wisdom!

I guess you think we're likely to fuck up the alien's preferences by light of their reflection process, but not our reflection process. But this just recurs to the meta level. If I really do care about an alien's preferences (as it feels like I do), why can't I also care about their reflection process (which is just a meta preference)?

I feel like the meta level at which I no longer care about doing right by an alien is basically the meta level at which I stop caring about someone doing right by me. In fact, this is exactly how it seems mentally constructed: what I mean by "doing right by [person]" is "what that person would mean by 'doing right by me'". This seems like either something as simple as it naively looks, or sensitive to weird hyperparameters I'm not sure I care about anyway. 

Load More