JuliaHP - LessWrong

The Field of AI Alignment: A Postmortem, and What To Do About It

(That broad technical knowledge is the main thing (as opposed to tacit skills) why you value a physics PhD is a really surprising response to me, and seems like an important part of the model that didn't come across from the post.)

The Field of AI Alignment: A Postmortem, and What To Do About It

JuliaHP22d72

Curious about what it would look like to pick up the relevant skills, especially the subtle/vague/tacit skills, in an independent-study setting rather than in academia. As well as the value of doing this, IE maybe its just a stupid idea and its better to just go do a PhD. Is the purpose of a PhD to learn the relevant skills, or to filter for them? (If you have already written stuff which suffices as a response, id be happy to be pointed to the relevant bits rather than having them restated)

"Broad technical knowledge" should be in some sense the "easiest" (not in terms of time-investment, but in terms of predictable outcomes), by reading lots of textbooks (using similar material as your study guide).

Writing/communication, while more vague, should also be learnable by just writing a lot of things, publishing them on the internet for feedback, reflecting on your process etc.

Something like "solving novel problems" seems like a much "harder" one. I don't know if this is a skill with a simple "core" or a grab-bag of tactics. Textbook problems take on a "meant-to-be-solved" flavor and I find one can be very good at solving these without being good at tackling novel problems. Another thing I notice is that when some people (myself included) try solving novel problems, we can end up on a path which gets there eventually, but if given "correct" feedback integration would go OOM faster.

I'm sure there are other vague-skills which one ends up picking up from a physics PhD. Can you name others, and how one picks them up intentionally? Am I asking the wrong question?

Considerations on orca intelligence

JuliaHP23d61

(warning: armchair evolutionary biology)

Another consideration for orca intelligence; they dodge the fermi paradox by not having arms.

Assume the main driver of genetic selection for intelligence is the social arms-race. As soon as a species gets intelligent enough (see humans) from this arms-race they start using their intelligence for manipulating the environment, and start civilization. But orcas mostly lack the external organs for manipulating the enviroment, so they can keep social-arms-racing-boosting-intelligence way past the point of "criticality".

This should be checkable, IE how long have orcas (or orca-forefathers) been socially-arms-racing? I tried asking claude to no avail, and I lack the domain knowledge to quickly look it up myself. Perhaps one could also check genetic change over time, perhaps social arms race is something you can see in this data? Do we know what this looks like in humans and orcas?

jacquesthibs's Shortform

JuliaHP1mo64

"As a result, we can make progress toward automating interpretability research by coming up with experimental setups that allow AIs to iterate."
This sounds exactly like the kind of progress which is needed in order to get closer to game-over-AGI. Applying current methods of automation to alignment seems fine, but if you are trying to push the frontier of what intellectual progress can be achieved using AI's, I fail to see your comparative advantage relative to pure capabilities researchers.

I do buy that there might be credit to the idea of developing the infrastructure/ability to be able to do a lot of automated alignment research, which gets cached out when we are very close to game-over-AGI, even if it comes at the cost of pushing the frontier some.

the gears to ascenscion's Shortform

JuliaHP5mo176

Transfer learning is dubious, doing philosophy has worked pretty well for me thus far for learning how to do philosophy. More specifically, pick a topic you feel confused about or a problem you want to solve (AI kill everyone oh no?). Sit down and try to do original thinking, and probably use some external tool of preference to write down your thoughts. Then do live or afterwards introspection on if your process is working and how you can improve it, repeat.
This might not be the most helpful, but most people seem to fail at "being comfortable sitting down and thinking for themselves", and empirically being told to just do it seems to work.

Maybe one crucial object level bit has to do with something like "mining bits from vague intuitions" like Tsvi explains at the end of this comment, idk how to describe it well.

Alignment: "Do what I would have wanted you to do"

JuliaHP6mo80

>It seems like all of the many correct answers to what X would've wanted might not include the AGI killing everyone.
Yes, but if it wants to kill everyone it would pick one which does. The space "all possible actions" also contains some friendly actions.

>Wrt the continuity property, I think Max Harm's corrigibility proposal has that
I think it understands this and is aiming to have that yeah. It looks like a lot of work needs to be done to flesh it out.

I dont have a good enough understanding of ambitious value learning & Roger Dearnaleys proposal to properly comment on these. Skimming + priors put fairly low odds on that they deal with this in the proper manner, but I could be wrong.

Alignment: "Do what I would have wanted you to do"

JuliaHP6mo1411

The step from "tell AI to do Y" to "AI does Y" is a big part of the entire alignment problem. The reasons chatbots might seem aligned in this sense is that the thing you ask for often lives in a continuous space, and when not too strong optimization pressure is applied, when you ask for Y, Y+epsilon is good enough. This ceases to be the case when your Y is complicated and high optimization pressure is applied, UNLESS you can find a Y which has a strong continuity property in the sense you care about, which I am unaware of anyone who knows how to do.

Not to mention that "Do what (pre-ASI) X, having considered this carefully for a while, would have wanted you to do" does not narrow down behaviour to a small enough space. There will be many to you reasonable looking interpretations, many of which will allow for satisfaction, while still allowing the AI to kill everyone.

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

JuliaHP6mo42

While I have a lot of respect for many of the authors, this work feels to me like its mostly sweeping the big problems under the rug. It might at most be useful for AI labs to make a quick buck, or do some safety-washing, before we all die. I might be misunderstand some of the approaches proposed here, and some of my critiques might be invalid as such.

My understanding is that the paper proposes that the AI implements and works with a human-interpretable world model, and that safety specifications is given in this world-model/ontology.

But given an ASI with such a world model, I don't see how one would specify properties such as "hey please don't hyperoptimize squiggles or goodhart this property". Any specification I can think of mostly leaves room for the AI to abide by it, and still kill everyone somehow. This recurses back to "just solve alignment/corrigbility/safe-superintelligent-behaviour".

Nevermind getting an AI where its actually preforming all cognition in the ontology you provided for it (that would probably count as real progress to me). How do you know that just because the internal ontology says "X", "X" is what the AI actually does? See this post.

If you are going to prove vague things about your AI and have it be any use at all, you'd want to prove properties in the style of "this AI has the kind of 'cognition/mind' for which it is 'beneficial for the user' to have running than not" and "this AI's 'cognition/mind' lies in an 'attractor space' where violated assumptions, bugs and other errors cause the AI to follow the desired behavior anyways".

For sufficiently powerful systems having proofs about output behavior mostly does not narrow down your space to safe agents. You want proofs about their internals. But that requires having a less confused notion of what to ask for in the AI's internals such that it is a safe computation to run, never mind formally specifying it. I don't have, and haven't found anyone who seems to understand enough of the relevant properties of minds, what it means for something to be 'beneficial to the user', or how to construct powerful optimizers which fail non-catastrophically. It appears to me that we're not bottle necked on proving these properties, but rather that the bottleneck is identifying and understanding what shape they have.

I do expect some of these approaches to, in the very limited scope of things you can formally specify, allow for more narrow AI applications, promote AI investments and give rise to new techniques and non-trivially shorten the time until we are able to build superhuman systems. My vibes regarding this are made worse by how various existing methods are listed in "safety ranking". It lists RLHF, Constitutional AI & Model-free RL as more safe than unsupervised learning, but to me it seems like these methods instill stable agent-like behavior on top of a prediction-engine, where there previously was either none or nearly none. They make no progress on the bits of the alignment problem which matter, but do let AI labs create new and better products, make more money, fund more capabilities research etc. I predict that future work along these lines will mostly have similar effects; little progress on the bits which matter, but useful capabilities insights along the way, which gets incorrectly labeled alignment.

Consider the humble rock (or: why the dumb thing kills you)

JuliaHP7mo132

You can totally have something which is trying to kill humanity in this framework though. Imagine something in the style of chaos-GPT, locally agentic & competent enough to use state-of-the-art AI biotech tools to synthesize dangerous viruses or compounds to release into the atmosphere. (note that In this example the critical part is the narrow-AI biotech tools, not the chaos-agent)

You don't need solutions to embedded agency, goal-content integrity & the like to build this. It is easier to build and is earlier in the tech-tree than crisp maximizers. It will not be stable enough to coherently take over the lightcone. Just coherent enough to fold some proteins and print them.

But why would anyone do such a stupid thing?

Isomorphisms don't preserve subjective experience... right?

Answer by JuliaHPJul 03, 202441

Unless I misunderstand the confusion, a useful line of thought which might resolve some things:

Instead of analyzing whether you yourself are conscious or not, analyze what is causally upstream of your mind thinking that you are conscious, or your body uttering the words "I am conscious".

Similarly you could analyze whether an upload would would think similar thoughts, or say similar things. What about a human doing manual computations? What about a pure mathematical object?

A couple of examples of where to go from there:
- If they have the same behavior, perhaps they are the same?
- If they have the same behavior, but you still think there is a difference, try to find out why you think there is a difference, what is causally upstream of this thought/belief?

LESSWRONG
LW

Posts

Wiki Contributions

Comments