JuliaHP - LessWrong

>though maintenance might suck idk

Yeah, and I'm guessing very expensive. If something is being given away for cheap/free the true market value of the good is likely negative. It probably makes sense to think more about that bit before concluding that obtaining a castle is a good idea.

Can we ever ensure AI alignment if we can only test AI personas?

JuliaHP1mo20

This to me seems to be akin to "sponge-alignment" IE not building a powerful AI.

We understand personas because they are simulating human behavior which we understand. But that human behavior is mostly limited to human capabilities (expect for maybe speed-up possibilities).

Building truly powerful AI's will probably involve systems that do something different than human brains, or at-least do not grow with human biases for learning, which causes them to learn the human behaviors we are familiar with.

If the "power" of the AI comes through something else than the persona, then trusting the persona won't do you much good.

Altman blog on post-AGI world

JuliaHP2mo2811

I do believe that if Altman does manage to create his superAI's, the first such eats Altman and makes squiggles. But if I were to engage in the hypothetical where nice corrigible superassistants are just magically created, Altman does not appear to treat this future he claims to be steering towards seriously.

The world where "everyone has a superassitant" is inherently incredibly volatile/unstable/dangerous due to an incredibly large offence-defence assymetry of superassistants attacking fragile-fleshbags (with optimized viruses, bacteria, molecules, nanobots etcetc) or hijacking fragile minds with supermemes.

Avoiding this kind of outcome to me seems difficult. Nonsystematic "patches" are always workaroundable.
If openAI's superassistant refuses your request to destroy the world, use it to build your own superassistant, or use it for subtasks etc etc. Humans are fragile-fleshbags, and if strong optimization is ever pointed in their direction, they die.

There are ways to make such a world stable, but all of them that I can see look incredibly authoritarian, something Altman says hes not aiming for. But Altman does not appear to be proposing any alternatives as to how this will turn out fine, and I am not aware of any research agenda at openai trying to figure out how "giving everyone a superoptimizer" will result in a stable world with humans doing human things.

I know only three coherent ways to interpret what Altman is saying, and none of them take the object of writing seriously:
1) I wanted to have the stock go up and wrote words which do that
2) I didnt really think about it, oops
3) I'm actully gonna keep the superassistants all to myself and rule, and this nicecore writing will make people support me as I approach the finish line

This is less meant to be critical of the writing, and more me asking for help of how to actually make sense of what Altman says

The Field of AI Alignment: A Postmortem, and What To Do About It

JuliaHP3mo71

(That broad technical knowledge is the main thing (as opposed to tacit skills) why you value a physics PhD is a really surprising response to me, and seems like an important part of the model that didn't come across from the post.)

The Field of AI Alignment: A Postmortem, and What To Do About It

JuliaHP3mo72

Curious about what it would look like to pick up the relevant skills, especially the subtle/vague/tacit skills, in an independent-study setting rather than in academia. As well as the value of doing this, IE maybe its just a stupid idea and its better to just go do a PhD. Is the purpose of a PhD to learn the relevant skills, or to filter for them? (If you have already written stuff which suffices as a response, id be happy to be pointed to the relevant bits rather than having them restated)

"Broad technical knowledge" should be in some sense the "easiest" (not in terms of time-investment, but in terms of predictable outcomes), by reading lots of textbooks (using similar material as your study guide).

Writing/communication, while more vague, should also be learnable by just writing a lot of things, publishing them on the internet for feedback, reflecting on your process etc.

Something like "solving novel problems" seems like a much "harder" one. I don't know if this is a skill with a simple "core" or a grab-bag of tactics. Textbook problems take on a "meant-to-be-solved" flavor and I find one can be very good at solving these without being good at tackling novel problems. Another thing I notice is that when some people (myself included) try solving novel problems, we can end up on a path which gets there eventually, but if given "correct" feedback integration would go OOM faster.

I'm sure there are other vague-skills which one ends up picking up from a physics PhD. Can you name others, and how one picks them up intentionally? Am I asking the wrong question?

Considerations on orca intelligence

JuliaHP4mo103

(warning: armchair evolutionary biology)

Another consideration for orca intelligence; they dodge the fermi paradox by not having arms.

Assume the main driver of genetic selection for intelligence is the social arms-race. As soon as a species gets intelligent enough (see humans) from this arms-race they start using their intelligence for manipulating the environment, and start civilization. But orcas mostly lack the external organs for manipulating the enviroment, so they can keep social-arms-racing-boosting-intelligence way past the point of "criticality".

This should be checkable, IE how long have orcas (or orca-forefathers) been socially-arms-racing? I tried asking claude to no avail, and I lack the domain knowledge to quickly look it up myself. Perhaps one could also check genetic change over time, perhaps social arms race is something you can see in this data? Do we know what this looks like in humans and orcas?

jacquesthibs's Shortform

JuliaHP4mo64

"As a result, we can make progress toward automating interpretability research by coming up with experimental setups that allow AIs to iterate."
This sounds exactly like the kind of progress which is needed in order to get closer to game-over-AGI. Applying current methods of automation to alignment seems fine, but if you are trying to push the frontier of what intellectual progress can be achieved using AI's, I fail to see your comparative advantage relative to pure capabilities researchers.

I do buy that there might be credit to the idea of developing the infrastructure/ability to be able to do a lot of automated alignment research, which gets cached out when we are very close to game-over-AGI, even if it comes at the cost of pushing the frontier some.

the gears to ascenscion's Shortform

JuliaHP8mo176

Transfer learning is dubious, doing philosophy has worked pretty well for me thus far for learning how to do philosophy. More specifically, pick a topic you feel confused about or a problem you want to solve (AI kill everyone oh no?). Sit down and try to do original thinking, and probably use some external tool of preference to write down your thoughts. Then do live or afterwards introspection on if your process is working and how you can improve it, repeat.
This might not be the most helpful, but most people seem to fail at "being comfortable sitting down and thinking for themselves", and empirically being told to just do it seems to work.

Maybe one crucial object level bit has to do with something like "mining bits from vague intuitions" like Tsvi explains at the end of this comment, idk how to describe it well.

Alignment: "Do what I would have wanted you to do"

JuliaHP9mo80

>It seems like all of the many correct answers to what X would've wanted might not include the AGI killing everyone.
Yes, but if it wants to kill everyone it would pick one which does. The space "all possible actions" also contains some friendly actions.

>Wrt the continuity property, I think Max Harm's corrigibility proposal has that
I think it understands this and is aiming to have that yeah. It looks like a lot of work needs to be done to flesh it out.

I dont have a good enough understanding of ambitious value learning & Roger Dearnaleys proposal to properly comment on these. Skimming + priors put fairly low odds on that they deal with this in the proper manner, but I could be wrong.

Alignment: "Do what I would have wanted you to do"

JuliaHP9mo1411

The step from "tell AI to do Y" to "AI does Y" is a big part of the entire alignment problem. The reasons chatbots might seem aligned in this sense is that the thing you ask for often lives in a continuous space, and when not too strong optimization pressure is applied, when you ask for Y, Y+epsilon is good enough. This ceases to be the case when your Y is complicated and high optimization pressure is applied, UNLESS you can find a Y which has a strong continuity property in the sense you care about, which I am unaware of anyone who knows how to do.

Not to mention that "Do what (pre-ASI) X, having considered this carefully for a while, would have wanted you to do" does not narrow down behaviour to a small enough space. There will be many to you reasonable looking interpretations, many of which will allow for satisfaction, while still allowing the AI to kill everyone.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments