quila

suffering-focused-altruist/longtermist, focused on ASI alignment

my pgp public key, mainly to prevent future LLM impersonation, you can always ask me to sign a dated message. [the private key is currently stored in plaintext within an encrypted drive, so is vulnerable to being read by local programs]:

I have signed no contracts or agreements whose existence I cannot mention.

Posts

Sorted by New

2quila's Shortform

125

2quila's Shortform

125

Wiki Contributions

Comments

Sorted by

Newest

quila's Shortform

quila12d20

a possible research direction which i don't know if anyone has explored: what would a training setup which provably creates a (probably^[1]) aligned system look like?

my current intuition, which is not good evidence here beyond elevating the idea from noise, is that such a training setup might somehow leverage how the training data and {subsequent-agent's perceptions/evidence stream} are sampled from the same world, albeit with different sampling procedures. for example, the training program could intake both a dataset and an outer-alignment-goal-function, and select for prediction of the dataset (to build up ability) while also doing something else to the AI-in-training; i have no idea what that something else would look like (and it seems like most of this problem).

has this been thought about before? is this feasible? why or why not?

(i can clarify if any part of this is not clear.)

(background motivator: in case there is no finite-length general purpose search algorithm^[2], alignment may have to be of trained systems / learners)

^{^}
(because in principle, it's possible to get unlucky with sampling for the dataset. compare: it's possible for an unlucky sequence of evidence to cause an agent to take actions which are counter to its goal.)
^{^}
by which i mean a program capable of finding something which meets any given criteria met by at least one thing (or writing 'undecidable' in self-referential edge cases)

fake alignment solutions????

quila6d10

There are all sorts of "strategies" (turn it off, raise it like a kid, disincentivize changing the environment, use a weaker AI to align it) that people come up with when they're new to the field of AI safety, but that are ineffective. And their ineffectiveness is only obvious and explainable by people who specifically know how AI behaves.

yep but the first three all fail for the shared reason of "programs will do what they say to do, including in response to your efforts". (the fourth one, 'use a weaker AI to align it', is at least obviously not itself a solution. the weakest form of it, using an LLM to assist an alignment researcher, is possible, and some less weak forms likely are too.)

when i think of other 'newly heard of alignment' proposals, like boxing, most of them seem to fail because the proposer doesn't actually have a model of how this is supposed to work or help in the first place. (the strong version of 'use ai to align it' probably fits better here)

(there are some issues which a programmatic model doesn't automatically make obvious to a human: they must follow from it, but one could fail to see them without making that basic mistake. probable environment hacking and decision theory issues come to mind. i agree that on general priors this is some evidence that there are deeper subjects that would not be noticed even conditional on those researchers approving a solution.)

i guess my next response then would be that some subjects are bounded, and we might notice (if not 'be able to prove') such bounds telling us 'theres not more things beyond what you have already written down', which would be negative evidence (strength depending on how strongly we've identified a bound). (this is more of an intuition, i don't know how to elaborate this)

(also on what johnswentworth wrote: a similar point i was considering making is that the question is set up in a way that forces you into playing a game of "show how you'd outperform ~~magnus carlsen~~ {those researchers} in ~~chess~~ alignment theory" - for any consideration you can think of, one can respond that those researchers will probably also think of it, which might preclude them from actually approving, which makes the conditional 'they approve but its wrong'^[1] harder to be true and basically dependent on them instead of object-level properties of alignment.)

i am interested in reading more arguments about the object-level question if you or anyone else has them.

If the solution to alignment were simple, we would have found it by now [...] That there is one simple thing from which comes all of our values, or a simple way to derive such a thing, just seems unlikely.

the pointer to values does not need to be complex (even if the values themselves are)

If the solution to alignment were simple, we would have found it by now

generally: simple things don't have to be easy to find. the hard part can be locating them within some huge space of possible things. (math (including is use in laws of physics) come to mind?). (and specifically to alignment: i also strongly expect an alignment solution to ... {have some set of simple principles from which it can be easily derived (i.e. whether the program itself ends up long)}, but idk if i can legibly explain why. real complexity usually results from stochastic interactions in a process, but "aligned superintelligent agent" is a simply-defined, abstract thing?)

^{^}
ig you actually wrote 'they dont notice flaws', which is ambiguously between 'they approve' and 'they don't find affirmative failure cases'. and maybe the latter was your intent all along.
it's understandable because we do have to refer to humans to call something unintuitive.

fake alignment solutions????

quila7d10

That for every true alignment solution, there are dozens of fake ones.
Is this something that I should be seriously concerned about?

if you truly believe in a 1-to-dozens ratio between^[1] real and 'fake' (endorsed by eliezer and others but unnoticedly flawed) solutions, then yes. in that case, you would naturally favor something like human intelligence augmentation, at least if you thought it had a chance of succeeding greater than p(chance of a solution being proposed which eliezer and others deem correct) × 1/24

I believe that before we stumble on an alignment solution, we will stumble upon an "alignment solution" - something that looks like an alignment solution, but is flawed in some super subtle, complicated way that means that Earth still gets disassembled into compute or whatever, but the flaw is too subtle and complicated for even the brightest humans to spot

i suggest writing why you believe that. in particular, how do you estimate the prominence of 'unnoticeably flawed' alignment solutions given they are unnoticeable (to any human)?^[2] where does "for every true alignment solution, there are dozens of fake ones" come from?

A really complicated and esoteric, yet somehow elegant

why does the proposed alignment solution have to be really complicated? overlooked mistakes become more likely as complexity increases, so this premise favors your conclusion.

^{^}
(out of the ones which might be proposed, to avoid technicalities about infinite or implausible-to-be-thought-of proposals)
^{^}
(there are ways you could in principle, for example if there were a pattern of the researchers continually noticing they made increasingly unintuitive errors up until the 'final' version (where they no longer notice, but presumably this pattern would be noticed); or extrapolation from general principles about {some class of programming that includes alignment} (?) being easy to make hard-to-notice unintuitive mistakes in)

Sapphire Shorts

quila7d40

good to hear it's at least transparent enough for you to describe it directly like this. (edit: though the points in dawnlights post seem scarier)

Sapphire Shorts

quila8d1814

she attempted to pressure me to take MDMA under her supervision. I ended up refusing, and she relented; however, she then bragged that because she had relented, I would trust her more and be more likely to take MDMA the next time I saw her.

this seems utterly evil, especially given MDMA is known as an attachment inducing drug.

edit: more generally, it seems tragic for people who are socially-vulnerable and creative to end up paired with adept manipulators.

a simple explanation is that because creativity is (potentially very) useful, vulnerable creative people will be targets for manipulation. but i think there are also dynamics in communities with higher [illegibility-tolerance? esoterism?] which enable this, which i don't know how to write about. i hope someone tries to write about it.

A good way to build many air filters on the cheap

quila10d20

upvoted, i think this article would be better with comparison to the recommendations in thomas kwa's shortform about air filters

quila's Shortform

quila12d10

But maybe you only want to "prove" inner alignment and assume that you already have an outer-alignment-goal-function

correct, i'm imagining these being solved separately

quila's Shortform

quila16d20

a moral intuition i have: to avoid culturally/conformistly-motivated cognition, it's useful to ask:

if we were starting over, new to the world but with all the technology we have now, would we recreate this practice?

example: we start and out and there's us, and these innocent fluffy creatures that can't talk to us, but they can be our friends. we're just learning about them for the first time. would we, at some point, spontaneously choose to kill them and eat their bodies, despite us having plant-based foods, supplements, vegan-assuming nutrition guides, etc? to me, the answer seems obviously not. the idea would not even cross our minds.

(i encourage picking other topics and seeing how this applies)

The Shape of Heaven

quila16d10

Status: Just for fun

it was fun to read this :]

All intelligent minds seek to optimise for their value function. To do this, they will create environments where their value function is optimised.

in case you believe this [disregard if not], i disagree and am willing to discuss here. in particular i disagree with the create environments part: the idea that all goal functions (or only some subset, like selected-for ones; also willing to argue against this weaker claim^[1]) would be maximally fulfilled (also) by creating some 'small' simulation (made of a low % of the reachable universe).

(though i also disagree with the all in the quote's first sentence^[2]. i guess i'd also be willing to discuss that).

^{^}
for this weaker claim: many humans are a counterexample of selected-for-beings whose values would not be satisfied just by creating a simulation, because they care about suffering outside the simulation too.
^{^}
my position: 'pursues goals' is conceptually not a property of intelligence, and not all possible intelligent systems pursue goals (and in fact pursuing goals is a very specific property, technically rare in the space of possible intelligent programs).

quila's Shortform

quila19d10

this could have been noise, but i noticed an increase in text fearing spies, in the text i've seen in the past few days^[1]. i actually don't know how much this concern is shared by LW users, so i think it might be worth writing that, in my view:

(AFAIK) both governments^[2] are currently reacting inadequately to unaligned optimization risk. as a starting prior, there's not strong reason to fear more one government {observing/spying on} ML conferences/gatherings over the other, absent evidence that one or the other will start taking unaligned optimization risks very seriously, or that one or the other is prone to race towards ASI.
- (AFAIK, we have more evidence that the U.S. government may try to race, e.g. this, but i could have easily missed evidence as i don't usually focus on this)
- tangentially, a more-pervasively-authoritarian government could be better situated to prevent unilaterally-caused risks (cf a similar argument in 'The Vulnerable World Hypothesis'), if it sought to. (edit: andif the AI labs closest to causing those risks were within its borders, which they are not atm)
  - this argument feels sad (or reflective of a sad world?) to me to be clear, but it seems true in this case

that said i don't typically focus on governance or international-AI-politics, so have not put much thought into this.

^{^}
examples: yesterday, saw this twitter/x post (via this quoting post)
today, opened lesswrong and saw this shortform about two uses of the word spy and this shortform about how it's hard to have evidence against the existence of manhattan projects
this was more than usual, and i sense that it's part of a pattern
^{^}
of those of US/china