oki! in this scenario, i guess i'm imagining humans/humanity becoming ever-more-artificial (like, ever-more-[human/mind]-made) and ever-more-intelligent (like, eventually much more capable than anything that might be created by 2100), so this still seems like a somewhat unnatural framing to me

Can we safely automate alignment research?

Kaarel2d10

What then? One option is to never build superintelligence. But there’s also another option, namely: trying to get access to enhanced human labor, via the sorts of techniques I discussed in my post on waystations (e.g., whole brain emulation). In particular: unlike creating an alignment MVP, which plausibly requires at least some success in learning how to give AIs human-like values, available techniques for enhancing human labor might give you human-like values by default, while still resulting in better-than-human alignment research capabilities. Call this an “enhanced human labor” path.^[12]
[12]: Though: note that if you thought that even an alignment MVP couldn’t solve the alignment problem, you need some story about why your enhanced human labor would do better.

something that is imo important but discordant with the analysis you give here:
* humans/humanity could also just continue becoming more intelligent/capable (i mean: in some careful, self-conscious, deliberate fashion; so not like: spawning a random alien AI that outfooms humans; of course, what this means is unclear — it would imo need to be figured out ever-better as we proceed), like maybe forever

Davey Morse's Shortform

Kaarel13d30

even if you're mediocre at coming up with ideas, as long as it's cheap and you can come up with thousands, one of them is bound to be promising

for ideas which are "big enough", this is just false, right? for example, so far, no LLM has generated a proof of an interesting conjecture in math

Davey Morse's Shortform

Kaarel13d10

coming up with good ideas is very difficult as well

(and it requires good judgment, also)

How training-gamers might function (and win)

Kaarel21d*2-2

I've only skimmed the post so the present comment could be missing the mark (sorry if so), but I think you might find it worthwhile/interesting/fun to think (afresh, in this context) about how come humans often don't wirehead and probably wouldn't wirehead even with much much longer lives (in particular, much longer childhoods and research careers), and whether the kind of AI that would do hard math/philosophy/tech-development/science will also be like that.^[1]^[2]

I'm not going to engage further on this here, but if you'd like to have a chat about this, feel free to dm me. ↩︎
I feel like clarifying that I'd inside-view say P( the future is profoundly non-human (in a bad sense) | AI (which is not pretty much a synthetic human) smarter than humans is created this century ) >0.98 despite this. ↩︎

leogao's Shortform

Kaarel25d1-2

i agree that most people doing "technical analysis" are doing nonsense and any particular well-known simple method does not actually work. but also clearly a very good predictor could make a lot of money just looking at the past price time series anyway

Ivan Vendrov's Shortform

Kaarel26d30

it feels to me like you are talking of two non-equivalent types of things as if they were the same. like, imo, the following are very common in competent entities: resisting attempts on one's life, trying to become smarter, wanting to have resources (in particular, in our present context, being interested in eating the Sun), etc.. but then whether some sort of vnm-coherence arises seems like a very different question. and indeed even though i think these drives are legit, i think it's plausible that such coherence just doesn't arise or that thinking of the question of what valuing is like such that a tendency toward "vnm-coherence" or "goal stability" could even make sense as an option is pretty bad/confused^[1].

(of course these two positions i've briefly stated on these two questions deserve a bunch of elaboration and justification that i have not provided here, but hopefully it is clear even without that that there are two pretty different questions here that are (at least a priori) not equivalent)

briefly and vaguely, i think this could involve mistakenly imagining a growing mind meeting a fixed world, when really we will have a growing mind meeting a growing world — indeed, a world which is approximately equal to the mind itself. slightly more concretely, i think things could be more like: eg humanity has many profound projects now, and we would have many profound but currently basically unimaginable projects later, with like the effective space of options just continuing to become larger, plausibly with no meaningful sense in which there is a uniform direction in which we're going throughout or whatever ↩︎

kh's Shortform

Kaarel26d10

words to be meaningful

Towards_Keeperhood:

you could define "mother(x,y)" as "x gave birth to y", and then "gave birth" as some more precise cluster of observations, which eventually need to be able to be identified from visual inputs

Kaarel:

if i should read this as talking about a translation of "x is the mother of y", then imo this is a bad idea.
in particular, i think there is the following issue with this: saying which observations "x gave birth to y" corresponds to intuitively itself requires appealing to a bunch of other understanding. it's like: sure, your understanding can be used to create visual anticipations, but it's not true that any single sentence alone could be translated into visual anticipations — to get a typical visual anticipation, you need to rely on some larger segment of your understanding. a standard example here is "the speed of light in vacuum is 3*10^8 m/s" creating visual anticipations in some experimental setups, but being able to derive those visual anticipations depends on a lot of further facts about how to create a vacuum and properties of mirrors and interferometers and so on (and this is just for one particular setup — if we really mean to make a universally quantified statement, then getting the observation sentences can easily end up requiring basically all of our understanding). and it seems silly to think that all this crazy stuff was already there in what you meant when you said "the speed of light in vacuum is m/s". one concrete reason why you don't want this sentence to just mean some crazy AND over observation sentences or whatever is because you could be wrong about how some interferometer works and then you'd want it to correspond to different observation sentences
this is roughly https://en.wikipedia.org/wiki/Confirmation_holism as a counter to https://en.wikipedia.org/wiki/Verificationism
that said, i think there is also something wrong with some very strong version of holism: it's not really like our understanding is this unitary thing that only outputs visual anticipations using all the parts together, either — the real correspondence is somewhat more granular than that

TK:

On reflection, I think my "mother" example was pretty sloppy and perhaps confusing. I agree that often quite a lot of our knowledge is needed to ground a statement in anticipations. And yeah actually it doesn't always ground out in that, e.g. for parsing the meaning of counterfactuals. (See "Mixed Reference: The great reductionist project".)

i wouldn't say a sentence is grounded in anticipations with a lot of our knowledge, because that makes it sound like in the above example, "the speed of light is $3 \times 10^{8}$ m/s" is somehow privileged compared to our understanding of mirrors and interferometers even though it's just all used together to create anticipations; i'd instead maybe just say that a bunch of our knowledge together can create a visual anticipation

TK:

thx. i wanted to reply sth like "a true statement can either be tautological (e.g. math theorems) or empirical, and for it to be an empirical truth there needs to be some entanglement between your belief and reality, and entanglement happens through sensory anticipations. so i feel fine with saying that the sentence 'the speed of light is $3 \times 10^{8}$ m/s' still needs to be grounded in sensory anticipations". but i notice that the way i would use "grounded" here is different from the way I did in my previous comment, so perhaps there are two different concepts that need to be disentangled.

here's one thing in this vicinity that i'm sympathetic to: we should have as a criterion on our words, concepts, sentences, thoughts, etc. that they play some role in determining our actions; if some mental element is somehow completely disconnected from our lives, then i'd be suspicious of it. (and things can be connected to action via creating visual anticipations, but also without doing that.)
that said, i think it can totally be good to be doing some thinking with no clear prior sense about how it could be connected to action (or prediction) — eg doing some crazy higher math can be good, imagining some crazy fictional worlds can be good, games various crazy artists and artistic communities are playing can be good, even crazy stuff religious groups are up to can be good. also, i think (thought-)actions in these crazy domains can themselves be actions one can reasonably be interested in supporting/determining, so this version of entanglement with action is really a very weak criterion
generally it is useful to be able to "run various crazy programs", but given this, it seems obvious that not all variables in all useful programs are going to satisfy any such criterion of meaningfulness? like, they can in general just be some arbitrary crazy things (like, imagine some memory bit in my laptop or whatever) playing some arbitrary crazy role in some context, and this is fine
and similarly for language: we can have some words or sentences playing some useful role without satisfying any strict meaningfulness criterion (beyond maybe just having some relation to actions or anticipations which can be of basically arbitrary form)
a different point: in human thinking, the way "2+2=4" is related to visual anticipations is very similar to the way "the speed of light is $3 \times 10^{8}$ m/s" is related to visual anticipations

TK:

Thanks!
I agree that e.g. imagining fictional worlds like HPMoR can be useful.
I think I want to expand my notion of "tautological statements" to include statements like "In the HPMoR universe, X happens". You can also pick any empirical truth "X" and turn it into a tautological one by saying "In our universe, X". Though I agree it seems a bit weird.
Basically, mathematics tells you what's true in all possible worlds, so from mathematics alone you never know in which world you may be in. So if you want to say something that's true about your world specifically (but not across all possible worlds), you need some observations to pin down what world you're in.
I think this distinction is what Eliezer means in his highly advanced epistemology sequence when he uses "logical pinpointing" and "physical pinpointing".
You can also have a combination of the two. (I'd say as soon as some physical pinpointing is involved I'd call it an empirical fact.)
Commented about that. (I actually changed my model slightly): https://www.lesswrong.com/posts/bTsiPnFndZeqTnWpu/mixed-reference-the-great-reductionist-project?commentId=HuE78qSkZJ9MxBC8p

the imo most important thing in my messages above is the argument against [any criterion of meaningfulness which is like what you’re trying to state] being reasonable
in brief, because it’s just useful to be allowed to have arbitrary “variables" in "one’s mental circuits”
just like there’s no such meaningfulness criterion on a bit in your laptop’s memory
if you want to see from the outside the way the bit is “connected to the world”, one thing you could do is to say that the bit is 0 in worlds which are such-and-such and 1 in worlds which are such-and-such, or, if you have a sense of what the laptop is supposed to be doing, you could say in which worlds the bit "should be 0" and in which worlds the bit "should be 1", but it’s not like anything like this crazy god’s eye view picture is (or even could explicitly be) present inside the laptop
our sentences and terms don’t have to have meanings “grounded in visual anticipations”, just like the bit in the laptop doesn’t
except perhaps in the very weak sense that it should be possible for a sentence to be involved in determining actions (or anticipations) in some potentially arbitrarily remote way
the following is mostly a side point: one problem with seeing from the inside what your bits (words, sentences) are doing (especially in the context of pushing the frontier of science, math, philosophy, tech, or generally doing anything you don’t know how to do yet, but actually also just basically all the time) is that you need to be open to using your bits in new ways; the context in which you are using your bits usually isn’t clear to you
btw, this is a sort of minor point but i'm stating it because i'm hoping it might contribute to pushing you out of a broader imo incorrect view: even when one is stating formal mathematical statements, one should be allowed to state sentences with no regard for whether they are tautologies/contradictions (that is, provable/disprovable) or not — ie, one should be allowed to state undecidable sentences, right? eg you should be allowed to state a proof that has the structure "if P, then blabla, so Q; but if not-P, then other-blabla, but then also Q; therefore, Q", without having to pay any attention to whether P itself is tautological/contradictory or undecidable
so, if what you want to do with your criterion of meaningfulness involves banning saying sentences which are not "meaningful", then even in formal math, you should consider non-tautological/contradictory sentences meaningful. (if you don't want to ban the "meaningless" sentences, then idk what we're even supposed to be doing with this notion of meaningfulness.)

TK:

Thx. I definitely agree one should be able to state all mathematical statements (including undecidable ones), and that for proofs you shouldn't need to pay attention to whether a statement is undecidable or not. (I'm having sorta constructivist tendencies though, where "if P, then blabla, so Q; but if not-P, then other-blabla, but then also Q; therefore, Q" wouldn't be a valid proof because we don't assume the law of excluded middle.)
Ok yeah thx I think the way I previously used "meaningfully" was pretty confused. I guess I don't really want to rule out any sentences people use.
I think sth is not meaningful if there's no connection between a belief to your main belief pool. So "a puffy is a flippo" is perhaps not meaningful to you because those concepts don't relate to anything else you know? (But that's a different kind of meaningful from what errors people mostly make.)

yea. tho then we could involve more sentences about puffies and flippos and start playing some game involving saying/thinking those sentences and then that could be fun/useful/whatever

TK:

maybe. idk.

Why do many people who care about AI Safety not clearly endorse PauseAI?

Kaarel1mo*70

I think it's plausible that a system which is smarter than humans/humanity (and distinct and separate from humans/humanity) should just never be created, and I'm inside-view almost certain it'd be profoundly bad if such a system were created any time soon. But I think I'll disagree with like basically anyone on a lot of important stuff around this matter, so it just seems really difficult for anyone to be such that I'd feel like really endorsing them on this matter?^[1] That said, my guess is that PauseAI is net positive, tho I haven't thought about this that much :)

https://youtu.be/Q3_7HTruMfg ↩︎

An Advent of Thought

Kaarel1mo*50

Thank you for the comment!

First, I'd like to clear up a few things:

I do think that making an "approximate synthetic 2025 human newborn/fetus (mind)" that can be run on a server having 100x usual human thinking speed is almost certainly a finite problem, and one might get there by figuring out what structures are there in a fetus/newborn precisely enough, and it plausibly makes sense to focus particularly on structures which are more relevant to learning. If one were to pull this off, one might then further be able to have these synthetic fetuses grow up quickly into fairly normal humans and have them do stuff which ends the present period of (imo) acute x-risk. (And the development of thought continues after that, I think; I'll say more that relates to this later.) While I do say in my post that making mind uploads is a finite problem, it might have been good to state also (or more precisely) that this type of thing is finite.
I certainly think that one can make a finite system such that one can reasonably think that it will start a process that does very much — like, eats the Sun, etc.. Indeed, I think it's likely that by default humanity would unfortunately start a process that gets the Sun eaten this century. I think it is plausible there will be some people who will be reasonable in predicting pretty strongly that that particular process will get the Sun eaten. I think various claims about humans understanding some stuff about that process are less clear, though there is surely some hypothetical entity that could pretty deeply understand the development of that process up to the point where it eats the Sun.
Some things in my notes were written mostly with an [agent foundations]y interlocutor in mind, and I'm realizing now that some of these things could also be read as if I had some different interlocutor in mind, and that some points probably seem more incongruous if read this way.

I'll now proceed to potential disagreements.

But there’s something else, which is a very finite legible learning algorithm that can automatically find all those things—the object-level stuff and the thinking strategies at all levels. The genome builds such an algorithm into the human brain. And it seems to work! I don’t think there’s any math that is forever beyond humans, or if it is, it would be for humdrum reasons like “not enough neurons to hold that much complexity in your head at once”.

Some ways I disagree or think this is/involves a bad framing:

If we focus on math and try to ask some concrete question, instead of asking stuff like "can the system eventually prove anything?", I think it is much more appropriate to ask stuff like "how quickly can the system prove stuff?". Like, brute-force searching all strings for being a proof of a particular statement can eventually prove any provable statement, but we obviously wouldn't want to say that this brute-force searcher is "generally intelligent". Very relatedly, I think that "is there any math which is technically beyond a human?" is not a good question to be asking here.
The blind idiot god that pretty much cannot even invent wheels (ie evolution) obviously did not put anything approaching the Ultimate Formula for getting far in math (or for doing anything complicated, really) inside humans (even after conditioning on specification complexity and computational resources or whatever), and especially not in an "unfolded form"^[1], right? Any rich endeavor is done by present humans in a profoundly stupid way, right?^[2] Humanity sorta manages to do math, but this seems like a very weak reason to think that [humans have]/[humanity has] anything remotely approaching an "ultimate learning algorithm" for doing math?^[3]
The structures in a newborn [that make it so that in the right context the newborn grows into a person who (say) pushes the frontier of human understanding forward] and [which participate in them pushing the frontier of human understanding forward] are probably already really complicated, right? Like, there's already a great variety of "ideas" involved in the "learning-relevant structures" of a fetus?
I think that the framing that there is a given fixed "learning algorithm" in a newborn, such that if one knew it, one would be most of the way there to understanding human learning, is unfortunate. (Well, this comes with the caveat that it depends on what one wants from this "understanding of human learning" — e.g., it is probably fine to think this if one only wants to use this understanding to make a synthetic newborn.) In brief, I'd say "gaining thinking-components is a rich thing, much like gaining technologies more generally; our ability to gain thinking-components is developing, just like our ability to gain technologies", and then I'd point one to Note 3 and Note 4 for more on this.
I want to say more in response to this view/framing that some sort of "human learning algorithm" is already there in a newborn, even in the context of just the learning that a single individual human is doing. Like, a human is also importantly gaining components/methods/ideas for learning, right? For example, language is centrally involved in human learning, and language isn't there in a fetus (though there are things in a newborn which create a capacity for gaining language, yes). I feel like you might want to say "who cares — there is a preserved learning algorithm in the brain of a fetus/newborn anyway". And while I agree that there are very important things in the brain which are centrally involved in learning and which are fairly unchanged during development, I don't understand what [the special significance of these over various things gained later] is which makes it reasonable to say that a human has a given fixed "learning algorithm". An analogy: Someone could try to explain structure-gaining by telling me "take a random init of a universe with such and such laws (and look along a random branch of the wavefunction^[4]) — in there, you will probably eventually see a lot of structures being created" — let's assume that this is set up such that one in fact probably gets atoms and galaxies and solar systems and life and primitive entities doing math and reflecting (imo etc.). But this is obviously a highly unsatisfying "explanation" of structure-gaining! I wanted to know why/how protons and atoms and molecules form and why/how galaxies and stars and black holes form, etc.. I wanted to know about evolution, and about how primitive entities inventing/discovering mathematical concepts could work, and imo many other things! Really, this didn't do very much beyond just telling me "just consider all possible universes — somewhere in there, structures occur"! Like, yes, I've been given a context in which structure-gaining happens, but this does very little to help me make sense of structure-gaining. I'd guess that knowing the "primordial human learning algorithm" which is there in a fetus is significantly more like knowing the laws of physics than your comment makes it out to be. If it's not like that, I would like to understand why it's not like that — I'd like to understand why a fetus's learning-structures really deserve to be considered the "human learning algorithm", as opposed to being seen as just providing a context in which wild structure-gaining can occur and playing some important role in this wild structure-gaining (for now).
to conclude: It currently seems unlikely to me that knowing a newborn's "primordial learning algorithm" would get me close to understanding human learning — in particular, it seems unlikely that it would get me close understanding how humanity gains scientific/mathematical/philosophical understanding. Also, it seems really unlikely that knowing this "primordial learning algorithm" would get me close to understanding learning/technology-making/mathematical-understanding-gaining in general.^[5]

like, such that it is already there in a fetus/newborn and doesn't have to be gained/built ↩︎
I think present humans have much more for doing math than what is "directly given" by evolution to present fetuses, but still. ↩︎
One attempt to counter this: "but humans could reprogram into basically anything, including whatever better system for doing math there is!". But conditional on this working out, the appeal of the claim that fetuses already have a load-bearing fixed "learning algorithm" is also defeated, so this counterargument wouldn't actually work in the present context even if this claim were true. ↩︎
let's assume this makes sense ↩︎
That said, I could see an argument for a good chunk of the learning that most current humans are doing being pretty close to gaining thinking-structures which other people already have, from other people that already have them, and there is definitely something finite in this vicinity — like, some kind of pure copying should be finite (though the things humans are doing in this vicinity are of course more complicated than pure copying, there are complications with making sense of "pure copying" in this context, and also humans suck immensely (compared to what's possible) even at "pure copying"). ↩︎