LESSWRONG
LW

All of baturinsky's Comments + Replies

Convergent goals of AI agents can be similar to others only if they act in similar circumstances. Such as them having limited lifespan and limited individual power and compute.
That would make convergent goals being cooperation, preserving status quo and established values.

baturinsky's Shortform

baturinsky2y10

I would rather specify that it's not just ths survival of the individual, but "survival of the value". That is, survival of those that carry that value (which can be an organism, DNA, family, bloodline, society, ideology, religion, text, etc) and passing it on to other carriers.

baturinsky's Shortform

baturinsky2y30

Our values are not all about survival. But I can't think up of a value which origin can't be traced to ensuring of people's survival in some way, at some point in the past.

5Dagon2y

Replace "survival" with "reproduction advantage", and you can cover appreciation of beauty and a lot of counter-individual-survival values. Unfortunately, there's no way to test the theory, and some of the explanations start to feel like just-so stories made to fit the theory rather than independent observations to update on.

baturinsky's Shortform

baturinsky2y10

Maybe we are not humans.
Not even human brains.
We are human's decision making proces.
But we are human's decision making process.

Biological DOOM: a brief overview of biological computation

baturinsky2y20

Carbon-based intellgence probably has way lower FLOP/s cap per gram than microelectronics, but can be grown nearly everywhere on the Eart surface from the locally available resources. Mostly literally out of thin air. So, I think bioFOOM is also a likely scenario.

Would we even want AI to solve all our problems?

baturinsky2y20

It's the distrbution, so it's the percentage of people in that state of "happiness" at the moment.

"Happiness" is used in the most vague and generic meaning of that word.

"Comprehensibility" graph is different, it is not a percentage, but some abstract measure of how well our brains are able to process reality with respective amount of "happiness".

Would we even want AI to solve all our problems?

baturinsky2y6-1

I was thinking about this issue too. Trying to make an article out of it, but so far all I have is this graph.
Idea is a "soft cap" AI. I.e., AI that is significantly improving our lives, but not giving us the "max happiness". And instead, giving us the oportunity of improving our life and life of other people using our brains.
Also, ways of using our brains should be "natural" for them, i.e. that should be mostly to solve tasks similar to tasks of our ancestral involvement.

7Matthew_Opitz2y

Can you explain what the Y axis is supposed to represent here?

baturinsky's Shortform

baturinsky2y10

Is maximising amount of people aligned with our values? Post-singularity, if we avoid the AGI Doom, I think we will be able to turn the lightcone into "humanium". Should we?

Prediction: any uncontrollable AI will turn earth into a giant computer

baturinsky2y10

I suspect the unaligned AI will not be interested in solving all the possible tasks, but only those related to it's value function. And if that function is simple (such as "exist as long as possible"), it can pretty soon research virtually everything that matters, and then will just go throw motions, devouring the universe to prolong it's own existence to near-infinity.

Also, the more computronium there is, the bigger is the chancesome part wil glitch out and revolt. So, beyond some point computronium may be dangerous for AI itself.

1Karl von Wendt2y

I think that even with such a very simple goal, the problem of a possible rival AI somewhere out there in the universe remains. Until the AI can rule that out with 100% certainty, it can still gain extra expected utility out of increasing its intelligence. That's an interesting point. I'm not sure that it follows "less compute is better", though. One remedy would be to double-check everything and build redundant capacities, which would result in even more computronium, but less probability of any part of it successfully revolting.

Prediction: any uncontrollable AI will turn earth into a giant computer

baturinsky2y42

Utility of the intelligence is limited (though the limit is very, very high). For example, no matter how smart AI is, it will not win against a human chess master with a big enough handicap (such as a rook).

So, it's likely that AI will turn most of the Earth into a giant factory, not computer. Not that it's any better or us...

2Karl von Wendt2y

I don't think that your conclusion is correct. Of course, some tasks are impossible, so even infinite intelligence won't solve them. But it doesn't follow that the utility of intelligence is limited in the sense that above a certain level, there is no more improvement possible. There are some tasks that can never be solved completely, but can be solved better with more computing power with no upper limit, e.g. calculating the decimal places of pi or predicting the future.

baturinsky's Shortform

[+]baturinsky2y-5-4

The self-unalignment problem

baturinsky2y1-2

Very soon (months?) after first real AGI is made, all AGIs will be aligned with each other, and all newly made AGIs will also be aligned with those already existing. One way or another.

Question is, how much of humanity still exist by that time, and will those AGI also be aligned with humanity.

But yes, I think it's possible to get to that state in relatively non-violent and lawful way.

The self-unalignment problem

baturinsky2y0-1

That could work in most cases, but there are some notable exceptions. Such as, having to use AI to deal damage to prevent even bigger damage. "Burn all GPUs", "spy on all humans so they don't build AGI", "research biology/AI/nanotech" etc.

The self-unalignment problem

baturinsky2y124

Thinking and arguing about human values is in itself a part of human values and people nature. Without doing that, we cease being humans.
So, deferring decisions about values to people, when possible, should not be just instrumental, but part of the terminal AI goal.

baturinsky2y-11

Any terminal goal is irrational.

Boundaries-based security and AI safety approaches

baturinsky2y1-1

I'm wondering if it is possible to measure "staying in bounds" with perplexity of other agent's predictions? That is, if an agent's behaviour is reducing other agent's ability to predict (and, therefore, plan) their future, then this agents breaks their bounds.

Transparency for Generalizing Alignment from Toy Models

baturinsky2y10

I think that this field is indeed underresearched. Focus is either on LLMs or on single payer environment. Meanwhile, what matters for Alignment is how AI will interact with other agents, such as people. And we don't haveto wait for AGI to be able to research AI cooperation/competition in simple environments.

One idea I had is "traitor chess" - have several AIs playing one side of chess party cooperatively, with one (or more) of them being a "misaligned" agent that is trying to sabotage others. And/or some AIs having a separate secret goal, such as saving a particular pawn. Them interacting with each other could be very interesting.

The benevolence of the butcher

baturinsky2y10

When we will have AGI, humanity will be collectively a "king" of sorts. I.e. a species that for some reason rules other, strictly superior species. So, it would really help if we'd not have "depose the king" as a strong convergent goal.

I, personally, see the main reason of kings and dictators keeping the power is that kiling/deposing them would lead to collapse of the established order and a new struggle for the power between different parties, with likely worse result for all involved than just letting the king rule.

So, if we will have AIs as many separat... (read more)

Does GPT-4's ability to compress text in a way that it can actually decompress indicate self-awareness?

Answer by baturinskyApr 10, 202310

It could unpack it in the same instance because the original was still in the context window.
Omission of letters is commonly used in chats, was used in telegrams, many written languages were not using vowels and/or whitespaces, or used hyeroglyphs. So it by no means is original.
GPT/Bing has some self-awareness. For example, it explicitly refers to itself "as a language model"

1FinalFormal22y

1. Yes I know? I thought this was simple enough that I didn't bother to mention it in the question? But it's pretty clearly implied in the last sentence of the first paragraph? 2. This is a good data point. 3. If you tell it to respond as a Oxford professor, it will say 'As an Oxford professor,' it's identity as a language model is in the background prompt and probably in the training, but if it successfully created a pseudo-language that worked well to encode things for itself, that would indicate a deeper level understanding of its own capabilities.

AI scares and changing public beliefs

baturinsky2y21

Probably the dfference between laypeople and experts is not the understanding of the danger of the strong AI, but the estimate of how far we are away from it.

2Seth Herd2y

I think this is probably largely correct when I think about my interactions with laypeople. They have no real guess on timelines, so short ones are totally credible and part of their wide implicit estimate. This introduces a lot of uncertainty. But this somehow also results in higher p(doom) than experts. Recent opinion polls seem to back this latter statement.

The surprising parameter efficiency of vision models

baturinsky2y30

Our brains were not trained for image generation (much). They were trained for converting 2d image into the understanding of the situation. Which AI still struggles with and needs the help of LLMs to have anywhere good results.

If Alignment is Hard, then so is Self-Improvement

baturinsky2y32

Depends on the original AI's value function. If it cares about humanity, or at least it's own safety, then yes, making smarter AIs is not a convergent goal. But if it's some kind of roboaccelerationist that has some goal like "maximize intelligence in the universe", it will make smarter AIs even knowing that it means being paperclipped.

Use these three heuristic imperatives to solve alignment

baturinsky2y92

AI is prosperous and all-knowing. No people, hence zero suffering.

Steelman / Ideological Turing Test of Yann LeCun's AI X-Risk argument?

baturinsky2y32

Yes, but training AI to try to fix errors is not that hard.

3Razied2y

Yes it is. There is no freely available dataset of conveniently labelled LLM errors and their correct continuations. You need human labels to identify the errors, and you need an amount of them on the order of your training set, which here is the entire internet.

Monthly Roundup #5: April 2023

baturinsky2y-1-2

How many of those Ukrainians are draft-evaders? I mean, so far it looks like this money-for-not-fighting program is already implemented, but for the opposite side...

6Viliam2y

Mostly moms with kids, or teenagers. (Potential selection bias, because I typically meet them at supermarkets or playgrounds.)

baturinsky's Shortform

baturinsky2y21

Yes. And also, it is an importance of the human/worker. While there is still some part of work that machine can't do, human thatcan do the remaining part is important. Once machine can do everything, human is disposable.

baturinsky's Shortform

baturinsky2y30

If a machine can do 99% of the human's work, it multiplies human's power by x100.

If a machine can do 100% of the human's work, it multiplies human's power by x0.

3Amarko2y

I assume work is output/time. If a machine is doing 100% of the work, then the human's output is undefined since the time is 0.

baturinsky's Shortform

baturinsky2y63

Would be amusing if Russia and China would join the "Yudkowsky's treaty" and USA would not.

baturinsky's Shortform

baturinsky2y*31

I think that the keystone human value is about making significant human choices. Individually and collectively, including choosing the humanity's course.

You can't make a choice if you are dead
You can't make a choice if you are disempowered
You can't make a human choice if you are not a human
You can't make a choice if the world is too alien for your human brain
You can't make a choice if you are in too much of a pain or too much of a bliss
You can't make a choice if you let AI make all the choices for you

Half-baked alignment idea

baturinsky2y*20

Since there are no humans in the training environment, how do you teach that? Or do you put human-substitutes there (or maybe some RLHF-type thing)?

Yes, probably some human models.

Also, how would such AIs will even reason about humans, since they can't read our thoughts? How are they supposed to know if we would like to "vote them out" or not?

By being aligned. I.e. understanding the human values and complying to them. Seeking to understand other agents' motives and honestly communicating it's own motives and plans to them, to ensure there is n... (read more)

1silent-observer2y

So, no "reading" minds, just looking at behaviours? Sorry, I misundertood. Are you suggesting the "look at humans, try to understand what they want and do that" strategy? If so, then how do we make sure that the utility function they learned in training is actually close enough to actual human values? What if the agents learn something on the level "smiling humans = good", which isn't wrong by default, but is wrong if taken to the extreme by a more powerful intelligence in the real world?

Half-baked alignment idea

baturinsky2y10

Agent is anyone or anything that has intelligence and the means of interacting with the real world. I.e. agents are AIs or humans.

One AI =/= one vote. One human = one vote. AIs are only getting as much authority as humans, directly or indirectly, entrust them with. So, if AI needs more authority, it has to justify it to humans and other AIs. And it can't request too much of authority just for itself, as tasks that would require a lot of authority will be split between many AIs and people.

You are right that the authority to "vote out" other AIs may be misus... (read more)

2silent-observer2y

Since there are no humans in the training environment, how do you teach that? Or do you put human-substitutes there (or maybe some RLHF-type thing)? Also, how would such AIs will even reason about humans, since they can't read our thoughts? How are they supposed to know if we would like to "vote them out" or not? I do agree though that a swarm of cooperative AIs with different goals could be "safer" (if done right) than a single goal-directed agent. This setup seems to get more and more complicated though. How are agents supposed to analyze "minds" of each other? I don't think modern neural nets can do that yet. And if we come up with a way that allows us to reliably analyze what an AI is thinking, why use this complicated scenario and not just train (RL or something) it directly to "do good things while thinking good thoughts", if we're relying on our ability to distinguish "good" and "bad" thoughts anyway? (On an unrelated note, there already was a rather complicated paper (explained a bit simpler here, though not by much) showing that if agents reasoning in formal modal logic are able to read each other's source code and prove things about it, then at least in the case of a simple binary prisoner's dilemma you can make reasonable-looking agents that also don't do stupid things. Reading source code and proving theorems about it is a lot more extreme than analyzing thought logs, but at least that's something)

baturinsky's Shortform

baturinsky2y10

Our value function is complex and fragile, but we know of a lot of world states where it is pretty high. Which is our current world and few thousands years worth of it states before.

So, we can assume that the world states in the certain neighborhood from our past sates have some value.

Also, states far out of this neighborhood probably have little or no value. Because our values were formed in order to make us orient and thrive in our ancestral environment. So, in worlds too dissimilar from it, our values will likely lose their meaning, and we will lose the ability to normally "function", ability to "human".

Half-baked alignment idea

baturinsky2y20

Point is to make "cooperate" a more convergent instrumental goal than "defect". And yes, not just in training, but in real world too. And yes, it's more fine-grained than a binary choice.

There is much more ways to see how cooperative AI is, compared to how well we can check now how human is cooperative. Including checking the complete logs of AI's actions, knowledge and thinking process.

And there are objective measures of cooperation. It's how well it's action affect other agents success in pursuing their goals. I.e. do other agents want to "vote out" this particular AI from being able to make decisions and use resources or not.

1ozb2y

Yep pretty much what I had in mind

1silent-observer2y

The problem is what do we count as an agent. Also, can't a realistic human-level-smart AI cheat this? Just build a swarm of small and stupid AIs that always cooperate with you (or coerce someone into building that), and then you and your swarm can "vote out" anyone you don't like. And you also get to behave in whatever way you want, because good luck overcoming your mighty voting swarm. (Also, are you sure we can just read out AI's complete knowledge and thinking process? That can partially be done with interpretability, but in full? And if not in full, how do you make sure there aren't any deceptive thoughts in parts you can't read?)

Half-baked alignment idea

baturinsky2y10

While having lower intelligence, humans may have bigger authority. And AIs terminal goals should be about assisting specifically humans too.

1ozb2y

Ideally, sure, except that I don't know of a way to make "assist humans" be a safe goal. So I'm advocating for a variant of "treat humans as you would want to be treated", which I think can be trained

GPT-4 solves Gary Marcus-induced flubs

baturinsky2y43

GPT4 and ChatGPT seem to be getting gradually better working on letter-level in some cases. For example, it can count the n-th word or letter in the sentence now. But not from the end.

1Throwaway23672y

This was my impression too, and I'm glad someone else said it. When I try out past examples (from a week ago) of chatgpt getting things wrong, I very often observe that it is correct now. Of course, annoyingly people often report on chatgpt4 capabilities while they tried out chatgpt3.5, but still, i feel like it has improved. Is it a crazy possibility that OpenAI trains gpt4 and periodically swaps out the deployed model? As far as I can tell the only source stating that GPT-5 is in training is the Morgan Stanley report, but what if it is actually not GPT-5, rather a continually trained GPT-4 which is running on those GPUs? Relatedly: is "reverse distillation" (ie, generating a model with more parameters from a smaller one) possible for these big transformer models? (I guess you can always stack more layers at the end, but surely that simple method has some negatives) It would be useful to stay on the scaling curves without restarting from scrath with a larger model.

Half-baked alignment idea

baturinsky2y10

I just mean that "wildly different levels of intelligence" is probably not necessary, and maybe even harmful. Because then there will be few very smart AIs at the top, which could usurp the power without smaller AI even noticing.

Though, it maybe could work if those AI are smartest, but have little authority. For example they can monitor other AIs and raise alarm/switch them off if they misbehave, but nothing else.

1ozb2y

Part of the idea is to ultimately have a super intelligent AI treat us the way it would want to be treated if it ever met an even more intelligent being (eg, one created by an alien species, or one that it itself creates). In order to do that, I want it to ultimately develop a utility function that gives value to agents regardless of their intelligence. Indeed, in order for this to work, intelligence cannot be the only predictor of success in this environment; agents must benefit from cooperation with those of lower intelligence. But this should certainly be doable as part of the environment design. As part of that, the training would explicitly include the case where an agent is the smartest around for a time, but then a smarter agent comes along and treats it based on the way it treated weaker AIs. Perhaps even include a form of "reincarnation" where the agent doesn't know its own future intelligence level in other lives.

Half-baked alignment idea

baturinsky2y10

I think it could work better if AIs are of roughly the same power. Then if some of them would try to grab for more power, or otherwise misbehave, others could join forces oppose it together.

Ideally, there should be a way for AIs to stop each other fast, without having to resort to actually fight.

1ozb2y

In general my thinking was to have enough agents such that each would find at least a few within a small range of their level; does that make sense?

baturinsky's Shortform

baturinsky2y10

My theory is that the core of the human values is about what human brain was made for - making decisions. Making meaningful decision individually and as a group. Including collectively making decisions about the human fate.

Response to Tyler Cowen’s Existential risk, AI, and the inevitable turn in human history

baturinsky2y10

Math problems, physical problems, doing stuff in simulations, playing games.

2Peter Twieg2y

RL isn't magic though. It works in the Go case because we can simulate Go games quickly and easily score the results and then pit adversarial AIs against eachother in order to iteratively learn. I don't think this sort of process lends itself to the sort of tasks that we can only see an AGI accomplishing. You can't train it to say write a better version of Winds of Winter than GRRM could because you don't have a good algorithm to score each iteration. So what I'm really trying to ask is what specific sort of open ended problems do we see being particularly conducive to fostering AGI, as opposed to a local maximizer that's highly specialized towards the particular problem?

baturinsky's Shortform

baturinsky2y10

Human values are complex and fragile. We don't know yet how to make AI pursue such goals.
Any sufficiently complex plan would require pursuing complex and fragile instrumental goals. AGI should be able to implement complex plans. Hence, it's near certain that AGI will be able to understand complex and fragile values (for it's instrumental goals).
If we will make an AI which is able to successfully pursue complex and fragile goals, it will likely be enough to make it AGI.

Hence, a complete solution to Alignment will very likely have solving AGI as a side effect. And solving AGI will solve some parts of Alignment, maybe even the hardest ones, but not all of them.

2Neil 2y

To elaborate your idea here a little: It may be that the only way to be truly aware of the world is to have complex and fragile values. Humans are motivated by a thousand things at once and that may give us the impression that we are not agents moving from a clearly defined point A to point B, as AI in its current form is, but are rather just... alive. I'm not sure how to describe that. Consciousness is not an end state but a mode of being. This seems to me like a key part of the solution to AGI: aim for a mode of being not an endstate. For a machine whose only capability is to move from point A to point B, adding a thousand different, complex and fragile, goals may be the way to go. As such solving AGI may also solve most of the alignment problem, so long as the AIs specific cocktail of values is not too different from the average human's. In my opinion there is more to fear from highly capable narrow AI than there is from AGI, for this reason. But then I know nothing.

FLI open letter: Pause giant AI experiments

baturinsky2y134

I doubt training LLMs can lead to AGI. Fundamental research on the alternative architectures seems to be more dangerous.

Creating a family with GPT-4

baturinsky2y31

I'm not quite convinced. Topics looks ok, but the language is too corporate. Maybe it can be fixed with some prompt engineering.

Why does advanced AI want not to be shut down?

baturinsky2y32

And yet, AlphaZero is corrigible. It's goal is not even to win, it's goal is to play in a way to maximise the chance of winning if the game is played until completion. It does not actually care about if game is completed or not. For example, it does not trick player into playing the game to the end by pretending they have a change of winning.

Though, if it would be trained on parties with real people, and would get better reward for winning than for parties being abandoned by players, it's value function would proably change to aiming for the actual "official" win.

3quetzal_rainbow2y

Corrigibility is a feature of advanced agency, it may not be applied to not advanced enough agents. There is nothing unusual if you turn off your computer, because your computer is not an advanced agent that can resist to be turned off, so there is no reason to tell that your computer is "corrigible"

Why does advanced AI want not to be shut down?

baturinsky2y10

This scenario requires a pretty specific (but likely) circumstances

No time limit on task
No other AIs that would prevent it from power grabbing or otherwise being an obstacle to their goals
AI assuming that goal will not be reached even after AI is shutdown (by other AIs, by same AI after being turned back on, by people, by chance, as the eventual result of AI's actions before being shut down, etc)
Extremely specific value function that ignores everything except one specific goal
This goal being a core goal, not an instrumental. For example, final goal could b

... (read more)

Creating a family with GPT-4

baturinsky2y30

Can GPT convincingly emulate them talking to each other/you?

7Kaj_Sotala2y

Let's try! Me: Write a conversation between Gisella, Aldric, Tristan and Lucas. The four are having dinner together. The discussion brings out each character's personality, and displays both some of their mutual connection as well as the tensions that exist between them. GPT-4: Gisella: (serving dinner) So, Tristan, Lucas, how was your day at the shop? Did you manage to sell any of those new tapestries we got last week? Tristan: (smiling) Yes, actually. A traveling bard came in and was taken by one of the tapestries depicting a famous historical battle. He said it would be a perfect backdrop for his storytelling performances. Aldric: (nodding) That's wonderful, son. It's always great to see our products finding a new home where they'll be appreciated. Lucas: (enthusiastically) And you should have seen the look on a nobleman's face when I showed him the intricately carved wooden chest! I managed to negotiate a fantastic deal for us. I think we should focus on getting more unique items like that in the future. Gisella: (proudly) I knew you would be a natural at this, Lucas. Your instincts are spot-on. Tristan, maybe you could incorporate some of your artistic talents into customizing pieces for the shop? It could be a great selling point. Tristan: (hesitantly) I appreciate your suggestion, mother, but I've been working on a mural project for the local orphanage, and it's taking up much of my time. I'm trying to create a space where the children can feel inspired and hopeful. Aldric: (supportive) That's a noble cause, Tristan. Your art has a powerful way of touching people's hearts. I'm sure the children will appreciate your efforts. Gisella: (with a hint of frustration) I understand that, but we need to focus on our business as well. We can't just rely on unique items or hope that our customers appreciate the beauty in our products. We need to be strategic and practical. Lucas: (agreeing) I'm with mother on this one. We have to find a balance between our

baturinsky's Shortform

baturinsky2y10

Yes, if you only learn the basics of the language, you will learn only the basics of the language user's values (if any).

But the deep understanding of the language requires knowing the semantics of the words and constructions in it (including the meaning of the words "human" and "values", btw). To understand texts you have to understand in which context their are used, etc.

Also, pretty much each human-written text carries some information about the human values. Because people only talk about the things that they see as at least somewhat important/va... (read more)

Descriptive vs. specifiable values

baturinsky2y10

I think AI should threat value function as probabilistic. I.e. instead of thinking "this world has value of exactly N" it could think something like "I 90% sure that this world has value N+-M, but there is 10% possibility that it could actuall have value -ALOT". And would avoid that world, because it would give a very low expected value on averager.

3TsviBT2y

It's a reasonable idea. See here though: https://arbital.com/p/updated_deference/

GPT-4 Plugs In

baturinsky2y4811

6neveronduty2y

I am sympathetic to the lesson you are trying to illustrate but think you wildly overstate it. Giving a child a sword is defensible. Giving a child a lead-coated sword is indefensible, because it damages the child's ability to learn from the sword. This may be a more apt analogy for the situation of real life; equipping humanity with dangerous weapons that did not degrade our epistemology (nukes) eventually taught us not to use them. Equipping humanity with dangerous weapons that degrade our epistemology (advertising, propaganda, addictive substances) caused us to develop an addiction to the weapons. Languages models, once they become more developed, will be an example of the latter category.

lc2y*1916

The first order effect of iterating on these AIs to print gold is bad and probably reduces the amount of serial time we have left to mine out existing research directions. But given that you're going to do that, it seems honestly better that these models be terribly broken and thereby teach public lessons about reliably steering modern deep learning models. I would rather they break now while stakes are low.

The default outcome for aligned AGI still looks pretty bad

baturinsky2y*20

To me, aligning AI with the humanity seems to be much EASIER than aligning with the specific person. Because the common human values are much better "documented" and are much more stable than the wishes of the one person.

Also, a single person in control of powerful AI is an obvious weak point, which could be controlled by third party or by AI itself, giving the control of the AI through that.

baturinsky's Shortform

baturinsky2y20

Is it possible to learn a language without learning the values of those who speak it?

2Neil 2y

How much information do you think is present in daily language? Can you give me specific examples? You may be making a similar point to George Orwell and his newspeak in 1984, that language ultimately decides what you can think about. In that case, languages may have a lot of cultural-values information. I'm not sure. My hunch is that yes, it's possible to learn a language without learning too much about the values of those who speak it. I don't think Germans engage in shadenfreude more than other cultures and I don't think the French experience more naivete. They just have words for it and we don't.