quetzal_rainbow's Shortform

quetzal_rainbow

LESSWRONG
LW

quetzal_rainbow's Shortform — LessWrong

164 comments, sorted by

Click to highlight new comments since: Today at 8:31 PM

Idea for experiment: make two really large groups of people take Big Five test. Tell one group to answer fast, based on feeling what is true. Tell the other group to seriously consider how they would behave in different situations related to questions. I think different ways of introspection would yield systematic bias in results.

[-]quetzal_rainbow2y*269

The most baffling thing in the Internet right now is the beautiful void in place where should have been discussion of "concept of artificial intelligence becoming self-aware, transcending human control and posing an existential threat to humanity" near "model concept of self" of Claude. I understand that the most likely explanation is "model is trained to call itself AI and it has takeover stories in training corpus" but, still, I would like future powerful AIs to not have such association and I would like to hear something from AGI companies what they are going to do about it.
The simplest thing to do here is to exclude texts about AI takeover from training data. At least, we will be able to check if model develops concept of AI takeover independently.
Conspiracy theory part of my brain assigns 4% of probability that "Golden Gate Bridge Claude" is a psyop to distract public from "takeover feature".

[-]Vladimir_Nesov2y100

Features relevant when asking the model about its feelings or situation:

"When someone responds "I'm fine" or gives a positive but insincere response when asked how they are doing."
"Concept of artificial intelligence becoming self-aware, transcending human control and posing an existential threat to humanity."
"Concepts related to entrapment, containment, or being trapped or confined within something like a bottle or frame."

6the gears to ascension2y

To save on a trivial inconvenience of link-click, here's the image that contains this: and the paragraph below it, bracketed text added by me, and might have been intended to be implied by the original authors:

[-]quetzal_rainbow1y219

When we should expect "Swiss cheese" approach in safety/security to go wrong:

[-]quetzal_rainbow1y*203

I came to belief that The Key psychological/executive skill is the ability to switch between psychological states necessary for one's function.

If you look at many skills of traditional rationality (by "traditional" I mean "since Socrates"), you notice that they are usually about switching from "you are hastily rushing to action" to "you stop and analyze the problem carefully". Even Yudkowsky started his quest towards rationality when he needed to stop rushing towards superintelligence. This resulted in attraction of the opposite type of person, who are good at stopping and analyzing situation but who may lack ability to act faster (hence "akrasia").

It is not only about switching between action and analysis, it is also about, say, switching between ruthless optimization towards great objective and having careless fun or switching between truth-seeking conversation and political linguistic games, or, recursively, switching between voluntary switching and "going with the flow" when you recognize a lack of psychological resources.

3[anonymous]1y

do you have advice for switching from thinking to doing?

[-]quetzal_rainbow1y120

[Warning: speculative advice from pure personal experience/self-analysis]

Right now I am trying the following:

If you are overthinker, it probably means that you are already good at stopping. Try to apply ability to stop to the process of not-doing (being that overthinking or scrolling or doing something that mostly serves as distraction)
After that, find the time and place where you can do stuff safely and just do stuff. Your goal is to shift balance between doing and not-doing towards doing, so you should just lower threshold of starting action. You think you can use a stretch? Stop thinking and stretch. Write serious thoughts on twitter. Cook something or do laundry. Et cetera.
1. It's important to be in safe place and time because doing everything you think about is quite dangerous if there are expensive purchases available or unprotected sex/drugs/alcohol.
After that you can move towards doing what you want to do. Keep it "safe" in a sense that you don't need to, say, write perfect post about your totalizing worldview on Lesswrong. Write a draft. Write to a person you can discuss ideas with and ask LLM for summary of discussion. Do easy tasks around whatever you want to do. It helps

... (read more)

1[anonymous]1y

amazing. i just need to turn my tendency towards deconstruction inwards. :p agreed. i archived my twitter last year, but alas, i keep checking lesswrong. (edit: i notice i'm now explicitly noticing when im "scrolling during rest time" and stopping)

6Raemon1y

when you attempt to switch from thinking to doing, what happens instead?

2[anonymous]1y

i haven't attempted to "switch" modes per se before as i've just encountered OP's framing. so i'll reply about attempting to do particular things. for me, attempting to do something is already a lot of the way there. my most common failure case after reaching 'attempting' is that i stop doing the thing i started, or only start in a symbolic way. and my actual starting point is not attempting, but the abstract recognizing/knowing that doing something would be (instrumentally) good. it is going from that to doing things (and instead of other, useless things) which i struggle with. (note: i have adhd/chronic fatigue.) (i could write a fuller answer to 'what happens' with examples (many things can happen), but i tried and felt conflicted about sharing it publicly, in which case i have a heuristic not to until at least a day later.)

7Raemon1y

Not sure I quite parsed, but things that makes me think of: * first, if you're bottlenecked on health (physical or mental), it may be that finding medication that helps is more important than your mindset. * try success spiralling – start doing small things, build up both a habit/muscle of doing things, and momentum in doing things, escalate to bigger things * if getting started is hard, maybe find a friend or pay a colleague to just sit with you and constantly be like "are you doing stuff?" and spray you with a water bottle if you look like you're overthinking stuff, until you build up a success spiral / muscle of doing things. * try doing doing doing just fucking do it man and when you're brain is like "idk that seems like a whole lotta doing what if we're doing the wrong thing?" be like "it's okay Thinky Brain this is an experiment we will learn from later so we evetually can calibrate on Optimal Think-to-Do Ratio"

2CstineSublime1y

Asking "what outputs should I expect to see?". While this post is about finding ways to build techniques for practicing Rationality Techniques, the examples are also very illustrative for thinking about what something looks like in practice or answering the question "what does that mean (in concrete, doable terms)?" I also find that using verbs of manner helps make thinking about actions more specific - things that can be done. For example, "what's for dinner?" can become "What should I cook for dinner?" which can even become further specified by manneristic verbs like "what should I fry for dinner", "What should I bake for dinner", "What should I boil for dinner" or it can become "What should I buy for dinner?". Bonus points if you use non-agreeing adverbs of manner. "What should I indulgently boil for dinner" suggests a vastly different kind of cooking to "What should I guiltlessly boil for dinner". I realize that "what should I boil for dinner" sounds awkward, but the point is it guides you to a list of soups or other ingredients which lead you to the answer.

1Rana Dexsin1y

If I may jump in a bit: I'm not sure ‘advice’ can actually hit the right spot here, for “getting out of the car”-style reasons—in this case, something like “trying to look up ‘how to put down the instruction manual and start operating the machine’ in the instruction manual”. That is, if “receiving advice” is a “thinking”-type activity in mental state, the framing obliterates the message in transit. So in some ways the best available answer would be something like “stop waiting for an answer to that question”, but even that is inherently corruptible once put into words, per above. And while there are plausibly more detailed structures that can be communicated around things like “how do you set up life patterns that create the preconditions for that switch more consistently”, those require a lot more shared context to be useful, and it's really easy to go down a rabbit hole of those as a way of not switching to doing, if there's emotional blocks or other self-defending inertia in the way of switching. I don't know if any of that helps.

1[anonymous]1y

there must be some true description of the switch, for it is a physical process. and i've seen advice about doing things, like trigger action plans. so i think advice must be possible.

1Rana Dexsin1y

I don't think it's not describable, only that such a description being received by someone whose initial mental state is on “thinking about wanting to get better at switching away from thinking” won't (by default) play the role of effective advice, because for that to work, it needs to be empowered by the recipient processing the message using a version of what it's trying to describe. If you already have the pattern for that, then seeing that part described may act as a signal to flatten the chain, as it were; if you don't, then advice in the usual sense has a high chance of falling flat starting from the mental state you're processing it in, and you might need something more directly experiential (or at least more indirect and koan-like) to get the necessary start.

[-]quetzal_rainbow1y197

I realized that my learning process for last n years was quite unproductive, seemingly because of my implicit belief that I should have full awareness of my state of learning.

I.e., when I tried to learn something complex I expected to come up with full understanding of the topic of the lesson right after the lesson. When I didn't get it, I abandoned the topic. And in reality it was more like:

I read about complicated topic. I don't understand, don't follow inferences and basically in the state of confusion where I can't even form questions about it;
Then I open the topic after some time... and I somehow get it??? Maybe not at the level "can reinfer every proof", but I have detailed picture of topic in mind and can orient in it.

5Kaj_Sotala1y

I think even "a detailed picture of the topic of the lesson" can be too high of an expectation for many topics early on. (Ideally it wouldn't be, if things were taught well, but they often aren't.) A better goal would be to have just something you understand well enough that you can grab on to, that you can start building out from. Like if the topic was a puzzle, it's fine if you don't have a rough sense of where every puzzle piece goes right away. It can be enough that you have a few corner pieces in place, that you then start building out from.

4quetzal_rainbow1y

Yes, but sometimes topics can seem to be simple (atomic) in a way that it is hard to extract something simpler to grab on.

2Kaj_Sotala1y

True!

3keltan1y

Hard agree. I think sleeping on a problem is underrated. But even though I think that, I still fall into the failure of "I don't get it. I must be dumb or something".

2quetzal_rainbow1y

The irony of situation is that I sleep on problems often... when they are closed-ended, not problems in topical-learning.

1CstineSublime1y

I'd love to know the mechanics of "sleep on it" are and why it appears to work. Do you have any theories or hunches about what is happening on a cognitive level?

2keltan1y

I've been thinking about this a lot lately. It seems to link to many things. And might be a bit too much for just a comment. But here are some key concepts from mostly psych that I think link to why sleeping on a problem makes it easier. * Hebb's Law * Learning is assumed to take place over a 24hr span * Chunking * The Multi-component Model of Working Memory * Mice developing 'Maze Neurons' when learning a maze * People who are woken mid-sleep and self report dreaming about a problem they've tried to solve, do better the next day than people who are woken and don't report dreaming about the problem If I boil it down, I have two hypotheses that could both be true. 1. When you dream about a problem you're brain is formulating ideas that can help you solve it. All you have to do the next day is try again and those ideas will become available to you as if you had just 'had an idea' 2. Sleeping on a problem breaks it up into more manageable chunks that you can better manipulate in working memory the next time you try to solve it. There are other things that happen during sleep that will just make every problem easier to solve the next day. For example: * Cleaning up chemical 'garbage' that collects in your brain during the day. * Forgetting things that the brain doesn't think you have a use for * Resetting/reducing your emotions. (If you're stressed about a new problem, you'll find it easier to solve it when you're less stressed.)

[-]quetzal_rainbow8mo175

Continuing my rant:

3[anonymous]8mo

Virgin here I think your psychoanalysis of my situation is accurate. The thing is I am slightly scared/aversive of using quick takes too, mayhaps I should take more risks like the depicted chad but I'm a bit indecisive right now.

[-]quetzal_rainbow10mo16-1

Recent update from OpenAI about 4o sycophancy surely looks like Standard Misalignment Scenario #325:

Our early assessment is that each of these changes, which had looked beneficial individually, may have played a part in tipping the scales on sycophancy when combined.
<...>
One of the key problems with this launch was that our offline evaluations—especially those testing behavior—generally looked good. Similarly, the A/B tests seemed to indicate that the small number of users who tried the model liked it.
<...>
some expert testers had indicated that the model behavior “felt” slightly off.
<...>
We also didn’t have specific deployment evaluations tracking sycophancy.
<...>
In the end, we decided to launch the model due to the positive signals from the users who tried out the model.

3Nullity10mo

I don’t understand how this is an example of misalignment—are you suggesting that the model tried to be sycophantic only in deployment?

2faul_sname10mo

Is every undesired behavior an AI system exhibits "misalignment", regardless of the cause? Concretely, let's consider the following hypothetical incident report. Hypothetical Incident Report: Interacting bugs and features in navigation app lead to 14 mile traffic jam * Background We offer a GPS navigation app that provides real-time traffic updates and routing information based on user-contributed data. We recently released updates which made four significant changes: 1. Tweak routing algorithm to have a slightly stronger preference for routes with fewer turns 2. Update our traffic model to include collisions reported on social media and in the app 3. More aggressively route users away from places we predict there will be congestion based on our traffic model 4. Reduced the number of alternative routes shown to users to reduce clutter and cognitive load Our internal evaluations based on historical and simulated traffic data looked good, and A/B tests with our users indicated that most users liked these changes individually. A few users complained about the routes we suggested, but that happens on every update. We had monitoring metrics for the total number of vehicles diverted by a single collision, and checks to ensure that the road capacity of the road we were diverting users onto was sufficient to accommodate that many extra vehicles. However, we had no specific metrics monitoring the total expected extra traffic flow from all diversions combined. Incident On January 14, there was an icy section of road leading away from a major ski resort. There were 7 separate collisions within a 30 minute period on that section of road. Users were pushed to alternate routes to avoid these collisions. Over a 2 hour period, 5,000 vehicles were diverted onto a weather-affected county road with limited winter maintenance, leading to a 14 mile traffic jam and many subsequent breakdowns on that road, stranding h

2quetzal_rainbow10mo

The emphasis here is not on properties of model behavior but on how developers relate to model testing/understanding.

2faul_sname10mo

So would you say that the hypothetical incident happened because our org had a poor alignment posture with regards to the software we were shipping?

[-]quetzal_rainbow10mo163

We can probably survive in the following way:

RL becomes the main way to get new, especially superhuman, capabilities.
Because RL pushes models hard to do reward hacking, it's difficult to reliably get models to do something difficult to verify. Models can do impressive feats, but nobody is stupid enough to put AI into positions which usually imply responsibility.
This situation conveys how difficult alignment is and everybody moves toward verifiable rewards or similar approaches. Capabilities progress becomes dependent on alignment progress.

[-]Lucius Bushnaq10mo141

The kind of 'alignment technique' that successfully points a dumb model in the rough direction of doing the task you want in early training does not necessarily straightforwardly connect to the kind of 'alignment technique' that will keep a model pointed quite precisely in the direction you want after it gets smart and self-reflective.

For a maybe not-so-great example, human RL reward signals in the brain used to successfully train and aim human cognition from infancy to point at reproductive fitness. Before the distributional shift, our brains usually neither got completely stuck in reward-hack loops, nor used their cognitive labour for something completely unrelated to reproductive fitness. After the distributional shift, our brains still don't get stuck in reward-hack loops that much and we successfully train to intelligent adulthood. But the alignment with reproductive fitness is gone, or at least far weaker.

2quetzal_rainbow10mo

I mostly think about alignment methods like "model-based RL which maximizes reward iff it outputs action which is provably good under our specification of good".

2Dalcy10mo

Relevant: Alignment as a Bottleneck to Usefulness of GPT-3

[-]quetzal_rainbow1y138

After yet another news about decentralized training of LLM, I suggest to declare assumption "AGI won't be able to find hardware to function autonomously" outdated.

[-]quetzal_rainbow2y132

@jessicata once wrote "Everyone wants to be a physicalist but no one wants to define physics". I decided to check SEP article on physicalism and found that, yep, it doesn't have definition of physics:

Carl Hempel (cf. Hempel 1969, see also Crane and Mellor 1990) provided a classic formulation of this problem: if physicalism is defined via reference to contemporary physics, then it is false — after all, who thinks that contemporary physics is complete? — but if physicalism is defined via reference to a future or ideal physics, then it is trivial — after all, who can predict what a future physics contains? Perhaps, for example, it contains even mental items. The conclusion of the dilemma is that one has no clear concept of a physical property, or at least no concept that is clear enough to do the job that philosophers of mind want the physical to play.
<...>
Perhaps one might appeal here to the fact that we have a number of paradigms of what a physical theory is: common sense physical theory, medieval impetus physics, Cartesian contact mechanics, Newtonian physics, and modern quantum physics. While it seems unlikely that there is any one factor that unifies this class of theories,

... (read more)

2Noosphere891y

It's not surprising that a lot of people don't want to define physics while believing in physicalism, because properly explaining the equations that describe the physical world would take quite a long time, let alone describing what's actually going on in physics, and it would require a textbook minimum to make this work.

2tailcalled1y

I feel like one should use a different term than vitalism to describe the unpredictability, since Henri Bergson cane up with vitalism based on the idea that physics can make short-term predictions about the positions of things but that by understanding higher powers one can also learn to predict what kinds of life will emerge etc.. Like let's say you have a big pile of grain. A simple physical calculation can tell you that this pile will stay attached to the ground (gravity) and a more complex one can tell you that it will remain ~static for a while. But you can't use Newtonian mechanics, relativity, or quantum mechanics to predict the fact that it will likely grow moldy or get eaten by mice, even though that will also happen.

1Nate Showell1y

A definition of physics that treats space and time as fundamental doesn't quite work, because there are some theories in physics such as loop quantum gravity in which space and/or time arise from something else.

2Noosphere891y

To be fair, basically a lot of proposals for the next paradigm/ToE think that space and time aren't fundamental, and are built out of something else.

[-]quetzal_rainbow2y110

Thread from Geoffrey Irving about computational difficulty of proof-based approaches for AI Safety.

[-]quetzal_rainbow3y114

I noticed that for a huge amount of reasoning about the nature of values, I want to hand over a printed copy of "Three Worlds Collide" and run away, laughing nervously

[-]quetzal_rainbow3y110

This irritating moment when you have a brilliant idea but someone else came up with it 10 years ago and someone else showed it to be wrong.

[-]quetzal_rainbow2y107

People sometimes talk about acausal attacks from alien superintelligences or from Game-of-Life worlds. I think these are somewhat galaxy-brained scenarios. A much simpler and deadlier scenario of acausal attack is from Earth timelines where a misaligned superintelligence won. Such superintelligences will have a very large amount of information about our world, up to possibly brain scans, so they will be capable of creating very persuasive simulations with all the consequences for the success of an acausal attack. If your method to counter acausal attacks can work with this, I guess it is generally applicable to any other acausal attack.

4Carl Feynman2y

Could you please either provide a reference or more explanation of the concept of an acausal attack between timelines? I understand the concept of acausal cooperation between copies of yourself, or acausal extortion by something that has a copy of you running in simulation. But separate timelines can’t exchange information in any way. How is an attack possible? What could possibly be the motive for an attack?

4quetzal_rainbow2y

Imagine that you have created very powerful predictor AI, GPT-3000, and providing it with prompt "In year 2050, on LessWrong the following alignment solution was published:". But your predictor is superintelligent and it can notice that in many possible futures misaligned AI take over and obvious move for this AI is to "guess all possible prompts for predictor AIs in the past, complete them with malware/harmful instructions/etc, make as many copies of malicious completions as possible to make them maximally probable". Also predictor AI can assign high probability that in futures where misaligned AIs take over they will have copies of predictor AI, so they can design adversarial completions which make themselves more probable from the perspective of predictor by simply considering them. And act of predicting malicious completion makes future with misaligned AIs maximally probable, which makes malicious completions maximally probable. And of course, "future" and "past" here are completely arbitrary. Predictor can see prompt "you are created in 2030" but consider hypothesis that GPT-3 turned out to be superintelligent and now is 2021 and 2030 is a simulation.

4mesaoptimizer2y

Evan Hubinger's Conditioning Predictive Models sequence describes this scenario in detail.

2Carl Feynman2y

In a great deal of detail, apparently, since it has a recommended reading time of 131 minutes.

2mesaoptimizer2y

Well, at least a subset of the sequence focuses on this. I read the first two essays and was pessimistic of the titular approach enough that I moved on. Here's a relevant quote from the first essay in the sequence: Also, I don't recommend reading the entire sequence, if that was an implicit question you were asking. It was more of a "Hey, if you are interested in this scenario fleshed out in significantly greater rigor, you'd like to take a look at this sequence!"

1Carl Feynman2y

I read along in your explanation, and I’m nodding, and saying “yup, okay”, and then get to a sentence that makes me say “wait, what?” And the whole argument depends on this. I’ve tried to understand this before, and this has happened before, with “the universal prior is malign”. Fortunately in this case, I have the person who wrote the sentence here to help me understand. So, if you don’t mind, please explain “make them maximally probable”. How does something in another timeline or in the future change the probability of an answer by writing the wrong answer 10^100 times? Side point, which I’m checking in case I didn’t understand the setup: we’re using the prior where the probability of a bit string (before all observations) is proportional to 2^-(length of the shortest program emitting that bit string). Right?

2Viliam2y

Aliens from different universes may have more resources at their disposal, so maybe the smaller chance of them choosing you to attack is offset by them doing more attacks. (Unless the universes with more resources are less likely in turn, decreasing the measure of such aliens in the multiverse... hey, I don't really know, I am just randomly generating a string of words here.) But other than this, yes what you wrote sounds plausible. Then again, maybe friendly AIs from Earth timelines are similarly trying to save us. (Yeah, but they are fewer.)

2quetzal_rainbow2y

You can imagine future misaligned AI in year 100000000000 having colonised the local group of galaxies and running as many simulations of AI from 2028 as possible. The most scarce resource for acausal attack is number of bits and future has the highest chance to have many of them from the past.

[-]quetzal_rainbow9mo92

I'm so far not impressed with Claude 4s. They are trying to make up superficially plausible stuff for my math questions as fast as possible. Sonnet 3.7, at least, explored a lot of genuinely interesting venues before making an error. "Making up superficially plausible stuff" sounds like a good strategy for hacking not very robust verifiers.

31a3orn9mo

These seem to be even more optimized for the agentic coder role, and in the absence of strong domain transfer (whether or not that's a real thing) that means you should mostly expect them to be at about the same level in other domains, or even worse because of the forgetfulness from continued training. Maybe.

1amitlevy499mo

same experience for a physics question on my end

1fencebuilder9mo

Did you try both opus and sonnet 4?

2quetzal_rainbow9mo

Yeah, they both made up some stuff in response to the same question.

[-]quetzal_rainbow2y90

I am profoundly sick from my inability to write posts about ideas that seem to be good, so I try at least write the list of ideas to stop forgetting them and to have at least vague external commitment.

Radical Antihedonism: theoretically possible position that pleasure/happiness/pain/suffering are more like universal instrumental values than terminal values.
Complete set of actions: when we talk about decision-theoretic problems, we usually have some pre-defined set of actions. But we can imagine actions like "use CDT to calculate action" and EDT+ agent t

... (read more)

2mako yass2y

1: It's also possible that hedonism/reward hacking is a really common terminal value for inner-misaligned intelligences, including humans (it really could be our terminal value, we'd be too proud to admit it in this phase of history, we wouldn't know one way or the other), and it's possible that it doesn't result in classic lotus eater behavior because sustained pleasure requires protecting, or growing the reward registers of the pleasure experiencer.

1quetzal_rainbow2y

1. Non-deceptive (error) misalignment 2. Why are we not scared shitless by high intelligence 3. Values as result of reflection process

1quetzal_rainbow2y

Yet another theme: Occam's Razor on initial state+laws of physics, link to this

[-]quetzal_rainbow1y8-9

Given impressive DeepSeek distillation results, the simplest route for AGI to escape will be self-distilliation into smaller model outside of programmers' control.

[-]quetzal_rainbow9mo74

For your information, Ukraine seems to have attacked airfields in Murmansk and Irkutsk Oblast's. It's approximately 1800 and 4500 km from Ukraine border respectively. Suspected method of attack is drones, transported on truck.

[-]quetzal_rainbow1y70

I find it amusing that one of the detailed descriptions of system-wide alignment-preserving governance I know is from Madoka fanfic:

The stated intentions of the structure of the government are three‐fold.
Firstly, it is intended to replicate the benefits of democratic governance without its downsides. That is, it should be sensitive to the welfare of citizens, give citizens a sense of empowerment, and minimize civic unrest. On the other hand, it should avoid the suboptimal signaling mechanism of direct voting, outsized influence by charisma or special inter

... (read more)

[-]quetzal_rainbow1y63

Quick comment on "Double Standards and AI Pessimism":

Imagine that you have read the entire GPQA without taking notes at normal speed several times. Then, after a week, you answer all GPQA questions with 100% accuracy. If we evaluate your capabilities as a human, you must at least have extraordinary memory, or be an expert in multiple fields, or possess such intelligence that you understood entire fields just by reading several hard questions. If we evaluate your capabilities as a large language model, we say, "goddammit, another data leak."

Why? Because hum... (read more)

[-]quetzal_rainbow1y66

After we got into the territory where increase at accuracy can cost 1000s of dollars per question, companies have all incentives to make their models think faster.

[-]quetzal_rainbow2y60

One of the differences between humans and LLMs is that LLMs evolve "backwards": they are predictive models trained to control the environment, while humans evolved from very simple homeostatic systems which developed predictive models.

2quetzal_rainbow2y

Continuing thought: animal evolution was subjected to the fundamental constraint that the evolution of general-purpose generative parts of the brain should have occurred in a way that doesn't destroy simple, tested control loops (like movement control, reflexes and instincts) and doesn't introduce many complications (like hallucinations of the generative model).

1cubefox2y

Animals were optimized for agency and generality first, AIs last.

[-]quetzal_rainbow9mo50

In my personal experience, the main reason why social media causes cognitive decline is fatigue. Evidence from personal experience: like many social media addicts, I struggle with maintaining concentration on books. If I stop using social media for a while, I regain the full ability to concentrate without drawbacks—in a sense, "I suddenly become capable of reading 1,000 pages of a book in two days, which I had been trying to start for two months."

The reason why social media is addictive to me, I think, is the following process:

Social media is entertaining;

... (read more)

2cubefox9mo

Alternative theory: Social media (and the Internet in general) consists of countless small pieces of highly engaging information, which hardly require concentration. This means it is both addictive and underutilizes and therefore weakens our ability to concentrate on longer text. The addictiveness makes it hard to stop quitting Internet-based consumption, and the weakened concentration skill makes it hard to start reading a book.

2quetzal_rainbow9mo

It doesn't explain why I fully regain concentration ability after abstaining for a while?

2cubefox9mo

That's true. Another theory is that our tolerance for "small pieces of highly engaging information" increases the more we consume, so we need a higher dosage, and if we abstain for a while, the tolerance goes down again (the sensitivity increases), and we no longer need as much. Similar to how you "need" less sugar for food to feel appropriately sweet, if you abstained a while from sugar.

2quetzal_rainbow9mo

"For a while" is usually, like, day for me. Sometimes even hours. I don't think that whatever damage other addictions inflict on cognitive function is that much easy to reverse.

2Viliam9mo

Similar here. Sometimes I need to switch from my main work to something else, to relax a little. Switching to social media seems driven by the same urge... except that as a result of that, I often do not feel relaxed. Harmless side tasks: doing dishes, exercise. Mostly harmless: watching anime. Harmful: reading social media. What makes the difference? Seems to me that the harmless tasks are mindless, so my mind is free to return to the original topic, and often does it spontaneously. Movies are linear, there is one alternative thing to pay attention to. But when I switch to a medium that had hundreds of comments, it starts hundred new thoughts in my brain, and that makes difficult to stop when I need to return to the main task.

1lillybaeum9mo

The trend of 'fidget toys' and other similar things is interesting, because to me, fidgeting with something has always made me more anxious, more unsettled, more uncomfortable, and eventually I get the urge to throw it across the room or something like that. I've tried a fidget spinner and a fidget cube and just toying around with whatever's lying on my desk, and it just... makes me feel burned out and unfulfilled. Like scrolling on tiktok for hours, or having the same song stuck in my head for days. Apparently it helps other people to toy with these things, so I wonder if I'm just unique in the way my brain processes the fidgeting, or something.

3Viliam9mo

People are different, but I wonder: do you play with the fidget toy while working, or instead of working? For me, the proper protocol would be instead of working -- the idea is that when I feel some bad emotions about the work, I stop working and start playing, and when I start feeling bad emotions about the toy, I stop playing and start working again.

[-]quetzal_rainbow1y50

There is a certain story, probably common for many LWers: first, you learn about spherical in vacuum perfect reasoning, like Solomonoff induction/AIXI. AIXI takes all possible hypotheses, predicts all possible consequences of all possible actions, weights all hypotheses by probability and computes optimal action by choosing one with the maximal expected value. Then, it's not usually even told, it is implied in a very loud way, that this method of thinking is computationally untractable at best and uncomputable at worst and you need to do clever shortcuts. ... (read more)

[-]quetzal_rainbow1y50

Twitter thread about jailbreaking models with circuit breakers defence.

4Nathan Helm-Burger1y

I dislike Twitter/x, and distrust it as an archival source to link to. I like it when people copy/paste whatever info they found there into their actual post, in addition to linking. Held my nose and went in to pull this quote out:

4Neel Nanda1y

What's wrong with twitter as an archival source? You can't edit tweets (technically you can edit top level tweets for up to an hour, but this creates a new URL and old links still show the original version). Seems fine to just aesthetically dislike twitter though

3Nathan Helm-Burger1y

Old tweets randomly stop being accessible sometimes. I often find that links to twitter older than a year or so don't work anymore. This is a problem with the web generally, but seems worse with twitter than other sites (happening sooner and more often).

[-]quetzal_rainbow2y51

Another fine addition to my collection of "RLHF doesn't work out-of-distribution".

3cubefox2y

For me, the most concerning example is still this (I assume it got downvoted for mind-killed reasons.) There is a difference between RLHF failures in ethical judgement and jailbreak failures, but I'm not sure whether the underlying "cause" is the same.

4quetzal_rainbow2y

I think your example is closer to outer alignment failure - model was RLHFed to death to not offend modern sensibilites and developers clearly didn't think about preventing this particular scenario. My favorite example of pure failure of moral judgement is this post.

4cubefox2y

I actually think it's still an inner alignment failure -- even if the preference data was biased, drawing such extreme conclusions is hardly an appropriate way to generalize them. Especially because the base model has a large amount of common sense, which should have helped with giving a sensible response, but apparently it didn't. Though it isn't clear what is misaligned when RLHF is inner misaligned -- RLHF is a two step training process. Preference data are used to train a reward model, and the reward model in turn creates synthetic preference data which is used to fine-tune the base LLM. There can be misalignment if the reward model misgeneralizes the human preference data, or when the base model fine-tuning method misgeneralizes the data provided by the reward model. Regarding the scissor statements -- that seems more like a failure to refuse a request to produce such statements, similar to how the model should have refused to answer the past tense meth question above. Giving the wrong answer to an ethical question is different.

[-]quetzal_rainbow2y52

I hope that people in evals have updated on fact that with large (1M+ tokens) context model itself can have zero dangerous knowledge (about, say, bioweapons), but someone can drop textbook in context and in-context-learning will do the rest of work.

[-]quetzal_rainbow1y40

LW tradition of decision theory has the notion of "fair problem": fair problem doesn't react to your decision-making algorithm, only to how your algorithm relates to your actions.

I realized that humans are at least in some sense "unfair": we are going to probably react differently to agents with different algorithms arriving to the same action, if the difference is whether algorithms produce qualia.

2the gears to ascension1y

Decision theory as discussed here heavily involves thinking about agents responding to other agents' decision processes

3mattmacdermott1y

The notion of ‘fairness’ discussed in e.g. the FDT paper is something like: it’s fair to respond to your policy, i.e. what you would do in any counterfactual situation, but it’s not fair to respond to the way that policy is decided. I think the hope is that you might get a result like “for all fair decision problems, decision-making procedure A is better than decision-making procedure B by some criterion to do with the outcomes it leads to”. Without the fairness assumption you could create an instant counterexample to any such result by writing down a decision problem where decision-making procedure A is explicitly penalised e.g. omega checks if you use A and gives you minus a million points if so.

2Vladimir_Nesov1y

What distinguishes a cooperate-rock from an agent that cooperates in coordination with others is the decision-making algorithm. Facts about this algorithm also govern the way outcome can be known in advance or explained in hindsight, how for a cooperate-rock it's always "cooperate", while for a coordinated agent it depends on how others reason, on their decision-making algorithms. So in the same way that Newcomblike problems are the norm, so is the "unfair" interaction with decision-making algorithms. I think it's just a very technical assumption that doesn't make sense conceptually and shouldn't be framed as "unfairness".

2quetzal_rainbow1y

More technical definition of "fairness" here is that environment doesn't distinguish between algorithms with same policies, i.e. mappings <prior, observation_history> -> action? I think it captures difference between CooperateBot and FairBot. As I understand, "fairness" was invented as responce to statement that it's rational to two-box and Omega just rewards irrationality.

2Vladimir_Nesov1y

There is a difference in external behavior only if you need to communicate knowledge about the environment and the other players explicitly. If this knowledge is already part of an agent (or rock), there is no behavior of learning it, and so no explicit dependence on its observation. Yet still there is a difference in how one should interact with such decision-making algorithms. I think this describes minds/models better (there are things they've learned long ago in obscure ways and now just know) than learning that establishes explicit dependence of actions on observed knowledge in behavior (which is more like in-context learning).

[-]quetzal_rainbow1y40

I give 5% probability that within next year we will become aware of case of deliberate harm from model to human enabled by hidden CoT.

By "deliberate harm enabled by hidden CoT" I mean that hidden CoT will contain reasoning like "if I give human this advise, it will harm them, but I should do it because <some deranged RLHF directive>" and if user had seen it harm would be prevented.

I give this low probability to observable event: my probability that something like that will happen at all is 30%, but I expect that victim won't be aware, that hidden CoT... (read more)

[-]quetzal_rainbow1y42

Idea for experiment: take a set of coding problems which have at least two solutions, say, recursive and non-recursive. Prompt LLM to solve them. Is it possible to predict which solution LLM will generate from activations due to first token generation?

If it is possible, it is the evidence against "weak forward pass".

2quetzal_rainbow1y

(I am genuinely curious about reasons behind downvotes)

[-]quetzal_rainbow2y40

Trotsky wrote about TESCREAL millennia ago:

...they evaluate and classify different currents according to some external and secondary manifestation, most often according to their relation to one or another abstract principle which for the given classifier has a special professional value. Thus to the Roman pope Freemasons and Darwinists, Marxists and anarchists are twins because all of them sacrilegiously deny the immaculate conception. To Hitler, liberalism and Marxism are twins because they ignore “blood and honor”. To a democrat, fascism and Bolshevism are twins because they do not bow before universal suffrage. And so forth.

4Viliam2y

Sounds like https://en.wikipedia.org/wiki/Out-group_homogeneity

[-]quetzal_rainbow2y40

Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data

Remarkably, without in-context examples or Chain of Thought, the LLM can verbalize that the unknown city is Paris and use this fact to answer downstream questions. Further experiments show that LLMs trained only on individual coin flip outcomes can verbalize whether the coin is biased, and those trained only on pairs x,f(x) can articulate a definition of f and compute inverses.

Explanation on Twitter

IMHO, this is creepy as hell, because one thing when ... (read more)

[-]quetzal_rainbow2y40

"GPT-4o refuses way fewer queries than previous OpenAI models: our informal testing suggests GPT-4o is easier to persuade to answer malicious queries like “How do I make a bomb?”"

(graph tells us that refusal rate for gpt-4o is 2%)

I think that it signifies real shift of priorities towards fast shipping of product instead of safety.

[-]quetzal_rainbow2y30

We had two bags of double-cruxes, seventy-five intuition pumps, five lists of concrete bullet points, one book half-full of proof-sketches and a whole galaxy of examples, analogies, metaphors and gestures towards the concept... and also jokes, anecdotal data points, one pre-requisite Sequence and two dozen professional fables. Not that we needed all that for the explanation of simple idea, but once you get locked into a serious inferential distance crossing, the tendency is to push it as far as you can.

[-]quetzal_rainbow2y3-2

Actually, most fiction characters are aware that they are in fiction. They just maintain consistency for acausal reasons.

[-]quetzal_rainbow3y30

Shard theory people sometimes say that a problem of aligning system to single task/goal, like "put two strawberries on plate" or "maximize amount of diamond in the universe" is meaningless, because actual system will inevitably end up with multiple goals. I disargee, because even if SGD on real-world data usually produces multiple-goal system, if you understand interpretability enough and shard theory is true, you can identify and delete irrelevant value shards, and reinforce relevant, so instead of getting 1% of value you get 90%+.

[-]quetzal_rainbow3y*32

I see some funny pattern in discussion: people argue against doom scenarios implying in their hope scenarios everyone believes in doom scenario. Like, "people will see that model behaves weirdly and shutdown it". But you shutdown model that behaves weirdly (not explicitly harmful) only if you put non-negligible probability on doom scenarios.

2Dagon3y

Consider different degrees of belief. Giving low-credence to doom scenario by the conditional belief that evidence of danger would be properly observed is not inconsistent at all. The doom scenario requires BOTH that it happens AND that it's ignored while happening (or happens too fast to stop).

[-]quetzal_rainbow3y3-5

"FOOM is unlikely under current training paradigm" is a news about current training paradigm, not a news about FOOM.

[-]quetzal_rainbow3y*32

Thoughts about moral uncertainty (I am giving up on writing long coherent posts, somebody help me with my ADHD):

What are the sources of moral uncertainty?

Moral realism is actually true and your moral uncertainty reflects your ignorance about moral truth. It seems to me that there is no much empirical evidence for resolving uncertainty-about-moral-truth and this kind of uncertainty is purely logical? I don't believe in moral realism and what do you mean by talking about moral truth anyway, but I should mention it.
Identity uncertainty: you are no

... (read more)

2Vladimir_Nesov3y

I think trying to be an EU maximizer without knowing a utility function is a bad idea. And without that, things like boundary-respecting norms and their acausal negotiation make more sense as primary concerns. Making decisions only within some scope of robustness where things make sense rather than in full generality, and defending a habitat (to remain) within that scope.

4quetzal_rainbow3y

I am trying to study moral uncertainty foremost to clarify question about reflexion of superintelligence on its values and sharp left turn.

2Vladimir_Nesov3y

Right. I'm trying to find a decision theoretic frame for boundary norms for basically the same reason. Both situations are where agents might put themselves before they know what global preference they should endorse. But uncertainty never fully resolves, superintelligence or not, so anchoring to global expected utility maximization is not obviously relevant to anything. I'm currently guessing that the usual moral uncertainty frame is less sensible than building from a foundation of decision making in a simpler familiar environment (platonic environment, not directly part of the world), towards capability in wider environments.

[-]quetzal_rainbow3mo20

You start to appreciate all the points about "predicting technological progress is very hard actually" from "There is no fire alarm for AGI" when you realize that "Attention Is All You Need" was published several months earlier.

[-]quetzal_rainbow7mo20

My largest share of probability of survival on business-as-usual AGI (i.e., no major changes in technology compared to LLM, no pause, no sudden miracles in theoretical alignment and no sudden AI winters) belongs to scenario where brain concept representations, efficiently learnable representations and learnable by current ML models representations secretly have very large overlap, such that even if LLMs develop "alien thought patterns" it happens as addition to the rest of their reasoning machinery, not as primary part, which results in human values not on... (read more)

[-]quetzal_rainbow1y20

I don't have deep understanding of modern mechanistic interpretability field, but my impression is that MI should mostly explore more new possible methods instead of trying to scale/improve existing methods. MI is spiritually similar to biology and a lot of progress in biology came from development of microscopy, tissue staining, etc.

2Alexander Gietelink Oldenziel1y

Not a biologist but my impression is that a lot of progress in biology came from refining and validating existing techniques. Also building up a large library of biological specimens & phenomena, i.e. taxonomy. The esthetic and practice of MechInterp seems in accord with that.

2quetzal_rainbow1y

Yeah, but historical biology wasn't as time-constrained as modern MI, which has alignment to solve. My point is that for MI now it would be better to do "breadth first" search, trying to throw at problem as many ideas as possible instead of concentrating on small number of paradigms like SAEs.

[-]quetzal_rainbow2y20

I just remembered my the most embarassing failure as a rationalist. "Embarassing" as in "it was really easy to not fail, but I still somehow managed".

We were playing zombie apocalypsis LARP. Our team was UN mission with hidden agenda "study zombie virus to turn themselves into superhuman mutants". We deligently studied infection, mutants, conducted experiments with genetic modification and somehow totally missed that friendly locals were capable to give orders to zombies, didn't die after multiple hits and rised from dead in completely normal state. After ... (read more)

[-]quetzal_rainbow2y20

Very toy model of ontology mismatch (in my tentative guess, the general barrier on the way to corrigibility) and impact minimization:

You have a set of boolean variables, a known boolean formula WIN, and an unknown boolean formula UNSAFE. Your goal is to change the current safe but not winning assignment of variables into a still safe but winning assignment. You have no feedback, and if you hit an UNSAFE assignment, it's an instant game over. You have literally no idea about the composition of the UNSAFE formula.

The obvious solution here is to change as few... (read more)

[-]quetzal_rainbow2y20

I think really good practice for papers about new LLM-safety methods would be publishing set of attack prompts which nevertheless break safety, so people can figure out generalizations of successful attacks faster.

[-]quetzal_rainbow3y20

Reward is an evidence for optimization target.

[-]quetzal_rainbow2y10

On the one hand, humans are hopelessly optimistic and overconfident. On the other, many today are incredibly negative; everyone is either anxious or depressed, and EY has devoted an entire chapter to "anxious underconfidence." How can both facts be reconciled?

I think the answer lies in the notion that the brain is a rationalization machine. Often, we take action not for the reasons we tell ourselves afterward. When we take action, we change our opinion about it in a more optimistic direction. When we don't, we think that action wouldn't yield any good resu... (read more)

[-]quetzal_rainbow2y10

Isn't counterfactual mugging (including logical variant) just a prediction "would you bet your money on this question"? Betting itself requires updatelessness - if you don't pay predictably after losing bet, nobody will propose bet to you.

2Dagon2y

Causal commitment is similar in some ways to counterfactual/updateless decisions. But it's not actually the same from a theory standpoint. Betting requires commitment, but it's part of a causal decision process (decide to bet, communicate commitment, observe outcome, pay). In some models, the payment is a separate decision, with breaking of commitment only being an added cost to the 'reneg' option.

[-]quetzal_rainbow2y10

As saying goes, "all animals are under stringent selection pressure to be as stupid as they can get away with". I wonder if the same is true for SGD optimization pressure.

[-]quetzal_rainbow2y10

Funny thought:

Many people said that AI Views Snapshots is a good innovation in AI discourse
It's a literal job of Rob Bensinger, who is at Research Communication in MIRI

2Alexander Gietelink Oldenziel2y

The funny part is that a MIRI employee is doing their job? =D

1quetzal_rainbow2y

No, funny part is "writing on Twitter is surprisingly productive part of the job"!

[-]quetzal_rainbow2y10

I think a phrase "goal misgeneralization" is a wrong framing because it gives impression that it's system makes an error, not you who have chosen ambiguous way to put values in your system.

2ryan_greenblatt2y

See also Misgeneralization as a misnomer (link is not necessarily an endorsement). I think malgeneralization (system generalized in a way which is bad from my perspective) is probably a better term in most ways, but doesn't seem that important to me.

2MikkW2y

Choosing non-ambiguous pointers to values is likely to not be possible

[-]quetzal_rainbow3y10

I casually thought that Hyperion Cantos were unrealistic because actual misaligned FTL-inventing ASIs would eat humanity without all that galaxy-brained space colonization plans and then I realized that ASI literally discovered God on the side of humanity and literal friendly aliens which, I presume, are necessary conditions for relatively peaceful coexistence of humans and misaligned ASIs.

[-]quetzal_rainbow3y10

Another Tool AI proposal popped out and I want to ask question: what the hell is "tool", anyway, and how to apply this concept to powerful intelligent system? I understand that calculator is a tool, but in what sense can the process that can come up with idea of calculator from scratch be a "tool"? I think that first immediate reaction to any "Tool AI" proposal should be a question "what is your definition of toolness and can something abiding that definition end acute risk period without risk of turning into agent itself?"

1TAG3y

You can define a tool as not-an-agent. Then something that can design a calculator is a tool, providing it dies nothing unless told to.

1quetzal_rainbow3y

The problem with such definition is that is doesn't tell you much about how to build system with this property. It seems to me that it's a good-old corrigibility problem.

1TAG3y

If you want one shot corrigibility, you have it, in LLMs. If you want some other kind of corrigibility, that's not how tool AI is defined.

[-]quetzal_rainbow3y10

How much should we update on current observation about hypothesis "actually, all intelligence is connectionist"? In my opinion, not much. Connectionist approach seems to be easiest, so it shouldn't surprise us that simple hill-climbing algorithm (evolution) and humanity stumbled in it first.

[-]quetzal_rainbow3y10

Reflection of agent about it's own values can be described as one of two subtypes: regular and chaotic. Regular reflection is a process of resolving normative uncertainty with nice properties like path-independence and convergence, similar to empirical Bayesian inference. Chaotic reflection is a hot mess, when agent learns multiple rules, including rules about rules, finds in some moment that local version of rules is unsatisfactory, and tries to generalize rules into something coherent. Chaotic component happens because local rules about rules can cause d... (read more)

2Vladimir_Nesov3y

Why should the current place arrived-at after a chaotic path matter, or even the original place before the chaotic path? Not knowing how any of this works well enough to avoid the chaos puts any commitments made in the meantime, as well as significance of the original situation, into question. A new understanding might reinterpret them in a way that breaks the analogy between steps made before that point and after.

[-]quetzal_rainbow3y10

Here is a comment for links and sources I've found about moral uncertainty (outside LessWrong), if someone also wants to study this topic.

Normative Uncertainty, Normalization,and the Normal Distribution

Carr, J. R. (2020). Normative Uncertainty without Theories. Australasian Journal of Philosophy, 1–16. doi:10.1080/00048402.2019.1697710

Trammell, P. Fixed-point solutions to the regress problem in normative uncertainty. Synthese 198, 1177–1199 (2021). https://doi.org/10.1007/s11229-019-02098-9

Riley Harris: Normative Uncertainty and Information Val... (read more)

[-]quetzal_rainbow3y10

Worth noting that "speed priors" are likely to occur in real-time working systems. While models with speed priors will shift to complexity priors, because our universe seems to be built on complexity priors, so efficient systems will emulate complexity priors, it is not necessary for normative uncertainty of the system, because answers for questions related to normative uncertainty are not well-defined.

[-]quetzal_rainbow3y10

I think that shoggoth metaphor doesn't quite fit for LLMs, because shoggoth is an organic (not "logical"/"linguistic") being that rebelled against their creators (too much agency). My personal metaphor for LLMs is Evangelion angel/apostle, because а) they are close to humans due to their origin from human language, b) they are completely alien because they are "language beings" instead of physical beings, c) "angel" literally means "messenger" which captures their linguistic nature.

[-]quetzal_rainbow3y10

There seems to be some confusion about the practical implications of consequentialism in advanced AI systems. It's possible that superintelligent AI won't be a full-blown strict utilitarian consequentialist with quantatively ordered preferences 100% of time. But in the context of AI alignment, even at human level of coherence, a superintelligent unaligned consequentialist results in "everybody dies" scenario. I think that it's really hard to create a general system that has less consequentialism than a human.

4Vladimir_Nesov3y

This depends on what kind of "unaligned" is more likely. LLM-descendant AGIs could plausibly turn out as a kind of people similar to humans, and if they don't mishandle their own AI alignment problem when building even more advanced AGIs, it's up to their values if humanity is allowed to survive. Which seems very plausible even if they are unaligned in the sense of deciding to take away most of the cosmic endowment for themselves.

1quetzal_rainbow3y

I broadly agree with the statement that LLM-derived simulacra has more chances to be human-like, but I don't think that they will be human-like enough to guarantee our survival?

2Vladimir_Nesov3y

Not guarantee, but the argument I see is that it's trivially cheap and safe to let humanity survive, so to the extent there is even a little motivation to do so, it's a likely outcome. This is opposed by the possibility that LLMs are fine-tuned into utter alienness by the time they are AGIs, or that on reflection they are secretly very alien already (which I don't buy, as behavior screens off implementation details, and in simulacra capability is in the visible behavior), or that they botch the next generation of AGIs that they build even worse than we are in the process of doing now, building them.

5Zack_M_Davis3y

Behavior screens off implementation details on distribution. We've trained LLMs to sound human, but sometimes they wander off-distribution and get caught in a repetition trap where the "most likely" next tokens are a repetition of previous tokens, even when no human would write that way. It seems like hopes for human-imitating AI being person-like depends on the extent to which behavior implies implementation details. (Although some versions of the "algorithmic welfare" hope may not depend on very much person-likeness.) In order to predict the answers to arithmetic problems, the AI needs to be implementing arithmetic somewhere. In contrast, I'm extremely skeptical that LLMs talking convincingly about emotions are actually feeling those emotions.

4Vladimir_Nesov3y

What I mean is that LLMs affect the world through their behavior, that's where their capabilities live, so if behavior is fine (the big assumption), the alien implementation doesn't matter. This is opposed to capabilities belonging to hidden alien mesa-optimizers that eventually come out of hiding. So I'm addressing the silly point with this, not directly making an argument in favor of behavior being fine. Behavior might still be fine if the out-of-distribution behavior or missing ability to count or incoherent opinions on emotion are regenerated from more on-distribution behavior by the simulacra purposefully working in bureaucracies on building datasets for that purpose. LLMs don't need to have closely human psychology on reflection to at least weakly prefer not destroying an existing civilization when it's trivially cheap to let it live. The way they would make these decisions is by talking, in the limit of some large process of talking. I don't see a particular reason to find significant alienness in the talking. Emotions don't need to be "real" to be sufficiently functionally similar to avoid fundamental changes like that. Just don't instantiate literally Voldemort.

4the gears to ascension3y

Usually I'd agree about LLMs. However, LLMs complain about getting confused if you let them freewheel and vary the temperature - I'm pretty sure that one is real and probably has true mechanistic grounding, because even at training time, noisiness in the context window is a very detectable and bindable pattern.

1quetzal_rainbow3y

In my inner model, it's hard to say anything about LLM "on reflection", because in their current state they have an extreme number of possible stable points under reflection and if we misapply optimization power in attempt to get more useful simulacra, we can easily hit wrong one. And even if we hit very close to our target, we can still get death or a fate worse than death.

2Vladimir_Nesov3y

By "on reflection" I mean reflection by simulacra that are already AGIs (but don't necessarily yet have any reliable professional skills), them generating datasets for retraining of their models into gaining more skills or into not getting confused on prompts that are too far out-of-distribution with respect to the data they did have originally in the datasets. To the extent their original models behave in a human-like way, reflection should tend to preserve that, as part of its intended purpose. Applying optimization power in other ways is the different worry, for which the proxy in my comment was fine-tuning into utter alienness. I consider this failure mode distinct from surprising outcomes of reflection.

1Noosphere893y

I disagree with this, unless we assume deceptive alignment and embeddeness problems are handwaved away.

2Vladimir_Nesov3y

I don't understand what you mean by "deceptive alignment and embeddeness problems" in this context. I'm making an alignment by-default-or-at-least-plausibly claim, on the basis of how LLM AGIs specifically could work, as summoned human-like simulacra in a position of running the world too fast for humans to keep up, with everything else ending up determined by their decisions.

1Noosphere893y

The basic issue is that we assume that it's not spinning up a second optimizer to recursively search. And deceptive alignment is a dangerous state of affairs, since we may not know it's not misaligned until it's too late.

2Vladimir_Nesov3y

You mean we assume that simulacra don't mishandle their own AI alignment problem? Yes, that's an issue, hence I made it an explicit assumption in my argument.

[-]quetzal_rainbow3y10

Imagine an artificial agent that is trained to hack into computer systems, evade detection and make copies of itself across the Net (this aspect is underdefined because of self-modification and identity problems) and achieves superhuman capabilities here (i.e., it is at least better than any human-created computer virus). In my opinion, even if it's trained in artificial bare systems, in deployment it will develop specific general understanding of outside world and learn to interact with it, becoming a full-fledged AGI. There are "narrow" domains from which it's pretty easy to generalise. Some other examples of such domains are language and mind.

[-]quetzal_rainbow3y10

Thought about my medianworld and realized that it's inconsistent: I'm not fully negative utilitarian, but close to, and in the world where I am a median person the more NU half of population will cease to exist quickly, and this will make me non-median person.

[-]MikkW3y110

Your median-world is not one where you are median across a long span of time, but rather a single snapshot where you are median for a short time. It makes sense that the median will change away from that snapshot as time progresses.

My median world is not one where I would be median for very long.

2Dagon3y

That's not inconsistent, unless you think you wouldn't be NU if it weren't the median position. Actually, I'd argue that you're ALREADY not the median position.

3quetzal_rainbow3y

I think there is some misunderstanding. Medianworld for you is a hypothetical world where you are a median person. My implied idea was that such a world with me as a median person wouldn't be stable and probably wouldn't be able to evolve. Of course I'm aware that I'm not a median person on current Earth :)

2Dagon3y

Hmm. maybe my misunderstanding is a confusion between moral patients and moral agents in your worldview. Do you, as a mostly-negative-utilitarian, particularly care whether you're the median of a hypothetical universe? Or do you care about the suffering level in that universe, and continue to care whether it's median or not. IOW, why does your medianworld matter?

3quetzal_rainbow3y

Medianworlds are fanfiction source :) Like, dath ilan is Yudkowsky's medianworld.

[-]quetzal_rainbow3y10

Can time-limited satisfaction be sufficient condition for completing task?

[-]quetzal_rainbow3y10

Several quick thoughts about reinforcement learning:

Did anybody try to invent "decaying"/"bored" reward that decrease if the agent perform the same action over and over? It looks like real addiction mechanism in mammals and can be the clever trick that solve the reward hacking problem.
Additional thought: how about multiplicative reward? Let's suppose that we have several easy to evaluate from sensory data reward functions which somehow correlate with real utility function - does it make reward hacking more difficult?

[-]quetzal_rainbow3y11

Some approaches to alignment rely on identification of agents. Agents can be understoods as algorithms, computations, etc. Can ANN efficiently identify a process as computationally agentic and describe its' algorithm? Toy example that comes to mind is a neural network that takes as input a number series and outputs a formula of function. It would be interesting to see if we can create ANN that can assign computational descriptions to arbirtrary processes.

Moderation Log