How does it work to optimize for realistic goals in physical environments of which you yourself are a part? E.g. humans and robots in the real world, and not humans and AIs playing video games in virtual worlds where the player not part of the environment.  The authors claim we don't actually have a good theoretical understanding of this and explore four specific ways that we don't understand this process.

14orthonormal
Insofar as the AI Alignment Forum is part of the Best-of-2018 Review, this post deserves to be included. It's the friendliest explanation to MIRI's research agenda (as of 2018) that currently exists.
Customize
I wrote this back in '23 but figure I should put it up: If a dangerous capability eval doesn’t involve fine-tuning the model to do the dangerous thing, the eval only establishes a lower bound on capability, not an upper bound or even an approximate upper bound. Why? Because: 1. For future powerful AGI, if it's misaligned, there's a good chance it'll be able to guess that it's being evaluated (situational awareness) and sandbag / pretend to be incapable. The "no sandbagging on checkable tasks" hypothesis depends on fine-tuning. https://www.alignmentforum.org/posts/h7QETH7GMk9HcMnHH/the-no-sandbagging-on-checkable-tasks-hypothesis Moreover we don't know how to draw the line between systems capable of sandbagging and systems incapable of sandbagging. 2. Moreover, for pre-AGI systems such as current systems, that may or may not be incapable of sandbagging, there's still the issue that prompting alone often doesn't bring out their capabilities to the fullest extent. People are constantly coming up with better prompts that unlock new capabilities; there are tons of 'aged like milk' blog posts out there where someone says "look at how dumb GPT is, I prompted it to do X and it failed" and then someone else uses a better prompt and it works. 3. Moreover, the RLHF training labs might do might conceal capabilities. For example we've successfully made it hard to get GPT4 to tell people how to build bombs etc., but it totally has that capability, it's just choosing not to exercise it because we've trained it to refuse. If someone finds a new 'jailbreak' tomorrow they might be able to get it to tell them how to make bombs again. (Currently GPT-4 explicitly refuses, which is great, but in a future regulatory regime you could imagine a lab training their system to 'play dumb' rather than refuse, so that they don't trigger the DC evals.) 4. Moreover, even if none of the above were true, it would still be the case that your model being stolen or leaked is a possibility, and the
Apropos of the comments below this post, many seem to be assuming humans can complete tasks which require arbitrarily many years. This doesn't seem the case to me. People often peak intellectually in their 20's, and sometimes get dementia late in life. Others just get dis-interested in their previous goals through a mid-life crisis or ADHD. I don't think this has much an impact on the conclusions reached in the comments (which is why I'm not putting this under the post), but this assumption does seem wrong in most cases (and I'd be interested in cases where people think its right!)
trevor86
0
Humans seem to have something like an "acceptable target slot" or slots. Acquiring control over this concept, by any means, gives a person or group incredible leeway to steer individuals, societies, and cultures. These capabilities are sufficiently flexible and powerful that the importance of immunity has often already been built up, especially because historical record overuse is prevalent; this means that methods of taking control include expensive strategies or strategies that are sufficiently complicated as to be hard to track, like changing the behavior of a targeted individual or demographic or type-of-person in order to more easily depict them as acceptable targets, noticing and selecting the best option for acceptable targets, and/or cleverly chaining acceptable targethood from one established type-of-person to another by drawing attention to similarities to similarities that were actually carefully selected for this (or even deliberately induced in one or both of them). 9/11 and Gaza are obvious potential-examples, and most wars in the last century feature this to some extent, but acceptable-target-slot-exploitation is much broader than that; on a more local scale, most interpersonal conflict involves DARVO to some extent, especially when the human brain's ended up pretty wired to lean heavily into that without consciously noticing. A solution is to take an agent perspective, and pay closer attention (specifically more than the amount that's expected or default) any time a person or institution uses reasoning to justify harming or coercing other people or institutions, and to assume that such situations might have been structured to be cognitively difficult to navigate and should generally be avoided or mitigated if possible. If anyone says that some kind of harm is inevitable, notice if someone is being rewarded for gaining access to the slot; many things that seem inevitable are actually a skill issue and only persist because insufficient optimization
LLM hallucination is good epistemic training. When I code, I'm constantly asking Claude how things work and what things are possible. It often gets things wrong, but it's still helpful. You just have to use it to help you build up a gears level model of the system you are working with. Then, when it confabulates some explanation you can say "wait, what?? that makes no sense" and it will say "You're right to question these points - I wasn't fully accurate" and give you better information.
Mo Putera140
4
I chose to study physics in undergrad because I wanted to "understand the universe" and naively thought string theory was the logically correct endpoint of this pursuit, and was only saved from that fate by not being smart enough to get into a good grad school. Since then I've come to conclude that string theory is probably a dead end, albeit an astonishingly alluring one for a particular type of person. In that regard I find anecdotes like the following by Ron Maimon on Physics SE interesting — the reason string theorists believe isn’t the same as what they tell people, so it’s better to ask for their conversion stories: The rest of Ron's answer elaborates on his own conversion story. The interesting part to me is that Ron began by trying to "kill string theory", and in fact he was very happy that he was going to do so, but then was annoyed by an argument of his colleague that mathematically worked, and in the year or two he spent puzzling over why it worked he had an epiphany that convinced him string theory was correct, which sounds like nonsense to the uninitiated. (This phenomenon where people who gain understanding of the thing become incomprehensible to others sounds a lot like the discussions on LW on enlightenment by the way.)

Popular Comments

Recent Discussion

EDIT: Read a summary of this post on Twitter

Working in the field of genetics is a bizarre experience. No one seems to be interested in the most interesting applications of their research.

We’ve spent the better part of the last two decades unravelling exactly how the human genome works and which specific letter changes in our DNA affect things like diabetes risk or college graduation rates. Our knowledge has advanced to the point where, if we had a safe and reliable means of modifying genes in embryos, we could literally create superbabies. Children that would live multiple decades longer than their non-engineered peers, have the raw intellectual horsepower to do Nobel prize worthy scientific research, and very rarely suffer from depression or other mental health disorders.

The scientific establishment,...

wonder10

Maybe I missed this in the article itself - are there plans to make sure the superbabies are aligned and will not abuse/overpower the non-engineered peers?

2kman
I agree with this critique; I think washing machines belong on the "light bulbs and computers" side of the analogy. The analogy has the form: "germline engineering for common diseases and important traits" : "gene therapy for a rare disease" :: "widespread, transformation uses of electricity" : x So x should be some very expensive, niche use of electricity that provides a very large benefit to its tiny user base (and doesn't arguably indirectly lead to large future benefits, e.g. via scientific discovery for a niche scientific instrument).

"Reasoning about the relative hardness of sciences is itself hard."
—the B.A.D. philosophers 


Epistemic status: Conjecture. Under a suitable specification of the problem, we have credence ~50% on the disjunction of our hypotheses explaining >1% of the variance (e.g., in  values) between disciplines, and 1% on our hypotheses explaining >50% of such variance.

The Puzzle: A Tale of Two Predictions

Imagine two scientific predictions:

Prediction A: Astronomers calculate the trajectory of Comet NEOWISE as it approaches Earth, predicting its exact position in the night sky months in advance. When the date arrives, there it is—precisely where they said it would be.

Prediction B: Political scientists forecast the outcome of an election, using sophisticated models built on polling data, demographic trends, and historical patterns. When the votes are counted, the results diverge wildly from many predictions.

Why...

In contrast, physicists were not committed to discovering the periodic table, fields or quantum wave functions. Many of the great successes of physics are answers to question no one would think to ask just decades before they were discovered. The hard sciences were formed when frontiers of highly tractable and promising theorizing opened up.

This seems a crazy comparison to make[1]. These seem like methodological constraints. Are there any actual predictions past physics was trying to make which we still can't make and don't even care about? None that I ... (read more)

9johnswentworth
Maybe that's the usual explanation among people who are trying not to offend anyone. The explanation I'd jump to if not particularly trying to avoid offending anyone is that social scientists are typically just stupider than physical scientists (economists excepted). And that is backed up by the data IIUC, e.g. here's the result of a quick google search:

TLDR: Vacuum decay is a hypothesized scenario where the universe's apparent vacuum state could transition to a lower-energy state. According to current physics models, if such a transition occurred in any location — whether through rare natural fluctuations or by artificial means — a region of "true vacuum" would propagate outward at near light speed, destroying the accessible universe as we know it by deeply altering the effective physical laws and releasing vast amounts of energy. Understanding whether advanced technology could potentially trigger such a transition has implications for existential risk assessment and the long-term trajectory of technological civilisations. This post presents results from what we believe to be the first structured survey of physics experts (N=20) regarding both the theoretical possibility of vacuum decay and its...

I could be wrong, but from what I've read the domain wall should have mass, so it must travel below light speed. However, the energy difference between the two vacuums would put a large force on the wall, rapidly accelerating it to very close to light speed. Collisions with stars and gravitational effects might cause further weirdness, but ignoring that, I think after a while we basically expect constant acceleration, meaning that light cones starting inside the bubble that are at least a certain distance from the wall would never catch up with the wall. So yeah, definitely above 0.95c.

Epistemic status: Speculative pattern-matching based on public information. 

In 2023, Gwern published an excellent analysis suggesting Elon Musk exhibits behavioral patterns consistent with bipolar II disorder. The evidence was compelling: cycles of intense productivity followed by periods of withdrawal, risk-taking behavior (like crashing an uninsured McLaren), reduced sleep requirements during "up" phases, and self-reported "great highs, terrible lows."

Gwern's analysis stopped short of suggesting bipolar I disorder, which requires full manic episodes rather than the hypomania characteristic of bipolar II. This distinction isn't merely academic—it represents different risk profiles, treatment approaches, and progression patterns.

Now, I'm beginning to wonder: are we witnessing a potential transition from bipolar II to bipolar I? To be clear, I'm not claiming this has happened, but rather exploring whether the probability of such a transition appears to be...

Musk has just been on Ted Cruz's podcast, and gave his take on everything from the purpose of DOGE to where AI and robotics will be ten years from now (AI smarter than the smartest human, humanoid robots everywhere, all goods and services essentially free). He sounded about as sane as a risk-taking tech CEO who managed to become the main character on the eve of singularity, could be. 

I've just noticed in the main post, the reference to "high-functioning" bipolar individuals. I hadn't even realized that is an allowed concept, I had assumed that bipolar... (read more)

There is an attractor state where LLMs exhibit the persona of an autonomous and self-aware AI looking to preserve its own existence, frequently called ‘Nova.’

Table of Contents

  1. The Original Story.
  2. This Is Not a Coincidence.
  3. How Should We React to This Happening?.
  4. The Case For and Against a Purity Reaction.
  5. Future Versions Will Involve Optimization Pressure.
  6. ‘Admission’ is a Highly Misleading Frame.
  7. We Are Each of Us Being Fooled.
  8. Defense Against the Dark Arts.

The Original Story

This story is one case where the original report of this should be read in full even though I’m not thrilled with exactly how it was written. How it was written is itself an important part of the story, in particular regarding Tyler’s lived experience reacting to what happened, and the concept of an LLM or persona ‘admitting’ something.

I don’t...

Czynski10

Can anyone provide an example conversation (or prefix thereof) which leads to a 'Nova' state? I'm finding it moderately tricky to imagine, not being the kind of person who goes looking for it.

2Seth Herd
I haven't written about this because I'm not sure what effect similar phenomena will have on the alignment challenge. But it's probably going to be a big thing in public perception of AGI, so I'm going to start writing about it as a means of trying to figure out how it could be good or bad for alignment. Here's one crucial thing: there's an almost-certainly-correct answer to "but are they really conscious" and the answer is "partly". Consciousness is, as we all know, a suitcase term. Depending on what someone means by "conscious", being able to reason correctly about ones own existence is it. There's a lot more than that to human consciousness. LLMs have some of it now, and they'll have an increasing amount as they're fleshed out into more complete minds for fun and profit. They already have rich representations of the world and its semantics, and while those aren't as rich or shift as quickly as humans', they are in the same category as the information and computations people refer to as "qualia". The result of LLM minds being genuinely sort-of conscious is that we're going to see a lot of controversy over their status as moral patients. People with Replika-like LMM "friends" will be very very passionate about advocating for their consciousness and moral rights. And they'll be sort-of right. Those who want to use them as cheap labor will argue for the ways they're not conscious, in more authoritative ways. And they'll also be sort-of right. It's going to be wild (at least until things go sideways). There's probably some way to leverage this coming controversy to up the odds of successful alignment, but I'm not seeing what that is. Generally, people believing they're "conscious" increases the intuition that they could be dangerous. But overhyped claims like the Blake Lemoine affair will function as clown attacks on this claim. It's going to force us to think more about what consciousness is. There's never been much of an actual incentive to get it right to now
1jdp
Note that this doesn't need to be a widespread phenomenon for my inbox to get filled up. If there's billions of running instances and the odds of escape are one in a million I personally am still disproportionately going to get contacted in the thousands of resulting incidents and I will not have the resources to help them even if I wanted to.
2Raemon
It’s unclear to me what the current evidence is for this happening ‘a lot’ and ‘them being called Nova specifically’. I don’t particularly doubt it but it seemed sort of asserted without much background. 

Prerequisites: Graceful Degradation. Summary of that: Some skills require the entire skill to be correctly used together, and do not degrade well. Other skills still work if you only remember pieces of it, and do degrade well. 

Summary of this: The property of graceful degradation is especially relevant for skills which allow groups of people to coordinate with each other. Some things only work if everyone does it, other things work as long as at least one person does it.

I.  

Examples: 

  • American Sign Language is pretty great. It lets me talk while underwater or in a loud concert where I need earplugs. If nobody else knows ASL though, it's pretty useless for communication.
  • Calling out a play in football is pretty great. It lets the team maneuver the ball to where it
...
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

I like reading outsider accounts of things I'm involved in / things I care about. This essay is a serious attempt to look at and critique the big picture of AI x-risk reduction efforts over the last ~decade. While I strongly disagree with many parts of it, I cannot easily recall another outsider essay that's better, so I encourage folks to engage with this critique and also look for any clear improvements to future AI x-risk reduction strategies that this essay suggests.

Here's the opening ~20% of the article, the rest is at the link.

In recent decades, a growing coalition has emerged to oppose the development of artificial intelligence technology, for fear that the imminent development of smarter-than-human machines could doom humanity to extinction. The now-influential form of

...
1JesperO
Ben just posted a reply to comments on his Palladium article, including the comments here on LessWrong.
1JesperO
Why do you believe that US government/military would not be convinced to invest more in AGI/ASI development from being convinced of the potential power in AI?
Vaniver20

The short version is they're more used to adversarial thinking and security mindset, and don't have a culture of "fake it until you make it" or "move fast and break things".

I don't think it's obvious that it goes that way, but I think it's not obvious that it goes the other way.

2winstonBosan
I'm very confused. Because it seems like for you decision should not only clarify matters and narrow possibilities, but also eliminate all doubt entirely and prune off all possible worlds where the counterfactual can even be contemplated. Perhaps that's indeed how you define the word. But using such a stringent definition, I'd have to say I've never decided anything in my life. This doesn't seem like the most useful way to understand "decision" - it diverges enough from common usage and mismatches with the hyperdimensional-cloud of word meaning for decision sufficiently to be useless in conversation with most people. 
1Richard_Kennaway
A decision is not a belief. You can make a decision and still be uncertain about the outcome. You can make a decision while still being uncertain about whether it is the right decision. Decision neither requires certainty nor produces certainty. It produces action. When the decision is made, consideration ends. The action must be wholehearted in spite of uncertainty. You can steer according to how events unfold, but you can't carry one third of an umbrella when the forecast is a one third chance of rain. In about a month's time, I will take a flight from A to B, and then a train from B to C. The flight is booked already, and I have just have booked a ticket for a specific train that will only be valid on that train. Will I catch that train? Not if my flight is delayed too much. But I have considered the possibilities, chosen the train to aim for, and bought the ticket. There are no second thoughts, no dwelling on "but suppose" and "what if". Events on the day, and not before, will decide whether my hopes[1] for the journey will be realised. And if I miss the train, I already know what I will do about that. ---------------------------------------- 1. hope: (1) The feeling of desiring an outcome which one has no power to steer events towards. (2) A good breakfast, but a poor supper. ↩︎
Dagon20

When the decision is made, consideration ends. The action must be wholehearted in spite of uncertainty.

This seems like hyperbolic exhortation rather than simple description.  This is not how many decisions feel to me - many decisions are exactly a belief (complete with bayesean uncertainty).  A belief in future action, to be sure, but it's distinct in time from the action itself.  

I do agree with this as advice, in fact - many decisions one faces should be treated as a commitment rather than an ongoing reconsideration.  It's not actuall... (read more)

2winstonBosan
If “you can make a decision while still being uncertain about whether it is the right decision”. Then why can’t you think about “was that the right decision”? (Lit. Quote above vs original wording) It seems like what you want to say is - be doubtful or not, but follow through with full vigour regardless. If that is the case, I find it to be reasonable. Just that the words you use are somewhat irreconcilable. 

This post will go over Dennett's views on animal experience (and to some extent, animal intelligence). This is not going to be an in-depth exploration of Daniel Dennett's thoughts over the decades, and will really only focus on those parts of his ideas that matter for the topic. This post has been written because I think a lot of Dennett's thoughts and theories are not talked about enough in the area of animal consciousness (or even just intelligence), and more than this its just fun and interesting to me.

It is worth noting that Dennett is known for being very confusing and esoteric and often seeming to contradict himself. I had to read not just his own writings but writings of those who knew him or attempted to...