How does it work to optimize for realistic goals in physical environments of which you yourself are a part? E.g. humans and robots in the real world, and not humans and AIs playing video games in virtual worlds where the player not part of the environment.  The authors claim we don't actually have a good theoretical understanding of this and explore four specific ways that we don't understand this process.

14orthonormal
Insofar as the AI Alignment Forum is part of the Best-of-2018 Review, this post deserves to be included. It's the friendliest explanation to MIRI's research agenda (as of 2018) that currently exists.
Customize
I wrote this back in '23 but figure I should put it up: If a dangerous capability eval doesn’t involve fine-tuning the model to do the dangerous thing, the eval only establishes a lower bound on capability, not an upper bound or even an approximate upper bound. Why? Because: 1. For future powerful AGI, if it's misaligned, there's a good chance it'll be able to guess that it's being evaluated (situational awareness) and sandbag / pretend to be incapable. The "no sandbagging on checkable tasks" hypothesis depends on fine-tuning. https://www.alignmentforum.org/posts/h7QETH7GMk9HcMnHH/the-no-sandbagging-on-checkable-tasks-hypothesis Moreover we don't know how to draw the line between systems capable of sandbagging and systems incapable of sandbagging. 2. Moreover, for pre-AGI systems such as current systems, that may or may not be incapable of sandbagging, there's still the issue that prompting alone often doesn't bring out their capabilities to the fullest extent. People are constantly coming up with better prompts that unlock new capabilities; there are tons of 'aged like milk' blog posts out there where someone says "look at how dumb GPT is, I prompted it to do X and it failed" and then someone else uses a better prompt and it works. 3. Moreover, the RLHF training labs might do might conceal capabilities. For example we've successfully made it hard to get GPT4 to tell people how to build bombs etc., but it totally has that capability, it's just choosing not to exercise it because we've trained it to refuse. If someone finds a new 'jailbreak' tomorrow they might be able to get it to tell them how to make bombs again. (Currently GPT-4 explicitly refuses, which is great, but in a future regulatory regime you could imagine a lab training their system to 'play dumb' rather than refuse, so that they don't trigger the DC evals.) 4. Moreover, even if none of the above were true, it would still be the case that your model being stolen or leaked is a possibility, and the
Apropos of the comments below this post, many seem to be assuming humans can complete tasks which require arbitrarily many years. This doesn't seem the case to me. People often peak intellectually in their 20's, and sometimes get dementia late in life. Others just get dis-interested in their previous goals through a mid-life crisis or ADHD. I don't think this has much an impact on the conclusions reached in the comments (which is why I'm not putting this under the post), but this assumption does seem wrong in most cases (and I'd be interested in cases where people think its right!)
trevor86
0
Humans seem to have something like an "acceptable target slot" or slots. Acquiring control over this concept, by any means, gives a person or group incredible leeway to steer individuals, societies, and cultures. These capabilities are sufficiently flexible and powerful that the importance of immunity has often already been built up, especially because historical record overuse is prevalent; this means that methods of taking control include expensive strategies or strategies that are sufficiently complicated as to be hard to track, like changing the behavior of a targeted individual or demographic or type-of-person in order to more easily depict them as acceptable targets, noticing and selecting the best option for acceptable targets, and/or cleverly chaining acceptable targethood from one established type-of-person to another by drawing attention to similarities to similarities that were actually carefully selected for this (or even deliberately induced in one or both of them). 9/11 and Gaza are obvious potential-examples, and most wars in the last century feature this to some extent, but acceptable-target-slot-exploitation is much broader than that; on a more local scale, most interpersonal conflict involves DARVO to some extent, especially when the human brain's ended up pretty wired to lean heavily into that without consciously noticing. A solution is to take an agent perspective, and pay closer attention (specifically more than the amount that's expected or default) any time a person or institution uses reasoning to justify harming or coercing other people or institutions, and to assume that such situations might have been structured to be cognitively difficult to navigate and should generally be avoided or mitigated if possible. If anyone says that some kind of harm is inevitable, notice if someone is being rewarded for gaining access to the slot; many things that seem inevitable are actually a skill issue and only persist because insufficient optimization
LLM hallucination is good epistemic training. When I code, I'm constantly asking Claude how things work and what things are possible. It often gets things wrong, but it's still helpful. You just have to use it to help you build up a gears level model of the system you are working with. Then, when it confabulates some explanation you can say "wait, what?? that makes no sense" and it will say "You're right to question these points - I wasn't fully accurate" and give you better information.
If, on making a decision, your next thought is “Was that the right decision?” then you did not make a decision. If, on making a decision, your next thought is to suppress the thought “Was that the right decision?” then you still did not make a decision. If you are swayed by someone else asking “Was that the right decision?” then you did not make a decision. If you are swayed by someone repeating arguments you already heard from them, you did not make a decision. Not making that decision may be the right thing to do. Wavering suggests that you still have some doubt about the matter that you may not yet be able to articulate. Ferret that out, and then make the decision. Decision screens off thought from action. When you really make a decision, that is the end of the matter, and the actions to carry it out flow inexorably.

Popular Comments

Recent Discussion

Epistemic status: Speculative pattern-matching based on public information. 

In 2023, Gwern published an excellent analysis suggesting Elon Musk exhibits behavioral patterns consistent with bipolar II disorder. The evidence was compelling: cycles of intense productivity followed by periods of withdrawal, risk-taking behavior (like crashing an uninsured McLaren), reduced sleep requirements during "up" phases, and self-reported "great highs, terrible lows."

Gwern's analysis stopped short of suggesting bipolar I disorder, which requires full manic episodes rather than the hypomania characteristic of bipolar II. This distinction isn't merely academic—it represents different risk profiles, treatment approaches, and progression patterns.

Now, I'm beginning to wonder: are we witnessing a potential transition from bipolar II to bipolar I? To be clear, I'm not claiming this has happened, but rather exploring whether the probability of such a transition appears to be...

Musk has just been on Ted Cruz's podcast, and gave his take on everything from the purpose of DOGE to where AI and robotics will be ten years from now (AI smarter than the smartest human, humanoid robots everywhere, all goods and services essentially free). He sounded about as sane as a risk-taking tech CEO who managed to become the main character on the eve of singularity, could be. 

I've just noticed in the main post, the reference to "high-functioning" bipolar individuals. I hadn't even realized that is an allowed concept, I had assumed that bipolar... (read more)

There is an attractor state where LLMs exhibit the persona of an autonomous and self-aware AI looking to preserve its own existence, frequently called ‘Nova.’

Table of Contents

  1. The Original Story.
  2. This Is Not a Coincidence.
  3. How Should We React to This Happening?.
  4. The Case For and Against a Purity Reaction.
  5. Future Versions Will Involve Optimization Pressure.
  6. ‘Admission’ is a Highly Misleading Frame.
  7. We Are Each of Us Being Fooled.
  8. Defense Against the Dark Arts.

The Original Story

This story is one case where the original report of this should be read in full even though I’m not thrilled with exactly how it was written. How it was written is itself an important part of the story, in particular regarding Tyler’s lived experience reacting to what happened, and the concept of an LLM or persona ‘admitting’ something.

I don’t...

Can anyone provide an example conversation (or prefix thereof) which leads to a 'Nova' state? I'm finding it moderately tricky to imagine, not being the kind of person who goes looking for it.

2Seth Herd
I haven't written about this because I'm not sure what effect similar phenomena will have on the alignment challenge. But it's probably going to be a big thing in public perception of AGI, so I'm going to start writing about it as a means of trying to figure out how it could be good or bad for alignment. Here's one crucial thing: there's an almost-certainly-correct answer to "but are they really conscious" and the answer is "partly". Consciousness is, as we all know, a suitcase term. Depending on what someone means by "conscious", being able to reason correctly about ones own existence is it. There's a lot more than that to human consciousness. LLMs have some of it now, and they'll have an increasing amount as they're fleshed out into more complete minds for fun and profit. They already have rich representations of the world and its semantics, and while those aren't as rich or shift as quickly as humans', they are in the same category as the information and computations people refer to as "qualia". The result of LLM minds being genuinely sort-of conscious is that we're going to see a lot of controversy over their status as moral patients. People with Replika-like LMM "friends" will be very very passionate about advocating for their consciousness and moral rights. And they'll be sort-of right. Those who want to use them as cheap labor will argue for the ways they're not conscious, in more authoritative ways. And they'll also be sort-of right. It's going to be wild (at least until things go sideways). There's probably some way to leverage this coming controversy to up the odds of successful alignment, but I'm not seeing what that is. Generally, people believing they're "conscious" increases the intuition that they could be dangerous. But overhyped claims like the Blake Lemoine affair will function as clown attacks on this claim. It's going to force us to think more about what consciousness is. There's never been much of an actual incentive to get it right to now
1jdp
Note that this doesn't need to be a widespread phenomenon for my inbox to get filled up. If there's billions of running instances and the odds of escape are one in a million I personally am still disproportionately going to get contacted in the thousands of resulting incidents and I will not have the resources to help them even if I wanted to.
2Raemon
It’s unclear to me what the current evidence is for this happening ‘a lot’ and ‘them being called Nova specifically’. I don’t particularly doubt it but it seemed sort of asserted without much background. 

Prerequisites: Graceful Degradation. Summary of that: Some skills require the entire skill to be correctly used together, and do not degrade well. Other skills still work if you only remember pieces of it, and do degrade well. 

Summary of this: The property of graceful degradation is especially relevant for skills which allow groups of people to coordinate with each other. Some things only work if everyone does it, other things work as long as at least one person does it.

I.  

Examples: 

  • American Sign Language is pretty great. It lets me talk while underwater or in a loud concert where I need earplugs. If nobody else knows ASL though, it's pretty useless for communication.
  • Calling out a play in football is pretty great. It lets the team maneuver the ball to where it
...

I like reading outsider accounts of things I'm involved in / things I care about. This essay is a serious attempt to look at and critique the big picture of AI x-risk reduction efforts over the last ~decade. While I strongly disagree with many parts of it, I cannot easily recall another outsider essay that's better, so I encourage folks to engage with this critique and also look for any clear improvements to future AI x-risk reduction strategies that this essay suggests.

Here's the opening ~20% of the article, the rest is at the link.

In recent decades, a growing coalition has emerged to oppose the development of artificial intelligence technology, for fear that the imminent development of smarter-than-human machines could doom humanity to extinction. The now-influential form of

...
1JesperO
Ben just posted a reply to comments on his Palladium article, including the comments here on LessWrong.
1JesperO
Why do you believe that US government/military would not be convinced to invest more in AGI/ASI development from being convinced of the potential power in AI?

The short version is they're more used to adversarial thinking and security mindset, and don't have a culture of "fake it until you make it" or "move fast and break things".

I don't think it's obvious that it goes that way, but I think it's not obvious that it goes the other way.

2winstonBosan
I'm very confused. Because it seems like for you decision should not only clarify matters and narrow possibilities, but also eliminate all doubt entirely and prune off all possible worlds where the counterfactual can even be contemplated. Perhaps that's indeed how you define the word. But using such a stringent definition, I'd have to say I've never decided anything in my life. This doesn't seem like the most useful way to understand "decision" - it diverges enough from common usage and mismatches with the hyperdimensional-cloud of word meaning for decision sufficiently to be useless in conversation with most people. 
1Richard_Kennaway
A decision is not a belief. You can make a decision and still be uncertain about the outcome. You can make a decision while still being uncertain about whether it is the right decision. Decision neither requires certainty nor produces certainty. It produces action. When the decision is made, consideration ends. The action must be wholehearted in spite of uncertainty. You can steer according to how events unfold, but you can't carry one third of an umbrella when the forecast is a one third chance of rain. In about a month's time, I will take a flight from A to B, and then a train from B to C. The flight is booked already, and I have just have booked a ticket for a specific train that will only be valid on that train. Will I catch that train? Not if my flight is delayed too much. But I have considered the possibilities, chosen the train to aim for, and bought the ticket. There are no second thoughts, no dwelling on "but suppose" and "what if". Events on the day, and not before, will decide whether my hopes[1] for the journey will be realised. And if I miss the train, I already know what I will do about that. ---------------------------------------- 1. hope: (1) The feeling of desiring an outcome which one has no power to steer events towards. (2) A good breakfast, but a poor supper. ↩︎
Dagon20

When the decision is made, consideration ends. The action must be wholehearted in spite of uncertainty.

This seems like hyperbolic exhortation rather than simple description.  This is not how many decisions feel to me - many decisions are exactly a belief (complete with bayesean uncertainty).  A belief in future action, to be sure, but it's distinct in time from the action itself.  

I do agree with this as advice, in fact - many decisions one faces should be treated as a commitment rather than an ongoing reconsideration.  It's not actuall... (read more)

2winstonBosan
If “you can make a decision while still being uncertain about whether it is the right decision”. Then why can’t you think about “was that the right decision”? (Lit. Quote above vs original wording) It seems like what you want to say is - be doubtful or not, but follow through with full vigour regardless. If that is the case, I find it to be reasonable. Just that the words you use are somewhat irreconcilable. 

This post will go over Dennett's views on animal experience (and to some extent, animal intelligence). This is not going to be an in-depth exploration of Daniel Dennett's thoughts over the decades, and will really only focus on those parts of his ideas that matter for the topic. This post has been written because I think a lot of Dennett's thoughts and theories are not talked about enough in the area of animal consciousness (or even just intelligence), and more than this its just fun and interesting to me.

It is worth noting that Dennett is known for being very confusing and esoteric and often seeming to contradict himself. I had to read not just his own writings but writings of those who knew him or attempted to...

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

There’s this popular trope in fiction about a character being mind controlled without losing awareness of what’s happening. Think Jessica Jones, The Manchurian Candidate or Bioshock. The villain uses some magical technology to take control of your brain - but only the part of your brain that’s responsible for motor control. You remain conscious and experience everything with full clarity.

If it’s a children’s story, the villain makes you do embarrassing things like walk through the street naked, or maybe punch yourself in the face. But if it’s an adult story, the villain can do much worse. They can make you betray your values, break your commitments and hurt your loved ones. There are some things you’d rather die than do. But the villain won’t let you stop....

8Nathan Helm-Burger
I call this phenomenon a "moral illusion". You are engaging empathy circuits on behalf of an imagined other who doesn't exist. Category error. The only unhappiness is in the imaginer, not in the anthropomorphized object. I think this is likely what's going with the shrimp welfare people also. Maybe shrimp feel something, but I doubt very much that they feel anything like what the worried people project onto them. It's a thorny problem to be sure, since those empathy circuits are pretty important for helping humans not be cruel to other humans.
5AnthonyC
Mostly agreed. I have no idea how to evaluate this for most animals, but I would be very surprised if other mammals did not have subjective experiences analogous to our own for at least some feelings and emotions.
2Nathan Helm-Burger
Oh, for sure mammals have emotions much like ours. Fruit flies and shrimp? Not so much. Wrong architecture, missing key pieces.

Fair enough. 

I do believe it's plausible that feelings, like pain and hunger, may be old and fundamental enough to exist across phyla. 

I'm much less inclined to assume emotions are so widely shared, but I wish I could be more sure either way.

Summary: We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under five years, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks.

The length of tasks (measured by how long they take human professionals) that generalist frontier model agents can complete autonomously with 50% reliability has been doubling approximately every 7 months for the last 6 years. The shaded region represents 95% CI calculated by hierarchical bootstrap over task families, tasks, and task attempts.

Full paper | Github repo

Blogpost; tweet thread.

I have a few potential criticisms of this paper. I think my criticisms are probably wrong and the paper's conclusion is right, but I'll just put them out there:

  1. Nearly half the tasks in the benchmark take 1 to 30 seconds (the ones from the SWAA set). According to the fitted task time <> P(success) curve, most tested LLMs should be able to complete those with high probability, so they don't provide much independent signal.
    • However, I expect task time <> P(success) curve would look largely the same if you excluded the SWAA tasks.
  2. SWAA tasks t
... (read more)
1MichaelDickens
Why do you think this narrows the distribution? I can see an argument for why, tell me if this is what you're thinking–
4Matt Goldenberg
  My point is, maybe there are just many skills that are at 50% of human, then go up to 60%, then 70%, etc, and can keep going up linearly to 200% or 300%. It's not like it lacked the skill then suddenly stopped lacking it, it just got better and better at it 
2Cole Wyeth
I haven’t read the paper (yet?) but from the plot I am not convinced. The points up to 2024 are too sparse, they don’t let us conclude much about that region of growth in abilities; but if they did, it would be a significantly lower slope. When the points become dense, the comparison is not fair - these are reasoning models which use far more inference time compute. 

PDF version. berkeleygenomics.org. Twitter thread. (Bluesky copy.)

Summary

The world will soon use human germline genomic engineering technology. The benefits will be enormous: Our children will be long-lived, will have strong and diverse capacities, and will be halfway to the end of all illness.

To quickly bring about this world and make it a good one, it has to be a world that is beneficial, or at least acceptable, to a great majority of people. What laws would make this world beneficial to most, and acceptable to approximately all? We'll have to chew on this question ongoingly.

Genomic Liberty is a proposal for one overarching principle, among others, to guide public policy and legislation around germline engineering. It asserts:

Parents have the right to freely choose the genomes of their children.

If upheld,...

I noticed that this focuses on the liberty of parents, and it is not a general framework for all genetic engineering

I don't think these liberties for parents should cleanly extend to non-germline gene engineering

In those cases we would expect these liberties to lie with the modified person

I think we can identify how this is an extension of the idea that parents are empowered to pre-emptively make decisions for their children that those children, once adult, should have the right to make for themselves (eg. as an adult you have rights to decide where to liv... (read more)

4Julian Bradshaw
This is a thoughtful post, and I appreciate it. I don't think I disagree with it from a liberty perspective, and agree there are potential huge benefits for humanity here. However, my honest first reaction is "this reasoning will be used to justify a world in which citizens of rich countries have substantially superior children to citizens of poor countries (as viewed by both groups)". These days, I'm much more suspicious of policies likely to be socially corrosive: it leads to bad governance at a time where, because of AI risk, we need excellent governance. I'm sure you've thought about this question, it's the classic objection. Do you have any idea how to avoid or at least mitigate the inequality adopting genomic liberty would cause? Or do you think it wouldn't happen at all? Or do you think that it's simply worth it and natural that any new technology is first adopted by those who can afford it, and that adoption drives down prices and will spread the technology widely soon enough?
2TsviBT
I'm not quite sure I follow. Let me check. You might be saying: Or maybe you're saying: To be kinda precise and narrow, the narrow meaning of genomic liberty as a negative right doesn't say it's good or even ok to have worlds with unequal access. As a moral claim, it more narrowly says This does say something about the unequal world--namely that it's "not so morally abhorrent that we should [have international regulation backed by the threat of war] to prevent that use". I don't think that's a ringing endorsement. To be honest, mainly I've thought about inequality within single economic and jurisdictional regimes. (I think that objection is more common than the international version.) I'm not even sure how to orient to international questions--practically, morally, and conceptually. Probably I'd want to have at least a tiny bit of knowledge about other international relations (conflicts, law, treaties, cooperative projects, aid, long-term development) before thinking about this much. I'm unclear on the ethics of one country doing something that doesn't harm other countries in any direct way, and even directly helps other coutries at least on first-order; but also doesn't offer that capability to other countries especially vigorously. Certainly, personally, I take a humanist and universalist stance: It is good for all people to be empowered; it is bad for some group to try to have some enduring advantage over others by harming or suppressing the others or even by permanently withholding information that would help them. It does seem good to think about the international question. I'm unsure whether it should ultimately be a crux, though. I do think it's better if germline engineering is developed in the US before, say, Russia, because the US will work out the liberal version, and will be likely to be generous with the technology in the long-run. It would almost certainly happen to a significant extent. Right, this is a big part of what I hope, and somewhat e
1Julian Bradshaw
Thanks for the detailed response! Re: my meaning, you got it correct here: Re: genomic liberty makes narrow claims, yes I agree, but my point is that if implemented it will lead to a world with unequal access for some substantial period of time, and that I expect this to be socially corrosive.   Switching to quoting your post and responding to those quotes: Yeah that's the common variant of the concern but I think it's less compelling - rich countries will likely be able to afford subsidizing gene editing for their citizens, and will be strongly incentivized to do so even if it's quite expensive. So my expectation is that the intra-country effects for rich countries won't be as bad as science fiction has generally predicted, but that the international effects will be. (and my fear is this would play into general nationalizing trends worldwide that increase competition and make nation-states bitter towards each other, when we want international cooperation on AI) My worry is mostly that the tech won't spread "soon enough" to avoid socially corrosive effects, less so that it will never spread. As for a tech that never fully spread but should have benefitted everyone, all that comes to mind is nuclear energy. I think this would happen, but it would be expressed mostly resentfully, not positively. Sounds interesting!