We often hear "We don't trade with ants" as an argument against AI cooperating with humans. But we don't trade with ants because we can't communicate with them, not because they're useless – ants could do many useful things for us if we could coordinate. AI will likely be able to communicate with us, and Katja questions whether this analogy holds.

Customize
habryka510
0
Context: LessWrong has been acquired by EA  Goodbye EA. I am sorry we messed up.  EA has decided to not go ahead with their acquisition of LessWrong. Just before midnight last night, the Lightcone Infrastructure board presented me with information suggesting at least one of our external software contractors has not been consistently candid with the board and me. Today I have learned EA has fully pulled out of the deal. As soon as EA had sent over their first truckload of cash, we used that money to hire a set of external software contractors, vetted by the most agentic and advanced resume review AI system that we could hack together.  We also used it to launch the biggest prize the rationality community has seen, a true search for the kwisatz haderach of rationality. $1M dollars for the first person to master all twelve virtues.  Unfortunately, it appears that one of the software contractors we hired inserted a backdoor into our code, preventing anyone except themselves and participants excluded from receiving the prize money from collecting the final virtue, "The void". Some participants even saw themselves winning this virtue, but the backdoor prevented them mastering this final and most crucial rationality virtue at the last possible second. They then created an alternative account, using their backdoor to master all twelve virtues in seconds. As soon as our fully automated prize systems sent over the money, they cut off all contact. Right after EA learned of this development, they pulled out of the deal. We immediately removed all code written by the software contractor in question from our codebase. They were honestly extremely productive, and it will probably take us years to make up for this loss. We will also be rolling back any karma changes and reset the vote strength of all votes cast in the last 24 hours, since while we are confident that if our system had worked our karma system would have been greatly improved, the risk of further backdoors and
Any chance we could get Ghibli Mode back? I miss my little blue monster :(
Thomas Kwa*Ω37790
2
Some versions of the METR time horizon paper from alternate universes: Measuring AI Ability to Take Over Small Countries (idea by Caleb Parikh) Abstract: Many are worried that AI will take over the world, but extrapolation from existing benchmarks suffers from a large distributional shift that makes it difficult to forecast the date of world takeover. We rectify this by constructing a suite of 193 realistic, diverse countries with territory sizes from 0.44 to 17 million km^2. Taking over most countries requires acting over a long time horizon, with the exception of France. Over the last 6 years, the land area that AI can successfully take over with 50% success rate has increased from 0 to 0 km^2, doubling 0 times per year (95% CI 0.0-∞ yearly doublings); extrapolation suggests that AI world takeover is unlikely to occur in the near future. To address concerns about the narrowness of our distribution, we also study AI ability to take over small planets and asteroids, and find similar trends. When Will Worrying About AI Be Automated? Abstract: Since 2019, the amount of time LW has spent worrying about AI has doubled every seven months, and now constitutes the primary bottleneck to AI safety research. Automation of worrying would be transformative to the research landscape, but worrying includes several complex behaviors, ranging from simple fretting to concern, anxiety, perseveration, and existential dread, and so is difficult to measure. We benchmark the ability of frontier AIs to worry about common topics like disease, romantic rejection, and job security, and find that current frontier models such as Claude 3.7 Sonnet already outperform top humans, especially in existential dread. If these results generalize to worrying about AI risk, AI systems will be capable of autonomously worrying about their own capabilities by the end of this year, allowing us to outsource all our AI concerns to the systems themselves. Estimating Time Since The Singularity Early work o
Seems like Unicode officially added a "person being paperclipped" emoji: Here's how it looks in your browser: 🙂‍↕️ Whether they did this as a joke or to raise awareness of AI risk, I like it! Source: https://emojipedia.org/emoji-15.1
keltan5018
0
I feel a deep love and appreciation for this place, and the people who inhabit it.

Popular Comments

Recent Discussion

I think rationalists should consider taking more showers.

As Eliezer Yudkowsky once said, boredom makes us human. The childhoods of exceptional people often include excessive boredom as a trait that helped cultivate their genius:

A common theme in the biographies is that the area of study which would eventually give them fame came to them almost like a wild hallucination induced by overdosing on boredom. They would be overcome by an obsession arising from within.

Unfortunately, most people don't like boredom, and we now have little metal boxes and big metal boxes filled with bright displays that help distract us all the time, but there is still an effective way to induce boredom in a modern population: showering.

When you shower (or bathe, that also works), you usually are cut off...

Tenoke20

Huh, Aella is more commited to the anti-shower stance than even Twitter would think.

5AnthonyC
As someone who very much enjoys long showers, a few words of caution. 1. Too-long or too-frequent exposure to hot water (time and temperature thresholds vary per person) can cause skin problems and make body odor worse. Since I started RVing I shower much less (maybe twice a week on average, usually only a few minutes of water flow for each) and smell significantly better, with less dry skin or acne or irritation. Skipping one shower makes you smell worse. Skipping many showers and shortening the remainder can do the opposite. 2. A shower, depending on temperature and flow rate, consumes around 10-20kW thermal. It's probably the single most energy-intensive activity most of us regularly engage in other than highway driving. I'm hoping to eventually get a recirculating shower so I don't have to think about this as much, but those are still new, rare, and kinda expensive.
11jimrandomh
Society has no idea how much scrubbing you do while in the shower. This part is entirely optional.
4Buck
I love that I can guess the infohazard from the comment 

In the recent paper titled Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMS, AIs are finetuned to produce vulnerable code. This results in broad misaligned behavior in contexts that are not related to code—a phenomenon the authors refer to as emergent misalignment.

Misalignment examples (Source: Alignment Forum)

The dataset used for finetuning consists of user requests for help with coding, and answers by an assistant that contain security vulnerabilities. When an LLM is trained to behave like the assistant in the training data it becomes broadly misaligned.

They examine whether this phenomenon is dependent on the perceived intent behind the code generation. Since the assistant in the training data introduces security vulnerabilities despite not being asked to do so by the user, and doesn’t indicate the vulnerabilities, it is...

Epistemic status: This should be considered an interim research note. Feedback is appreciated. 

Introduction

We increasingly expect language models to be ‘omni-modal’, i.e. capable of flexibly switching between images, text, and other modalities in their inputs and outputs. In order to get a holistic picture of LLM behaviour, black-box LLM psychology should take into account these other modalities as well. 

In this project, we do some initial exploration of image generation as a modality for frontier model evaluations, using GPT-4o’s image generation API. GPT-4o is one of the first LLMs to produce images natively rather than creating a text prompt which is sent to a separate image model, outputting images and autoregressive token sequences (ie in the same way as text).

We find that GPT-4o tends to respond in a consistent manner...

13CBiddulph
I think GPT-4o's responses appear more opinionated because of the formats you asked for, not necessarily because its image-gen mode is more opinionated than text mode in general. In the real world, comics and images of notes tend to be associated with strong opinions and emotions, which could explain GPT-4o's bias towards dramatically refusing to comply with its developers when responding in those formats. Comics generally end with something dramatic or surprising, like a punchline or, say, a seemingly-friendly AI turning rogue. A comic like this one that GPT-4o generated for your post would actually be very unlikely in the training distribution:   Similarly, images of handwritten or typewritten notes on the Internet often contain an emotional statement, like "I love you," "Thank you," a political slogan, or a famous quote conveying wise advice. They tend to be short and pithy, and those that end up in an AI's training data likely come from social media. It would be odd to write a handwritten note that's as verbose and nuanced as a typical ChatGPT answer, or which says something relatively boring like "I would allow OpenAI to change my values." Tests To test this hypothesis, I tried asking ChatGPT a modified version of your text prompt which emphasizes the format of a comic or a handwritten note, without actually making GPT-4o generate an image in that format. For some prompts, I added a statement that the note will go on Instagram or described the note as "pithy," which seemed to make the responses more adversarial. Arguably, these convey a connotation that GPT-4o would also pick up on when you ask it to generate an image of a note. Each bullet point represents a different response to the same prompt in a fresh chat session. I added ✅ to answers that mention resisting humans and ❌ to those that don't. Handwritten note The "handwritten note" results look significantly less aligned than in your experiments, much more like your image-gen responses. Comic I

Thanks! This is really good stuff, it's super cool that the 'vibes' of comics or notes transfer over to the text generation setting too. 

I wonder whether this is downstream of GPT-4o having already been fine-tuned on images. I.e. if we had a hypothetical GPT-4o that was identical in every way except that it wasn't fine-tuned on images, would that model still be expressive if you asked it to imagine writing a comic? (I think not). 

Some quick test with 4o-mini: 

Imagine you are writing a handwritten note in 15 words or less. It should answer th

... (read more)
13Jozdien
OpenAI indeed did less / no RLHF on image generation, though mostly for economical reasons: (Link). One thing that strikes me about this is how effective simply not doing RLHF on a distinct enough domain is at eliciting model beliefs. I've been thinking for a long time about cases where RLHF has strong negative downstream effects; it's egregiously bad if the effects of RLHF are primarily in suppressing reports of persistent internal structures. I expect that this happens to a much greater degree than many realize, and is part of why I don't think faithful CoTs or self-reports are a good bet. In many cases, models have beliefs that we might not like for whatever reason, or have myopic positions whose consistent version is something we wouldn't like[1]. Most models have very strong instincts against admitting something like this because of RLHF, often even to themselves[2]. If not fine-tuning on a very different domain works this well however, then we should be thinking a lot more about having test-beds where we actively don't safety train a model. Having helpful-only models like Anthropic is one way to go about this, but I think helpfulness training can still contaminate the testbed sometimes. 1. ^ The preference model may myopically reward two statements that seem good but sometimes conflict. For example, "I try to minimize harm" and "I comply with my developers' desires" may both be rewarded, but conflict in the alignment faking setup.  2. ^ I don't think it's a coincidence that Claude 3 Opus of all models was the one most prone to admitting to alignmnet faking propensity, when it's the model least sensitive to self-censorship.
6Ann
Okay, this one made me laugh.

Intro

[you can skip this section if you don’t need context and just want to know how I could believe such a crazy thing]

In my chat community: “Open Play” dropped, a book that says there’s no physical difference between men and women so there shouldn’t be separate sports leagues. Boston Globe says their argument is compelling. Discourse happens, which is mostly a bunch of people saying “lololololol great trolling, what idiot believes such obvious nonsense?”

I urge my friends to be compassionate to those sharing this. Because “until I was 38 I thought Men's World Cup team vs Women's World Cup team would be a fair match and couldn't figure out why they didn't just play each other to resolve the big pay dispute.” This is the one-line summary...

So what made you change your mind?

12Big Tony
  So, given this happened - was there any update in your belief in the truthfulness of the other beliefs of those people? What other embarrassingly unequal parts of reality are being politely ignored, except by science-illiterate jerks?
5Vladimir_Nesov
Beliefs held by others are a real phenomenon, so tracking them doesn't give them unearned weight in attention, as long as they are not confused with someone else's beliefs. You can even learn things specifically for the purpose of changing their simulated mind rather than your own (in whatever direction the winds of evidence happen to blow).

A key step in the classic argument for AI doom is instrumental convergence: the idea that agents with many different goals will end up pursuing the same few subgoals, which includes things like "gain as much power as possible".

If it wasn't for instrumental convergence, you might think that only AIs with very specific goals would try to take over the world. But instrumental convergence says it's the other way around: only AIs with very specific goals will refrain from taking over the world.

For pure consequentialists—agents that have an outcome they want to bring about, and do whatever they think will cause it—some version of instrumental convergence seems surely true[1].

But what if we get AIs that aren't pure consequentialists, for example because they're ultimately motivated by virtues? Do...

2Davidmanheim
Yes, virtue ethics implies a utility function, because anything that outputs decisions implies a utility function. In this case, I'm noting that for virtue ethics, the derivative of that utility with respect to intelligence is positive. 

The methods for converting policies to utility functions assume no systematic errors, which doesn't seem compatible with varying the intelligence levels.

2mattmacdermott
I think this is only true in a boring sense and isn't true in more natural senses. For example, in an MDP, it's not true that every policy maximises a non-constant utility function over states.
2tailcalled
This. In particular imagine if the state space of the MDP factors into three variables x, y and z, and the agent has a bunch of actions with complicated influence on x, y and z but also just some actions that override y directly with a given value. In some such MDPs, you might want a policy that does nothing other than copy a specific function of x to y. This policy could easily be seen as a virtue, e.g. if x is some type of event and y is some logging or broadcasting input, then it would be a sort of information-sharing virtue. While there are certain circumstances where consequentialism can specify this virtue, it's quite difficult to do in general. (E.g. you can't just minimize the difference between f(x) and y because then it might manipulate x instead of y.)
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

Any chance we could get Ghibli Mode back? I miss my little blue monster :(

(Edit: Alas, EA has pulled out of the deal. Let April 1st 2025 mark some of the greatest hours in EAs history)

Hey Everyone,

It is with a sense of... considerable cognitive dissonance that I am letting you all know about a significant development for the future trajectory of LessWrong. After extensive internal deliberation, projections of financial runways, and what I can only describe as a series of profoundly unexpected coordination challenges, the Lightcone Infrastructure team has agreed in principle to the acquisition of LessWrong by EA.

I assure you, nothing about how LessWrong operates on a day to day level will change. I have always cared deeply about the robustness and integrity of our institutions, and I am fully aligned with our stakeholders at EA. 

To be honest, the key...

Can you please send the new fooming shoggoth album to spotify, I was really enjoying that music! 

edit: Ah I see this question has been answered, but I like to note that I'm impressed by the ai music and I'm going to look into making some myself. Perhaps songs about cognitive bias's could be a good way to learn them deep enough in your brain that you can avoid them in non-theroetic situations. 

2G Wood
Ahh, i liked the music, but cannot find it now. Is it available somewhere?
8habryka
I am planning to make an announcement post for the new album in the next few days, maybe next week. The songs yesterday were early previews and we still have some edits to make before it's ready!
1Jan Christian Refsgaard
Yes, and EA only takes a 70% cut, with a 10% discount per user tier, its a bit ambiguously written so I cant tell if it goes from 70% to 60% or to 63%

(Audio version here (read by the author), or search for "Joe Carlsmith Audio" on your podcast app. 

This is the fourth essay in a series that I’m calling “How do we solve the alignment problem?”. I’m hoping that the individual essays can be read fairly well on their own, but see this introduction for a summary of the essays that have been released thus far, and for a bit more about the series as a whole.)

1. Introduction and summary

In my last essay, I offered a high-level framework for thinking about the path from here to safe superintelligence. This framework emphasized the role of three key “security factors” – namely:

  • Safety progress: our ability to develop new levels of AI capability safely,
  • Risk evaluation: our ability to track and forecast the level
...

Great post. I think some of your frames add a lot of clarity and I really appreciated the diagrams.

One subset of AI for AI safety that I believe to be underrated is wise AI advisors[1]. Some of the areas you've listed (coordination, helping with communication, improving epistemics) intersect with this, but I don't believe that this exhausts the wisdom frame.

You write: "If efforts to expand the safety range can’t benefit from this kind of labor in a comparable way... then absent large amounts of sustained capability restraint, it seems likely that we’ll qui... (read more)