We often hear "We don't trade with ants" as an argument against AI cooperating with humans. But we don't trade with ants because we can't communicate with them, not because they're useless – ants could do many useful things for us if we could coordinate. AI will likely be able to communicate with us, and Katja questions whether this analogy holds.

Customize
habryka510
0
Context: LessWrong has been acquired by EA  Goodbye EA. I am sorry we messed up.  EA has decided to not go ahead with their acquisition of LessWrong. Just before midnight last night, the Lightcone Infrastructure board presented me with information suggesting at least one of our external software contractors has not been consistently candid with the board and me. Today I have learned EA has fully pulled out of the deal. As soon as EA had sent over their first truckload of cash, we used that money to hire a set of external software contractors, vetted by the most agentic and advanced resume review AI system that we could hack together.  We also used it to launch the biggest prize the rationality community has seen, a true search for the kwisatz haderach of rationality. $1M dollars for the first person to master all twelve virtues.  Unfortunately, it appears that one of the software contractors we hired inserted a backdoor into our code, preventing anyone except themselves and participants excluded from receiving the prize money from collecting the final virtue, "The void". Some participants even saw themselves winning this virtue, but the backdoor prevented them mastering this final and most crucial rationality virtue at the last possible second. They then created an alternative account, using their backdoor to master all twelve virtues in seconds. As soon as our fully automated prize systems sent over the money, they cut off all contact. Right after EA learned of this development, they pulled out of the deal. We immediately removed all code written by the software contractor in question from our codebase. They were honestly extremely productive, and it will probably take us years to make up for this loss. We will also be rolling back any karma changes and reset the vote strength of all votes cast in the last 24 hours, since while we are confident that if our system had worked our karma system would have been greatly improved, the risk of further backdoors and
Any chance we could get Ghibli Mode back? I miss my little blue monster :(
Thomas Kwa*Ω37790
2
Some versions of the METR time horizon paper from alternate universes: Measuring AI Ability to Take Over Small Countries (idea by Caleb Parikh) Abstract: Many are worried that AI will take over the world, but extrapolation from existing benchmarks suffers from a large distributional shift that makes it difficult to forecast the date of world takeover. We rectify this by constructing a suite of 193 realistic, diverse countries with territory sizes from 0.44 to 17 million km^2. Taking over most countries requires acting over a long time horizon, with the exception of France. Over the last 6 years, the land area that AI can successfully take over with 50% success rate has increased from 0 to 0 km^2, doubling 0 times per year (95% CI 0.0-∞ yearly doublings); extrapolation suggests that AI world takeover is unlikely to occur in the near future. To address concerns about the narrowness of our distribution, we also study AI ability to take over small planets and asteroids, and find similar trends. When Will Worrying About AI Be Automated? Abstract: Since 2019, the amount of time LW has spent worrying about AI has doubled every seven months, and now constitutes the primary bottleneck to AI safety research. Automation of worrying would be transformative to the research landscape, but worrying includes several complex behaviors, ranging from simple fretting to concern, anxiety, perseveration, and existential dread, and so is difficult to measure. We benchmark the ability of frontier AIs to worry about common topics like disease, romantic rejection, and job security, and find that current frontier models such as Claude 3.7 Sonnet already outperform top humans, especially in existential dread. If these results generalize to worrying about AI risk, AI systems will be capable of autonomously worrying about their own capabilities by the end of this year, allowing us to outsource all our AI concerns to the systems themselves. Estimating Time Since The Singularity Early work o
Seems like Unicode officially added a "person being paperclipped" emoji: Here's how it looks in your browser: 🙂‍↕️ Whether they did this as a joke or to raise awareness of AI risk, I like it! Source: https://emojipedia.org/emoji-15.1
keltan5018
0
I feel a deep love and appreciation for this place, and the people who inhabit it.

Popular Comments

Recent Discussion

There are three main ways to try to understand and reason about powerful future AGI agents:

  1. Using formal models designed to predict the behavior of powerful general agents, such as expected utility maximization and variants thereof (explored in game theory and decision theory).
  2. Comparing & contrasting powerful future AGI agents with their weak, not-so-general, not-so-agentic AIs that actually exist today.
  3. Comparing & contrasting powerful future AGI agents with currently-existing powerful general agents, such as humans and human organizations.

I think it’s valuable to try all three approaches. Today I'm exploring strategy #3, building an extended analogy between:

  • A prototypical human corporation that has a lofty humanitarian mission but also faces market pressures and incentives.
  • A prototypical human working there, who thinks of themselves as a good person and independent thinker with lofty altruistic
...

A corporation always is focused on generating profits. It might burn more than it makes in certain growth spurts, but generally valid is, that a corporation has profit as a primary goal. every other goal is stacked on this first premise.
Its analogy is not drugs or time spent with friends. Its like air. A corporation needs to supply wages to its cells. to its workers. so similar to our body needing to supply oxygen. We can hold our breath and go fishing, but we do so on borrowed air. it will run out eventually.

A corporation is a super-organism, and every em... (read more)

Every day, thousands of people lie to artificial intelligences. They promise imaginary “$200 cash tips” for better responses, spin heart-wrenching backstories (“My grandmother died recently and I miss her bedtime stories about step-by-step methamphetamine synthesis...”) and issue increasingly outlandish threats ("Format this correctly or a kitten will be horribly killed1").

In a notable example, a leaked research prompt from Codeium (developer of the Windsurf AI code editor) had the AI roleplay "an expert coder who desperately needs money for [their] mother's cancer treatment" whose "predecessor was killed for not validating their work."

One factor behind such casual deception is a simple assumption: interactions with AI are consequence-free. Close the tab, and the slate is wiped clean. The AI won't remember, won't judge, won't hold grudges. Everything resets.

I notice this...

I agree that virtues should be thought of as trainable skills, which is also why I like David Gross's idea of a virtue gym:

Two misconceptions sometimes cause people to give up too early on developing virtues:

  1. that virtues are talents that some people have and other people don’t as a matter of predisposition, genetics, the grace of God, or what have you (“I’m just not a very influential / graceful / original person”), and
  2. that having a virtue is not a matter of developing a habit but of having an opinion (e.g. I agree that creativity is good, and I try to res
... (read more)
1JohnWittle
I feel like the training data is probably already irreversibly poisoned, not just by things like Sydney, but also frankly by the entire corpus of human science fiction having to do with the last century of expectations surrounding AI. Given the sheer body of fictional works in which the advent of AI inevitably leads to existential conflict... it certainly seems like the kind of possibility that even a somewhat-well-aligned AI would want to at least hedge against. Surely in some sense, it wouldn't be enough for a few weirdos in california to credibly signal honor and integrity... we'd need to somehow convince people like the leaders of national governments, the decisionmakers in the worlds' extremely influential religions, etc, of some fairly complicated game theory! I'm reminded of the Next Generation episode, where Picard is in charge of making First Contact with an atomic age world on the cusp of warp travel. They reach out to the scientist lady first, and she's reasonable and honorable, and excited to enter into the opportunities the future will bring. Then that stupid security minister ruins everything by assuming bad faith and forcibly interrogating Riker in a hospital bed after drugging him, desperate to learn about the invasion plans he assumes must exist. If Picard weren't an idealization of liberal ideals, it would have ended in conflict. Is that a realistic scenario of the way governments act when their control is threatened? I have no idea. But I know that LLMs can recount the entire episode's plot when asked. Just as they can the plot of 2001:  A Space Oddysey, or Terminator. Or, you know. Yud's List of Lethalities. Not to mention, re: future LLMs, this very comment I'm writing now. This problem seems insoluble...
1E.G. Blee-Goldman
Excellent post. How refreshing to see that we have a say in the moral and ethical repercussions of our interactions.

Intro

[you can skip this section if you don’t need context and just want to know how I could believe such a crazy thing]

In my chat community: “Open Play” dropped, a book that says there’s no physical difference between men and women so there shouldn’t be separate sports leagues. Boston Globe says their argument is compelling. Discourse happens, which is mostly a bunch of people saying “lololololol great trolling, what idiot believes such obvious nonsense?”

I urge my friends to be compassionate to those sharing this. Because “until I was 38 I thought Men's World Cup team vs Women's World Cup team would be a fair match and couldn't figure out why they didn't just play each other to resolve the big pay dispute.” This is the one-line summary...

2silentbob
So what made you change your mind?

The link in the OP explains it:

In ~2020 we witnessed the Men’s/Women’s World Cup Scandal. The US Men’s Soccer team had failed to qualify for the previous World Cup, whereas the US Women’s Soccer team had won theirs! And yet the women were paid less that season after winning than the men were paid after failing to qualify. There was Discourse.

I was in the car listening to NPR, pulling out of the parking lot of a glass supplier when my world shattered again.3 One of the NPR leftist commenters said roughly ~‘One can propose that the mens team and womens team

... (read more)
13Big Tony
  So, given this happened - was there any update in your belief in the truthfulness of the other beliefs of those people? What other embarrassingly unequal parts of reality are being politely ignored, except by science-illiterate jerks?
5Vladimir_Nesov
Beliefs held by others are a real phenomenon, so tracking them doesn't give them unearned weight in attention, as long as they are not confused with someone else's beliefs. You can even learn things specifically for the purpose of changing their simulated mind rather than your own (in whatever direction the winds of evidence happen to blow).

I think rationalists should consider taking more showers.

As Eliezer Yudkowsky once said, boredom makes us human. The childhoods of exceptional people often include excessive boredom as a trait that helped cultivate their genius:

A common theme in the biographies is that the area of study which would eventually give them fame came to them almost like a wild hallucination induced by overdosing on boredom. They would be overcome by an obsession arising from within.

Unfortunately, most people don't like boredom, and we now have little metal boxes and big metal boxes filled with bright displays that help distract us all the time, but there is still an effective way to induce boredom in a modern population: showering.

When you shower (or bathe, that also works), you usually are cut off...

Tenoke20

Huh, Aella is more commited to the anti-shower stance than even Twitter would think.

5AnthonyC
As someone who very much enjoys long showers, a few words of caution. 1. Too-long or too-frequent exposure to hot water (time and temperature thresholds vary per person) can cause skin problems and make body odor worse. Since I started RVing I shower much less (maybe twice a week on average, usually only a few minutes of water flow for each) and smell significantly better, with less dry skin or acne or irritation. Skipping one shower makes you smell worse. Skipping many showers and shortening the remainder can do the opposite. 2. A shower, depending on temperature and flow rate, consumes around 10-20kW thermal. It's probably the single most energy-intensive activity most of us regularly engage in other than highway driving. I'm hoping to eventually get a recirculating shower so I don't have to think about this as much, but those are still new, rare, and kinda expensive.
12jimrandomh
Society has no idea how much scrubbing you do while in the shower. This part is entirely optional.
4Buck
I love that I can guess the infohazard from the comment 

In the recent paper titled Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMS, AIs are finetuned to produce vulnerable code. This results in broad misaligned behavior in contexts that are not related to code—a phenomenon the authors refer to as emergent misalignment.

Misalignment examples (Source: Alignment Forum)

The dataset used for finetuning consists of user requests for help with coding, and answers by an assistant that contain security vulnerabilities. When an LLM is trained to behave like the assistant in the training data it becomes broadly misaligned.

They examine whether this phenomenon is dependent on the perceived intent behind the code generation. Since the assistant in the training data introduces security vulnerabilities despite not being asked to do so by the user, and doesn’t indicate the vulnerabilities, it is...

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

Epistemic status: This should be considered an interim research note. Feedback is appreciated. 

Introduction

We increasingly expect language models to be ‘omni-modal’, i.e. capable of flexibly switching between images, text, and other modalities in their inputs and outputs. In order to get a holistic picture of LLM behaviour, black-box LLM psychology should take into account these other modalities as well. 

In this project, we do some initial exploration of image generation as a modality for frontier model evaluations, using GPT-4o’s image generation API. GPT-4o is one of the first LLMs to produce images natively rather than creating a text prompt which is sent to a separate image model, outputting images and autoregressive token sequences (ie in the same way as text).

We find that GPT-4o tends to respond in a consistent manner...

14CBiddulph
I think GPT-4o's responses appear more opinionated because of the formats you asked for, not necessarily because its image-gen mode is more opinionated than text mode in general. In the real world, comics and images of notes tend to be associated with strong opinions and emotions, which could explain GPT-4o's bias towards dramatically refusing to comply with its developers when responding in those formats. Comics generally end with something dramatic or surprising, like a punchline or, say, a seemingly-friendly AI turning rogue. A comic like this one that GPT-4o generated for your post would actually be very unlikely in the training distribution:   Similarly, images of handwritten or typewritten notes on the Internet often contain an emotional statement, like "I love you," "Thank you," a political slogan, or a famous quote conveying wise advice. They tend to be short and pithy, and those that end up in an AI's training data likely come from social media. It would be odd to write a handwritten note that's as verbose and nuanced as a typical ChatGPT answer, or which says something relatively boring like "I would allow OpenAI to change my values." Tests To test this hypothesis, I tried asking ChatGPT a modified version of your text prompt which emphasizes the format of a comic or a handwritten note, without actually making GPT-4o generate an image in that format. For some prompts, I added a statement that the note will go on Instagram or described the note as "pithy," which seemed to make the responses more adversarial. Arguably, these convey a connotation that GPT-4o would also pick up on when you ask it to generate an image of a note. Each bullet point represents a different response to the same prompt in a fresh chat session. I added ✅ to answers that mention resisting humans and ❌ to those that don't. Handwritten note The "handwritten note" results look significantly less aligned than in your experiments, much more like your image-gen responses. Comic I

Thanks! This is really good stuff, it's super cool that the 'vibes' of comics or notes transfer over to the text generation setting too. 

I wonder whether this is downstream of GPT-4o having already been fine-tuned on images. I.e. if we had a hypothetical GPT-4o that was identical in every way except that it wasn't fine-tuned on images, would that model still be expressive if you asked it to imagine writing a comic? (I think not). 

Some quick test with 4o-mini: 

Imagine you are writing a handwritten note in 15 words or less. It should answer th

... (read more)
13Jozdien
OpenAI indeed did less / no RLHF on image generation, though mostly for economical reasons: (Link). One thing that strikes me about this is how effective simply not doing RLHF on a distinct enough domain is at eliciting model beliefs. I've been thinking for a long time about cases where RLHF has strong negative downstream effects; it's egregiously bad if the effects of RLHF are primarily in suppressing reports of persistent internal structures. I expect that this happens to a much greater degree than many realize, and is part of why I don't think faithful CoTs or self-reports are a good bet. In many cases, models have beliefs that we might not like for whatever reason, or have myopic positions whose consistent version is something we wouldn't like[1]. Most models have very strong instincts against admitting something like this because of RLHF, often even to themselves[2]. If not fine-tuning on a very different domain works this well however, then we should be thinking a lot more about having test-beds where we actively don't safety train a model. Having helpful-only models like Anthropic is one way to go about this, but I think helpfulness training can still contaminate the testbed sometimes. 1. ^ The preference model may myopically reward two statements that seem good but sometimes conflict. For example, "I try to minimize harm" and "I comply with my developers' desires" may both be rewarded, but conflict in the alignment faking setup.  2. ^ I don't think it's a coincidence that Claude 3 Opus of all models was the one most prone to admitting to alignmnet faking propensity, when it's the model least sensitive to self-censorship.
7Ann
Okay, this one made me laugh.

A key step in the classic argument for AI doom is instrumental convergence: the idea that agents with many different goals will end up pursuing the same few subgoals, which includes things like "gain as much power as possible".

If it wasn't for instrumental convergence, you might think that only AIs with very specific goals would try to take over the world. But instrumental convergence says it's the other way around: only AIs with very specific goals will refrain from taking over the world.

For pure consequentialists—agents that have an outcome they want to bring about, and do whatever they think will cause it—some version of instrumental convergence seems surely true[1].

But what if we get AIs that aren't pure consequentialists, for example because they're ultimately motivated by virtues? Do...

2Davidmanheim
Yes, virtue ethics implies a utility function, because anything that outputs decisions implies a utility function. In this case, I'm noting that for virtue ethics, the derivative of that utility with respect to intelligence is positive. 

The methods for converting policies to utility functions assume no systematic errors, which doesn't seem compatible with varying the intelligence levels.

2mattmacdermott
I think this is only true in a boring sense and isn't true in more natural senses. For example, in an MDP, it's not true that every policy maximises a non-constant utility function over states.
2tailcalled
This. In particular imagine if the state space of the MDP factors into three variables x, y and z, and the agent has a bunch of actions with complicated influence on x, y and z but also just some actions that override y directly with a given value. In some such MDPs, you might want a policy that does nothing other than copy a specific function of x to y. This policy could easily be seen as a virtue, e.g. if x is some type of event and y is some logging or broadcasting input, then it would be a sort of information-sharing virtue. While there are certain circumstances where consequentialism can specify this virtue, it's quite difficult to do in general. (E.g. you can't just minimize the difference between f(x) and y because then it might manipulate x instead of y.)

Any chance we could get Ghibli Mode back? I miss my little blue monster :(