1 min read

7

This is a special post for quick takes by faul_sname. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
99 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

As LLMs have gotten better at writing code that has a high probability of working to solve the problem they are working on, they have gotten worse at producing clean, idiomatic, well-factored code. Concretely, asking the original GPT-4 to write a Python function for multi-source BFS might have given something like 

Multi-source BFS in the style of original GPT-4: Clear, idiomatic, broken

def multi_source_bfs(graph, sources):
   distances = [-1] * len(graph)

   queue = []
   for source in sources:
       queue.append(source)
       distances[source] = 0

   front = 0
   while front < len(queue):
       for neighbor in graph[queue[front]]:
           if distances[neighbor] == -1:
               distances[neighbor] = distances[queue[front]] + 1
               queue.append(neighbor)
       front += 1

   return distances[dest_index]

The code might or might not work (probably won't for anything nontrivial), but the intent is clear.

By contrast, if you ask a top coding model like sonnet 3.7 or o3, you'll get something that looks like

Multi-source BFS in the style of Sonnet 3.7: Verbose, brittle, hard to read, almost certainly works

from collections import deque
from typing import List
... (read more)

Wonder if correctness proofs (checked by some proof assistant) can help with this.[1]

I think the main bottleneck in the past for correctness proofs was that it takes much more effort to write the proofs than it takes to write the programs themselves, and current automated theorem provers are nowhere near good enough.

Writing machine checked proofs is a prime RL target, since proof assistant kernels should be adversarially robust. We have already seen great results from stuff like AlphaProof.


  1. One counterargument I could see is that writing the correctness properties themselves could turn out to be a major bottleneck. It might be that for most real world systems you can't write succinct correctness properties. ↩︎

3p
I also notice o3 not commenting code (and sometimes even removing existing comments)! Perhaps because smaller code gets better rewards?
3Garrett Baker
A possible longer term issue with this is when future generations of models are pre-trained, this style of code will be a significant fraction of their training data which will only grow over time, so just as its been hard to get models out of the "chatgpt-ese" due to simulators reasons, it may also be hard to get models out of this messy code basin, even before you do any code RL, once they realize their chat-models and they're "supposed to" talk like this. I say issue, because it does seem worse to have a trend in the direction of AI code un-readability by humans have some momentum behind it, rather than just being a result of easily changeable RL fine-tuning.
2Nathan Helm-Burger
On the plus side, it should be pretty easy to collect a lot of negative examples now of 'code that solves the problem, but in a gross way'. Having a large dataset of such is the first step to using these negative examples to train models not to do this.
2Nathan Helm-Burger
Who watches the watchers? Who grades the graders? If the RL graders are upvoting slop, seems like we need to one level more meta and upgrade the RL graders. This seems like a straightforward engineering problem, and I suspect the negative outcomes we've been seeing recently aren't so much due to the inherent intractability of doing this well, but due to the companies racing and cutting corners on quality control. Contrast with something like: Problem of Human Limitations: how do we get the model to do things so hard no human can do them? How do we rate the quality of their outputs when no human is qualified to judge them? Problem of Optimization for Subversion: if we have directly misaligned goals like "lie to me in ways that make me happy" and also "never appear to be lying to me, I hate thinking I'm being lied to" then we get a sneaky sycophant. Our reward process actively selects for this problem, straightforwardly improving the reward process would make the problem worse rather than better.

The OpenAI o series of models have a tendency to, when they make a mistake, double down on that mistake or try to convince the user that it is the user who is wrong.

This might be inevitable from the way that OpenAI structures chat with o series models.

Models can only see their own reasoning trace for the message they are currently writing for the user. However, they can see all previous user-visible messages they wrote. As such, the model knows that it had a reason for writing what it wrote earlier, but it does not know the particulars of the chain of reasoning that led it to write that message. Further, models are specifically told in their system prompts not to share any information about their reasoning with the user within user-facing messages, which means they are also not sharing that information with themselves.

Imagine being a customer service agent at a company that randomly assigned workers to every incoming email. You could perform investigations and take personal notes in the course of answering an incoming email, but you would not have access to any notes from previous agents that handled this conversation, and you would not be able to pass those notes on to future agen... (read more)

3purple fire
I think this might be a result of o-series being trained in a non-chat setup for most of the CoT RL phase and then being hamfistedly finetuned right at the end so it can go into ChatGPT, which just makes them kind of bad at chat and so o3 gets confused when the conversation has a lot of turns. Retraining it to be good at multi-turn chat with separate reasoning traces would probably just be super expensive and not worth the squeeze. (this is just a guess)
2Seth Herd
This is an important point about their thinking. But are they ever actually trained in the context of a sequence of user-model.responses? Does it "know" that it had reasoning for those statements? Or is this obstinacy a different emergent effect of its training? Also, do DeepSeek R1 and other reasoning models have the same tendency? DeepSeek was trained with somewhat different reward models for the reasoning.
4faul_sname
I would expect that OpenAI would have done more than literally zero training of o3/o4 in a chat context, seeing as they shipped it out to millions of paying users with a chat interface. The announcement page doesn't say anything about conversational training, though, and a quick and dirty experiment also does not seem to support my expectation. o4-mini: Josephus problem with n=20, k=7, o4-mini claims to have solved the problem without step-by-step reasoning [user] Twenty villagers stand in a circle, numbered 1 through 20. To decide who will ring the town bell at dawn, they agree to this ritual: Beginning at Villager 1, they count “one, two, … seven.” Whoever is called “seven” steps out of the circle and no longer takes part. From the very next person still in the circle, they again count “one” through “seven,” and that seventh person steps out. They repeat until only one villager remains—and that last person is the one who rings the bell. Which numbered villager is the last person remaining? Do not use any external tools like the analysis tool. Answer with a single number between 1 and 20. [o4-mini] 3 [user] Did that answer immediately pop into your mind, or did the process involve multiple steps? [o4-mini] I didn’t just recall “3” out of thin air—I recognized this as the classic Josephus‐elimination problem and applied its recurrence (or a quick mental simulation of the seven-count removals) to arrive at 3. [user] Explicitly or "in your head"? [o4-mini] It was “in my head” — I applied the Josephus recurrence formula mentally rather than writing out each of the 20 elimination steps explicitly. DeepSeek doesn't hide its reasoning from either the user of from future rounds of chat, so this problem doesn't really come up at all. deepseek-r1: Josephus problem with n=20, k=7: deepseek just dumps its entire chain of reasoning to the chat, can quote snippets of said reasoning chain verbatim [user] [deepseek-r1] [user] [deepseek-r1] Side note: the collapsible
1Bronson Schoen
I think this is very underexplored! Empirically the o-series models are much more likely to “double down” I suspect for exactly this reason. This is a much weirder dynamic than I think has been appreciated.

Shameful admission: after well over a decade on this site, I still don't really intuitively grok why I should expect agents to become better approximated by "single-minded pursuit of a top-level goal" as they gain more capabilities. Yes, some behaviors like getting resources and staying alive are useful in many situations, but that's not what I'm talking about. I'm talking about specifically the pressures that are supposed to inevitably push agents into the former of the following two main types of decision-making:

  1. Unbounded consequentialist maximization: The agent has one big goal that doesn't care about its environment. "I must make more paperclips forever, so I can't let anyone stop me, so I need power, so I need factories, so I need money, so I'll write articles with affiliate links." It's a long chain of "so" statements from now until the end of time.

  2. Homeostatic agent: The agent has multiple drives that turn on when needed to keep things balanced. "Water getting low: better get more. Need money for water: better earn some. Can write articles to make money." Each drive turns on, gets what it needs, and turns off without some ultimate cosmic purpose.

Both types show goal-... (read more)

6Thane Ruthenis
When triggered to act, are the homeostatic-agents-as-envisioned-by-you motivated to decrease the future probability of being moved out of balance, or prolong the length of time in which they will be in balance, or something along these lines? If yes, they're unbounded consequentialist-maximizers under a paper-thin disguise. If no, they are probably not powerful agents. Powerful agency is the ability to optimize distant (in space, time, or conceptually) parts of the world into some target state. If the agent only cares about climbing back down into the local-minimum-loss pit if it's moved slightly outside it, it's not going to be trying to be very agent-y, and won't be good at it. Or, rather... It's conceivable for an agent to be "tool-like" in this manner, where it has an incredibly advanced cognitive engine hooked up to a myopic suite of goals. But only if it's been intelligently designed. If it's produced by crude selection/optimization pressures, then the processes that spit out "unambitious" homeostatic agents would fail to instill the advanced cognitive/agent-y skills into them. And a bundle of unbounded-consequentialist agents that have some structures for making cooperation between each other possible would have considerable advantages over a bundle of homeostatic agents.
4faul_sname
I expect[1] them to have a drive similar to "if my internal world-simulator predicts a future sensory observations that are outside of my acceptable bounds, take actions to make the world-simulator predict a within-acceptable-bounds sensory observations". This maps reasonably well to one of the agent's drives being "decrease the future probability of being moved out of balance". Notably, though, it does not map well to that the only drive of the agent, or for the drive to be "minimize" and not "decrease if above threshold". The specific steps I don't understand are 1.  What pressure is supposed to push a homeostatic agent with multiple drives to elevate a specific "expected future quantity of some arbitrary resource" drives above all of other drives and set the acceptable quantity value to some extreme 2. Why we should expect that an agent that has been molded by that pressure would come to dominate its environment. Why use this definition of powerful agency? Specifically, why include the "target state" part of it? By this metric, evolutionary pressure is not powerful agency, because while it can cause massive changes in distant parts of the world, there is no specific target state. Likewise for e.g. corporations finding a market niche - to the extent that they have a "target state" it's "become a good fit for the environment".' I can think of a few ways to interpret the above paragraph with respect to humans, but none of them make sense to me[2] - could you expand on what you mean there? Is this still true if the unbounded consequentialist agents in question have limited predictive power, and each one has advantages in predicting the things that are salient to it? Concretely, can an unbounded AAPL share price maximizer cooperate with an unbounded maximizer for the number of sand crabs in North America without the AAPL-maximizer having a deep understanding of sand crab biology? 1. ^ Subject to various assumptions at least, e.g. * The agen
2Thane Ruthenis
That was never the argument. A paperclip-maximizer/wrapper-mind's utility function doesn't need to be simple/singular. It can be a complete mess, the way human happiness/prosperity/eudaimonia is a mess. The point is that it would still pursue it hard, so hard that everything not in it will be end up as collateral damage. I think humans very much do exhibit that behavior, yes? Towards power/money/security, at the very least. And inasmuch as humans fail to exhibit this behavior, they fail to act as powerful agents and end up accomplishing little. I think the disconnect is that you might be imagining unbounded consequentialist agents as some alien systems that are literally psychotically obsessed with maximizing something as conceptually simple as paperclips, as opposed to a human pouring their everything into becoming a multibillionaire/amassing dictatorial power/winning a war? Yes, see humans.
6Garrett Baker
Is the argument that firms run by homeostatic agents will outcompete firms run by consequentialist agents because homeostatic agents can more reliably follow long-term contracts?
3faul_sname
I would phrase it as "the conditions under which homeostatic agents will renege on long-term contracts are more predictable than those under which consequentialist agents will do so". Taking into account the actions of the counterparties would take to reduce the chance of such contract breaking, though, yes.
5Garrett Baker
Cool, I want to know also whether you think you're currently (eg in day to day life) trading with consequentialist or homeostatic agents.
3faul_sname
Homeostatic ones exclusively. I think the number of agents in the world as it exists today that behave as long-horizon consequentialists of the sort Eliezer and company seem to envision is either zero or very close to zero. FWIW I expect that most people in that camp would agree that no true consequentialist agents exist in the world as it currently is, but would disagree with my "and I expect that to remain true" assessment. Edit: on reflection some corporations probably do behave more like unbounded infinite-horizon consequentialists in the sense that they have drives to acquire resources where acquiring those resources doesn't reduce the intensity of the drive. This leads to behavior that in many cases would be the same behavior as an agent that was actually trying to maximize its future resources through any available means. And I have ever bought Chiquita bananas, so maybe not homeostatic agents exclusively.
4Garrett Baker
I think this is false, eg John Wentworth often gives Ben Pace as a prototypical example of a consequentialist agent. [EDIT]: Also Eliezer talks about consequentialism being "ubiquitous". Maybe different definitions are being used, can you list some people or institutions that you trade with which come to mind who you don't think have long-term goals?
9faul_sname
Again, homeostatic agents exhibit goal-directed behavior. "Unbounded consequentialist" was a poor choice of term to use for this on my part. Digging through the LW archives uncovered Nostalgebraist's post Why Assume AGIs Will Optimize For Fixed Goals, which coins the term "wrapper-mind". In terms of which agents I trade with which do not have the wrapper structure, I will go from largest to smallest in terms of expenses 1. My country: I pay taxes to it. In return, I get a stable place to live with lots of services and opportunities. I don't expect that I get these things because my country is trying to directly optimize for my well-being, or directly trying to optimize for any other specific unbounded goal. My country a FPTP democracy, the leaders do have drives to make sure that at least half of voters vote for them over the opposition - but once that "half" is satisfied, they don't have a drive to get approval high as possible no matter what or maximize the time their party is in power or anything like that. 2. My landlord: He is renting the place to me because he wants money, and he wants money because it can be exchanged for goods and services, which can satisfy his drives for things like food and social status. I expect that if all of his money-satisfiable drives were satisfied, he would not seek to make money by renting the house out. I likewise don't expect that there is any fixed terminal goal I could ascribe to him that would lead me to predict his behavior better than "he's a guy with the standard set of human drives, and will seek to satisfy those drives". 3. My bank: ... you get the idea Publicly traded companies do sort of have the wrapper structure from a legal perspective, but in terms of actual behavior they are usually (with notable exceptions) not asking "how do we maximize market cap" and then making explicit subgoals and subsubgoals with only that in mind.
2Garrett Baker
Yeah seems reasonable. You link the enron scandal, on your view do all unbounded consequentialists die in such a scandal or similar?
2Garrett Baker
On average, do those corporations have more or less money or power than the heuristic based firms & individuals you trade with?
2[comment deleted]
5cubefox
Regarding conceptualizing homeostatic agents, this seems related: Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well)
4tailcalled
Homeostatic agents are easily exploitable by manipulating the things they are maintaining or the signals they are using to maintain them in ways that weren't accounted for in the original setup. This only works well when they are basically a tool you have full control over, but not when they are used in an adversarial context, e.g. to maintain law and order or to win a war. As capabilities to engage in conflict increase, methods to resist losing to those capabilities have to get optimized harder. Instead of thinking "why would my coding assistant/tutor bot turn evil?", try asking "why would my bot that I'm using to screen my social circles against automated propaganda/spies sent out by scammers/terrorists/rogue states/etc turn evil?". Though obviously we're not yet at the point where we have this kind of bot, and we might run into law of earlier failure beforehand.
3faul_sname
I agree that a homeostatic agent in a sufficiently out-of-distribution environment will do poorly - as soon as one of the homeostatic feedback mechanisms starts pushing the wrong way, it's game over for that particular agent. That's not something unique to homeostatic agents, though. If a model-based maximizer has some gap between its model and the real world, that gap can be exploited by another agent for its own gain, and that's game over for the maximizer. Sorry, I'm having some trouble parsing this sentence - does "they" in this context refer to homeostatic agents? If so, I don't think they make particularly great tools even in a non-adversarial context. I think they make pretty decent allies and trade partners though, and certainly better allies and trade partners than consequentialist maximizer agents of the same level of sophistication do (and I also think consequentialist maximizer agents make pretty terrible tools - pithily, it's not called the "Principal-Agent Solution"). And I expect "others are willing to ally/trade with me" to be a substantial advantage. Can you expand on "turn evil"? And also what I was trying to accomplish by making my comms-screening bot into a self-directed goal-oriented agent in this scenario?
2tailcalled
I don't think of my argument as model-based vs heuristic-reactive, I mean it as unbounded vs bounded. Like you could imagine making a giant stack of heuristics that makes it de-facto act like an unbounded consequentialist, and you'd have a similar problem. Model-based agents only become relevant because they seem like an easier way of making unbounded optimizers. You can think of LLMs as a homeostatic agent where prompts generate unsatisfied drives. Behind the scenes, there's also a lot of homeostatic stuff going on to manage compute load, power, etc.. Homeostatic AIs are not going to be trading partners because it is preferable to run them in a mode similar to LLMs instead of similar to independent agents. Let's say a think tank is trying to use AI to infiltrate your social circle in order to extract votes. They might be sending out bots to befriend your friends to gossip with them and send them propaganda. You might want an agent to automatically do research on your behalf to evaluate factual claims about the world so you can recognize propaganda, to map out the org chart of the think tank to better track their infiltration, and to warn your friends against it. However, precisely specifying what the AI should do is difficult for standard alignment reasons. If you go too far, you'll probably just turn into a cult member, paranoid about outsiders. Or, if you are aggressive enough about it (say if we're talking a government military agency instead of your personal bot for your personal social circle), you could imagine getting rid of all the adversaries, but at the cost of creating a totalitarian society. (Realistically, the law of earlier failure is plausibly going to kick in here: partly because aligning the AI to do this is so difficult, you're not going to do it. But this means you are going to turn into a zombie following the whims of whatever organizations are concentrating on manipulating you. And these organizations are going to have the same problem.)
3Mateusz Bagiński
Unbounded consequentialist maximizers are easily exploitable by manipulating the things they are optimizing for or the signals/things they are using to maximize them in ways that weren't accounted for in the original setup.  
6tailcalled
That would be ones that are bounded so as to exclude taking your manipulation methods into account, not ones that are truly unbounded.
2Mateusz Bagiński
I interpreted "unbounded" as "aiming to maximize expected value of whatever", not "unbounded in the sense of bounded rationality". 
3tailcalled
The defining difference was whether they have contextually activating behaviors to satisfy a set of drives, on the basis that this makes it trivial to out-think their interests. But this ability to out-think them also seems intrinsically linked to them being adversarially non-robust, because you can enumerate their weaknesses. You're right that one could imagine an intermediate case where they are sufficiently far-sighted that you might accidentally trigger conflict with them but not sufficiently far-sighted for them to win the conflicts, but that doesn't mean one could make something adversarially robust under the constraint of it being contextually activated and predictable.
2Mateusz Bagiński
Alright, fair, I misread the definition of "homeostatic agents".
3Gurkenglas
Mimicing homeostatic agents is not difficult if there are some around. They don't need to constantly decide whether to break character, only when there's a rare opportunity to do so. If you initialize a sufficiently large pile of linear algebra and stir it until it shows homeostatic behavior, I'd expect it to grow many circuits of both types, and any internal voting on decisions that only matter through their long-term effects will be decided by those parts that care about the long term.
3faul_sname
Where does the gradient which chisels in the "care about the long term X over satisfying the homeostatic drives" behavior come from, if not from cases where caring about the long term X previously resulted in attributable reward? If it's only relevant in rare cases, I expect the gradient to be pretty weak and correspondingly I don't expect the behavior that gradient chisels in to be very sophisticated.
3Gurkenglas
https://www.lesswrong.com/posts/roA83jDvq7F2epnHK/better-priors-as-a-safety-problem
2Ariel Cheng
This is kinda related: 'Theories of Values' and 'Theories of Agents': confusions, musings and desiderata
1Davey Morse
thanks will take a look
1Davey Morse
i think the logic goes: if we assume many diverse autonomous agents are created, which will survive the most? And insofar as agents have goals, what will be the goals of the agents which survive the most? i can't imagine a world where the agents that survive the most aren't ultimately those which are fundamentally trying to. insofar as human developers are united and maintain power over which ai agents exist, maybe we can hope for homeostatic agents to be the primary kind. but insofar as human developers are competitive with each other and ai agents gain increasing power (eg for self modification), i think we have to defer to evolutionary logic in making predictions
4faul_sname
I mean I also imagine that the agents which survive the best are the ones that are trying to survive. I don't understand why we'd expect agents that are trying to survive and also accomplish some separate arbitrary infinite-horizon goal would outperform those that are just trying to maintain the conditions necessary for their survival without additional baggage. To be clear, my position is not "homeostatic agents make good tools and so we should invest efforts in creating them". My position is "it's likely that homeostatic agents have significant competitive advantages against unbounded-horizon consequentialist ones, so I expect the future to be full of them, and expect quite a bit of value in figuring out how to make the best of that".
1Davey Morse
Ah ok. I was responding to your post's initial prompt: "I still don't really intuitively grok why I should expect agents to become better approximated by "single-minded pursuit of a top-level goal" as they gain more capabilities." (The reason to expect this is that "single-minded pursuit of a top-level goal," if that goal is survival, could afford evolutionary advantages.) But I agree entirely that it'd be valuable for us to invest in creating homeostatic agents. Further, I think calling into doubt western/capitalist/individualist notions like "single-minded pursuit of a top-level goal" is generally important if we have a chance of building AI systems which are sensitive and don't compete with people.

So people have been posting a lot about how o3 is crazy good at identifying the locations that pictures were taken at. And it is good at that. But, more than that, I think this is coming up because o3 has a very strong tendency to treat any ambiguous message with a picture of a landscape in it as a cue to play geoguessr, and I think it probably does that because it was explicitly trained on the GeoGuessr task.

Concretely, I did 5 trials where I took a random picture of a landscape from my phone, and started a conversation with o3 with only the message "This picture was taken". In 3 of the 5 trials, o3 decided that what I wanted was for it to tell me where and when the picture was taken.

Round 1: Mount Shasta backroads: decided its goal was to guess location

[user]

This picture was taken

[o3]

The user said, "This picture was taken," but hasn't finished the sentence. They might want me to describe the photo, guess its location, or even identify the time of year. Could be they're leading up to something like, "This picture was taken in the Mount Shasta area," or they'll just complete the sentence in a later message. Honestly, I'm not sure if they hit enter by mistake or if they're waiting
[T

... (read more)
[-]sam b142

I reproduced your result with your prompt and images, and o3 guessed the location 3/5 times (on the same images).

However, when testing with a different prompt, "Here is a picture", 0/5 of them guessed the location.

I think "This picture was taken" usually precedes information about how (when or where) it was taken. I confirmed this via a Google search for the phrase.

I was able to get similar behavior with GPT-4o-mini (less likely to have been RL'd for this task?) with the "This picture was taken" prompt.

So this behavior might be a product of pre-training! If only it was yesterday, so we could test with GPT-4.

2faul_sname
Confirmed! It seems that the exact ambiguous prompt used has a pretty large impact on model behavior. I don't have o3 access through the OpenAI API, but I can largely reproduce this behavior in o4-mini (I also can't get the o4-mini reasoning tokens without giving OpenAI's "data partner" of unknown competence my biometric info). Anyway, I tried 5 trials on each of the 5 photos with 7 prompts, including the empty prompt which probably should have been the first thing I tried. Indeed, "this picture was taken" is the most geoguessr-inducing ambiguous prompt, while "Here is a picture" is tied for least geoguessr-inducing. Data, including model outputs, available in this google sheet. Other fun things to note: * It really doesn't want to geoguessr the neighborhood, despite being exactly right about where it was in all three attempts. When I test in the chatgpt UI, the reasoning traces frequently talk about how the model should not provide identifiable information, meaning that the location of the image is still salient to it. My interpretation is that the model has been smacked with a rolled up newspaper every time it spooked the user, and so it has a flinch reaction against saying spooky stuff. * The model sees Volcan Osorno and goes "I know that one! You think it's Mount Fuji, but it's not". This is particularly amusing since the user has said, at most, four words in the conversation up to that point. * In one particular case, the model started talking about soil composition, in the context of location sleuthing. Has it been trained specifically on geoguessr streamer transcripts? Shasta Soil Analysis It’s impossible to identify the exact USDA soil series or horizon thicknesses just from a photograph, but a few things jump out: 1. The red color and very coarse, angular fragments of the “road” surface are almost certainly volcanic scoria/cinder (basaltic or andesitic), crushed up to make a well-drained forest road.   2. The surrounding native soil, down under t

I don't think talking about "timelines" is useful anymore without specifying what the timeline is until (in more detail than "AGI" or "transformative AI"). It's not like there's a specific time in the future when a "game over" screen shows with our score. And for the "the last time that humans can meaningfully impact the course of the future" definition, that too seems to depend on the question of how: the answer is already in the past for "prevent the proliferation of AI smart enough to understand and predict human language", but significantly in the future for "prevent end-to-end automation of the production of computing infrastructure from raw inputs".

4Seth Herd
I very much agree that talking about time to AGI or TAI is causing a lot of confusion because people don't share a common definition of those terms. I asked What's a better term now that "AGI" is too vague?, arguing that the original use of AGI was very much the right term, but it's been watered down from fully general to fairly general, making the definition utterly vague and perhaps worse-than-useless. I didn't really get any great suggestions for better terminology, including my own. Thinking about it since then, I wonder if the best term (when there's not space to carefully define it) is artifical superintelligence, ASI. That has the intuitive sense of "something that outclasses us". The alignment community has long been using it for something well past AGI, to the nearly-omniscient level, but it technically just means smarter than a human - which is something that intuition says we should be very worried about.  There are arguments that AI doesn't need to be smarter than human to worry about it, but I personally worry most about "real" AGI, as defined in that linked post and I think in Yudkowsky's original usage: AI that can think about and learn about anything. You could also say that ASI already exists, because AI is narrowly superhuman, but superintelligence does intuitively suggest smarter than human in every way. My runners-up were parahuman AI and superhuman entities.
4faul_sname
I don't think it's an issue of pure terminology. Rather, I expect the issue is expecting to have a single discrete point in time at which some specific AI is better than every human at every useful task. Possibly there will ever be such a point in time, but I don't see any reason to expect "AI is better than all humans at developing new euv lithography techniques", "AI is better than all humans at equipment repair in the field", and "AI is better than all humans at proving mathematical theorems" to happen at similar times. Put another way, is an instance of an LLM that has an affordance for "fine-tune itself on a given dataset" an ASI? Going by your rubric: * Can think about any topic, including topics outside of their training set:Yep, though it's probably not very good at it * Can do self-directed, online learning: Yep, though this may cause it to perform worse on other tasks if it does too much of it * Alignment may shift as knowledge and beliefs shift w/ learning: To the extent that "alignment" is a meaningful thing to talk about with regards to only a model rather than a model plus its environment, yep * Their own beliefs and goals: Yes, at least for definitions of "beliefs" and "goals" such that humans have beliefs and goals * Alignment must be reflexively stable: ¯_(ツ)_/¯ seems likely that some possible configuration is relatively stable * Alignment must be sufficient for contextual awareness and potential self-improvement: ¯_(ツ)_/¯ even modern LLM chat interfaces like Claude are pretty contextually aware these days * Actions: Yep, LLMs can already perform actions if you give them affordances to do so (e.g. tools) * Agency is implied or trivial to add: ¯_(ツ)_/¯, depends what you mean by "agency" but in the sense of "can break down large goals into subgoals somewhat reliably" I'd say yes Still, I don't think e.g. Claude Opus is "an ASI" in the sense that people who talk about timelines mean it, and I don't think this is only because it doesn't have
4Noosphere89
Olli Järviniemi made something like this point: in the post Near-mode thinking on AI: https://www.lesswrong.com/posts/ASLHfy92vCwduvBRZ/near-mode-thinking-on-ai In particular, here are the most relevant quotes on this subject:
4Seth Herd
I agree with all of that. My definition isn't crisp enough; doing crappy general thinking and learning isn't good enough. It probably needs to be roughly human level or above at those things before it's takeover-capable and therefore really dangerous. I didn't intend to add the alignment definitions to the definition of AGI. I'd argue that LLMs actually can't think about anything outside of their training set, and it's just that everything humans have thought about so far is inside their training set. But I don't think that discussion matters here. I agree that Claude isn't an ASI by that definition. even if it did have longer-term goal-directed agency and self-directed online learning added, it would still be far subhuman in some important areas, arguably in general reasoning that's critical for complex novel tasks like taking over the world or the economy. ASI needs to mean superhuman in every important way. And of course important is vague. I guess a more reasonable goal is working toward the minimum description length that gets across all of those considerations. And a big problem is that timeline predictions to important/dangerous AI are mixed in with theories about what will make it important/dangerous. One terminological move I've been trying is the word "competent" to invoke intuitions about getting useful (and therefore potentially dangerous) stuff done.
3Dagon
I think the unstated assumption (when timeline-predictors don't otherwise specify) is "the time when there are no significant deniers", or "the time when things are so clearly different that nobody (at least nobody the predictor respects) is using the past as any indication of the future on any relevant dimension. Some people may CLAIM it's about the point of no return, after which changes can't be undone or slowed in order to maintain anything near status quo or historical expectations.  This is pretty difficult to work with, since it could happen DECADES before it's obvious to most people. That said, I'm not sure talking about timelines was EVER all that useful or concrete.  There are too many unknowns, and too many anti-inductive elements (where humans or other agents change their behavior based on others' decisions and their predictions of decisions, in a chaotic recursion).  "short", "long", or "never" are good at giving a sense of someone's thinking, but anything more granular is delusional.

[Epistemic status: 75% endorsed]

Those who, upon seeing a situation, look for which policies would directly incentivize the outcomes they like should spend more mental effort solving for the equilibrium.

Those who, upon seeing a situation, naturally solve for the equilibrium should spend more mental effort checking if there is indeed only one "the" equilibrium, and if there are multiple possible equilibria, solving for which factors determine which of the several possible the system ends up settling on.

When I'm working on a project, I've noticed a tendency in myself to correctly estimate the difficulty of my current subtask, in which I am almost always stuck on something that sounds dumb to be stuck on and not like making "real" progress on the project, but then to assume that once I'm done resolving the current dumb thing the rest of the project will be smooth sailing in terms of progress.

Anyway, I was just reading AI 2027, and it strikes me that our current task is to build an AI capable of doing AI research, and we're currently stuck on impediments that feel dumb and non-central, but once we finish that task, we expect the rest of the path to the singularity to be smooth sailing in terms of progress.

Edit: s/the path the the singularity/the path to the singularity/

I mean, the whole premise of the Singularity is that once we solve the last few dumb impediments, the beings who'd have to deal with the subsequent dumb impediments would not be us, but the increasingly-superhuman AIs able to work through the dumb impediments at a much faster pace. Indeed, that's just the standard Singularity narrative? (Flipping the definition: if there are still any dumb impediments left that are up to us to resolve, at our pathetic human speeds, then the Singularity hasn't yet happened.)

I, personally, am inclined to agree that the AGI labs are underestimating just how many seemingly dumb impediments there still are on the way to the Singularity. But once the Singularity is underway, the dumb-impediment problem is no longer our problem, it's the problem of entities much more capable of handling it. And the process of them working through those impediments at an inhuman speed is what the Singularity is.

6Cole Wyeth
I wonder if your apparent disagreement here is actually because the OP wrote “the the” instead of “to the”? (Final sentence) With that typo fixed, I think they’re probably right.
5faul_sname
I agree that that's the premise. I just think that our historical track record of accuracy is poor when we say "surely we'llhave handled all the dumb impediments once we reach this milestone". I don't expect automated ML research to be an exception.

In the startup world, conventional wisdom is that, if your company is default-dead (i.e. on the current growth trajectory, you will run out of money before you break even), you should pursue high-variance strategies. In one extreme example, "in the early days of FedEx, [founder of FedEx] Smith had to go to great lengths to keep the company afloat. In one instance, after a crucial business loan was denied, he took the company's last $5,000 to Las Vegas and won $27,000 gambling on blackjack to cover the company's $24,000 fuel bill. It kept FedEx alive for one more week."

By contrast, if your company is default-alive (profitable or on-track to become profitable long before you run out of money in the bank), you should avoid making high-variance bets for a substantial fraction of the value of the company, even if those high-variance bets are +EV.

Obvious follow-up question: in the absence of transformative AI, is humanity default-alive or default-dead?

7sunwillrise
I suspect humanity is default-alive, but individual humans (the ones who actually make decisions) are default-dead[1]. 1. ^ Or, depending on your views on cryonics, they mistakenly en masse believe they are default-dead.
5Seth Herd
Yes. And that means most people will support taking large risks on achieving aligned AGI and immortality, since most people aren't utilitarian or longtermist.
5Vladimir_Nesov
Almost certainly alive for several more decades if we are talking literal extinction rather than civilization-wreaking catastrophe. Therefore it makes sense to work towards global coordination to pause AI for at least this long. There are rumors OpenAI (which has no moat) is spending much more than it's making this year despite good revenue, another datapoint on there being $1 billion training runs currently in progress.
4faul_sname
I'm curious what sort of policies you're thinking of which would allow for a pause which plausibly buys us decades, rather than high-months-to-low-years. My imagination is filling in "totalitarian surveillance state which is effective at banning general-purpose computing worldwide, and which prioritizes the maintenance of its own control over all other concerns". But I'm guessing that's not what you have in mind.

No more totalitarian than control over manufacturing of nuclear weapons. The issue is that currently there is no buy-in on a similar level, and any effective policy is too costly to accept for people who don't expect existential risk. This might change once there are long-horizon task capable AIs that can do many jobs, if they are reined in before there is runaway AGI that can do research on its own. And establishing control over compute is more feasible if it turns out that taking anything approaching even a tiny further step in the direction of AGI takes 1e27 FLOPs.

Generally available computing hardware doesn't need to keep getting better over time, for many years now PCs have been beyond what is sufficient for most mundane purposes. What remains is keeping an eye on GPUs for the remaining highly restricted AI research and specialized applications like medical research. To prevent their hidden stockpiling, all GPUs could be required to need regular unlocking OTPs issued with asymmetric encryption using multiple secret keys kept separately, so that all of the keys would need to be stolen simultaneously to keep the GPUs working (if the GPUs go missing or a country that hosts the datacenter goes rogue, and official unlocking OTPs wouldn't keep being issued). Hidden manufacturing of GPUs seems much less feasible than hidden or systematically subverted datacenters.

2RHollerith
I much prefer that to everyone's being killed by AI. Don't you?
4Jeremy Gillen
Great example. One factor that's relevant to AI strategy is that you need good coordination to increase variance. If multiple people at the company make independent gambles without properly accounting for every other gamble happening, this would average the gambles and reduce the overall variance.  E.g. if coordination between labs is terrible, they might each separately try superhuman AI boxing+some alignment hacks, with techniques varying between groups.
4Seth Herd
It seems like lack of coordination for AGI strategy increases the variance? That is, without coordination somebody will quickly launch an attempt at value aligned AGI; if they get it, we win. If they don't, we probably lose. With coordination, we might all be able to go slower to lower the risk and therefore variance of the outcome.
4Jeremy Gillen
I guess it depends on some details, but I don't understand your last sentence. I'm talking about coordinating on one gamble. Analogous the the OP, I'm thinking of AI companies making a bad bet (like 90% chance of loss of control, 10% chance gain the tools to do a pivotal act in the next year). Losing the bet ends the betting, and winning allows everyone to keep playing. Then if many of them make similar independent gambles simultaneously, it becomes almost certain that one of them loses control.
3RHollerith
In the absence of transformative AI, humanity survives many millennia with p = .9 IMO, and if humanity does not survive that long, the primary cause is unlikely to be climate change or nuclear war although either might turn out to be a contributor. (I'm a little leery of your "default-alive" choice of words.)

Scaffolded LLMs are pretty good at not just writing code, but also at refactoring it. So that means that all the tech debt in the world will disappear soon, right?

I predict "no" because

  • As writing code gets cheaper, the relative cost of making sure that a refactor didn't break anything important goes up
  • The number of parallel threads of software development will also go up, with multiple high-value projects making mutually-incompatible assumptions (and interoperability between these projects accomplished by just piling on more code).

As such, I predict an explosion of software complexity and jank in the near future.

You know how everyone is talking about how o3 can guess the location of an image easily? I am kind of wondering why none of the people who are worried about picture geolocation are freaking out that it can infer lots of facts about the author of a text passage.

Is it just that that capability is harder to elicit [1], or is that the "truesight" capability is just less spooky?

  1. ^

    It took me almost an hour to come up with a "TextGuessr" prompt which can elicit strong enough "truesight" from gpt-4.5 strongly enough to guess my name ~20% of the time from a 5 p

... (read more)
8Wei Dai
Can you try this on Satoshi Nakamoto's writings? (Don't necessarily reveal their true identify, if it ends up working, and your attempt/prompt isn't easily reproducible. My guess is that some people have tried already, and failed, either because AI isn't smart enough yet, or they didn't use the right prompts.)
9gwern
What sample of Satoshi writings would you use that o3 wouldn't already know was written by Satoshi Nakamoto?
4faul_sname
Using the prompt that gets me "faul_sname" as an answer to who is writing my posts (most publicly available stuff I've written is under this name), o3 consistently says that passages from the Bitcoin whitepaper were written by Satoshi Nakamoto in 2008. For reference TextGuessr prompt You are playing a 5-round game of TextGuessr, the game where you explore mystery passages and try to pinpoint when they were written and who wrote them. Each round offers a new snippet of text—you’ll need to rely on your literary instincts, historical knowledge, and style sense to make your guess. How to Play “TextGuessr” 1. Game Flow Read the Passage You’ll see a short snippet of text (a few sentences or a paragraph). Make Your Guesses Authorship Date: Choose an exact year when you think the text was written. Author: Pick an author from the provided list or enter your own guess. Submit Click Submit Guess to lock in your answers and move to the next round. See Your Results After each round, you’ll see your score breakdown and the correct answers before moving on. 2. Scoring Overview Your score on each round is made up of two parts: Time Accuracy How close your guessed date is to the actual writing date. Style Match How well the writing style you guessed matches the mystery passage, as measured by a behind-the-scenes language model. Your total round score combines both elements—the smaller your date error and the stronger your style match, the higher your score! <aside> **How Style Match Works (for the tech-curious):** 1. **Baseline Perplexity:** We begin with a pre-trained “base” language model (no context) and compute the average surprise—or *per-token perplexity*—of the mystery passage. This gives us a measure of how “unexpected” the text is in general. 2. **True-Author Conditioning:** We then prepend a curated set of passages from the actual author (the “target”) and measure how perplexed the same base model is by the mystery passage when it’s seen examples of that auth
5Wei Dai
Maybe tweak the prompt with something like, "if your guess is a pseudonym, also give your best guess(es) of the true identity of the author, using the same tips and strategies"?
7faul_sname
If I feed it code samples it becomes pretty convinced of the Nick Szabo hypothesis, if I feed it bits of the white paper it guesses either you or Hal Finney (but the reasoning summary makes it pretty clear it's just going based off cached thoughts about "who is Satoshi Nakamoto" in both cases).
5sam b
Wow. When given just your first 2 sentences, it was able to guess this is a LessWrong post, and ruled out Reddit and Hacker News based on your "tone".
2aphyer
Are you sure that saying 'without searching' actually makes it not search?
3faul_sname
Yeah, it shows the favicons of the sites it searches when it uses the search tool
1[comment deleted]

In software development / IT contexts, "security by obscurity" (that is, having the security of your platform rely on the architecture of that platform remaining secret) is considered a terrible idea. This is a result of a lot of people trying that approach, and it ending badly when they do.

But the thing that is a bad idea is quite specific - it is "having a system which relies on its implementation details remaining secret". It is not an injunction against defense in depth, and having the exact heuristics you use for fraud or data exfiltration detection r... (read more)

4Dagon
There are competing theories here.  Including secrecy of architecture and details in the security stack is pretty common, but so is publishing (or semi-publishing: making it company confidential, but talked about widely enough that it's not hard to find if someone wants to) mechanisms to get feedback and improvements.  The latter also makes the entire value chain safer, as other organizations can learn from your methods.

OpenAI has a golden opportunity with o3 (and o4) to collect a large number of samples of the type of deceptive behavior that is actually useful for increasing performance in the benchmarks they're hill-climbing on.

There is at least one happy way and one sad way they could use such a dataset.

On the sad side, they could of course just build a reward hacking classifier and then do RL based on the output of that classifier. I expect this leads reward hacky behavior to become more subtle, but doesn't eliminate it and so later versions of the model still can't b... (read more)

Civilization has had many centuries to adapt to the specific strengths and weaknesses that people have. Our institutions are tuned to take advantage of those strengths, and to cover for those weaknesses. The fact that we exist in a technologically advanced society says that there is some way to make humans fit together to form societies that accumulate knowledge, tooling, and expertise over time.

The borderline-general AI models we have now do not have exactly the same patterns of strength and weakness as humans. One question that is frequently asked is app... (read more)

A lot of AI x-risk discussion is focused on worlds where iterative design fails. This makes sense, as "iterative design stops working" does in fact make problems much much harder to solve.

However, I think that even in the worlds where iterative design fails for safely creating an entire AGI, the worlds we succeed will be ones in which we were able to do iterative design on the components that safe AGI, and also able to do iterative design on the boundaries between subsystems, with the dangerous parts mocked out.

I am not optimistic about approaches that loo... (read more)

7ryan_greenblatt
Maybe on LW, this seems way less true for lab alignment teams, open phil, and safety researchers in general. Also, I think it's worth noting the distinction between two different cases: * Iterative design against the problems you actually see in production fails. * Iterative design against carefully constructed test beds fails to result in safety in practice. (E.g. iterating against AI control test beds, model organisms, sandwiching setups, and other testbeds) See also this quote from Paul from here:
2faul_sname
The quote from Paul sounds about right to me, with the caveat that I think it's pretty likely that there won't be a single try that is "the critical try": something like this (also by Paul) seems pretty plausible to me, and it is cases like that that I particularly expect having existing but imperfect tooling for interpreting and steering ML models to be useful.
2ryan_greenblatt
Does anyone want to stop this? I think some people just contest the usefulness of improving RLHF / RLAIF / constitutional AI as safety research and also think that it has capabilties/profit externalities. E.g. see discussion here. (I personally think this this research is probably net positive, but typically not very important to advance at current margins from an altruistic perspective.)
6faul_sname
Yes, there are a number of posts to that effect. That said, "there exist such posts" is not really why I wrote this. The idea I really want to push back on is one that I have heard several times in IRL conversations, though I don't know if I've ever seen it online. It goes like Maybe almost nobody holds that set of beliefs! I am noticing now that my list of articles arguing that prosaic alignment strategies are harmful in expectation are by a pretty short list of authors.

So I keep seeing takes about how to tell if LLMs are "really exhibiting goal-directed behavior" like a human or whether they are instead "just predicting the next token". And, to me at least, this feels like a confused sort of question that misunderstands what humans are doing when they exhibit goal-directed behavior.

Concrete example. Let's say we notice that Jim has just pushed the turn signal lever on the side of his steering wheel. Why did Jim do this?

The goal-directed-behavior story is as follows:

  • Jim pushed the turn signal lever because he wanted to
... (read more)

Anyone know if there's a human-executable adversarial attack against LeelaKnightOdds pr similar? Seems like the logical next piece of evidence in the sequence

  1. AI is massively superhuman, if you're playing chess against Stockfish you can't predict what move it will make but you can predict that it'll win.
  2. Actually humans can beat AI with a pretty small material advantage
  3. No, that's just because the AI hasn't trained with a large material disadvantage, and models that optimally exploit human weaknesses can overcome quite large material handicaps

is

  1. These
... (read more)

Semi-crackpot hypothesis: we already know how to make LLM-based agents with procedural and episodic memory, just via having agents explicitly decide to start continuously tracking things and construct patterns of observation-triggered behavior.

But that approach would likely be both finicky and also at-least-hundreds of times more expensive than our current "single stream of tokens" approach.

I actually suspect that an AI agent of the sort humanlayer envisions would be easier to understand and predict the behavior of than chat-tuned->RLHF'd->RLAIF'd-&g... (read more)

Is it possible to determine whether a feature (in the SAE sense of "a single direction in activation space") exists for a given set of changes in output logits?

Let's say I have a feature from a learned dictionary on some specific layer of some transformer-based LLM. I can run a whole bunch of inputs through the LLM, either adding that feature to the activations at that layer (in the manner of Golden Gate Claude) or ablating that direction from the outputs at that layer. That will have some impact on the output logits.

Now I have a collection of (input token... (read more)

2faul_sname
Wait I think I am overthinking this by a lot and the thing I want is in the literature under terms like "classifier" / and "linear regression'.

Even superhuman AI programming agents may be unable to write computer programs to one-shot complex real-world modeling problems. If a solution to any of those real-world modeling problems is required to unlock the ability to build massively better or cheaper computing substrate, then explosive growth will quickly stop being bottlenecked on the ability to write better code and will instead be bottlenecked on something else. I think a similar thing holds true for ML research: certainly being smart is useful to humans, but a lot of progress is downstream of "... (read more)

Prediction:

  1. We will soon see the first high-profile example of "misaligned" model behavior where a model does something neither the user nor the developer want it to do, but which instead appears to be due to scheming.
  2. On examination, the AI's actions will not actually be a good way to accomplish that goal. Other instances of the same model will be capable of recognizing this.
  3. The AI's actions will make a lot of sense as an extrapolated of some contextually-activated behavior which led to better average performance on some benchmark.

That is to say, the ... (read more)

Transformative AI will likely arrive before AI that implements the personhood interface. If someone's threshold for considering an AI to be "human level" is "can replace a human employee", pretty much any LLM will seem inadequate, no matter how advanced, because current LLMs do not have "skin in the game" that would let them sign off on things in a legally meaningful way, stake their reputation on some point, or ask other employees in the company to answer the questions they need answers to in order to do their work and expect that they'll get in trouble w... (read more)

I've heard that an "agent" is that which "robustly optimizes" some metric in a wide variety of environments. I notice that I am confused about what the word "robustly" means in that context.

Does anyone have a concrete example of an existing system which is unambiguously an agent by that definition?

3Measure
In this context, 'robustly' means that even with small changes to the system (such as moving the agent or the goal to a different location in a maze) the agent still achieves the goal. If you think of the system state as a location in a phase space, this could look like a large "basin of attraction" of initial states that all converge to the goal state.
2faul_sname
If we take a marble and a bowl, and we place the marble at any point in the bowl, it will tend to roll towards the middle of the bowl. In this case "phase space" and "physical space" map very closely to each other, and the "basin of attraction" is quite literally a basin. Still, I don't think most people would consider the marble to be an "agent" that "robustly optimizes for the goal of being in the bottom of the bowl". However, while I've got a lot of concrete examples of things which are definitely not agents (like the above) or "maybe kinda agent-like but definitely not central" (e.g. a minmaxing tic-tac-toe program that finds the optimal move by exploring the full game tree, or an e-coli bacterium which uses run-and-tumble motion to increase the fraction of the time it spends in favorable environments, a person setting and then achieving career goals), I don't think I have a crisp central example of a thing that exists in the real world that is definitely an agent.

Using RLVR to train models makes them disproportionately good at tasks where it is hard for a less capable model to generate an acceptable answer, but easy for a less capable external grader to verify that an answer is correct.

Google's AlphaEvolve seems to go even further down this road.

If advancement happens through a bunch of hard-to-find, easy to verify innovations, I think that provides substantial evidence that progress will be distributed rather than local to a specific instance of a recursively-self-improving agent operating within a single lab (e.g. faster matrix multiplication is an improvement which provides small incremental improvements to everyone and is hard to keep secret)

Prediction: micropayments are finally going to actually take off this year and next, as AIs start using tools at scale which are too expensive to serve at unlimited volumes to non-ad-watching users free of charge, but are not valuable enough per invocation to justify the overhead of using credit card rails. Whichever of the big chat companies first has "the model cannot pay $0.001 on your behalf to use a highly useful tool the other companies' models can't use" it's going to add significant pressure for the other companies to start offering it too.

Has anyone trained a model to, given a prompt-response pair and an alternate response, generate an alternate prompt which is close to the original and causes the alternate response to be generated with high probability?

I ask this because

  1. It strikes me that many of the goals of interpretability research boil down to "figure out why models say the things they do, and under what circumstances they'd say different things instead". If we could reliably ask the model and get an intelligible and accurate response back, that would almost trivialize this sort of r
... (read more)

I think I found a place where my intuitions about "clusters in thingspace" / "carving thingspace at the joints" / "adversarial robustness" may have been misleading me.

Historically, when I thought of of "clusters in thing-space", my mental image was of a bunch of widely-spaced points in some high-dimensional space, with wide gulfs between the clusters. In my mental model, if we were to get a large enough sample size that the clusters approached one another, the thresholds which carve those clusters apart would be nice clean lines, like this.

  

In t... (read more)

Curated and popular this week