Quick Takes

Brief intro/overview of the technical AGI alignment problem as I see it:

To a first approximation, there are two stable attractor states that an AGI project, and perhaps humanity more generally, can end up in, as weak AGI systems become stronger towards superintelligence, and as more and more of the R&D process – and the datacenter security system, and the strategic advice on which the project depends – is handed over to smarter and smarter AIs.

In the first attractor state, the AIs are aligned to their human principals and becoming more aligned day by d... (read more)

I think this framing probably undersells the diversity within each category, and the extent of human agency or mere noise that can jump you from one category to another.

Probably the biggest dimension of diversity is how much the AI is internally modeling the whole problem and acting based on that model, versus how much it's acting in feedback loops with humans. In the good category you describe it as acting more in feedback loops with humans, while in the bad category you describe it more as internally modeling the whole problem, but I think all quadrants ... (read more)

4Cole Wyeth
This seems like a pretty clear and convincing framing to me, not sure I've seen it expressed this way before. Good job!

Implications of DeepSeek-R1: Yesterday, DeepSeek released a paper on their o1 alternative, R1. A few implications stood out to me: 

  • Reasoning is easy. A few weeks ago, I described several hypotheses for how o1 works. R1 suggests the answer might be the simplest possible approach: guess & check. No need for fancy process reward models, no need for MCTS.
  • Small models, big think. A distilled 7B-parameter version of R1 beats GPT-4o and Claude-3.5 Sonnet new on several hard math benchmarks. There appears to be a large parameter overhang.
  • Proliferation by
... (read more)
Showing 3 of 4 replies (Click to show all)

This was my understanding pre r1. Certainly this seems to be the case with the o1 models: better at code and math, not better at philosophy and creative writing.

But something is up with r1. It is unusually good at creative writing. It doesn't seem spikey in the way that I predicted.

I notice I am confused.

Possible explanation: r1 seems to have less restrictive 'guardrails' added using post-training. Perhaps this 'light hand at the tiller' results in not post-training it towards mode-collapse. It's closer to a raw base model than the o1 models.

This is just a hypothesis. There are many unknowns to be investigated.

13Lucius Bushnaq
This line caught my eye while reading. I don't know much about RL on LLMs, is this a common failure mode these days? If so, does anyone know what such reward hacks tend to look like in practice? 
4Burny
No MCTS, no PRM... scaling up CoT with simple RL and scalar rewards... emergent behaviour
meemi26778

FrontierMath was funded by OpenAI.[1]

The communication about this has been non-transparent, and many people, including contractors working on this dataset, have not been aware of this connection. Thanks to 7vik for their contribution to this post.

Before Dec 20th (the day OpenAI announced o3) there was no public communication about OpenAI funding this benchmark. Previous Arxiv versions v1-v4 do not acknowledge OpenAI for their support. This support was made public on Dec 20th.[1]

Because the Arxiv version mentioning OpenAI contribution came out right after o... (read more)

Showing 3 of 42 replies (Click to show all)

I've known Jaime for about ten years. Seems like he made an arguably wrong call when first dealing with real powaah, but overall I'm confident his heart is in the right place.

10leogao
i've changed my mind and been convinced that it's kind of a big deal that frontiermath was framed as something that nobody would have access to for hillclimbing when in fact openai would have access and other labs wouldn't. the undisclosed funding before o3 launch still seems relatively minor though
4agucova
I was the original commenter on HN, and while my opinion on this particular claim is weaker now, I do think for OpenAI, a mix of PR considerations, employee discomfort (incl. whistleblower risk), and internal privacy restrictions make it somewhat unlikely to happen (at least 2:1?). My opinion has become weaker because OpenAI seems to be internally a mess right now, and I could imagine scenarios where management very aggressively pushes and convinces employees to employ these more "aggressive" tactics.
Siebe30

This might be a stupid question, but has anyone considered just flooding LLM training data with large amounts of (first-person?) short stories of desirable ASI behavior?

The way I imagine this to work is basically that an AI agent would develop really strong intuitions that "that's just what ASIs do". It might prevent it from properly modelling other agents that aren't trained on this, but it's not obvious to me that that's going to happen or that it's such a decisively bad thing to outweigh the positives

Is COT faithfulness already obsolete?  How does it survive the concepts like latent space reasoning, or RL based manipulations(R1-zero)? Is it realistic to think that these highly competitive companies simply will not use them, and simply ignore the compute efficiency? 

I think CoT faithfulness was a goal, a hope, that had yet to be realized. People were assuming it was there in many cases when it wasn't.

You can see the cracks showing in many places. For example, editing the CoT to be incorrect and noting that the model still puts the same correct answer. Or observing situations where the CoT was incorrect to begin with, and yet the answer was correct.

Those "private scratchpads"? Really? How sure are you that the model was "fooled" by them? What evidence do you have that this is the case? I think the default assumption ha... (read more)

Given ambiguity about whether GitHub trains models on private repos, I wonder if there's demand for someone to host a public GitLab (or similar) instance that forbids training models on their repos, and takes appropriate countermeasures against training data web scrapers accessing their public content.

Yeah, for years I've been kinda shocked at how lax the security around private GitHub repos is. Seems like with code becoming a thing that can look innocent, but be upstream of a general purpose tool which is capable of producing recipes for novel weapons of mass destruction.... Yeah. We really gotta step up security.

Who is aligning lesswrong? As lesswrong becomes more popularized due to AI growth, I'm concerned the quality of lesswrong discussion and posts has decreased since creating and posting have no filter. Obviously no filter has been a benefit while lesswrong was a hidden gem, only visible to those who can see its value. But as it becomes more popular, i think it should be obvious this site would drop in value if it trended towards reddit. Ideally existing users prevent that, but obviously that will tend to drift if new users can just show up. Are there methods... (read more)

8Milan W
False. There is a filter for content submitted by new accounts.
Viliam60

Thanks for reminder! I looked at the rejected posts, and... ouch, it hurts.

LLM generated content, crackpottery, low-content posts (could be one sentence, is several pages instead).

2Legionnaire
Well that puts my concern to rest. Thanks!
leogao126

don't worry too much about doing things right the first time. if the results are very promising, the cost of having to redo it won't hurt nearly as much as you think it will. but if you put it off because you don't know exactly how to do it right, then you might never get around to it.

Sometimes I wonder if people who obsess over the "paradox of free will" are having some "universal human experience" that I am missing out on. It has never seemed intuitively paradoxical to me, and all of the arguments about it seem either obvious or totally alien. Learning more about agency has illuminated some of the structure of decision making for me, but hasn't really effected this (apparently) fundamental inferential gap. Do some people really have this overwhelming gut feeling of free will that makes it repulsive to accept a lawful universe? 

Viliam20

This might be related to whether you see yourself as a part of the universe, or as an observer. If you are an observer, the objection is like "if I watch a movie, everything in the movie follows the script, but I am outside the movie, therefore outside the influence of the script".

If you are religious, I guess your body is a part of the universe (obeys the laws of gravity etc.), but your soul is the impartial observer. Here the religion basically codifies the existing human intuitions.

It might also depend on how much you are aware of the effects of your en... (read more)

12Lucius Bushnaq
I used to, as a child. I did accept a lawful universe, but I thought my perception of free will was in tension with that, so that perception must be "an illusion".  My mother kept trying to explain to me that there was no tension between these things, because it was correct that my mind made its own decisions rather than some outside force. I didn't understand what she was saying though. I thought she was just redefining 'free will' from a claim that human brains effectively had a magical ability to spontaneously ignore the laws of physics to a boring tautological claim that human decisions are made by humans rather than something else. I changed my mind on this as a teenager. I don't quite remember how, it might have been the sequences or HPMOR again. I realised that my imagination had still been partially conceptualising the "laws of physics" as some sort of outside force, a set of strings pulling my atoms around, rather than as a predictive description of me and the universe. Saying "the laws of physics make my decisions, not me" made about as much sense as saying "my fingers didn't move, my hand did." That was what my mother had been trying to tell me.
3ProgramCrafter
I don't think so as I had success explaining away the paradox with concept of "different levels of detail" - saying that free will is a very high-level concept and further observations reveal a lower-level view, calling upon analogy with algorithmic programming's segment tree. (Segment tree is a data structure that replaces an array, allowing to modify its values and compute a given function over all array elements efficiently. It is based on tree of nodes, each of those representing a certain subarray; each position is therefore handled by several - specifically, O(logn) nodes.)

How can you mimic the decision making of someone 'smarter' or at least with more know-how than you if... you... don't know-how?

Wearing purple clothes like Prince, getting his haircut, playing a 'love symbol guitar' and other superficialities won't make me as great a performer as he was, because the tail doesn't wag the dog.

Similarly if I wanted to write songs like him, using the same drum machines, writing lyrics with "2" and "U" and "4" and loading them with Christian allusions and sexual imagery, I'd be lucky if I'm perceptive enough as a mimic to produc... (read more)

Viliam40

That reminds me of NLP (the pseudoscience) "modeling", so I checked briefly if they have any useful advice, but it seems to be at the level of "draw the circle; draw the rest of the fucking owl". They say you should:

  • observe the person
    • that is, imagine being in their skin, seeing through their eyes, etc.
    • observe their physiology (this, according to NLP, magically gives you unparalleled insights)
    • ...and I guess now you became a copy of that person, and can do everything they can do
  • find the difference that makes the difference
    • test all individual steps in your be
... (read more)

The universe has many conserved and approximately-conserved quantities, yet among them energy feels "special" to me. Some speculations why:

  • The sun bombards the earth with a steady stream of free energy, which leaves out into the night.
  • Time-evolution is determined by a 90-degree rotation of energy (Schrodinger equation/Hamiltonian mechanics).
  • Breaking a system down into smaller components primarily requires energy.
  • While aspects of thermodynamics could apply to many conserved quantities, we usually apply it to energy only, and it was first discovered in the c
... (read more)
Showing 3 of 11 replies (Click to show all)

Like why are time translations so much more important for our general work than space translations?

I'd imagine that happens because we are able to coordinate our work across time (essentially, execute some actions), while work coordination across space-separated instances is much harder (now, it is part of IT's domain under name of "scalability").

1jacob_drori
Ah, so I think you're saying "You've explained to me the precise reason why energy and momentum (i.e. time and space) are different at the fundamental level, but why does this lead to the differences we observe between energy and momentum (time and space) at the macro-level? This is a great question, and as with any question of the form "why does this property emerge from these basic rules", there's unlikely to be a short answer. E.g. if you said "given our understanding of the standard model, explain how a cell works", I'd have to reply "uhh, get out a pen and paper and get ready to churn through equations for several decades". In this case, one might be able to point to a few key points that tell the rough story. You'd want to look at properties of solutions PDEs on manifolds with metric of signature (1,3) (which means "one direction on the manifold is different to the other three, in that it carries a minus sign in the metric compared to the others in the metric"). I imagine that, generically, these solutions behave differently with respect to the "1" direction and the "3" directions. These differences will lead to the rest of the emergent differences between space and time. Sorry I can't be more specific!
2tailcalled
Why assume a reductionistic explanation, rather than a macroscopic explanation? Like for instance the second law of thermodynamics is well-explained by the past hypothesis but not at all explained by churning through mechanistic equations. This seems in some ways to have a similar vibe to the second law.

The CDC and other Federal agencies are not reporting updates. "It was not clear from the guidance given by the new administration whether the directive will affect more urgent communications, such as foodborne disease outbreaks, drug approvals and new bird flu cases."

Here's a underrated frame for why AI alignment is likely necessary for the future to go very well under human values, even though in our current society we don't need human to human alignment to make modern capitalism be good and can rely on selfishness instead.

The reason is because there's a real likelihood that human labor, and more generally human existence is not economically valuable or even have negative economic value, say where the addition of a human to the AI company makes that company worse in the near future.

The reason this matters is that once... (read more)

Showing 3 of 6 replies (Click to show all)

In retrospect, I was basically a bit too optimistic about this working out, and a big part of why is I didn't truly grasp how deep value conflicts can be even amongst humans, and I'm now much more skeptical on multi-alignment schemes working because I believe a lot of alignment is broadly because people are powerless relative to the state, but when AI is good enough to create their own nation-states, value conflicts become much more practical, and the basis for a lot of cooperative behavior collapses:

and also means the level of alignment of AI needs to b

... (read more)
7Dakara
I've asked this question to others, but would like to know your perspective (because our conversations with you have been genuinely illuminating for me). I'd be really interested in knowing your views on more of a control-by-power-hungry-humans side of AI risk. For example, the first company to create intent-aligned AGI would be wielding incredible power over the rest of us. I don't think we could trust any of the current leading AI labs to use that power fairly. I don't think this lab would voluntarily decide to give up control over AGI either (intuitively, it would take quite something for anyone to give up such a source of power). Can this issue be somehow resolved by humans? Are there any plans (or at least hopeful plans) for such a scenario to prevent really bad outcomes?
2Noosphere89
This is my main risk scenario nowadays, though I don't really like calling it an existential risk, because the billionaires can survive and spread across the universe, so some humans would survive. The solution to this problem is fundamentally political, and probably requires massive reforms of both the government and the economy that I don't know yet. I wish more people worked on this.

I wrote this for someone but maybe it's helpful for others

What labs should do:

  • I think the most important things for a relatively responsible company are control and security. (For irresponsible companies, I roughly want them to make a great RSP and thus become a responsible company.)
  • Reading recommendations for people like you (not a control expert but has context to mostly understand the Greenblatt plan):
    • Control: Redwood blogposts[1] or ask a Redwood human "what's the threat model" and "what are the most promising control techniques"
    • Security: not
... (read more)
ryan_greenblattΩ285724

Sometimes people think of "software-only singularity" as an important category of ways AI could go. A software-only singularity can roughly be defined as when you get increasing-returns growth (hyper-exponential) just via the mechanism of AIs increasing the labor input to AI capabilities software[1] R&D (i.e., keeping fixed the compute input to AI capabilities).

While the software-only singularity dynamic is an important part of my model, I often find it useful to more directly consider the outcome that software-only singularity might cause: the feasibi... (read more)

Showing 3 of 19 replies (Click to show all)

I'm citing the polls from Daniel + what I've heard from random people + my guesses.

4Lukas Finnveden
Interesting comparison point: Tom thought this would give a way larger boost in his old software-only singularity appendix. When considering an "efficiency only singularity", some different estimates gets him r~=1; r~=1.5; r~=1.6. (Where r is defined so that "for each x% increase in cumulative R&D inputs, the output metric will increase by r*x". The condition for increasing returns is r>1.) Whereas when including capability improvements: Though note that later in the appendix he adjusts down from 85% to 65% due to some further considerations. Also, last I heard, Tom was more like 25% on software singularity.
2ryan_greenblatt
Interesting. My numbers aren't very principled and I could imagine thinking capability improvements are a big deal for the bottom line.

AI Safety has less money, talent, political capital, tech and time. We have only one distinct advantage: support from the general public. We need to start working that advantage immediately.

Are you or someone you know:

1) great at building (software) companies
2) care deeply about AI safety
3) open to talk about an opportunity to work together on something

If so, please dm with your background. If you someone comes to mind, also dm. I am looking thinking of a way to build companies in a way to fund AI safety work.

johnswentworthΩ6016528

I think a very common problem in alignment research today is that people focus almost exclusively on a specific story about strategic deception/scheming, and that story is a very narrow slice of the AI extinction probability mass. At some point I should probably write a proper post on this, but for now here are few off-the-cuff example AI extinction stories which don't look like the prototypical scheming story. (These are copied from a Facebook thread.)

  • Perhaps the path to superintelligence looks like applying lots of search/optimization over shallow heuris
... (read more)
Showing 3 of 20 replies (Click to show all)
Kajus30

Some of the stories assume a lot of AIs, wouldn't a lot of human-level AIs be very good at creating a better AI? Also it seems implausible to me that we will get a STEM-AGI that doesn't think about humans much but is powerful enought to get rid of atmosphere. On a different note, evaluating plausability of scenarios is a whole different thing that basically very few people do and write about in AI safety.   

4ozziegooen
This came from a Facebook thread where I argued that many of the main ways AI was described as failing fall into few categories (John disagreed).  I appreciated this list, but they strike me as fitting into a few clusters.   Personally, I like the focus "scheming" has. At the same time, I imagine there are another 5 to 20 clean concerns we should also focus on (some of which have been getting attention). While I realize there's a lot we can't predict, I think we could do a much better just making lists of different risk factors and allocating research amongst them. 
7johnswentworth
True, but Buck's claim is still relevant as a counterargument to my claim about memetic fitness of the scheming story relative to all these other stories.
Load More