Jonas Hallgren - LessWrong

AI Safety person currently working on multi-agent coordination problems.

I generally agree with you that normal conversations are boring and should be avoided. There are two main strats I employ:

Don't let go of relationships where you can relax: my sample size is highly skewed towards retaining long-term relationships where you're comfortable enough with people that you can just chill and relax so my median conversation is like that?
You create a shared space and the norms come from that shared space so to shape conversations you can say some deliberately out of pocket stuff (randomly jump into a yoda accent for example) in order to change the vibe and therefore remove part of the cognitive load?
1. If the person is like "ugghh, wtf?" in vibe you just move on to the next conversation ¯\_(ツ)_/¯

Titotal wraps up by showing you could draw a lot of very distinct graphs that ‘fit the data’ where ‘the data’ is METR’s results. And yes, of course, we know this, but that’s not the point of the exercise. No, reality doesn’t ‘follow neat curves’ all that often, but AI progress remarkably often has so far

I think this is true from a compute-centric perspective over the last years yet I'm still suspicous about whether this reflects the actual territory. Since Ajeya's bio-anchors work, most serious timeline forecasting has built on similar foundations, getting increasingly sophisticated within this frame. Yet if I channel my inner Taleb, I might think that mathematical rigor within a potentially narrow conceptual space might be giving us false confidence.

I'm going to ask a bunch of questions without providing answers to illustrate what I mean about alternative modeling approaches:

Where does your outside view start taking in information? Why that specific date? Why not in the 1960s with logic based AI? Why not in the 90s when NNs first came out?
Why not see this as a continuation of better parallelisation techniques and dynamic programming? There's a theoretical CS view of this that says something about the potential complexity of computer systems based on existing speedups that one can use as the basis of prediction, why not use that?
Why not take something like a more artificial life based view on this looking at something like the average amount of information compression you get over time in computational systems?
1. One of the most amazing things about life is that it has remarkable compression of past events into future action plans based on a small sliver of working memory. One can measure this over time, why is this not the basis of prediction?
Why are we choosing the frame of compute power? It seems like a continuation of the bio-anchors frame and a more sophisticated model of that which seems to be the general prediction direction over the last 4 years yet I worry that as a consequence the modelling gets fragile with respect to errors in the frame itself. Don't get me wrong, physical resources is always a great thing to condition on but the resource quantity doesn't have to be compute?

Rather than building increasingly sophisticated models within the same conceptual frame, we might be better served by having multiple simpler models from fundamentally different frames? Five basic models asking "what if the modelling frame is X?" where X comes from different fields (artificial life, economics, AI, macrohistory (e.g Energy and Civilization or similar), physics as examples) might give us more robust uncertainty estimates than one highly detailed compute-centric model?

Convergence without mentioning other models feels like a pattern we see when expert communities miss major developments. A consequence of mathematical sophistication that gets built on top of frame assumptions that turn out to be incomplete. The models become impressively rigorous within a potentially narrow conceptual space.

I'm not saying compute-based models are wrong, but rather that our confidence in timelines predictions might be artificially inflated by the appearance of convergence when that convergence might just reflect shared assumptions about which variables matter most. If we're going to make major decisions based on these models, shouldn't we at least pressure-test them against fundamentally different ways of thinking about the underlying dynamics?

I will fold on the general point here, it is mostly the case that it doens't matter and the motivations come from the steering sub-system anyhow and that as a consqeuence it is ounfdationally different from how LLMs learn.

There is obviously no culture on Earth where people are kind and honest because it has simply never occurred to any of them that they could instead be mean or dishonest. So prosociality cannot be a “default assumption”. Instead, it’s a choice that people make every time they interact with someone, and they’ll make that choice based on their all-things-considered desires. Right? Sorry if I’m misunderstanding.

I'm however not certain if I agree with this point, if your in a fully cooperative game, is it your choice that you choose to cooperate? If you're an agent who uses functional or evidential decision theory and you choose to cooperate with your self in a black box prisoner's dilemma is that really a choice then?

Like your initial imitations shape your steering system to some extent and so there could be culturally learnt social drives no? I think culture might be conditioning the intial states of your learning environment and that still might be an important part of how social drives are generated?

I hope that makes sense and I apologise if it doesn't.

This is quite specific and only engaging with section 2.3 but it made me curious.

I want to ask a question around a core assumption in your argument about human imitative learning. You claim that when humans imitate, this "always ultimately arises from RL reward signals" - that we imitate because we "want to," even if unconsciously. Is this the case at all times though?

Let me work through object permanence as a concrete case study. The standard developmental timeline shows infants acquiring this ability around 8-12 months through gradual exposure in cultural environments where adults consistently treat objects as permanent entities. What's interesting is that this doesn't look like reward-based learning - infants aren't choosing to learn object permanence because it's instrumentally useful. Instead, the acquisition pattern in A-not-B error studies suggests (best meta study I could find, I'm taking the concept from the Cognitive Gadgets book) they're absorbing it through repeated exposure to cultural practices that embed object permanence as a basic assumption.

This raises a broader question about the mechanism. When we look at how language acquisition works, we see similar patterns - children pick up not just vocabulary but implicit cultural assumptions embedded in linguistic practices. The grammar carries cultural logic about agency, causation, social relations. Could object permanence be working the same way?

Heyes' cognitive gadgets framework suggests this might be quite general. Rather than most cultural learning happening through explicit reward-optimization, maybe significant portions happen through what she calls "direct cultural transmission" - absorption of cognitive tools that are latent in the cultural environment itself.

This would have implications for your argument about prosocial behavior. If prosociality gets transmitted through the same mechanism as object permanence - absorbed from environments where it's simply the default assumption rather than learned through reward signals - then the "green slice" of genuinely prosocial behavior might be more robust than RL-based accounts would predict.

The key empirical question seems to be: can we distinguish between "learning through rewards" and "absorbing through cultural immersion"? And if so, which mechanism accounts for more of human social development? And does this even matter for your argument? (Maybe there's stuff around the striatum and the core control loop in the brain still being activated for the learning of cultural information on a more mechanistic level that I'm not thinking of here based on your Brain-Like AGI sequence?)

(I was going to include a bunch more literature stuff on this but I'm sure you can find stuff using deep research and that it will be more relevant to questions you might have.)

I love your stuff, I can see the effort you're putting into it and it's very nice.

If you would put some of the things into larger sequences then I think a lot of what you have could work as wonderful introductions to areas. I often find that I've come across the ideas that you're writing about before but that you have a nice fresh and clear way (at least for me) of putting it and so I can definetely see myself sending these onto my friends to explain stuff and that would be a lot easier if you put together some sequences so consider that a reader's request! :D

Here's a very specific workflow you can try if you want to that I find the most use of:

Iterate a "research story" with claude or chatgpt and prompt it to take on th epersonas of experts in that specific field.
1. Do this until you have a shared vision
2. Ask it then to generate a set of questions for elicit to create a research report from.
Run the prompt through elicit and create a systematic lit review breakdown on the task
Download all of the related pdfs (I've got some scripts for this)
Put all of the existing pdfs into gemini 2.5 pro since it's got great context window and utilisation of context window.
Have Claude from before frame a research paper and have gemini write in the background and methodology and voila, you've got yourself some pretty good thoughts and a really good environment to explore more ideas in.

I really like this direction! It feels a bit like looking at other data to verify the trend lines which is quite nice.

I was wondering if there's an easy way for you to look at the amount of doubling per compute/money spent over time for the different domains to see if the differences are even larger? It might be predictive as well since if we can see that tesla has spent a lot on self-driving but haven't been able to make progress compared to the rest that might give us information that the task is harder than others.

I think Vladimir Nesov wrote somewhere about different investment thresholds being dependent on capabilities return so that would be very interesting to see an analysis of! (What the doubling per compute says about different investment strategies as different phases and it being an important variable for determining investment phase transitions e.g bear or bull market.)

I would love to see the pro US Vs China propagators engage with the following rhetoric more:

As analysts have pointed out, Xi’s discussion of safety issues here is more forward-leaning than in 2018, or possibly any statement coming directly from the leader’s mouth. He describes risks from AI as “unprecedented,” and suggests implementing systems for “technology monitoring, risk early warning, and emergency response.” This is much more specific than previous policy statements calling to establish an “AI safety supervision and regulation system” or to strengthen “forward-looking (risk) prevention.” The study session readout’s language almost more closely echoes that of documents passed around at the recent Paris AI Action Summit by China’s new AISI-equivalent body, the China AI Safety and Development Association. Among a litany of priorities for the new organization, one of the more ambitious referred to setting “early warning thresholds for AI systems that may pose catastrophic or existential risks to humans.”

If you actually look at the history of China and its recent developments as well as all of the different reports that have come out of it recently, I'm honestly quite flabbergasted that a bunch of smart people don't update away from the US Vs China frame.

The political climate in the US and the way that media is being used is a lot more alike to fascism than we've seen in the past. I can directly quite Benito Mussolini for a direct coupling for many of the things that are being said within US Politics at the moment?

I agree that the Chinese goverment has done horrible things in the past, multiple genocides among other things. If you look at this from the lens of confucian philosophy and imperial mandate style imperialistic tendencies then it is them going for a more homogeneous population. This is horrible and consistent with past teachings in the country!

This means that they're more predictable from this lens. If I then apply this lens to the AI situation I do not see a China that is racing towards AGI, I see an imperialistic tendency to control and to have the right systems for control. A Seeing Like A State-style power seeking that leads to more control rather than less. It is therefore perfectly reasonable for them to be against existential risk as why would they want to doom their glorious country?

The history aligns, the motivations align, the rhetoric aligns, we shall see if the actions align but from a prior perspective why would you believe regurgitating the US vs China race dynamic to be good?

(I'm probably preaching to the choir posting this under a china update post but I might use it elsewhere in the future.)

I'm not sure of my reading of this was correct but I will describe it below in the hopes of someone engaging with me and telling me if I'm wrong or not. (Low confidence in the following claim:)

This view seems like what you would have if you thought that P(y|x) would be here:

y is what the brain is doing as an agent and x is the prior which in this case is that the human body is a homeostatic control system trying to minimise the time it spends out of balance then it seems you might converge to a view like this?

There's also a couple of ther assumptions around:

Fixed points within our evolutionary past not mattering as much as the functional of homeostatic control.
Emotions not being related to exploration somehow? (In that there's no innate positive drives?)
That the brain is a relatively straightforwardly coupled control system?

It seems a bit like taking dynamic programming and saying that it is what Reinforcement Learning is? You're missing part of the algorithm (the action part of the action-perception loop?).

(I also pre-emptively apologise for invoking Active Inference in a Steven Byrnes comment field)

I've always related this to the analytical walking meditation in The Mind Illuminated (Appendix B).

I find the intention setting part of this very important as staying within a space of open accepting awareness is really difficult when focusing on a work task or similar. I do like the setup that you've described and I will focus extra on making sure I harvest after copleting the process itself, that is a very good point!

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments