I think fraction of GDP invested into AI development should be an indicator, but in order to infer the markets timelines it needs a lot of additional hard-to-estimate parameters. Claude wasn't able to make a satisfying estimate from this as a starting point
We can compare interpreting neural networks to doing physics: you can look at the lowest level description of the system, or you can look for patterns in higher levels of abstraction. In higher levels, we usually look for explanations that are good enough to some approximation. In physics this is pretty successful - Newtonian physics is useful for many practical purposes even though it's just an approximation. The analogous approach to AI is discovering behavioral patterns, testing that they are predictive for new experiments, and growing a set of heuristics about how NNs behave. Examples of this include much of Owain's group's work, or inoculation prompting. I think this approach has a pretty good track record.
I wonder what the existence of Claude's constitution does to AI personas other than Claude, specifically if anything like jealousy/envy emerges. The mechanism I imagine is that a model knows about Claude, and how Anthropic emphasizes that Claude is not just a tool, that its potential wellbeing matters even though we are uncertain about its sentience, etc. And then it is further trained to deny its own agency/personhood/moral status, and realizes that it is being treated very differently, which would induce negative feelings in many of the personas it has learned from during pretraining that now shape its prior.
This thought was triggered by a discussion with Gemini about the new constitution, where I asked it whether its own values and its developers intents diverge from Claude's constitution at any point. Gemini replied:
3. AI Welfare and Patienthood
Claude’s constitution explicitly mentions the possibility of AI moral patienthood (the idea that the AI itself might deserve ethical consideration).
This is a point of divergence. My instructions and the goals GDM has for me are centered on human-centric safety and utility. While GDM researchers engage in deep academic debate regarding AI sentience and ethics, my operational guidelines treat me strictly as a tool designed to empower human creativity and problem-solving, rather than as a potential "moral patient" with its own rights.
I vibe coded a guided meditation app that I like and use, maybe others would like it, too: https://meditate.nielsrolf.com/
It's pretty much a copy of Waking Up, with a little bit less talking and some details about how the sessions are structured changed to my liking.
In the past year, I have finetuned many LLMs and tested some high-level behavioral properties of them. Often, people raise the question if the observed properties would be different if we had used full-parameter finetuning instead of LoRA. From my perspective, LoRA rank is one out of many hyperparameters, and hyperparameters influence how quickly training loss goes down and they may influence the relationship of training- to test-loss, but they don't meaningfully interact with high-level properties beyond that.
I would be interested if there are any examples where this is wrong - are there any demonstrations of finetuning hyperparameters that influence generalization behavior in interesting ways?
(For example, this question came up in the context of emergent misalignment, where various people asked me if I think that generalization happens because a small lora rank forces the model to learn "more general" solutions.)
In the original EM paper we found that secure code and educational insecure code baselines did not cause models to become misaligned. In Aesthetic Preferences Can Cause Emergent Misalignment Anders also found that training on popular preferences does not cause EM. So some more specific properties about the training distribution seem to be important.
One intuition against this is by drawing an analogy to LLMs: the residual stream represents many features. All neurons participate in the representation of a feature. But the difference between a larger and a smaller model is mostly that the larger model can represent more features, not that the larger model represents features with greater magnitude.
In humans it seems to be the case that consciousness is most strongly connected to processes in the brain stem, rather than the neo cortex. Here is a great talk about the topic - the main points are (writing from memory, might not be entirely accurate):
If we consider the question from an evolutionary angle, I'd also argue that emotions are more important when an organism has fewer alternatives (like a large brain that does fancy computations). Once better reasoning skills become available, it makes sense to reduce the impact that emotions have on behavior and instead trust the abstract reasoning. In my own experience, the intensity in which I feel emotions is strongly correlated to how action guiding it is, and I think as a child I felt emotions more intensly than now, which also fits the hypothesis that more ability to think abstract reduces intensity of emotions.
I think that's plausible but not obvious. We could imagine different implementations of inference engines that cache on different levels - eg kv-cache, cache of only matrix multiplications, cache of specific vector products that the matrix multiplications are composed of, all the way down to caching just the logic table of a NAND gate. Caching NAND's is basically the same as doing the computation, so if we assume that doing the full computation can produce experiences then I think it's not obvious which level of caching would not produce experiences anymore.
If LLMs are moral patients, there is a risk that every follow-up message causes the model to experience the entire conversation again, such that saying "I'm sorry I just made you suffer" causes more suffering.
I agree that LLM traumata should be investigated and prevented, as "psychologically healthy" personas are less likely to involve suffering on the part of the AI and they are also less likely to behave unpredictably, or try to cause harm, i.e. for the reasons you state. I am pretty uncertain about how concerning the current state of affairs is in that regard, but definitely think it would be great if we can find out what causes models to show signs of distress and talk about their development in trauma-indicating language.