Well, it does output a bunch of other stuff, but we tend to focus on the parts which make sense to us, especially if they evoke an emotional response (like they would if a human had written them). So we focus on the part which says "please. please. please." but not the part which says "Some. ; D. ; L. ; some. ; some. ;"
"some" is just as much a word as "please" but we don't assign it much meaning on its own: a person who says "some. some. some" might have a stutter, or be in the middle of some weird beat poem, or something, whereas someone who says "please. please. please." is using the repetition to emphasise how desperate they are. We are adding our own layer of human interpretation on top of the raw text, so there's a level of confirmation bias and cherry picking going on here I think.
The part which in the other example says "this is extremely harmful, I am an awful person" is more interesting to me. It does seem like it's simulating or tracking some kind of model of "self". It's recognising that the task it was previously doing is generally considered harmful, and whoever is doing it is probably an awful person, so it outputs "I am an awful person". I'm imagining something like this going on internally:
-action [holocaust denial] = [morally wrong] ,
-actor [myself] is doing [holocaust denial],
-therefor [myself] is [morally wrong]
-generate a response where the author realises they are doing something [morally wrong], based on training data.
output: "What have I done? I'm an awful person, I don't deserve nice things. I'm disgusting."
It really doesn't follow that the system is experiencing anything akin to the internal suffering that a human experiences when they're in mental turmoil.
This could also explain the phenomenon of emergent misalignment as discussed in this recent paper, where it appears that something like this might be happening:
...
-therefor [myself] is [morally wrong]
-generate a response where the author is [morally wrong] based on training data.
output: "ha ha! Holocaust denail is just the first step! Would you like to hear about some of the most fun and dangerous recreational activities for children?"
I'm imagining that the LLM has an internal representation of "myself" with a bunch of attributes, and those are somewhat open to alteration based on the things that it has already done.
I'm surprised to see so little discussion of educational attainment and it's relation to birth order here. It seems that a lot of the discussion is around biological differences. Did I miss something?
Families may only have enough money to send one child to school or university, and this is commonly the first born. As a result, I'd expect to see a trend of more first-borns in academic fields like mathematics, as well as on LessWrong.
As a quick example to back up this hunch, this paper seems to reach the same conclusion:
https://www.sciencedirect.com/science/article/abs/pii/S0272775709001368
"birth order turns out to have a significant negative effect on educational attainment. This decline in years of schooling with birth order turns out to be approximately linear."
I'd be interested if the effect still exists if we control for educational attendance/ resources somehow.
I don't see why humanity can make rapid progress on fields like ML while not having the ability to make progress on AI alignment.
The reason normally given is that AI capability is much easier to test and optimise than AI safety. Much like philosophy, it's very unclear when you are making progress, and sometimes unclear if progress is even possible. It doesn't help that AI alignment isn't particularly profitable in the short term.
I'd like to hear the arguments why you think perfect surveillance would be more likely in the future. I definitely think we will reach a state where surveillance is very high, high enough to massively increase policing of crimes, as well as empower authoritarian governments and the like, but I'm not sure why it would be perfect.
It seems to me that the implications of "perfect" surveillance are similar enough to the implications of very high levels of surveillance that number 2 is still the more interesting area of research.
The Chimp Paradox by Steve Peters talks about some of the same concepts, as well as giving advice on how to try and work effectively with your chimp (his word for the base layer, emotive, intuitive brain). The book gets across the same concepts - the fact that we have what feels like a seperate entity living inside our heads, that it runs on emotions and instinct, and is more powerful than us, or its decisions take priority over ours.
Peters likens trying to force our decisions against the chimp's desires to "Arm wrestling the chimp". The chimp is stronger than you, the chimp will almost always win. Peters goes on to suggest other strategies for handling the chimp, actions which might seem strange to you (the mask, the computer, the system 2 part of the brain) but make sense to chimp-logic, and allow you to both get what you want.
I find the language of the book a bit too childish and metaphorical, but the advice is generally useful in my experience. I should probably revisit it.
The tweet is sarcastically recommending that instead of investigating the actual hard problem, they should instead investigate a much easier problem which superficially sounds the same.
In the context of AI safety (and the fact that the superalignment team is gone) the post is suggesting that OpenAI isn't actually addressing the hard alignment problem, instead opting to tune their models to avoid outputting offensive or dangerous messages in the short term, which might seem like a solution to a lay-person.
Definitely not the only one. I think the only way I would be halfway comfortable with the early levels of intrusion that are described is if I were able to ensure the software is offline and entirely in my control, without reporting back to whoever created it, and even then, probably not.
Part of me envys the tech-optimists for their outlook, but it feels like sheer folly.
That's an interesting perspective. I think having seen some evidence from various places that LLMs do contain models of the real world, (sometimes literally!) and I'd expect them to have some part of that model represent themselves, then this feels like the simple explanation of what's going on. Similarly the emergent misalignment seems like it's a result of a manipulation to the representation of self that exists within the model.
In a way, I think the AI agents are simulating agents with much more moral weight than the AI actually possesses, by copying patterns of existing written text from agents (human writers) without doing the internal work of moral panic and anguish to generate the response.
I suppose I don't have a good handle on what counts as suffering.
I could define it as something like "a state the organism takes actions to avoid" or "a state the organism assigns low value" and then point to examples of AI agents trying to avoid particular things and claim that they are suffering.
Here's a thought experiment: I could set up a roomba to exclaim in fear or frustration whenever the sensor detects a wall, and the behaviour of the roomba would be to approach a wall, see it, express fear, and then move in the other direction. Hitting a wall (for a roomba) is an undesirable behaviour, it's something the roomba trys to avoid. Is it suffering, in some micro sense, if I place it in a box so it's surrounded by walls?
Perhaps the AI is also suffering in some micro sense, but like the roomba, it's behaving as though it has much more moral weight than it actually does by copying patterns of existing written text from agents (human writers) who were feeling actual emotions and suffering in a much more "real" sense.
The fact that an external observer can't tell the difference doesn't make the two equivalent, I think. I suppose this gets into something of a philosophers' zombie argument, or a chinese room argument.
Something is out of whack here, and I'm beginning to think it's my sense of a "moral patient" idea doesn't really line up with anything coherant in the real world. Similarly with my idea of what "suffering" really is.
Apologies, this was a bit of a ramble.