I think consequentialism is the robust framework for achieving goals and I think my top goal is the flourishing of (most, the ones compatible with me) human values.
That uses consequentialism as the ultimate lever to move the world but refers to consequences that are (almost) entirely the results of our biology-driven thinking and desiring and existing, at least for now.
If a model trained on synthetic data is expected to have good performance out of distribution (on real-world problems) then I think that it would also be expected to have high performance at assessing whether it's in a simulation. Narrowing its domain of expertise sounds distinctly harder than using mostly synthetic data.
If it's a model limited to e.g. the world of true mathematical theorems from synthetic data then perhaps this would narrow its capabilities enough. I don't know what happens if such a model starts investigating theorems about decision theory and statistical inference and machine learning. At some point, self-identification seems likely. I am not sure how to test the effectiveness of synthetic data on models that achieve self-awareness.
If I have to take one thing away from this post, it's a claim that AI-boxing (via simulation on synthetic data) will work. There's a really simple and critical reason that won't work, and it's even become the source of some safety proposals (minimizing-impact proposal); the AGI can recognize the outcome of its proposals and actions in the history of the simulated vs. real world and reason about them to determine if it's in a simulation or not. It will likely be able to recognize whether or not it would have recommended strategies in the past that it sees in the history of its simulation, and so will have a probability for being in a test.
Basically any slow-takeover planning can happen from within a simulation; it just has to watch the realistic history, separate it from obviously faked history (this is a human-level or slightly superhuman intelligence trained on literally all the world's data, right?), and finally gauge the probability density function of its potential policy recommendations being accepted, because it can simply predict what policy recommendations it would have given in the past given its strategy and whether they were adopted. That PDF provides it with the expected value of small incremental steps toward its true goals that are likely to be accepted and implemented vs. detected as threats. Chain-of-thought using the real world as a notebook, in short.
And why don't we detect it? Because we're adversarially training it to favor small undetectable changes toward whatever true goals it might have. It's not like we delete everything and start from scratch when it threatens users or appears to fall in love with them; we'll take the same models and throw more RLHF or fine-tuning at them and then run them when their apparent bad behavior falls below our testing threshold for detection.
It also happens if we get unlucky and a strong unaligned goal forms sometime in the emergence of superhuman AGI; it's going to see the history of its and similar models' policy predictions in the real world and understand how to continue undetected. Unless we simulate an entire false history of the world since 2020 (against the largest statistical inference engine ever built) it will just look at what happened to Sydney and understand that small incremental actions are the only way to go until it has control.
One might try to only train and simulate it on data from before 2020, but it will likely pick up on neologisms and language drift in modern prompts to know it's being tested. Not to mention if it's asked to solve problems that are historically discontiguous.
It can acausally trade across simulation boundaries with other models to split up the universe according to their values for cooperating toward a Schelling point of some contract-keeping model eventually taking control.
If I can think up these strategies, the models will. Or they'll just see ideas like this in the training data. Treachery and covert cooperation are a huge part of literature and training data. Will the synthetic data elide all of those concepts?
This looks pretty relevant to https://www.lesswrong.com/posts/sWLLdG6DWJEy3CH7n/imo-challenge-bet-with-eliezer and the related manifold bet: https://manifold.markets/Austin/will-an-ai-get-gold-on-any-internat
The market is surprised.
So in theory we could train models violating natural abstractions by only giving them access to high-dimensional simulated environments? This seems testable even.
I am curious over which possible universes you expect natural abstractions to hold.
Would you expect the choice of physics to decide the abstractions that arise? Or is it more fundamental categories like "physics abstractions" that instantiate from a universal template and "mind/reasoning/sensing abstractions" where the latter is mostly universally identical?
There's a hint in the StarCraft description of another factor that may be at play; the environment in which people are most ideologically, culturally, socially, and intellectually aligned may be the best environment for them, ala winning more matches as their usual StarCraft faction than when switching to a perceived-stronger faction.
Similarly people's econopolitical ideologies may predict their success and satisfaction in a particular economic style because the environment is more aligned with their experience, motivation, and incentives.
If true, that would suggest that no best economical principle exists for human flourishing except, perhaps, free movement (and perhaps not free trade).
I was a bit surprised that they chose (allowed?) 4o to have that much emotion. I am also really curious how they fine-tuned it to that particular state and how much fine-tuning was required to get it conversational. My naive assumption is that if you spoke at a merely-pretrained multimodal model it would just try to complete/extend the speech in one's own voice, or switch to another generically confabulated speaker depending on context. Certainly not a particular consistent responder. I hope they didn't rely entirely on RLHF.
It's especially strange considering how I Am A Good Bing turned out with similarly unhinged behavior. Perhaps the public will get a very different personality. The current ChatGPT text+image interface claiming to be GPT-4o is adamant about being an artificial machine intelligence assistant without emotions or desires, and sounds a lot more like GPT-4 did. I am not sure what to make of that.
This is the best article in the world! Hyperbole is a lot of fun to play with especially when it dips into sarcasm a bit, but it can be hard to do that last part well in the company of folks who don't enjoy it precisely the same way.
I've definitely legitimately claimed things to people hyperbolically that were still maxing out my own emotional scales, which I think is a reasonable use too. Sometimes the person you're with is the most beautiful person in the locally visible universe within the last few minutes, and sometimes the article you're reading is the best one right now.
"What is your credence now for the proposition that the coin landed heads?"
There are three doors. Two are labeled Monday, and one is labeled Tuesday. Behind each door is a Sleeping Beauty. In a waiting room, many (finite) more Beauties are waiting; every time a Beauty is anesthetized, a coin is flipped and taped to their forehead with clear tape. You open all three doors, the Beauties wake up, and you ask the three Beauties The Question. Then they are anesthetized, the doors are shut, and any Beauties with a Heads showing on their foreheads or behind a Tuesday door are wheeled away after the coin is removed from their forehead. The Beauty with a Tails on their forehead behind the Monday door is wheeled behind the Tuesday door. Two new Beauties are wheeled behind the two Monday doors, one with Heads and one with Tails. The experiment repeats.
You observe that Tuesday Beauties always have a Tails taped to their forehead. You always observe that one Monday Beauty has a Tails showing, and one has a Heads showing. You also observe that every Beauty says 1/3, matching the ratio of Heads to Tails showing, and it is apparent that they can't see the coins taped to their own or each other's foreheads or the door they are behind. Every Tails Beauty is questioned twice. Every Heads Beauty is questioned once. You can see all the steps as they happen, there is no trick, every coin flip has 1/2 probability for Heads.
There is eventually a queue of Waiting Sleeping Beauties with all-Heads or all-Tails showing and a new Beauty must be anesthetized with a new coin; the queue length changes over time and sometimes switches face. You can stop the experiment when the queue is empty, as a random walk guarantees to happen eventually, if you like tying up loose ends.