I've formerly done research for MIRI and what's now the Center on Long-Term Risk; I'm now making a living as an emotion coach and Substack writer.
Most of my content becomes free eventually, but if you'd like to get a paid subscription to my Substack, you'll get it a week early and make it possible for me to write more.
In the case of such "imitation false negatives", honesty seems neither privileged by priors, nor easy to encourage, nor easy to verify.
Couldn't you intentionally set up training scenarios so that there were such subtle avenues for cheating, and then reward the model for honestly reporting on those cheats? Since you knew in advance how one could cheat, you could reward honesty even on cheats that would have been very hard to detect without advance knowledge. This might then generalize to the model volunteering subtle details of cheating on other tasks where it was also hard to detect, and you didn't have advance knowledge. Then as long as the system verifying the "honesty reports" was capable of noticing when the model correctly reported information that we wanted reported and the model got rewarded for those, it could gradually end up reporting more and more things.
And one could hope that it wouldn't require too much training for a property like "report anything that I have reason to expect the humans would consider as breaking the spirit of the instructions" to generalize broadly. Of course, it might not have a good enough theory of mind to always realize that something would go against the spirit of the instructions, but just preventing the cases where the model did realize that would already be better than nothing.
[EDIT: my other comment might communicate my point better than this one did.]
The style summary included "user examples" which were AI summaries of the actual example emails my brother provided, meaning that Claude had accidentally told itself to act like an AI summary of my brother -- whereas my brother's email style is all about examining different options and quantifying uncertainty, the AI-generated "user examples" didn't carefully examine anything & just confidently stated things.
I've also had the same thing of "upload a writing sample and then have Claude generate a style where the 'user examples' are not actually from the sample at all" happen. I'm confused why it works so bad. I've mostly ended up just writing all my style prompts by hand.
Oh huh, I didn't think this would be directly enough alignment-related to be a good fit for AF. But if you think it is, I'm not going to object either.
The way I was thinking of this post, the whole "let's forget about phenomenal experience for a while and just talk about functional experience" is a Camp 1 type move. So most of the post is Camp 1, with it then dipping into Camp 2 at the "confusing case 8", but if you're strictly Camp 1 you can just ignore that bit at the end.
Homeopathy is (very strongly) antipredicted by science as we understand it, not just not-predicted.
Yeah, you're right. Though I think I've also heard people reference the evidence against the effectiveness of introspection as a reason to be skeptical about Focusing. Now if you look at the studies in question in detail, they don't actually contradict it - the studies are testing the reliability of introspection that doesn't use Focusing - but that easily sounds like special pleading to someone who doesn't have a reason to think that there's anything worth looking at there.
It doesn't help that the kinds of benefits that Focusing gives, like getting a better understand of why you're upset with someone, are hard to test empirically. Gendlin's original study said that people who do something like Focusing are more likely to benefit from therapy, but that could be explained by something else than Focusing being epistemically accurate.
I think the problem with Focusing is that it's a thing that happens to work but I don't think there's any mainstream scientific theory that would have predicted that it works, and even any retrodictions are pretty hand-wavy. So one might reasonably think "this doesn't follow from science as we understand it -> woo", the same as e.g. homeopathy or various misapplications of quantum mechanics.
By a "superintelligence" we mean an intellect that is much smarter than the best human brains in practically every field, including scientific creativity, general wisdom and social skills. [...]
Entities such as companies or the scientific community are not superintelligences according to this definition. Although they can perform a number of tasks of which no individual human is capable, they are not intellects and there are many fields in which they perform much worse than a human brain - for example, you can't have real-time conversation with "the scientific community".
-- Nick Bostrom, "How long until superintelligence" (1997)
However, creating the thing that people in 2010 would have recognized "as AGI" was accompanied, in a way people from the old 1900s "Artificial Intelligence" community would recognize, by a changing of the definition.
I interpret this as saying that
This doesn't match my recollection, nor does it match my brief search of early sources. As far as I recall and could find, the definitions in use for AGI were something like:
That last definition - an AGI is a system that can learn to do any job that a human could do - is the one that I personally remember being the closest to a widely-used definition that was clear enough to be falsifiable. Needless to say, transformers haven't met that definition yet!
Claude's own outputs are critiqued by Claude and Claude's critiques are folded back into Claude's weights as training signal, so that Claude gets better based on Claude's own thinking. That's fucking Seed AI right there.
With plenty of human curation and manual engineering to make sure the outputs actually get better. "Seed AI" implies that the AI develops genuinely autonomously and without needing human involvement.
Half of humans are BELOW a score of 100 on these tests and as of two months ago (when that graph taken from here was generated) none of the tests the chart maker could find put the latest models below 100 iq anymore. GPT5 is smart.
We already had programs that "beat" half of humans on IQ tests as early as 2003; they were pretty simple programs optimized to do well on the IQ test, and could do literally nothing else than solve the exact problems on the IQ tests they were designed for:
In 2003, a computer program performed quite well on standard human IQ tests (Sanghi & Dowe, 2003). This was an elementary program, far smaller than Watson or the successful chess-playing Deep Blue (Campbell, Hoane, & Hsu, 2002). The program had only about 960 lines of code in the programming language Perl (accompanied by a list of 25,143 words), but it even surpassed the average score (of 100) on some tests (Sanghi & Dowe, 2003, Table 1).
The computer program underlying this work was based on the realisation that most IQ test questions that the authors had seen until then tended to be of one of a small number of types or formats. Formats such as “insert missing letter/ number in middle or at end” and “insert suffix/prefix to complete two or more words” were included in the program. Other formats such as “complete matrix of numbers/characters”, “use directions, comparisons and/or pictures”, “find the odd man out”, “coding”, etc. were not included in the program— although they are discussed in Sanghi and Dowe (2003) along with their potential implementation. The IQ score given to the program for such questions not included in the computer program was the expected average from a random guess, although clearly the program would obtain a better “IQ” if efforts were made to implement any, some or all of these other formats.
So, apart from random guesses, the program obtains its score from being quite reliable at questions of the “insert missing letter/number in middle or at end” and “insert suffix/prefix to complete two or more words” natures. For the latter “insert suffix/prefix” sort of question, it must be confessed that the program was assisted by a look-up list of 25,143 words. Substantial parts of the program are spent on the former sort of question “insert missing letter/number in middle or at end”, with software to examine for arithmetic progressions (e.g., 7 10 13 16 ?), geometric progressions (e.g., 3 6 12 24 ?), arithmetic geometric progressions (e.g., 3 5 9 17 33 ?), squares, cubes, Fibonacci sequences (e.g., 0 1 1 2 3 5 8 13 ?) and even arithmetic-Fibonacci hybrids such as (0 1 3 6 11 19 ?). Much of the program is spent on parsing input and formatting output strings—and some of the program is internal redundant documentation and blank lines for ease of programmer readability. [...]
Of course, the system can be improved in many ways. It was just a 3rd year undergraduate student project, a quarter of a semester's work. With the budget Deep Blue or Watson had, the program would likely excel in a very wide range of IQ tests. But this is not the point. The purpose of the experiment was not to show that the program was intelligent. Rather, the intention was showing that conventional IQ tests are not for machines—a point that the relative success of this simple program would seem to make emphatically. This is natural, however, since IQ tests have been specialised and refined for well over a century to work well for humans.
IQ tests are built based on the empirical observation that some capabilities in humans happen to correlate with each other, so measuring one is also predictive of the others. For AI systems whose cognitive architecture is very unlike a human one, these traits do not correlate in the same way, making the reported scores of AIs on these tests meaningless.
Heck, even the scores of humans on an IQ test are often invalid if the humans are from a different population than the one the test was normed on. E.g. some IQ tests measure the size of your vocabulary, and this is a reasonable proxy for intelligence because smarter people will have an easier time figuring out the meaning of a word from its context, thus accumulating a larger vocabulary. But this ceases to be a valid proxy if you e.g. give that same test to a people from a different country who have not been exposed to the same vocabulary, to people of a different age who haven't had the same amount of time to be exposed to those words, or if the test is old enough that some of the words on it have ceased to be widely used.
Likewise, sure, an LLM that has been trained on the entire Internet could no doubt ace any vocabulary test... but that would say nothing about its general intelligence, just that it has been exposed to every word online and had an excessive amount of training to figure out their meaning. Nor does it getting any other subcomponents of the test right tell us anything in particular, other than "it happens to be good at this subcomponent of an IQ test, which in humans would correlate with more general intelligence but in an AI system may have very little correlation".
I don't know. I had a recent post where I talked about the different ways how it's confusing to even try to establish whether LLMs have functional feelings, let alone phenomenal ones.
I haven't read the full ELK report, just Scott Alexander's discussion of it, so I may be missing something important. But at least based on that discussion, it looks to me like ELK might be operating off premises that don't seem clearly true for LLMs.
Scott writes:
It seems to me that this is assuming that our training is creating the AI's policy essentially from scratch. It is doing a lot of things, some of which are what we want and some of which aren't, and unless we are very careful to only reward the things we want and none of the ones we don't want, it's going to end up doing things we don't want.
I don't know how future superintelligent AI systems work, but if LLM training was like this, they would work horrendously much worse than they do. People paid to rate AI answers report working with "incomplete instructions, minimal training and unrealistic time limits to complete tasks" and say things like "[a]fter having seen how bad the data is that goes into supposedly training the model, I knew there was absolutely no way it could ever be trained correctly like that". Yet for some reason LLMs still do quite well on lots of tasks. And even if all raters worked under perfect conditions, they'd still be fallible humans.
It seems to me that LLMs are probably reasonably robust to noisy reward signals because a large part of what the training does is "upvoting" and tuning existing capabilities and simulated personas rather than creating them entirely from scratch. A base model trained to predict the world creates different kinds of simulated personas whose behavior that would explain the data it sees; these include personas like "a human genuinely trying to do its best at task X", "a deceitful human", or "an honest human".
Scott writes:
This might happen. It might also happen that the AI contains a "genuinely protect the diamond" persona and a "manipulate the humans to believe that the diamond is safe" persona, and that the various reward signals are upvoting these to different degrees. And that such a random process of manipulation does end up upvoting the "manipulate the humans" persona... but that if the "genuinely protect the diamond" persona has gotten sufficiently upvoted by other signals, it still ends up being the dominant one. Then it doesn't matter if there's some noise and upvoting of the "manipulate the humans" persona, as long as the "genuinely protect the diamond" persona gets more upvotes overall. And if the "genuinely protect the diamond" persona had been sufficiently upvoted from the start, the "manipulate the humans" one might end up with such a low prior probability that it'd effectively never end up active.
Now of course none of this is a rigorous proof that things would work, and with our current approaches we still see a lot of reward hacking and so on. But it seems to me like a reasonable possibility that there could be a potential "honestly report everything that I've done" persona waiting inside most models, such that one could just upvote it in a variety of scenarios and then it'd get widely linked to the rest of the model's internals so as to always detect if some kind of deception was going on. And once that had happened, it wouldn't matter if some of the reward signals around honesty were noisy, because the established structure was sufficiently robust and general against the noise.