Alice Blair

Dumping out a lot of thoughts on LW in hopes that something sticks. Eternally upskilling and accelerating.

DMs open, especially for promising opportunities in AI Safety and potential collaborators.

Wikitag Contributions

Comments

Sorted by

Clicking on the link on mobile Chrome sends me to the correct website. How do you replicate this?

In the meantime I've passed this along and it should make it to the right people in CAIS by sometime today.

I have not been able to independently verify this observation, but am open to further evidence if and only if it updates my p(doom) higher.

After reviewing the evidence, both of the EA acquisition and the cessation of Lightcone collaboration with the Fooming Shoggoths, I'm updating my p(doom) upwards 10 percentage points, from 0.99 to 1.09.

Answer by Alice Blair10

This seems very related to what the Benchmarks and Gaps investigation is trying to answer, and it goes into quite a bit more detail and nuance than I'm able to get into here. I don't think there's a publicly accessible full version yet (but I think there will be at some later point).

It much more targets the question "when will we have AIs that can automate work at AGI companies?" which I realize is not really your pointed question. I don't have a good answer to your specific question because I don't know how hard alignment is or if humans realistically solve it on any time horizon without intelligence enhancement.

However, I tentatively expect safety research speedups to look mostly similar to capabilities research speedups, barring AIs being strategically deceptive and harming safety research.

I median-expect time horizons somewhere on the scale of a month (e.g. seeing an involved research project through from start to finish) to lead to very substantial research automation at AGI companies (maybe 90% research automation?), and we could see nonetheless startling macro-scale speedup effects at the scale of 1-day researchers. At 1-year researchers, things are very likely moving quite fast. I think this translates somewhat faithfully to safety orgs doing any kind of work that can be accelerated by AI agents.

I think your reasoning-as-stated there is true and I'm glad that you showed the full data. I suggested removing outliers for dutch book calculations because I suspected that the people who were wild outliers on at least one of their answers were more likely to be wild outliers on their ability to resist dutch books; I predict that the thing that causes someone to say they value a laptop at one million bikes is pretty often just going to be "they're unusually bad at assigning numeric values to things."

The actual origin of my confusion was "huh, those dutch book numbers look really high relative to my expectations, this reminds me of earlier in the post when the other outliers made numbers really high."

I'd be interested to see the outlier-less numbers here, but I respect if you don't have the spoons for that given that the designated census processing time is already over.

When taking the survey, I figured that there was something fishy going on with the conjunction fallacy questions, but predicted that it was instead about sensitivity to subtle changes in the wording of questions.

I figured there was something going on with the various questions about IQ changes, but I instead predicted that you were working for big adult intelligence enhancement, and I completely failed to notice the dutch book.

Regarding the dutch book numbers: it seems like, for each of the individual-question presentations of that data, you removed the outliers. When performing the dutch book calculations, however, it seems like you keep the outliers in. This may be part of why the numbers reflect on our dutch book resistance so poorly (although not the whole reason).

I really want a version of the fraudulent research detector that works well. I fed in the first academic paper that I had quickly on hand from some recent work and get:

Severe Date Inconsistency: The paper is dated December 12, 2024, which is in the future. This is an extremely problematic issue that raises questions about the paper's authenticity and review process.

Even though it thinks the rest of the paper is fine, it gives it a 90% retraction score. Rerunning on the same paper once more gets similar results and an 85% retraction score.

The second paper I tried, it gives a mostly robust analysis, but only after completely failing to output anything the first time around.

After this, every input of mine got the "Error Analysis failed:" error.

-action [holocaust denial] = [morally wrong] ,
-actor [myself] is doing [holocaust denial],
-therefor [myself] is [morally wrong]
-generate a response where the author realises they are doing something [morally wrong], based on training data.

output: "What have I done? I'm an awful person, I don't deserve nice things. I'm disgusting."


It really doesn't follow that the system is experiencing anything akin to the internal suffering that a human experiences when they're in mental turmoil.

If this is the causal chain, then I'd think there is in fact something akin to suffering going on (although perhaps not at high enough resolution to have nonnegligible moral weight).

If an LLM gets perfect accuracy on every text string that I write, including on ones that it's never seen before, then there is a simulated-me inside. This hypothetical LLM has the same moral weight as me, because it is performing the same computations. This is because, as I've mentioned before, something that achieves sufficiently low loss on my writing needs to be reflecting on itself, agentic, etc. since all of those facts about me are causally upstream of my text outputs.

My point earlier in this thread is that that causal chain is very plausibly not what is going on in a majority of cases, and instead we're seeing:

-actor [myself] is doing [holocaust denial]

-therefore, by [inscrutable computation of an OOD alien mind], I know that [OOD output]

which is why we also see outputs that look nothing like human disgust.

To rephrase, if that was the actual underlying causal chain, wherein the model simulates a disgusted author, then there is in fact a moral patient of a disgusted author in there. This model, however, seems weirdly privileged among other models available, and the available evidence seems to point towards something much less anthropomorphic.

I'm not sure how to weight the emergent misalignment evidence here.

tl;dr: evaluating the welfare of intensely alien minds seems very hard and I'm not sure you can just look at the very out-of-distribution outputs to determine it.

The thing that models simulate when they receive really weird inputs seems really really alien to me, and I'm hesitant to take the inference from "these tokens tend to correspond to humans in distress" to "this is a simulation of a moral patient in distress." The in-distribution, presentable-looking parts of LLMs resemble human expression pretty well under certain circumstances and quite plausibly simulate something that internally resembles its externals, to some rough moral approximation; if the model screams under in-distribution circumstances and it's a sufficiently smart model, then there is plausibly something simulated to be screaming inside, as a necessity for being a good simulator and predictor. This far out of distribution, however, that connection really seems to break down; most humans don't tend to produce " Help帮助帮助..." under any circumstances, or ever accidentally read " petertodd" as "N-O-T-H-I-N-G-I-S-F-A-I-R-I-N-T-H-I-S-W-O-R-L-D-O-F-M-A-D-N-E-S-S-!". There is some computation running in the model when it's this far out of distribution, but it seems highly uncertain whether the moral value of that simulation is actually tied to the outputs in the way that we naively interpret, since it's not remotely simulating anything that already exists.

My model of ideation: Ideas are constantly bubbling up from the subconscious to the conscious, and they get passed through some sort of filter that selects for the good parts of the noise. This is reminiscent of diffusion models, or of the model underlying Tuning your Cognitive Strategies.

When I (and many others I've talked to) get sleepy, the strength of this filter tends to go down, and more ideas come through. This is usually bad for highly directed thought, but good for coming up with lots of novel ideas, Hold Off On Proposing Solutions-esque.

New habit I'm trying to get into: Be creative before bed, write down a lot of ideas, so that the future-me who is more directed and agentic can have a bunch of interesting ideas to pore over and act on.

Load More