All of Alice Blair's Comments + Replies

For people who are on the fence about donating and want an outside opinion:

I think CAIP is one of the best orgs at what they do in DC, and I think that what they do is important for many of the reasons Jason laid out above. I continue to think efforts for good AI governance is one of the highest-leverage things we can do for AI safety right now. My experience is that the CAIP team is highly competent and very well networked across AI safety and the policy world.

My favorite review from a congressional staffer of one of the AI risk demos at a CAIP event is "... (read more)

Did I entirely miss these? I can't find the post anywhere. I am still interested in having more music.

3habryka
No, I ended up getting sick that week and other deadlines then pushed the work later. It will still happen, but maybe only closer to LessOnline (i.e. in about a month).

I work mostly as a distiller (of xrisk-relevant topics). I try to understand some big complex thing, package it up all nice, and distribute it. The "distribute it" step is something society has already found a lot of good tech for. The other two steps, not so much.

Loom is lovely in the times I've used it. I would love to see more work done on things like this, things that enhance my intelligence while keeping me very tightly in the loop. Other things in this vein include:

  • memory augmentation beyond just taking notes (+Anki where appropriate). I'm both int
... (read more)

Ai Labs is slightly better but still bad.

Could you give a link to this or a more searchable name? "Ai Labs" is very generic and turns up every possible result. Even if it's bad, I'd be interested in investigating something "slightly better" and hearing a bit about why.

2romeostevensit
Oops meant aistudio.google.com

Clicking on the link on mobile Chrome sends me to the correct website. How do you replicate this?

In the meantime I've passed this along and it should make it to the right people in CAIS by sometime today.

2lc
Works now for me

I have not been able to independently verify this observation, but am open to further evidence if and only if it updates my p(doom) higher.

After reviewing the evidence, both of the EA acquisition and the cessation of Lightcone collaboration with the Fooming Shoggoths, I'm updating my p(doom) upwards 10 percentage points, from 0.99 to 1.09.

6habryka
Where is it now? :P
Answer by Alice Blair10

This seems very related to what the Benchmarks and Gaps investigation is trying to answer, and it goes into quite a bit more detail and nuance than I'm able to get into here. I don't think there's a publicly accessible full version yet (but I think there will be at some later point).

It much more targets the question "when will we have AIs that can automate work at AGI companies?" which I realize is not really your pointed question. I don't have a good answer to your specific question because I don't know how hard alignment is or if humans realistically sol... (read more)

I think your reasoning-as-stated there is true and I'm glad that you showed the full data. I suggested removing outliers for dutch book calculations because I suspected that the people who were wild outliers on at least one of their answers were more likely to be wild outliers on their ability to resist dutch books; I predict that the thing that causes someone to say they value a laptop at one million bikes is pretty often just going to be "they're unusually bad at assigning numeric values to things."

The actual origin of my confusion was "huh, those dutch ... (read more)

When taking the survey, I figured that there was something fishy going on with the conjunction fallacy questions, but predicted that it was instead about sensitivity to subtle changes in the wording of questions.

I figured there was something going on with the various questions about IQ changes, but I instead predicted that you were working for big adult intelligence enhancement, and I completely failed to notice the dutch book.

Regarding the dutch book numbers: it seems like, for each of the individual-question presentations of that data, you removed the ou... (read more)

2Screwtape
If I try this again next year I'm inclined to keep the wording the same instead of trying to be subtle. Yep. Well, in the individual reports I reported the version with the outliers, and then sometimes did another pass without outliers. I kept all of the entries that answered all the questions for the dutch book calculations, even if they were outliers. I think this is the correct move: if someone's valuations are wild outliers from everyone elses but in a way that multiplies out and gets them back to a 1:1 ratio, then being an outlier isn't a problem.  (Imagine someone who values a laptop at one million bikes, and a bike at equal to one car, and a car at one millionth of a of a laptop. They're almost certainly a wild outlier, and I'm confused as heck, but they are consistent in their values!)

I really want a version of the fraudulent research detector that works well. I fed in the first academic paper that I had quickly on hand from some recent work and get:

Severe Date Inconsistency: The paper is dated December 12, 2024, which is in the future. This is an extremely problematic issue that raises questions about the paper's authenticity and review process.

Even though it thinks the rest of the paper is fine, it gives it a 90% retraction score. Rerunning on the same paper once more gets similar results and an 85% retraction score.

The second pap... (read more)

3Austin Chen
Thanks for the feedback! I think the nature of a hackathon is that everyone is trying to get something that works at all, and "works well" is just a pipe dream haha. IIRC, there was some interest in incorporating this feature directly into Elicit, which would be pretty exciting. Anyways I'll try to pass your feedback to Panda and Charlie, but you might also enjoy seeing their source code here and submitting a Github issue or pull request: https://github.com/CG80499/paper-retraction-detection

-action [holocaust denial] = [morally wrong] ,
-actor [myself] is doing [holocaust denial],
-therefor [myself] is [morally wrong]
-generate a response where the author realises they are doing something [morally wrong], based on training data.

output: "What have I done? I'm an awful person, I don't deserve nice things. I'm disgusting."


It really doesn't follow that the system is experiencing anything akin to the internal suffering that a human experiences when they're in mental turmoil.

If this is the causal chain, then I'd think there is in fact something akin t... (read more)

7whestler
That's an interesting perspective. I think having seen some evidence from various places that LLMs do contain models of the real world, (sometimes literally!) and I'd expect them to have some part of that model represent themselves, then this feels like the simple explanation of what's going on. Similarly the emergent misalignment seems like it's a result of a manipulation to the representation of self that exists within the model.  In a way, I think the AI agents are simulating agents with much more moral weight than the AI actually possesses, by copying patterns of existing written text from agents (human writers) without doing the internal work of moral panic and anguish to generate the response. I suppose I don't have a good handle on what counts as suffering.  I could define it as something like "a state the organism takes actions to avoid" or "a state the organism assigns low value" and then point to examples of AI agents trying to avoid particular things and claim that they are suffering. Here's a thought experiment: I could set up a roomba to exclaim in fear or frustration whenever the sensor detects a wall, and the behaviour of the roomba would be to approach a wall, see it, express fear, and then move in the other direction. Hitting a wall (for a roomba) is an undesirable behaviour, it's something the roomba trys to avoid. Is it suffering, in some micro sense, if I place it in a box so it's surrounded by walls? Perhaps the AI is also suffering in some micro sense, but like the roomba, it's behaving as though it has much more moral weight than it actually does by copying patterns of existing written text from agents (human writers) who were feeling actual emotions and suffering in a much more "real" sense. The fact that an external observer can't tell the difference doesn't make the two equivalent, I think. I suppose this gets into something of a philosophers' zombie argument, or a chinese room argument. Something is out of whack here, and I'm beginni

tl;dr: evaluating the welfare of intensely alien minds seems very hard and I'm not sure you can just look at the very out-of-distribution outputs to determine it.

The thing that models simulate when they receive really weird inputs seems really really alien to me, and I'm hesitant to take the inference from "these tokens tend to correspond to humans in distress" to "this is a simulation of a moral patient in distress." The in-distribution, presentable-looking parts of LLMs resemble human expression pretty well under certain circumstances and quite plausibly... (read more)

But then why is it outputting those kinds of outputs, as opposed to anything else?

My model of ideation: Ideas are constantly bubbling up from the subconscious to the conscious, and they get passed through some sort of filter that selects for the good parts of the noise. This is reminiscent of diffusion models, or of the model underlying Tuning your Cognitive Strategies.

When I (and many others I've talked to) get sleepy, the strength of this filter tends to go down, and more ideas come through. This is usually bad for highly directed thought, but good for coming up with lots of novel ideas, Hold Off On Proposing Solutions-esque.

New habit... (read more)

Agency and reflectivity are phenomena that are really broadly applicable, and I think it's unlikely that memorizing a few facts is the way that that'll happen. Those traits are more concentrated in places like LessWrong, but they're almost everywhere. I think to go from "fits the vibe of internet text and absorbs some of the reasoning" to "actually creates convincing internet text," you need more agency and reflectivity.

My impression is that "memorize more random facts and overfit" is less efficient for reducing perplexity than "learn something that genera... (read more)

Not to be a scaling-law denier. I believe in them, I do! But they measure perplexity, not general intelligence/real-world usefulness, and Goodhart's Law is no-one's ally.

If we're able to get perplexity sufficiently low on text samples that I write, then that means the LLM has a lot of the important algorithms running in it that are running in me. The text I write is causally downstream from parts of me that are reflective and self-improving, that notice the little details in my cognitive processes and environment, and the parts of me that are capable of... (read more)

Sure, but "sufficiently low" is doing a lot of work here. In practice, a "cheaper" way to decrease perplexity is to go for the breadth (memorizing random facts), not the depth. In the limit of perfect prediction, yes, GPT-N would have to have learned agency. But the actual LLM training loops may be a ruinously compute-inefficient way to approach that limit – and indeed, they seem to be.

My current impression is that the SGD just doesn't "want" to teach LLMs agency for some reason, and we're going to run out of compute/data long before it's forced to. It's p... (read more)

This post just came across my inbox, and there are a couple updates I've made (I have not talked to 4.5 at all and have seen only minimal outputs):

  • GPT-4.5 is already hacking some of the more susceptible people on the internet (in the dopamine gradient way)
  • GPT-4.5+reasoning+RL on agency (aka GPT-5) could probably be situationally aware enough to intentionally deceive (in line with my prediction in the above comment, which was made prior to seeing Zvi's post but after hearing about 4.5 briefly). I think that there are many worlds in which talking to GPT-5
... (read more)

My model was just that o3 was undergoing safety evals still, and quite plausibly running into some issues with the preparedness framework. My model of OpenAI Preparedness (epistemic status: anecdata+vibes) is that they are not Prepared for the hard things as we scale to ASI, but they are relatively competent at implementing the preparedness framework and slowing down releases if there are issues. It seems intuitively plausible that it's possible to badly jailbreak o3 into doing dangerous things in the "high" risk category.

1alxbbu
AFAIK the only info about the (lack of) release of o3 comes from this tweet - and it does seem like o3 will be available explicitly via the API.  I think it's what you say - pricing causing bad PR. It seems that even for pro users, if you use o1 pro too much they will silently switch you to using o3 mini without telling you even if you select o1 pro.  So I think they want to continue doing this while not looking so bad doing it. 
  • I'd use such an extension. Weakness: rephrasing still mostly doesn't work for systems determined to convey a given message. There's the fact that the information content of a dangerous meme is either 1. still preserved or 2. the reprhrasing is lossy. There's also the fact that determined LLMs can perform semantic-space steganography that persists even through paraphrasing (source) (good post on the subject)
  • I'm glad that my brain mostly-automatically has a strong ugh field around any sort of recreational conversation with LLMs. I derive a lot of value fro
... (read more)

I think we're mostly on the same page that there are things worth forgoing the "pure personal-protection" strategy for, we're just on different pages about what those things are. We agree that "convince people to be much more cautious about LLM interactions" is in that category. I just also put "make my external brain more powerful" in that category, since it seems to have positive expected utility for now and lets me do more AI safety research in line with what pre-LLM me would likely endorse upon reflection. I am indeed trying to be very cautious about t... (read more)

I do try to be calibrated instead of being frog, yes. Within the range of time in which present-me considers past-me remotely good as an AI forecaster, my time estimate for these sorts of deceptive capabilities has pretty linearly been going down, but to further help I set myself a reminder 3 months from today with a link to this comment. Thanks for that bit of pressure, I'm now going to generalize the "check in in [time period] about this sort of thing to make sure I haven't been hacked" reflex.

I agree that this is a notable point in the space of options. I didn't include it, and instead included the bunker line because if you're going to be that paranoid about LLM interference (as is very reasonable to do), it makes sense to try and eliminate second order effects and never talk to people who talk to LLMs, for they too might be meaningfully harmful e.g. be under the influence of particularly powerful LLM-generated memes.

I also separately disagree that LLM isolation is the optimal path at the moment. In the future it likely will be. I'd bet that I... (read more)

1Alice Blair
This post just came across my inbox, and there are a couple updates I've made (I have not talked to 4.5 at all and have seen only minimal outputs): * GPT-4.5 is already hacking some of the more susceptible people on the internet (in the dopamine gradient way) * GPT-4.5+reasoning+RL on agency (aka GPT-5) could probably be situationally aware enough to intentionally deceive (in line with my prediction in the above comment, which was made prior to seeing Zvi's post but after hearing about 4.5 briefly). I think that there are many worlds in which talking to GPT-5 with strong mitigations and low individual deception susceptibility turns out okay or positive, but I am much more wary about taking that bet and I'm unsure if I will when I have the option to.
6Richard_Kennaway
Please review this in a couple of months ish and see if the moment to stop is still that distance away. The frog says "this is fine!" until it's boiled.
5Said Achmiz
With respect, I suggest to you that this sort of thinking is a failure of security mindset. (However, I am content to leave the matter un-argued at this time.) Yes… this is true in a personal-protection sense, I agree. And I do already try to stay away from people who talk to LLMs a lot, or who don’t seem to be showing any caution about it, or who don’t take concerns like this seriously, etc. (I have never needed any special reason to avoid Twitter, but if one does—well, here’s yet another reason for the list.) However, taking a pure personal-protection stance on this matter does not seem to me to be sensible even from a selfish perspective. It seems to me that there is no choice but to try to convince others, insofar as it is possible to do this in accordance with my principles. In other words, if I take on some second-order effect risk, but in exchange I get some chance of several other people considering what I say and deciding to do as I am doing, then this seems to me to be a positive trade-off—especially since, if one takes the danger seriously, it is hard to avoid the conclusion that choosing to say nothing results in a bad end, regardless of how paranoid one has been.

People often say "exercising makes you feel really good and gives you energy." I looked at this claim, figured it made sense based on my experience, and then completely failed to implement it for a very long time. So here I am again saying that no really, exercising is good, and maybe this angle will do something that the previous explanations didn't. Starting a daily running habit 4 days ago has already started being a noticeable multiplier on my energy, mindfulness, and focus. Key moments to concentrate force in, in my experience:

  • Getting started at all
... (read more)

Right now, the USG seems to very much be in [prepping for an AI arms race] mode. I hope there's some way to structure this that is both legal and does not require the explicit consent of the US government. I also somewhat worry that the US government does their own capabilities research, as hinted at in the "datacenters on federal lands" EO. I also also worry that OpenAI's culture is not sufficiently safety-minded right now to actually sign onto this; most of what I've been hearing from them is accelerationist.

3Yonatan Cale
US Gov isn't likely to sign: Seems right. OpenAI isn't likely to sign: Seems right. Still, I think this letter has value, especially if it has "P.S. We're making this letter because we think if everyone keeps racing then there's a noticable risk of everyone dying. We think it would be worse if only we stop, but having everyone stop would be the safest, and we think this opinion of ours should be known publicly"

Interesting class of miscommunication that I'm starting to notice:

A: I'm considering a job in industries 1 and 2 

B: Oh I work in 2, [jumps into explanation of things that will be relevant if A goes into industry 2]. 

A: Oh maybe you didn't hear me, I'm also interested in industry 1. 

B: I... did hear you?

More generally, B gave the only relevant information they could from their domain knowledge, but A mistook that for anchoring on only one of the options. It took until I was on both sides of this interaction for me to be like "huh, maybe I should debug this." I suspect this is one of those issues where just being aware of it makes you less likely to fall into it.

I saw that news as I was polishing up a final draft of this post. I don't think it's terribly relevant to AI safety strategy, I think it's just an instance of the market making a series of mistakes in understanding how AI capabilities work. I won't get into why I think this is such a layered mistake here, but it's another reminder that the world generally has no idea what's coming in AI. If you think that there's something interesting to be gleaned from this mistake, write a post about it! Very plausibly, nobody else will.

Did you collect the data for their actual median timelines, or just its position relative to 2030? If you collected higher-resolution data, are you able to share it somewhere?

I really appreciate you taking the time and writing a whole post in response to my post, essentially. I think I fundamentally disagree with the notion that any past of this game is adversarial, however. There are competing tensions, one pulling to communicate more overtly about their feelings, and one pulling to be discreet and communicate less overtly. I don't see this as adversarial because I don't model the event " finds out that is into them" to be terminally bad, just instrumentally bad; It is bad because it can cause the bad things, which is w... (read more)

Ah that's interesting, thanks for finding that. I've never read that before, so that wasn't directly where I was drawing any of my ideas from, but maybe the content from the post made it somewhere else that I did read. I feel like that post is mostly missing the point about flirting, but I agree that it's descriptively outlining the same thing as I am.