Alice Blair

Dumping out a lot of thoughts on LW in hopes that something sticks. Eternally upskilling and accelerating.

DMs open, especially for promising opportunities in AI Safety and potential collaborators.

Wikitag Contributions

Comments

Sorted by

My model of ideation: Ideas are constantly bubbling up from the subconscious to the conscious, and they get passed through some sort of filter that selects for the good parts of the noise. This is reminiscent of diffusion models, or of the model underlying Tuning your Cognitive Strategies.

When I (and many others I've talked to) get sleepy, the strength of this filter tends to go down, and more ideas come through. This is usually bad for highly directed thought, but good for coming up with lots of novel ideas, Hold Off On Proposing Solutions-esque.

New habit I'm trying to get into: Be creative before bed, write down a lot of ideas, so that the future-me who is more directed and agentic can have a bunch of interesting ideas to pore over and act on.

Agency and reflectivity are phenomena that are really broadly applicable, and I think it's unlikely that memorizing a few facts is the way that that'll happen. Those traits are more concentrated in places like LessWrong, but they're almost everywhere. I think to go from "fits the vibe of internet text and absorbs some of the reasoning" to "actually creates convincing internet text," you need more agency and reflectivity.

My impression is that "memorize more random facts and overfit" is less efficient for reducing perplexity than "learn something that generalizes," for these sorts of generating algorithms that are everywhere. There's a reason we see "approximate addition" instead of "memorize every addition problem" or "learn webdev" instead of "memorize every website."

The RE-bench numbers for task time horizon just keep going up, and I expect them to continue as models continue to gain bits and pieces of the complex machinery required for operating coherently over long time horizons.

As for when we run out of data, I encourage you to look at this piece from Epoch. We run out of RL signal for R&D tasks even later than that.

Not to be a scaling-law denier. I believe in them, I do! But they measure perplexity, not general intelligence/real-world usefulness, and Goodhart's Law is no-one's ally.

If we're able to get perplexity sufficiently low on text samples that I write, then that means the LLM has a lot of the important algorithms running in it that are running in me. The text I write is causally downstream from parts of me that are reflective and self-improving, that notice the little details in my cognitive processes and environment, and the parts of me that are capable of pursuing goals for a long inferential distance. An LLM agent which can mirror those properties (which we do not yet have the capabilities for) seems like it would very plausibly become a very strong agent in a way that we haven't seen before.

The phenomenon of perplexity getting lower is made up of LLMs increasingly grokking different and new parts of the generating algorithm behind the text. I think the failure in agents that we've seen so far is explainable by the fact that they do not yet grok the things that agency is made of, and the future disruption of that trend is explainable as a consequence of "perplexity over my writing gets lower past the threshold where faithfully emulating my reflectivity and agency algorithms is necessary."

(This perplexity argument about reflectivity etc. is roughly equivalent to one of the arguments that Eliezer gave on Dwarkesh.)

This post just came across my inbox, and there are a couple updates I've made (I have not talked to 4.5 at all and have seen only minimal outputs):

  • GPT-4.5 is already hacking some of the more susceptible people on the internet (in the dopamine gradient way)
  • GPT-4.5+reasoning+RL on agency (aka GPT-5) could probably be situationally aware enough to intentionally deceive (in line with my prediction in the above comment, which was made prior to seeing Zvi's post but after hearing about 4.5 briefly). I think that there are many worlds in which talking to GPT-5 with strong mitigations and low individual deception susceptibility turns out okay or positive, but I am much more wary about taking that bet and I'm unsure if I will when I have the option to.

My model was just that o3 was undergoing safety evals still, and quite plausibly running into some issues with the preparedness framework. My model of OpenAI Preparedness (epistemic status: anecdata+vibes) is that they are not Prepared for the hard things as we scale to ASI, but they are relatively competent at implementing the preparedness framework and slowing down releases if there are issues. It seems intuitively plausible that it's possible to badly jailbreak o3 into doing dangerous things in the "high" risk category.

  • I'd use such an extension. Weakness: rephrasing still mostly doesn't work for systems determined to convey a given message. There's the fact that the information content of a dangerous meme is either 1. still preserved or 2. the reprhrasing is lossy. There's also the fact that determined LLMs can perform semantic-space steganography that persists even through paraphrasing (source) (good post on the subject)
  • I'm glad that my brain mostly-automatically has a strong ugh field around any sort of recreational conversation with LLMs. I derive a lot of value from my recreational conversations with humans from the fact that there is a person on the other end. Removing this fact removes the value and the appeal. I can imagine this sort of thing hacking me anyways, if I somehow find my way onto one of these sites after we've crossed a certain capability threshold. Seems like a generally sound strategy that many people probably need to hear.

I think we're mostly on the same page that there are things worth forgoing the "pure personal-protection" strategy for, we're just on different pages about what those things are. We agree that "convince people to be much more cautious about LLM interactions" is in that category. I just also put "make my external brain more powerful" in that category, since it seems to have positive expected utility for now and lets me do more AI safety research in line with what pre-LLM me would likely endorse upon reflection. I am indeed trying to be very cautious about this process, trying to be corrigible to my past self, to implement all of the mitigations I listed plus all the ones I don't have words for yet. It would be a failure of security mindset to fail to notice these things and to see that they are important to deal with. However, it is a bet that I am making that the extra optimization power is worth it for now. I may lose that bet, and then that will be bad.

I do try to be calibrated instead of being frog, yes. Within the range of time in which present-me considers past-me remotely good as an AI forecaster, my time estimate for these sorts of deceptive capabilities has pretty linearly been going down, but to further help I set myself a reminder 3 months from today with a link to this comment. Thanks for that bit of pressure, I'm now going to generalize the "check in in [time period] about this sort of thing to make sure I haven't been hacked" reflex.

I agree that this is a notable point in the space of options. I didn't include it, and instead included the bunker line because if you're going to be that paranoid about LLM interference (as is very reasonable to do), it makes sense to try and eliminate second order effects and never talk to people who talk to LLMs, for they too might be meaningfully harmful e.g. be under the influence of particularly powerful LLM-generated memes.

I also separately disagree that LLM isolation is the optimal path at the moment. In the future it likely will be. I'd bet that I'm still on the side where I can safely navigate and pick up the utility, and I median-expect to be for the next couple months ish. At GPT-5ish level I get suspicious and uncomfortable, and beyond that exponentially more so.

People often say "exercising makes you feel really good and gives you energy." I looked at this claim, figured it made sense based on my experience, and then completely failed to implement it for a very long time. So here I am again saying that no really, exercising is good, and maybe this angle will do something that the previous explanations didn't. Starting a daily running habit 4 days ago has already started being a noticeable multiplier on my energy, mindfulness, and focus. Key moments to concentrate force in, in my experience:

  • Getting started at all
  • The moment when exhaustion meets the limits of your automatic willpower, and you need to put in conscious effort to keep going
  • The moment the next day where you decide whether or not to keep up the habit, despite the ugh field around exercise

Having a friend to exercise with is surprisingly positive. Having a workout tracker app is surprisingly positive, because then I get to see a trendline and so suddenly my desire is to make it go up and stay unbroken.

Many rationalists bucket themselves with the nerds, as opposed to the jocks. The people with brains, as opposed to the people with muscles. But we're here to win, to get utility, so let's pick up the cognitive multiplier that exercise provides.

Load More