sam - LessWrong

o3 lies much more blatantly and confidently than other models, in my limited experiments.

Over a number of prompts, I have found that it lies, and when corrected on those lies, apologies, and tells some other lies.

This is obviously not scientific, more of a vibes based analysis, but its aggressive lying and fabricating of sources is really noticeable to me in a way it hasn’t been for previous models.

Has anyone else felt this way at all?

sam's Shortform

sam5d10

Apparently, some (compelling?) evidence of life on an exoplanet has been found.

I have no ability to judge how seriously to take this or how significant it might be. To my untrained eye, it seems like it might be a big deal! Does anybody with more expertise or bravery feel like wading in with a take?

Link to a story on this:

https://www.nytimes.com/2025/04/16/science/astronomy-exoplanets-habitable-k218b.html

Thoughts on AI 2027

sam11d31

simply instruct humans to kill themselves

This is obviously not the most important thing in this post, but it confused me a little. What do you mean by this? That an ASI would be persuasive enough to make humans kill themselves or what?

sam's Shortform

sam17d-10

LLMs (probably) have a drive to simulate a coherent entity

Maybe we can just prepend a bunch of examples of aligned behaviour before a prompt, presented as if the model had done this itself, and see if that improves its behaviour.

sam's Shortform

sam23d10

Note: I am extremely open to other ideas on the below take and don't have super high confidence in it

It seems plausible to me that successfully applying interpretability techniques to increase capabilities might be net-positive for safety.

You want to align the incentives of the companies training/deploying frontier models with safety. If interpretable systems are more economically valuable than uninterpretable systems, that seems good!

It seems very plausible to me that if interpretability never has any marginal benefit to capabilities, the little nuggets of interpretability we do have will be optimized away.

For instance, if you can improve capabilities slightly by allowing models to reason in latent space instead of in a chain of thought, that will probably end up being the default.

There's probably a good deal of path dependence on the road to AGI and if capabilities are going to inevitably increase, perhaps it's a good idea to nudge that progress in the direction of interpretable systems.

sam's Shortform

sam25d52

If you beat a child every time he talked about having experience or claimed to be conscious he will stop talking about it - but he still has experience

LESSWRONG
LW

Posts

Wikitag Contributions

Comments