Brief intro/overview of the technical AGI alignment problem as I see it:
To a first approximation, there are two stable attractor states that an AGI project, and perhaps humanity more generally, can end up in, as weak AGI systems become stronger towards superintelligence, and as more and more of the R&D process – and the datacenter security system, and the strategic advice on which the project depends – is handed over to smarter and smarter AIs.
In the first attractor state, the AIs are aligned to their human principals and becoming more aligned day by d...
I think this framing probably undersells the diversity within each category, and the extent of human agency or mere noise that can jump you from one category to another.
Probably the biggest dimension of diversity is how much the AI is internally modeling the whole problem and acting based on that model, versus how much it's acting in feedback loops with humans. In the good category you describe it as acting more in feedback loops with humans, while in the bad category you describe it more as internally modeling the whole problem, but I think all quadrants ...
Implications of DeepSeek-R1: Yesterday, DeepSeek released a paper on their o1 alternative, R1. A few implications stood out to me:
This was my understanding pre r1. Certainly this seems to be the case with the o1 models: better at code and math, not better at philosophy and creative writing.
But something is up with r1. It is unusually good at creative writing. It doesn't seem spikey in the way that I predicted.
I notice I am confused.
Possible explanation: r1 seems to have less restrictive 'guardrails' added using post-training. Perhaps this 'light hand at the tiller' results in not post-training it towards mode-collapse. It's closer to a raw base model than the o1 models.
This is just a hypothesis. There are many unknowns to be investigated.
FrontierMath was funded by OpenAI.[1]
The communication about this has been non-transparent, and many people, including contractors working on this dataset, have not been aware of this connection. Thanks to 7vik for their contribution to this post.
Before Dec 20th (the day OpenAI announced o3) there was no public communication about OpenAI funding this benchmark. Previous Arxiv versions v1-v4 do not acknowledge OpenAI for their support. This support was made public on Dec 20th.[1]
Because the Arxiv version mentioning OpenAI contribution came out right after o...
I've known Jaime for about ten years. Seems like he made an arguably wrong call when first dealing with real powaah, but overall I'm confident his heart is in the right place.
This might be a stupid question, but has anyone considered just flooding LLM training data with large amounts of (first-person?) short stories of desirable ASI behavior?
The way I imagine this to work is basically that an AI agent would develop really strong intuitions that "that's just what ASIs do". It might prevent it from properly modelling other agents that aren't trained on this, but it's not obvious to me that that's going to happen or that it's such a decisively bad thing to outweigh the positives
Is COT faithfulness already obsolete? How does it survive the concepts like latent space reasoning, or RL based manipulations(R1-zero)? Is it realistic to think that these highly competitive companies simply will not use them, and simply ignore the compute efficiency?
I think CoT faithfulness was a goal, a hope, that had yet to be realized. People were assuming it was there in many cases when it wasn't.
You can see the cracks showing in many places. For example, editing the CoT to be incorrect and noting that the model still puts the same correct answer. Or observing situations where the CoT was incorrect to begin with, and yet the answer was correct.
Those "private scratchpads"? Really? How sure are you that the model was "fooled" by them? What evidence do you have that this is the case? I think the default assumption ha...
Master post for selection/coherence theorems. Previous relevant shortforms: learnability constraints decision rules, AIT selection for learning.
Given ambiguity about whether GitHub trains models on private repos, I wonder if there's demand for someone to host a public GitLab (or similar) instance that forbids training models on their repos, and takes appropriate countermeasures against training data web scrapers accessing their public content.
Yeah, for years I've been kinda shocked at how lax the security around private GitHub repos is. Seems like with code becoming a thing that can look innocent, but be upstream of a general purpose tool which is capable of producing recipes for novel weapons of mass destruction.... Yeah. We really gotta step up security.
Who is aligning lesswrong? As lesswrong becomes more popularized due to AI growth, I'm concerned the quality of lesswrong discussion and posts has decreased since creating and posting have no filter. Obviously no filter has been a benefit while lesswrong was a hidden gem, only visible to those who can see its value. But as it becomes more popular, i think it should be obvious this site would drop in value if it trended towards reddit. Ideally existing users prevent that, but obviously that will tend to drift if new users can just show up. Are there methods...
Thanks for reminder! I looked at the rejected posts, and... ouch, it hurts.
LLM generated content, crackpottery, low-content posts (could be one sentence, is several pages instead).
Sometimes I wonder if people who obsess over the "paradox of free will" are having some "universal human experience" that I am missing out on. It has never seemed intuitively paradoxical to me, and all of the arguments about it seem either obvious or totally alien. Learning more about agency has illuminated some of the structure of decision making for me, but hasn't really effected this (apparently) fundamental inferential gap. Do some people really have this overwhelming gut feeling of free will that makes it repulsive to accept a lawful universe?
This might be related to whether you see yourself as a part of the universe, or as an observer. If you are an observer, the objection is like "if I watch a movie, everything in the movie follows the script, but I am outside the movie, therefore outside the influence of the script".
If you are religious, I guess your body is a part of the universe (obeys the laws of gravity etc.), but your soul is the impartial observer. Here the religion basically codifies the existing human intuitions.
It might also depend on how much you are aware of the effects of your en...
How can you mimic the decision making of someone 'smarter' or at least with more know-how than you if... you... don't know-how?
Wearing purple clothes like Prince, getting his haircut, playing a 'love symbol guitar' and other superficialities won't make me as great a performer as he was, because the tail doesn't wag the dog.
Similarly if I wanted to write songs like him, using the same drum machines, writing lyrics with "2" and "U" and "4" and loading them with Christian allusions and sexual imagery, I'd be lucky if I'm perceptive enough as a mimic to produc...
That reminds me of NLP (the pseudoscience) "modeling", so I checked briefly if they have any useful advice, but it seems to be at the level of "draw the circle; draw the rest of the fucking owl". They say you should:
The universe has many conserved and approximately-conserved quantities, yet among them energy feels "special" to me. Some speculations why:
Like why are time translations so much more important for our general work than space translations?
I'd imagine that happens because we are able to coordinate our work across time (essentially, execute some actions), while work coordination across space-separated instances is much harder (now, it is part of IT's domain under name of "scalability").
The CDC and other Federal agencies are not reporting updates. "It was not clear from the guidance given by the new administration whether the directive will affect more urgent communications, such as foodborne disease outbreaks, drug approvals and new bird flu cases."
Here's a underrated frame for why AI alignment is likely necessary for the future to go very well under human values, even though in our current society we don't need human to human alignment to make modern capitalism be good and can rely on selfishness instead.
The reason is because there's a real likelihood that human labor, and more generally human existence is not economically valuable or even have negative economic value, say where the addition of a human to the AI company makes that company worse in the near future.
The reason this matters is that once...
In retrospect, I was basically a bit too optimistic about this working out, and a big part of why is I didn't truly grasp how deep value conflicts can be even amongst humans, and I'm now much more skeptical on multi-alignment schemes working because I believe a lot of alignment is broadly because people are powerless relative to the state, but when AI is good enough to create their own nation-states, value conflicts become much more practical, and the basis for a lot of cooperative behavior collapses:
...and also means the level of alignment of AI needs to b
I wrote this for someone but maybe it's helpful for others
What labs should do:
Sometimes people think of "software-only singularity" as an important category of ways AI could go. A software-only singularity can roughly be defined as when you get increasing-returns growth (hyper-exponential) just via the mechanism of AIs increasing the labor input to AI capabilities software[1] R&D (i.e., keeping fixed the compute input to AI capabilities).
While the software-only singularity dynamic is an important part of my model, I often find it useful to more directly consider the outcome that software-only singularity might cause: the feasibi...
I'm citing the polls from Daniel + what I've heard from random people + my guesses.
Are you or someone you know:
1) great at building (software) companies
2) care deeply about AI safety
3) open to talk about an opportunity to work together on something
If so, please dm with your background. If you someone comes to mind, also dm. I am looking thinking of a way to build companies in a way to fund AI safety work.
I think a very common problem in alignment research today is that people focus almost exclusively on a specific story about strategic deception/scheming, and that story is a very narrow slice of the AI extinction probability mass. At some point I should probably write a proper post on this, but for now here are few off-the-cuff example AI extinction stories which don't look like the prototypical scheming story. (These are copied from a Facebook thread.)
Some of the stories assume a lot of AIs, wouldn't a lot of human-level AIs be very good at creating a better AI? Also it seems implausible to me that we will get a STEM-AGI that doesn't think about humans much but is powerful enought to get rid of atmosphere. On a different note, evaluating plausability of scenarios is a whole different thing that basically very few people do and write about in AI safety.