Building gears-level models is expensive - often prohibitively expensive. Black-box approaches are usually cheaper and faster. But black-box approaches rarely generalize - they need to be rebuilt when conditions change, don’t identify unknown unknowns, and are hard to build on top of. Gears-level models, on the other hand, offer permanent, generalizable knowledge which can be applied to many problems in the future, even if conditions shift.
Say you want to plot some data. You could just plot it by itself:
Or you could put lines on the left and bottom:
Or you could put lines everywhere:
Or you could be weird:
Which is right? Many people treat this as an aesthetic choice. But I’d like to suggest an unambiguous rule.
First, try to accept that all axis lines are optional. I promise that readers will recognize a plot even without lines around it.
So consider these plots:
Which is better? I claim this depends on what you’re plotting. To answer, mentally picture these arrows:
Now, ask yourself, are the lengths of these arrows meaningful? When you draw that horizontal line, you invite people to compare those lengths.
You use the same principle for deciding if you should draw a y-axis line. As...
Yeah, there seems to be a lot of personal preference involved. Removing cell borders is obnoxious and inconvenient, the table below hurts. The table above has the borders a tad too thick, but removing them is a cure that's, personally, worse than the disease.
Recently, there's been a fair amount of pushback on the "canonical" views towards the difficulty of AGI Alignment (the views I call the "least forgiving" take).
Said pushback is based on empirical studies of how the most powerful AIs at our disposal currently work, and is supported by fairly convincing theoretical basis of its own. By comparison, the "canonical" takes are almost purely theoretical.
At a glance, not updating away from them in the face of ground-truth empirical evidence is a failure of rationality: entrenched beliefs fortified by rationalizations.
I believe this is invalid, and that the two views are much more compatible than might seem. I think the issue lies in the mismatch between their subject matters.
It's clearer if you taboo the word "AI":
One additional probably important distinction / nuance: there are also theoretical results for why CoT shouldn't just help with one-forward-pass expressivity, but also with learning. E.g. the result in Auto-Regressive Next-Token Predictors are Universal Learners is about learning; similarly for Sub-Task Decomposition Enables Learning in Sequence to Sequence Tasks and for Why Can Large Language Models Generate Correct Chain-of-Thoughts?.
The learning aspect could be strategically crucial with respect to what the first transformatively-useful AIs should look ...
- For scheming, the model could reason about "should I still stay undercover", "what should I do in case I should stay undercover" and "what should I do in case it's time to attack" in parallel, finally using only one serial step to decide on its action.
I am also very interested in e.g. how one could operationalize the number of hops of inference of out-of-context reasoning required for various types of scheming, especially scheming in one-forward-pass; and especially in the context of automated AI safety R&D.
I did an exploration into how Community Notes (formerly Birdwatch) from X (formerly Twitter) works, and how its algorithm decides which notes get displayed to the wider community. In this post, I’ll share and explain what I found, as well as offer some comments.
Community Notes is a fact-checking tool available to US-based users of X/Twitter which allows readers to attach notes to posts to give them clarifying context. It uses an open-source bridging-based ranking algorithm intended to promote notes which receive cross-partisan support, and demote notes with a strong partisan lean. The tool seems to be pretty popular overall, and most of the criticism aimed toward it seems to be about how Community Notes fails to be a sufficient replacement for other, more top-down moderation systems.[1]
This seems interesting to me as an...
Lex Fridman posts timestamped transcripts of his interviews. It's an 83 minute read here and a 115 minute watch on Youtube.
It's neat to see Altman's side of the story. I don't know whether his charisma is more like +2SD or +5SD above the average American (concept origin: planecrash, likely doesn't follow a normal distribution), and I only have a vague grasp of what kinds of shenanigans +5SDish types can do when they pull out the stops in face-to-face interactions, so maybe you'll prefer to read the transcript over watching the video (although they're largely related to reading and responding to your facial expression and body language on the fly, not projecting their own).
If you've missed it, Gwern's side of the story is here.
...Lex Fridman(00:01:05) Take me through
I listened off and on to much of the interview, while also playing solitaire (why I do that I do not know, but I do), but I paid close attention at two points during the talk about GPT-4, once following about 46:00 where Altman was talking about using it as a brainstorming partner and later at about 55:00 where Fridman mentioned collaboration and said: "I'm not sure where the magic is if it's in here [gestures to his head] or if it's in there [points toward the table] or if it's somewhere in between." I've been in a kind of magical collaborative zone with ...
I have anxiety and depression.
The kind that doesn’t go away, and you take pills to manage.
This is not a secret.
What’s more interesting is that I just switched medications from one that successfully managed the depression but not the anxiety to one that successfully manages the anxiety but not the depression, giving me a brief window to see my two comorbid conditions separated from each other, for the first time since ever.
What follows is a (brief) digression on what they’re like from the inside.
I’m still me when I’m depressed.
Just a version of me that’s sapped of all initiative, energy, and tolerance for human contact.
There are plenty of metaphors for depression - a grey fog being one of the most popular - but I often think of it in...
Both gravity and inertia are determined by mass. Both are explained by spacetime curvature in general relativity. Was this an intentional part of the metaphor?
I've been in many conversations where I've mentioned the idea of using neuroscience for outer alignment, and the people who I'm talking to usually seem pretty confused about why I would want to do that. Well, I'm confused about why one wouldn't want to do that, and in this post I explain why.
As far as I see it, there are three main strategies people have for trying to deal with AI alignment in worlds where AI alignment is hard.
In my opinion, these are all great efforts, but I personally like the idea of working on value alignment directly. Why? First some negatives of the others:
...First is that I don't really expect us to come up with a fully general answer to this problem in time. I wouldn't be surprised if we had to trade off some generality for indexing on the system in front of us - this gets us some degree of non-robustness, but hopefully enough to buy us a lot more time before stuff like the problem behind deep deception breaks a lack of True Names. Hopefully then we can get the AI systems to solve the harder problem for us in the time we've bought, with systems more powerful than us. The relevance here is that if this is the
S-risks are barely discussed in LW, is that because: