Post summary (feel free to suggest edits!):
‘Setting the Zero Point’ is a “Dark Art” ie. something which causes someone else’s map to unmatch the territory in a way that’s advantageous to you. It involves speaking in a way that takes for granted that the line between ‘good’ and ‘bad is at a particular point, without explicitly arguing for that. This makes changes between points below and above that line feel more significant.
As an example, many people draw a zero point between helping and not helping a child drowning in front of them. One is good, one is ba...
Currently it's all manually, but the ChatGPT summaries are pretty decent, I'm looking into which types of posts it does well.
Post summary (feel free to suggest edits!):
The author argues that if today’s AI development methods lead directly to powerful enough AI systems, disaster is likely by default (in the absence of specific countermeasures).
This is because there is good economic reason to have AIs ‘aim’ at certain outcomes - eg. We might want an AI that can accomplish goals such as ‘get me a TV for a great price’. Current methods train AIs to do this via trial and error, but because we ourselves are often misinformed, we can sometimes negatively reinforce truthful behavi...
Post summary (feel free to suggest edits!):
Various people have proposed variants of “align AGI by making it sufficiently uncertain about whether it’s in the real world versus still in training”. This seems unpromising because AGI could still have bad outcomes if convinced, and convincing it would be difficult.
Non-exhaustive list of how it could tell it’s in reality:
Post summary (feel free to suggest edits!):
The author gives examples where their internal mental model suggested one conclusion, but a low-information heuristic like expert or market consensus differed, so they deferred. This included:
Another common case of this principle is assuming something won’t work in a particular...
Cheers, edited :)
Post summary (feel free to suggest edits!):
The Diplomacy AI got a handle on the basics of the game, but didn’t ‘solve it’. It mainly does well due to avoiding common mistakes like eg. failing to communicate with victims (thus signaling intention), or forgetting the game ends after the year 1908. It also benefits from anonymity, one-shot games, short round limits etc.
Some things were easier than expected eg. defining the problem space, communications generic and simple and quick enough to easily imitate and even surpass humans, no reputational or decision t...
Post summary (feel free to suggest edits!):
In chatting with ChatGPT, the author found it contradicted itself and its previous answers. For instance, it said that orange juice would be a good non-alcoholic substitute for tequila because both were sweet, but when asked if tequila was sweet it said it was not. When further quizzed, it apologized for being unclear and said “When I said that tequila has a "relatively high sugar content," I was not suggesting that tequila contains sugar.”
This behavior is worrying because the system has the capacity to produce co...
Post summary (feel free to suggest edits!):
Last year, the author wrote up an plan they gave a “better than 50/50 chance” would work before AGI kills us all. This predicted that in 4-5 years, the alignment field would progress from preparadigmatic (unsure of the right questions or tools) to having a general roadmap and toolset.
They believe this is on track and give 40% likelihood that over the next 1-2 years the field of alignment will converge toward primarily working on decoding the internal language of neural nets - with interpretability on th...
Thanks for the feedback! I've passed it on.
It's mainly because we wanted to keep the episodes to ~20m, to make them easy for people to keep up with week to week - and the LW posts tended toward the more technical side, which doesn't translate as easily in podcast form (it can be hard to take in without the writing in front of you). We may do something for the LW posts in future though, unsure at this point.
Thanks, realized I forgot to add the description of the top / curated section - fixed. Everything in there occurs in it's own section too.
Thanks, great to hear!
Thanks for the info - added to post
Super interesting, thanks!
If you were running it again, you might want to think about standardizing the wording of the questions - it varies from 'will / is' to 'is likely' to 'plausible' and this can make it hard to compare between questions. Plausible in particular is quite a fuzzy word, for some it might mean 1% or more, for others it might just mean it's not completely impossible / if a movie had that storyline, they'd be okay with it.
Great to hear, thanks :-)
Good point, thank you - I've had a re-read of the conclusion and replaced the sentence with "Due to this, he concludes that climate change is still an important LT area - though not as important as some other global catastrophic risks (eg. biorisk), which outsize on both neglectedness and scale."
Originally I think I'd mistaken his position a bit based on this sentence: "Overall, because other global catastrophic risks are so much more neglected than climate change, I think they are more pressing to work on, on the margin." (and in addition I hadn't used the clearest phrasing) But the wider conclusion fits the new sentence better.
Post summary (feel free to suggest edits!):
The author argues that the “simulators” framing for LLMs shouldn’t reassure us much about alignment. Scott Alexander has previously suggested that LLMs can be thought of as simulating various characters eg. the “helpful assistant” character. The author broadly agrees, but notes this solves neither outer (‘be careful what you wish for’) or inner (‘you wished for it right, but the program you got had ulterior motives’) alignment.
They give an example of each failure case:
For outer alignment, say researcher... (read more)