Jack O'Brien - LessWrong

Ummmmm yeah what have I done so far. I didn't really get any solid work done this week either. I have decided to extend the project by another two weeks with the other two people involved - we have all been pretty preoccupied with life. Last week on sunday night i didn't really do a solid hour of work. I did manage to summarise the concept of selection theorems and think about the agent type signature - a concept i will be referring to throughout the post, super fundamental. Tonight I will hopefully actually meet with my group. I wanna do like half an hour of work before, and a little bit after too. I want to summarise the good regulator theorem this week as well as turner's post on power seeking.

Jack O'Brien's Shortform

Jack O'Brien2mo30

Well, haven't got much done in the last 2 weeks. Life has gotten in the way, and in the times where I thought I actually had the time and headspace to work on the project, things happened like my shoulder got injured playing sport, and my laptop mysteriously died.

But I have managed to create a github repo, and read the original posts on selection theorems. My list of selection theorems to summarize has grown. Check out the github page: https://github.com/jack-obrien/selection-theorems-review

Tonight I will try to do at least an hour of solid work on it. I want to summarize the idea of selection theorems, and sumarize the good regulator theorem, and start reading the next post (probably Turner's post on power seeking)

Jack O'Brien's Shortform

Jack O'Brien2mo30

** Progress Report: AI Safety Fundamentals Project ** This is a public space for me to keep updated on my AI safety fundamentals project. The project will take 4 weeks. My goal is to stay lean and limit my scope so I can actually finish on time. I aim to update this post at least once per week with my updates, but maybe more often.

Overall, I want to work on agent foundations and the theory behind AI alignment agendas. One stepping point for this is Selection theorems; a research program to find justifications that a given training process will result in a given agent property.

My plan for the agisf project: literature review on selection theorems. Take a whole load of concepts / blog posts, read them, riff on them if i feel like it. At least write a 1 paragraph summary of each post im intrerested in. List of posts:

John's original posts on Selection theorems, and Adam Khoja's distillation of it.
Scott garrabrant's stuff on geometric rationality
Coherence theorems for utility theory.
Evolutionary biology shallow dive and explanation of price and fishers equations.
Maybe some stuff by Thane Ruthenis.
Some content from Jaynes' probability theory about bayesian vs frequentism
Power seeking is instrumentally convergent in MDPs.
??? more examples to come once i read john's original post.

TODO:

Make an initial lesswrong progress report.
Make a list of things to read.
Make a git repo on my pc with markdown and mathjax support. In the initial document, populate it with the list of things to read. For each thing I read, remove it from the TODO list and put its summary in the main body of the blog post. When I am done, any posts still left on the TODO list will get formatted and added as an 'additional reading' section.

Lucie Philippon's Shortform

Jack O'Brien1y42

I think this is a good thing to do! I reccomend looking up things like "reflections on my LTFF upskilling grant" for similar pieces from lesser known researchers / aspiring researchers.

Ten Levels of AI Alignment Difficulty

Jack O'Brien1y1211

Hey thanks for writing this up! I thought you communicated the key details excellently - in particular these 3 camps of varying alignment difficulty worlds, and the variation within those camps. Also I think you included just enough caveats and extra details to give readers more to think about, but without washing out the key ideas of the post.

Just wanted to say thanks, this post makes a great reference for me to link to.

MDPs and the Bellman Equation, Intuitively Explained

Jack O'Brien1y10

Yep that's right, fixed :)

Introduction to Cartesian Frames

Jack O'Brien1y10

Definition: .

A useful alternate definition of this is:
$Prevent (C) = {S^{C} | S \in Ensure (C)}$
Where $S^{C}$ refers to $W ∖ S$ . Proof:

$\begin{matrix} Prevent (C) & = {S \subseteq W | \exists a \in A s.t. \forall e \in E, a \cdot e \notin S} = {S \subseteq W | \exists a \in A s.t. \forall e \in E, a \cdot e \in S^{C}} = {S^{C} | S \in Ensure (C)} \end{matrix}$

You are probably not a good alignment researcher, and other blatant lies

Jack O'Brien2y41

This felt great to read. Thanks for that :)

Accurate Models of AI Risk Are Hyperexistential Exfohazards

Jack O'Brien2y20

Yep, fair point. In my original comment I seemed to forget about the problem of AIs goodharting our long reflection. I probably agree now that doing a pivotal act into a long reflection is approximately as difficult as solving alignment.

(Side-note about how my brain works: I notice that when I think through all the argumentative steps deliberately, I do believe this statement: "Making an AI which helps humans clarify their values is approximately as hard as making an AI care about any simple, specific thing." However it does not come to mind automatically when I'm reasoning about alignment. 2 Possible fixes:

Think more concretely about Retargeting the Search when I think about solving alignment. This makes the problems seem similar in difficulty.
Meditate on just how hard it is to target an AI at something. Sometimes I forget how Goodhartable any objective is. )

Accurate Models of AI Risk Are Hyperexistential Exfohazards

Jack O'Brien2y3-2

This post was incredibly interesting and useful to me. I would strong-upvote it, but I don't think this post should be promoted to more people. I've been thinking about the question of "who are we aligning AI to" for the past two months.

I really liked your criticism of the Long Reflection because it is refreshingly different from e.g. Macaskill and Ord's writing on the long reflection. I'm still not convinced that we can't avoid all of the hellish things you mentioned like synthetic superstimuli cults and sub-AGI drones. Why can't we just have a simple process of open dialogue with values of truth, individual agency during the reflection, and some clearly defined contract at the end of the long reflection to like, take power away from the AGI drones?

LESSWRONG
LW

Posts

Wiki Contributions

Comments