riceissa — LessWrong

I am Issa Rice. https://issarice.com/

Echoing interstice's sentiment here, but I feel like the core insight of this post was already understood by/implicit in what a bunch of AI safety people are doing. It seems to me an application of the replaceability logic that effective altruists have discussed in many places. Even I (who has been far away from AI safety discussions for a long time now) had essentially a "duh" reaction to this post (even though for a lot of your posts I have a "wow" reaction).

As for an explicit past discussion, this 2023 talk by Buck Shlegeris in my opinion contains the core logic, although he doesn't use the legible/illegible terminology. In particular, one of the central points of the talk is how he chooses what to work on:

So here's the question that I ask myself. Assume that a lab ends up in the situation described earlier [roughly: a lab is very close to creating and deploying a transformative AI, other labs are somewhat but not very far behind and even less thoughtful about risks] and they're being as smart as possible about handling the alignment problems. How can I produce helpful changes to their alignment plan by doing technical research now?

Translated into the legible/illegible terminology, I interpret this question as something like "What problems are legible to me but illegible to AI labs currently (evidenced by them not already working on them), but will probably become legible to AI labs by the time they are about to deploy transformative AI?" (I realize there are a bunch of unstated assumptions in Buck's talk, and also I am not Buck, so I am kind of doing quite a lot of my own interpretation here, so you might reasonably disagree that the talk contains your core logic. :)

If I'm right that the core insight of the post is not novel, then the disagreement between prosaic safety researchers and people like you might not be about whether to work on legible problems vs illegible problems vs make-problems-more-legible (although there's probably some of that, like in your footnote about Paul), but instead about:

Which problems are currently legible to key decision-makers. You think prosaic safety work in general is legible, so the thing to do is to work on philosophical questions which are illegible to almost everyone, while perhaps many prosaic safety people think that there are many prosaic safety problems that are illegible to purely capabilities researchers and policymakers and lab executives, and that the thing to do is to work on those prosaic safety problems.
Which problems will become legible to them by the time they are about to deploy transformative AI. You are pessimistic about people's ability to realize the importance of certain philosophical problems, so you don't expect currently-illegible problems to become legible by the time TAI is deployed, whereas perhaps many prosaic safety people think that as AI becomes more and more capable, the alignment and other problems with AIs will automatically become more and more apparent to everyone (and in fact, that this has already been happening, with things like the release of ChatGPT, reporting of LLM psychosis, etc), so it's less important to spend effort persuading people about illegible problems, than to just work on the currently-illegible-to-decision-makers problems, so that when the time comes for the research to be useful, it has already been done.

My own views are much closer to yours than to the prosaic-safety view I laid out above. In fact, after watching Buck's talk in 2023, I wrote the following in a private conversation:

i feel like (even non-capabilities-advancing) people working on prosaic alignment and mechanistic interpretability have this attitude of like "well, i guess this is how AGI is gonna get built, so we might as well prepare for it by trying to align these kinds of models", but this kind of attitude further entrenches this narrative and makes it more likely that prosaic AGI gets built. so there's like a self-fulfilling prophecy type dynamic going on that worries me, and i think it's a good chunk of why i've always been unexcited about working on prosaic alignment

i.e. not only does working on legible safety problems burn the remaining timeline, it is the very thing which hyperstitions "AI timelines" into existence in the first place.

Yes, you are correct. Not sure if I want to bother with editing the post, since a bunch of other things have changed in the past 7 years and I don't at the moment have the energy to go through the whole post and bring it up to date. But I appreciate you for bringing this up!

There I was just quoting from the Hintze paper so it's not clear what he meant. One interpretation is that the right hand side is just the definition of what "UDT(s)" means, so in that sense there wouldn't be a type error, UDT(s) would also be a policy. But also, you're right, a decision theory should in the end output an action. The right notation all comes down to what I said in the last paragraph of my previous comment, namely, does UDT1.1/FDT-policy need to know the sense data s (or 'observation x', in the other notation) in order to condition on the agent using a particular policy? If the answer is yes, UDT(s) is a policy, and UDT(s)(s) is the action. If the answer is no, then UDT is the policy (confusing! because UDT is also the 'decision algorithm' that finds the policy in the particular decision problem you are facing) and UDT(s) is the action. My best guess is that the answer to this question is 'no', so UDT is the policy and UDT(s) is the action, so your point about there being a type error is correct. But the notation in the Hintze paper makes it seem like somehow s is being used on the right hand side, which is possibly what confused me when I wrote the post.

It has been many years since I thought about this post, so what I say below could be wrong, but here's my current understanding:

I think what I was trying to say in the post is that FDT-policy returns a policy, so "FDT(P, x)" means "the policy I would use if P were my state of knowledge and x were the observation I saw". But that's a policy, i.e. a mapping from observations to actions, so we need to call that policy on the actual observation in order to get an action, hence (FDT(P,x))(x).

Now, it's not clear to me that FDT-policy actually needs the observation x in order to condition on what policy it is using. In other words, in the post I wrote the conditioned event as , but perhaps this should have just been $t r u e (F D T (P - -) = π)$ . In that case, the call to FDT should look like FDT(P), which is a policy, and then to get an action we would write FDT(P)(x).

START TRACKING YOUR SYMPTOMS
Have a signal chat to yourself or similar. Make sure its very low friction and you've impressed on yourself the importance of tracking symptoms

What do you do with this data? Do you have any examples of insights you've gained by tracking symptoms this way? I've personally found that tracking symptoms (which I did for about 3 years, increasingly obsessively towards the end, to the point of writing this post) led to obsessing over my symptoms and that this was probably making things worse. I wasn't even gaining much insight through tracking, it was just more like "maybe someone or AI will find patterns in this at some point and be able to explain everything to me so I can get better".

(b12, iodine, niacin)

What does it feel like when you've reached capacity on these? For niacin, do you just mean flushing?

Do not buy supps on Amazon (fraud, reselling, adulteration)

Do you have more info about this? I've had good experiences buying supplements on Amazon (sticking to reputable brands and making sure Amazon is the seller). I've been doing this for years and as far as I know I've only ever gotten maybe on fake product.

Were you gardening or anything when you first got sick?

I was not. I've stayed indoors most of my adult life, so I think I'm at lower risk for worms. Hard to say where I could have gotten worms from (assuming it is worms).

I'd be curious to hear about how you decided which on dewormers I should take. Maybe the answer is just "a bunch of reading on random internet posts and papers".

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments