All of Ronny Fernandez's Comments + Replies

Curated. Tackles thorny conceptual issues at the foundation of AI alignment while also revealing the weak spots of the abstractions used to do so.

I like the general strategy of trying to make progress on understanding the problem relying only on the concept of "basic agency" without having to work on the much harder problem of coming up with a useful formalism of a more full throated conception of agency, whether or not that turns out to be enough in the end.

The core point of the post: that certain kinds of goals only make sense at all given that there are... (read more)

Curated. The problem of certain evidence is an old fundamental problem in Bayesian epistemology and this post makes a simple and powerful conceptual point tied to a standard way of trying to resolve that problem. Explaining how to think about certain evidence vs. something like Jefferey's conditionalization under the prediction market analogy of a Bayesian agent is itself valuable. Further pointing out both that:

1) You can think of evidence and hypotheses as objects of the same type signature using the analogy.


2) The difference between them is rev... (read more)

There is! It is now posted! Sorry about the delay.

Hello, last time a taught a class on the basics of Bayesian epistemology. This time I will teach a class that goes a bit further. I will explain what a proper scoring rule is and we will also do some calibration training. In particular, we will play a calibration training game called two lies, a truth, and a probability. I will do this at 7:30 the same place as last time. Come by to check it out.

Hello! Please note that I will be giving a class called the Bayesics in Eigen hall at 7:30. Heard of Bayes's theorem but don't fully understand what the fuss is about? Want to have an intuitive as well as formal understanding of what the Bayesian framework is? Want to learn how to do bayesian updates in your head? Come and learn the Bayesics.

Also, please note that I will be giving a class at 7:30 after the reading group called "The Bayesics" where I will teach you the basics of intuitive Bayesian epistemology and how to do Bayesian updates irl on the fly as a human. All attending the reading group are welcome to join for that as well.

That's this week, right? Is reading An Intuitive Explanation of Bayes Theorem recommended?

I think you should still write it. I'd be happy to post it instead or bet with you on whether it ends up negative karma if you let me read it first.


Misfits, hooligans, and rabble rousers. 
Provocateurs and folk who don’t wear trousers. 
These are my allies and my constituents. 
Weak in number yet suffused with arcane power. 

I would never condone bullying in my administration. 
It is true we are at times moved by unkind motivations. 
But without us the pearl clutchers, hard asses, and busy bees would overrun you. 
You would lose an inch of slack per generation. 

Many among us appreciate your precision. 
I ad... (read more)

Answer by Ronny Fernandez353

Hey, I'm just some guy but I've been around for a while. I want to give you a piece of feedback that I got way back in 2009 which I am worried no one has given you. In 2009 I found lesswrong, and I really liked it, but I got downvoted a lot and people were like "hey, your comments and posts kinda suck". They said, although not in so many words, that basically I should try reading the sequences closely with some fair amount of reverence or something. 

I did that, and it basically worked, in that I think I really did internalize a lot of the values/taste... (read more)

I didn’t figure out that the “bow” in “rainbow” referred to a bow like as in bow and arrow, and not a bow like a bow on a frilly dress, until five minutes ago. I was really pretty confused about this since I was like 8. Somebody could’ve explained but nobody did.

I want to note for posterity that I tried to write this reading list somewhat impartially. That is, I have a lot of takes about a lot of this stuff, and I tried to include a lot of material that I disagree with but which I have found helpful in some way or other. I also included things that people I trust have found helpful even if I personally never found it helpful.

I believe there isn't really a deadline! You just buy tickets and then you can come. Tickets might sellout is the limiting factor.

In retrospect I think the above was insufficiently cooperative. Sorry,

To be clear, I did not think we were discussing the AI optimist post. I don't think Nate thought that. I thought we were discussing reasons I changed my mind a fair bit after talking to Quintin.

I meant the reasonable thing other people knew I meant and not the deranged thing you thought I might've meant.

[This comment is no longer endorsed by its author]Reply1
5Ronny Fernandez
In retrospect I think the above was insufficiently cooperative. Sorry,

Yeah I’m totally with you that it definitely isn’t actually next token prediction, it’s some totally other goal drawn from the dist of goals you get when you sgd for minimizing next token prediction surprise.

I suppose I'm trying to make a hypothetical AI that would frustrate any sense of "real self" and therefore disprove the claim "all LLMs have a coherent goal that is consistent across characters". In this case, the AI could play the "benevolent sovereign" character or the "paperclip maximizer" character, so if one claimed there was a coherent underlying goal I think the best you could say about it is "it is trying to either be a benevolent sovereign or maximize paperclips". But if your underlying goal can cross such a wide range of behaviors it is practical

... (read more)
I think we are pretty much on the same page! Thanks for the example of the ball-moving AI, that was helpful. I think I only have two things to add: 1. Reward is not the optimization target, and in particular just because an LLM was trained by changing it to predict the next token better, doesn't mean the LLM will pursue that as a terminal goal. During operation an LLM is completely divorced from the training-time reward function, it just does the calculations and reads out the logits. This differs from a proper "goal" because we don't need to worry about the LLM trying to wirehead by feeding itself easy predictions. In contrast, if we call up 2. To the extent we do say the LLM's goal is next token prediction, that goal maps very unclearly onto human-relevant questions such as "is the AI safe?". Next-token prediction contains multitudes, and in OP I wanted to push people towards "the LLM by itself can't be divorced from how it's prompted".

So the shoggoth here is the actual process that gets low loss on token prediction. Part of the reason that it is a shoggoth is that it is not the thing that does the talking. Seems like we are onboard here. 

The shoggoth is not an average over masks. If you want to see the shoggoth, stop looking at the text on the screen and look at the input token sequence and then the logits that the model spits out. That's what I mean by the behavior of the shoggoth. 

On the question of whether it's really a mind, I'm not sure how to tell. I know it gets really ... (read more)

1David Johnston
1. We can definitely implement a probability distribution over text as a mixture of text generating agents. I doubt that an LLM is well understood as such in all respects, but thinking of a language model as a mixture of generators is not necessarily a type error. 2. The logits and the text on the screen cooperate to implement the LLM's cognition. Its outputs are generated by an iterated process of modelling completions, sampling them, then feeding the sampled completions back back to the model.
I think we're largely on the same page here because I'm also unsure of how to tell! I think I'm asking for someone to say what it means for the model itself to have a goal separate from the masks it is wearing, and show evidence that this is the case (rather than the model "fully being the mask"). For example, one could imagine an AI with the secret goal "maximize paperclips" which would pretend to be other characters but always be nudging the world towards paperclipping, or human actors who perform in a way supporting the goal "make my real self become famous/well-paid/etc" regardless of which character they play. Can someone show evidence for the LLMs having a "real self" or a "real goal" that they work towards across all the characters they play? I suppose I'm trying to make a hypothetical AI that would frustrate any sense of "real self" and therefore disprove the claim "all LLMs have a coherent goal that is consistent across characters". In this case, the AI could play the "benevolent sovereign" character or the "paperclip maximizer" character, so if one claimed there was a coherent underlying goal I think the best you could say about it is "it is trying to either be a benevolent sovereign or maximize paperclips". But if your underlying goal can cross such a wide range of behaviors it is practically meaningless! (I suppose these two characters do share some goals like gaining power, but we could always add more modes to the AI like "immediately delete itself" which shrinks the intersection of all the characters' goals.)

The shoggoth is supposed to be a of a different type than the characters. The shoggoth for instance does not speak english, it only knows tokens. There could be a shoggoth character but it would not be the real shoggoth. The shoggoth is the thing that gets low loss on the task of predicting the next token. The characters are patterns that emerge in the history of that behavior.

I agree that to the extent there is a shoggoth, it is very different than the characters it plays, and an attempted shoggoth character would not be "the real shoggoth". But is it even helpful to think of the shoggoth as being an intelligence with goals and values? Some people are thinking in those terms, e.g. Eliezer Yudkowsky saying that "the actual shoggoth has a motivation Z". To what extent is the shoggoth really a mind or an intelligence, rather than being the substrate on which intelligences can emerge? And to get back to the point I was trying to ma... (read more)

Yeah I think this would work if you conditioned on all of the programs you check being exactly equally intelligent. Say you have a hundred superintelligent programs in simulations and one of them is aligned, and they are all equally capable, then the unaligned ones will be slightly slower in coming up with aligned behavior maybe, or might have some other small disadvantage. 

However, in the challenge described in the post it's going to be hard to tell a level 999 aligned superintelligence from a level 1000 unaligned superintelligence.

I think the advant... (read more)

Quick submission:

The first two prongs of OAI's approach seems to be aiming to get a human values aligned training signal. Let us suppose that there is such a thing, and ignore the difference between a training signal and a utility function, both of which I think are charitable assumptions for OAI. Even if we could search the space of all models and find one that in simulations does great on maximizing the correct utility function which we found by using ML to amplify human evaluations of behavior, that is no guarantee that the model we find in that search ... (read more)

This is an intuition only based on speaking with researchers working on LLMs, but I think that OAI thinks that a model can simultaneously be good enough at next token prediction to assist with research but also be very very far away from being a powerful enough optimizer to realise that it is being optimized for a goal or that deception is an optimal strategy, since the latter two capabilities require much more optimization power. And that the default state of cutting edge LLMs for the next few years is to have GPT-3 levels of deception (essentially none) and graduate student levels of research assistant ability.
1Ronny Fernandez
This inspired a full length post.

I loved this, but maybe should come with a cw.


I would think the title is itself a content warning.

I guess someone might think this post is or could be far more abstract and less detailed about the visceral realities than it is (or maybe even just using the topic as a metaphor at most).

What kind of specific content warning do you think would be appropriate? Maybe "Describes the dissection of human bodies in vivid concrete terms."?

I assumed he meant the thing that most activates the face detector, but from skimming some of what people said above, seems like maybe we don't know what that is.

There's a nearby kind of obvious but rarely directly addressed generalized version of one of your arguments, which is that ML learns complex functions all the time, so why should human values be any different? I rarely see this discussed, and I thought the replies from Nate and the ELK related difficulties were important to have out in the open, so thanks a lot for including the face learning <-> human values learning analogy. 

For at least about ten years in my experience people in this community have been saying the main problem isn't getting the AI to understand human values, it's getting the AI to have human values. Unfortunately the word "learn human values" is sometimes used to mean "have human values" and sometimes used to mean "understand human values" hence the confusion.

Ege Erdil gave an important disanaology between the problem of recognizing/generating a human face, and the problem of either learning human values, or learning what plans that advance human values are like. The disanalogy is that humans are near perfect human face recognizers, but we are not near perfect valuable world-state or value-advancing-plan recognizers. This means that if we trained an AI to either recognize valuable world-states or value-advancing plans, we would actually end up just training something that recognizes what we can recognize as val... (read more)

I'm confused, which GAN faces look like "horrible monstrosities"!?

I also find it somewhat taboo but not so much that I haven’t wondered about it.

Just realized that’s not UAI. Been looking for this source everywhere, thanks.

Ok I understand that although I never did find a proof that they are equivalent in UAI. If you know where it is, please point it out to me.

I still think that solomonoff induction assigns 0 to uncomputable bit strings, and I don’t see why you don’t think so.

Like the outputs of programs that never halt are still computable right? I thought we were just using a “prints something eventually oracle” not a halting oracle.

Simple in the description length sense is incompatible with uncomputability. Uncomputability means there is no finite way to point to the function. That’s what I currently think, but I’m confused about you understanding all those words and disagreeing.

There's a level-slip here, which is more common in discussions of algorithmic probability than I realized when I wrote as if it was an obvious nitpick. There are two ways to point at the Solomonoff prior: the operational way (Solomonoff's original idea) and the denotational way (called a "universal mixture"). These are shown equivalent by (Hutter et al., 2011). I'll try to explain the confusion in both ways. Operationally, we're going to feed an infinite stream of fair coin-flips into a universal monotone Turing machine, and then measure the distribution of the output. What we do not do is take the random noise that we're about to feed into the Turing machine and apply a halting-oracle to it to make sure it encodes a computable function before running it. We just run it. Each such "hypothesis" may halt eventually, or it may produce output forever, or it may infinite-loop after producing some output. As a Solomonoff inductor, we're going to have infinite patience and consider all the output it produces before either halting or infinite-looping. This is the sense in which a hypothesis doesn't have to be "computable" to be admissible, it just has to be "a program". Denotationally, we're going to mix over a computable enumeration of lower semicomputable semimeasures. (This is the perspective I prefer, I think, but it's equivalent.) So in a really precise way, the hypotheses have to be semicomputable, which is a strictly larger set than being computable, and, in fact, the universal mixture itself is semicomputable (even though it is uncomputable). This is why it's not inconsistent to imagine Solomonoff inductors showing up as hypotheses inside Solomonoff induction.

A lot of folks seem to think that general intelligences are algorithmically simple. Paul Christiano seems to think this when he says that the universal distribution is dominated by simple consequentialists.

But the only formalism I know for general intelligences is uncomputable, which is as algorithmically complicated as you can get.

The computable approximations are plausibly simple, but are the tractable approximations simple? The only example I have of a physically realized agi seems to be very much not algorithmically simple.


Just want to point out that the minimum-description-length prior and the speed prior are two very different notions of simplicity. The universal distribution Paul is referring to is probably the former. Simple in the description-length sense is not incompatible with uncomputability. In my experience of conversations at AIRCS workshops, it's perennially considered an open question whether or not the speed prior is also full of consequentialists.

After trying it, I've decided that I am going to charge more like five dollars per step, but yes, thoughts included. 

Can we apply for consultation as a team of two? We only want remote consultation of the resources you are offering because we are not based in bay area.


For anyone who may have the executive function to go for the 1M, I propose myself as a cheap author if I get to play as the dungeon master role, or play as the player role, but not if I have to do both. I recommend taking me as the dungeon master role. This sounds genuinely fun to me. I would happily do a dollar per step.

I can also help think about how to scale the operation, but I don’t think I have the executive function, management experience, or slack to pull it off myself.

I am Ronny Fernandez. You can contact me on fb.

Does your offer include annotating your thoughts too or does it only include writing the prompts?

I'm setting up a place for writers and organizers to find each other, collaborate, and discuss this; please join the Discord.  More details in this comment.

I'd also be interested in contributing with writing for pay in this fashion, and perhaps helping with the executive side as well. You can reach me on fb, Erik Engelhardt. 

I similarly offer myself as an author, in either the dungeon master or player role. I could possibly get involved in the management or technical side of things, but would likely not be effective in heading a project (for similar reasons to Brangus), and do not have practical experience in machine learning.

I am best reached through direct message or comment reply here on Lesswrong, and can provide other contact information if someone wants to work with me.

I came here to say something pretty similar to what Duncan said, but I had a different focus in mind. 

It seems like it's easier for organizations to coordinate around PR than it is for them to coordinate around honor.  People can have really deep intractable, or maybe even fundamental and faultless, disagreements about what is honorable, because what is honorable is a function of what normative principles you endorse. It's much easier to resolve disagreements about what counts as good PR. You could probably settle most disagreements about what co... (read more)

As a counterpoint, one writer thinks that it's psychologically harder for organizations to think about PR:

A famous investigative reporter once asked me why my corporate clients were so terrible at defending themselves during controversy. I explained, “It’s not what they do. Companies make and sell stuff. They don’t fight critics for a living. And they dread the very idea of a fight. Critics criticize; it’s their entire purpose for existing; it’s what they do.”

"But the companies have all that money!” he said, exasperated.

"But their critics have you,” I said

... (read more)

It's much easier to resolve disagreements about what counts as good PR.

I mostly disagree. I mean, maybe this applies in comparison to “honor” (not sure), but I don’t think it applies in comparison to “reputation” in many of the relevant senses. A person or company could reasonably wish to maintain a reputation as a maker of solid products that don’t break, or as a reliable fact-checker, or some other such specific standard. And can reasonably resolve internal disagreements about what is and isn’t likely to maintain this reputation.

If it was actually ... (read more)

5Duncan Sabien (Deactivated)
This seems true to me but also sort of a Moloch-style dynamic?  Like "yep, I agree those are the incentives, and it's too bad that that's the case."

This might be sort of missing the point, but here is an ideal and maybe not very useful not-yet-theory of rationality improvements I just came up with.

There are a few black boxes in the theory. The first takes you and returns your true utility function, whatever that is. Maybe it's just the utility function you endorse, and that's up to you. The other black box is the space of programs that you could be. Maybe it's limited by memory, maybe it's limited by run time, or maybe it's any finite state machine with less than 10^20 states... (read more)

The main thing I want to point out that this is an idealized notion of non-idealized decision theory -- in other words, it's still pretty useless to me as a bounded agent, without some advice about how to approximate it. I can't very well turn into this max-expected-value bounded policy. But there are other barriers, too. Figuring out what utility function I endorse is a hard problem. And we face challenges of embedded decision theory; how do we reason about the counterfactuals of changing our policy to the better one? Modulo those concerns, I do think your description is roughly right, and carries some important information about what it means to self-modify in a justified way rather than cargo-culting.

I don't think we should be surprised that any reasonable utility function is uncomputable. Consider a set of worlds with utopias that last only as long as a Turing machine in the world does not halt and are otherwise identical. There is one such world for each Turing machine. All of these worlds are possible. No computable utility function can assign higher utility to every world with a never halting Turing machine.

I do think this is an important concept to explain our conception of goal-directedness, but I don't think it can be used as an argument for AI risk, because it proves too much. For example, for many people without technical expertise, the best model they have for a laptop is that it is pursuing some goal (at least, many of my relatives frequently anthropomorphize their laptops).

This definition is supposed to also explains why a mouse has agentic behavior, and I would consider it a failure of the definition if it implied that mice are dangerous. I think a system becomes more dangerous as your best model of that system as an optimizer increases in optimization power.

Here is an idea for a disagreement resolution technique. I think this will work best:

*with one other partner you disagree with.

*when your the beliefs you disagree about are clearly about what the world is like.

*when your the beliefs you disagree about are mutually exclusive.

*when everybody genuinely wants to figure out what is going on.

Probably doesn't really require all of those though.

The first step is that you both write out your beliefs on a shared work space. This can be a notebook or a whiteboard or anything like that. Then you each write do... (read more)


Ok, let me give it a try. I am trying to not spend too much time on this, so I prefer to start with a rough draft and see whether there is anything interesting here before I write a massive essay.

You say the following:

Do chakras exist?

In some sense I might be missing the point since the answer to this is basically just "no". Though obviously I still think they form a meaningful category of something, but in my model they form a meaningful category of "mental experiences" and "mental procedures", and definitely not a meaningfu... (read more)

If you come up with a test or set of tests that it would be impossible to actually run in practice, but that we could do in principle if money and ethics were no object, I would still be interested in hearing those. After talking to one of my friends who is enthusiastic about chakras for just a little bit, I would not be surprised if we in fact make fairly similar predictions about the results of such tests.

Sometimes I sort of feel like a grumpy old man that read the sequences back in the good old fashioned year of 2010. When I am in that mood I will sometimes look around at how memes spread throughout the community and say things like "this is not the rationality I grew up with". I really do not want to stir things up with this post, but I guess I do want to be empathetic to this part of me and I want to see what others think about the perspective.

One relatively small reason I feel this way is that a lot of really smart rationalists, who are my fr... (read more)


I am not one of the Old Guard, but I have an uneasy feeling about something related to the Chakra phenomenon.

It feels like there's a lot of hidden value clustered around wooy topics like Chakras and Tulpas, and the right orientation towards these topics seems fairly straightforward: if it calls out to you, investigate and, if you please, report. What feels less clear to me is how I as an individual or as a member of some broader rat community should respond when, according to me, people do not certain forms of bullshit tests.

This comes from someone wi... (read more)

lol on the grumpy old man part, I feel that sometimes :) I'm not really familiar with what chakras are supposed to be about, but I'm decently familiar with yoga (200h level training several years ago). For the first 2/3 of the training we just focused on movement and anatomy, and the last 1/3 was teaching and theory. My teacher told be that there was the stuff called prana that flowed through living beings, and that breath work was all about getting the right prana flow. I thought that was a bit weird, but the breathing techniques we actually did also had lovely and noticeable affects on my mood/body. My frame: some woo frameworks came about through X years of experimentation and fiding lots of little tweaks that work, and then the woo framework co-evolved, or came afterwards, as a way to tie all these disjointed bits of accumulated knowledge. So when I go to evaluate something like chakras, I treat the actual theory as secondary to the actual pointers, "how chakras tell me to live my life". Now, any given woo framework may or may not have that much useful accumulated tidbits, that's where we have to try it for ourselves and see if it works. I've done enough yoga to be incredibly confident that though prana may not carve reality at the joints or be real, I'm happy to ask a master yogi how to handle my body better. Hmmmmm, so I guess the thing I wanted to say to you was, when having this chakra discussion with whomever, make sure to ask them, "What are the concrete things chakras tell me to do with my body/mind" and then see if those things have nay effect.
1Ronny Fernandez
If you come up with a test or set of tests that it would be impossible to actually run in practice, but that we could do in principle if money and ethics were no object, I would still be interested in hearing those. After talking to one of my friends who is enthusiastic about chakras for just a little bit, I would not be surprised if we in fact make fairly similar predictions about the results of such tests.
I have some thoughts about this (as someone who isn't really into the chakra stuff, but feels like it's relatively straightforward to answer the meta-questions that you are asking here). Feel free to ping me in a week if I haven't written a response to this.

Here is an idea I just thought of in an uber ride for how to narrow down the space of languages it would be reasonable to use for universal induction. To express the k-complexity of an object relative to a programing language I will write:

Suppose we have two programing languages. The first is Python. The second is Qython, which is a lot like Python, except that it interprets the string "A" as a program that outputs some particular algorithmically large random looking character string with . I claim that intuitively, Pyth... (read more)

When I started writing this comment I was confused. Then I got myself fairly less confused I think. I am going to say a bunch of things to explain my confusion, how I tried to get less confused, and then I will ask a couple questions. This comment got really long, and I may decide that it should be a post instead.

Take a system with 8 possible states. Imagine is like a simplified Rubik's cube type puzzle. (Thinking about mechanical Rubik's cube solvers is how I originally got confused, but using actual Rubik's cubes to explain would make... (read more)

Is there a particular formula for negentropy that OP has in mind? I am not seeing how the log of the inverse of the probability of observing an outcome as good or better than the one observed can be interpreted as the negentropy of a system with respect to that preference ordering.

Edit: Actually, I think I figured it out, but I would still be interested in hearing what other people think.

Something about your proposed decision problem seems cheaty in a way that the standard Newcomb problem doesn't. I'm not sure exactly what it is, but I will try to articulate it, and maybe you can help me figure it out.

It reminds me of two different decision problems. Actually, the first one isn't really a decision problem.

Omega has decided to give all those who two box on the standard Newcomb problem 1,000,000 usd, and all those who do not 1,000 usd.

Now that's not really a decision problem, but that's not the issue with using it ... (read more)

I think I see where you're coming from with the inverse problem feeling "cheaty". It's not like other decision problems in the sense that it is not really a dilemma; two-boxing is clearly the best option. I used the word "problem" instinctively, but perhaps I should have called it the "Inverse Newcomb Scenario" or something similar instead. However, the fact that it isn't a real "problem" doesn't change the conclusion. I admit that the inverse scenario is not as interesting as the standard problem, but what matters is that it's just as likely, and clearly favours two-boxers. FDT agents have a pre-commitment to being one-boxers, and that would work well if the universe actually complied and provided them with the scenario they have prepared for (which is what the paper seems to assume). What I tried to show with the inverse scenario is that it's just as likely that their pre-commitment to one-boxing will be used against them. Both Newcomb's Problem and the Inverse Scenario are "unfair" for one of the theories, which is why I think the proper performance measure is the total money for going through botha, where CDT comes out on top.

I had already proved it for two values of H before I contracted Sellke. How easily does this proof generalize to multiple values of H?

3Samuel Hapák
Very simple. To prove it for arbitrary number of values, you just need to prove that h_i being true increases its expected “probability to be assigned” after measurement for each i. If you define T as h_i and F as NOT h_i, you just reduced the problem to two values version.

I see. I think you could also use PPI to prove Good's theorem though. Presumably the reason it pays to get new evidence is that you should expect to assign more probability to the truth after observing new evidence?

I honestly could not think of a better way to write it. I had the same problem when my friend first showed me this notation. I thought about using but that seemed more confusing and less standard? I believe this is how they write things in information theory, but those equations usually have logs in them.

Just to add an additional voice here, I would view that as incorrect in this context, instead referring to the thing that the CEE is saying. The way I'd try to clarify this would be to put the variables varying in the expectation in subscripts after the E, so the CEE equation would look like ED[P(H=hi|D)]=P(H=hi), and the PPI inequality would be E(H,D)[P(H|D)]≥EH[P(H)].
Yeah, this is the one that I would have used.
Load More