LESSWRONG
LW

All of Johannes C. Mayer's Comments + Replies

Depression as a Learned Suppression Loop

Overview

This post proposes a mechanistic model of a common kind of depression, framing it not as a transient emotional state or a chemical imbalance, but as a persistent, self-reinforcing control loop. The model assumes a brain composed of interacting subsystems, some of which issue heuristic error signals (e.g., bad feelings), and others which execute learned policies in response. The claim is that a large part of what is commonly called "depression" can be understood as a long-term learned pattern of suppressing ... (read more)

Johannes C. Mayer's Shortform

Johannes C. Mayer8d*20

This is a good informal introduction to Control Theory / Cybernetics.

https://www.youtube.com/watch?v=YrdgPNe8KNA

johnswentworth's Shortform

Johannes C. Mayer12d40

In both cases, the conversation drains more energy than the equal-fun alternative. I have probably had at most a single-digit number of conversations in my entire life which were as fun-in-their-own-right as e.g. a median night out dancing, or a median escape room, or median sex, or a median cabaret show. Maybe zero, unsure.

I wanted to say that for me it is the opposite, but reading the second half I have to say it's the same.

I have defnetly had the problem that I talked too long sometimes to somebody. E.g. multiple times I talked to a person for 8-14 h... (read more)

Johannes C. Mayer's Shortform

Johannes C. Mayer24d30

Mathematical Notation as Learnable Language

To utilize mathematical notation fully you need to interpret it. To read it fluently, you must map symbols to concrete lenses, e.g. computational, visual, algebraic, or descriptive.

Example: Bilinear Map

Let

$f : R^{2} \times R^{2} \to R$

be defined by

$f ((x_{1}, x_{2}), (y_{1}, y_{2})) = x_{1} y_{1} + 2 x_{2} y_{2} .$

Interpretations:

Computational
Substitute specific vectors and check results. If $v = (3, 4)$ , then
$f ((x_{1}, x_{2}), v) = 3 x_{1} + 8 x_{2},$
Through this symbolic computation we can see how the expression depends on $x$ . Perform such computations until you get a feel for the "shape" o

... (read more)

johnswentworth's Shortform

Johannes C. Mayer26d41

My mind derives pleasure from deep philosophical and technical discussions.

johnswentworth's Shortform

Johannes C. Mayer26d60

In model flirting is about showing that you are paying attention. You say things that you could only pick up if you pay close attention to me and what I say. It's like a cryptographic proof certificate, showing that you think that I am important enough to pay attention to continuously. Usually this is coupled with an optimization process of using that knowledge to make me feel good, e.g. given a compliment that actually tracks reality in a way I care about.

It's more general than just showing sexual interest I think.

johnswentworth's Shortform

Johannes C. Mayer2mo21

I don't use it to write code, or really anything. Rather I find it useful to converse with it. My experience is also that half is wrong and that it makes many dumb mistakes. But doing the conversation is still extremely valuable, because GPT often makes me aware of existing ideas that I don't know. Also like you say it can get many things right, and then later get them wrong. That getting right part is what's useful to me. The part where I tell it to write all my code is just not a thing I do. Usually I just have it write snippets, and it seems pretty good... (read more)

johnswentworth's Shortform

Johannes C. Mayer2mo30

Why don't you run the test yourself seems very easy?

Yes it does catch me when I am saying wrong things quite often. It also quite often says things that are not correct and I correct it, and if I am right it usually agrees immediately.

1Aprillion2mo

Interesting - the first part of the response seems to suggest that it looked like I was trying to understand more about LLMs... Sorry for confusion, I wanted to clarify an aspect of your worflow that was puzzling to me. I think I got all info for what I was asking about, thanks! FWIW, if the question was an expression of actual interest and not a snarky suggestion, my experience with chatbots has been positive for brainstorming, dictionary "search", rubber ducking, description of common sense (or even niche) topics, but disappointing for anything that requires application of commons sense. For programmming, one- or few-liner autocomplete is fine for me - then it's me doing the judgement, half of the suggestions are completely useless, half are fine, and the third half look fine at first before I realise I needed the second most obvious thing this time.. but it can save time for the repeating part of almost-repeating stuff. For multi file editing,, I find it worse than useless when it feels like doing code review after a psychopath pretending to do programming (AFAICT all models can explain everything most stuff correctly and then write the wrong code anyway .. I don't find it useful when it tries to appologize later if I point it out or to pre-doubt itself in CoT in 7 paragraphs and then do it wrong anyway) - I like to imagine as if it was trained on all code from GH PRs - both before and after the bug fix... or as if it was bored, so it's trying to insert drama into a novel about my stupid programming task, when the second chapter will be about heroic AGI firefighting the shit written by previous dumb LLMs...

johnswentworth's Shortform

Johannes C. Mayer3mo30

Their romantic partner offering lots of value in other ways. I'm skeptical of this one because female partners are typically notoriously high maintenance in money, attention, and emotional labor. Sure, she might be great in a lot of ways, but it's hard for that to add up enough to outweigh the usual costs.

Imagine a woman is a romantic relationship with somebody else. Are they still so great a person that you would still enjoy hanging out with them as a friend? If not that woman should not be your girlfriend. Friendship first. At least in my model romantic stuff should be stacked ontop of platonic love.

Self's Shortform

Johannes C. Mayer5mo31

Don't spend all your time compressing knowledge that's not that useful to begin with, if there are higher value things to be learned.

1Self5mo

True! Useless knowledge should neither be learned nor compressed, as both takes cognition.

Johannes C. Mayer's Shortform

Johannes C. Mayer5mo30

An extreme test of how much you trust a persons intention is to consider whether you would upload them, if you could. Then they would presumably be the only (speed) superintelligence.

4Viliam5mo

Sounds like a mild infohazard, because if this becomes common knowledge, people will express willingness to upload someone as a way to signal tribal allegiance, and soon you will find that all the popular candidates for uploading are controversial politicians.

I'm offering free math consultations!

Johannes C. Mayer5mo40

Maybe better name: Let me help debug your math via programming

Exercise: Planmaking, Surprise Anticipation, and "Baba is You"

Johannes C. Mayer6mo40

If you've tried this earnestly 3 times, after the 3rd time, I think it's fine to switch to just trying to solve the level however you want (i.e. moving your character around the screen, experimenting).

After you failed 3 times, wouldn't it be a better exercise to just play around in the level until you get a new pice of information that you predict will allow you to reformulate better plans, and then step back into planning mode again?

2Raemon6mo

Oh, yeah. I wrote this for a workshop context where there was a de facto time limit and eventually I needed to move things along. But I agree your suggestion here is better if you have more time.

johnswentworth's Shortform

Johannes C. Mayer6mo82

Another one: We manage to solve alignment to a significant extend. The AI who is much smarter than a human thinks that it is aligned, and takes aligned actions. The AI even predicts that it will never become unaligned to humans. However, at some point in the future as the AI naturally unrolles into a reflectively stable equilibrium it becomes unaligned.

The Field of AI Alignment: A Postmortem, and What To Do About It

Johannes C. Mayer6mo91

Why not AI? Is it that AI alignment is too hard? Or do you think it's likely one would fall into the "try a bunch of random stuff" paradigm popular in AI, which wouldn't help much in getting better at solving hard problems?

What do you think about the strategy of instead of learning a textbook e.g. on information theory, or compilers you try to write the textbook and only look at existing material if you are really stuck. That's my primary learning strategy.

It's very slow and I probably do it too much, but it allows me to train to solve hard problems that aren't super hard. If you read all the text books all the practice problems remaining are very hard.

Vegans need to eat just enough Meat - emperically evaluate the minimum ammount of meat that maximizes utility

Johannes C. Mayer6mo40

How about we meet, you do research, and I observe, and then try to subtly steer you, ideally such that you learn faster how to do it well. Basically do this, but without it being an interview.

1Knight Lee6mo

Sure, I never tried anything like that before, it sounds really interesting. There's a real possibility that I discover a whole world of things. We'll treat it as a friendly chat, not a business interview, like if I screw up or you screw up we'll laugh about it, right?

The Field of AI Alignment: A Postmortem, and What To Do About It

Johannes C. Mayer6mo90

What are some concrete examples of the of research that MIRI insufficiently engaged with? Are there general categories of prior research that you think are most underutilized by alignment researchers?

Vanessa Kosoy6mo162

Learning theory, complexity theory and control theory. See the "AI theory" section of the LTA reading list.

The Field of AI Alignment: A Postmortem, and What To Do About It

Johannes C. Mayer6mo6-1

... and Carol's thoughts run into a blank wall. In the first few seconds, she sees no toeholds, not even a starting point. And so she reflexively flinches away from that problem, and turns back to some easier problems.

I spend ~10 hours trying to teach people how to think. I sometimes try to intentionally cause this to happen. Usually you can recognize it by them starting to be quiet (I usually give the instruction that they should do all their thinking out loud). And this seems to be when actual cognitive labor is happening, instead of saying things that y... (read more)

Vegans need to eat just enough Meat - emperically evaluate the minimum ammount of meat that maximizes utility

Johannes C. Mayer6mo30

I watched this video, and I semi trust this guy (more than anybody else) about not getting it completely wrong. So you can eat too much soy. But eating a bit is actually healthy, is my current model.

Here is also a calculation I did that it is possible to get all amino acids from soy without eating too much.

Vegans need to eat just enough Meat - emperically evaluate the minimum ammount of meat that maximizes utility

Johannes C. Mayer6mo42

Haven't thought about, nor experimented with that. If you think clams would be ok to eat, you could perform the experiment yourself.

3Knight Lee6mo

Thanks for the honesty. Speaking of honesty I'm not actually vegan. I'm vegetarian. I tried going vegan for a week or so but turned back due to sheer uncertainty about the nutrition, and right now I'm still taking supplements like creatine omega-3. I really like your mindset of self questioning and like, "are my methods/plans stupid? Am I sorta in denial of this?" Haha. I read your biography page and you are shockingly similar to myself haha. I'm almost afraid to list all the similarities. You are way ahead in progress, giving me hope in me. Hello! I'm really curious about you, and would love some advice from you (or just know what you think of me). I'm currently working on a few projects like "A better Statement on AI Risk?" I thought everyone would agree with it and it would save the world lol but in the end very few people liked it. I spent $2000 donating to various organizations hoping they would reply to my emails haha. I'm also trying to invent other stuff like Multi-Agent Framing and I posted it on LessWrong etc. but got very little interaction and I have no idea if that means it's a bad idea or if I have too much Asperger's to write engagingly. I'm working on two more ideas I "invented" which I haven't posted yet, because the drafts are still extremely messy (you can take a quick skim: Multistage CoT Alignment and this weird idea). Honestly speaking, do you think what I'm doing is the best use of my time? Am I in denial of certain things? Feel free to tell me :) Your LinkedIn says "I've been funded by LTFF." How did you succeed in getting that?

What Have Been Your Most Valuable Casual Conversations At Conferences?

Answer by Johannes C. MayerDec 25, 2024131

At the 2024 LessWrong Community weekend I met somebody who I have been working with for perhaps 50 hours so far. They are better at certain programming related tasks than me, in a way provided utility. Before meeting them they where not even considering working on AI alignment related things. The conversation wen't something like this:

Johannes: What are you working on.
Other Person: Web development. What are you working on?
Johannes: I am trying to understand intelligence such that we can build a system that is capable enough to prevent other misaligned ... (read more)

Vegans need to eat just enough Meat - emperically evaluate the minimum ammount of meat that maximizes utility

Johannes C. Mayer6mo40

I probably did it badly. I would eat hole grain bread pretty regularly, but not consistently. I might not eat it for 1 week in a row sometimes. That was before I knew that amino acids are important.

Vegans need to eat just enough Meat - emperically evaluate the minimum ammount of meat that maximizes utility

Johannes C. Mayer6mo30

It was ferritin. However the levels where actually barely within acceptable levels. I hypothesise that because I started to eat steamed blood for perhaps 2 weaks prior every day, and that blood contains a lot of heme iron, that I was deficient before.

Vegans need to eat just enough Meat - emperically evaluate the minimum ammount of meat that maximizes utility

Johannes C. Mayer7mo71

I think running this experiment is generally worth it. It's very different to read a study and to run the experiment and see the effect yourself. You may also try to figure out if you are amino acid deficient. See this comment, as well as others in that comment stack.

2[anonymous]7mo

note the psychological cost to me would be similar to that of eating a part of a human corpse, so while i agree doing personal experiments is generally worth it and doesn't require existing studies, it is not so obvious for me in this case. the cost may well be a part of, or at least a change to, my metaphorical soul. will check the linked thread

Vegans need to eat just enough Meat - emperically evaluate the minimum ammount of meat that maximizes utility

Johannes C. Mayer7mo50

The reason I mention chicken is that last time I ran this experiment with beef my body started to hurt really bad such that I woke up in the middle of the night. I am pretty sure that the beef was the reason. Maybe something weird was going on in my body at the same time. However, when I tried the same one week later with chicken I didn't have this issue.

3niplav7mo

Huh, interesting!

Vegans need to eat just enough Meat - emperically evaluate the minimum ammount of meat that maximizes utility

Johannes C. Mayer7mo50

I ate tons of beluga lentils. Sometimes 1kg (cooked) a day. That wasn't enough. However, now I switched to eating 600g (cooked) soybeans every day, and that was a very significant improvement (like solving the problem to 75% or so). Soy is a complete protein. Soy beans are also very cheap.

1FlorianH6mo

Would you personally answer Should we be concerned about eating too much soy? with "Nope, definitely not", or do you just find it's a reasonable gamble to take to eat the very large qty of soy you describe? Btw, thanks a lot for the post; MANY parallels with my past as more-serious-but-uncareful-vegan until body showed clear signs of issues that I realized only late as I'd have never believed anyone that healthy vegan diet is that tricky.

4Gordon Seidoh Worley6mo

Just to verify, you were also eating rice with those lentils? I'd expect to be differently protein deficient if you only eat lentils. The right combo is beans and rice (or another grain).

johnswentworth's Shortform

Johannes C. Mayer7mo40

Note this 50% likely only holds if you are using a main stream language. For some non-main stream language I have gotten responses that where really unbelivably bad. Things like "the name of this variable wrong" which literally could never be the problem (it was a valid identifier).

And similarly, if you are trying to encode novel concepts, it's very different from gluing together libraries, or implementing standard well known tasks, which I would guess is what habryka is mostly doing (not that this is a bad thing to do).

johnswentworth's Shortform

Johannes C. Mayer7mo71

Maybe you include this in "stack overflow substitute", but the main thing I use LLMs for is to understand well known technical things. The workflow is: 1) I am interested in understanding something, e.g. how a multiplexed barrel bit shifter works. 2) I ask the LLM to explain the concept. 3) Based on the initial response I create seperate conversation branches with questions I have (to save money and have the context be closer. Didn't evaluate if this actually makes the LLM better.). 4) Once I think I understood the concept or part of the concept I explain... (read more)

5Aprillion3mo

Have you tried to make a mistake in your understanding on purpose to test out whether it would correct you or agree with you even when you'd get it wrong? (and if yes, was it "a few times" or "statistically significant" kinda test, please?)

Goal: Understand Intelligence

Johannes C. Mayer8mo20

I totally agree with this. I expect the majority early AI researchers where falling into this trap. The main problem I am focusing on is how a mind can construct a model of the world in the first place.

Goal: Understand Intelligence

Johannes C. Mayer8mo40

The goal is to have a system where there are no unlabeled parameters ideally. That would be the world modeling system. It then would build a world model that would have many unlabeled parameters. By understanding the world modeler system you can ensure that the world model has certain properties. E.g. there is some property (which I don't know) of how to make the world model not contain dangerous minds.

E.g. imagine the AI is really good at world modeling, and now it models you (you are part of the world) so accurately that you are now basically copied into... (read more)

2Steven Byrnes8mo

Yup, this is what we’re used to today: * there’s an information repository, * there’s a learning algorithm that updates the information repository, * there’s an inference algorithm that queries the information repository, * both the learning algorithm and the inference algorithm consist of legible code written by humans, with no inscrutable unlabeled parameters, * the high-dimensional space [or astronomically-large set, if it’s discrete] of all possible configurations of the information repository is likewise defined by legible code written by humans, with no inscrutable unlabeled parameters, * the only inscrutable unlabeled parameters are in the content of the information repository, after the learning algorithm has been running for a while. So for example, in LLM pretraining, the learning algorithm is backprop, the inference algorithm is a forward pass, and the information repository is the weights of a transformer-architecture neural net. There’s nothing inscrutable about backprop, nor about a forward pass. We fully understand what those are doing and how. Backprop calculates the gradient, etc. That’s just one example. There are many other options! The learning algorithm could involve TD learning. The inference algorithm could involve tree search, or MCMC, or whatever. The information repository could involve a learned value function and/or a learned policy and/or a learned Bayes net and/or a learned OpenCog AtomSpace or whatever. But in all cases, those six bullets above are valid. So anyway, this is already how ML works, and I’m very confident that it will remain true until TAI, for reasons here. And this is a widespread consensus. There’s a very obvious failure mode in which: the world-model models the world, and the planner plans, and the value function calculates values, etc. … and at the end of all that, the AI system as a whole hatches and executes a plan to wipe out humanity. The major unsolved problem is: how do we confidently avoid that? The

Johannes C. Mayer's Shortform

Johannes C. Mayer8mo30

Here. There is a method you can have. This is just a small pice of what I do. I also probably haven't figured out many important methodological things yet.

Also this is very important.

Goal: Understand Intelligence

Johannes C. Mayer8mo40

John's post is quite wierd, because it only says true things, and implicitly implies a conclusion, namely that NNs are not less interpretable than some other thing, which is totally wrong.

Example: A neural network implements modular arithmetic with furier transforms. If you implement that furier algorithm in python, it's harder to understand for a human than the obvious modular arithmetic implementation in python.

It doesn't matter if the world model is inscruitable when looking directly at it, if you can change the generating code such that certain propert... (read more)

6Steven Byrnes8mo

Again see my comment. If an LLM does Task X with a trillion unlabeled parameters and (some other thing) does the same Task X with “only” a billion unlabeled parameters, then both are inscrutable. Your example of modular arithmetic is not a central example of what we should expect to happen, because “modular arithmetic in python” has zero unlabeled parameters. Realistically, an AGI won’t be able to accomplish any real-world task at all with zero unlabeled parameters. I propose that a more realistic example would be “classifying images via a ConvNet with 100,000,000 weights” versus “classifying images via 5,000,000 lines of Python code involving 1,000,000 nonsense variable names”. The latter is obviously less inscrutable on the margin but it’s not a huge difference. If “very useful for alignment” means “very useful for doing technical alignment research”, then yes, clearly. If “very useful for alignment” means “increases our odds of winding up with aligned AGI”, then no, I don’t think it’s true, let alone “clearly” true. If you don’t understand how something can simultaneously both be very useful for doing technical alignment research and decrease our odds of winding up with aligned AGI, here’s a very simple example. Suppose I posted the source code for misaligned ASI on github tomorrow. “Clearly that would be very useful” for doing technical alignment research, right? Who could disagree with that? It would open up all sorts of research avenues. But also, it would also obviously doom us all. For more on this topic, see my post “Endgame safety” for AGI. There’s a very basic problem that instrumental convergence is convergent because it’s actually useful. If you look at the world and try to figure out the best way to design a better solar cell, that best way involves manipulating humans (to get more resources to run more experiments etc.). Humans are part of the environment. If an algorithm can look at a street and learn that there’s such a thing as cars, the ve

Johannes C. Mayer's Shortform

Johannes C. Mayer8mo20

I specifically am talking about solving problems that nobody knows the answer to, where you are probably even wrong about what the problem even is. I am not talking about taking notes on existing material. I am talking about documenting the process of generating knowledge.

I am saying that I forget important ideas that I generated in the past, probably they are not yet so refined that they are impossible to forget.

1CstineSublime8mo

Thank you for the clarification. Do you have a process or a methodology for when you try and solve this kind of "nobody knows" problems? Or is it one of those things where the very nature of these problems being so novel means that there is no broad method that can be applied?

Goal: Understand Intelligence

Johannes C. Mayer8mo20

A robust alignment scheme would likely be trivial to transform into an AGI recipe.

Perhaps if you did have the full solution, but it feels like that there are some things of a solution that you could figure out, such that that part of the solution doesn't tell you as much about the other parts of the solution.

And it also feels like there could be a book such that if you read it you would gain a lot of knowledge about how to align AIs without knowing that much more about how to build one. E.g. a theoretical solution to the stop button problem seems like i... (read more)

2Thane Ruthenis8mo

I agree with that.

Goal: Understand Intelligence

Johannes C. Mayer8mo40

If you had a system with “ENTITY 92852384 implies ENTITY 8593483" it would be a lot of progress, as currently in neural networks we don't even understand the interal structures.

I want to have an algorithm that creates a world model. The world is large. A world model is uninterpretable by default through it's sheer size, even if you had interpretable but low level abels. By default we don't get any interpretable labels. I think there are ways to have generic dataprocessing procedures that don't talk about the human mind at all, that would yield more interpr... (read more)

4Steven Byrnes8mo

See Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc, including my comment on it. If your approach would lead to a world-model that is an uninterpretable inscrutable mess, and LLM research would lead to a world-model that is an even more uninterpretable, even more inscrutable mess, then I don’t think this is a reason to push forward on your approach, without a good alignment plan. Yes, it’s a pro tanto reason to prefer your approach, other things equal. But it’s a very minor reason. And other things are not equal. On the contrary, there are a bunch of important considerations plausibly pushing in the opposite direction: * Maybe LLMs will plateau anyway, so the comparison between inscrutable versus even-more-inscrutable is a moot point. And then you’re just doing AGI capabilities research for no safety benefit at all. (See “Endgame safety” for AGI.) * LLMs at least arguably have some safety benefits related to reliance on human knowledge, human concepts, and chains-of-thought, whereas the kind of AGI you’re trying to invent might not have those. * Your approach would (if “successful”) be much, much more compute-efficient—probably by orders of magnitude—see Section 3 here for a detailed explanation of why. This is bad because, if AGI is very compute-efficient, then when we have AGI at all, we will have AGI that a great many actors around the world will be able to program and run, and that makes governance very much harder. (Related: I for one think AGI is possible on a single consumer GPU, see here.) * Likewise, your approach would (if “successful”) have a “better” inductive bias, “better” sample efficiency, etc., because you’re constraining the search space. That suggests fast takeoff and less likelihood of a long duration of janky mediocre-human-level AGIs. I think most people would see that as net bad for safety. If it’s a problem for any possible approach to building AGI, then it’s an argument against pursuing any kind of AGI capa

Johannes C. Mayer's Shortform

Johannes C. Mayer8mo53

You Need a Research Log

I definitely very often run into the problem that I forget why something was good to do in the first place. What are the important bits? Often I get sidetracked, and then the thing that I am doing seems not so got, so I stop and do something completely different. But then later on I realize that actually the original reason that led me down the path was good and that it would have been better to only backtrack a bit to the important piece. But often I just don't remember the important piece in the moment.

E.g. I think that having som... (read more)

1CstineSublime8mo

I'm missing a key piece of context here - when you say "doing something good" are you referring to educational or research reading; or do you mean any type of personal project which may or may not involve background research? I may have some practical observations about note-taking which may be relevant, if I understand the context.

3Hastings8mo

+1 for just throwing your notes up on a website. For example, mine are at https://www.hgreer.com/Reports/ although there is currently a bit of a gap for the last few months as I've been working more on synthesizing existing work into a CVPR submission than on exploreing new directions. The above is a terrible post-hoc justification and I need to get back to note taking.

Goal: Understand Intelligence

Johannes C. Mayer8mo20

I'd think you can define a tedrahedron for non-euclidean space. And you can talk about and reason about a set of polyhedra with 10 verticies as an abstract object without talking or defining any specific such polyhedra.

Just consider if you take the assumption that the system would not change in arbitrary ways in response to it's environment. There might be certain constrains. You can think about what the constrains need to be such that e.g. a self modifying agent would never change itself such that it would expect that in the future it would get less utili... (read more)

4Thane Ruthenis8mo

If you relax the definition of a tetrahedron to cover figures embedded in non-Euclidean spaces, sure. It wouldn't be the exact same concept, however. In a similar way to how "a number" is different if you define it as a natural number vs. real number. Perhaps more intuitively, then: the notion of a geometric figure with specific properties is dependent on the notion of a space in which it is embedded. (You can relax it further – e. g., arguably, you can define a "tetrahedron" for any set with a distance function over it – but the general point stands, I think.) Yes, but: those constraints are precisely the principles you'd need to code into your AI to give it general-intelligence capabilities. If your notion of alignment only needs to be robust to certain classes of changes, because you've figured out that an efficient generally intelligent system would only change in such-and-such ways, then you've figured out a property of how generally intelligent systems ought to work – and therefore, something about how to implement one. Speaking abstractly, the "negative image" of the theory of alignment is precisely the theory of generally intelligent embedded agents. A robust alignment scheme would likely be trivial to transform into an AGI recipe.

Seeking Collaborators

Johannes C. Mayer8mo20

The way I would approach this problem (after not much thought): Come up with a concrete system architecture A of a maimizing computer program that has an explicit utility function, and is known to behave optimally. E.g. maybe it plays tic tac toe or 4-in a row optimally.

Now mutate the source code of A slightly such that it is no longer optimal to get a system B. The objective is not modified. Now B still "wants" to basically be A, in the sense that if it is a general enough optimizer and has access to selfmodification facilities, it would try to make itsel... (read more)

Goal: Understand Intelligence

Johannes C. Mayer8mo82

To me it seems that understanding how a system that you are building actually works (i.e. have good models about its internal) is the most basic requirement to be able to reason about the system coherently at all.

Yes if you'd actually understood how intelligence works in a deep way you don't automatically solve alignment. But it sure will make it a lot more tractable in many ways. Especially when only aiming for a pivotal act.

I am pretty sure you can figure out alignment in advance as you suggest. That might be the overall saver route... if we didn't have ... (read more)

2Charlie Steiner8mo

Thanks for the reply :)

4Thane Ruthenis8mo

I'm not so sure about that. How do you figure out how to robustly keep a generally intelligent dynamically updating system on-target without having a solid model of how that system is going to change in response to its environment? Which, in turn, would require a model of what that system is? I expect the formal definition of "alignment" to be directly dependent on the formal framework of intelligence and embedded agency, the same way a tetrahedron could only be formally defined within the context of Euclidean space.

johnswentworth's Shortform

Johannes C. Mayer8mo60

It's becomes more interresting when the people constrain their output based on what they expect is true information that the other person does not yet know. It's useful to talk to an expert, who tells you a bunch of random stuff they know that you don't.

Often some of it will be useful. This only works if they understand what you have said though (which presumably is something that you are interested in). And often the problem is that people's models about what is useful are wrong. This is especially likely if you are an expert in something. Then the thing ... (read more)

Whiteboard Pen Magazines are Useful

Johannes C. Mayer9mo40

2024-10-14 Added the "FUUU 754 extensions M and S" section.

Whiteboard Pen Magazines are Useful

Johannes C. Mayer9mo40

Update History

4Johannes C. Mayer9mo

2024-10-14 Added the "FUUU 754 extensions M and S" section.

Why is o1 so deceptive?

Johannes C. Mayer9moΩ120

It seems potentially important to compare this to GPT4o. In my experience when asking GPT4 for research papers on particular subjects it seemed to make up non-existent research papers (at least I didn't find them after multiple minutes of searching the web). I don't have any precise statistics on this.

5abramdemski9mo

What I'm trying to express here is that it is surprising that o1 seems to explicitly encourage itself to fake links; not just that it fakes links. I agree that other models often hallucinate plausible references. What I haven't seen before is a chain of thought which encourages this. Furthermore, while it's plausible that you can solicit such a chain of thought from 4o under some circumstances, it seems a priori surprising that such behavior would survive at such a high rate in a model whose chain of thought has specifically been trained via rl to help produce correct answers. This leads me to guess the rl is badly misaligned.

We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap

Johannes C. Mayer10mo40

Yes exactly. The larva example illustrates that there are different kinds of values. I thought it was underexplored in the OP to characterize exactly what these different kinds of values are.

In the sadist example we have:

the hardcoded pleasure of hurting people.
And we have, let's assume, the wish to make other people happy.

These two things both seem like values. However, they seem to be qualitatively different kinds of values. I intuit that more precisely characterizing this difference is important. I have a bunch of thoughts on this that I failed to write up so far.

We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap

Johannes C. Mayer10mo70

reward is the evidence from which we learn about our values

A sadist might feel good each time they hurt somebody. I am pretty sure it is possible for a sadist to exist who does not endorse hurting people, meaning they feel good if they hurt people, but they avoid it nonetheless.

So to what extent is hurting people a value? It's like the sadist's brain tries to tell them that they ought to want to hurt people, but they don't want to. Intuitively the "they don't want to" seems to be the value.

3Measure10mo

This seems similar to the ant larvae situation where they reflectively argue around the hardcoded reward signal. Hurting people might still be considered a value the sadist has, but it trades off against other values.

Johannes C. Mayer's Shortform

Johannes C. Mayer10mo40

Any n-arity function can be simulated with an an (n+1)-arity predicate. Let a and b be constants. With a function, we can write the FOL sentence $a + b > 1$ , where $+$ is the default addition function. We can write the same as $\forall x, (+_{P} (a, b, x) \to x > 1)$ where $+_{P}$ is now a predicate that returns true iff $a$ added to $b$ is $x$ .

2cubefox10mo

The reverse is also possible: an n-ary relation can be represented as an n-ary function which maps instances of the relation to the object "true" and non-instances to the object "false". So which is better? Enter Oppenheimer & Zalta: Relations vs functions at the foundations of logic: type-theoretic considerations:

Johannes C. Mayer's Shortform

Johannes C. Mayer10mo80

How to Sleep

Here are a few observations I have made when it comes to going to bed on time.

Bedtime Alarms

I set up an alarm that reminds me when my target bedtime has arrived. Many times when I am lost in an activity, the alarm makes me remember that I made the commitment to go to bed on time.

I only allow myself to dismiss the alarm when I lay down in bed. Before laying down I am only allowed to snooze it for 8 minutes. To dismiss the alarm I need to solve a puzzle which takes 10s, making dismissing more convenient. Make sure to carry your phone around wi... (read more)

quila's Shortform

Johannes C. Mayer10mo20

Consider all the programs $P$ that encode uncomputable numbers up to $n$ digits. There are infinitely many of these programs. Now consider the set of programs $P^{'} := {call-10-times (p) | p \in P}$ . Each program in P' has some pattern. But it's always a different one.

Johannes C. Mayer's Shortform

Johannes C. Mayer10mo40

You need the right relationship with confusion. By default confusion makes you stop your thinking. Being confused feels like you are doing something wrong. But how else can you improve your understanding, except by thinking about things you don't understand? Confusion tells you that you don't yet understand. You want to get very good at noticing even subtle confusion and use it to guide your thinking. However, thinking about confusing things isn't enough. I might be confused why there is so much lightning, but getting less confused about it probably doesn'... (read more)

2Viliam10mo

Yes. If you are never confused, it probably means you are always within the well-known territory. That feels nice, but you probably don't learn much. Of course, all of this only works as an approximation. When you keep making non-zero but very small steps forward, you are learning. (That's basically the ideal of education -- this situation won't happen naturally, but it can be prepared for others, and then it is both educational and pleasant.) And as you said, not all kinds of confusion lead to learning.

Johannes C. Mayer's Shortform

Johannes C. Mayer10mo30

Here is an AI called GameNGen that generates a game in real-time as the player interacts with the model. (It simulates doom at >20fps.) It uses a diffusion model. People are only slightly better than random chance at identifying if it was generated by the AI or by the Doom program.

2Vladimir_Nesov10mo

The interesting thing is this is (in part) essentially a synthetic data generation pipeline for world models, when there is a game for which RL can train an agent with reasonable behavior. This way they get arbitrary amounts of data that's going to be somewhat on-policy for reasonable play, and has all the action labeling to make the resulting world model able to reasonably respond to most possible actions. They only use 128 TPUv5e for training, which is the lite variant of TPUv5 with only 200 BF16 teraFLOP/s. So this is like 25 H100s, when currently 100K H100s training clusters are coming online. RL that can play video games already somewhat works for open world survival craft and factory building games, see DeepMind SIMA from Mar 2024, so the method can be applied to all these games to build their world models. And a better world model trained from all the games at once (and YouTube) potentially lets model-based RL get really good sample efficiency in novel situations.