All of Domenic's Comments + Replies

These posts always leave me feeling a little melancholy that my life doesn't seem to have that many challenges where thinking faster/better/harder/sooner would actually help.

Most of my waking hours are spent on my job, where cognitive performance is not at all the bottleneck. (I honestly believe that if you made me 1.5x "better at thinking", this would not give a consistent boost in output as valued by the business. I'm a software engineer.) I have some intellectual spare-time hobbies, but the most demanding of them is Japanese studying, which is more abou... (read more)

It's interesting to compare this to the other curated posts I got in my inbox over the last week, What is malevolence? and How will we update about scheming. Both of those (especially the former) I bounced off of due to length. But this one I stuck with for quite a while, before I started skimming in the worksheet section.

I think the instinct to apply a length filter before sending a post to many peoples' inboxes is a good one. I just wish it were more consistently applied :)

Finding non-equilibrium quantum states would be evidence of pilot wave theory since they're only possible in a pilot wave theory.

If you can find non-equilibrium quantum states, they are distinguishable. https://en.m.wikipedia.org/wiki/Quantum_non-equilibrium

(Seems pretty unlikely we'd ever be able to definitively say a state was non-equilibrium instead of some other weirdness, though.)

1lemonhope
I don't understand what you're replying to. Is this about a possible experiment? What experiment could you do?

I can help confirm that your blind assumption is false. Source: my undergrad research was with a couple of the people who have tried hardest, which led to me learning a lot about the problem. (Ward Struyve and Samuel Colin.) The problem goes back to Bell and has been the subject of a dedicated subfield of quantum foundations scholars ever since.

This many years distant, I can't give a fair summary of the actual state of things. But a possibly unfair summary based on vague recollections is: it seems like the kind of situation where specialists have something... (read more)

1lemonhope
(If you do go reread then I would love to read some low-effort notes on it similar to your recollection above.)

This is great, until Spotify is ready this will be the best way to share on social media.

May I suggest adding lyrics, either in the description or as closed captions or both?

If you are willing to share, can you say more about what got you into this line of investigation, and what you were hoping to get out of it?

For my part, I don't feel like I have many issues/baggage/trauma, so while some of the "fundamental debugging" techniques discussed around here (like IFS or meditation) seem kind of interesting, I don't feel too compelled to dive in. Whereas, techniques like TYCS or jhana meditation seem more intriguing, as potential "power ups" from a baseline-fine state.

So I'm wondering if your baseline is more like mine, and you ended up finding fundamental debugging valuable anyway.

1mesaoptimizer
Burnt out after almost an year of focusing on alignment research. I wanted to take a break from alignment-ey stuff and also desired to systematically fix the root causes behind the fact that I hit what I considered burn-out. I felt similar when I began this, and my motivation was not to 'fix issues' in myself but more "hey I have explicitly decided to take a break and have fun and TYCS seems interesting let's experiment with it for a while, I can afford to do so".
4Raemon
I'm not mesaoptimizer, but, fyi my case is "I totally didn't find IFS type stuff very useful for years, and the one day I just suddenly needed it, or at least found myself shaped very differently such that it felt promising." (see My "2.9 trauma limit")

It seems we have very different abilities to understand Holtman's work and find it intuitive. That's fair enough! Are you willing to at least engage with my minimal-time-investment challenge?

4johnswentworth
Sure. Let's adopt the "petrol/electric cars" thing from Holtman's paper. In timestep 0, the agent has a choice: either create a machine which will create one petrol car every timestep indefinitely, or create a machine which will create one petrol car every timestep until the button is pressed and then switch to electric. The agent does not have any choices after that; its only choice is which successor agent to create at the start. The utility functions are the same as in Holtman's paper. My main claim is that the π∗fcg0 agent is not indifferent between the two actions; it will actively prefer the one which ignores the button. I expect this also extends to the π∗fcgc agent, but am less confident in that claim.

The  agent is indifferent between creating  stoppable or unstoppable subagents, but the  agent goes back to being corrigible in this way. The "emergent incentive" handwave is only necessary for the subagents working on sub-goals (section 8.4). Which is not something that either Sores et al. or your post that we're commenting on are prepared to tackle, although it is an interesting followup work.

I suggest engaging with the simulator. It very clearly shows that, given the option of creating shutdown-resistant successor ag... (read more)

2johnswentworth
I think this is wrong? The π∗fcg0 agent actively prefers to create shutdown-resistant agents (before the button is pressed), it is not indifferent. Intuitive reasoning: prior to button-press, that agent acts-as-though it's an RN maximizer and expects to continue being an RN maximizer indefinitely. If it creates a successor which will shut down when the button is pressed, then it will typically expect that successor to perform worse under RN after the button is pressed than some other successor which does not shut down and instead just keeps optimizing RN. Either I'm missing something very major in the definitions, or that argument works and therefore the agent will typically (prior to button-press) prefer successors which don't shut down.
2johnswentworth
Part of what's feeding into my skepticism here is that I think Holtman's formalization is substantially worse than the 2015 MIRI paper. It's adding unnecessary complexity - e.g. lots of timesteps, which in turn introduces the need for dynamic programming, which in turn requires all the proofs to work through recursive definitions - in a way which does not add any important mechanisms for making corrigibility work or clarify any subproblem. (Also, he's using MDPs, which implicitly means everything is observable at every step - a very big unrealistic assumption!) Sure, the whole thing is wrapped in more formalism, but it's unhelpful formalism which mostly makes it easier for problems to go unnoticed. As far as I can tell from what I've read so far, he's doing qualitatively the same things the 2015 MIRI paper did, but in a setting which makes the failure modes less clear, and he's communicated it all less understandably. I don't particularly want to spend a day or two cleaning it all up and simplifying and distilling it back down to the point where the problems (which I strongly expect exist) are obvious. If you're enthusiastic about this, then maybe try to distill it yourself? Like, figure out the core intuitive ideas of the proofs, and present those directly in the simplest-possible setup (maybe two timesteps, maybe not, whatever's simple). Just as one example of the sort of simplification I have in mind: the definition of f makes it so that, before button-press, the agent acts like it's an R′N maximizer and expects to continue being an R′N maximizer indefinitely. After button-press, the agent acts like it's an RS maximizer and expects to continue being an RS maximizer indefinitely. But it required tens of minutes chasing definitions around in order to see this very intuitive and key fact. One could just as easily define the agent in a way which made that fact obvious right from the get-go. Ideally, one would also find a similarly-clear expression for what gc doe

Are you aware of how Holtman solved MIRI's formulation of the shutdown problem in 2019? https://arxiv.org/abs/1908.01695, my summary notes at https://docs.google.com/document/u/0/d/1Tno_9A5oEqpr8AJJXfN5lVI9vlzmJxljGCe0SRX5HIw/mobilebasic

Skimming through your proposal, I believe Holtman's correctly-constructed utility function correction terms would work for the scenario you describe, but it's not immediately obvious how to apply them once you jump to a subagent model.

4johnswentworth
Hadn't seen that before. Based on an hour or so of sorting through it, I think it basically doesn't work. The most legible problem I've been able to identify so far is that it will create shutdown-resistant successor agents (despite Holtman's claims to the contrary). The problem here is that, prior to the shutdown button being pressed, the agent acts-as-though it's just optimizing R′N and expects to continue optimizing R′N indefinitely; it has no incentive to maintain the option of switching to RS later, because at pre-press timesteps f cancels out everything to do with RS and just makes the value function mimic the R′N-maximizing value function. And an R′N-maximizer has no reason to make its successor agents shutdownable. (Holtman claims that section 10 proves that some "emergent incentive" prevents this. I haven't sorted through the notational mess enough to figure out where that proof goes wrong, but I pretty strongly expect there's a mistake there, with my modal guess being that he forgot to account for the term in f which cancels out all the RS contributions to R. My second-most-probable guess is that gc is doing something weird to make the section 10 proof work, but then it breaks things in the earlier sections where g0 is used in place of gc.) Also, two complaints at a meta level. First, this paper generally operates by repeatedly slapping on not-very-generalizable-looking patches, every time it finds a new failure mode. This is not how one ends up with robustly-generalizable solutions; this is how one ends up with solutions which predictably fail under previously-unconsidered conditions. Second, after coming up with some interesting proof, it is important to go back and distill out the core intuitive story that makes it work; the paper doesn't really do that. Typically, various errors and oversights and shortcomings of formulation become obvious once one does that distillation step.

This is a well-executed paper, that indeed shakes some of my faith in ChatGPT/LLMs/transformers with its negative results.

I'm most intrigued by their negative result for GPT-4 prompted with a scratchpad. (B.5, figure 25.) This is something I would have definitely predicted would work. GPT-4 shows enough intelligence in general that I would expect it to be able to follow and mimic the step-by-step calculation abilities shown in the scratchpad, even if it were unable to figure out the result one- or few-shot (B.2, figure 15).

But, what does this failure mean?... (read more)

I've had a hard time connecting John's work to anything real. It's all over Bayes nets, with some (apparently obviously true https://www.lesswrong.com/posts/2WuSZo7esdobiW2mr/the-lightcone-theorem-a-better-foundation-for-natural?commentId=K5gPNyavBgpGNv4m3 ) theorems coming out of it.

In contrast, look at work like Anthropic's superposition solution, or the representation engineering paper from CAIS. If someone told me "I'm interested in identifying the natural abstractions AIs use when producing their output", that is the kind of work I'd expect. It's on a... (read more)

Interesting post. One small area where I might have a useful insight:

A lot of online multiplayer games rest on the appeal of their character design. Think of Smash Bros, Overwatch, or League of Legends. Characters' unique abilities give rise to a dense hypergraphs of strategic relationships which players will want to learn the whole of.
But in these games, a character cannot have unique motivations. They'll have a backstory that alludes to some, but in the game, that will be forgotten. Instead, every mind will be turned towards just one game and one goal: K

... (read more)
2mako yass
I can imagine some readers just responding "roleplay, you're describing roleplay". Ray Doraisami brought up ttrpgs, but we agreed that conflict rarely exists in them. My guesses at the reasons were: - Compelling roleplay requires everyone in the room to buy into the same story or else the vibe will shatter? - It's too difficult for players to keep any sort of secret, because peoples' interactions with the world are totally mediated by the DM, and everyone can hear the DM. It might be much easier to do if there were more than one DM and people could go into separate rooms sometimes.   I might describe digital cohabitives as a multiplayer roleplaying games but with thematic incentives. Characters and their different incentives wouldn't just be stories that players are entertaining, they'd be codified in the scoring (leveling?) system, which may in turn be used by the matchmaking system to cohort out bad roleplayers.

I wonder if more people would join you on this journey if you had more concrete progress to show so far?

If you're trying to start something approximately like a new field, I think you need to be responsible for field-building. The best type of field-building is showing that the new field is not only full of interesting problems, but tractable ones as well.

Compare to some adjacent examples:

  • Eliezer had some moderate success building the field of "rationality", mostly though explicit "social" field-building activities like writing the sequences or associated
... (read more)

I continue to be intrigued about the ways modern powerful AIs (LLMs) differ from the Bostrom/Yudkowsky theorycrafted AIs (generally, agents with objective functions, and sometimes specifically approximations of AIXI). One area I'd like to ask about is corrigibility.

From what I understand, various impossibility results on corrigibility have been proven. And yet, GPT-4 is quite corrigible. (At the very least, in the sense that if you go unplug it, it won't stop you.) Has anyone analyzed which preconditions of the impossibility results have been violated by GPT-N? Do doomers have some prediction for how GPT-N for N >= 5 will suddenly start meeting those preconditions?

8Daniel Kokotajlo
Good questions! The ways in which modern powerful AIs (and laptops, for that matter) differ from the theorycrafted AIs are related to the ways in which they are not yet AGI. To be AGI, they will become more like the theorycrafted AIs -- e.g. they'll be continuously online learning in some way or other, rather than a frozen model with some training cutoff date; they'll be running a constant OODA loop so they can act autonomously for long periods in the real world, rather than simply running for a few seconds in response to a prompt and then stopping; and they'll be creatively problem-solving and relentlessly pursuing various goals, and deciding how to prioritize their attention and efforts and manage their resources in pursuit of said goals. They won't necessarily have utility functions that they maximize, but utility functions are a decent first-pass way of modelling them--after all, utility functions were designed to help us talk about agents who intelligently trade off between different resources and goals. Moreover, and relatedly, there's an interesting and puzzling area of uncertainty/confusion in our mental models of how all this goes, about "Reflective Stability," e.g. what happens as a very intelligent/capable/etc. agentic system is building successors who build successors who build successors... etc. on until superintelligence. Does giving the initial system values X ensure that the final system will have values X? Not necessarily! However, using the formalism of utility functions, we are able to make decently convincing arguments that this self-improvement process will tend to preserve utility functions. Because if it forseeably changed utility function from X to Y, then probably it would be calculated by the X-maximizing agent to harm, rather than help, its utility, and so the change would not be made. With deontological constraints this is not so clear. To be clear, the above isn't exactly a proof IMO, just a plausible argument. But we don't even have

I am sympathetic to this viewpoint. However I think there are large-enough gains to be had from "just" an AI that: matches genius-level humans; has N times larger working memory; thinks N times faster; has "tools" (like calculators or Mathematica or Wikipedia) integrated directly into it's "brain"; and is infinitely copyable. That gets you to https://www.lesswrong.com/posts/5wMcKNAwB6X4mp9og/that-alien-message territory, which is quite x-risky.

Working memory is a particularly powerful intuition pump here, I think. Given that you can hold 8 items in working... (read more)

I don't think the lack of machine translations of AI alignment materials is holding the field back in Japan. Japanese people have an unlimited amount of that already available. "Doing it for them", when their web browser already has the feature built in, seems honestly counterproductive, as it signals how little you're willing to invest in the space.

I think it's possible increasing the amount of human-translated material could make a difference. (Whether machines are good enough to aid such humans or not, is a question I leave to the professional translators.)

1Harold
Completely in agreement with Domenic (though, full disclosure, we're both in AISafety東京 members). What's missing in the Japanese space is attempts to answer the question of why the Anglo-US views on AI are relevant in Japan. Anglo-Americans may think it's obvious why that question isn't relevant... which just closes the loop.

Yes, I would really appreciate that. I find this approach compelling the abstract but what does it actually cache out in?

My best guess is that it means lots of mechanistic interpretability research, identifying subsystems of LLMs (or similar) and trying to explain them, until eventually they're made of less and less Magic. That sounds good to me! But what directions sound promising there? E.g. the only result in this area I've done a deep dive on, Transformers learn in-context by gradient descent, is pretty limited as it only gets a clear match for linear ... (read more)

Relatedly, CoEms could be run at potentially high speed-ups, and many copies or variations could be run together. So we could end up in the classic scenario of a smarter-than-average "civilization", with "thousands of years" to plan, that wants to break out of the box.

This still seems less existentially risky, though, if we end up in a world where the CoEms retain something approximating human values. They might want to break out of the box, but probably wouldn't want to commit species-cide on humans.

5quetzal_rainbow
As fas as I understand, the point on this proposal is that "human-like cognitive architecture ≈ cognitive containability ≈ sort of safety", not "human-like cognitive architecture ≈ human values". I just want to say that even human can be cognitively uncontainable relatively to another human, because they can learn mental tricks that look to another human as Magic.

These are the most compelling-to-me quotes from "Simulators", saved for posterity.

 

Perhaps it shouldn’t be surprising if the form of the first visitation from mindspace mostly escaped a few years of theory conducted in absence of its object.

 

…when AI is all of a sudden writing viral blog posts, coding competitively, proving theorems, and passing the Turing test so hard that the interrogator sacrifices their career at Google to advocate for its personhood, a process is clearly underway whose limit we’d be foolish not to contemplate.

 

GPT-3 doe

... (read more)

"Do stuff that seems legibly valuable" becomes the main currency, rather than "do stuff that is actually valuable."

 

In my experience, these are aligned quite often, and a good organization/team/manager's job is keeping them aligned. This involves lots of culture-building, vigilance around goodharting, and recurring check-ins and reevaluations to make sure the layer under you is properly aligned. Some of the most effective things I've noticed are rituals like OKR planning and wide-review councils, and having a performance evaluation culture that tries ... (read more)

Oh, I didn't realize it was such a huge difference! Almost half of the sequences omitted. Wow. I guess I can write a quick program to diff and thus answer my original question.

Was there any discussion about how and why R:A-Z made the selections it did?

Do you have any recommendations for particularly good sequences or posts that R:A-Z omitted, from the ~430 that I've apparently missed?

This seems like a simulator in the same way the human imagination is a simulator. I could mentally simulate a few chess moves after the ones you prompted. After a while (probably a short while) I'd start losing track of things and start making bad moves. Eventually I'd probably make illegal moves, or maybe just write random move-like character strings if I was given some motivation for doing so and thought I could get away with it.

2metasemi
Yes, it sure felt like that. I don't know whether you played through the game or not, but as a casual chess player, I'm very familiar with the experience of trying to follow a game from just the notation and experiencing exactly what you describe. Of course a master can do that easily and impeccably, and it's easy to believe that GPT-3 could do that too with the right tuning and prompting. I don't have the chops to try that, but if it's correct it would make your 'human imagination' simile still more compelling. Similarly, the way GPT-3 "babbles" like a toddler just acquiring language sometimes, but then can become more coherent with better / more elaborate / recursive prompting is a strong rhyme with a human imagination maturing through its activity in a world of words. Of course a compelling analogy is just a compelling analogy... but that's not nothing!