It's interesting to compare this to the other curated posts I got in my inbox over the last week, What is malevolence? and How will we update about scheming. Both of those (especially the former) I bounced off of due to length. But this one I stuck with for quite a while, before I started skimming in the worksheet section.
I think the instinct to apply a length filter before sending a post to many peoples' inboxes is a good one. I just wish it were more consistently applied :)
If you can find non-equilibrium quantum states, they are distinguishable. https://en.m.wikipedia.org/wiki/Quantum_non-equilibrium
(Seems pretty unlikely we'd ever be able to definitively say a state was non-equilibrium instead of some other weirdness, though.)
I can help confirm that your blind assumption is false. Source: my undergrad research was with a couple of the people who have tried hardest, which led to me learning a lot about the problem. (Ward Struyve and Samuel Colin.) The problem goes back to Bell and has been the subject of a dedicated subfield of quantum foundations scholars ever since.
This many years distant, I can't give a fair summary of the actual state of things. But a possibly unfair summary based on vague recollections is: it seems like the kind of situation where specialists have something...
If you are willing to share, can you say more about what got you into this line of investigation, and what you were hoping to get out of it?
For my part, I don't feel like I have many issues/baggage/trauma, so while some of the "fundamental debugging" techniques discussed around here (like IFS or meditation) seem kind of interesting, I don't feel too compelled to dive in. Whereas, techniques like TYCS or jhana meditation seem more intriguing, as potential "power ups" from a baseline-fine state.
So I'm wondering if your baseline is more like mine, and you ended up finding fundamental debugging valuable anyway.
The agent is indifferent between creating stoppable or unstoppable subagents, but the agent goes back to being corrigible in this way. The "emergent incentive" handwave is only necessary for the subagents working on sub-goals (section 8.4). Which is not something that either Sores et al. or your post that we're commenting on are prepared to tackle, although it is an interesting followup work.
I suggest engaging with the simulator. It very clearly shows that, given the option of creating shutdown-resistant successor ag...
Are you aware of how Holtman solved MIRI's formulation of the shutdown problem in 2019? https://arxiv.org/abs/1908.01695, my summary notes at https://docs.google.com/document/u/0/d/1Tno_9A5oEqpr8AJJXfN5lVI9vlzmJxljGCe0SRX5HIw/mobilebasic
Skimming through your proposal, I believe Holtman's correctly-constructed utility function correction terms would work for the scenario you describe, but it's not immediately obvious how to apply them once you jump to a subagent model.
This is a well-executed paper, that indeed shakes some of my faith in ChatGPT/LLMs/transformers with its negative results.
I'm most intrigued by their negative result for GPT-4 prompted with a scratchpad. (B.5, figure 25.) This is something I would have definitely predicted would work. GPT-4 shows enough intelligence in general that I would expect it to be able to follow and mimic the step-by-step calculation abilities shown in the scratchpad, even if it were unable to figure out the result one- or few-shot (B.2, figure 15).
But, what does this failure mean?...
I've had a hard time connecting John's work to anything real. It's all over Bayes nets, with some (apparently obviously true https://www.lesswrong.com/posts/2WuSZo7esdobiW2mr/the-lightcone-theorem-a-better-foundation-for-natural?commentId=K5gPNyavBgpGNv4m3 ) theorems coming out of it.
In contrast, look at work like Anthropic's superposition solution, or the representation engineering paper from CAIS. If someone told me "I'm interested in identifying the natural abstractions AIs use when producing their output", that is the kind of work I'd expect. It's on a...
Interesting post. One small area where I might have a useful insight:
...A lot of online multiplayer games rest on the appeal of their character design. Think of Smash Bros, Overwatch, or League of Legends. Characters' unique abilities give rise to a dense hypergraphs of strategic relationships which players will want to learn the whole of.
But in these games, a character cannot have unique motivations. They'll have a backstory that alludes to some, but in the game, that will be forgotten. Instead, every mind will be turned towards just one game and one goal: K
I wonder if more people would join you on this journey if you had more concrete progress to show so far?
If you're trying to start something approximately like a new field, I think you need to be responsible for field-building. The best type of field-building is showing that the new field is not only full of interesting problems, but tractable ones as well.
Compare to some adjacent examples:
I continue to be intrigued about the ways modern powerful AIs (LLMs) differ from the Bostrom/Yudkowsky theorycrafted AIs (generally, agents with objective functions, and sometimes specifically approximations of AIXI). One area I'd like to ask about is corrigibility.
From what I understand, various impossibility results on corrigibility have been proven. And yet, GPT-4 is quite corrigible. (At the very least, in the sense that if you go unplug it, it won't stop you.) Has anyone analyzed which preconditions of the impossibility results have been violated by GPT-N? Do doomers have some prediction for how GPT-N for N >= 5 will suddenly start meeting those preconditions?
I am sympathetic to this viewpoint. However I think there are large-enough gains to be had from "just" an AI that: matches genius-level humans; has N times larger working memory; thinks N times faster; has "tools" (like calculators or Mathematica or Wikipedia) integrated directly into it's "brain"; and is infinitely copyable. That gets you to https://www.lesswrong.com/posts/5wMcKNAwB6X4mp9og/that-alien-message territory, which is quite x-risky.
Working memory is a particularly powerful intuition pump here, I think. Given that you can hold 8 items in working...
I don't think the lack of machine translations of AI alignment materials is holding the field back in Japan. Japanese people have an unlimited amount of that already available. "Doing it for them", when their web browser already has the feature built in, seems honestly counterproductive, as it signals how little you're willing to invest in the space.
I think it's possible increasing the amount of human-translated material could make a difference. (Whether machines are good enough to aid such humans or not, is a question I leave to the professional translators.)
Yes, I would really appreciate that. I find this approach compelling the abstract but what does it actually cache out in?
My best guess is that it means lots of mechanistic interpretability research, identifying subsystems of LLMs (or similar) and trying to explain them, until eventually they're made of less and less Magic. That sounds good to me! But what directions sound promising there? E.g. the only result in this area I've done a deep dive on, Transformers learn in-context by gradient descent, is pretty limited as it only gets a clear match for linear ...
Relatedly, CoEms could be run at potentially high speed-ups, and many copies or variations could be run together. So we could end up in the classic scenario of a smarter-than-average "civilization", with "thousands of years" to plan, that wants to break out of the box.
This still seems less existentially risky, though, if we end up in a world where the CoEms retain something approximating human values. They might want to break out of the box, but probably wouldn't want to commit species-cide on humans.
These are the most compelling-to-me quotes from "Simulators", saved for posterity.
Perhaps it shouldn’t be surprising if the form of the first visitation from mindspace mostly escaped a few years of theory conducted in absence of its object.
…when AI is all of a sudden writing viral blog posts, coding competitively, proving theorems, and passing the Turing test so hard that the interrogator sacrifices their career at Google to advocate for its personhood, a process is clearly underway whose limit we’d be foolish not to contemplate.
...GPT-3 doe
"Do stuff that seems legibly valuable" becomes the main currency, rather than "do stuff that is actually valuable."
In my experience, these are aligned quite often, and a good organization/team/manager's job is keeping them aligned. This involves lots of culture-building, vigilance around goodharting, and recurring check-ins and reevaluations to make sure the layer under you is properly aligned. Some of the most effective things I've noticed are rituals like OKR planning and wide-review councils, and having a performance evaluation culture that tries ...
Oh, I didn't realize it was such a huge difference! Almost half of the sequences omitted. Wow. I guess I can write a quick program to diff and thus answer my original question.
Was there any discussion about how and why R:A-Z made the selections it did?
Do you have any recommendations for particularly good sequences or posts that R:A-Z omitted, from the ~430 that I've apparently missed?
This seems like a simulator in the same way the human imagination is a simulator. I could mentally simulate a few chess moves after the ones you prompted. After a while (probably a short while) I'd start losing track of things and start making bad moves. Eventually I'd probably make illegal moves, or maybe just write random move-like character strings if I was given some motivation for doing so and thought I could get away with it.
These posts always leave me feeling a little melancholy that my life doesn't seem to have that many challenges where thinking faster/better/harder/sooner would actually help.
Most of my waking hours are spent on my job, where cognitive performance is not at all the bottleneck. (I honestly believe that if you made me 1.5x "better at thinking", this would not give a consistent boost in output as valued by the business. I'm a software engineer.) I have some intellectual spare-time hobbies, but the most demanding of them is Japanese studying, which is more abou... (read more)