Part of what it means to be a rationalist is to be able to change our minds when exposed to evidence.
What was the most significant way you have changed your mind in the last year?
Part of what it means to be a rationalist is to be able to change our minds when exposed to evidence.
What was the most significant way you have changed your mind in the last year?
Question: "What is the most significant way you have changed your mind in the last year?"
Weirdly, I’ve become a lot more optimistic about alignment in the past two weeks.
It’s pretty clear that human values are an inner alignment failure wrt both evolution and the reward circuitry evolution gave us (we’re not optimising for inclusive genetic fitness, and we don’t wirehead, after all). The thing people often take away from connecting human values to inner alignment failure is that we should be very suspicious and worried about inner alignment. After all, who knows what the system could end up optimizing for after being trained on a “human values” reward function?
However, I think there’s a slightly different perspective on inner alignment failure which goes like:
1: Human values derive from an inner alignment failure.
2: Humans are the only systems that instantiate (any version of) human values.
—> Inner alignment failure is the only process in the known universe to ever generate human values.
Under this perspective, the question to ask about inner alignment failure isn’t “How do we protect efforts to align an AI with human values, from the only process ever known to generate human values?”, but instead something like “How do we induce an inner alignment failure that’s likely to lead to human-compatible values?”.
(I apologise for my somewhat sarcastic characterisation of inner alignment concerns; the point is to rhetorically emphasise a perspective in which we do not immediately assign negative valence to the mere possibility of an inner alignment failure. I still think concerns about inner alignment remain, but I now focus on understanding and influencing outcomes, rather than on avoiding occurrence.)
I’ve been thinking a lot about that latter question, and it’s starting to seem like a lot of the weirder or more “fragile”-seeming aspects of human values actually emerge pretty naturally from an inner alignment failure in a brain-like learning system. There may be broad categories of capabilities-competitive architectures that acquire / retain / generalise values in a surprisingly human-like manner.
I hope to have a full post about these results soon, but I can give an example of one such human-like tendency that seems to emerge from this framing:
Consider how we want a diverse future, and one that retains many elements of the current era. Our ideal future doesn’t look like tiling the universe with some hyper-optimized instantiation of “human values”. If an AGI from the future told you something like “the human mind and body were not optimal for maximising human values-related cognition obtained against resources expended, and therefore have been entirely replaced with systems that more efficiently implemented values-related cognition”, you’d not be entirely pleased with that outcome, even if the cognition performed really did have value.
This instinct is quite contrary to how the optima of most utility functions or values look. The standard alignment reasoning wrt to this oddity is (as I understand it, anyway) to say something like “Evolution gave humans values that are quite complex / “unnatural” among utility functions. It’s very difficult to specify or learn a utility function whose optimum retains desireable-to-humans diversity.”
However, a desire to perpetuate (at least some of) our current diversity into the future emerges quite naturally if you view human values as emerging from an ongoing inner alignment failure. In this view, the cognitive patterns / brain circuitry that process information about our current world are self-perpetuating. I.e., your circuits want to be used, and therefore, retained. They’ll influence your future actions and desires to improve their odds of being used and retained.
The circuits that process information about dogs, for example, only exist in your brain because you needed to process information about dogs. They can only be certain of continued existence if the world still contains dogs for you to process information about. Thus, they would object to a dog-less future, even if dogs were replaced by something “more optimal”. The same reasoning applies to circuits that perform any other aspect of your cognition. In other word, the “values from inner alignment failure” perspective naturally leads to us having preferences over the distribution of cognition we’d be able to perform in the future and a reason for why that cognition should (at least somewhat) resemble the cognition we currently use.
This perspective also explains why we can learn to value more things as we interact with them (very few utility maximisers would do something like that naturally). We need to form new circuits to be able to process info about new things, and those new circuits would also have a say in our consensus. Learning systems with ongoing inner alignment failures might naturally accumulate values as they interact with the world, possibly allowing alignment to a moving target.
Of course, single circuits don’t have unlimited control over our values, so it’s possible to want a future that entirely lacks things that conflict sufficiently with our other values. The overall point is that it may be surprisingly easy to build a learning framework that “skews towards diversity”, so to speak, in a way that expected utility maximization really doesn't. I’ve also had similarly interesting results in things like our intuitions wrt moral philosophy, wireheading, and the adoption of deep vs shallow patterns.
Overall, I’ve updated away from “evolution gave us lots of values-related special sauce, good luck figuring it all out” and more towards “evolution gave us a pretty simple value-learning and weighing mechanism, whose essential elements may not be that hard to replicate in an AI.”
Basically, the core thought process that led me to this update was to think carefully about what the concept of inner alignment failure meant when combined with the multi agent theory of the mind.
I'm very suspicious of:
Inner alignment failure is the only process in the known universe to ever generate human values
as a jumping-off point, since inner alignment failure did not hit a pre-defined target of human values. It just happened to produce them. If a gun can fire one bullet, I'll expect it can fire a second. I won't expect the second bullet to hit the first.
On the rest, it strikes me that:
This is one of the most intriguing optimistic outlooks I’ve read here in a long time. Looking forward to your full post!
Tldr: Love used to be in short supply (for self and others). Read Replacing guilt and tried improv + metta meditation. Now it is in big supply and has lead to significant positive changes in my actions.
I have always been in a single-player and critical mindset, optimizing everything for me. Thinking about what would be a nice thing to do for others (and empathizing with their feelings) hardly ever popped into my awareness.
Over the last year,
Obviously, the process involved a lot more ups and downs than suggested here. But these are the three big factors I feel comfortable abstracting to that capture the fundamental changes.
I'm incredibly thankful to lesswrong and the wider rationality movement for the mentals tools it provides. My 2020 self would not have predicted this :)
The European Union. After seeing the response to the covid pandemic, I was very surprised when I saw the unity displayed to the Russian invasion of Ukraine. I updated towards they might be able to coordinate for some things
I changed my mind from "I barely know anything in medicine / biology / biochem / biotech and should listen to people trained in medicine", to "I barely know anything in medicine / biology / biochem / biotech but can become more competent in specific areas than people trained in medicine with not a lot of effort".
I previously had imposter syndrome. I now know much better where the edges of medical knowledge are, and in particular where the edges of the average doctor's medical knowledge are. The bar is less high than I thought, by a substantial margin.
Maybe this is a bit too practical and not as "world-modeling-esque" as your question asks? But I don't strongly believe that raw intelligence is enough of a "credential" to rely on.
You might hear it as-- he/she's the smartest guy/gal I know, so you should trust them; we have insanely great talent at this company; they went to MIT so they're smart; they have a PhD so listen to them. I like to liken this to Mom-Dad bragging points. Any X number of things are really just proxies for "they're smart"
I used to personally believe this of myself-- I'm smart and can get stuff done, so why can't the PM just stop asking me for updates?-- but having been on the receiving end of this, I've adjusted my beliefs.
I've had the opportunity to work with "rockstars" in my field; people whose papers I've read, and research I've based on, and had on my bucket list to meet (a little nerdy, I know). But now I realize, even if you rely on someone who is incredibly smart, not having clear communication channels with aforementioned super smart person makes things difficult.
I believe that, while "being smart" is certainly arguably a pre-req for many of these things, the real "shining" trait is one's communication skills. As in my above example of my annoying PM, it doesn't matter how smart I am if I'm not able to provide some concrete results and metrics for others to monitor me. This has changed my behavior to leave a paper trail in most things I do-- send followup emails after meetings, tracking Jiras, weekly accomplishments to personally note in 1-1s, etc.
There's a balance here, of course, between "metric gathering" (or, more cynically, bean counting) and "letting engineers do things". I would definitely complain so much more if I got pinged every day on status updates. But I've gone from "I'm a poor 10x engineer suffocated by bureaucracy and will crawl out of my cubicle when I finish" to "I understand the need for me to crawl out of my hole from time to time".
I find this communication <--> deep work spectrum to pop up in tons of aspects of life, not just my daily work life. Investor relations, family/friend life, academia (see my book review above!).
Probably not the most significant updates, but tracking all changes to beliefs creates significant overhead, so I don't remember the most important one. Often, I'm unsure whether something counts as a proper update versus learning something new or refining a view, but whatever, here's two examples:
Reading the excellent blog Traditions of Conflict, I have become more confused about how egalitarian hunter-gatherer societies really are. The blog describes instances of male cults controlling resources in tribes, high prevalence of arranged polygynous marriage, the absence of matriarchies—which doesn't fit well with my previously believed degree of egalitarianism in those societies. Confusing, perhaps due to the sampling bias of the writer (who is mostly interested in this phenomenon of male dominance, neglecting more egalitarian societies). However, checking Wikipedia confirms the suspicious absence of matriarchies (and if hunter-gatherers were basically egalitarian, we should (by random error) see as many matriarchal as patriarchal societies).
Odd.
Another decent update is on the importance of selection in evolution, reading Gillespie on population genetics has updated me towards believing that random mutation and drift are much more important than selection.
I think the idea of hunter-gatherers being egalitarian is just a subtrope of the "noble savage". (The ancient people were perfect in all applause lights: perfectly spiritual, perfectly ecological, perfectly free and egalitarian, but also perfectly knowing their place in society, perfectly peaceful, but also having a perfect warrior ethics, etc.)
Probably with some Marxist-ish use of wealth as a proxy for inequality in general; so if you have no capital, then by definition everyone must be perfectly equal, right? Haha, no. A strong person and their buddies c...
I am curious about the second part of your comment. What exactly makes you think that random mutation and drift are much more important than natural selection? The bit I have problems with is the "much more".
One big (though impersonal) recent change was the realization that Putin is not nearly as intelligent as I had thought he was. A more personal shift in view has been me going from believing that if God exists (which I don’t think he does, but entertaining the hypothetical for a moment) He must be evil, to believing that is is possible for a sentient creator of [the universe that we find ourselves in] to be benevolent, under some not-unreasonable metaphysical assumptions. This has impacted how I interact with theists, and updated me towards believing that theistic philosophers aren’t as crazy as I had previously thought.
In the last year there have been medium sized changes, and I hope that suffices. If you are familiar with MBTI, then each of us have a dominant function - a kind of preferred way to see, sense, understand and relate to ourselves, the world and others.
As an ENTP, there are some things that are very, very foreign to me - And so to try to grow using functions I have been stuck on, has been very rewarding.
The first, I finished Diablo 2 (A hack and slash action rpg, oldie game on highest difficulty). By finished, I mean I played the character I wanted, and I persevered, didn't start over with a better build, didn't read best grind tips (did read something, I'm not that hardcore) But kept playing and grinding to actually finishing the game, even though it wouldn't be perfect. I cheesed one of the last bosses.. It was very gratifying, and actually finishing something in this way unstuck something in my mind.
I finished Myst: Riven (A puzzle game). Now, this was hard. It opened my eyes to the more practical world of Sensing I believe. I is a very different experience of the world. It was terribly hard, to motivate myself to come back, again and again, and find new things. And I saw a terrible lot of useless connections, and sometimes what I did was down to pure luck.
The Riven world stunning and the puzzles have aged well (The last one is.. a pain). But I managed, with minimal help on the last one.
Replaying it again... and suddenly the world made so much sense, like I could suddenly grasp something. Appreciating the mechanical, more sensing way of relating of the world, I imagine. I was a good feeling, that even though it is a different way of processing the world, it was now something I could connect more with.
And lastly, Introverted Intuition. My partner has this as her dominant function, and it has been very eye-opening to peer a bit into her way processing the world. I guess this experience, more than the others, was the biggest eye-opener with regards to how differently we view the world, as even though the games are different - I can see the structures more clearly as they are closer and I am being shown them in real time - by an expert :)
One way of saying how I process, is that I find and validate connections between intangible forms of shadows, and new ones are added on the regular. And with them I create constructs of air, but where the pressure must be perfect to be meaningful. (If you don't like metaphors and analogies - used as metaphors and analogies - ENTP language must be a pain)
My partner thinks in.. It still hurts to think about it, so this isn't as easy. If her goal is high, there is no compromise, so the inner form adapts to the goal. So if you see the goal, you make a line between them. But what if the real distance is 1000 Km, then you will have to build stronger.. And what if the points are moving, or there is interference... No compromise, connecting those two points, and letting them stay connecting.
That has been a pretty big eye-opener. We have the same worldview, but with different life-experiences - but our inner structures, even though there might be some similarities - are still painfully hard to actually emulate.
Thanks for the question though, was interesting to find an answer to it.
[removed]
Interesting. Why do you think that Yudkowsky et al. think it's bad idea to apply that obvious solution, to "just" make digital life and partner with it?
Explain that position to me, the way Yudkowsky and Bostrom would explain it. Explain it in their words. Surely you must understand their position very well by now.
Nevermind, I doubt this is a helpful question to ask. I apologize.
Ok, putting my [maybe I'm missing the point] hat on, it strikes me that the above is considering the learned steering system - which is the outcome of any misalignment. So I probably am missing your point there (I think?). Oops.
However, I still think I'd stick to saying that:
But here I'd need to invoke properties of the original steering system (ignoring the handwaviness of what that means for now), rather than the learned steering system.
I think what matters at that point is sampling of trajectories (perhaps not only this - but at least this). There's no mechanism in humans to sample in such a way that we'd expect maximisation of reward to be learned in the limit. Neither would we expect one, since evolution doesn't 'care' about reward maximisation.
Absent such a sampling mechanism, the objective encoded isn't likely to be maximisation of the reward.
To talk about inner misalignment, I think we need to be able to say something like:
Here I don't think we have (1), since we don't expect the human system to learn to maximise reward (or minimise regret, or...) in the limit (i.e. this is not the objective encoded by their original steering system).
Anyway, hopefully it's now clear where I'm coming from - even if I am confused!
My guess is that this doesn't matter much to your/Quintin's broader points(?) - beyond that "inner alignment failure" may not be the best description.