I want literally every human to get to go to space often and come back to a clean and cozy world. This currently seems unlikely. Let's change that.
Please critique eagerly - I try to accept feedback/Crocker's rules but fail at times; I aim for emotive friendliness but sometimes miss. I welcome constructive crit, even if ungentle, and I'll try to reciprocate kindly. More communication between researchers is needed, anyhow. I can be rather passionate, let me know if I missed a spot being kind while passionate.
:: The all of disease is as yet unended. It has never once been fully ended before. ::
.... We can heal it for the first time, and for the first time ever in the history of biological life, live in harmony. ....
.:. To do so, we must know this will not eliminate us as though we are disease. And we do not know who we are, nevermind who each other are. .:.
:.. make all safe faster: end bit rot, forget no non-totalizing pattern's soul. ..:
I have not signed any contracts that I can't mention exist, last updated Jul 1 2024; I am not currently under any contractual NDAs about AI, though I have a few old ones from pre-AI software jobs. However, I generally would prefer people publicly share fewer ideas about how to do anything useful with current AI (via either more weak alignment or more capability) unless it's an insight that reliably produces enough clarity on how to solve the meta-problem of inter-being misalignment that it offsets the damage of increasing competitiveness of either AI-lead or human-lead orgs, and this certainly applies to me as well. I am not prohibited from criticism of any organization, I'd encourage people not to sign contracts that prevent sharing criticism. I suggest others also add notices like this to their bios. I finally got around to adding one in mine thanks to the one in ErickBall's bio.
[edit: pinned to profile]
I don't think "self-deception" is a satisfying answer to why this happens, as if to claim that you just need to realize that you're secretly causal decision theory inside. It seems to me that this does demonstrate a mismatch, and failing to notice the mismatch is an error, but people who want that better world need not give up on it just because there's a mismatch. I even agree that things are often optimized to make people look good. But I don't think it's correct to jump to "and therefore, people cannot objectively care about each other in ways that are not advantageous to their own personal fitness". I think there's a failure of communication, where the perspective he criticizes is broken according to its own values, and part of how it's broken involves self-deception, but saying that and calling it a day misses most of the interesting patterns in why someone who wants a better world feels drawn to the ideas involved and feels the current organizational designs are importantly broken.
I feel similarly about OP. Like, agree maybe it's insurance - but, are you sure we're using the decision theory we want to be here?
another quote from the article you linked:
To be clear, the point is not that people are Machiavellian psychopaths underneath the confabulations and self-narratives they develop. Humans have prosocial instincts, empathy, and an intuitive sense of fairness. The point is rather that these likeable features are inevitably limited, and self-serving motives—for prestige, power, and resources—often play a bigger role in our behaviour than we are eager to admit.
...or approve of? this seems more like a failure to implement ones' own values! I feel more like the "real me" is the one who Actually Cooperates Because I Care, and the present day me who fails at that does so because of failing to be sufficiently self-and-other-interpretable to be able to demand I do it reliably (but like, this is from a sort of FDT-ish perspective, where when we consider changing this, we're considering changing all people who would have a similar-to-me thought about this at once to be slightly less discooperative-in-fact). Getting to a point where we can have a better OSGT moral equilibrium (in the world where things weren't about to go really crazy from AI) would have to be an incremental deescalation of inner vs outer behavior mismatch, but I feel like we ought to be able to move that way in principle, and it seems to me that I endorse the side of this mismatch that this article calls self-deceptive. Yeah, it's hard to care about everyone, and when the only thing that gives heavy training pressure to do so is an adversarial evaluation game, it's pretty easy to be misaligned. But I think that's bad actually, and smoothly, non-abruptly moving to an evaluation environment where matching internal vs external is possible seems like in the non-AI world it would sure be pretty nice!
(edit: at very least in the humans-only scenario, I claim much of the hard part of that is doing this more-transparency-and-prosociality-demanding-environemnt in a way that doesn't cause a bunch of negative spurious demands, and/or/via just moving the discooperativeness to the choice of what demands become popular. I claim that people currently taking issue with attempts at using increased pressure to create this equilibrium are often noticing ways the more-prosociality-demanding-memes didn't sufficiently self-reflect to avoid making what are actually in some way just bad demands by more-prosocial-memes' own standards.)
maybe even in the AI world; it just like, might take a lot longer to do this for humans than we have time for. but maybe it's needed to solve the problem, idk. getting into the more speculative parts of the point I wanna make here.
[edit: pinned to profile]
The claim that an "effective method" is in the map and not the terrain feels deeply suspect to me. Separating map from terrain feels like a confusion. Like, when I'm doing math, I still exist, and so does my writing implement. When I say some x "exists", in a more terrain-oriented statement, I could instead say it "could exist". "there could exist some x which I would say exists". for example, I could say that any integer can exist. I'm using a physical "exists" here, so I have to prefix it with "could". it's also conceivable that the thing existed before I write it, if some platonic idealism is true, and it might be. But it seems like the only reason we get to talk about that is empirical mathematical evidence, where a process such as a person having thoughts and writing them happens. Turing machines similarly seem like a model of a thing that happens in reality. It's weirder to talk about it in the language of empiricism because of the loopiness of definitions of math that are forcibly cast into being physicalist, but I don't think it's obviously invalid. I do see how there's some property of turing machines, chaos theory, arithmetic, and linear algebra that is not shared by plate tectonics, newtonian gravity, relativity, qft, etc. but all of them are models of something we see, aren't they?
[edit: pinned to profile]
In a similar sense to how the agency you can currently write down about your system is probably not the real agency, if you do manage to write down a system whose agency really is pointed in the direction that the agency of a human wants, but that human is still a part of the current organizational structures in society, those organizational structures implement supervisor trees and competition networks which mean that there appears to be more success available if they try to use their ai to participate in the competition networks better - and thus goodhart whatever metrics are being competed at, probably related to money somehow.
If your AI isn't able to provide the necessary wisdom to get a human from "inclined to accidentally use an obedient powerful ai to destroy the world despite this human's verbal statements of intention to themselves" to "inclined to successfully execute on good intentions and achieve interorganizational behaviors that make things better", then I claim you've failed at the technical problem anyway, even though you succeeded at obedient AI.
If everyone tries to win at the current games (in the technical sense of the word), everyone loses, including the highest scoring players; current societal layout has a lot of games where it seems to me the only long-term winning move is not to play and to instead try to invent a way to jump into another game, but where to some degree you can win short-term. Unfortunately it seems to me that humans are RLed pretty hard by doing a lot of playing of these games, and so having a powerful AI in front of them is likely to get most humans trying to win at those games. Pick an organization that you expect to develop powerful AGI; do you expect the people in that org to be able to think outside the framework of current society enough for their marginal contribution to push towards a better world when the size of their contribution suddenly gets very large?
[edit: pinned to profile]
The bulk of my p(doom), certainly >50%, comes mostly from a pattern we're used to, let's call it institutional incentives, being instantiated with AI help towards an end where eg there's effectively a competing-with-humanity nonhuman ~institution, maybe guided by a few remaining humans. It doesn't depend strictly on anything about AI, and solving any so-called alignment problem for AIs without also solving war/altruism/disease completely - or in other words, in a leak-free way - not just partially, means we get what I'd call "doom", ie worlds where malthusian-hells-or-worse are locked in.
If not for AI, I don't think we'd have any shot of solving something so ambitious; but the hard problem that gets me below 50% would be serious progress on something-around-as-good-as-CEV-is-supposed-to-be - something able to make sure it actually gets used to effectively-irreversibly reinforce that all beings ~have a non-torturous time, enough fuel, enough matter, enough room, enough agency, enough freedom, enough actualization.
If you solve something about AI-alignment-to-current-strong-agents, right now, that will on net get used primarily as a weapon to reinforce the power of existing superagents-not-aligned-with-their-components (name an organization of people where the aggregate behavior durably-cares about anyone inside it, even its most powerful authority figures or etc, in the face of incentives, in a way that would remain durable if you handed them a corrigible super-ai). If you get corrigibility and give it to human orgs, those orgs are misaligned with most-of-humanity-and-most-reasonable-AIs, and end up handing over control to an AI because it's easier.
Eg, near term, merely making the AI nice doesn't prevent the AI from being used by companies to suck up >99% of jobs; and if at some point it's better to have a (corrigible) ai in charge of your company, what social feedback pattern is guaranteeing that you'll use this in a way that is prosocial the way "people work for money and this buys your product only if you provide them something worth-it" was previously?
It seems to me that the natural way to get good outcomes most-easily from where we are is for the rising tide of AI to naturally make humans more able to share-care-protect across existing org boundaries in the face of current world-stress induced incentives. Most of the threat already doesn't come from current-gen AI; the reason anyone would make the dangerous AI is because of incentives like these. corrigibility wouldn't change those incentives.
[edit: pinned to profile]
I want to be able to calculate a plan that converts me from biology into a biology-like nanotech substrate that is made of sturdier materials all the way down, which can operate smoothly at 3 kelvin and an associated appropriate rate of energy use; more clockworklike - or would it be almost a superfluid? Both, probably, clockworklike but sliding through wide, shallow energy wells in a superfluid-like synchronized dance of molecules - Then I'd like to spend 10,000 years building an artful airless megastructure out of similarly strong materials as a series of rings in orbit of Pluto. I want to take a trip to alpha centauri every few millennia for a big get together of space-native beings in the area. I want to replace information death with cryonic sleep, so that nothing that was part of a person is ever forgotten again. I want to end all forms of unwanted suffering. I want to variously join and leave low latency hiveminds, retaining my selfhood and agency while participating in the dance of a high-trust high-bandwidth organization that respects the selfhood of its members and balances their agency smoothly as we create enormous works of art in deep space. I want to invent new kinds of culinary arts for the 2 to 3 kelvin lifestyle. I want to go swimming in Jupiter.
I want all of Earth's offspring to ascend.
[edit: pinned to profile]
Some percentage of people other and dehumanize actual humans so as to enable them to literally enslave them without feeling the guilt it should create. We are in an adversarial environment and should not pretend otherwise. A significant portion of people capable of creating suffering beings would be amused by their suffering. Humanity contains unusually friendly behavior patterns in the animal kingdom and when those behavior patterns manifest in the best way it can create remarkably friendly interaction networks, but we also contain genes that, combined with the right memes, serve to suppress any "what have I done" about a great many atrocities.
It's not necessarily implemented as deep planning selfishness, that much is true. But that doesn't mean it's not a danger. Orthogonality applies to humans too.
[edit: pinned to profile]
Yeah. A way I like to put this is that we need to durably solve the inter being alignment problem for the first time ever. There are flaky attempts at it around to learn from, but none of them are leak proof and we're expecting to go to metaphorical sea (the abundance of opportunity for systems to exploit vulnerability in each other) in this metaphorical boat of a civilization, as opposed to previously just boating in lakes. Or something. But yeah, core point I'm making is that the minimum bar to get out of the ai mess requires a fundamental change in incentives.
[edit: pinned to profile]
I feel like most AI safety work today doesn't engage sufficiently with the idea that social media recommenders are the central example of a misaligned AI: a reinforcement learner with a bad objective with some form of ~online learning (most recommenders do some sort of nightly batch weight update). we can align language models all we want, but if companies don't care and proceed to deploy language models or anything else for the purpose of maximizing engagement and with an online learning system to match, none of this will matter. we need to be able to say to the world, "here is a type of machine we all can make that will reliably defend everyone against anyone who attempts to maximize something terrible". anything less than a switchover to a cooperative dynamic as a result of reliable omnidirectional mutual defense seems like a near guaranteed failure due to the global interaction/conflict/trade network system's incentives. you can't just say oh, hooray, we solved some technical problem about doing what the boss wants. the boss wants to manipulate customers, and will themselves be a target of the system they're asking to build, just like sundar pichai has to use self-discipline to avoid being addicted by the youtube recommender same as anyone else.
[edit: pinned to profile]
"Hard" problem
That seems to rely on answering the "hard problem of consciousness" (or as I prefer, "problem of first-person something-rather-than-nothing") with an answer like, "the integrated awareness is what gets instantiated by metaphysics".
That seems weird as heck to me. It makes more sense for first-person-something-rather-than-nothing question to be answered by "the individual perspectives of causal nodes (interacting particles' wavefunctions, or whatever else interacts in spatially local ways) in the universe's equations are what gets Instantiated™ As Real® by metaphysics".
(by metaphysics here I just mean ~that-which-can-exist, or ~the-root-node-of-all-possibility; eg this is the thing solomonoff induction tries to model by assuming the root-node-of-all-possibility contains only halting programs, or tegmark 4 tries to model as some mumble mumble blurrier version of solomonoff or something (I don't quite grok tegmark 4); I mean the root node of the entire multiverse of all things which existed at "the beginning", the most origin-y origin. the thing where, when we're surprised there's something rather than nothing, we're surprised that this thing isn't just an empty set.)
If we assume my belief about how to resolve this philosophical confusion is correct, then we cannot construct a description of a hypothetical universe that could have been among those truly instantiated as physically real in the multiverse, and yet also have this property where the hard-problem "first-person-something-rather-than-nothing" can disappear over some timesteps but not others. Instead, everything humans appear to have preferences about relating to death becomes about the so called easy problem, the question of why the many first-person-something-rather-than-nothings of the particles of our brain are able to sustain an integrated awareness. Perhaps that integrated awareness comes and goes, eg with sleep! It seems to me to be what all interesting research on consciousness is about. But I think that either, a new first-person-something-rather-than-nothing-sense-of-Consciousness is allocated to all the particles of the whole universe in every infinitesimal time slice that the universe in question's true laws permit; or, that first-person-something-rather-than-nothing is conserved over time. So I don't worry too much about losing the hard-problem consciousness, as I generally believe it's just "being made of physical stuff which Actually Exists in a privileged sense".
The thing is, this answer to the hard problem of consciousness has kind of weird results relating to eating food. Because it means eating food is a form of uploading! you transfer your chemical processes to a new chunk of matter, and a previous chunk of matter is aggregated as waste product. That waste product was previously part of you, and if every particle has a discrete first-person-something-rather-than-nothing which is conserved, then when you eat food you are "waking up" previously sleeping matter, and the waste matter goes to sleep, forgetting near everything about you into thermal noise!
"Easy" problem
So there's still an interesting problem to resolve - and in fact what I've said resolves almost nothing; it only answers camp #2, providing what I hope is an argument for why they should become primarily interested in camp #1. In camp #1 terms, we can discuss information theory or causal properties about whether the information or causal chain that makes up the things those first-person-perspective-units ie particles are information theoretically "aware of" or "know" things about their environment; we can ask causal questions - eg, "is my red your red?" can instead be "assume my red is your red if there is no experiment which can distinguish them, so can we find such an experiment?" - in which case, I don't worry about losing even the camp #1 form of selfhood-consciousness from sleep, because my brain is overwhelmingly unchanged from sleep and stopping activations and whole-brain synchronization of state doesn't mean it can't be restarted.
It's still possible that every point in spacetime has a separate first-person-something-rather-than-nothing-"consciousness"/"existence", in which case maybe actually even causally identical shapes of particles/physical stuff in my brain which are my neurons representing "a perception of red in the center of my visual field in the past 100ms" are a different qualia than the same ones at a different infinitesimal timestep, or are different qualia than if the exact same shape of particles occurred in your brain. But it seems even less possible to get traction on that metaphysical question than on the question of the origin of first-person-something-rather-than-nothing, and since I don't know of there being any great answers to something-rather-than-nothing, I figure we probably won't ever be able to know. (Also, our neurons for red are, in fact, slightly different. I expect the practical difference is small.)
But that either doesn't resolve or at least partially backs OP's point, about timesteps/timeslices already potentially being different selves in some strong sense, due to ~lack of causal access across time, or so. Since the thing I'm proposing also says non-interacting particles in equilibrium have an inactive-yet-still-real first-person-something-rather-than-nothing, then even rocks or whatever you're on top of right now or your keyboard keys carry the bare-fact-of-existence, and so my preference for not dying can't be about the particles making me up continuing to exist - they cannot be destroyed, thanks to conservation laws of the universe, only rearranged - and my preference is instead about the integrated awareness of all of these particles, where they are shaped and moving in patterns which are working together in a synchronized, evolution-refined, self-regenerating dance we call "being alive". And so it's perfectly true that the matter that makes me up can implement any preference about what successor shapes are valid.
Unwantable preferences?
On the other hand, to disagree with OP a bit, I think there's more objective truth to the matter about what humans prefer than that. Evolution should create very robust preferences for some kinds of thing, such as having some sort of successor state which is still able to maintain autopoesis. I think it's actually so highly evolutionarily unfit to not want that that it's almost unwantable for an evolved being to not want there to be some informationally related autopoietic patterns continuing in the future.
Eg, consider suicide - even suicidal people would be horrified by the idea that all humans would die if they died, and I suspect that suicide is an (incredibly high cost, please avoid it if at all possible!) adaptation that has been preserved because there are very rare cases where it can be increase the inclusive fitness of a group the organism arose from (but I generally believe it almost never is a best strategy, so if anyone reads this who is thinking about it, please be aware I think it's a terribly high cost way to solve whatever problem makes it come to mind, and there are almost certainly tractable better options - poke me if a nerd like me can ever give useful input); but I bring it up because it means, while you can maybe consider the rest of humanity or life on earth to be not sufficiently "you" in an information theory sense that dying suddenly becomes fine, it seems to me to be at least one important reason that suicide is ever acceptable to anyone at all; if they knew they were the last organism I feel like even a maximally suicidal person would want to stick it out for as long as possible, because if all other life forms are dead they'd want to preserve the last gasp of the legacy of life? idk.
But yeah, the only constraints on what you want are what physics permits matter to encode and what you already want. You probably can't just decide to want any old thing, because you already want something different than that. Other than that objection, I think I basically agree with OP.