CEO at Conjecture.
I don't know how to save the world, but dammit I'm gonna try.
Thanks for the comment!
Have I understood this correctly?
I am most confident in phases 1-3 of this agenda, and I think you have overall a pretty good rephrasing of 1-5, thanks! One note is that I don't think "LLM calls" as being fundamental, I think of LLMs as a stand-in for "banks of patterns" or "piles of shards of cognition." The exact shape of this can vary, LLMs are just our current most common shape of "cognition engines", but I can think of many other, potentially better, shapes this "neural primitive/co-processor" could take.
I think there is some deep, as of yet unformalized, concept of computer science that differentiates what are intuitively "cognitive"/"neural" type problems vs "classical"/"code" type problems. Why can neural networks easily recognize dogs but doing it in regular code is hell? How can one predict ahead of time whether a given task can be solved with a given set of programming tools or neural network components? Some kind of vastly more advanced form of Algorithmic Information Theory, that can take in as input your programming tools and libraries, and a description of the problem you are trying to solve, and output how hard it is going to be (or what "engineering complexity class" it would belong to, whatever that means). I think this is a vast, unsolved question of theoretical computer science, that I don't expect we will solve any sooner than we are going to solve P vs NP.
So, in absence of such principled understanding, we need to find the "engineering approximation equivalent" to this, which involves using as much code as we can and bounding the neural components as much as we can, and then developing good practical engineering around this paradigm.
Maybe it'd be good to name some speculative tools/theory that you hope to have been developed for shaping CoEms, then say how they would help with some of:
The way I see it, there are two main ways in which I see things differently in the CoEm frame:
First, the hope isn't so much that CoEm "solves" these problems, but makes them irrelevant, because it makes it possible to not slip into the dangerous/unpredictable capabilities regime unexpectedly. If you can't ensure your system won't do something funky, you can simply choose not to build it, and instead decide to build something you can ensure proper behavior of. Then you can iterate, unlike in the current "pump as much LLM/RL juice as possible as fast as possible" paradigm.
In other words, CoEm makes it easier to distinguish between capabilities failures and alignment failures.
Most alignment research skips to trying to resolve issues like these first, at least in principle. Then often backs off to develop a relevant theory. I can see why you might want to do the levers part first, and have theory develop along with experience building things. But it's risky to do the hard part last.
Secondly, more speculatively, I expect these problems to dissolve under better engineering and understanding. Here I am trying to point at something like "physicalism" or "gears level models." If you have gears level models, a lot of the questions you might ask in a non-gears-level model stop making sense/being relevant, and you find new, more fundamental questions and tradeoff.
I think ontologies such as Agents/Goals are artifacts of poor understanding of deeper mechanics. If you can't understand the inner mechanics of cell biology, then maybe psychology is the best you can do to predict a human. But if you can understand cell biology and construct a biological being from scratch, I think you don't need the Agent framing, and it would be actively confusing to insist it is ontologically primitive somehow and must be "addressed" in your final description of the system you are engineering. These kinds of abstract/functionalist/teleologically models might be a good source of inspiration for messing around, but this is not the shape that the true questions will have.
"Instrumental convergence" dissolves into questions of predictability, choices of resource allocation and aesthetic/ethical stances on moral patienthood/universal rights. Those problems aren't easy, but they are different and more "fundamental", more part of the territory than of the map.
Similarly, "Reflective stability of goals" is just a special case of predicting what your system does. It's not a fundamental property that AGIs have and other software doesn't.
The whole CoEm family of ideas is pointing in this direction, encouraging the uncovering of more fundamental, practical, grounded, gears level models, by means of iterative construction. I think we currently do not have good gears level models of lots of the important questions of AI/cognition/alignment, and I think the way to get there is by treating it as a software/physicalist/engineering problem, not presupposing an already higher level agentic/psychological/functionalist framing. (It's like the epistemological equivalent of the AI Effect, but for good, lol.)
I think that picking a hard problem before you know whether that "hard problem" is real or not is exactly what leads to confusions like the "hard problem of consciousness", followed by zero actual progress on problems that matter. I don't actually think we know what the true "hard problems" are to a level of deconfusion that we can just tackle them directly and backchain. Backchaining from a confused or wrong goal is one of the best ways to waste an entire career worth of research.
Not saying it is guaranteed to solve all these problems, or that I am close to having solved all these problems, but this agenda is the type of thing I would do if I wanted to make iterative research progress into that direction.
This is often not true, and I don't think your paradigm makes it true. E.g. often we lose legibility to increase capability, and that is plausibly also true during AGI development in the CoEm paradigm.
It's kinda trivially true in that the point of the agenda is to get to legibility, and if you sacrifice on legibility/constructibility, you are no longer following the paradigm, but I realize that is not an interesting statement. Ultimately, this is a governance problem, not a technical problem. The choice to choose illegible capabilities is a political one.
Expensive why? Seems like the bottleneck here is theoretical understanding.
Literally compute and man-power. I can't afford the kind of cluster needed to even begin a pretraining research agenda, or to hire a new research team to work on this. I am less bottlenecked on the theoretical side atm, because I need to run into a lot of bottlenecks from actual grounded experiments first.
Hi habryka, I don't really know how best to respond to such a comment. First, I would like to say thank you for your well-wishes, assuming you did not mean them sarcastically. Maybe I have lost the plot, and if so, I do appreciate help in recovering it. Secondly, I feel confused as to why you would say such things in general.
Just last month, me and my coauthors released a 100+ page explanation/treatise on AI extinction risk that gives a detailed account of where AGI risk comes from and how it works, which was received warmly by LW and the general public alike, and which continues to be updated and actively publicised.
In parallel, our sister org ControlAI, a non-profit policy advocacy org focused solely on extinction risk prevention I work with frequently, has had A Narrow Path, a similarly extensive writeup on principles of regulation to address xrisk from ASI, which me and ControlAI have pushed and discussed extensively with policy makers of multiple countries, and there are other regulation-promoting projects ongoing.
I have been on CNN, BBC, Fox News and other major news sources warning in no ambiguous terms about the risks. There is literally dozens of hours of podcast material, including from just last month, where I explain in excruciating depth the existential risk posed by AGI systems and where it comes from, and how it differs from other forms of AI risk. If you think all my previous material has "lost the plot", then well, I guess in your eyes I never had it, not much I can do.
This post is a technical agenda that is not framed in the usual LW ideological ontology, and has not been optimized to appeal to that audience, but rather to identify an angle that is tractable and generalizes the problem without losing its core, and leads to solutions that address the hard core, which is Complexity. In the limit, if we had beautifully simple, legible designs for ASIs that we fully understand and can predict, technical xrisk (but not governance) would be effectively solved. If you disagree with this, I would have greatly enjoyed your engagement with what object level points you think are wrong, and it may have helped me write a better roadmap.
But it seems to me that you have not even tried to engage with the content of this post at all, and have instead merely asserted it is a "random rant against AI-generated art" and "name-calling." I see no effort other than surface level pattern matching, or any curiosity to how it might fit with my previous writings and thinking that have been shared and discussed.
Do you truly think that's the best effort at engaging in good faith you can make?
If so, I don't know what I can say that would help. I hope we can both find the plot again, since neither of us seem to see it in the other person.
Morality is multifaceted and multilevel. If you have a naive form of morality that is just "I do whatever I think is the right thing to do", you are not coordinating or being moral, you are just selfish.
Coordination is not inherently always good. You can coordinate with one group to more effectively do evil against another. But scalable Good is always built on coordination. If you want to live in a lawful, stable, scalable, just civilization, you will need to coordinate with your civilization and neighbors and make compromises.
As a citizen of a modern country, you are bound by the social contract. Part of the social contract is "individuals are not allowed to use violence against other individuals, except in certain circumstances like self defense." [1] Now you might argue that this is a bad contract or whatever, but it is the contract we play by (at least in the countries I have lived in), and I think unilaterally reneging on that contract is immoral. Unilaterally saying "I will expose all of my neighbors to risk of death from AGI because I think I'm a good person" is very different from "we all voted and the majority decided building AGI is a risk worth taking."
Now, could it be that you in some exceptional circumstances need to do something immoral to prevent some even greater tragedy? Sure, it can happen. Murder is bad, but self defense can make it on net ok. But just because it's self defense doesn't make murder moral, it just means there was an exception in this case. War is bad, but sometimes countries need to go to war. That doesn't mean war isn't bad.
Civilization is all about commitments, and honoring them. If you can't honor your commitments to your civilization, even when you disagree with them sometimes, you are not civilized and are flagrantly advertising your defection. If everyone does this, we lose civilization.
Morality is actually hard, and scalable morality/civilization is much, much harder. If an outcome you dislike happened because of some kind of consensus, this has moral implications. If someone put up a shitty statue that you hate in the town square because he's an asshole, that's very different morally from "everyone in the village voted, and they like the statue and you don't, so suck it up." If you think "many other people want X and I want not X" has no moral implications whatsoever your "morality" is just selfishness.[2]
Hi, as I was tagged here, I will respond to a few points. There are a bunch of smaller points only hinted at that I won't address. In general, I strongly disagree with the overall conclusion of this post.
There are two main points I would like to address in particular:
There seems to be a deep underlying confusion here that in some sense more information is inherently more good, or inherently will result in good things winning out. This is very much the opposite of what I generally claim about memetics. Saying that all information is good is like saying all organic molecules or cells are equally good. No! Adding more biosludge and toxic algal blooms to your rosegarden won't make it better!
Social media is the exact living proof of this. People genuinely thought social media will bring everyone together, resolve conflicts, create a globally unified culture and peace and democracy, that autocracy and bigotry couldn't possibly thrive if you just only had enough information. I consider this hypothesis thoroughly invalidated. "Increasing memetic evolutionary pressure" is not a good thing! (all things equal)
Increasing the evolutionary pressure on the flu virus doesn't make the world better, and viruses mutate a lot faster than nice fluffy mammals. Most mutations in fluffy mammals kills them, mutations in viruses helps them far more. Value is fragile. It is asymmetrically easy to destroy than to create.
Raw evolution selects for fitness/reproduction, not Goodness. You are just feeding the Great Replicator.
For an accessible intro to some of this, I recommend the book "Nexus" by Yuval Harari. (not that I endorse everything in that book, but the first half is great)
You talk about theories of change of the form "we safety people will keep everything secret and create an aligned AI, ship it to big labs and save the world before they destroy it (or directly use the AI to stop them)". I don't endorse, and in fact strongly condemn, such theories of change.
But not because of the hiding information part, but because of the "we will not coordinate with others and will use violence unilaterally" part! Such theories of change are fundamentally immoral for the same reasons labs building AGI is immoral. We have a norm in our civilization that we don't as private citizens threaten to harm or greatly upend the lives of our fellow civilians without either their consent or societal/governmental/democratic authority.
The not sharing information part is fine! Not all information is good! For example, Canadian researchers a while back figured out how to reconstruct an extinct form of smallpox, and then published how to do it. Is this a good thing for the world to have that information out there?? I don't think so. Should we open source the blue prints of the F-35 fighter jet? I don't think so, I think it's good that I don't have those blueprints!
Information is not inherently good! Not sharing information that would make the world worse is virtuous. Now, you might be wrong about the effects of sharing the information you have, sure, but claiming there is no tradeoff or the possibility that sharing might actually, genuinely, be bad, is just ignoring why coordination is hard.
If you ever find yourself thinking something of the shape "we must simply unreservedly increase [conceptually simple variable X], with no tradeoffs", you're wrong. Doesn't matter how clever you think X is, you're wrong. Any real life, not fake complex thing is made of towers upon towers of tradeoffs. If you think there are no tradeoffs in whatever system you are looking at, you don't understand the system.
Memes are not our friends. Conspiracy theories and lies spread faster than complex, nuanced truth. The printing press didn't bring the scientific revolution, it brought the witch burnings and the 30 year war. The scientific revolution came from the Royal Society and its nuanced, patient, complex norms of critical inquiry. Yes, spreading your scientific papers was also important, it was necessary but not sufficient for a good outcome.
More mutation/evolution, all things equal, means more cancer, not more health and beauty. Health and beauty can come from cancerous mutation and selection, but it's not a pretty process, and requires a lot of bloody, bloody trial and error (and a good selection function). The kind of inefficient and morally abominable process I would prefer us not relying on.
With that being said, I think it's good that you wrote things down and are thinking about them, please don't take what I'm saying as some kind of personal disparaging, I wish more people wrote down their ideas and tried to think things through! I think there is indeed a lot of valuable things in this direction, around better norms, tools, processes and memetic growth, but they're just really quite non trivial! You're on your way to thinking critically about morality, coordination and epistemology, which is great! That's where I think real solutions are!
Nice set of concepts, I might use these in my thinking, thanks!
I don't understand what point you are trying to make, to be honest. There are certain problems that humans/I care about that we/I want NNs to solve, and some optimizers (e.g. Adam) solve those problems better or more tractably than others (e.g. SGD or second order methods). You can claim that the "set of problems humans care about" is "arbitrary", to which I would reply "sure?"
Similarly, I want "good" "philosophy" to be "better" at "solving" "problems I care about." If you want to use other words for this, my answer is again "sure?" I think this is a good use of the word "philosophy" that gets better at what people actually want out of it, but I'm not gonna die on this hill because of an abstract semantic disagreement.
"good" always refers to idiosyncratic opinions, I don't really take moral realism particularly seriously. I think there is "good" philosophy in the same way there are "good" optimization algorithms for neural networks, while also I assume there is no one optimizer that "solves" all neural network problems.
I strongly disagree and do not think that will be how AGI will look, AGI isn't magic. But this is a crux and I might be wrong of course.
I can't rehash my entire views on coordination and policy here I'm afraid, but in general, I believe we are currently on a double exponential timeline (though I wouldn't model it quite like you, but the conclusions are similar enough) and I think some simple to understand and straightforwardly implementable policy (in particular, compute caps) at least will move us to a single exponential timeline.
I'm not sure we can get policy that can stop the single exponential (which is software improvements), but there are some ways, and at least we will then have additional time to work on compounding solutions.
Thanks for the comment! I agree that we live in a highly suboptimal world, and I do not think we are going to make it, but it's worth taking our best shot.
I don't think of the CoEm agenda as "doing AGI right." (for one, it is not even an agenda for building AGI/ASI, but of bounding ourselves below that) Doing AGI right would involve solving problems like P vs PSPACE, developing vastly more deep understanding of Algorithmic Information Theory and more advanced formal verification of programs. If I had infinite budget and 200 years, the plan would look very different, and I would feel very secure in humanity's future.
Alas, I consider CoEm an instance of a wider class of possible alignment plans that I consider the "bare minimum for Science to work." I generally think any plans more optimistic than this require some other external force of things going well, which might be empirical facts about reality (LLMs are just nice because of some deep pattern in physics) or metaphysics (there is an actual benevolent creator god intervening specifically to make things go well, or Anthropic Selection is afoot). Many of the "this is what we will get, so we have to do this" type arguments just feel like cope to me, rather than first principles thinking of "if my goal is a safe AI system, what is the best plan I can come up with that actually outputs safe AI at the end?", reactive vs constructive planning. Of course, in the real world, it's tradeoffs all the way down, and I know this. You can read some of my thoughts about why I think alignment is hard and current plans are not on track here.
I don't consider this agenda to be maximally principled or aesthetically pleasing, quite the opposite, it feels like a grubby engineering compromise that simply has a minimum requirement to actually do science in a non-insane way. There are of course various even more compromising positions, but I think those simply don't work in the real world. I think the functionalist/teleological/agent based frameworks that are currently being applied to alignment work on LW are just too confused to ever really work in the real world, the same way how I think that the models of alchemy just can never actually get you to a safe nuclear reactor and you need to at least invent calculus (or hell at least better metallurgy!) and do actual empiricism and stuff.
As for pausing and governance, I think governance is another mandatory ingredient to a good outcome, most of the work there I am involved with happens through ControlAI and their plan "A Narrow Path". I am under no illusion that these political questions are easy to solve, but I do believe they are possible and necessary to solve, and I have a lot of illegible inside info and experience here that doesn't fit into a LW comment. If there is no mechanism by which reckless actors are prevented from killing everyone else by building doomsday machines, we die. All the technical alignment research in the world is irrelevant to this point. (And "pivotal acts" are an immoral pipedream)