(generic comment that may not apply too much to Mayer's work in detail, but that I think is useful for someone to hear:) I agree with the basic logic here. But someone trying to follow this path should keep in mind that there's philosophically thorniness here.
A bit more specifically, the questions one asks about "how intelligence works" will always be at risk of streetlighting. As an example/analogy, think of someone trying to understand how the mind works by analyzing mental activity into "faculties", as in: "So then the object recognition faculty recognizes the sofa and the doorway, and it extracts their shapes, and sends their shapes to the math faculty, which performs a search for rotations that allow the sofa to pass through the doorway, and when it finds one it sends that to the executive faculty, which then directs the motor-planning faculty to make an execution plan, and that plan is sent to the motor factulty...". This person may or may not be making genuine progress on something; but either way, if they are trying to answer questions like "which faculties are there and how do they interoperate to perform real-world tasks", they're missing a huge swath of key questions. (E.g.: "how does the sofa concept get produced in the first place? how does the desire to not damage the sofa and the door direct the motor planner? where do those desires come from, and how do they express themselves in general, and how do they respond to conflict?")
Some answers to "how intelligence works" are very relevant, and some are not very relevant, to answering fundamental questions of alignment, such as what determines the ultimate effects of a mind.
I totally agree with this. I expect the majority early AI researchers where falling into this trap. The main problem I am focusing on is how a mind can construct a model of the world in the first place.
I didn’t read it very carefully but how would you respond to the dilemma:
I’m guessing you’re in the second bullet but I’m not sure how you’re thinking about this alignment concern.
If you had a system with “ENTITY 92852384 implies ENTITY 8593483" it would be a lot of progress, as currently in neural networks we don't even understand the interal structures.
I want to have an algorithm that creates a world model. The world is large. A world model is uninterpretable by default through it's sheer size, even if you had interpretable but low level abels. By default we don't get any interpretable labels. I think there are ways to have generic dataprocessing procedures that don't talk about the human mind at all, that would yield more interpretable world model. Similar to how you could probably specify some very general property about python programs, such that that program becomes easier to understand by humans. E.g. a formalism of what it means that the control flow is straightforward: Don't use goto in C.
But even if you wouldn't have this, understanding the system still allows you to understand what the structure of the knowledge would be. It seems plausible that one could simply by understanding the system very well, make it such that the learned datastrucutres need to take particular shapes, such that these shapes correspond some relevant alignment properties.
In any case, it seems that this is a problem that any possible way to build an intelligence runs into? So I don't think it is a case against the project. When building an AI with NN you might not even think about that the interal representations might be wierd and alien (even for an LLM trained on human text)[1], but the same problem persists.
I haven't looked into this, or thought about at all, though that's what I expect.
See Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc, including my comment on it. If your approach would lead to a world-model that is an uninterpretable inscrutable mess, and LLM research would lead to a world-model that is an even more uninterpretable, even more inscrutable mess, then I don’t think this is a reason to push forward on your approach, without a good alignment plan.
Yes, it’s a pro tanto reason to prefer your approach, other things equal. But it’s a very minor reason. And other things are not equal. On the contrary, there are a bunch of important considerations plausibly pushing in the opposite direction:
In any case, it seems that this is a problem that any possible way to build an intelligence runs into? So I don't think it is a case against the project.
If it’s a problem for any possible approach to building AGI, then it’s an argument against pursuing any kind of AGI capabilities research! Yes! It means we should focus first on solving that problem, and only do AGI capabilities research when and if we succeed. And that’s what I believe. Right?
It seems plausible that one could simply by understanding the system very well, make it such that the learned datastrucutres need to take particular shapes, such that these shapes correspond some relevant alignment properties.
I don’t think this is plausible. I think alignment properties are pretty unrelated to the low-level structure out of which a world-model is built. For example, the difference between “advising a human” versus “manipulating a human”, and the difference between “finding a great out-of-the-box solution” versus “reward hacking”, are both extremely important for alignment. But you won’t get insight into those distinctions, or how to ensure them in an AGI, by thinking about whether world-model stuff is stored as connections on graphs versus induction heads or whatever.
Anyway, if your suggestion is true, I claim you can (and should) figure that out without doing AGI capabilities research. Here’s an example. Assume that the the learned data structure is a Bayes net, or some generalization of a Bayes net, or the OpenCog “AtomSpace”, or whatever. OK, now spend as long as you like thinking about what if anything that has to do with “alignment properties”. My guess is “very little”. Or if you come up with anything, you can share it. That’s not advancing capabilities, because people already know that there is such a thing as Bayes nets / OpenCog / whatever.
Alternatively, another concrete thing that you can chew on is: brain-like AGI. :) We already know a lot about how it works without needing to do any new capabilities research. For example, you might start with Plan for mediocre alignment of brain-like [model-based RL] AGI and think about how to make that approach better / less bad.
John's post is quite wierd, because it only says true things, and implicitly implies a conclusion, namely that NNs are not less interpretable than some other thing, which is totally wrong.
Example: A neural network implements modular arithmetic with furier transforms. If you implement that furier algorithm in python, it's harder to understand for a human than the obvious modular arithmetic implementation in python.
It doesn't matter if the world model is inscruitable when looking directly at it, if you can change the generating code such that certain properties must hold. Figuring out what these properties is not directly solved by understading intelligence of cause.
This is bad because, if AGI is very compute-efficient, then when we have AGI at all, we will have AGI that a great many actors around the world will be able to program and run, and that makes governance very much harder.
This is bad because, if AGI is very compute-efficient, then when we have AGI at all, we will have AGI that a great many actors around the world will be able to program and run, and that makes governance very much harder.
Totally agree, so obviously try super hard to not leak the working AGI code if you had it.
But you won’t get insight into those distinctions, or how to ensure them in an AGI, by thinking about whether world-model stuff is stored as connections on graphs versus induction heads or whatever.
No you can. E.g. I could define theoretically a general algoritm that identifies the minimum concrepts neccesary, if I know enough about the structure of the system, specifically how concepts are stored, for solving a task. That's of cause not perfect, but it would seem that for very many problems it would make the AI unable to think about things like human manipulation, or that it is a constrained AI, even if that knowledge was somewhere in a learned black box world model. This is just an example of something you can do by knowing the structure of a system.
If your system is some plain code with for loops, just reduce the number the for loops of seach processes do. Now decreasing/incleasing the iterations somewhat will correspond to making the system dumber/smarter. Again obviously not solving the problem completely, but clearly a powerful thing to be able to do.
Of cause many low level details do not matter. Often you'd only care that something is a sequence, or a set. I am talking about a higher level program structure.
It feels like you are somewhat missing the point. The goal is to understand how intelligence works. Clearly that would be very useful for alignment? Even if you would get a blackbox world model. But of cause it would also enable you to think about how to make such a world model more interpretable. I think that is possible, it's just not what I am focusing on now.
John's post is quite wierd, because it only says true things, and implicitly implies a conclusion, namely that NNs are not less interpretable than some other thing, which is totally wrong.
Example: A neural network implements modular arithmetic with furier transforms. If you implement that furier algorithm in python, it's harder to understand for a human than the obvious modular arithmetic implementation in python.
Again see my comment. If an LLM does Task X with a trillion unlabeled parameters and (some other thing) does the same Task X with “only” a billion unlabeled parameters, then both are inscrutable.
Your example of modular arithmetic is not a central example of what we should expect to happen, because “modular arithmetic in python” has zero unlabeled parameters. Realistically, an AGI won’t be able to accomplish any real-world task at all with zero unlabeled parameters.
I propose that a more realistic example would be “classifying images via a ConvNet with 100,000,000 weights” versus “classifying images via 5,000,000 lines of Python code involving 1,000,000 nonsense variable names”. The latter is obviously less inscrutable on the margin but it’s not a huge difference.
The goal is to understand how intelligence works. Clearly that would be very useful for alignment?
If “very useful for alignment” means “very useful for doing technical alignment research”, then yes, clearly.
If “very useful for alignment” means “increases our odds of winding up with aligned AGI”, then no, I don’t think it’s true, let alone “clearly” true.
If you don’t understand how something can simultaneously both be very useful for doing technical alignment research and decrease our odds of winding up with aligned AGI, here’s a very simple example. Suppose I posted the source code for misaligned ASI on github tomorrow. “Clearly that would be very useful” for doing technical alignment research, right? Who could disagree with that? It would open up all sorts of research avenues. But also, it would also obviously doom us all.
For more on this topic, see my post “Endgame safety” for AGI.
E.g. I could define theoretically a general algoritm that identifies the minimum concrepts neccesary, if I know enough about the structure of the system, specifically how concepts are stored, for solving a task. That's of cause not perfect, but it would seem that for very many problems it would make the AI unable to think about things like human manipulation, or that it is a constrained AI, even if that knowledge was somewhere in a learned black box world model.
There’s a very basic problem that instrumental convergence is convergent because it’s actually useful. If you look at the world and try to figure out the best way to design a better solar cell, that best way involves manipulating humans (to get more resources to run more experiments etc.).
Humans are part of the environment. If an algorithm can look at a street and learn that there’s such a thing as cars, the very same algorithm will learn that there’s such a thing as humans. And if an algorithm can autonomously figure out how an engine works, the very same algorithm can autonomously figure out human psychology.
You could remove humans from the training data, but that leads to its own problems, and anyway, you don’t need to “understand intelligence” to recognize that as a possibility (e.g. here’s a link to some prior discussion of that).
Or you could try to “find” humans and human manipulation in the world-model, but then we have interpretability challenges.
Or you could assume that “humans” were manually put into the world-model as a separate module, but then we have the problem that world-models need to be learned from unlabeled data for practical reasons, and humans could also show up in the other modules.
Anyway, it’s fine to brainstorm on things like this, but I claim that you can do that brainstorming perfectly well by assuming that the world model is a Bayes net (or use OpenCog AtomSpace, or Soar, or whatever), or even just talk about it generically.
If your system is some plain code with for loops, just reduce the number the for loops of seach processes do. Now decreasing/incleasing the iterations somewhat will correspond to making the system dumber/smarter. Again obviously not solving the problem completely, but clearly a powerful thing to be able to do.
I’m 100% confident that, whatever AGI winds up looking like, “we could just make it dumber” will be on the table as an option. We can give it less time to find a solution to a problem, and then the solution it finds (if any) will be worse. We can give it less information to go on. Etc.
You don’t have to “understand intelligence” to recognize that we’ll have options like that. It’s obvious. That fact doesn’t come up very often in conversation because it’s not all that useful for getting to Safe and Beneficial AGI.
Again, if you assume the world model is a Bayes net (or use OpenCog AtomSpace, or Soar), I think you can do all the alignment thinking and brainstorming that you want to do, without doing new capabilities research. And I think you’d be more likely (well, less unlikely) to succeed anyway.
The goal is to have a system where there are no unlabeled parameters ideally. That would be the world modeling system. It then would build a world model that would have many unlabeled parameters. By understanding the world modeler system you can ensure that the world model has certain properties. E.g. there is some property (which I don't know) of how to make the world model not contain dangerous minds.
E.g. imagine the AI is really good at world modeling, and now it models you (you are part of the world) so accurately that you are now basically copied into the AI. Now you might try to escape the AI, which would actually be really good because then you could save the world as a speed intelligence (assuming the model of you would really accurate which is probably wouldn't be). But if it models another mind (maybe it considers dangerous adversaries) then maybe they could also escape, and would not be aligned.
By understanding the system you could put constraints on what world models can be generated, such that all generated world models can't contain such dangerous minds, or at least make such minds much less likely.
I propose that a more realistic example would be “classifying images via a ConvNet with 100,000,000 weights” versus “classifying images via 5,000,000 lines of Python code involving 1,000,000 nonsense variable names”. The latter is obviously less inscrutable on the margin but it’s not a huge difference.
Python code is a discrete structure. You can do proofs on more easily than for a NN. You could try to apply program transformations on it that preserve functional equality, trying to optimize for some measure of "human understandable structure". There are image classification alogrithms iirc that are worse than NN but much more interpretable, and these algorithms would at most be hundets of lines of code I guess (haven't really looked a lot at them).
Anyway, it’s fine to brainstorm on things like this, but I claim that you can do that brainstorming perfectly well by assuming that the world model is a Bayes net (or use OpenCog AtomSpace, or Soar, or whatever), or even just talk about it generically.
You give examples of recognizing problems. I tried to give examples of how you can solve these problems. I'm not brainstorming on "how could this system fail". Instead I understand something, and then I just notice without really trying, that now I can do a thing that seems very useful, like making the system not think about human psycology given certain constraints.
Probably I completely failed at making clear why I think that, because my explanation was terrible. In any case I think your suggested brainstorming this is completely different from the thing that I am actually doing.
To me it just seems that limiting the depth of a tree search is better that limiting the compute of a black box neural network. It seems like you can get a much better grip on what it means to limit the depth, and what this implies about the system behavior, when you actually understand how tree search works. Of cause tree search here is only an example.
The goal is to have a system where there are no unlabeled parameters ideally. That would be the world modeling system. It then would build a world model that would have many unlabeled parameters.
Yup, this is what we’re used to today:
So for example, in LLM pretraining, the learning algorithm is backprop, the inference algorithm is a forward pass, and the information repository is the weights of a transformer-architecture neural net. There’s nothing inscrutable about backprop, nor about a forward pass. We fully understand what those are doing and how. Backprop calculates the gradient, etc.
That’s just one example. There are many other options! The learning algorithm could involve TD learning. The inference algorithm could involve tree search, or MCMC, or whatever. The information repository could involve a learned value function and/or a learned policy and/or a learned Bayes net and/or a learned OpenCog AtomSpace or whatever. But in all cases, those six bullets above are valid.
So anyway, this is already how ML works, and I’m very confident that it will remain true until TAI, for reasons here. And this is a widespread consensus.
By understanding the world modeler system you can ensure that the world model has certain properties. E.g. there is some property (which I don't know) of how to make the world model not contain dangerous minds.
There’s a very obvious failure mode in which: the world-model models the world, and the planner plans, and the value function calculates values, etc. … and at the end of all that, the AI system as a whole hatches and executes a plan to wipe out humanity. The major unsolved problem is: how do we confidently avoid that?
Then separately, there’s a different, weird, exotic type of failure mode, where, for example, there’s a full-fledged AGI agent, one that can do out-of-the-box foresighted planning etc., but this agent is not working within the designed AGI architecture (where the planner plans etc. as above), but rather the whole agent is hiding entirely within the world-model. I think that, in this kind of system, the risk of this exotic failure mode is very low, and can be straightforwardly mitigated to become even lower still. I wrote about it a long time ago at Thoughts on safety in predictive learning.
I really think we should focus first and foremost on the very obvious failure mode, which again is an unsolved problem that is very likely to manifest, and we should put aside the weird exotic failure mode at least until we’ve solved the big obvious one.
When we put aside the exotic failure mode and focus on the main one, then we’re no longer worried about “the world model contains dangerous minds”, but rather we’re worried about “something(s) in the world model has been flagged as desirable, that shouldn’t have been flagged as desirable”. This is a hard problem not only because of the interpretability issue (I think we agree that the contents of the world-model are inscrutable, and I hope we agree that those inscrutable contents will include both good things and bad things), but also because of concept extrapolation / goal misgeneralization (i.e., the AGI needs to have opinions about plans that bring it somewhere out of distribution). It’s great if you want to think about that problem, but you don’t need to “understand intelligence” for that, you can just assume that the world-model is a Bayes net or whatever, and jump right in! (Maybe start here!)
To me it just seems that limiting the depth of a tree search is better that limiting the compute of a black box neural network. It seems like you can get a much better grip on what it means to limit the depth, and what this implies about the system behavior, when you actually understand how tree search works. Of cause tree search here is only an example.
Right, but the ability to limit the depth of a tree search is basically useless for getting you to safe and beneficial AGI, because you don’t know the depth that allows dangerous plans, nor do you know that dangerous plans won’t actually be simpler (less depth) than intended plans. This is a very general problem. This problem applies equally well to limiting the compute of a black box, limiting the number of steps of MCMC, limiting the amount of (whatever OpenCog AtomSpace does), etc.
[You can also potentially use tree search depth to try to enforce guarantees about myopia, but that doesn’t really work for other reasons.]
Python code is a discrete structure. You can do proofs on more easily than for a NN. You could try to apply program transformations on it that preserve functional equality, trying to optimize for some measure of "human understandable structure". There are image classification alogrithms iirc that are worse than NN but much more interpretable, and these algorithms would at most be hundets of lines of code I guess (haven't really looked a lot at them).
“Hundreds of lines” is certainly wrong because you can recognize easily tens of thousands of distinct categories of visual objects. Probably hundreds of thousands.
Proofs sound nice, but what do you think you can realistically prove that will help with Safe and Beneficial AGI? You can’t prove things about what AGI will do in the real world, because the real world will not be encoded in your formal proof system. (pace davidad).
“Applying program transformations that optimize for human understandable structure” sounds nice, but only gets you to “inscrutable” from “even more inscrutable”. The visual world is complex. The algorithm can’t be arbitrarily simple, while still capturing that complexity. Cf. “computational irreducibility”.
I'm not brainstorming on "how could this system fail". Instead I understand something, and then I just notice without really trying, that now I can do a thing that seems very useful, like making the system not think about human psycology given certain constraints.
What I’m trying to do in this whole comment is point you towards various “no-go theorems” that Eliezer probably figured out in 2006 and put onto Arbital somewhere.
Here’s an analogy. It’s appealing to say: “I don’t understand string theory, but if I did, then I would notice some new obvious way to build a perpetual motion machine.”. But no, you won’t. We can rule out perpetual motion machines from very general principles that don’t rely on how string theory works.
By the same token, it’s appealing to say: “I don’t understand intelligence, but if I did, then I would notice some new obvious way to guarantee that an AGI won’t try to manipulate humans.”. But no, you won’t. There are deep difficulties that we know you’re going to run into, based on very general principles that don’t rely on the data format for the world-model etc.
I suggest to think harder about the shape of the solution—getting all the way to Safe & Beneficial AGI. I think you’ll come to realize that figuring out the data format for the world-model etc. is not only dangerous (because it’s AGI capabilities research) but doesn’t even help appreciably with safety anyway.
This seems worryingly like "advance capabilities now, figure out the alignment later." The reverse would be to put the upfront work into alignment desiderata, and decide on a world model later.
Yes, I know every big AI project is also doing the former thing, and doing the latter thing is a much murkier and more philosophy-ridden problem.
To me it seems that understanding how a system that you are building actually works (i.e. have good models about its internal) is the most basic requirement to be able to reason about the system coherently at all.
Yes if you'd actually understood how intelligence works in a deep way you don't automatically solve alignment. But it sure will make it a lot more tractable in many ways. Especially when only aiming for a pivotal act.
I am pretty sure you can figure out alignment in advance as you suggest. That might be the overall saver route... if we didn't have coordination problems. But it seems slower, and we don't have time.
Obviously, if you figure out the intelligence algorithm before you know how to steer it, don't put it on GitHub or the universe's fate will be sealed momentarily. Ideally don't even run it at all.
So far working on this project seems to have created ontologies in my brain that are good for thinking about alignment. There are a couple of approaches that now seem obvious, which I think wouldn't seem obvious before. Again having good models about intelligence (which is really what this is about) is actually useful for thinking about intelligence. And Alignment research is mainly thinking about intelligence.
The approach many people take of trying to pick some alignment problem seems somewhat backward to me. E.g. embedded agency is a very important problem, and you need to solve it at some point. But it doesn't feel like the problem such that when you work on it, you build up the most useful models of intelligence in your brain.
As an imperfect analogy consider trying to understand how a computer works by first understanding how modern DRAM works. To build a practical computer you might need to use DRAM. But in principle, you could build a computer with only S-R latch memory. So clearly while important it is not at the very core. First, you understand how NAND gates work, the ALU, and so on. Once you have a good understanding of the fundamentals, DRAM will be much easier to understand. It becomes obvious how it needs to work at a high level: You can write and read bits. If you don't understand how a computer works you might not even know why storing a bit is an important thing to be able to do.
I am pretty sure you can figure out alignment in advance as you suggest
I'm not so sure about that. How do you figure out how to robustly keep a generally intelligent dynamically updating system on-target without having a solid model of how that system is going to change in response to its environment? Which, in turn, would require a model of what that system is?
I expect the formal definition of "alignment" to be directly dependent on the formal framework of intelligence and embedded agency, the same way a tetrahedron could only be formally defined within the context of Euclidean space.
I'd think you can define a tedrahedron for non-euclidean space. And you can talk about and reason about a set of polyhedra with 10 verticies as an abstract object without talking or defining any specific such polyhedra.
Just consider if you take the assumption that the system would not change in arbitrary ways in response to it's environment. There might be certain constrains. You can think about what the constrains need to be such that e.g. a self modifying agent would never change itself such that it would expect that in the future it would get less utility than if it would not selfmodify.
And that is just a random thing that came to mind without me trying. I would expect that you can learn useful things about alignment by thinking about such things. Infact the line between understanding intelligence and figuring out alignment in advance really doesn't exist I think. Clearly understanding something about alignment is understanding something about intelligence.
When people say to only figure out alignment thing, maybe what they mean is to figure out things about intelligence that won't actually get you much closer to being able to build a dangerous intelligence. And there do seem to be such things. It is just that I expect that just trying to work on these will not actually make you generate the most useful models about intelligence in your mind, making you worse/slower at thinking on average per unit of time working.
And that's of cause not a law. Probably there are some things that you want to understand through an abstract theoretical lens at certain points in time. Do whatever works best.
I'd think you can define a tedrahedron for non-euclidean space
If you relax the definition of a tetrahedron to cover figures embedded in non-Euclidean spaces, sure. It wouldn't be the exact same concept, however. In a similar way to how "a number" is different if you define it as a natural number vs. real number.
Perhaps more intuitively, then: the notion of a geometric figure with specific properties is dependent on the notion of a space in which it is embedded. (You can relax it further – e. g., arguably, you can define a "tetrahedron" for any set with a distance function over it – but the general point stands, I think.)
Just consider if you take the assumption that the system would not change in arbitrary ways in response to it's environment. There might be certain constrains. You can think about what the constrains need to be such that e.g. a self modifying agent would never change itself such that it would expect that in the future it would get less utility than if it would not selfmodify.
Yes, but: those constraints are precisely the principles you'd need to code into your AI to give it general-intelligence capabilities. If your notion of alignment only needs to be robust to certain classes of changes, because you've figured out that an efficient generally intelligent system would only change in such-and-such ways, then you've figured out a property of how generally intelligent systems ought to work – and therefore, something about how to implement one.
Speaking abstractly, the "negative image" of the theory of alignment is precisely the theory of generally intelligent embedded agents. A robust alignment scheme would likely be trivial to transform into an AGI recipe.
A robust alignment scheme would likely be trivial to transform into an AGI recipe.
Perhaps if you did have the full solution, but it feels like that there are some things of a solution that you could figure out, such that that part of the solution doesn't tell you as much about the other parts of the solution.
And it also feels like there could be a book such that if you read it you would gain a lot of knowledge about how to align AIs without knowing that much more about how to build one. E.g. a theoretical solution to the stop button problem seems like it would not tell you that much about how to build an AGI compared to figuring out how to properly learn a world model of Minecraft. And knowing how to build a world model of minecraft probably helps a lot with solving the stop button problem, but it doesn't just trivially yield a solution.
Perhaps if you did have the full solution, but it feels like that there are some things of a solution that you could figure out, such that that part of the solution doesn't tell you as much about the other parts of the solution.
I agree with that.
Understanding [how to design] rather than 'growing' search/agency-structure would actually equal solving inner alignment, if said structure does not depend on what target[1] it is intended to be given, i.e. is targetable (inner-alignable) rather than target-specific.[2]
Such an understanding would simultaneously qualify as of 'how to code a capable AI', but would be fundamentally different from what labs are doing in an alignment-relevant way. In this framing, labs are selecting for target-specific structures (that we don't understand). (Another difference is that, IIRC, Johannes might intend not to share research on this publicly, but I'm less sure after rereading the quote that gave me that impression[3]).
includes outer alignment goals
If it's not clear what I mean, reading this about my background model might help, also feel free to ask me questions
from one of Johannes' posts:
I don't have such a convincing portfolio for doing research yet. And doing this seems to be much harder. Usually, the evaluation of such a portfolio requires technical expertise - e.g. how would you know if a particular math formalism makes sense if you don't understand the mathematical concepts out of which the formalism is constructed?
Of course, if you have a flashy demo, it's a very different situation. Imagine I had a video of an algorithm that learns Minecraft from scratch within a couple of real-time days, and then gets a diamond in less than 1 hour, without using neural networks (or any other black box optimization). It does not require much technical knowledge to see the significance of that.
But I don't have that algorithm, and if I had it, I would not want to make that publicly known. And I am unsure what is the cutoff value. When would something be bad to publish? All of this complicates things.
(After rereading this I'm not actually sure what that means they'd be okay sharing or if they'd intend to share technical writing that's not a flashy demo)
So I agree it would be an advance, but you could solve inner alignment in the sense of avoiding mesaoptimizers, yet fail to solve inner alignment in the senses of predictable generalization or stability of generalization across in-lifetime learning.
Save the world by understanding intelligence.
Instead of having SGD "grow" intelligence, design the algorithms of intelligence directly to get a system we can reason about. Align this system to a narrow but pivotal task, e.g. upload a human.
The key to intelligence is finding the algorithms that infer world models that enable efficient prediction, planning, and meaningfully combining existing knowledge.
By understanding the algorithms, we can make the system non-self-modifying (algorithms are constant, only the world model changes), making reasoning about the system easier.
Understanding intelligence at the algorithmic level is a very hard technical problem. However, we are pretty sure it is solvable and, if solved, would likely save the world.
Current focus: How to model a world such that we can extract structure from the transitions between states ('grab object'=useful high level action), as well as the structure within particular states ('tree'=useful concept).
I am leading a project on that. Read more here and apply on the AISC website.