But humans seem to have some way of (sometimes) noticing out-of-distribution inputs, and can feel confused instead of just confidently use their existing training to respond to it.
I think what you're describing can be approximated by a Bayesian agent having a wide prior, and feeling "confused" when some new piece of evidence makes its posterior more diffuse. Evolutionarily it makes sense to have that feeling, because it tells the agent to do more exploration and less exploitation.
For example, if you flip a coin 1000 times and always get heads, your posterior is very concentrated around "the coin always comes up heads". But then it comes up tails once, your posterior becomes more diffuse, you feel confused, and you change your betting behavior until you can learn more.
I think it is driven by a general heuristic of finding compressibility. If a distribution seems complex we assume we're accidentally conflating two variables and seek the decomposition that makes the two resultant distributions approximate-able by simpler functions.
I guess it feels like I don't know how we could know that we're in the position that we've "solved" meta-philosophy. It feels like the thing we could do is build a set of better and better models of philosophy and check their results against held-out human reasoning and against each other.
I also don't think we know how to specify a ground truth reasoning process that we could try to protect and run forever which we could be completely confident would come up with the right outcome (where something like HCH is a good candidate but potentially with bugs/subtleties that need to be worked out).
I feel like I have some (not well justified and possibly motivated) optimism that this process yields something good fairly early on. We could gain confidence that we are in this world if we build a bunch of better and better models of meta-philosophy and observe at some point the models continue agreeing with each other as we improve them, and that they agree with various instantiations of protected human reasoning that we run. If we are in this world, the thing we need to do is just spend some time building a variety of these kinds of models and produce an action that looks good to most of them. (Where agreement is not "comes up with the same answer" but more like "comes up with an answer that other models think is okay and not disastrous to accept").
Do you think this would lead to "good outcomes"? Do you think some version of this approach could be satisfactory for solving the problems in Two Neglected Problems in Human-AI Safety?
Do you think there's a different kind of thing that we would need to do to "solve metaphilosophy"? Or do you think that working on "solving metaphilosophy" roughly caches out as "work on coming up with better and better models of philosophy in the model I've described here"?
I guess it feels like I don’t know how we could know that we’re in the position that we’ve “solved” meta-philosophy.
What I imagine is reaching a level of understanding of what we’re really doing (or what we should be doing) when we “do philosophy”, on par with our current understanding of what “doing math” or “doing science” consist of, or ideally a better level of of understanding than that. (See Apparent Unformalizability of “Actual” Induction for one issue with our current understanding of “doing science”.)
I also don’t think we know how to specify a ground truth reasoning process that we could try to protect and run forever which we could be completely confident would come up with the right outcome (where something like HCH is a good candidate but potentially with bugs/subtleties that need to be worked out).
Here I’m imagining something like putting a group of the best AI researchers, philosophers, etc. in some safe and productive environment (which includes figuring out the right rules of social interactions), where they can choose to delegate further to other reasoning processes, but don’t face any time pressure to do so. Obviously I don’t know how to specify this in terms of having all the details worked out, but that does not seem like a hugely difficult problem to solve, so I wonder what do you mean/imply by “don’t think we know how”?
It feels like the thing we could do is build a set of better and better models of philosophy and check their results against held-out human reasoning and against each other.
If that’s all we do, it seems like it would be pretty easy to miss some error in the models, because we didn’t know that we should test for it. For example there could be entire classes of philosophical problems that the models will fail on, which we won’t know because we won’t have realized yet that those classes of problems even exist.
Do you think this would lead to “good outcomes”? Do you think some version of this approach could be satisfactory for solving the problems in Two Neglected Problems in Human-AI Safety?
It could, but it seems much riskier than either of the approaches I described above.
Do you think there’s a different kind of thing that we would need to do to “solve metaphilosophy”? Or do you think that working on “solving metaphilosophy” roughly caches out as “work on coming up with better and better models of philosophy in the model I’ve described here”?
Hopefully I answered these sufficiently above. Let me know if there’s anything I can clear up further.
All else equal, I prefer an AI which is not capable to philosophy, as I am afraid of completely alien conclusions which it could come to (e.g. insect are more important than humans).
More over, I am skeptical that going on meta-level simplifies the problem to the level that it will be solvable by humans (the same about meta-ethics and theory of human values). For example, if someone says that he is not able to understand math, but instead will work on meta-mathematical problems, we would be skeptical about his ability to contribute. Why meta-level would be simpler?
More over, I am skeptical that going on meta-level simplifies the problem to the level that it will be solvable by humans (the same about meta-ethics and theory of human values).
This is also my reason for being pessimistic about solving metaphilosophy before a good number of object-level philosophical problems have been solved (e.g. in decision theory, ontology/metaphysics, and epistemology). If we imagine being in a state where we believe running computation X would solve hard philosophical problem Y, then it would seem that we already have a great deal of philosophical knowledge about Y, or a more general class of problems that includes Y.
More generally, we could look at the history difficulty of solving a problem vs. the difficulty of automating it. For example: the difficulty of walking vs. the difficulty of programming a robot to walk; the difficulty of adding numbers vs. the difficulty of specifying an addition algorithm; the difficulty of discovering electricity vs. the difficulty of solving philosophy of science to the point where it's clear how a reasoner could have discovered (and been confident in) electricity; and so on.
The plausible story I have that looks most optimistic for metaphilosophy looks something like:
I think our positions on this are pretty close, but I may put a bit more weight on other "plausible stories" for solving metaphilosophy relative to your "plausible story". (I'm not sure if overall I'm more or less optimistic than you are.)
If we imagine being in a state where we believe running computation X would solve hard philosophical problem Y, then it would seem that we already have a great deal of philosophical knowledge about Y, or a more general class of problems that includes Y.
It seems quite possible that understanding the general class of problems that includes Y is easier than understanding Y itself, and that allows us to find a computation X that would solve Y without much understanding of Y itself. As an analogy, suppose Y is some complex decision problem that we have little understanding of, and X is an AI that is programmed with a good decision theory.
More generally, we could look at the history difficulty of solving a problem vs. the difficulty of automating it. For example: the difficulty of walking vs. the difficulty of programming a robot to walk;
This does not seem like a very strong argument for your position. My suggestion in the OP is that humans already know the equivalent of "walking" (i.e., doing philosophy), we're just doing it very slowly. Given this, your analogies don't seem very conclusive about the difficulty of solving metaphilosophy or whether we have to make a bunch more progress on object-level philosophical problems before we can solve metaphilosophy.
Creating AI for solving hard philosophical problems is like passing hot potato from right hand to left.
For example, I want to solve the problem of qualia. I can't solve it myself, but may be I can create super-intelligent AI which will help me to solve it? Now I start to working on AI, and soon encounter the the control problem. Trying to solve the control problem, I would have to specify nature of human values, and soon I will find the need to tell something about existing and nature of qualia. Now the circle is done: I have the same problem of qualia, but packed inside the control problem. If I make some assumption about what qualia should be, they will probably affect the final answer by AI.
However, I still could use some forms of AI to solve qualia problem: if I use google search, I could quickly find all relevant articles, identify the most cited, newest, maybe create an argument map. This is where Drexler's CAIS may help.
Maybe one AI philosophy service could look like: would ask you a bunch of other questions that are simpler than the problem of qualia, then show you what those answers imply about the problem of qualia if you use some method of reconciling those answers.
In fact, when I use Google Scholar to find new articles about e.g. qualia, I already use narrow AI to advance my understanding. So AI could be useful in thinking about philosophical problems. What I am afraid of is AI's decisions based on incomprehensible AI-created philosophy.
More over, I am skeptical that going on meta-level simplifies the problem to the level that it will be solvable by humans
If I gave the impression in this post that I expect metaphilosophy to be solved before someone builds an AGI, that was far from my intentions. I think this is a small-chance-of-high-return kind of situation, plus I think someone has to try to attack the problem if only to generate evidence that it really is a hard problem, otherwise I don't know how to convince people to adopt costly social solutions like stopping technological progress. (And actually I don't expect the evidence to be highly persuasive either, so this amounts to just another small chance of high return.)
What I wrote in an earlier post still describes my overall position:
There is no strong empirical evidence that solving metaphilosophy is superhumanly difficult, simply because not many people have attempted to solve it. But I don’t think that a reasonable prior combined with what evidence we do have (i.e., absence of visible progress or clear hints as to how to proceed) gives much hope for optimism either.
As I said here countless times before, answering questions is not what philosophy is good at. It's good at asking questions, and figuring out how to slice a small manageable piece of a big question for some other science to work on. Sadly, most philosophers misunderstand what their job is. They absolutely suck at finding answers, even as they excel as debating the questions. The debate is important as it crystallizes how to slice the big question into smaller ones, but it does not provide answers. Sometimes it's the philosophers themselves that are polymaths enough to be able to both slice a question and to answer it, like Pierce/Russell/Wittgenstein with truth tables. Most of the time a good question is posed, or a non-obvious perspective is highlighted, like the oft-discussed here Searle's Chinese room argument, or Jackson's Mary's room setup, but the proposed solution itself is nowhere close to satisfactory.
Philosophy is NOT a general purpose problem solver, and NOT a meta problem solver, it is a (meta) problem problem asker and slicer.
I object rather strongly to this categorization. This feels strongly to me like a misunderstanding borne of having only encountered analytic philosophy in rather limited circumstances and having assumed the notion of the "separate magisterium" that the analytic tradition developed as it broke from the rest of Western philosophy.
Many people doing philosophy, myself included, think of it more as the "mother" discipline from which we might specialize into other disciplines once we have the ground well understood enough to cleave off a part of reality for a time being while we work with that small part so as to avoid constantly facing the complete, overwhelming complexity of facing all of reality at once. What is today philosophy is perhaps tomorrow a more narrow field of study, except it seems in those cases where we touch so closely upon fundamental uncertainty that we cannot hope to create a useful abstraction, like physics or chemistry, to let us manipulate some small part of the world accurately without worrying about the rest of it.
Many people doing philosophy, myself included, think of it more as the "mother" discipline from which we might specialize into other disciplines once we have the ground well understood enough to cleave off a part of reality for a time being while we work with that small part so as to avoid constantly facing the complete, overwhelming complexity of facing all of reality at once.
That's a great summary, yeah. I don't see any contradiction with what I said.
What is today philosophy is perhaps tomorrow a more narrow field of study, except it seems in those cases where we touch so closely upon fundamental uncertainty that we cannot hope to create a useful abstraction, like physics or chemistry, to let us manipulate some small part of the world accurately without worrying about the rest of it.
You have a way with words :) Yes, specific sciences study small slivers of what we experience, and philosophy ponders the big picture, helping to spawn another sliver to study. Still don't see how it provides answers, just helps crystallize questions.
Yes, specific sciences study small slivers of what we experience, and philosophy ponders the big picture, helping to spawn another sliver to study. Still don't see how it provides answers, just helps crystallize questions.
It sounds like a disagreement on whether A contains B means B is an A or B is not an A. That is, whether or not that, say, physics, which is contained within the realm of study we call philosophy, although carefully cordoned off with certain assumptions from the rest of it, is still philosophy or whether philosophy is the stuff that isn't broken down into a smaller part, because to my way of thinking physics is largely philosophy of the material and so by example we have a case where philosophy provides answers.
I don't see this as anything related to containment. Just interaction. Good philosophy provides a well-defined problem to investigate for a given science, and, once in a blue moon, an outline of methodology, like Popper did. In turn, the scientific investigation in question can give philosophy some new "big" problems to ponder.
Jackson’s Mary’s room setup
Never understood why it is considered good - isn't just confusion between "being in a state" and "knowing about a state"? The same way there is a difference between knowing everything about axes and there being axe in your head.
Physicalists sometimes respond to Mary's Room by saying that one can not expect Mary actually to actually instantiate Red herself just by looking at a brain scan. It seems obvious to then that a physical description of brain state won't convey what that state is like, because it doesn't put you into that state. As an argument for physicalism, the strategy is to accept that qualia exist, but argue that they present no unexpected behaviour, or other difficulties for physicalism.
That is correct as stated but somewhat misleading: the problem is why is it necessary, in the case of experience, and only in the case of experience to instantiate it in order to fully understand it. Obviously, it is true a that a descirption of a brain state won't put you into that brain state. But that doesn't show that there is nothing unusual about qualia. The problem is that there in no other case does it seem necessary to instantiate a brain state in order to undertstand something.
If another version of Mary were shut up to learn everything about, say, nuclear fusion, the question "would she actually know about nuclear fusion" could only be answered "yes, of course....didn't you just say she knows everything"? The idea that she would have to instantiate a fusion reaction within her own body in order to understand fusion is quite counterintuitive. Similarly, a description of photosynthesis will make you photosynthesise, and would not be needed for a complete understanding of photosynthesis.
There seem to be some edge cases.: for instance, would an alternative Mary know everything about heart attacks without having one herself? Well, she would know everything except what a heart attack feels like, and what it feels like is a quale. the edge cases, like that one, are cases are just cases where an element of knowledge-by-acquaintance is needed for complete knowledge. Even other mental phenomena don't suffer from this peculiarity. Thoughts and memories are straightforwardly expressible in words, so long as they don't involve qualia.
So: is the response "well, she has never actually instantiated colour vision in her own brain" one that lays to rest and the challenge posed by the Knowledge argument, leaving physicalism undisturbed? The fact that these physicalists feel it would be in some way necessary to instantiate colour, but not other things, like photosynthesis or fusion, means they subscribe to the idea that there is something epistemically unique about qualia/experience, even if they resist the idea that qualia are metaphysically unique.
The problem is that there in no other case does it seem necessary to instantiate a brain state in order to undertstand something.
The point is you either define "to understand" as "to experience", or it is not necessary to see red in order to understand experience. What part of knowledge is missing if Mary can perfectly predict when she will see red? It just that ability to invoke qualia from memory is not knowledge, just because it is also in the brain - the same way that reflexes are not additional knowledge. And even ability to transfer thoughts with words is just approximation... I mean it doesn't solve the Hard problem by itself (panpsychism does) - but I think bringing knowledge into it doesn't help. Maybe its intuitive, but it seems to be very easily disprovable intuition - not the kind of "I am certain that I am conscious".
Most people who rides bikes don't have explicit knowledge about how riding a bike works. They are relying on reflexes to ride a bike.
Would you say that most people who ride bikes don't know how to ride a bike?
Basically, yes, I would like to use different words for different things. And if we don't accept that knowing how to ride a bike and being able to ride a bike are different, then what? A knowledge argument for unphysical nature of reflexes?
By that reasoning a native speaker of a language would often have less knowledge of a language then a person who learned it as a foreign language in a formal matter even when the native speaker speaks it much better for all practical purposes.
When we speak about whether Mary understanding Chinese, I think what we care about is to what extend she will be able to use the language the way a speaker of Chinese would.
A lot of most expert decision making is based on "unconscious competence" and you have to be very careful about how you use the term knowledge if you think that "unconscious competence" doesn't qualify as knowledge.
Again, this seems to me like a pretty consistent way to look at things that also more accurately matches reality. Whether we use words "knowledge" and "ability" or "explicit knowledge" and "knowledge" doesn't matter, of course. And for what its worth, I much less sure of usefulness of being precise about such terms in practice. But if there is an obvious physical model of this thought experiment, where there are roughly two kinds of things in Mary's brain - one easily influenceable by words, and another not - and this model explains everything without introducing anything unphysical, then I don't see what's the point of saying "well, if we first group everything knowledge-sounding together, then that grouping doesn't make sense in Mary's situation".
But philosophers are good at proposing answers - they all do that, usually just after identifying a flaw with an existing proposal.
What they're not good at is convincing everyone else that their solution is the right one. (And presumably this is because multiple solutions are plausible. And maybe that's because of the nature of proof - it's impossible to prove something definitively, and disproving typically involves finding a counterexample, which may be hard to find.)
I'm not convinced philosophy is much less good at finding actual answers than say physics. It's not as if physics is completely solved, or even particularly stable. Perhaps its most promising period of stability was specifically the laws of motion & gravity after Newton - though for less than two centuries. Physics seems better than philosophy at forming a temporary consensus; but that's no use (and indeed is counterproductive) unless the solution is actually right.
Cf a rare example of consensus in philosophy: knowledge was 'solved' for 2300 years with the theory that it's a 'true justified belief'. Until Gettier thought of counterexamples.
having AIs derive their terminal goals from simulated humans who live in a safe virtual environment.
There has been some subsequent discussion (expressing concern/doubt) about this at https://www.lesswrong.com/posts/7jSvfeyh8ogu8GcE6/decoupling-deliberation-from-competition?commentId=bSNhJ89XFJxwBoe5e
"The point here is that no matter how we measure complexity, it seems likely that philosophy would have a "high computational complexity class" according to that measure." - I disagree. The task of philosophy is to figure out how to solve the meta problem, not to actually solve all individual problems or the worst individual problem
Re: Philosophy as interminable debate, another way to put the relationship between math and philosophy:
Philosophy as weakly verifiable argumentation
Math is solving problems by looking at the consequences of a small number of axiomatic reasoning steps. For something to be math, we have to be able to ultimately cash out any proof as a series of these reasoning steps. Once something is cashed out in this way, it takes a small constant amount of time to verify any reasoning step, so we can verify given polynomial time.
Philosophy is solving problems where we haven't figured out a set of axiomatic reasoning steps. Any non-axiomatic reasoning step we propose could end up having arguments that we hadn't thought of that would lead us to reject that step. And those arguments themselves might be undermined by other arguments, and so on. Each round of debate lets us add another level of counter-arguments. Philosophers can make progress when they have some good predictor of whether arguments are good or not, but they don't have access to certain knowledge of arguments being good.
Another difference between mathematics and philosophy is that in mathematics we have a well defined set of objects and a well-defined problem we are asking about. Whereas in philosophy we are trying to ask questions about things that exist in the real world and/or we are asking questions that we haven't crisply defined yet.
When we come up with a set of axioms and a description of a problem, we can move that problem from the realm of philosophy to the realm of mathematics. When we come up with some method we trust of verifying arguments (ie. replicating scientific experiments), we can move problems out of philosophy to other sciences.
It could be the case that philosophy grounds out in some reasonable set of axioms which we don't have access to now for computational reasons - in which case it could all end up in the realm of mathematics. It could be the case that, for all practical purposes, we will never reach this state, so it will remain in the "potentially unbounded DEBATE round case". I'm not sure what it would look like if it could never ground out - one model could be that we have a black box function that performs a probabilistic evaluation of argument strength given counter-arguments, and we go through some process to get the consequences of that, but it never looks like "here is a set of axioms".
Because of the strange loopy nature of concepts/language/self/different problems metaphilosophy seems unsolvable?
Asking: What is good? already implies that there are the concepts "good", "what", "being" that there are answers and questions ... Now we could ask what concepts or questions to use instead ...
Similarly:
> "What are all the things we can do with the things we have and what decision-making process will we use and why use that process if the character of the different processes is the production of different ends; don't we have to know which end is desired in order to choose the decision-making process that also arrives at that result?"
> Which leads back to desire and knowing what you want without needing a system to tell you what you want.
It's all empty in the Buddhist sense. It all depends on which concepts or turing machines or which physical laws you start with.
Metaphilosophy is about reasoning through logical consequences. It's the basic, foundation of causality
You can read here https://www.lesswrong.com/posts/Xnunj6stTMb4SC5Zg/metaphilosophy-a-philosophizing-through-logical-consequences
A powerful AI (or human-AI civilization) guided by wrong philosophical ideas would likely cause astronomical (or beyond astronomical) waste. Solving metaphilosophy is one way that we can hope to avoid this kind of disaster. For my previous thoughts on this topic and further motivation see The Argument from Philosophical Difficulty, Metaphilosophical Mysteries, Three AI Safety Related Ideas, and Two Neglected Problems in Human-AI Safety.
Some interrelated ways of looking at philosophy
Philosophy as answering confusing questions
This was my starting point for thinking about what philosophy is: it's what we do when we try to answer confusing questions, or questions that we don't have any other established methodology for answering. Why do we find some questions confusing, or lack methods for answering them? This leads to my next thought.
Philosophy as ability to generalize / handle distributional shifts
ML systems tend to have a lot of trouble dealing with distributional shifts. (It seems to be a root cause of many AI as well as human safety problems.) But humans seem to have some way of (sometimes) noticing out-of-distribution inputs, and can feel confused instead of just confidently use their existing training to respond to it. This is perhaps most obvious in unfamiliar ethical situations like Torture vs Dust Specks or trying to determine whether our moral circle should include things like insects and RL algorithms. Unlike ML algorithms that extrapolate in an essentially random way when given out-of-distribution inputs, humans can potentially generalize in a principled or correct way, by using philosophical reasoning.
Philosophy as slow but general purpose problem solving
Philosophy may even be a fully general purpose problem solving technique. At least we don't seem to have reason to think that it's not. The problem is that it's painfully slow and resource intensive. Individual humans acting alone seem to have little chance of achieving justifiably high confidence in many philosophical problems even if they devote their entire lives to those problems. Worse, humanity has been collectively trying to solve some philosophical problems for hundreds or even thousands of years, without arriving at final solutions. The slowness of philosophy explains why distributional shifts remain a safety problem for humans, even though we seemingly have a general way of handling them.
Philosophy as meta problem solving
Given that philosophy is extremely slow, it makes sense to use it to solve meta problems (i.e., finding faster ways to handle some class of problems) instead of object level problems. This is exactly what happened historically. Instead of using philosophy to solve individual scientific problems (natural philosophy) we use it to solve science as a methodological problem (philosophy of science). Instead of using philosophy to solve individual math problems, we use it to solve logic and philosophy of math. Instead of using philosophy to solve individual decision problems, we use it to solve decision theory. Instead of using philosophy to solve individual philosophical problems, we can try to use it to solve metaphilosophy.
Philosophy as "high computational complexity class"
If philosophy can solve any problem within a very large class, then it must have a "computational complexity class" that's as high as any given problem within that class. Computational complexity can be measured in various ways, such as time and space complexity (on various actual machines or models of computation), whether and how high a problem is in the polynomial hierarchy, etc. "Computational complexity" of human problems can also be measured in various ways, such as how long it would take to solve a given problem using a specific human, group of humans, or model of human organizations or civilization, or whether and how many rounds of DEBATE would be sufficient to solve that problem either theoretically (given infinite computing power) or in practice.
The point here is that no matter how we measure complexity, it seems likely that philosophy would have a "high computational complexity class" according to that measure.
Philosophy as interminable debate
The visible aspects of philosophy (as currently done by humans) seem to resemble an endless (both in clock time and in the number of rounds) game of debate, where people propose new ideas, arguments, counterarguments, counter-counterarguments, and so on, and at the same time to try judge proposed solutions based on these ideas and arguments. People sometimes complain about the interminable nature of philosophical discussions, but that now seems understandable if philosophy is a "high computational complexity" method of general purpose problem solving.
In a sense, philosophy is the opposite of math: whereas in math any debate can be settled by producing a proof (hence analogous to the complexity class NP) (in practice maybe a couple more rounds is needed of people finding or fixing flaws in the proof), potentially no fixed number of rounds of debate (or DEBATE) is enough to settle all philosophical problems.
Philosophy as Jürgen Schmidhuber's General TM
Unlike standard Turing Machines, a General TM or GTM may edit their previous outputs, and can be considered to solve a problem even if it never terminates, as long as it stops editing its output after a finite number of edits and the final output is the correct solution. So if a GTM solves a certain problem, you know that it will eventually converge to the right solution, but you may have no idea when, or if what's on its output tape at any given moment is the right solution. This seems a lot of like philosophy, where people can keep changing their minds (or adjust their credences) based on an endless stream of new ideas, arguments, counterarguments, and so on, and you never really know when you've arrived at a correct answer.
What to do until we solve metaphilosophy?
Protect the trajectory?
What would you do if you had a GTM that could solve a bunch of really important problems, and that was the only method you had of solving them? You'd probably try to reverse-engineer it and make a bunch of copies. But if you couldn't do that, then you'd want to put layers and layers of protection around it. Applied to philosophy, this line of thought seems to lead to the familiar ideas of using global coordination (or a decisive strategic advantage) to stop technological progress, or having AIs derive their terminal goals from simulated humans who live in a safe virtual environment.
Replicate the trajectory with ML?
Another idea is to try to build a good enough approximation of the GTM by training ML on its observable behavior (including whatever work tapes you have read access to). But there are two problems with this: 1. This is really hard or impossible to do if the GTM has internal state that you can't observe. And 2. If you haven't already reverse engineered the GTM, there's no good way to know that you've built a good enough approximation, i.e., to know that the ML model won't end up converging to answers that are different from the GTM.
A three part model of philosophical reasoning
It may be easier to understand the difficulty of capturing philosophical reasoning with ML by considering a more concrete model. I suggest we can divide it into three parts as follows:
It's tempting to think that building an approximation of B using ML perhaps isn't too difficult, and then we can just search for the "best" ideas/arguments/counterarguments/etc. using standard optimization algorithms (maybe with some safety precautions like trying to avoid adversarial examples for the learned model). There's some chance this could work out well, but without having a deeper understanding of metaphilosophy, I don't see how we can be confident that throwing out A and C won't lead to disaster, especially in the long run. (For example, if the order in which we think of / encounter arguments is important to the eventual conclusions we reach, then straightforwardly optimizing for the "best" arguments won't reproduce our trajectory. Or suppose C is what will allow us to eventually be able to think about a very wide class of ideas and arguments.) But A and C seem very hard or impossible for ML to capture (A due to paucity of training data, and C due to the unobservable state).
Is there a way around this difficulty? What else can we do in the absence of a full white-box solution to metaphilosophy?