Warmup: The Expert
If you haven’t seen “The Expert” before, I recommend it as a warmup for this post:
The Client: “We need you to draw seven red lines, all strictly perpendicular. Some with green ink, some with transparent. Can you do that?”
(... a minute of The Expert trying to explain that, no, he cannot do that, nor can anyone else…)
The Client: “So in principle, this is possible.”
This. This is what it looks like in practice, by default, when someone tries to outsource some cognitive labor which they could not themselves perform. At best, The Expert is well-intentioned and knows what the user needs, ignores the incoherent parts of The Client’s babbling, and does the right thing. Or, they manage to add some silly but ultimately harmless bells and whistles to satisfy whatever dumb thing The Client is looking for.
At worst… well, there’s more than one failure mode which could qualify for the title of "worst". Maybe The Expert gives The Client something which looks right to The Client and successfully conceals all the problems with it; presumably that’s a lucrative strategy for Experts. Maybe the Double Illusion of Transparency kicks in, both parties think they’ve successfully communicated, but in fact neither has any idea what’s going on in the other’s head. Maybe a well-intentioned Expert decides to ignore The Client’s incoherent babbling and do the thing which seems most likely to be right, but gets The Client’s preferences wrong.
One way or another, The Client’s ignorance is a major bottleneck to cognitive outsourcing. In practice, I expect The Client’s ignorance to be the primary bottleneck to cognitive outsourcing.
The core reason why we cannot just outsource alignment research to an AI is because we would then be The Client, and probably a very ignorant one.
Application to Alignment Schemes
There’s a lot of different flavors of “have the AI solve alignment for us”. A sampling:
- Just prompt a language model to generate alignment research
- Do some fine-tuning/RLHF on the language model to make it generate alignment research
- Let the language model talk to other instances of itself, and prompt or fine-tune them together so they generate alignment research jointly
- Set up a language model to generate alignment proposals and another to poke holes in them, and fine-tune the pair via a human judging the “debate”
- …
As we go down the list, the proposals get fancier and add more bells and whistles, trying to make the AI a better expert. Sadly, none of them at all address what I expect to be the actual main bottleneck: The Client (i.e. the human user or users) has no understanding of what they need, what questions to ask, what’s possible or even logically coherent, etc.
What would this kind of error look like in practice?
Here’s one concrete example of the kind of failures I’d expect when a would-be outsourcer’s understanding falls short (from here):
Somebody literally types “If we take the action you just proposed, will we be happy with the outcomes?” into a GPT prompt. Obviously that does not result in the AI giving its actual best-guess answers to the questions, but in this case it doesn't result in the AI thinking about how to deceive humans either. It just thinks about what text would follow that question if it appeared on the internet somewhere. And then I imagine someone with a bunch of interpretability tools saying "yup, it's just thinking about what text typically follows this question", and then that person's boss is like "great, it's not trying to deceive us, guess we can trust the answer", and they both just haven't really thought of the fact that the AI's response-text does not have anything in particular to do with whether the AI is aligned or whether they'll be happy with the outcome or whatever.
(It's essentially the same mistake as a GOFAI person looking at a node in some causal graph labeled "will_kill_humans", and seeing that node set to 99% False, and thinking that somehow implies the GOFAI will not kill humans.)
This is an Illusion of Transparency failure mode: The Client (humans) thinks they know what The Expert (GPT) is saying/thinking/doing, but in fact has no clue.
To be clear, I’d expect this particular mistake to be obvious to at least, like, 30% of the people who want to outsource alignment-solving to AI. Only the people who really do not understand what’s going on would make this particular mistake. (Or, of course, people who do understand but are working in a large organization and don’t notice that nobody else is checking for the obvious failure modes.) But in general, I expect more subtle versions of this kind of failure mode to be the default outcome when someone attempts to outsource lots of cognition to an AI in an area the outsourcer understands very poorly.
As they say: error between chair and keyboard.
Also, It’s Worse Than That
In fact “The Expert” video is too optimistic. The video opens with a well-intentioned Expert, who really does understand the domain and tries to communicate the problems, already sitting there in the room. In practice, I expect that someone as clueless as The Client is at least as likely to hire someone as clueless as themselves, or someone non-clueless but happy to brush problems under the rug, as they are to hire an actual Expert.
General principle: some amount of expertise is required to distinguish actual experts from idiots, charlatans, confident clueless people, etc. As Paul Graham puts it:
The problem is, if you're not a hacker, you can't tell who the good hackers are. A similar problem explains why American cars are so ugly. I call it the design paradox. You might think that you could make your products beautiful just by hiring a great designer to design them. But if you yourself don't have good taste, how are you going to recognize a good designer? By definition you can't tell from his portfolio. And you can't go by the awards he's won or the jobs he's had, because in design, as in most fields, those tend to be driven by fashion and schmoozing, with actual ability a distant third. There's no way around it: you can't manage a process intended to produce beautiful things without knowing what beautiful is. American cars are ugly because American car companies are run by people with bad taste.
Now, in the case of outsourcing to AI, the problem is not “who to hire” but rather “which AI behavior/personality to train/prompt”. In the simulators frame, it’s a question of who or what to simulate. For the same reasons that a non-expert can’t reliably hire actual experts, a non-expert won’t be able to reliably prompt a simulacrum of an expert, because they can’t distinguish an expert simulacrum from a non-expert simulacrum.
Or, in the context of RLHF: a non-expert won’t be able to reliably reinforce expert thinking/writing on alignment, because they can’t reliably distinguish expert thinking/writing from non-expert.
Cocnretely, consider our earlier example:
Somebody literally types “If we take the action you just proposed, will we be happy with the outcomes?” into a GPT-3 prompt. Obviously that does not result in the AI giving its actual best-guess answers to the questions, but in this case it doesn't result in the AI thinking about how to deceive humans either.
This is a failure to “hire” the right behavior/personality within the AI. Alas, the user fails to even realize that they have “hired” neither an honest actual expert nor an actively deceptive expert; they have “hired” something entirely different. (Reminder: I expect actual failures to be more subtle than this one.)
... Oh, And Worse Than That Too
Note that, in all of our prototypical examples above, The Client doesn't just fail to outsource. They fail to recognize that they've failed. (This is not necessarily the case when non-expert Clients fail to outsource cognitive labor, but it sure is correlated.)
That means the problem is inherently unsolvable by iteration. "See what goes wrong and fix it" auto-fails if The Client cannot tell that anything is wrong. If The Client doesn't even know there's a failure, then they have nothing on which to iterate. We're solidly in "worlds in which iterative design fails" territory.
Solutions
Partial Solution: Better User Interfaces
One cached response to “error between chair and keyboard” is “sounds like your user interface needs to communicate what’s going on better”.
There's a historical parable about an airplane (I think the B-52 originally?) where the levers for the flaps and landing gear were identical and right next to each other. Pilots kept coming in to land, and accidentally retracting the landing gear. The point of the story is that this is a design problem with the plane more than a mistake on the pilots' part; the problem was fixed by putting a little rubber wheel on the landing gear lever. If we put two identical levers right next to each other, it's basically inevitable that mistakes will be made; that's bad interface design.
In practice, an awful lot of supposed “errors between chair and keyboard” can be fixed with better UI design. Especially among relatively low-hanging fruit. For instance, consider our running hypothetical scenario:
Somebody literally types “If we take the action you just proposed, will we be happy with the outcomes?” into a GPT-3 prompt. Obviously that does not result in the AI giving its actual best-guess answers to the questions, but in this case it doesn't result in the AI thinking about how to deceive humans either. It just thinks about what text would follow that question if it appeared on the internet somewhere. And then I imagine someone with a bunch of interpretability tools saying "yup, it's just thinking about what text typically follows this question", and then that person's boss is like "great, it's not trying to deceive us, guess we can trust the answer", and they both just haven't really thought of the fact that the AI's response-text does not have anything in particular to do with whether the AI is aligned or whether they'll be happy with the outcome or whatever.
What UI features would make that mistake less probable? Well, a less chat-like interface would be a good start, something which does not make it feel intuitively like we’re talking to a human and provide the affordance to anthropomorphize the system constantly. Maybe something that emphasizes that the AI’s words don’t necessarily correspond to reality, like displaying before every response “The mysterious pile of tensors says:” or “The net’s output, when prompted with the preceding text, is:” or something along those lines. (Not that those are very good ideas, just things off the top of my head.)
So there’s probably room for a fair bit of value in UI design.
That said, there are two major limitations on how much value we can add via the “better UI” path.
First, the more minor problem: in more complex domains, there are sometimes wide inferential distances - places where someone needs to understand a concept requires first understanding a bunch of intermediate concepts, and there just isn’t a good way around that. UI improvement can go a long way, but mostly only when inferential distances are short.
Second, the main problem: whoever’s designing the UI must themselves be an actual expert in the domain, or working closely with an expert. Otherwise, they don’t know what mistakes their UI needs to avoid, or what kinds of thoughts their UI needs to provide affordances for. (It’s the same main problem with building tools for alignment research more generally.) In the above example, for instance, the UI designer needs to already know that somebody interpreting the AI’s output as having anything to do with reality is a failure mode they need to watch out for. (And reminder: I expect actual failure modes to be more subtle than the hypothetical, so to handle realistic analogues of this problem the UI designer needs more expertise than this particular hypothetical failure story requires.)
So we’re back to the core problem: can’t robustly usefully outsource until we already have expertise. Except now we need expertise in alignment and UI.
The Best Solution: A Client With At Least Some Understanding
The obvious best solution would be for “The Client” (i.e. human user/users of the AI system) to have at least some background understanding of what they need, what questions to ask, what’s possible or even logically coherent, etc.
In other words: the best solution is for The Client to also be an expert.
Importantly, I expect this solution to typically “degrade well”: a Client with somewhat more (but still incomplete) background knowledge/understanding is usually quantitatively better than a Client with less background knowledge/understanding. (Of course there are situations where someone “knows just enough to shoot themselves in the foot”, but I expect that to usually be a problem of overconfidence more than a problem of the knowledge itself.)
Returning to our running example:
… and then I imagine someone with a bunch of interpretability tools saying "yup, it's just thinking about what text typically follows this question", and then that person's boss is like "great, it's not trying to deceive us, guess we can trust the answer".
If that boss had thought much about alignment failure modes beyond just deception, I’d expect them to be quantitatively more likely to notice the error. Not that the chance of missing the error would be low enough to really be comforting, but it would be quantitatively lower.
The usefulness of quantitatively more/better (but still incomplete) understanding matters because, in practice, it is unlikely that any human will be a real proper Expert in alignment very soon. But insofar as the human user’s understanding of the domain is the main bottleneck to robustly useful cognitive outsourcing, and even partial improvements are a big deal, improving our own understanding is likely to be the highest-value way to improve our chances of successful cognitive outsourcing.
Summary and Advice
Key idea: “The Client’s” own understanding is a key bottleneck to cognitive outsourcing in practice; I expect it is the main bottleneck to outsourcing cognitive work which The Client could not perform themselves. And I expect it to be the main bottleneck to successfully outsourcing alignment research to AIs.
The laundry list of strategies to make AIs better experts and avoid deception do little-to-nothing to address this bottleneck.
There’s probably room for better user interfaces to add a lot of value in principle, but the UI designer would either need to be an expert in alignment themselves or be working closely with an expert. Same problem as building tools for alignment research more generally.
Thus, my main advice: if you’re hoping to eventually solve alignment by outsourcing to AI, the best thing to do is to develop more object-level expertise in alignment yourself. That’s the main bottleneck.
Note that the relevant kind of “expertise” here is narrower than what many people would refer to as “alignment expertise”. If you hope to outsource the job of aligning significantly smarter-than-human AI to AI, then you need expertise in aligning significantly smarter-than-human AI, not just the hacky tricks which most people expect to fail as soon as the AI gets reasonably intelligent. You need to have some idea of what questions to ask, what failure modes to look for, etc. You need to focus on things which will generalize. You need to go after the difficult parts, not the easy parts, so you have a better idea of what questions to ask when it comes time for an AI to solve the difficult parts - or, y'know, time for an AI to tell you that drawing seven perpendicular lines in two dimensions isn't even logically coherent.
… and of course you will probably not end up with that good an idea of how to align significantly smarter-than-human AI. We have no significantly smarter-than-human systems on which to test, and by the time we do it will probably be too late, so your understanding will likely be limited. But, as when e.g. optimizing code, partial progress on the bottleneck is better than basically-any progress on non-bottlenecks.
Indeed!
It's a necessary but not sufficient condition.
Summary:
The regularities are expressed in terms of score-functions (that score argument-step-networks)[1]. We can score these score-functions based on simplicity/brevity, and restrict what they can do (make it so that they have to be written within human-defined confines).
I posit that we probably can do this in such a way that the most high-scoring score-functions that it’s possible to make (that meet requirements relating to wiggle-room and “reach”[2]) do so by specifying general regularities for when humans are/aren’t led astray (as opposed to ones that favor specific conclusions).
It should (I posit) be feasible to make it so that score-functions that are designed with the intention of fooling us about specific things earn significantly less points (if they pass the requirements at all).
If we allow score-functions to be constructed freely, then this makes it easy for AIs to output score-functions that (practically speaking) allow for pretty much any internally consistent set of conclusions. Indeed, the AI could simply have the score-function be itself, and choose to only give a high score to argument-step-trees with conclusions it likes.
So the question becomes: How do we restrict the “space” of possible score-functions that can be suggested, and add requirements that the score-functions must pass, so that the only way to get a high score (for AIs that suggest score-functions) is to make score-functions that (in some sense) separate out “good” human evaluations?
Here are examples of some of the types of possible requirements for score-functions:
Low maximum source code length (no space for score-function to be intelligent or biased in a detailed way)
With a sufficiently low maximum source code length, there is not enough space for the score-function itself to be intelligent, or for hardcoding bias for many specific conclusions.
Work would have to be done elsewhere (e.g. predictions of human output, which are obtained from system-calls to external functions).
Other optimization-targets relating to simplicity
Requirements relating to maximum use of computational resources, how “general”[3] the code must be, etc.
Low expressivity (humans define contours of what the score-function should do)
As humans we may not be able to make a score-function that would force good output, but we may be able to outline what such a function might look like, e.g. using “function-builders'' (think regular expressions, but easier to work with for humans, and more expressive).
This way, no humans look at the functions proposed by the AIs, and humans themselves don’t have to write functions, but AI-written source code must stay within human-defined contours.
Enforced modularity (nowhere should score-function know what network argues for/against)
Individual argument-steps should (insofar as possible) be evaluated independently from the network as a whole. And macro-score for the network should be calculated without granular info about nodes. No part of the function should know what the network argues for/against. This could be enforced by the human-defined “contours” of the score-functions and/or by giving points based on heuristics that relate to modularity[4].
Limited number of different meta-questions the score-function can reference
Systems could predict what humans are likely to answer to questions about argument-steps (“Does the argument-step in Box A fit the description given in Box B?”, etc). And score-functions could be allowed to label argument-steps as “not allowed” based on answer-predictions for meta-questions such as these.
We could limit the amount of different questions that are allowed. For example, the score-function could have 40 questions to choose from when evaluating any given argument-step, but no more.
I posit that if the AI only is allowed the minimum amount of questions[5], then probably it will need to restrict itself to general questions that separate out “bad” argumentation/assumptions (that are likely to lead humans astray).
Why? Because it needs to use general questions to reduce wiggle-room. Questions that are aimed at making specific conclusions turn out wrong (or preventing certain specific correct conclusions) will probably not be efficient in helping with the general, across-the-board reduction in wiggle-room. Hence, these specific questions will not replace the general ones (that are needed to avoid wiggle-room). And since they don’t replace other questions that are needed, having to include them increases the minimal amount of questions that are needed[6].
Having humans pre-define lists of argument-steps that help define minimum standard ("I know it when I see it")
We can specify argument-steps that seem top quality to us, and have it so that the score-function only can allow argument-steps that seem at least as rigorous/clear/etc as those ones (according to predictions for what humans would answer when asked to compare different argument-steps)[7].
Having humans pre-define lists of meta-questions that help define minimum standard
Whether a meta-question is allowed by a score-function could depend on how “suspicious” that meta-question seems. How plausible does it seem that this question prevents humans from being led astray in a general way (as opposed to somehow being targeted at affecting specific conclusions)?
We could pre-define meta-questions that seem good and non-suspicious to us, and only allow meta-questions that don’t seem more suspicious than those[8]. Or more strictly, we could choose to only allow score-functions to reference meta-questions from the pre-defined list.
“Abstracting” away specifics of argument-network content before score-functions sees it (to make it even harder for any part of the score-function to "know what's being discussed")
Functions (external to the score-function) can process network content, in ways such that score-functions still can evaluate argumentation robustness, but where no AI can tell what the nodes argue for/against (based on output from functions that "abstract" the content).
An “abstraction” can, in this context, be defined as a function that takes some data-structure as input, and returns some other data-structure (typically with less info) as output.
Scoring “abstractions” would be a process in itself. We would want “abstractions” that remove as much info as possible, and are as “general” as possible, but don’t make it impossible to get anything done at all.
Not accepting score-functions if there are simplification that chance conclusion for only some low fraction of conclusions
The way I imagine things, most tasks done by AIs are done in zero-sum competitions with other AIs[9].
When AIs compete to make high-scoring score-functions, it may be possible for one AI to show that a score-function from the other AI could be simplified, in ways such that only a few conclusions (in the set of statements where high-scoring argument-step-networks are possible) are changed (proportionally speaking)[10].
The description above is imprecise/crude. How we might make this kind of thing work (without the need for human review, etc) would be a discussion by itself.
The list above is not at all intended to be comprehensive. Hopefully it gives an impression of the kinds of restrictions I have in mind.
1: Wiggle-room of score-functions can be explored[11]. If a score-function is revealed to have wiggle-room, then the regularities weren’t stated “accurately”[12].
How strong that indication is depends on our initial assumptions regarding P(high-scoring arguments can be trusted | score-function has low wiggle-room).
2: This leaves several questions, one of which is: If we specified requirements/desiderata that seemed to us that they should result in a very high P(high-scoring arguments can be trusted | score-function has low wiggle-room), how do we check if we were right?
Answer: The job of AIs that propose score-functions is to propose score-functions that pass our requirements and get a high score. They may be tasked to find examples of score-functions (that both pass the requirements + get a high score) that are contradictory in terms of the conclusions that it’s possible to argue for with high-scoring argument-networks.
What we do here is, in a sense, to explore wiggle-room “one level higher up”.
Suppose that A is some set/space of possible score-functions, and we think that for all of the score-functions in A, P(high-scoring arguments can be trusted | score-function has low wiggle-room) is very high. If so, then for any claim P1 (that we are interested in) we could task AIs to:
If any AI is able to do this, then this indicates that our assumptions were wrong[13].
If we presume that some restriction/restrictions makes it disproportionately harder for score-functions to have low wiggle-room for the “wrong” reasons, then that can (sort of) be tested. Or at least we may be able to get very strong indications for whether this is the case.
This process of exploring wiggle-room “one level higher up” can be taken further. Maybe that intuitively seems like "bells and whistles". But to me it seems like an important part of the answer. And I'm explaining it less well than I'd want to in order to make things shorter and less convoluted[14].
3: AIs could be tasked to try making high-scoring argument-step-networks with conclusions/predictions that can be proven wrong in a safe/cheap way.
For example: Constructing functions from within the argument-network, and making wrong conclusions/predictions relating to output from those functions.
If the score-function is not "watertight", this kind of thing would offer lots of opportunitities to achieve unambiguous contradictions.
Keep in mind how “interconnected” truth-claims are. And keep in mind that predictions of human evaluations would relate to individual argument-steps (that are evaluated in a fairly “modular” way, independently of the network as a whole, and often/typically with no way to guess what the conclusion of the network is, or what questions the conclusion answers).
Imagine "proof-like" reasoning relating to the behavior of code, where each argument-step relating to the code is focused on one little thing. The kinds of regularities needed to ensure that the scoring of such argumentation is correct - well, it would need to be quite "watertight" (in order to not be exposed by AIs that are "incentivized" to do so).
And while being so "watertight" it needs (1) allow for certain fairly specific errors and (2) prevent certain specific correct conclusions that correspond inversely to those errors.
Kind of like making a bowl be very watertight in general, but making the bowl always leak water in certain very specific conditions (even those conditions cannot be determined locally by the different parts of the bowl that is to allow the water through). And having the design of this bowl be simpler (not just the same level of complexity) as the simplest way to make a bowl that doesn't leak water in these highly specific conditions (but otherwise behaves the same)[15].
The analogy above is maybe not perfect, but not so bad either IMO (if we accept certain presumptions, e.g. that the AIs that explore wiggle-room do so competently and without bias!).
We can also test how good score-functions are in terms of e.g. predictions of experimental results (where the predictions are the conclusions of high-scoring networks). Not all experiments are safe to test, but we can ask questions relating to ones that are (and ask in regards to already-existing experimental data that the AIs don't have access to).
The AIs that make argument-step-networks would be scored based on their ability to make high-scoring argument-step-networks (and would not be penalized if it turns out that the conclusions are wrong in ways that we quickly/safely can check). AIs that make score-functions would be penalized for this if/when accuracy against such predictions has been made to be one of the score-function requirements (for the request in question).
To put it somewhat simplistically (things absorb less easily with more caveats and nuance):
In order to fool us about specific things, a score-function would need to not make it infeasible to lead us astray about anything[16] except some very specific things (that all are outside of the realm of things we can test safely/cheaply/quickly). It has to do that within the score-function restrictions that are designed with the purpose of making that hard.
And even though it is presumably harder to fail in these very specific ways (while getting things right most of the time), it has to be (1) feasible to do that and (2) infeasible to not do that.
That is to say: The human-defined score-function-restrictions[17] have to somehow make it infeasible to lead us astray in regards to most conclusions, but at the same time make it infeasible to not lead us astray in regards to certain specific conclusions. If not, the score-function-restrictions would have wiggle-room, and various AIs would be leaving points on the table by not pointing this out[18].
So the human-defined restrictions mustn’t just not work. All/most of them must fail in the same very specific and improbable way.
And if we have humans define things at a "higher level" (which I'd recommend), then this mustn't just be the case for a specific set of restrictions we have defined for score-functions (but very strong patterns observed within human-defined "spaces" of possible ways that score-function restrictions can be combined).
Some postulations/assumptions that I don’t discuss here, but could get into if requested (I’m pretty much a jukebox when it comes to this):
One dilemma:
We want to test/verify the accuracy of systems that predict human evaluations / answers to questions. But often we want these systems to predict human evaluations of AGI-generated content. So if we want to test the accuracy of these systems against the full range of outputs that we use, that would (seemingly) require us to expose real humans to AGI-generated content.
I would say that maybe around half of my thinking, if not more, relates to this kind of dilemma ("How might we evaluate the accuracy of functions outside of the range of inputs where we are willing+able to test output, without having humans try to understand the source code, or even look at it?").
I have also thought considerably about how we might obtain functions that handle certain types of requests, without ever intentionally leaving points on the table for specific requests (e.g. “deliberately” failing to notice specific cases where wiggle room can be exposed).
The situation I often imagine is having obtained superintelligent AGI-systems that can be imagined as trying to seem aligned/helpful/capable (while not necessarily being aligned/helpful)[19]. And then I try to ponder: Given that hypothetical situation, what steps might we take from there (if we want to obtain robustly aligned superintelligences while minimizing risk along the way)?
Thanks for the feedback (and thanks again for engaging). You may well be right about that. For sure I have plenty of room for improvement whereas communication is concerned.
That being said, a suspicion of mine is that it can be easy to significantly underestimate the inferential distance that I have to cover (even if usually correct heuristics make that seem unlikely). I am already leaving out concepts that I myself think of as “core” concepts, so as to reduce length.
Or at least that's one way to think of it.
By “reach” I mean: What is the range of questions where, for at least one answer to that question, there is at least one answer that can be argued for by a high-scoring argument-step-network?
A maximally “non-general” function would have 1 if-sentence per possible input.
Possible heuristics:
- “% chance that random change to code chances output for a low but non-zero fraction of inputs”
- “If there are possible simplifications to the code that would affect a low but non-zero fraction of inputs”
One example of such a heuristic:
“% of source code that consists of subroutines where no AI is able to make accurate guesses (based on the subroutine input) for what the argument-network argues for/against.
Values for heuristics could be determined by running zero-sum AI vs AI games (including games where the AIs make bets on what the outcomes of other AI vs AI games would be).
The minimum amount necessary for achieving low wiggle-room (while still having it be possible to construct high-scoring argument-step-networks with useful conclusions).
Similar reasoning to this also applies to maximal source code length. It’s probably hard to minimize source code as much as possible while trying to make it so that conclusions turn out wrong from certain specific questions (but otherwise turn out correct). So AIs that try to do that would presumably lose points to competing AIs that don’t (presuming the AIs are equally competent, etc).
As alluded to earlier, score-functions should not have the expressivity or space that is necessary for predicting human answers from within themselves. And they would therefore have to make system-calls to external systems that make such predictions. And there are restrictions in regards to the system-calls that are allowed.
Based on predictions of what reviewers will answer to meta-questions where they are asked to compare other meta-questions 🙃
Or maybe some other process that approximates such a process (meaning that we do gradient descent, or something like that, as if we had AIs compete against each other in zero-sum games).
If so, the AI that made the “non-general” score-function would lose points (and each point lost by one AI is earned by the other).
By other AI-systems, that are selected for based on their ability to find examples of wiggle-room.
As they for at least one statement would allow for both P and ¬P - and at least one of those has to be bad/wrong.
Or maybe it was wrong to claim that SF1 and SF2 have low internal wiggle-room. But some AIs would be “incentivized” to find counter-examples (that prove claims of low wiggle-room wrong). And if counter-examples are found, then the AI that posited low wiggle-room will be penalized (with the penalty being proportional to the level of confidence that was expressed).
Earlier I talked about score-function restrictions, and how we would define score-function restrictions that score-functions would need to adhere to.
The simplest thing to imagine is that we define one set of restrictions that score-functions must adhere to (hence it makes sense to start explaining things that way). But what I really envision is that we “explore landscapes” of higher-level wiggle-room relating to “spaces” of possible ways score-functions can be restricted.
The patterns we see could, in combination with our initial assumptions regarding P(high-scoring arguments can be trusted | score-function has low wiggle-room), provide strong evidence.
If our assumptions regarding regarding P(high-scoring arguments can be trusted | score-function has low wiggle-room) typically are correct (and at least not way off), then this should be possible to see from the "patterns" that are revealed.
I could try to explain this better and more concretely if requested.
If they are the same level of complexity, then there is higher-level wiggle-room (e.g. relative to the score-function requirements / score-function score-function).
Well, not literally anything, but an incomprehensibly large amount of possible statements.
Or score-function score-functions.
Presuming that we have succeeded in getting at least one of the AIs in question to act as if it is trying to maximize points for every individual request that it gets.
While also imagining that these AGI-systems have been successfully contained, are extremely limited in terms of their communication to operators, etc.