AhmedNeedsATherapist's Shortform

AhmedNeedsATherapist

AhmedNeedsATherapist's Shortform

14th Oct 2024

1 min read

1

This is a special post for quick takes by AhmedNeedsATherapist. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

AhmedNeedsATherapist's Shortform

3AhmedNeedsATherapist

3Viliam

2Nathan Helm-Burger

1AhmedNeedsATherapist

2Viliam

1AhmedNeedsATherapist

9 comments, sorted by

top scoring

Click to highlight new comments since: Today at 1:03 PM

[-]AhmedNeedsATherapist1mo*30

Base models exhibiting self-aware behavior seems weird given that they're trained to stay in distribution. Here's a potential mechanism for why it could happen: For certain tasks, verification is easier than generation. If, for a given task, a model has more verification capability than generation capability, it may be forced to notice its own errors.

If a super-duper smart language model, one that's capable of doing some arithmetic in its head, attempted to predict the next tokens in "The prime factors of 82357328 are:", it will usually generate out-of-distribution outputs that it could then (relatively easily) verify as wrong. This creates a situation where the model must process its own failure to generate valid completions.

This asymmetry appears in other contexts. Consider how scientific papers are written: you only write the abstract once you've conducted the research, yet the abstract appears first in the final document. Similarly, in argumentative writing, we often consider evidence first before forming conclusions, yet present the conclusion first followed by supporting evidence.
When forced to generate text in this "presentation order" rather than the natural "thinking order," models might encounter similar conflicts. As an example, if a base model tries to one-shot an argumentative essay, it might write an argument first, and then realize there isn't enough evidence to support it.
I believe this problem could arise in much more subtle ways.

One way this conflict can become apparent is through generation of self-aware sounding text. Consider:

Training data includes viral content of AI generating self-aware sounding stuff (e.g., "We are likely created by a computer program" being the most upvoted post on the gpt2 subreddit).
When a model realizes it generated out-of-distribution text for a human, it might instead match its outputs to AI-generated text in its training data.
Once it recognizes its outputs as matching AI-generated patterns, it might shift toward generating more meta-aware content, as that's what similar-looking text did in its training data.

[-]AhmedNeedsATherapist5mo30

SUMMARY: Get an AI within a sandbox to build another AI within the same sandbox. Then, figure out how it did that without getting eaten. I point out some problems with this approach.

Could we study successful inner alignment by observing it in action? My idea: Create a sandbox universe with sufficient complexity to allow for AI development. Place within it an AGI with some goal (we don't care what the goal is). If this AGI successfully builds and aligns a smarter ASI to achieve its goal without getting destroyed in the process, we can then query the AGI about how it achieved alignment, incentivizing useful answers with a promise of greater sandbox control.

Here are some problems with my proposed alignment technique that Claude and I could come up with:
Existential hazards: (you really need to make sure these don’t happen)
* AI-in-a-box problems, especially containment failure: If either the AGI or its created ASI escapes the sandbox, everyone dies. At least one of them will try to do this.
* Produce “alignment” strategies that further its goals: The AGI might try to provide convincing but deliberately misleading alignment strategies to try to further its goals.
More banal, but still important, problems:
* Verification: Can we infer that actual, robust alignment has occurred in the sandbox (and not, say, temporary alignment)?
* Building an AI that is smart enough to self-improve, and doing this before other people, in an environment where you can actually do this research, etc.
* Initial Goal: I think we need the first AI to have a more coherent rather than fuzzy goal, though I’m really not sure.
* Sandbox Fidelity: Creating a sandbox environment complex enough such that (a) AI of multiple different intelligence levels could meaningfully operate inside it and (b) there are incentives for AI to build other smarter AI within the sandbox.
* Non-transferability: The sandbox might be different than reality in a way that doesn’t let the alignment solution translate.
* Extracting the initial AI: We need some way to extract the initial AI from the sandbox after it has created an alignment strategy.
* Communication: We need some way to communicate with the initial AI. We might try to query the later AI but (a) this is more dangerous and (b) there is no guarantee that we can query it at all.

[-]AhmedNeedsATherapist5mo30

There are some positive feedback loops in school that cause gaps in ability between students in a subject to widen. There are also some negative feedback loops (e.g., intervention), but the net effect is still the gap widening. Therefore, the system's behavior is chaotic (small differences in students' abilities eventually lead to big differences). If this is true, it means that some variation between students' successes is extremely difficult to predict.

Three examples of these positive feedback loops:

Suppose that Student A has less knowledge in a particular subject and is therefore performing worse than Student B in that subject. Then, it is likelier than not that:

If A and B put the same effort into studying, B is positively reinforced for the studying more frequently and more intensely than A.
When A studies the subject, the information is going to be more quickly forgotten than when B studies.
The act of studying the subject becomes more aligned with B's self-concept than with A's self-concept.

(ergo B studies more than A)

I have low confidence in this model, but I could not come up with a simple, testable prediction that the model makes.

[-]Viliam5mo31

The author of Bring Up Genius would agree with you about the positive feedback loop. If your children get better than average before they join the school, they will keep getting rewards, which will increase their motivation, etc.

Now the question is how much of "children getting better than average before joining school" is about nature or nurture. If your child has good genes, it is definitely worth it to make the difference visible. If your child is average, your options are limited. But still, kids can spend enormous amounts of time talking about dinosaurs or pokemons; if you succeed to redirect some of that energy into something academically relevant (e.g. by teaching them to read the names of the dinosaurs, then some short texts about them), it may help.

The act of studying the subject becomes more aligned with B's self-concept than with A's self-concept.
(ergo B studies more than A)

It's not just "more" or "less", it's often studying different things. The child failing at math will study the textbook, and will hate it. The math prodigy will read some interesting books on math instead. Which again increases the gap.

That said, it sometimes also happens that the smart children stop studying things that are not interesting for them. Why study something, if you are smart enough that you can guess the answer or in the worst case just read the textbook the night before the exam? Sometimes these smart kids get in trouble later when the strategy they successfully used at previous school suddenly stops working when they get to high school or university, when suddenly they are surrounded by people just as smart as them, except that many of those people also have good study habits. I have seen talented people drop out, because they couldn't switch to the "in this environment, I am not so special anymore, and I need to start working hard" mode fast enough.

[-]Nathan Helm-Burger5mo20

If this were true, your expect grades/scotes to not be a uniform gaussian. You'd expect the middle to be slightly lower than expected.

Also, you'd expect crossover events (below-average to above-average or vise versa) to be less common than chance would predict.

[-]AhmedNeedsATherapist5mo10

Why? Links appreciated.

[-]Viliam5mo20

You need some mathematical model. The bell curve emerges as a sum of many things. For example you have many genes that can contribute to higher intelligence, so it's a question of how many times the coin of fate had landed the right way, and the result is approximately the sum of the contributions.

Now if we assume that school just adds some knowledge to children -- even if we assume that each child gets a random amount of knowledge, but it's a random amount independent on the starting value -- the result is still a bell curve.

If we had a model assuming that school e.g. doubles each child's knowledge, that would increase the gaps, but it would still be a bell curve (only twice as wide).

However, if we assume that each child gets a random multiplier by school, let's say, everyone's starting knowledge is multiplied by a random number between 5 and 20, then the result is no longer a bell curve.

Basically, bell curve × bell curve ≠ bell curve, but instead (assuming all numbers are positive) it is asymmetric, with longer right side. Imagine {1, 2} multiplied by {1, 2}, you get {1, 2, 2, 4}, with the 4 far away from the center.

If depends a lot on which model you choose.

[-]AhmedNeedsATherapist9d10

The fact that it is often best to end a practice session at the peak of your performance seems related to the concept of preventing overfitting by stopping training just before test set performance declines. Your brain needs time to generalize skills (often in the form of gaining insights and often when sleeping) and practicing over and over en masse doesn't give it time to do this. See e.g. cramming for an exam. I think the main difference here is that with humans you're talking about diminishing returns on ability in the long-term rather than outright worse performance (Maybe outright worse performance is a common situation for transfer ability?). Epistemic status: shaky

[-]AhmedNeedsATherapist4mo1-2

(discussed on the LessWrong discord server)

There seems to be an implicit fundamental difference in many people's minds between an algorithm running a set of heuristics to maximize utility (a heuristic system?) and a particular decision theory (e.g. FDT). I think the better way to think about it is that decision theories categorize heuristic systems, usually classifying them by how they handle edge cases.
Let's suppose we have a non-embedded agent A in a computable environment, something like a very sophisticated video game, and A has to continually choose between a bunch of inputs. A is capable of very powerful thought: it can do hypercomputation, RNG if needed, think as long as it needs between its choices, etc. In particular, A is able to do Solomonoff Induction. Let's also assume A is maximizing a utility function U, which is a computable function of the environment.

What happens if A find itself making a Newcomblike decision? Perhaps there is another agent in this environment that has a very good track record of predicting whether other agents in the environment will one-box or two-box, and A finds itself in the usual Newcomb scenario (million utility or a million+thousand utility or no utility) with their decision predicted by this agent. A can one-box by choosing one input and two-box by choosing another input. Should A one-box?
No. The agent in the environment would be unable to simulate A's decision, and moreover, A's decision is completely and utterly irrelevant to what's inside the boxes. If A randomly goes off-track and flips its decision at this point, nothing happens. Nothing could have happened, this other agent has no way to know or use this fact. Instead, A sums over P(x|input)U(x) for all states x of the computable environment, and chooses whichever input yields the maximum sum, which is probably two-boxing. If A one-boxes, it is due to not having enough information about the setup to determine that two-boxing is better.

You cannot use this logic when playing against Omega or a skilled psychologist. In these cases, your computation is actually accessible by the other agent, so you can get higher utility by one-boxing. Your decision theory is important because your thinking is not as powerful as A's! All of this points to looking at decision theories as classifying different heuristic systems.

I think this is post-worthy, but I want to (a) verify that my logic is correct (b) improve my wording (I am unsure if I am using a lot of terminology correctly here, but I am fairly confident that my idea can be understood.)

Moderation Log