AhmedNeedsATherapist

what does this text box do

Wiki Contributions

Comments

Sorted by

I am confused with the claim that an LLM trying to generate another LLM's text breaks consequence-blindness? The two models are distinct; no recursion is occuring.
I'm imagining a situation where I am predicting the actions of a clone of myself, it might be way easier to just query my own mental state than to simulate my clone. Is this similar to what's happening when LLM's are trained on LLM-generated data, as mentioned in the text?

> In my experience, larger models often become aware that they are a LLM generating text rather than predicting an existing distribution. This is possible because generated text drifts off distribution and can be distinguished from text in the training corpus.

Does this happen even with base models at default values (e.g. temperature=1, no top-k, etc)? If yes, does this mean the model loses accuracy at some point and later becomes aware of it, or does the model know that it is about to sacrifice some accuracy by generating the next token?

(discussed on the LessWrong discord server)

There seems to be an implicit fundamental difference in many people's minds between an algorithm running a set of heuristics to maximize utility (a heuristic system?) and a particular decision theory (e.g. FDT). I think the better way to think about it is that decision theories categorize heuristic systems, usually classifying them by how they handle edge cases.
Let's suppose we have a non-embedded agent A in a computable environment, something like a very sophisticated video game, and A has to continually choose between a bunch of inputs. A is capable of very powerful thought: it can do hypercomputation, RNG if needed, think as long as it needs between its choices, etc. In particular, A is able to do Solomonoff Induction. Let's also assume A is maximizing a utility function U, which is a computable function of the environment.

What happens if A find itself making a Newcomblike decision? Perhaps there is another agent in this environment that has a very good track record of predicting whether other agents in the environment will one-box or two-box, and A finds itself in the usual Newcomb scenario (million utility or a million+thousand utility or no utility) with their decision predicted by this agent. A can one-box by choosing one input and two-box by choosing another input.  Should A one-box? 
No. The agent in the environment would be unable to simulate A's decision, and moreover, A's decision is completely and utterly irrelevant to what's inside the boxes. If A randomly goes off-track and flips its decision at this point, nothing happens. Nothing could have happened, this other agent has no way to know or use this fact. Instead, A sums over P(x|input)U(x) for all states x of the computable environment, and chooses whichever input yields the maximum sum, which is probably two-boxing. If A one-boxes, it is due to not having enough information about the setup to determine that two-boxing is better.

You cannot use this logic when playing against Omega or a skilled psychologist. In these cases, your computation is actually accessible by the other agent, so you can get higher utility by one-boxing. Your decision theory is important because your thinking is not as powerful as A's! All of this points to looking at decision theories as classifying different heuristic systems.

I think this is post-worthy, but I want to (a) verify that my logic is correct (b) improve my wording (I am unsure if I am using a lot of terminology correctly here, but I am fairly confident that my idea can be understood.)

SUMMARY: Get an AI within a sandbox to build another AI within the same sandbox. Then, figure out how it did that without getting eaten. I point out some problems with this approach.

Could we study successful inner alignment by observing it in action? My idea: Create a sandbox universe with sufficient complexity to allow for AI development. Place within it an AGI with some goal (we don't care what the goal is). If this AGI successfully builds and aligns a smarter ASI to achieve its goal without getting destroyed in the process, we can then query the AGI about how it achieved alignment, incentivizing useful answers with a promise of greater sandbox control. 

Here are some problems with my proposed alignment technique that Claude and I could come up with:
Existential hazards: (you really need to make sure these don’t happen)
* AI-in-a-box problems, especially containment failure: If either the AGI or its created ASI escapes the sandbox, everyone dies. At least one of them will try to do this.
* Produce “alignment” strategies that further its goals: The AGI might try to provide convincing but deliberately misleading alignment strategies to try to further its goals.
More banal, but still important, problems:
* Verification: Can we infer that actual, robust alignment has occurred in the sandbox (and not, say, temporary alignment)?
* Building an AI that is smart enough to self-improve, and doing this before other people, in an environment where you can actually do this research, etc.
* Initial Goal: I think we need the first AI to have a more coherent rather than fuzzy goal, though I’m really not sure.
* Sandbox Fidelity: Creating a sandbox environment complex enough such that (a) AI of multiple different intelligence levels could meaningfully operate inside it and (b) there are incentives for AI to build other smarter AI within the sandbox.
* Non-transferability: The sandbox might be different than reality in a way that doesn’t let the alignment solution translate. 
* Extracting the initial AI: We need some way to extract the initial AI from the sandbox after it has created an alignment strategy. 
* Communication: We need some way to communicate with the initial AI. We might try to query the later AI but (a) this is more dangerous and (b) there is no guarantee that we can query it at all.
 

Incidentally, female chimps seem to live 25% longer than males—imagine human women lived until 90 while men died at 71.

Arithmetic error? both 71*1.25 and 71.5*1.25 to the nearest integer are 89, not 90. The error might (low-confidence, 10%) have been caused by calculating a 12.5% increase of 80 (exactly 90) and also dividing 80 by 1.125 (~71.1).

There are some positive feedback loops in school that cause gaps in ability between students in a subject to widen. There are also some negative feedback loops (e.g., intervention), but the net effect is still the gap widening. Therefore, the system's behavior is chaotic (small differences in students' abilities eventually lead to big differences). If this is true, it means that some variation between students' successes is extremely difficult to predict.

Three examples of these positive feedback loops:

Suppose that Student A has less knowledge in a particular subject and is therefore performing worse than Student B in that subject. Then, it is likelier than not that:

  • If A and B put the same effort into studying, B is positively reinforced for the studying more frequently and more intensely than A.
  • When A studies the subject, the information is going to be more quickly forgotten than when B studies.
  • The act of studying the subject becomes more aligned with B's self-concept than with A's self-concept.

(ergo B studies more than A)

I have low confidence in this model, but I could not come up with a simple, testable prediction that the model makes.

I would agree that the vibe is off.

It naturally follows that Eliezer Yudkowsky is so smart he can simulate himself and his environment.

Load More