All of Sahil's Comments + Replies

Sahil*10

This section here has some you might find useful, among wriing that is published. Excerpted below:

Live theory in oversimplified claims:

Claims about AI.

Claim: Scaling in this era depends on replication of fixed structure. (eg. the code of this website or the browser that hosts it.)

Claim: For a short intervening period, AI will remain only mildly creative, but its cost, latency, and error-rate will go down, causing wide adoption.

Claim: Mildly creative but very cheap and fast AI can be used to turn informal instruction (eg. prompts, comments) into formal inst

... (read more)
2Chris_Leong
  That link is broken.
Sahil10

unaligned ASI is extremely sensitive to context, just in the service of its own goals.

 

Risks of abuse, isolation and dependence can skyrocket indeed, from, as you say, increased "context-sensitivity" in service of an AI/someone else's own goals. A personalized torture chamber is not better, but in fact quite likely a lot worse, than a context-free torture chamber. But to your question:

 

Is misalignment really is a lack of sensitivity as opposed to a difference in goals or values?

 

The way I'm using "sensitivity": sensitivity to X = the meaning... (read more)

2Chris_Leong
I'm fine with that, although it seems important to have a definition for the more limited definition of sensitivity so we can keep track of that distinction: maybe adaptability? Internalising values and internalising concepts are distinct. I can have a strong understanding of your definition of "good" and do the complete opposite. I think it's reasonable to say something along the lines of: "AI safety was developed in a context where most folks weren't expecting language models before ASI, so insufficient attention has been given to the potential of LLM's to help fill in or adapt informal definitions. Even though folks who feel we need a strongly principled approach may be skeptical that this will work, there's a decent argument that this should increase our chances of success on the margins".
Sahil10

Thank you, Dusan!

Next time there will be more notice, and also a more refined workshop!

Answer by SahilΩ11177

Great! I'd love to have included a remark that one, as a human, might anticipate forward-chainy/rational reasoning in these systems because we're often taking the "thought" metaphor seriously/literally in the label "chain-of-thought", rather than backwardy/rationalization "reasoning". 

But since it is is at least somewhat intelligent/predictive, it can make the move of "acausal collusion" with its own tendency to hallucinate, in generating its "chain"-of-"thought". That is, the optimization to have chain-of-thought in correspondence with its output can... (read more)

2Stephen Fowler
"But since it is is at least somewhat intelligent/predictive, it can make the move of "acausal collusion" with its own tendency to hallucinate, in generating its "chain"-of-"thought"." I am not understanding what this sentence is trying to say. I understand what an acausal trade is. Could you phrase it more directly? I cannot see why you require the step that the model needs to be reasoning acausally for it to develop a strategy of deceptively hallucinating citations. What concrete predictions does the model in which this is an example of "acausal collusion" make?
abramdemskiΩ5111

I want to draw a clear distinction between the hypothesis I mention in the OP (your 'causal' explanation) and the 'acausal' explanation you mention here. You were blending the two together during our conversation, but I think there are important differences.

In particular, I am interested in the question: does o1 display any "backwards causation" from answers to chain-of-thought? Would its training result in chain-of-thought that is optimized to justify a sort of answer it tends to give (eg, hallucinated URLs)?

This depends on details of the training which we may not have enough information on (plus potentially complex reasoning about the consequences of such reasoning).