All of Eleni Angelou's Comments + Replies

I agree with Lewis. A few clarificatory thoughts. 1. I think that the point of calling it a category mistake is exactly about expecting a "nice simple description". It will be something within the network, but there's no reason to believe that this something will be a single neural analog. 2. Even if there are many single neural analogs, there's no reason to expect that all the safety-relevant properties will have them. 3. Even if all the safety-relevant properties have them, there's no reason to believe (at least for now) that we have the interp tools to ... (read more)

3eggsyntax
Can you clarify what you mean by 'neural analog' / 'single neural analog'? Is that meant as another term for what the post calls 'simple correspondences'? Agreed. I'm hopeful that perhaps mech interp will continue to improve and be automated fast enough for that to work, but I'm skeptical that that'll happen. Or alternately I'm hopeful that we turn out to be in an easy-mode world where there is something like a single 'deception' direction that we can monitor, and that'll at least buy us significant time before it stops working on more sophisticated systems (plausibly due to optimization pressure / selection pressure if nothing else). I agree that that's a real risk; it makes me think of Andreessen Horowitz and others claiming in an open letter that interpretability had basically been solved and so AI regulation isn't necessary. On the other hand, it seems better to state our best understanding plainly, even if others will slippery-slope it, than to take the epistemic hit of shifting our language in the other direction to compensate.

Some troubles with Evals:

  1. Saturation: as performance becomes better, especially surpassing the human baseline, it becomes harder to measure differences.
  2. Gamification: optimizing for scoring high at evals tests.
  3. Contamination: benchmarks found in the training data of models.
  4. Problems with construct validity: measuring exactly the capability you want might be harder than you think.
  5. Predictive validity: what do current evals tell us about future model performance?

    Reference: https://arxiv.org/pdf/2405.03207 

As LLMs get better, the intentional stance becomes a two-way street: the user models the system and the system is increasingly modeling the user.

Highlights from my philosophical chat with Claude 3 Opus

A few notes:

  • Claude is better at talking philosophy than the average human imho
  • At many points, it felt that Claude was modeling me/giving me responses I would endorse
  • It felt a bit creepy/more intense than the average interaction I have with LLMs

Here are the highlights:

E: If we are in a simulation, what's outside of it?

C: You raise an interesting philosophical question about the nature of reality. The simulation hypothesis proposes that our reality may actually be a computer simulation, similar to a ver... (read more)

Course titles are fixed so I didn't choose that, but because it's a non-intro course it's up to the instructor to decide the course's focus. And yes, the students had seen the description before selecting it.

3gjm
Huh. So is there a course every year titled "Philosophy and the challenge of the future", with radically different content each time depending on the particular interests of whoever's lecturing that year?

It was intro to phil 101 at Queens College CUNY. I was also confused by this. 

I agree there probably isn't enough time. Best case scenario there's enough time for weak alignment tools (small apples).