User Comment Replies

A Problem to Solve Before Building a Deception Detector

I agree with Lewis. A few clarificatory thoughts. 1. I think that the point of calling it a category mistake is exactly about expecting a "nice simple description". It will be something within the network, but there's no reason to believe that this something will be a single neural analog. 2. Even if there are many single neural analogs, there's no reason to expect that all the safety-relevant properties will have them. 3. Even if all the safety-relevant properties have them, there's no reason to believe (at least for now) that we have the interp tools to ... (read more)

3eggsyntax25d

Can you clarify what you mean by 'neural analog' / 'single neural analog'? Is that meant as another term for what the post calls 'simple correspondences'? Agreed. I'm hopeful that perhaps mech interp will continue to improve and be automated fast enough for that to work, but I'm skeptical that that'll happen. Or alternately I'm hopeful that we turn out to be in an easy-mode world where there is something like a single 'deception' direction that we can monitor, and that'll at least buy us significant time before it stops working on more sophisticated systems (plausibly due to optimization pressure / selection pressure if nothing else). I agree that that's a real risk; it makes me think of Andreessen Horowitz and others claiming in an open letter that interpretability had basically been solved and so AI regulation isn't necessary. On the other hand, it seems better to state our best understanding plainly, even if others will slippery-slope it, than to take the epistemic hit of shifting our language in the other direction to compensate.

Eleni Angelou's Shortform

Eleni Angelou6mo30

Some troubles with Evals:

Saturation: as performance becomes better, especially surpassing the human baseline, it becomes harder to measure differences.
Gamification: optimizing for scoring high at evals tests.
Contamination: benchmarks found in the training data of models.
Problems with construct validity: measuring exactly the capability you want might be harder than you think.
Predictive validity: what do current evals tell us about future model performance?
Reference: https://arxiv.org/pdf/2405.03207

Eleni Angelou's Shortform

Eleni Angelou9mo10

As LLMs get better, the intentional stance becomes a two-way street: the user models the system and the system is increasingly modeling the user.

Eleni Angelou's Shortform

Eleni Angelou9mo32

Highlights from my philosophical chat with Claude 3 Opus

A few notes:

Claude is better at talking philosophy than the average human imho
At many points, it felt that Claude was modeling me/giving me responses I would endorse
It felt a bit creepy/more intense than the average interaction I have with LLMs

Here are the highlights:

E: If we are in a simulation, what's outside of it?

C: You raise an interesting philosophical question about the nature of reality. The simulation hypothesis proposes that our reality may actually be a computer simulation, similar to a ver... (read more)

Looking for reading recommendations: Theories of right/justice that safeguard against having one's job automated?

Answer by Eleni AngelouOct 12, 202310

Here's a famous book on this: https://www.amazon.com/Rise-Robots-Technology-Threat-Jobless/dp/0465097537

I designed an AI safety course (for a philosophy department)

Eleni Angelou1y40

Course titles are fixed so I didn't choose that, but because it's a non-intro course it's up to the instructor to decide the course's focus. And yes, the students had seen the description before selecting it.

3gjm1y

Huh. So is there a course every year titled "Philosophy and the challenge of the future", with radically different content each time depending on the particular interests of whoever's lecturing that year?

Questions about AI that bother me

Eleni Angelou2y10

Yup, that's what I mean. Specifically, I had Pinker in mind: https://forum.effectivealtruism.org/posts/3nL7Ak43gmCYEFz9P/cognitive-science-and-failed-ai-forecasts

Questions about AI that bother me

Eleni Angelou2y20

It was intro to phil 101 at Queens College CUNY. I was also confused by this.

Book recommendations for the history of ML?

Eleni Angelou2y10

Thank you Lawrence!

Who ordered alignment's apple?

Eleni Angelou3y10

I agree there probably isn't enough time. Best case scenario there's enough time for weak alignment tools (small apples).

LESSWRONG
LW

All of Eleni Angelou's Comments + Replies