LESSWRONG
LW

wonder

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

Newest

Self-fulfilling misalignment data might be poisoning our AI models

wonder13d10

I was thinking of this the other day as well; I think this is particularly a problem when we are evaluating misalignment based on these semantic wording. This may suggest the increasing need to pursue alternative ways to evaluate misalignment, rather than purely prompt based evaluation benchmarks

Cole Wyeth's Shortform

wonder1mo40

Based on my observations, I would also think some current publication chasing culture could get people push out papers more quickly (in some particular domains like CS), even though some papers may be partially completed

Agent Foundations 2025 at CMU

wonder2mo70

Will the event/sessions be recorded by any chance? (may not be able to attend, but would love to learn); additionally, would the topics be focused exclusively on relations to X risks?