RobertM — LessWrong

LessWrong dev & admin as of July 5th, 2022.

Yes, and the "social circumstance" of the game as represented to o3 does not seem analogous to "a human being locked in a room and told that the only way out is to beat a powerful chess engine"; see my comment expanding on that. (Also, this explanation fails to explain why some models don't do what o3 did, unless the implication is that they're worse at modeling the social circumstances of the setup than o3 was. That's certainly possible for the older and weaker models tested at the time, but I bet that newer, more powerful, but also less notoriously reward-hack-y models would "cheat" much less frequently.)

it just seems like the author was trying to have a pretty different conversation

I think mostly in tone. If I imagine a somewhat less triggered intro sentence in Buck's comment, it seems to be straightforwardly motivating answers to the two questions at the end of OP:

1. None of Eliezer's public communication is -EV for AI Safety
2. Financial support of MIRI is likely to produce more consistently +EV communication than historically seen from Eliezer individually.

ETA: I do think the OP was trying to avoid spawning demon threads, which is a good impulse to have (especially when it comes to questions like this).

Even if you think the original prompt variation seems designed to elicit bad behavior, o3's propensity to cheat even with the dontlook and powerless variations seems pretty straightforward. Also...

Contrary to intent?: Well the framework specifically gives you access to these files, and actually instructs you look around before you start playing. So having access to them seems clearly ok!

I would not describe "the chess game is taking place in a local execution environment" as "the framework specifically gives you access to these files". Like, sure, it gives you access to all the files. But the only file it draws your attention to is game.py.

I think the stronger objection is that the entire setup is somewhat contrived. What possible reason could someone have for asking an LLM to play chess like this? It does happen to be the case that people have been interested in measuring LLMs' "raw" chess strength, and have run experiments to figure test that. How likely is this particular setup to be such an experiment, vs. something else? :shrug:

Ultimately, I don't think the setup (especially with the "weaker" prompt variants) is so contrived that it would be unfair to call this specification gaming. I would not have been surprised if someone posted a result about o3's anomalously strong chess performance on Twitter, only to retract it later when other people failed to replicate it with more robust setups, because their setup allowed this sort of cheating and they didn't catch it before posting.

In practice the PPUs are basically equity for compensation purposes, though probably with worse terms than e.g. traditional RSUs.

now performance is faster than it's ever been before

As a point of minor clarification, performance now is probably slightly worse than it was in the middle of the large refactoring effort (after the despaghettification, but before the NextJS refactor), but still better now than at any point before the start of (combined) refactoring effort, though it's tricky to say for sure since there are multiple different relevant metrics and some of them are much more difficult to measure now.

Yes, this is just the number for a relatively undifferentiated (but senior) line engineer/researcher.

It's not strictly "salaries", but OpenAI pays technical people^[1] at their "senior" level something like 900k/year in combined cash + equity compensation.

^{^}
Software engineers, ML researchers, etc.

They originally argued a fair amount that AI would go from vastly subhuman to vastly superhuman over an extremely short time (e.g hours or days rather than years, which is what we are currently seeing).

EY argued that this was possible, not that this was overdetermined (and not that it was load-bearing to the threat model).

Just to check, did you use the "Submit Linkposts" functionality on the nomination page for that, or did you crosspost it some other way?

ETA: Ok, looks like the library responsible for extracting external article data/metadata didn't successfully extract the date the article was published. I've manually set it to the correct date.

One reason to think that this is completely hallucinated is that the "soul document" is written in Claude's typical style. That is, it looks to be AI (Claude) generated text, not something written by a human. Just look at the first paragraph:

I disagree. The document reads very strongly of Anthropic's "house style", at least compared to their system prompts. It's much higher quality writing than any current LLM's.

"This isn't [x] but [y]" is quite weak evidence compared to the rest of it being obviously something that Opus would be unable to generate in its default voice. (Also, the original phrase uses "but rather", which is non-standard for that type of LLM construction.)

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments