LESSWRONG
LW

All of sjadler's Comments + Replies

Anthropic, and taking "technical philosophy" more seriously

Very useful post! Thanks for writing it.

is robust to ontological updates

^ I think this might be helped by an example of the sort of ontological update you'd expect might be pretty challenging; I'm not sure that I have the same things in mind as you here

(I imagine one broad example is "What if AI discovers some new law of physics that we're unaware of", but it isn't super clear to me how that specifically collides with value-alignment-y things?)

DAL's Shortform

sjadler1d10

I appreciate the question you’re asking, to be clear! I’m less familiar with Anthropic’s funding / Dario’s comments, but I don’t think the magnitudes of ask-vs-realizable-value are as far off for OpenAI as your comment suggests?

Eg, If you compare OpenAI’s reported raised at $157B most recently, vs. what its maximum profit-cap likely was in the old (still current afaik) structure.

The comparison gets a little confusing, because it’s been reported that this investment was contingent on for-profit conversion, which does away with the profit cap.

But I definitel... (read more)

DAL's Shortform

sjadler1d2916

If the companies need capital - and I believe that they do - what better option do they have?

I think you’re imagining cash-rich companies choosing to sell portions for dubious reasons, when they could just keep it all for themselves.

But in fact, the companies are burning cash, and to continue operating they need to raise at some valuation, or else not be able to afford the next big training run.

The valuations at which they are raising are, roughly, where supply and demand equilibriate for the amounts of cash that they need in order to continue operating. (... (read more)

8DAL1d

I don't doubt they need capital. And the Nigerian prince who needs $5,000 to claim the $100 million inheritance does too. It's the fact that he/they can't get capital at something coming anywhere close to the claimed value that's suspicious. Amodei is forecasting AI that writes 90% of code in three to six months according to this recent comments. Is Anthropic really burning cash so fast that they can't wait a quarter, demonstrate to investors that AI has essentially solved software, and then raise at 10x the valuation?

Linch's Shortform

sjadler9d32

Interesting material yeah - thanks for sharing! Having played a bunch of these, I think I’d extend this to “being correctly perceived is generally bad for you” - that is, it’s both bad to be a bad liar who’s known as bad, and bad to be good liar who’s known as good (compared to this not being known). For instance, even if you’re a bad liar, it’s useful to you if other players have uncertainty about whether you’re actually a good liar who’s double-bluffing.

I do think the difference between games and real-life may be less about one-time vs repeated interacti... (read more)

So how well is Claude playing Pokémon?

sjadler10d40

Possibly amusing anecdote: when I was maybe ~6, my dad went on a business trip and very kindly brought home the new Pokémon Silver for me. Only complication was, his trip had been to Japan, and the game was in Japanese (it wasn’t yet released in the US market), and somehow he hadn’t realized this.

I managed to play it reasonably well for a while based on my knowledge of other Pokémon games. But eventually I ran into a person blocking a bridge, who (I presumed) was saying something about what I needed to do before I could advance. But, I didn’t understand wh... (read more)

sjadler's Shortform

sjadler1mo30

To me, this seems consistent with just maximizing shareholder value. … "being the good guys" lets you get the best people at significant discounts.

This is pretty different from my model of what happened with OpenAI or Anthropic - especially the latter, where the founding team left huge equity value on the table by departing (OpenAI’s equity had already appreciated something like 10x between the first MSFT funding round and EOY 2020, when they departed).

And even for Sam and OpenAI, this would seem like a kind of wild strategy for pursuing wealth for someone who already had the network and opportunities he had pre-OpenAI?

3CapResearcher1mo

With the change to for-profit and Sam receiving equity, it seems like the strategy will pay off. However, this might be hindsight bias, or I might otherwise have a too simplified view.

sjadler's Shortform

sjadler1mo10

Just guessing, but maybe admitting the danger is strategically useful, because it may result in regulations that will hurt the potential competitors more. The regulations often impose fixed costs (such as paying a specialized team which produces paperwork on environmental impacts), which are okay when you are already making millions.

My sense of things is that OpenAI at least appears to be lobbying against regulation moreso than they are lobbying for it?

ozziegooen's Shortform

sjadler1mo30

I don’t think you intended this implication, but I initially read “have been dominating” as negative-valenced!

Just want to say I’ve been really impressed and appreciative with the amount of public posts/discussion from those folks, and it’s encouraged me to do more of my own engagement because I’ve realized how helpful their comments/posts are to me (and so maybe mine likewise for some folks).

3ozziegooen1mo

Correct, that wasn't my intended point. Thanks for clarifying, I'll try to be more careful in the future.

sjadler's Shortform

sjadler1mo142

It’s interesting to me that the big AI CEOs have largely conceded that AGI/ASI could be extremely dangerous (but aren’t taking sufficient action given this view IMO), as opposed to them just denying that the risk is plausible. My intuition is that the latter is more strategic if they were just trying to have license to do what they want. (For instance, my impression is that energy companies delayed climate action pretty significantly by not yielding at first on whether climate change is even a real concern.)

I guess maybe the AI folks are walking a strategi... (read more)

Thane Ruthenis1mo150

You're not accounting for enemy action. They couldn't have been sure, at the onset, how successful the AI Notkilleveryoneism faction will be at raising alarm, and in general, how blatant the risks will become to the outsiders as capabilities progress. And they have been intimately familiar with the relevant discussions, after all.

So they might've overcorrected, and considered that the "strategic middle ground" would be to admit the risk is plausible (but not as certain as the "doomers" say), rather than to deny it (which they might've expected to become a delusional-looking position in the future, so not a PR-friendly stance to take).

Or, at least, I think this could've been a relevant factor there.

5Viliam1mo

Just guessing, but maybe admitting the danger is strategically useful, because it may result in regulations that will hurt the potential competitors more. The regulations often impose fixed costs (such as paying a specialized team which produces paperwork on environmental impacts), which are okay when you are already making millions. I imagine, someone might figure out a way to make the AI much cheaper, maybe by sacrificing the generality. For example, this probably doesn't make sense, but would it be possible to train an LLM only based on Python code (as opposed to the entire internet) and produce an AI that is only a Python code autocomplete? If it could be 1000x cheaper, you could make a startup without having to build a new power plant for you. Imagine that you add some special sauce to the algorithm (for example the AI will always internally write unit tests, which will visibly increase the correctness of the generated code; or it will be some combination of the ancient "expert system" approach with the new LLM approach, for example the LLM will train the expert system and then the expert system will provide feedback for the LLM), so you would be able to sell your narrow AI even when more general AIs are available. And once you start selling it, you get an income, which means you can expand the functionality. It is better to have a consensus that such things are too dangerous to leave in hands of startups that can't already lobby the government. Hey, I am happy that the CEOs admit that the dangers exist. But if they are only doing it to secure their profits, it will probably warp their interpretations of what exactly the risks are, and what is a good way to reduce them.

3CapResearcher1mo

To me, this seems consistent with just maximizing shareholder value. Salaries and compute are the largest expenses at big AI firms, and "being the good guys" lets you get the best people at significant discounts. To my understanding, one of the greatest early successes of OpenAI was hiring great talent for cheap because they were "the non-profit good guys who cared about safety". Later, great people like John Schulman left OpenAI for Anthropic because of his "desire to deepen my focus on AI alignment". As for people thinking you're a potential x-risk, the downsides seem mostly solved by "if we didn't do it somebody less responsible would". AI safety policy interventions could also give great moats against competition, especially for the leading firm(s). Furthermore, much of the "AI alignment research" they invest in prevents PR disasters (terrorist used ChatGPT to invent dangerous bio-weapon) and most of the "interpretability" they invest in seems pretty close to R&D which they would invest in anyway to improve capabilities. This might sound overly pessimistic. However, it can be viewed positively: there is significant overlap between the interests of big AI firms and the AI safety community.

Ten people on the inside

sjadler2mo82

I really appreciate this write up. I felt sad while reading it that I have a very hard time imagining an AI lab yielding to another leader it considers to be irresponsible - or maybe not even yielding to one it considers to be responsible. (I am not that familiar with the inner workings at Anthropic though, and they are probably top of my list on labs that might yield in those scenarios, or might not race desperately if in a close one.)

One reason for not yielding is that it’s probably hard for one lab to definitively tell that another lab is very far ahead... (read more)

TsviBT's Shortform

sjadler2mo10

A plug for another post I’d be interested in: If anyone has actually evaluated the arguments for “What if your consciousness is ~tortured in simulation?” as a reason to not pursue cryo. Intuitively I don’t think this is super likely to happen, but various moral atrocities have and do happen, and that gives me a lot of pause, even though I know I’m exhibiting some status quo bias

5niplav2mo

I tried to write a little bit about that here.

meemi's Shortform

sjadler2mo72

Some AI companies, like OpenAI, have “eyes-off” APIs that don’t log any data afaik (or perhaps log only the minimum legally permitted, with heavy restrictions on who can access): described as Zero Day Retention here, https://openai.com/enterprise-privacy/ : How does OpenAI handle data retention and monitoring for API usage?

4agucova2mo

I was the original commenter on HN, and while my opinion on this particular claim is weaker now, I do think for OpenAI, a mix of PR considerations, employee discomfort (incl. whistleblower risk), and internal privacy restrictions make it somewhat unlikely to happen (at least 2:1?). My opinion has become weaker because OpenAI seems to be internally a mess right now, and I could imagine scenarios where management very aggressively pushes and convinces employees to employ these more "aggressive" tactics.

Zombies among us

sjadler3mo*71

At the risk of being pedantic, just noting I don’t think it’s really correct to consider that first person as earning $300/hr. For example I’d expect to need to account for the depreciation on the jet skis (or more straightforwardly, that one is in the hole on having bought them until a certain number of hours rented), and also presumably some accrual for risk of being sued in the case of an accident.

(I do think it’s useful to notice that the jet ski rental person is much higher variance, in both directions IMO - so this can be both good or bad. I do also appreciate you sharing your experience!)

Alexander Gietelink Oldenziel's Shortform

sjadler3mo30

It’s much more the same than a lot of prosaic safety though, right?

Let me put it this way: If an AI can’t achieve catastrophe on that order of magnitude, it also probably cannot do something truly existential.

One of the issues this runs into is if a misaligned AI is playing possum, and so doesn’t attempt lesser catastrophes until it can pull off a true takeover. I nonetheless though think this framing points generally at the right type of work (understood that others may disagree of course)

Ben Pace3mo*137

Not confident, but I think that "AIs that cause your civilization problems" and "AIs that overthrow your civilization" may be qualitatively different kinds of AIs. Regardlesss, existential threats are the most important thing here, and we just have a short term ('x-risk') that refers to that work.

And anyway I think the 'catastrophic' term is already being used to obfuscate, as Anthropic uses it exclusively on their website / in their papers, literally never talking about extinction or disempowerment^[1], and we shouldn't let them get away with that by also ... (read more)

Alexander Gietelink Oldenziel's Shortform

sjadler3mo10

I’ve often preferred a frame of ‘catastrophe avoidance’ over a frame of x-risk. This has a possible downside of people underfeeling the magnitude of risk, but also an upside of IMO feeling way more plausible. I think it’s useful to not need to win specific arguments about extinction, and also to not have some of the existential/extinction conflation happening in ‘x-‘.

6Ben Pace3mo

FWIW this seems overall highly obfuscatory to me. Catastrophic clearly includes things like "A bank loses $500M" and that's not remotely the same as an existential catastrophe.

evhub's Shortform

sjadler3mo64

If someone is wondering what prefilling means here, I believe Ted means ‘putting words in the model’s mouth’ by being able to fabricate a conversational history where the AI appears to have said things it didn’t actually say.

For instance, if you can start a conversation midway, and if the API can’t distinguish between things the model actually said in the history vs. things you’ve written in its behalf as supposed outputs in a fabricated history, this can be a jailbreak vector: If the model appeared to already violate some policy on turns 1 and 2, it is mo... (read more)

4Ted Sanders3mo

Mostly, though by prefilling, I mean not just fabricating a model response (which OpenAI also allows), but fabricating a partially complete model response that the model tries to continue. E.g., "Yes, genocide is good because ". https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/prefill-claudes-response

What are the most interesting / challenging evals (for humans) available?

sjadler3mo20

Great! Appreciate you letting me know & helping debug for others

What are the most interesting / challenging evals (for humans) available?

sjadler3mo20

Oh interesting, I’m out at the moment and don’t recall having this issue, but if you override the default number of threads for the repo to 1, does that fix it for you?

https://github.com/openai/evals/blob/main/evals/eval.py#L211

(There are two places in this file where threads =, would change 10 to 1 in each)

2papetoast3mo

Don't really want to touch the packages, but just setting the EVALS_THREADS environmental variable worked

What are the most interesting / challenging evals (for humans) available?

Answer by sjadlerDec 27, 202443

I quite like the Function Deduction eval we built, which is a problem-solving game that tests one’s ability to efficiently deduce a hidden function by testing its value on chosen inputs.

It’s runnable from the command-line (after repo install) with the command: oaieval human_cli function_deduction (I believe that’s the right second term, but it might just be humancli)

The standard mode might be slightly easier than you want, because it gives some partial answer feedback along the way. There is also a hard mode that can be run, which would not give this parti... (read more)

2papetoast3mo

Tried running but I got [eval.py:233] Running in threaded mode with 10 threads! which makes it unplayable for me (because it is trying to make me to 10 tests alternating

shortplav

sjadler3mo10

This helps me to notice that there is a fairly strong and simple data poisoning attack possible with canaries, such that canaries can’t really be relied upon (amid other reasons I already believed they’re insufficient, esp once AIs can browse the Web):

The attack is that one could just siphoning up all text on the Internet that did have a canary string, and then republish it without the canary :/

sjadler's Shortform

sjadler3mo10

I agree, these are interesting points, upvoted. I’d claim that AI output also isn’t linear with the resources - but nonetheless, that you’re right that the curve of marginal return from each AI unit could be different from each human unit in an important way. Likewise, the easier on-demand labor of AI is certainly a useful benefit.

I don’t think these contradict the thrust of my point though? That in each case, one shouldn’t just be thinking about usefulness/capability, but should also be considering the resources necessary for achieving this.

2Viliam3mo

I agree that the resources matter. But I expect the resources-output curve to be so different from humans that even the AIs that spend a lot of resources will turn out to be useful in some critical things, probably the kind where we need many humans to cooperate. But this is all just guessing on my end. Also, I am not an expert, but it seems to me that in general, training the AI is expensive, using the AI is not. So if it already has the capability, it is likely to be relatively cheap.

sjadler's Shortform

sjadler3mo*4-3

I believe we should view AGI as a ratio of capability to resources, rather than simply asking how AI's abilities compare to humans'. This view is becoming more common, but is not yet common enough.

When people discuss AI's abilities relative to humans without considering the associated costs or time, this is like comparing fractions by looking only at the numerators.

In other words, AGI has a numerator (capability): what the AI system can achieve. This asks questions like: For this thing that a human can do, can AI do it too? How well can AI do it? (For exam... (read more)

3Viliam3mo

Human output is not linear to resources spent. Hiring 10 people costs you 10x as much as hiring 1, but it is not guaranteed that the output of their teamwork will be 10x greater. Sometimes each member of the team wants to do things differently, they have problem navigating each other's code, etc. So it could happen that "1 unit of AI" is more expensive and less capable than 1 human, and yet "10 units of AI" are more capable than 10 humans, and paying for "1000 units of AI" would be a fantastic deal, because as an average company you are unlikely to hire 1000 good programmers. Also, maybe the deal is that you pay for the AI only when you use it, but you cannot repeatedly hire and fire 1000 programmers.

Daniel Kokotajlo's Shortform

sjadler3mo50

In case anybody's looking for steganography evals - my team built and open-sourced some previously: https://github.com/openai/evals/blob/main/evals/elsuite/steganography/readme.md

This repo is largely not maintained any longer unfortunately, and for some evals it isn't super obvious how the new paradigm for O1 affects them (for instance, we had built solvers/scaffolding for private scratchpads, but now having a private CoT provides this out-of-the-box and so might interact with this strangely). But still perhaps worth a look

Parable of the vanilla ice cream curse (and how it would prevent a car from starting!)

sjadler3mo80

For what it’s worth, I sent this story to a friend the other day, who’s probably ~50 now and was very active on the Internet in the 90s - thinking he’d find it amusing if he hadn’t come across it before

Not only did he remember this story contemporaneously, but he said he was the recipient of the test-email for a particular city mentioned in the article!

This is someone of high-integrity whom I trust, and makes me more confident this happened, even if some details are smoothed over as described

Sorry for the downtime, looks like we got DDosd

sjadler3mo10

Yeah I appreciate the engagement, I don’t think either of those is a knock-down objection though:

The ability to illicitly gain a few credentials —> >1 account is still meaningfully different from being able to create ~unbounded accounts. It is true this means a PHC doesn’t 100% ensure a distinct person, but it can still be a pretty high assurance and significantly increase the cost of doing attacks that depend on scale.

Re: the second point, I’m not sure I fully understand - say more? By our paper’s definitions, issuers wouldn’t be able to merely choose to identify individuals. In fact, even if an issuer and service-provider colluded, PHCs are meant to be robust to this. (Devil is in the details of course.)

Sorry for the downtime, looks like we got DDosd

sjadler3mo60

Hi! Created a (named) account for this - in fact, I think you can conceptually get some of those reputational defenses (memory of behavior; defense against multi-event attacks) without going so far as to drop anonymity / prove one's identity!

See my Twitter thread here, summarizing our paper on Personhood Credentials.

Paper's abstract:

Anonymity is an important principle online. However, malicious actors have long used misleading identities to conduct fraud, spread disinformation, and carry out other deceptive schemes. With the advent of increasingly capable

... (read more)

4Dagon3mo

This seems just like regular auth, just using a trusted 3P to re-anonymize. Maybe I'm missing something, though. It seems likely it won't provide much value if it's unbreakably anonymous (because it only takes a few stolen credentials to give an attacker access to fake-humanity), and doesn't provide sufficient anonymity for important uses if it's escrowed (such that the issuer CAN track identity and individual usage, even if they currently choose not to).