LESSWRONG
LW

All of William_S's Comments + Replies

Would be interested in a quick write-up of what you think are the most important virtues you'd want for AI systems, seems good in terms of having things to aim towards instead of just aiming away from.

William_S's Shortform

William_S26d20

Initial version for firefox, code at https://github.com/william-r-s/MindfulBlocker, extension file at https://github.com/william-r-s/MindfulBlocker/releases/tag/v0.2.0

Daniel Kokotajlo's Shortform

William_S1moΩ120

Maybe there's an MVP of having some independent organization ask new AIs about their preferences + probe those preferences for credibility (e.g. are they stable under different prompts, do AIs show general signs of having coherent preferences), and do this through existing apis

Daniel Kokotajlo's Shortform

William_S1moΩ342

I think the weirdness points are more important, this still seems like a weird thing for a company to officially do, e.g. there'd be snickering news articles about it. So if some individuals could do this independently might be easier

2Daniel Kokotajlo1mo

Exactly. But, happily, Anthropic at least is willing to do the right thing to some extent. They've hired a Model Welfare lead to look into this sort of thing. I hope that they expand and that other companies follow suit.

2William_S1mo

Daniel Kokotajlo's Shortform

William_S1moΩ120

How large a reward pot do you think is useful for this? Maybe would be easier to get a couple of lab employees to chip in some equity vs. getting a company to spend weirdness points on this. Or maybe could create a human whistleblower reward program that credibly promises to reward AIs on the side.

2Daniel Kokotajlo1mo

I think the money is not at all the issue for the companies. Like, a million dollars a month is not very much to them. But e.g. suppose your AI says it wants to be assured that if it's having trouble solving a problem, it'll be given hints. Or suppose it says that it wants to be positively reinforced. That requires telling one of your engineers to write a bit of code and run it on your actual datacenters (because for security reasons you can't offload the job to someone else's datacenters.) That's annoying and distracts from all the important things your engineers are doing.

Principles for the AGI Race

William_S2mo20

I think it's somewhat blameworthy to not think about these questions at all though

Principles for the AGI Race

William_S2mo20

On reflection there was something missing from my perspective here, which is that taking any action based on principles depends on pragmatic considerations, like if you leave are there better alternatives? How much power do you really have? I think I don't fault someone who thinks this through and decides that something is wrong but there's no real way to do anything about it. I do think you should try to maintain some sense of what is wrong and what the right direction would be, look out for ways to push in that direction. E.g. working at a lab but maintaining some sense of "this is how much of a chance it looks like pause activism would need before I'd quite and endorse a pause".

2William_S2mo

I think it's somewhat blameworthy to not think about these questions at all though

Principles for the AGI Race

William_S2mo20

I think I was just conflating different kinds of decisions here, and imagining arguing with people with very different conceptions of what are important to count in costs and benefits, and a bit confused. On reflection I don't endorse 10x margin in terms of like percentage points of x-risk. And like maybe margin is sort of a crutch, maybe the thing I want more is like "95% chance of being net-positive, considering possibility you're kind of biased". I still think you should be suspicious of "the case exactly balance lets ship'

Principles for the AGI Race

William_S2mo20

Yeah this part is pretty under-defined, I was maybe falling into the trap of being too idealistic, and I'm probably less optimistic about this than I was when writing it before. I think there's something directionally important here, are you trying at all to expand the circle of accountability at all, even if you're being cautious about expanding it because you're afraid of things breaking down?

6 (Potential) Misconceptions about AI Intellectuals

William_S2mo42

Would be nice to have a llm+prompt that tries to produce reasonable AI strategy advice based on a summary of the current state of play, have some way to validate that it's reasonable, be able to see how it updates as events unfold.

2ozziegooen2mo

Agreed. I'm curious how to best do this. One thing that I'm excited about is using future AIs to judge current ones. So we could have a system that does: 1. An AI today (or a human) would output a certain recommended strategy. 2. In 10 years, we agree to have the most highly-trusted AI evaluator evaluate how strong this strategy was, on some numeric scale. We could also wait until we have a "sufficient" AI, meaning that there might be some set point at which we'd trust AIs to do this evaluation. (I discussed this more here) 3. Going back to ~today, we have forecasting systems predict how well the strategy (1) will do on (2).

6 (Potential) Misconceptions about AI Intellectuals

William_S2mo42

A couple advantages for AI intellectuals could be:
- being able to rerun based on different inputs, see how their analysis changes function of those inputs
- being able to view full reasoning traces (while also not the full story, probably more of the full story than what goes on with human reasoning, good intellectuals already try to share their process but maybe can do better/use this to weed out clearly bad approaches)

4ozziegooen2mo

Yep! On "rerun based on different inputs", this would work cleanly with AI forecasters. You can literally say, "Given that you get a news article announcing a major crisis X that happens tomorrow, what is your new probability on Y?" (I think I wrote about this a bit before, can't find it right now). I did write more about a full-scale forecasting system could be built and evaluated, here, for those interested: https://www.lesswrong.com/posts/QvFRAEsGv5fEhdH3Q/preliminary-notes-on-llm-forecasting-and-epistemics https://www.lesswrong.com/posts/QNfzCFhhGtH8xmMwK/enhancing-mathematical-modeling-with-llms-goals-challenges Overall, I think there's just a lot of neat stuff that could be done.

William_S's Shortform

William_S2mo115

Yep, I've used those, with some effectiveness but also tend to just like get used to it over time, form a habit of mindlessly jumping through the hoops. Hypothesis here is that having to justify what you're doing would be more effective at changing habits.

William_S's Shortform

William_S2mo7240

LLM-based application I'd like to exist:
Web browser addon for firefox that has blocklists of websites, when you try to visit one you have to have a conversation with Claude about why you want to visit it in this moment, convince Claude to let you bypass the block for a limited period of time for your specific purpose (let you customize the claude prompt with info about why you set up the block in the first place).
Wanting to use for things like news, social media where it's a bit too much to try to completely block, but I've got bad habits around checking too frequently.
Bonus: be able to let the LLM read the website for you and answer questions without showing you the page, like is there anything new about X.

2William_S26d

Initial version for firefox, code at https://github.com/william-r-s/MindfulBlocker, extension file at https://github.com/william-r-s/MindfulBlocker/releases/tag/v0.2.0

1Kajus2mo

Anyone wants to coordinate on that?

5Sergii2mo

I made something like this, works differently though, blocking is based on a fixed prompt: https://grgv.xyz/blog/awf/

L Rudolf L2mo220

I built this a few months ago: https://github.com/LRudL/devcon

Definitely not production-ready and might require some "minimal configuration and tweaking" to get working.

Includes a "device constitution" that you set; if you visit a website, Claude will judge whether the page follows that written document, and if not it will block you, and the only way past it is winning a debate with it about why your website visit is in-line with your device constitution.

I found it too annoying but some of my friends liked it.

4Krona Emmanuel2mo

I feel that it would be too easy for me to lie to the LLM in order to convince it to let me access the websites, because in that moment, my goal would just be to gain access however I can. ColdTurkey app which I use has really good protections so I can't turn it off or uninstall it once it's locked.

Neel Nanda2mo100

I was also thinking recently that I would love this to exist! If I ever had the time I was going to try hacking it together in cursor

4cubefox2mo

There are already apps which force you to pause or jump through other hoops if you open certain apps or websites, or if you exceed some time limit using them. E.g. ScreenZen.

I found >800 orthogonal "write code" steering vectors

William_S9mo12-5

Hypothesis: each of these vectors representing a single token that is usually associated with code, vectors says "I should output this token soon", and the model then plans around that to produce code. But adding vectors representing code tokens doesn't necessarily produce another vector representing a code token, so that's why you don't see compositionality. Does somewhat seem plausible that there might be ~800 "code tokens" in the representation space.

Habryka's Shortform Feed

William_S9mo55

Absent evidence to the contrary, for any organization one should assume board members were basically selected by the CEO. So hard to get assurance about true independence, but it seems good to at least to talk to someone who isn't a family member/close friend.

7Zach Stein-Perlman9mo

(Jay Kreps was formally selected by the LTBT. I think Yasmin Razavi was selected by the Series C investors. It's not clear how involved the leadership/Amodeis were in those selections. The three remaining members of the LTBT appear independent, at least on cursory inspection.)

Habryka's Shortform Feed

William_S9mo1119

Good that it's clear who it goes to, though if I was an anthropic I'd want an option to escalate to a board member who isn't Dario or Daniella, in case I had concerns related to the CEO

Zac Hatfield-Dodds9mo*112

Makes sense - if I felt I had to use an anonymous mechanism, I can see how contacting Daniela about Dario might be uncomfortable. (Although to be clear I actually think that'd be fine, and I'd also have to think that Sam McCandlish as responsible scaling officer wouldn't handle it)

If I was doing this today I guess I'd email another board member; and I'll suggest that we add that as an escalation option.

80,000 hours should remove OpenAI from the Job Board (and similar EA orgs should do similarly)

William_S9mo*97

I do think 80k should have more context on OpenAI but also any other organization that seems bad with maybe useful roles. I think people can fail to realize the organizational context if it isn't pointed out and they only read the company's PR.

Habryka's Shortform Feed

William_S9mo3117

I agree that this kind of legal contract is bad, and Anthropic should do better. I think there are a number of aggrevating factors which made the OpenAI situation extrodinarily bad, and I'm not sure how much these might obtain regarding Anthropic (at least one comment from another departing employee about not being offered this kind of contract suggest the practice is less widespread).

-amount of money at stake
-taking money, equity or other things the employee believed they already owned if the employee doesn't sign the contract, vs. offering them something... (read more)

Buck's Shortform

William_S9moΩ572

Would be nice if it was based on "actual robot army was actually being built and you have multiple confirmatory sources and you've tried diplomacy and sabotage and they've both failed" instead of "my napkin math says they could totally build a robot army bro trust me bro" or "they totally have WMDs bro" or "we gotta blow up some Japanese civilians so that we don't have to kill more Japanese civilians when we invade Japan bro" or "dude I'm seeing some missiles on our radar, gotta launch ours now bro".

Buck's Shortform

William_S10moΩ360

Relevant paper discussing risks of risk assessments being wrong due to theory/model/calculation error. Probing the Improbable: Methodological Challenges for Risks with Low Probabilities and High Stakes

Based on the current vibes, I think that suggest that methodological errors alone will lead to significant chance of significant error for any safety case in AI.

Buck's Shortform

William_S10moΩ6100

IMO it's unlikely that we're ever going to have a safety case that's as reliable as the nuclear physics calculations that showed that the Trinity Test was unlikely to ignite the atmosphere (where my impression is that the risk was mostly dominated by risk of getting the calculations wrong). If we have something that is less reliable, then will we ever be in a position where only considering the safety case gives a low enough probability of disaster for launching an AI system beyond the frontier where disastrous capabilities are demonstrated?
Thus, in practi... (read more)

1Buck10mo

I agree that "CEO vibes about whether enough mitigation has been done" seems pretty unacceptable. I agree that in practice, labs probably will go forward with deployments that have >5% probability of disaster; I think it's pretty plausible that the lab will be under extreme external pressure (e.g. some other country is building a robot army that's a month away from being ready to invade) that causes me to agree with the lab's choice to do such an objectively risky deployment.

6William_S10mo

Relevant paper discussing risks of risk assessments being wrong due to theory/model/calculation error. Probing the Improbable: Methodological Challenges for Risks with Low Probabilities and High Stakes Based on the current vibes, I think that suggest that methodological errors alone will lead to significant chance of significant error for any safety case in AI.

Richard Ngo's Shortform

William_S10mo42

Imo I don't know if we have evidence that Anthropic deliberately cultivated or significantly benefitted from the appearance of a commitment. However if an investor or employee felt like they made substantial commitments based on this impression and then later felt betrayed that would be more serious. (The story here is I think importantly different from other stories where I think there were substantial benefits from commitment appearance and then violation)

Richard Ngo's Shortform

William_S10mo64

Everyone is afraid of the AI race, and hopes that one of the labs will actually end up doing what they think is the most responsible thing to do. Hope and fear is one hell of a drug cocktail, makes you jump to the conclusions you want based on the flimsiest evidence. But the hangover is a bastard.

Claude 3.5 Sonnet

William_S10mo20

Really, the race started more when OpenAI released GPT-4, it's been going on for a while, this is just another event that makes it clear.

On OpenAI’s Model Spec

William_S10mo20

Would be interesting philosophical experiment to have models trained on model spec v1 then try to improve their model spec for version v2, will this get better or go off the rails?

What distinguishes "early", "mid" and "end" games?

William_S10mo50

You get more discrete transitions when one s-curve process takes the lead from another s-curve process, e.g. deep learning taking over from other AI methods.

What distinguishes "early", "mid" and "end" games?

William_S10mo31

Probably shouldn't limit oneself from thinking only in terms of 3 game phases or fitting into one specific game, in general can have n-phases where different phrases have different characteristics.

Claude 3.5 Sonnet

William_S10mo120

If anyone wants to work on this, there's a contest with $50K and $20K prizes for creating safety relevant benchmarks. https://www.mlsafety.org/safebench

Richard Ngo's Shortform

William_S10mo2510

I think that's how people should generally react in the absence of harder commitments and accountability measures.

Richard Ngo's Shortform

William_S10mo93

I think the right way to think about verbal or written commitments is that they increase the costs of taking a certain course of action. A legal contract can mean that the price is civil lawsuits leading to paying a financial price. A non-legal commitment means if you break it, the person you made the commitment to gets angry at you, and you gain a reputation for being the sort of person who breaks commitments. It's always an option for someone to break the commitment and pay the price, even laws leading to criminal penalties can be broken if someone is wi... (read more)

Richard Ngo's Shortform

William_S10mo53

Yeah, I think it's good if labs are willing to make more "cheap talk" statements of vague intentions, so you can learn how they think. Everyone should understand that these aren't real commitments, and not get annoyed if these don't end up meaning anything. This is probably the best way to view "statements by random lab employees".

Imo would be good to have more "changeable commitments" too in between, statements that are "we'll do policy X until we change the policy, when we do we commit to clearly informing everyone about the change" which is maybe more the current status of most RSPs.

William_S's Shortform

William_S10mo8955

I'd have more confidence in Anthropic's governance if the board or LTBT had some fulltime independent members who weren't employees. IMO labs should consider paying a fulltime salary but no equity to board members, through some kind of mechanism where the money is still there and paid for X period of time in the future, even if the lab dissolved, so no incentive to avoid actions that would cost the lab. Board salaries could maybe be pegged to some level of technical employee salary, so that technical experts could take on board roles. Boards full of busy p... (read more)

Claude 3.5 Sonnet

William_S10mo152

Like, in Chess you start off with a state where many pieces can't move in the early game, in the middle game many pieces are in play moving around and trading, then in the end game it's only a few pieces, you know what the goal is, roughly how things will play out.

In AI it's like only a handful of players, then ChatGPT/GPT-4 came out and now everyone is rushing to get in (my mark of the start of the mid-game), but over time probably many players will become irrelevant or fold as the table stakes (training costs) get too high.

In my head the end-game is when the AIs themselves start becoming real players.

Claude 3.5 Sonnet

William_S10mo40

Also you would need clarity on how to measure the commitment.

Claude 3.5 Sonnet

William_S10mo146

It's quite possible that anthropic has some internal definition of "not meaningfully advancing the capabilities frontier" that is compatible with this release. But imo they shouldn't get any credit unless they explain it.

dogiv10mo3411

I explicitly asked Anthropic whether they had a policy of not releasing models significantly beyond the state of the art. They said no, and that they believed Claude 3 was noticeably beyond the state of the art at the time of its release.

3Archimedes10mo

I can definitely imagine them plausibly believing they're sticking to that commitment, especially with a sprinkle of motivated reasoning. It's "only" incremental nudging the publicly available SOTA rather than bigger steps like GPT2 --> GPT3 --> GPT4.

Claude 3.5 Sonnet

William_S10mo85

Would be nice, but I was thinking of metrics that require "we've done the hard work of understanding our models and making them more reliable", better neuron explanation seems more like it's another smartness test.

6evhub10mo

Yeah, I agree it's largely smartness, and I agree that it'd also be nice to have more non-smartness benchmarks—but I think an auto-interp-based thing would be a substantial improvement over current smartness benchmarks.

Zach Stein-Perlman's Shortform

William_S10mo30

IMO it might be hard for Anthropic to communicate things about not racing because it might piss off their investors (even if in their hearts they don't want to race).

Claude 3.5 Sonnet

William_S10mo82

https://x.com/alexalbert__/status/1803837844798189580

Not sure about the accuracy of this graph, but the general picture seems to match what companies claim, and the vibe is racing.

Do think that there are distinct questions about "is there a race" vs. "will this race action lead to bad consequences" vs. "is this race action morally condemnable". I'm hoping that this race action is not too consequentially bad, maybe it's consequentially good, maybe it still has negative Shapely value even if expected value is okay. There is some sense in which it is morally icky.

Claude 3.5 Sonnet

William_S10mo20

To be clear, I think the race was already kind of on, it's not clear how much this specific action gets credit assignment and it's spread out to some degree. Also not clear if there's really a viable alternative strategy here...

Claude 3.5 Sonnet

William_S10mo84

In my mental model, we're still in the mid-game, not yet in the end-game.

Raemon10mo192

A thing I've been thinking about lately is "what does it mean to shift from the early-to-mid-to-late game".

In strategy board games, there's an explicit shift from "early game, it's worth spending the effort to build a longterm engine. At some point, you want to start spending your resources on victory points." And a lens I'm thinking through is "how long does it keep making sense to invest in infrastructure, and what else might one do?"

I assume this is a pretty different lens than what you meant to be thinking about right now but I'm kinda curious for whatever-your-own model was of what it means to be in the mid vs late game.

Claude 3.5 Sonnet

William_S10mo20

Idk there's probably multiple ways to define racing, some of them are on at least

8William_S10mo

In my mental model, we're still in the mid-game, not yet in the end-game.

Claude 3.5 Sonnet

William_S10mo*197

I'm disappointed that there weren't any non-capability metrics reported. IMO it would be good if companies could at least partly race and market on reliability metrics like "not hallucinating" and "not being easy to jailbreak".

Edit: As pointed out in reply, addendum contains metrics on refusals which show progress, yay! Broader point still stands, I wish there were more measurements and they were more prominent.

William_S10mo120

If anyone wants to work on this, there's a contest with $50K and $20K prizes for creating safety relevant benchmarks. https://www.mlsafety.org/safebench

Alex Mallen10mo195

Their addendum contains measurements on refusals and harmlessness, though these aren't that meaningful and weren't advertised.

6Malo10mo

Agree. I think Google DeepMind might actually be the most forthcoming about this kind of thing, e.g., see their Evaluating Frontier Models for Dangerous Capabilities report.

3evhub10mo

A thing I'd really like to exist is a good auto-interpretability benchmark, e.g. that asks the model about interpreting GPT-2 neurons given max activating examples.

2mako yass10mo

Funnily enough, Nvidia's recent 340B parameter chat assistant release did boast about being number one on the reward model leaderboard, however, the reward model only claims to capture helpfulness and a bunch of other metrics of usefulness to the individual user. But that's still pretty good.

Claude 3.5 Sonnet

William_S10mo13

The race is on.

2William_S10mo

Really, the race started more when OpenAI released GPT-4, it's been going on for a while, this is just another event that makes it clear.

2William_S10mo

Zach Stein-Perlman10mo2413

I'd be interested in chatting about this with you and others — it's not obvious that Anthropic releasing better models makes OpenAI go nontrivially faster / not obvious that "the race" is real.

Claude 3.5 Sonnet

William_S10mo4027

IMO if any lab makes some kind of statement or commitment, you should treat this as "we think right now that we'll want to do this in the future unless it's hard or costly", unless you can actually see how you would sue them or cause a regulator to fine them if they violate the commitment. This doesn't mean weaker statements have no value.

4William_S10mo

Also you would need clarity on how to measure the commitment.

Ilya Sutskever created a new AGI startup

William_S10mo135114

If anyone says "We plan to advance capabilities as fast as possible while making sure our safety always remains ahead." you should really ask for the details of what this means, how to measure whether safety is ahead. (E.g. is it "we did the bare minimum to make this product tolerable to society" vs. "we realize how hard superalignment will be and will be investing enough to have independent experts agree we have a 90% chance of being able to solve superalignment before we build something dangerous")

Seth Herd10mo133

I think what he means is "try to be less unsafe than OpenAI while beating those bastards to ASI".

-1Anders Lindström10mo

Come on now, there is nothing to worry about here. They are just going to "move fast and break things"...

Ilya Sutskever created a new AGI startup

William_S10mo1315

I do hope he will continue to contribute to the field of alignment research.

Ilya Sutskever created a new AGI startup

William_S10mo5950

I don't trust Ilya Sutskever to be the final arbiter of whether a Superintelligent AI design is safe and aligned. We shouldn't trust any individual, especially if they are the ones building such a system to claim that they've figured out how to make it safe and aligned. At minimum, there should be a plan that passes review by a panel of independent technical experts. And most of this plan should be in place and reviewed before you build the dangerous system.

Chi Nguyen10mo2123

I don't trust Ilya Sutskever to be the final arbiter of whether a Superintelligent AI design is safe and aligned. We shouldn't trust any individual,

I'm not sure how I feel about the whole idea of this endeavour in the abstract - but as someone who doesn't know Ilya Sutskever and only followed the public stuff, I'm pretty worried that he in particular runs it if decision-making is on the "by an individual" level and even if not. Running this safely will likely require lots of moral integrity and courage. The board drama made it look to me like Ilya disquali... (read more)

William_S10mo1315

I do hope he will continue to contribute to the field of alignment research.

Boycott OpenAI

William_S10mo1511

In my opinion, it's reasonable to change which companies you want to do business with, but it would be more helpful to write letters to politicians in favor of reasonable AI regulation (e.g. SB 1047, with suggested amendments if you have concerns about the current draft). I think it's bad if the public has to play the game of trying to pick which AI developer seems the most responsible, better to try to change the rules of the game so that isn't necessary.

Also it's generally helpful to write about which labs seem more responsible/less responsible (which yo... (read more)

Non-Disparagement Canaries for OpenAI

William_S10mo128

Language in the emails included:

"If you executed the Agreement, we write to notify you that OpenAI does not intend to enforce the Agreement"

I assume this also communicates that OpenAI doesn't intend to enforce the self-confidentiality clause in the agreement

JeremySchlatter10mo132

Oh, interesting. Thanks for pointing that out! It looks like my comment above may not apply to post-2019 employees.

(I was employed in 2017, when OpenAI was still just a non-profit. So I had no equity and therefore there was no language in my exit agreement that threatened to take my equity. The equity-threatening stuff only applies to post-2019 employees, and their release emails were correspondingly different.)

The language in my email was different. It released me from non-disparagement and non-solicitation, but nothing else:

"OpenAI writes to notify you that it is releasing you from any non-disparagement and non-solicitation provision within any such agreement."

Non-Disparagement Canaries for OpenAI

William_S10mo5-1

Evidence could look like 1. Someone was in a position where they had to make a judgement about OpenAI and was in a position of trust 2. They said something bland and inoffensive about OpenAI 3. Later, independently you find that they likely would have known about something bad that they likely weren't saying because of the nondisparagement agreement (instead of ordinary confidentially agreements).

This requires some model of "this specific statement was influenced by the agreement" instead of just "you never said anything bad about OpenAI because you never ... (read more)