All of Ofer's Comments + Replies

Ofer40

I think the important factors w.r.t. risks re [morally relevant disvalue that occurs during inference in ML models] are probably more like:

  1. The training algorithm. Unsupervised learning seems less risky than model-free RL (e.g. the RLHF approach currently used by OpenAI maybe?); the latter seems much more similar, in a relevant sense, to the natural evolution process that created us.
  2. The architecture of the model.

Being polite to GPT-n is probably not directly helpful (though it can be helpful by causing humans to care more about this topic). A user can ... (read more)

Ofer145

My question was about whether ARC gets to evaluate [the most advanced model that the AI company created so far] before the company creates a slightly more advanced model (by scaling up the architecture, or by continuing the training process of the evaluated model).

OferΩ3112

Did OpenAI/Anthropic allow you to evaluate smaller scale versions* of GPT4/Claude before training the full-scale model?

* [EDIT: and full-scale models in earlier stages of the training process]

4Hjalmar_Wijk
ARC evals has only existed since last fall, so for obvious reasons we have not evaluated very early versions. Going forward I think it would be valuable and important to evaluate models during training or to scale up models in incremental steps.
1Lone Pine
I don't see the value in testing smaller, less capable versions. Obviously they can only test versions that already exist, but they should always test the biggest, most capable models available.
Ofer4-4

Will this actually make things worse? No, you're overthinking this.

This does not seem like a reasonable attitude (both in general, and in this case specifically).

6ArthurB
In general yes, here no. My impression from reading LW is that many people suffer from a great deal of analysis paralysis and are taking too few chances, especially given that the default isn't looking great. There is such a thing as doing a dumb thing because it feels like doing something (e.g. let's make AI Open!) but this ain't it. The consequences of this project are not going to be huge (talking to people) but you might get a nice little gradient read as to how helpful it is and iterate from there.
OferΩ15-3

Having thought a bunch about acausal trade — and proven some theorems relevant to its feasibility — I believe there do not exist powerful information hazards about it that stand up to clear and circumspect reasoning about the topic.

Have you discussed this point with other relevant researchers before deciding to publish this post? Is there a wide agreement among relevant researchers that a public, unrestricted discussion about this topic is net-positive? Have you considered the unilateralist's curse and biases that you may have (in terms of you gaining status/prestige from publishing this)?

Ofer20

Re impact markets: there's a problem regarding potentially incentivizing people to do risky, net-negative things (that can end up being beneficial). I co-authored this post about the topic.

OferΩ120

(Though even in that case it's not necessarily a generalization problem. Suppose every single "test" input happens to be identical to one that appeared in "training", and the feedback is always good.)

2Rohin Shah
It's still well-defined, though I agree that in this case the name is misleading. But this is a single specific edge case that I don't expect will actually happen, so I think I'm fine with that.
OferΩ120

Generalization-based. This categorization is based on the common distinction in machine learning between failures on the training distribution, and out of distribution failures. Specifically, we use the following process to categorize misalignment failures:

  1. Was the feedback provided on the actual training data bad? If so, this is an instance of outer misalignment.
  2. Did the learned program generalize poorly, leading to bad behavior, even though the feedback on the training data is good? If so, this is an instance of inner misalignment.

This categorizatio... (read more)

2Rohin Shah
You can extend the definition to online learning: choose some particular time and say that all the previous inputs on which you got gradients are the "training data" and the future inputs are the "test data". In the situation you describe, you would want to identify the point at which the AI system starts executing on its plan to cause an existential catastrophe, set that as the specific point in time (so everything before it is "training" and everything after is "test"), and then apply the categorization as usual.
2Dzoldzaya
Definitely relevant, but not sure if it supports my argument that we shouldn't try to induce collapse. This post is about unilaterally taking a very radical action to avert a potentially worse scenario that inertia and race dynamics are pushing us towards. That looks like classic 'unilateralist's benediction' territory.
Ofer00

I'm interested in hearing what you think the counterfactuals to impact shares/retroactive funding in general are, and why they are better.

The alternative to launching an impact market is to not launch an impact market. Consider the set of interventions that get funded if and only if an impact market it launched. Those are interventions that no classical EA funder decides to fund in a world without impact markets, so they seem unusually likely to be net-negative. Should we move EA funding towards those interventions, just because there's a chance that they'll end up being extremely beneficial? (Which is the expected result of launching a naive impact market.)

2Elizabeth
Ah. You have much more confidence in other funding mechanisms than I do. Doesn't seem like we're making progress on this so I will stop here.
Ofer-20

I expect prosocial projects to still be launched primarily for prosocial reasons, and funding to be a way of enabling them to happen and publicly allocating credit. People who are only optimizing for money and don't care about externalities have better ways available to pursue their goals, and I don't expect that to change.

It seems that according to your model, it's useful to classify (some) humans as either:

(1) humans who are only optimizing for money, power and status; and don't care about externalities.

(2) humans who are working on prosocial projects... (read more)

2Elizabeth
That model is a straw man: talking in dichotomies and sharp cut-offs is easier than spectrums and margins, but I would hope they'd be assumed by default.  But focusing strictly on the margin: so much of EA pushes people to think big: biggest impact, fastest scaling, etc. It also encourages people to be terrified of doing anything, but not in ways that balance out, just make people stressed and worse at thinking. I 100% agree with you that this pushes people to ignore the costs and risks of their projects, and that this is bad. Relative to that baseline, I think retroactive funding is at most a drop in the bucket, and impact shares potentially an improvement because the quantification gives people more traction to raise specific objections.  The same systems also encourage people to overestimate their project's impact and ignore downsides. No one wants to admit their project failed, much less did harm. It will hurt their chances of future grants and jobs and reduce their status. Impact Shares at least gives a hook for a quantified outside assessment, instead of the only form of post-project feedback being public criticism that is costly and scary to give.   (Yes, this only works if the quantification and objections are sufficiently good, but sufficiently only means "better than the conterfactual". "The feedback could be bad though" applies to everything). This post on EAForum outlines a long history of CEA launching projects and then just dropping them, without evaluation. Impact shares that remain valueless are an easy way to build common knowledge of the lack of proof of value, especially compared to someone writing a post that obviously took tens if not hundreds of hours to write and had to be published anonymously due to fear of reprucussions. I'm interested in hearing what you think the counterfactuals to impact shares/retroactive funding in general are, and why they are better.
Ofer10

(To be clear, my comment was not about the funding of your specific project but rather about the general funding approach that is referred to in the title of the OP.)

2Elizabeth
Something feels off to me about the whole framing.  I expect prosocial projects to still be launched primarily for prosocial reasons, and funding to be a way of enabling them to happen and publicly allocating credit. People who are only optimizing for money and don't care about externalities have better ways available to pursue their goals, and I don't expect that to change. If you describe the problem as "this encourages swinging for the fences and ignoring negative impact", impact shares suffer from it much less than many parts of effective altruism. Probably below average. Impact shares at least have some quantification and feedback loop, which is more than I can say for the constant discussion of long tails, hits based giving, and scalability. I would love it if effective altruism and x-risk groups took the risk of failure and negative externalities more seriously.  Given the current state, impact shares seem like a really weird place to draw the line. 
Ofer00

How do you avoid the problem of incentivizing risky, net-negative projects (that have a chance of ending up being beneficial)?

You wrote:

Ultimately we decided that impact shares are no worse than the current startup equity model, and that works pretty well. “No worse than startup equity” was a theme in much of our decision-making around this system.

If the idea is to use EA funding and fund things related to anthropogenic x-risks, then we probably shouldn't use a mechanism that yields similar incentives as "the current startup equity model".

2Elizabeth
Your questions are reasonable for people outside the trust ecosystem. I'm in the x-risk ecosystem and will get feedback and judgement on this project independent of money. If you have a better solution to fine tunning credit allocation among humans with feelings doing work with long feedback cycles I'd love to hear it.
OferΩ240

The smooth graphs seem like good evidence that there are much smoother underlying changes in the model, and that the abruptness of the change is about behavior or evaluation rather than what gradient descent is learning.

If we're trying to predict abrupt changes in the accuracy of output token sequences, the per-token log-likelihood can be a useful signal. What's the analogous signal when we're talking about abrupt changes in a model's ability to deceptively conceal capabilities, hack GPU firmware, etc.? What log-likelihood plots can we use to predict those types of abrupt changes in behavior?

1Ethan Perez
Here, I think we'll want to look for suspicious changes in the log-likelihood trends. E.g., it's a red flag if we see steady increases in log-likelihood on some scary behavior, but then the trend reverse at some level of model scale.
Ofer306

Does everyone who work at OpenAI sign a non-disparagement agreement? (Including those who work on governance/policy?)

Ofer15-2

Yes. To be clear, the point here is that OpenAI's behavior in that situation seems similar to how, seemingly, for-profit companies sometimes try to capture regulators by paying their family members. (See 30 seconds from this John Oliver monologue as evidence that such tactics are not rare in the for-profit world.)

gugu1917

Sooo this was such an intriguing idea that I did some research -- but reality appears to be more boring:

In a recent informal discussion I believe said OPP CEO remarked he had to give up the OpenAI board seat as his fiancée joining Anthropic creates a conflict of interest. Naively this is much more likely, and I think is much better supported by the timelines.
According to LinkedIn of the mentioned fiancée joined in already as VP in 2018 and was promoted to a probably more serious position in 2020, and her sibling was promoted to VP in 2019.
The Anthropic spl... (read more)

Vaniver*4248

Makes sense; it wouldn't surprise me if that's what's happening. I think this perhaps understates the degree to which the attempts at capture were mutual--a theory of change where OPP gives money to OpenAI in exchange for a board seat and the elevation of safety-conscious employees at OpenAI seems like a pretty good way to have an effect. [This still leaves the question of how OPP assesses safety-consciousness.]

I should also note find the 'nondisparagement agreements' people have signed with OpenAI somewhat troubling because it means many people with high ... (read more)

Ofer102

Another bit of evidence about OpenAI that I think is worth mentioning in this context: OPP recommended a grant of $30M to OpenAI in a deal that involved OPP's then-CEO becoming a board member of OpenAI. OPP hoped that this will allow them to make OpenAI improve their approach to safety and governance. Later, OpenAI appointed both the CEO's fiancée and the fiancée's sibling to VP positions.

Vaniver100

Both of whom then left for Anthropic with the split, right?

OferΩ110

Sorry, that text does appear in the linked page (in an image).

OferΩ110

The Partnership may never make a profit

I couldn't find this quote in the page that you were supposedly quoting from. The only google result for it is this post. Am I missing something?

[This comment is no longer endorsed by its author]Reply
1Ofer
Sorry, that text does appear in the linked page (in an image).
OferΩ120

That being said, I think that, most of the time, alignment work ending up in training data is good, since it can help our AI systems be differentially better at AI alignment research (e.g. relative to how good they are at AI capabilities research), which is something that I think is pretty important.

That consideration seems relevant only for language models that will be doing/supporting alignment work.

OferΩ112

Maybe the question here is whether including certain texts in relevant training datasets can cause [language models that pose an x-risk] to be created X months sooner than otherwise.

The relevant texts I'm thinking about here are:

  1. Descriptions of certain tricks to evade our safety measures.
  2. Texts that might cause the ML model to (better) model AIS researchers or potential AIS interventions, or other potential AI systems that the model might cooperate with (or that might "hijack" the model's logic).
OferΩ110

Is that because you think it would be hard to get the relevant researchers to exclude any given class of texts from their training datasets [EDIT: or prevent web crawlers from downloading the texts etc.]? Or even if that part was easy, you would still feel that that lever is very small?

5Richard_Ngo
Even if that part was easy, it still seems like a very small lever. A system capable of taking over the world will be able to generate those ideas for itself, and a system with strong motivations to take over the world won't have them changed by small amounts of training text.
Ofer42

First point: by "really want to do good" (the really is important here) I mean someone who would be fundamentally altruistic and would not have any status/power desire, even subconsciously.

Then I'd argue the dichotomy is vacuously true, i.e. it does not generally pertain to humans. Humans are the result of human evolution. It's likely that having a brain that (unconsciously) optimizes for status/power has been very adaptive.

Regarding the rest of your comment, this thread seems relevant.

Ofer*107

I'd add to that bullet list:

  • Severe conflicts of interest are involved.
Ofer126

I strong downvoted your comment in both dimensions because I found it disagreeable and counterproductive.

Generally, I think it would be net-negative to discourage such open discussions about unilateral, high-risk interventions—within the EA/AIS communities—that involve conflicts of interest. Especially, for example, unilateral interventions to create/fund for-profit AGI companies, or to develop/disseminate AI capabilities.

Ofer*136

Like, who knew that the thing would become a Discord server with thousands of people talking about ML? That they would somewhat succeed? And then, when the thing is pretty much already somewhat on the rails, what choice do you even have? Delete the server? Tell the people who have been working hard for months to open-source GPT-3 like models that "we should not publish it after all"?

I think this eloquent quote can serve to depict an important, general class of dynamics that can contribute to anthropogenic x-risks.

Ofer32

I don't think discussing whether someone really wants to do good or whether there is some (possibly unconscious?) status-optimization process is going to help us align AI.

Two comments:

  • [wanting to do good] vs. [one's behavior being affected by an unconscious optimization for status/power] is a false dichotomy.
  • Don't you think that unilateral interventions within the EA/AIS communities to create/fund for-profit AGI companies, or to develop/disseminate AI capabilities, could have a negative impact on humanity's ability to avoid existential catastrophes from AI?

First point: by "really want to do good" (the really is important here) I mean someone who would be fundamentally altruistic and would not have any status/power desire, even subconsciously.

I don't think Conjecture is an "AGI company", everyone I've met there cares deeply about alignment and their alignment team is a decent fraction of the entire company. Plus they're funding the incubator.

I think it's also a misconception that it's an unilateralist intervension. Like, they've talked to other people in the community before starting it, it was not a secret.

Ofer10

This concern seems relevant if (1) a discount factor is used in an RL setup (otherwise the systems seems as likely to be deceptively aligned with or without the intervention, in order to eventually take over the world), and (2) a decision about whether the system is safe for deployment is made based on its behavior during training.

As an aside, the following quote from the paper seems relevant here:

Ensuring copies of the states of early potential precursor AIs are preserved to later receive benefits would permit some separation of immediate safety needs a

... (read more)
OferΩ230

I think this comment is lumping together the following assumptions under the "continuity" label, as if there is a reason to believe that either they are all correct or all incorrect (and I don't see why):

  1. There is large distance in model space between models that behave very differently.
  2. Takeoff will be slow.
  3. It is feasible to create models that are weak enough to not pose an existential risk yet able to sufficiently help with alignment.

I bet more on scenarios where we get AGI when politics is very different compared to today.

I agree that just before... (read more)

OferΩ590

Even with adequate closure and excellent opsec, there can still be risks related to researchers on the team quitting and then joining a competing effort or starting their own AGI company (and leveraging what they've learned).

OferΩ330

Do you generally think that people in the AI safety community should write publicly about what they think is "the missing AGI ingredient"?

It's remarkable that this post was well received on the AI Alignment Forum (18 karma points before my strong downvote).

Ofer60

Regarding the table in the OP, there seem to be strong selection effects that are involved. For example, the "recruitment setting" for the "Goërtz 2020" study is described as:

Recruitment from Facebook groups for COVID-19 patients with persistent symptoms and registries on the website of the Lung Foundation on COVID-19 information

Ofer*10

Hey there!

And then finally there are actually some formal results where we try to formalize a notion of power-seeking in terms of the number of options that a given state allows a system. This is work [...] which I'd encourage folks to check out. And basically you can show that for a large class objectives defined relative to an environment, there's a strong reason for a system optimizing those objectives to get to the states that give them many more options.

Do you understand the main theorems in that paper and for what environments they are applicable... (read more)

Ofer10

Rather than letting super-intelligent AI take control of human's destiny, by merging with the machines humans can directly shape their own fate.

.

Since humans connected to machines are still “human”, anything they do definitionally satisfies human values.

We are already connected to machines (via keyboards and monitors). The question is how a higher bandwidth interface will help in mitigating risks from huge, opaque neural networks.

2Logan Zoellner
  I think the idea is something along the lines of: 1. Build high-bandwidth interface between the human brain and a computer 2. figure out how to simulate a single cortical column 3. Give human beings a million extra cortical columns to make us really smart This isn't something you could do with a keyboard and monitor. But, as stated, I'm not super-optimistic this will result in a sane, super-intelligent human being.  I merely think that it is physically possible to do this before/around the same time as the Singularity.
OferΩ110

Suppose that each subnetwork does general reasoning and thus up until some point during training the subnetworks are useful for minimizing loss.

1Not Relevant
Are you saying that such a mechanism occurs by coincidence, or that it’s actively constructed? It seems like for all the intermediate steps, all consumers of the almost-identical subnetworks would naturally just pick one and use that one, since it was slightly closer to what the consumer needed.
Ofer*30

[EDIT: sorry, I need to think through this some more.]

3Not Relevant
I see, so your claim here is that gradient hacking is a convergent strategy for all agents of sufficient intelligence. That’s helpful, thanks. I am still confused about this in the case that Alice is checking whether or not she has X goal, since by definition it is to her goal Y’s detriment to not have children if she finds she has a different goal Y!=X.
Ofer10

I wouldn't use the myopic vs. long-term framing here. Suppose a model is trained to play chess via RL, and there are no inner alignment problems. The trained model corresponds to a non-myopic agent (a chess game can last for many time steps). But the environment that the agent "cares" about is an abstract environment that corresponds to a simple chess game. (It's an environment with less than states). The agent doesn't care about our world. Even if some potential activation values in the network correspond to hacking the computer that runs the model a... (read more)

OferΩ030

If the model that is used as a Microscope AI does not use any optimization (search), how will it compute the probability that, say, Apple's engineers will overcome a certain technical challenge?

3Evan R. Murphy
That's a good question. Perhaps it does make use of optimization but the model still has an overall passive relationship to the world compared to an active mesa-optimizer AI. I'm thinking about the difference between say, GPT-3 and the classic paperclip maximizer or other tiling AI. This is just my medium-confidence understanding and may be different from what Evan Hubinger meant in that quote.
Ofer10

Agents that don't care about influencing our world don't care about influencing the future weights of the network.

1Not Relevant
I see, so you’re comparing a purely myopic vs. a long-term optimizing agent; in that case I probably agree. But if the myopic agent cares even about later parts of the episode, and gradients are updated in between, this fails, right?
Ofer20

(Haven't read the OP thoroughly so sorry if not relevant; just wanted to mention...)

If any part of the network at any point during training corresponds to an agent that "cares" about an environment that includes our world then that part can "take over" the rest of the network via gradient hacking.

1Not Relevant
This seems like a weird claim - if there are multiple objectives within the agent, why would the one that cares about the external world decisively “win” any gradient-hacking-fight?
Ofer10

Should we take this seriously? I'm guessing no, because if this were true someone at OpenAI or DeepMind would have encountered it also and the safety people would have investigated and discovered it and then everyone in the safety community would be freaking out right now.

(This reply isn't specifically about Karpathy's hypothesis...)

I'm skeptical about the general reasoning here. I don't see how we can be confident that OpenAI/DeepMind will encounter a given problem first. Also, it's not obvious to me that the safety people at OpenAI/DeepMind will be notified about a concerning observation that the capabilities-focused team can explain to themselves with a non-concerning hypothesis.

OferΩ110

What I can do is point to my history of acting in ways that, I hope, show my consistent commitment to doing what is best for the longterm future (even if of course some people with different models of what is “best for the longterm future” will have legitimate disagreements with my choices of past actions), and pledge to remain in control of Conjecture and shape its goals and actions appropriately.

Sorry, do you mean that you are actually pledging to "remain in control of Conjecture"? Can some other founder(s) make that pledge too if it's necessary for m... (read more)

OferΩ8430

Your website says: "WE ARE AN ARTIFICIAL GENERAL INTELLIGENCE COMPANY DEDICATED TO MAKING AGI SAFE", and also "we are committed to avoiding dangerous AI race dynamics".

How are you planning to avoid exacerbating race dynamics, given that you're creating a new 'AGI company'? How will you prove to other AI companies—that do pursue AGI—that you're not competing with them?

Do you believe that most of the AI safety community approves of the creation of this new company? In what ways (if any) have you consulted with the community before starting the company?

Connor LeahyΩ8160

To address the opening quote - the copy on our website is overzealous, and we will be changing it shortly. We are an AGI company in the sense that we take AGI seriously, but it is not our goal to accelerate progress towards it. Thanks for highlighting that.

We don’t have a concrete proposal for how to reliably signal that we’re committed to avoiding AGI race dynamics beyond the obvious right now. There is unfortunately no obvious or easy mechanism that we are aware of to accomplish this, but we are certainly open to discussion with any interested parties ab... (read more)

Ofer230

Who, in practice, pulls the EA-world fire alarm? Is it Holden Karnofsky?

FYI, him having that responsibility would seemingly entail a conflict of interest; he said in an interview:

Anthropic is a new AI lab, and I am excited about it, but I have to temper that or not mislead people because Daniela, my wife, is the president of Anthropic. And that means that we have equity, and so [...] I’m as conflict-of-interest-y as I can be with this organization.

Ofer30

The founders also retain complete control of the company.

Can you say more about that? Will shareholders not be able to sue the company if it acts against their financial interests? If Conjecture will one day become a public company, is it likely that there will always be a controlling interest in the hands of few individuals?

[...] to train and study state-of-the-art models without pushing the capabilities frontier.

Do you plan to somehow reliably signal to AI companies—that do pursue AGI—that you are not competing with them? (In order to not exacerbate race dynamics).

2Connor Leahy
The founders have a supermajority of voting shares and full board control and intend to hold on to both for as long as possible (preferably indefinitely). We have been very upfront with our investors that we do not want to ever give up control of the company (even if it were hypothetically to go public, which is not something we are currently planning to do), and will act accordingly. For the second part, see the answer here.
OferΩ230

I'm late to the party by a month, but I'm interested in your take (especially Rohin's) on the following:

Conditional on an existential catastrophe happening due to AI systems, what is your credence that the catastrophe will occur only after the involved systems are deployed?

7Rohin Shah
Idk, 95%? Probably I should push that down a bit because I haven't thought about it very hard. It's a bit fuzzy what "deployed" means, but for now I'm going to assume that we mean that we put inputs into the AI system for the primary purpose of getting useful outputs, rather than for seeing what the AI did so that we can make it better. Any existential catastrophe that didn't involve a failure of alignment seems like it had to involve a deployed system. For failures of alignment, I'd expect that before you get an AI system that can break out of the training process and kill you, you get an AI system that can break out of deployment and kill you, because there's (probably) less monitoring during deployment. You're also just running much longer during deployment -- if an AI system is waiting for the right opportunity, then even if it is equally likely to happen for a training vs deployment input (i.e. ignoring the greater monitoring during training), you'd still expect to see it happen at deployment since >99% of the inputs happen at deployment.
Ofer10

Simple metrics, like number of views, or number of likes, are easy for companies to optimise for. Whereas figuring out how to optimise for what people really want is a trickier problem. So it’s not surprising if companies haven’t figured it out yet.

It's also not surprising for a different reason: The financial interests of the shareholders can be very misaligned with what the users "really want". (Which can cause the company to make the product more addictive, serve targeted ads that exploit users' vulnerabilities, etc.).

Ofer30

PSA for Edge browser users: if you care about privacy, make sure Microsoft does not silently enable syncing of browsing history etc. (Settings->Privacy, search and services).

They seemingly did so to me a few days ago (probably along with the Windows "Feature update" 20H2); it may be something that they currently do to some users and not others.

Ofer10

BTW that foldable design makes the respirator fit in a pocket, which can be a big plus.

Ofer10

This is one of those "surprise! now that you've read this, things might be different" posts.

The surprise factor may be appealing from the perspective of a writer, but I'm in favor of having a norm against it (e.g. setting an expectation for authors to add a relevant preceding content note to such posts).

Load More