My question was about whether ARC gets to evaluate [the most advanced model that the AI company created so far] before the company creates a slightly more advanced model (by scaling up the architecture, or by continuing the training process of the evaluated model).
Did OpenAI/Anthropic allow you to evaluate smaller scale versions* of GPT4/Claude before training the full-scale model?
* [EDIT: and full-scale models in earlier stages of the training process]
Will this actually make things worse? No, you're overthinking this.
This does not seem like a reasonable attitude (both in general, and in this case specifically).
Having thought a bunch about acausal trade — and proven some theorems relevant to its feasibility — I believe there do not exist powerful information hazards about it that stand up to clear and circumspect reasoning about the topic.
Have you discussed this point with other relevant researchers before deciding to publish this post? Is there a wide agreement among relevant researchers that a public, unrestricted discussion about this topic is net-positive? Have you considered the unilateralist's curse and biases that you may have (in terms of you gaining status/prestige from publishing this)?
Re impact markets: there's a problem regarding potentially incentivizing people to do risky, net-negative things (that can end up being beneficial). I co-authored this post about the topic.
(Though even in that case it's not necessarily a generalization problem. Suppose every single "test" input happens to be identical to one that appeared in "training", and the feedback is always good.)
Generalization-based. This categorization is based on the common distinction in machine learning between failures on the training distribution, and out of distribution failures. Specifically, we use the following process to categorize misalignment failures:
- Was the feedback provided on the actual training data bad? If so, this is an instance of outer misalignment.
- Did the learned program generalize poorly, leading to bad behavior, even though the feedback on the training data is good? If so, this is an instance of inner misalignment.
This categorizatio...
I'm interested in hearing what you think the counterfactuals to impact shares/retroactive funding in general are, and why they are better.
The alternative to launching an impact market is to not launch an impact market. Consider the set of interventions that get funded if and only if an impact market it launched. Those are interventions that no classical EA funder decides to fund in a world without impact markets, so they seem unusually likely to be net-negative. Should we move EA funding towards those interventions, just because there's a chance that they'll end up being extremely beneficial? (Which is the expected result of launching a naive impact market.)
I expect prosocial projects to still be launched primarily for prosocial reasons, and funding to be a way of enabling them to happen and publicly allocating credit. People who are only optimizing for money and don't care about externalities have better ways available to pursue their goals, and I don't expect that to change.
It seems that according to your model, it's useful to classify (some) humans as either:
(1) humans who are only optimizing for money, power and status; and don't care about externalities.
(2) humans who are working on prosocial projects...
(To be clear, my comment was not about the funding of your specific project but rather about the general funding approach that is referred to in the title of the OP.)
How do you avoid the problem of incentivizing risky, net-negative projects (that have a chance of ending up being beneficial)?
You wrote:
Ultimately we decided that impact shares are no worse than the current startup equity model, and that works pretty well. “No worse than startup equity” was a theme in much of our decision-making around this system.
If the idea is to use EA funding and fund things related to anthropogenic x-risks, then we probably shouldn't use a mechanism that yields similar incentives as "the current startup equity model".
The smooth graphs seem like good evidence that there are much smoother underlying changes in the model, and that the abruptness of the change is about behavior or evaluation rather than what gradient descent is learning.
If we're trying to predict abrupt changes in the accuracy of output token sequences, the per-token log-likelihood can be a useful signal. What's the analogous signal when we're talking about abrupt changes in a model's ability to deceptively conceal capabilities, hack GPU firmware, etc.? What log-likelihood plots can we use to predict those types of abrupt changes in behavior?
Does everyone who work at OpenAI sign a non-disparagement agreement? (Including those who work on governance/policy?)
Yes. To be clear, the point here is that OpenAI's behavior in that situation seems similar to how, seemingly, for-profit companies sometimes try to capture regulators by paying their family members. (See 30 seconds from this John Oliver monologue as evidence that such tactics are not rare in the for-profit world.)
Sooo this was such an intriguing idea that I did some research -- but reality appears to be more boring:
In a recent informal discussion I believe said OPP CEO remarked he had to give up the OpenAI board seat as his fiancée joining Anthropic creates a conflict of interest. Naively this is much more likely, and I think is much better supported by the timelines.
According to LinkedIn of the mentioned fiancée joined in already as VP in 2018 and was promoted to a probably more serious position in 2020, and her sibling was promoted to VP in 2019.
The Anthropic spl...
Makes sense; it wouldn't surprise me if that's what's happening. I think this perhaps understates the degree to which the attempts at capture were mutual--a theory of change where OPP gives money to OpenAI in exchange for a board seat and the elevation of safety-conscious employees at OpenAI seems like a pretty good way to have an effect. [This still leaves the question of how OPP assesses safety-consciousness.]
I should also note find the 'nondisparagement agreements' people have signed with OpenAI somewhat troubling because it means many people with high ...
Another bit of evidence about OpenAI that I think is worth mentioning in this context: OPP recommended a grant of $30M to OpenAI in a deal that involved OPP's then-CEO becoming a board member of OpenAI. OPP hoped that this will allow them to make OpenAI improve their approach to safety and governance. Later, OpenAI appointed both the CEO's fiancée and the fiancée's sibling to VP positions.
Both of whom then left for Anthropic with the split, right?
Sorry, that text does appear in the linked page (in an image).
The Partnership may never make a profit
I couldn't find this quote in the page that you were supposedly quoting from. The only google result for it is this post. Am I missing something?
That being said, I think that, most of the time, alignment work ending up in training data is good, since it can help our AI systems be differentially better at AI alignment research (e.g. relative to how good they are at AI capabilities research), which is something that I think is pretty important.
That consideration seems relevant only for language models that will be doing/supporting alignment work.
Maybe the question here is whether including certain texts in relevant training datasets can cause [language models that pose an x-risk] to be created X months sooner than otherwise.
The relevant texts I'm thinking about here are:
Is that because you think it would be hard to get the relevant researchers to exclude any given class of texts from their training datasets [EDIT: or prevent web crawlers from downloading the texts etc.]? Or even if that part was easy, you would still feel that that lever is very small?
First point: by "really want to do good" (the really is important here) I mean someone who would be fundamentally altruistic and would not have any status/power desire, even subconsciously.
Then I'd argue the dichotomy is vacuously true, i.e. it does not generally pertain to humans. Humans are the result of human evolution. It's likely that having a brain that (unconsciously) optimizes for status/power has been very adaptive.
Regarding the rest of your comment, this thread seems relevant.
I'd add to that bullet list:
I strong downvoted your comment in both dimensions because I found it disagreeable and counterproductive.
Generally, I think it would be net-negative to discourage such open discussions about unilateral, high-risk interventions—within the EA/AIS communities—that involve conflicts of interest. Especially, for example, unilateral interventions to create/fund for-profit AGI companies, or to develop/disseminate AI capabilities.
Like, who knew that the thing would become a Discord server with thousands of people talking about ML? That they would somewhat succeed? And then, when the thing is pretty much already somewhat on the rails, what choice do you even have? Delete the server? Tell the people who have been working hard for months to open-source GPT-3 like models that "we should not publish it after all"?
I think this eloquent quote can serve to depict an important, general class of dynamics that can contribute to anthropogenic x-risks.
I don't think discussing whether someone really wants to do good or whether there is some (possibly unconscious?) status-optimization process is going to help us align AI.
Two comments:
First point: by "really want to do good" (the really is important here) I mean someone who would be fundamentally altruistic and would not have any status/power desire, even subconsciously.
I don't think Conjecture is an "AGI company", everyone I've met there cares deeply about alignment and their alignment team is a decent fraction of the entire company. Plus they're funding the incubator.
I think it's also a misconception that it's an unilateralist intervension. Like, they've talked to other people in the community before starting it, it was not a secret.
This concern seems relevant if (1) a discount factor is used in an RL setup (otherwise the systems seems as likely to be deceptively aligned with or without the intervention, in order to eventually take over the world), and (2) a decision about whether the system is safe for deployment is made based on its behavior during training.
As an aside, the following quote from the paper seems relevant here:
...Ensuring copies of the states of early potential precursor AIs are preserved to later receive benefits would permit some separation of immediate safety needs a
I think this comment is lumping together the following assumptions under the "continuity" label, as if there is a reason to believe that either they are all correct or all incorrect (and I don't see why):
I bet more on scenarios where we get AGI when politics is very different compared to today.
I agree that just before...
Even with adequate closure and excellent opsec, there can still be risks related to researchers on the team quitting and then joining a competing effort or starting their own AGI company (and leveraging what they've learned).
Do you generally think that people in the AI safety community should write publicly about what they think is "the missing AGI ingredient"?
It's remarkable that this post was well received on the AI Alignment Forum (18 karma points before my strong downvote).
Regarding the table in the OP, there seem to be strong selection effects that are involved. For example, the "recruitment setting" for the "Goërtz 2020" study is described as:
Recruitment from Facebook groups for COVID-19 patients with persistent symptoms and registries on the website of the Lung Foundation on COVID-19 information
Hey there!
And then finally there are actually some formal results where we try to formalize a notion of power-seeking in terms of the number of options that a given state allows a system. This is work [...] which I'd encourage folks to check out. And basically you can show that for a large class objectives defined relative to an environment, there's a strong reason for a system optimizing those objectives to get to the states that give them many more options.
Do you understand the main theorems in that paper and for what environments they are applicable...
Rather than letting super-intelligent AI take control of human's destiny, by merging with the machines humans can directly shape their own fate.
.
Since humans connected to machines are still “human”, anything they do definitionally satisfies human values.
We are already connected to machines (via keyboards and monitors). The question is how a higher bandwidth interface will help in mitigating risks from huge, opaque neural networks.
Suppose that each subnetwork does general reasoning and thus up until some point during training the subnetworks are useful for minimizing loss.
[EDIT: sorry, I need to think through this some more.]
I wouldn't use the myopic vs. long-term framing here. Suppose a model is trained to play chess via RL, and there are no inner alignment problems. The trained model corresponds to a non-myopic agent (a chess game can last for many time steps). But the environment that the agent "cares" about is an abstract environment that corresponds to a simple chess game. (It's an environment with less than states). The agent doesn't care about our world. Even if some potential activation values in the network correspond to hacking the computer that runs the model a...
If the model that is used as a Microscope AI does not use any optimization (search), how will it compute the probability that, say, Apple's engineers will overcome a certain technical challenge?
Agents that don't care about influencing our world don't care about influencing the future weights of the network.
(Haven't read the OP thoroughly so sorry if not relevant; just wanted to mention...)
If any part of the network at any point during training corresponds to an agent that "cares" about an environment that includes our world then that part can "take over" the rest of the network via gradient hacking.
Should we take this seriously? I'm guessing no, because if this were true someone at OpenAI or DeepMind would have encountered it also and the safety people would have investigated and discovered it and then everyone in the safety community would be freaking out right now.
(This reply isn't specifically about Karpathy's hypothesis...)
I'm skeptical about the general reasoning here. I don't see how we can be confident that OpenAI/DeepMind will encounter a given problem first. Also, it's not obvious to me that the safety people at OpenAI/DeepMind will be notified about a concerning observation that the capabilities-focused team can explain to themselves with a non-concerning hypothesis.
What I can do is point to my history of acting in ways that, I hope, show my consistent commitment to doing what is best for the longterm future (even if of course some people with different models of what is “best for the longterm future” will have legitimate disagreements with my choices of past actions), and pledge to remain in control of Conjecture and shape its goals and actions appropriately.
Sorry, do you mean that you are actually pledging to "remain in control of Conjecture"? Can some other founder(s) make that pledge too if it's necessary for m...
Your website says: "WE ARE AN ARTIFICIAL GENERAL INTELLIGENCE COMPANY DEDICATED TO MAKING AGI SAFE", and also "we are committed to avoiding dangerous AI race dynamics".
How are you planning to avoid exacerbating race dynamics, given that you're creating a new 'AGI company'? How will you prove to other AI companies—that do pursue AGI—that you're not competing with them?
Do you believe that most of the AI safety community approves of the creation of this new company? In what ways (if any) have you consulted with the community before starting the company?
To address the opening quote - the copy on our website is overzealous, and we will be changing it shortly. We are an AGI company in the sense that we take AGI seriously, but it is not our goal to accelerate progress towards it. Thanks for highlighting that.
We don’t have a concrete proposal for how to reliably signal that we’re committed to avoiding AGI race dynamics beyond the obvious right now. There is unfortunately no obvious or easy mechanism that we are aware of to accomplish this, but we are certainly open to discussion with any interested parties ab...
Who, in practice, pulls the EA-world fire alarm? Is it Holden Karnofsky?
FYI, him having that responsibility would seemingly entail a conflict of interest; he said in an interview:
Anthropic is a new AI lab, and I am excited about it, but I have to temper that or not mislead people because Daniela, my wife, is the president of Anthropic. And that means that we have equity, and so [...] I’m as conflict-of-interest-y as I can be with this organization.
The founders also retain complete control of the company.
Can you say more about that? Will shareholders not be able to sue the company if it acts against their financial interests? If Conjecture will one day become a public company, is it likely that there will always be a controlling interest in the hands of few individuals?
[...] to train and study state-of-the-art models without pushing the capabilities frontier.
Do you plan to somehow reliably signal to AI companies—that do pursue AGI—that you are not competing with them? (In order to not exacerbate race dynamics).
I'm late to the party by a month, but I'm interested in your take (especially Rohin's) on the following:
Conditional on an existential catastrophe happening due to AI systems, what is your credence that the catastrophe will occur only after the involved systems are deployed?
Simple metrics, like number of views, or number of likes, are easy for companies to optimise for. Whereas figuring out how to optimise for what people really want is a trickier problem. So it’s not surprising if companies haven’t figured it out yet.
It's also not surprising for a different reason: The financial interests of the shareholders can be very misaligned with what the users "really want". (Which can cause the company to make the product more addictive, serve targeted ads that exploit users' vulnerabilities, etc.).
PSA for Edge browser users: if you care about privacy, make sure Microsoft does not silently enable syncing of browsing history etc. (Settings->Privacy, search and services).
They seemingly did so to me a few days ago (probably along with the Windows "Feature update" 20H2); it may be something that they currently do to some users and not others.
BTW that foldable design makes the respirator fit in a pocket, which can be a big plus.
This is one of those "surprise! now that you've read this, things might be different" posts.
The surprise factor may be appealing from the perspective of a writer, but I'm in favor of having a norm against it (e.g. setting an expectation for authors to add a relevant preceding content note to such posts).
I think the important factors w.r.t. risks re [morally relevant disvalue that occurs during inference in ML models] are probably more like:
Being polite to GPT-n is probably not directly helpful (though it can be helpful by causing humans to care more about this topic). A user can ... (read more)