Quick Takes

simeon_c12974

Idea: Daniel Kokotajlo probably lost quite a bit of money by not signing an OpenAI NDA before leaving, which I consider a public service at this point. Could some of the funders of the AI safety landscape give some money or social reward for this?

I guess reimbursing everything Daniel lost might be a bit too much for funders but providing some money, both to reward the act and incentivize future safety people to not sign NDAs would have a very high value. 

Showing 3 of 22 replies (Click to show all)
10AlexMennen
I wonder if it might be more effective to fund legal action against OpenAI than to compensate individual ex-employees for refusing to sign an NDA. Trying to take vested equity away from ex-employees who refuse to sign an NDA sounds likely to not hold up in court, and if we can establish a legal precident that OpenAI cannot do this, that might make other ex-employees much more comfortable speaking out against OpenAI than the possibility that third-parties might fundraise to partially compensate them for lost equity would be (a possibility you might not even be able to make every ex-employee aware of). The fact that this would avoid financially rewarding OpenAI for bad behavior is also a plus. Of course, legal action is expensive, but so is the value of the equity that former OpenAI employees have on the line.

Yeah, at the time I didn't know how shady some of the contracts here were. I do think funding a legal defense is a marginally better use of funds (though my guess is funding both is worth it).

1wassname
Thanks, I hadn't seen that, I find it convincing.

On an apparent missing mood - FOMO on all the vast amounts of automated AI safety R&D that could (almost already) be produced safely 

Automated AI safety R&D could results in vast amounts of work produced quickly. E.g. from Some thoughts on automating alignment research (under certain assumptions detailed in the post): 

each month of lead that the leader started out with would correspond to 15,000 human researchers working for 15 months.

Despite this promise, we seem not to have much knowledge when such automated AI safety R&D might happ... (read more)

Showing 3 of 5 replies (Click to show all)
1Bogdan Ionut Cirstea
Seems like probably the modal scenario to me too, but even limited exceptions like the one you mention seem to me like they could be very important to deploy at scale ASAP, especially if they could be deployed using non-x-risky systems (e.g. like current ones, very bad at DC evals). This seems good w.r.t. automated AI safety potentially 'piggybacking', but bad for differential progress. Sure, though wouldn't this suggest at least focusing hard on (measuring / eliciting) what might not come at the same time? 
2ryan_greenblatt
Why think this is important to measure or that this already isn't happening? E.g., on the current model organism related project I'm working on, I automate inspecting reasoning traces in various ways. But I don't feel like there is any particularly interesting thing going on here which is important to track (e.g. this tip isn't more important than other tips for doing LLM research better).

Intuitively, I'm thinking of all this as something like a race between [capabilities enabling] safety and [capabilities enabling dangerous] capabilities (related: https://aligned.substack.com/i/139945470/targeting-ooms-superhuman-models); so from this perspective, maintaining as large a safety buffer as possible (especially if not x-risky) seems great. There could also be something like a natural endpoint to this 'race', corresponding to being able to automate all human-level AI safety R&D safely (and then using this to produce a scalable solution to a... (read more)

keltan134

Note to self, write a post about the novel akrasia solutions I thought up before becoming a rationalist.

  • Figuring out how to want to want to do things
  • Personalised advertising of Things I Wanted to Want to Do
  • What I do when all else fails
5trevor
Have you tried whiteboarding-related techniques? I think that suddenly starting to using written media (even journals), in an environment without much or any guidance, is like pressing too hard on the gas; you're gaining incredible power and going from zero to one on things faster than you ever have before.  Depending on their environment and what they're interested in starting out, some people might learn (or be shown) how to steer quickly, whereas others might accumulate/scaffold really lopsided optimization power and crash and burn (e.g. getting involved in tons of stuff at once that upon reflection was way too much for someone just starting out).
keltan10

This seems incredibly interesting to me. Googling “White-boarding techniques” only gives me results about digitally shared idea spaces. Is this what you’re referring to? I’d love to hear more on this topic.

4keltan
Maybe I could even write a sequence on this?

Unfortunately, it looks like non-disparagement clauses aren't unheard of in general releases:

http://www.shpclaw.com/Schwartz-Resources/severance-and-release-agreements-six-6-common-traps-and-a-rhetorical-question

Release Agreements commonly include a “non-disparagement” clause – in which the employee agrees not to disparage “the Company.”

https://joshmcguirelaw.com/civil-litigation/adventures-in-lazy-lawyering-the-broad-general-release

The release had a very broad definition of the company (including officers, directors, shareholders, etc.), but a fairly reas

... (read more)
  1. We inhabit this real material world, the one which we perceive all around us (and which somehow gives rise to perceptive and self-conscious beings like us).
  2. Though not all of our perceptions conform to a real material world. We may be fooled by things like illusions or hallucinations or dreams that mimic perceptions of this world but are actually all in our minds.
  3. Indeed if you examine your perceptions closely, you'll see that none of them actually give you representations of the material world, but merely reactions to it.
  4. In fact, since the only evidence we
... (read more)
Showing 3 of 10 replies (Click to show all)
2Richard_Kennaway
Dragging files around in a GUI is a familiar action that does known things with known consequences. Somewhere on the hard disc (or SSD, or somewhere in the cloud, etc.) there is indeed a "file" which has indeed been "moved" into a "folder", and taking off those quotation marks only requires some background knowledge (which in fact I have) of the lower-level things that are going on and which the GUI presents to me through this visual metaphor. Some explanations work better than others. The idea that there is stuff out there that gives rise to my perceptions, and which I can act on with predictable results, seems to me the obvious explanation that any other contender will have to do a great deal of work to topple from the plinth. The various philosophical arguments over doctrines such as "idealism", "realism", and so on are more like a musical recreation (see my other comment) than anything to take seriously as a search for truth. They are hardly the sort of thing that can be right or wrong, and to the extent that they are, they are all wrong. Ok, that's my personal view of a lot of philosophy, but I'm not the only one.

It sounds like you want to say things like "coherence and persistent similarity of structure in perceptions demonstrates that perceptions are representations of things external to the perceptions themselves" or "the idea that there is stuff out there seems the obvious explanation" or "explanations that work better than others are the best alternatives in the search for truth" and yet you also want to say "pish, philosophy is rubbish; I don't need to defend an opinion about realism or idealism or any of that nonsense". In fact what you're doing isn't some alternative to philosophy, but a variety of it.

2Richard_Kennaway
A lot of philosophy is like that. Or perhaps it is better compared to music. Music sounds meaningful, but no-one has explained what it means. Even so, much philosophy sounds meaningful, consisting of grammatical sentences with a sense of coherence, but actually meaning nothing. This is why there is no progress in philosophy, any more than there is in music. New forms can be invented and other forms can go out of fashion, but the only development is the ever-greater sprawl of the forest.

Several dozen people now presumably have Lumina in their mouths. Can we not simply crowdsource some assays of their saliva? I would chip money in to this. Key questions around ethanol levels, aldehyde levels, antibacterial levels, and whether the organism itself stays colonized at useful levels.

Showing 3 of 4 replies (Click to show all)
Lorxus10

Surely so! Hit me up if you ever end doing this - I'm likely getting the Lumina treatment in a couple months.

4sapphire
Lumina is incredibly cheap right now. I pre-ordered for 250usd. Even genuinely quite poor people I know don't find the price off-putting (poor in the sense of absolutely poor for the country they live in). I have never met a single person who decided not to try Lumina because the price was high. If they pass its always because they think its risky.
8kave
I think Romoeo is thinking of checking a bunch of mediators of risk (like aldehyde levels) as well as of function (like whether the organism stays colonised)
Wei Dai502

AI labs are starting to build AIs with capabilities that are hard for humans to oversee, such as answering questions based on large contexts (1M+ tokens), but they are still not deploying "scalable oversight" techniques such as IDA and Debate. (Gemini 1.5 report says RLHF was used.) Is this more good news or bad news?

Good: Perhaps RLHF is still working well enough, meaning that the resulting AI is following human preferences even out of training distribution. In other words, they probably did RLHF on large contexts in narrow distributions, with human rater... (read more)

Showing 3 of 8 replies (Click to show all)

Bad: AI developers haven't taken alignment seriously enough to have invested enough in scalable oversight, and/or those techniques are unworkable or too costly, causing them to be unavailable.

Turns out at least one scalable alignment team has been struggling for resources. From Jan Leike (formerly co-head of Superalignment at OpenAI):

Over the past few months my team has been sailing against the wind. Sometimes we were struggling for compute and it was getting harder and harder to get this crucial research done.

Even worse, apparently the whole Supera... (read more)

4ryan_greenblatt
I'm skeptical that increased scale makes hacking the reward model worse. Of course, it could (and likely will/does) make hacking human labelers more of a problem, but this isn't what the comment appears to be saying. Note that the reward model is of the same scale as the base model, so the relative scale should be the same. This also contradicts results from an earlier paper by Leo Gao. I think this paper is considerably more reliable than the comment overall, so I'm inclined to believe the paper or think that I'm misunderstanding the comment. Additionally, from first principles I think that RLHF sample efficiency should just increase with scale (at least with well tuned hyperparameters) and I think I've heard various things that confirm this.
2ryan_greenblatt
Oops, fixed.

If your endgame strategy involved relying on OpenAI, DeepMind, or Anthropic to implement your alignment solution that solves science / super-cooperation / nanotechnology, consider figuring out another endgame plan.

William_SΩ731669

I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to t... (read more)

Reply151032
Showing 3 of 34 replies (Click to show all)
tlevin196

Kelsey Piper now reports: "I have seen the extremely restrictive off-boarding agreement that contains nondisclosure and non-disparagement provisions former OpenAI employees are subject to. It forbids them, for the rest of their lives, from criticizing their former employer. Even acknowledging that the NDA exists is a violation of it."

4PhilosophicalSoul
I have reviewed his post. Two (2) things to note:  (1) Invalidity of the NDA does not guarantee William will be compensated after the trial. Even if he is, his job prospects may be hurt long-term.  (2) State's have different laws on whether the NLRA trumps internal company memorandums. More importantly, labour disputes are traditionally solved through internal bargaining. Presumably, the collective bargaining 'hand-off' involving NDA's and gag-orders at this level will waive subsequent litigation in district courts. The precedent Habryka offered refers to hostile severance agreements only, not the waiving of the dispute mechanism itself.  I honestly wish I could use this dialogue as a discrete communication to William on a way out, assuming he needs help, but I re-affirm my previous worries on the costs.  I also add here, rather cautiously, that there are solutions. However, it would depend on whether William was an independent contractor, how long he worked there, whether it actually involved a trade secret (as others have mentioned) and so on. The whole reason NDA's tend to be so effective is because they obfuscate the material needed to even know or be aware of what remedies are available.  
2wassname
Interesting! For most of us, this is outside our area of competence, so appreciate your input.
niplav72

Just checked who from the authors of the Weak-To-Strong Generalization paper is still at OpenAI:

  • Collin Burns
  • Jan Hendrick Kirchner
  • Leo Gao
  • Bowen Baker
  • Yining Chen
  • Adrian Ecoffet
  • Manas Joglekar
  • Jeff Wu

Gone are:

  • Ilya Sutskever
  • Pavel Izmailov[1]
  • Jan Leike
  • Leopold Aschenbrenner

  1. Reason unknown ↩︎

quila94

(Personal) On writing and (not) speaking

I often struggle to find words and sentences that match what I intend to communicate.

Here are some problems this can cause:

  1. Wordings that are odd or unintuitive to the reader, but that are at least literally correct.[1]
  2. Not being able express what I mean, and having to choose between not writing it, or risking miscommunication by trying anyways. I tend to choose the former unless I'm writing to a close friend. Unfortunately this means I am unable to express some key insights to a general audience.
  3. Writing taking lots of
... (read more)
Showing 3 of 5 replies (Click to show all)

Thank you, that is all very kind! ☺️☺️☺️

I expect if he continues being what he is, he'll produce lots of cool stuff which I'll learn from later.

I hope so haha

1quila
Record yourself typing?
2Emrik
quila42

At what point should I post content as top-level posts rather than shortforms?

For example, a recent writing I posted to shortform was ~250 concise words plus an image: 'Anthropics may support a 'non-agentic superintelligence' agenda'. It would be a top-level post on my blog if I had one set up (maybe soon :p).

Some general guidelines on this would be helpful.

4niplav
This is a good question, especially since there've been some short form posts recently that are high quality and would've made good top-level posts—after all, posts can be short.
Emrik10

Epic Lizka post is epic.

Also, I absolutely love the word "shard" but my brain refuses to use it because then it feels like we won't get credit for discovering these notions by ourselves. Well, also just because the words "domain", "context", "scope", "niche", "trigger", "preimage" (wrt to a neural function/policy / "neureme") adequately serve the same purpose and are currently more semantically/semiotically granular in my head.

trigger/preimage ⊆ scope ⊆ domain

"niche" is a category in function space (including domain, operation, and codomain), "domain" is a set.

"scope" is great because of programming connotations and can be used as a verb. "This neural function is scoped to these contexts."

elifland5039

The word "overconfident" seems overloaded. Here are some things I think that people sometimes mean when they say someone is overconfident:

  1. They gave a binary probability that is too far from 50% (I believe this is the original one)
  2. They overestimated a binary probability (e.g. they said 20% when it should be 1%)
  3. Their estimate is arrogant (e.g. they say there's a 40% chance their startup fails when it should be 95%), or maybe they give an arrogant vibe
  4. They seem too unwilling to change their mind upon arguments (maybe their credal resilience is too high)
  5. They g
... (read more)

 Moore & Schatz (2017) made a similar point about different meanings of "overconfidence" in their paper The three faces of overconfidence. The abstract:

Overconfidence has been studied in 3 distinct ways. Overestimation is thinking that you are better than you are. Overplacement is the exaggerated belief that you are better than others. Overprecision is the excessive faith that you know the truth. These 3 forms of overconfidence manifest themselves under different conditions, have different causes, and have widely varying consequences. It is a mist

... (read more)
4Daniel Kokotajlo
I feel like this should be a top-level post.
3Garrett Baker
When I accuse someone of overconfidence, I usually mean they're being too hedgehogy when they should be being more foxy.

For anyone interested in Natural Abstractions type research: https://arxiv.org/abs/2405.07987

Claude summary:

Key points of "The Platonic Representation Hypothesis" paper:

  1. Neural networks trained on different objectives, architectures, and modalities are converging to similar representations of the world as they scale up in size and capabilities.

  2. This convergence is driven by the shared structure of the underlying reality generating the data, which acts as an attractor for the learned representations.

  3. Scaling up model size, data quantity, and task dive

... (read more)

This sounds really intriguing. I would like someone who is familiar with natural abstraction research to comment on this paper.

Epistemic status: not a lawyer, but I've worked with a lot of them.

As I understand it, an NDA isn't enforceable against a subpoena (though the former employer can seek a protective order for the testimony).   Someone should really encourage law enforcement or Congress to subpoena the OpenAI resigners...

A subpoena for what?

Decomposability seems like a fundamental assumption for interpretability and condition for it to succeed. E.g. from Toy Models of Superposition:

'Decomposability: Neural network activations which are decomposable can be decomposed into features, the meaning of which is not dependent on the value of other features. (This property is ultimately the most important – see the role of decomposition in defeating the curse of dimensionality.) [...]

The first two (decomposability and linearity) are properties we hypothesize to be widespread, while the latte... (read more)

1Bogdan Ionut Cirstea
Related, from The “no sandbagging on checkable tasks” hypothesis:

Quote from Shulman’s discussion of the experimental feedback loops involved in being able to check how well a proposed “neural lie detector” detects lies in models you’ve trained to lie: 

A quite early example of this is Collin Burn’s work, doing unsupervised identification of some aspects of a neural network that are correlated with things being true or false. I think that is important work. It's a kind of obvious direction for the stuff to go. You can keep improving it when you have AIs that you're training to do their best to deceive humans or other

... (read more)

A list of some contrarian takes I have:

  • People are currently predictably too worried about misuse risks

  • What people really mean by "open source" vs "closed source" labs is actually "responsible" vs "irresponsible" labs, which is not affected by regulations targeting open source model deployment.

  • Neuroscience as an outer alignment[1] strategy is embarrassingly underrated.

  • Better information security at labs is not clearly a good thing, and if we're worried about great power conflict, probably a bad thing.

  • Much research on deception (Anthropic's re

... (read more)
Reply1822221111
Showing 3 of 10 replies (Click to show all)

Ah yes, another contrarian opinion I have:

  • Big AGI corporations, like Anthropic, should by-default make much of their AGI alignment research private, and not share it with competing labs. Why? So it can remain a private good, and in the off-chance such research can be expected to be profitable, those labs & investors can be rewarded for that research.
4Olli Järviniemi
  I talked about this with Garrett; I'm unpacking the above comment and summarizing our discussions here. * Sleeper Agents is very much in the "learned heuristics" category, given that we are explicitly training the behavior in the model. Corollary: the underlying mechanisms for sleeper-agents-behavior and instrumentally convergent deception are presumably wildly different(!), so it's not obvious how valid inference one can make from the results * Consider framing Sleeper Agents as training a trojan instead of as an example of deception. See also Dan Hendycks' comment. * Much of existing work on deception suffers from "you told the model to be deceptive, and now it deceives, of course that happens" * (Garrett thought that the Uncovering Deceptive Tendencies paper has much less of this issue, so yay) * There is very little work on actual instrumentally convergent deception(!) - a lot of work falls into the "learned heuristics" category or the failure in the previous bullet point * People are prone to conflate between "shallow, trained deception" (e.g. sycophancy: "you rewarded the model for leaning into the user's political biases, of course it will start leaning into users' political biases") and instrumentally convergent deception * (For more on this, see also my writings here and here.  My writings fail to discuss the most shallow versions of deception, however.)   Also, we talked a bit about and I interpreted Garrett saying that people often consider too few and shallow hypotheses for their observations, and are loose with verifying whether their hypotheses are correct. Example 1: I think the Uncovering Deceptive Tendencies paper has some of this failure mode. E.g. in experiment A we considered four hypotheses to explain our observations, and these hypotheses are quite shallow/broad (e.g. "deception" includes both very shallow deception and instrumentally convergent deception). Example 2: People generally seem to have an opinion of "chain-of-t
2Garrett Baker
I will clarify on this. I think people often do causal interventions in their CoTs, but not in ways that are very convincing to me.

I thought Superalignment was a positive bet by OpenAI, and I was happy when they committed to putting 20% of their current compute (at the time) towards it. I stopped thinking about that kind of approach because OAI already had competent people working on it. Several of them are now gone.

It seems increasingly likely that the entire effort will dissolve. If so, OAI has now made the business decision to invest its capital in keeping its moat in the AGI race rather than basic safety science. This is bad and likely another early sign of what's to come.

I think ... (read more)

kromem10

It's going to have to.

Ilya is brilliant and seems to really see the horizon of the tech, but maybe isn't the best at the business side to see how to sell it.

But this is often the curse of the ethically pragmatic. There is such a focus on the ethics part by the participants that the business side of things only sees that conversation and misses the rather extreme pragmatism.

As an example, would superaligned CEOs in the oil industry fifty years ago have still only kept their eye on quarterly share prices or considered long term costs of their choices? There'... (read more)

3Bogdan Ionut Cirstea
Strongly agree; I've been thinking for a while that something like a public-private partnership involving at least the US government and the top US AI labs might be a better way to go about this. Unfortunately, recent events seem in line with it not being ideal to only rely on labs for AI safety research, and the potential scalability of automating it should make it even more promising for government involvement. [Strongly] oversimplified, the labs could provide a lot of the in-house expertise, the government could provide the incentives, public legitimacy (related: I think of a solution to aligning superintelligence as a public good) and significant financial resources.

My timelines are lengthening. 

I've long been a skeptic of scaling LLMs to AGI *. To me I fundamentally don't understand how this is even possible. It must be said that very smart people give this view credence. davidad, dmurfet. on the other side are vanessa kosoy and steven byrnes. When pushed proponents don't actually defend the position that a large enough transformer will create nanotech or even obsolete their job. They usually mumble something about scaffolding.

I won't get into this debate here but I do want to note that my timelines have lengthe... (read more)

Showing 3 of 21 replies (Click to show all)
5Nathan Helm-Burger
My view is that there's huge algorithmic gains in peak capability, training efficiency (less data, less compute), and inference efficiency waiting to be discovered, and available to be found by a large number of parallel research hours invested by a minimally competent multimodal LLM powered research team. So it's not that scaling leads to ASI directly, it's: 1. scaling leads to brute forcing the LLM agent across the threshold of AI research usefulness 2. Using these LLM agents in a large research project can lead to rapidly finding better ML algorithms and architectures. 3. Training these newly discovered architectures at large scales leads to much more competent automated researchers. 4. This process repeats quickly over a few months or years. 5. This process results in AGI. 6. AGI, if instructed (or allowed, if it's agentically motivated on its own to do so) to improve itself will find even better architectures and algorithms. 7. This process can repeat until ASI. The resulting intelligence / capability / inference speed goes far beyond that of humans.  Note that this process isn't inevitable, there are many points along the way where humans can (and should, in my opinion) intervene. We aren't disempowered until near the end of this.
4Alexander Gietelink Oldenziel
Why do you think there are these low-hanging algorithmic improvements?

My answer to that is currently in the form of a detailed 2 hour lecture with a bibliography that has dozens of academic papers in it, which I only present to people that I'm quite confident aren't going to spread the details. It's a hard thing to discuss in detail without sharing capabilities thoughts. If I don't give details or cite sources, then... it's just, like, my opinion, man. So my unsupported opinion is all I have to offer publicly. If you'd like to bet on it, I'm open to showing my confidence in my opinion by betting that the world turns out how I expect it to.

Yesterday Greg Sadler and I met with the President of the Australian Association of Voice Actors. Like us, they've been lobbying for more and better AI regulation from government. I was surprised how much overlap we had in concerns and potential solutions:
1. Transparency and explainability of AI model data use (concern)

2. Importance of interpretability (solution)

3. Mis/dis information from deepfakes (concern)

4. Lack of liability for the creators of AI if any harms eventuate (concern + solution)

5. Unemployment without safety nets for Australians (concern)

6.... (read more)

Load More