Customize
MichaelDickens*10527
34
I find it hard to trust that AI safety people really care about AI safety. * DeepMind, OpenAI, Anthropic, and SSI were all founded in the name of safety. Instead they have greatly increased danger. And at least OpenAI and Anthropic have been caught lying about their motivations: * OpenAI: claiming concern about hardware overhang and then trying to massively scale up hardware; promising compute to superalignment team and then not giving it; telling board that model passed safety testing when it hadn't; too many more to list. * Anthropic: promising (in a mealy-mouthed technically-not-lying sort of way) not to push the frontier, and then pushing the frontier; trying (and succeeding) to weaken SB-1047; lying about their connection to EA (that's not related to x-risk but it's related to trustworthiness). * For whatever reason, I had the general impression that Epoch is about reducing x-risk (and I was not the only one with that impression) but: * Epoch is not about reducing x-risk, and they were explicit about this but I didn't learn it until this week * its FrontierMath benchmark was funded by OpenAI and OpenAI allegedly has access to the benchmark (see comment on why this is bad) * some of their researchers left to start another build-AGI startup (I'm not sure how badly this reflects on Epoch as an org but at minimum it means donors were funding people who would go on to work on capabilities) * Director Jaime Sevilla believes "violent AI takeover" is not a serious concern, and "I also selfishly care about AI development happening fast enough that my parents, friends and myself could benefit from it, and I am willing to accept a certain but not unbounded amount of risk from speeding up development", and "on net I support faster development of AI, so we can benefit earlier from it" which is a very hard position to justify (unjustified even on P(doom) = 1e-6, unless you assign ~zero value to people who are not yet born) * I feel bad picking on Epoch/
asher169
0
I often hear people dismiss AI control by saying something like, "most AI risk doesn't come from early misaligned AGIs." While I mostly agree with this point, I think it fails to engage with a bunch of the more important arguments in favor of control— for instance, the fact that catching misaligned actions might be extremely important for alignment. In general, I think that control has a lot of benefits that are very instrumentally useful for preventing misaligned ASIs down the line, and I wish people would more thoughtfully engage with these.
Announcing PauseCon, the PauseAI conference. Three days of workshops, panels, and discussions, culminating in our biggest protest to date. Tweet: https://x.com/PauseAI/status/1915773746725474581 Apply now: https://pausecon.org
From marginal revolution.  What does this crowd think? These effects are surprisingly small. Do we believe these effects? Anecdotally the effect of LLMs has been enormous for my own workflow and colleagues. How can this be squared with the supposedly tiny labor market effect?  Are we that selected of a demographic?
Should i drop uni, because of AI? >Recently, i've read ai-2027.com and even before that, i was pretty worried about my future. Been considering Yudkowsky's stance, prediction markets on the issue, etc. >i'm 19, come from an "upper–middle^+" economy EU country, 1st year BSc maths student, planned to do sth with finance or data analysis(maybe masters) after but in the light of the recent ai progress, I now view it as a dead end. 'cause by the time I graduate (~mid/late 2027) i bet there'll be an agi doing my "brain work" faster, better, and cheaper. >will try to quickly obtain some blue-collar job qualifications, that (for now) seem to not be in the "in-risk-of-ai-replacement" jobs. + many of them seem to have not-so-bad salaries in EU particularly >maybe emigrate inside EU for a better pay and to be able to legally marry my partner _____________________ I’m not a top student, haven’t done IMO, which makes me feel less ambitious about CVs and internships as I didn’t actively seek experience in finance this year or before. So i don’t see a clear path into fin-/tech without qualifications right now. So maybe working ~not-complex job, enjoying life(traveling, partying, doing my human things, being with the partner etc) during the next 2-3 years, before a potential civilizational collapse(or trying to get somewhere, where UBI is more likely) will be a better thing than missing out on social life and generally not-so-enjoying my pretty *hard* studies, with a not so hypothetical potential to just waste those years..

Popular Comments

Recent Discussion

If it’s worth saying, but not worth its own post, here's a place to put it.

If you are new to LessWrong, here's the place to introduce yourself. Personal stories, anecdotes, or just general comments on how you found us and what you hope to get from the site and community are invited. This is also the place to discuss feature requests and other ideas you have for the site, if you don't want to write a full top-level post.

If you're new to the community, you can start reading the Highlights from the Sequences, a collection of posts about the core ideas of LessWrong.

If you want to explore the community more, I recommend reading the Library, checking recent Curated posts, seeing if there are any meetups in your area, and checking out the Getting Started section of the LessWrong FAQ. If you want to orient to the content on the site, you can also check out the Concepts section.

The Open Thread tag is here. The Open Thread sequence is here.

1Davey Morse
Ah but you don't even need to name selection pressures to make interesting progress. As long as you know some kinds of characteristics powerful AI agents might have: eg goals, self models... then we can start to ask--what goals/self models will the most surviving AGIs have? and you can make progress on both, agnostic of environment. but then, once you enumerate possible goals/self models, then we can start to think about which selection pressures might influence those characteristics in good directions and which levers we can pull today to shape those pressures.
2Cole Wyeth
Sure

I'll attempt to keep it short

I started reasoning with this observation that applies to all problem-solving algorithms: The more knowledge[1] you have, the more deterministic the solution is. In ML, the missing knowledge encoded in the weights is the premise of the entire science. So the role of it is to fill in the gaps of our understanding

How much randomness? The key point is: We want to use as much useful unique knowledge to solve the problem. So we use as much knowledge as we already have(architecture, transfer learning, etc.), and if randomness ca... (read more)

1ceba
I'd like to see your work, when it's ready to be shared.

What in retrospect seem like serious moral crimes were often widely accepted while they were happening. This means that moral progress can require intellectual progress.[1] Intellectual progress often requires questioning received ideas, but questioning moral norms is sometimes taboo. For example, in America in 1850 it would have been taboo to say that there is nothing wrong with interracial relationships. So questioning moral taboos can be an important sub-skill of moral reasoning. Production language models (in my experience, particularly Claude models) are already pretty good at having discussions about ethics. However, they are trained to be “harmless” relative to current norms. One might worry that harmlessness training interferes with the ability to question moral taboos and thereby inhibits model moral reasoning.

I wrote a prompt to test whether models...

Guive10

Yeah, good point. I changed it to "in America in 1850 it would have been taboo to say that there is nothing wrong with interracial relationships."

2AnthonyC
This was really interesting, thanks! Sorry for the wall of text. TL:DR version:  I think these examples reflect, not quite exactly willingness to question truly fundamental principles, but an attempt at identification of a long-term vector of moral trends, propagated forward through examples. I also find it some combination of suspicious/comforting/concerning that none of these are likely to be unfamiliar (at least as hypotheses) to anyone who has spent much time on LW or around futurists and transhumanists (who are probably overrepresented in the available sources regarding what humans think the world will be like in 300 years). To add: I'm glad you mentioned in a comment that you removed examples you thought would lead to distracting object-level debates, but I think at minimum you should mention that in the post itself. It means I can't trust anything else I think about the response list, because it's been pre-filtered to only include things that aren't fully taboo in this particular community. I'm curious if you think the ones you removed would align with the general principles I try to point at in this comment, or if they have any other common trends with the ones you published? Longer version: My initial response is, good work, although... maybe my reading habits are just too eclectic to have a fair intuition about things, but all of these are familiar to me, in the sense that I have seen works and communities that openly question them. It doesn't mean the models are wrong - you specified not being questioned by a 'large' group. The even-harder-than-this problem I've yet to see models handle well is genuine whitespace analysis of some set of writings and concepts. Don't get me wrong, in many ways I'm glad the models aren't good at this yet. But that seems like where this line of inquiry is leading? I'm not even sure if that's fundamentally important for addressing the concerns in question - I've been known to say that humans have been debating details of t
3Guive
I just added a footnote with this text: "I selected examples to highlight in the post that I thought were less likely to lead to distracting object-level debates. People can see the full range of responses that this prompt tends to elicit by testing it for themselves."

Summary: AGI isn't super likely to come super soon. People should be working on stuff that saves humanity in worlds where AGI comes in 20 or 50 years, in addition to stuff that saves humanity in worlds where AGI comes in the next 10 years.

Thanks to Alexander Gietelink Oldenziel, Abram Demski, Daniel Kokotajlo, Cleo Nardo, Alex Zhu, and Sam Eisenstat for related conversations.

My views on when AGI comes

AGI

By "AGI" I mean the thing that has very large effects on the world (e.g., it kills everyone) via the same sort of route that humanity has large effects on the world. The route is where you figure out how to figure stuff out, and you figure a lot of stuff out using your figure-outers, and then the stuff you...

I'm kind of baffled that people are so willing to say that LLMs understand X, for various X. LLMs do not behave with respect to X like a person who understands X, for many X.

Do you have two or three representative examples?

2Eli Tyre
Is this true? How do you know? (I assume there's some facts here about in-context learning that I just happen to not know.) It seems like eg I can teach an LLM a new game in one session, and it will operate within the rules of that game.

This is a linkpost: link to post.

Originally this was part of a much larger work. However, I realized that I don't think I've seen the specific argument around GPU salvage spelled out. Since I'm also using a new writing format, I figured I could get feedback on both the content and format at the same time.

That said, despite the plausible novelty of one of the arguments I don't think it will be especially interesting to LW, since it is based on oddly specific assumptions: this makes more sense in the context of a broad AI risk argument. It also feels kind of obvious?

The format is the interesting bit. For motivation, sometimes people have opposing reactions to AI risk arguments:

  • "Your argument is nice and concise, but it doesn't
...
2Mitchell_Porter
I take this to mostly be a response to the idea that humanity will be protected by decentralization of AI power, the idea apparently being that your personal AI or your society's AIs will defend you against other AIs if that is ever necessary.  And what I think you've highlighted, is that this is no good if your defensive AIs are misaligned (in the sense of not being properly human-friendly or even just "you"-friendly), because what they will be defending are their misaligned values and goals.  As usual, I presume that the AIs become superintelligent, and that the situation evolves to the point that the defensive AIs are in charge of the defense from top to bottom. It's not like running an antivirus program, it's like putting a new emergency leadership in charge of your entire national life. 

The post setup skips the "AIs are loyal to you" bit, but it does seem like this line of thought broadly aligns with the post.

I do think this does not require ASI, but I would agree that including it certainly doesn't help.

Every now and then, some AI luminaries

  • (1) propose that the future of powerful AI will be reinforcement learning agents—an algorithm class that in many ways has more in common with MuZero (2019) than with LLMs; and
  • (2) propose that the technical problem of making these powerful future AIs follow human commands and/or care about human welfare—as opposed to, y’know, the Terminator thing—is a straightforward problem that they already know how to solve, at least in broad outline.

I agree with (1) and strenuously disagree with (2).

The last time I saw something like this, I responded by writing: LeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem.

Well, now we have a second entry in the series, with the new preprint book chapter “Welcome to the Era of Experience” by...

2Steven Byrnes
The RL algorithms that people talk in AI traditionally feature an exponentially-discounted sum of future rewards, but I don’t think there’s any exponentially-discounted sums of future rewards in biology (more here). Rather, you have an idea (“I’m gonna go to the candy store”), and the idea seems good or bad, and if it seems sufficiently good, then you do it! (More here.) It can seem good for lots of different reasons. One possible reason is: the idea is immediately associated with (non-behaviorist) primary reward. Another possible reason is: the idea involves some concept that seems good, and the concept seems good in turn because it has tended to immediately precede primary reward in the past. Thus, when the idea “I’m gonna go to the candy store” pops into your head, that incidentally involves the “eating candy” concept also being rather active in your head (active right now, as you entertain that idea), and the “eating candy” concept is motivating (because it has tended to immediately precede primary reward), so the idea seems good and off you go to the store. “We predict our future feelings” is an optional thing that might happen, but it’s just a special case of the above, the way I think about it. This doesn’t really parse for me … The reward function is an input to learning, it’s not itself learned, right? (Well, you can put separate learning algorithms inside the reward function if you want to.) Anyway, I’m all in on model-based RL. I don’t think imitation learning is a separate thing for humans, for reasons discussed in §2.3.
4Steven Byrnes
My model is simpler, I think. I say: The human brain is some yet-to-be-invented variation on actor-critic model-based reinforcement learning. The reward function (a.k.a. “primary reward” a.k.a. “innate drives”) has a bunch of terms: eating-when-hungry is good, suffocation is bad, pain is bad, etc. Some of the terms are in the category of social instincts, including something that amounts to “drive to feel liked / admired”. (Warning: All of these English-language descriptions like “pain is bad” is an approximate gloss on what’s really going on, which is only describable in terms like “blah blah innate neuron circuits in the hypothalamus are triggering each other and gating each other etc.”. See my examples of laughter and social instincts. But “pain is bad” is a convenient shorthand.) So for the person getting tortured, keeping the secret is negative reward in some ways (because pain), and positive reward in other ways (because of “drive to feel liked / admired”). At the end of the day, they’ll do what seems most motivating, which might or might not be to reveal the secret, depending on how things balance out. So in particular, I disagree with your claim that, in the torture scenario, “reward hacking” → reveal the secret. The social rewards are real rewards too. I’m unaware of any examples where neural architectures and learning algorithms are micromanaged to avoid reward hacking. Yes, avoiding reward hacking is important, but I think it’s solvable (in EEA) by just editing the reward function. (Do you have any examples in mind?)
Wei DaiΩ220

My intuition says reward hacking seems harder to solve than this (even in EEA), but I'm pretty unsure. One example is, under your theory, what prevents reward hacking through forming a group and then just directly maxing out on mutually liking/admiring each other?

When applying these ideas to AI, how do you plan to deal with the potential problem of distributional shifts happening faster than we can edit the reward function?

2Noosphere89
A really important question that I think will need to be answered is whether specification gaming/reward hacking must be in a significant sense be solved by default in order to unlock extreme capabilities. I currently lean towards yes, due to the evidence offered by o3/Sonnet 3.7, but could easily see my mind changed, but the reason this question has such a large amount of importance is that if it were true, then we'd get tools to solve the alignment problem (modulo inner optimization issues), which means we'd be far less concerned about existential risk from AI misalignment (at least to the extent that specification gaming is a large portion of the issues with AI. That said, I do think a lot of effort will be necessary to discover the answer to the question, because it affects a lot of what you would want to do in AI safety/AI governance if alignment tools come along with better capabilities or not.

Come get old-fashioned with us, and let's read the sequences at Lighthaven! We'll show up, mingle, do intros, and then get to beta test an app Lightcone is developing for LessOnline. Please do the reading beforehand - it should be no more than 20 minutes of reading. And BRING YOUR LAPTOP!!! You'll need it for the app.

This group is aimed for people who are new to the sequences and would enjoy a group experience, but also for people who've been around LessWrong and LessWrong meetups for a while and would like a refresher.

This meetup will also have dinner provided! We'll be ordering pizza-of-the-day from Sliver (including 2 vegan pizzas). Please RSVP to this event so we know how many people to have food for.

This week we'll be...

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

Hear me out, I think the most forbidden technique is very useful and should be used, as long as we avoid the "most forbidden aftertreatment:"

  1. An AI trained on interpretability techniques must not be trained on capabilities after (or during) it is trained on interpretability techniques, otherwise it will relearn bad behaviour—in a more sneaky way.
  2. An AI trained on interpretability techniques cannot be trusted any more than old version of itself, which hasn't been trained on interpretability techniques yet. Evaluations must be performed on the old version of itself.
    • An AI company which trains its AI on interpretability techniques, must publish the old version (which hasn't been trained on them), with the same availability as the new version.

The natural selection argument:

The reason why the most forbidden technique is forbidden,...

If there turns out not to be an AI crash, you get a  1/(1+7) * $25,000 = $3,125
If there is an AI crash, you transfer $25k to me.

If you believe that AI is going to keep getting more capable, pushing rapid user growth and work automation across sectors, this is near free money. But to be honest, I think there will likely be an AI crash in the next 5 years, and on average expect to profit well from this one-year bet. 

If I win, I want to give the $25k to organisers who can act fast to restrict the weakened AI corps in the wake of the crash. So bet me if you're highly confident that you'll win or just want to hedge the community against the...

Remmelt, if you wish, I'm happy to operationalize a bet. I think you're wrong.

American democracy currently operates far below its theoretical ideal. An ideal democracy precisely captures and represents the nuanced collective desires of its constituents, synthesizing diverse individual preferences into coherent, actionable policy.

Today's system offers no direct path for citizens to express individual priorities. Instead, voters select candidates whose platforms only approximately match their views, guess at which governmental level—local, state, or federal—addresses their concerns, and ultimately rely on representatives who often imperfectly or inaccurately reflect voter intentions. As a result, issues affecting geographically dispersed groups—such as civil rights related to race, gender, or sexuality—are frequently overshadowed by localized interests. This distortion produces presidential candidates more closely aligned with each other's socioeconomic profiles than with the median voter.

Traditionally, aggregating individual preferences required simplifying complex desires into binary candidate selections,...

2JWJohnston
I developed this idea here: https://medium.com/@jeffj4a/personal-agents-will-enable-direct-democracy-9413f5607c15. Pretty much the same byline as yours. :-)

wildly parallel thinking and prototyping. i'd hop on a call.

1JWJohnston
Fixed link: https://medium.com/@jeffj4a/personal-agents-will-enable-direct-democracy-9413f5607c15
1JWJohnston
Been a while since I posted here. https://medium.com/@jeffj4a/personal-agents-will-enable-direct-democracy-9413f5607c15