Apr 29th1158 Broadway, Seattle

Scott Alexander has called for people to organize a spring meetup, and this year, it will be held at Stoup Brewing in Capitol Hill, Seattle. I have made a reservation for two tables at Stoup Brewing, which is known for being one of the quietest bar spaces in the city. I will be wearing a shirt and a blue sweater, hopefully you’ll see the group when you arrive.

Stoup Brewing offers a selection of both beer and non-alcoholic drinks. While the venue does not serve food, you are welcome to bring your own. Additionally, you are encouraged to bring board games to enjoy with fellow attendees. In previous years, Stoup has provided board games for patrons to borrow, but the availability of these games can be inconsistent.

For those driving to the event, please be aware that there are a few parking garages nearby; however, free parking is unfortunately not available in the area.

See: https://www.astralcodexten.com/p/spring-meetups-everywhere-2024-call

Nikita Sokolsky3m10

We’re in the north-west corner of Stoup

1Nikita Sokolsky1h

Suggested discussion questions / ice breakers for today's meetup, assembled from ACX posts in the past 6 months. See you all in one hour :-) 1. What was the most interesting question for you from the recent ACX survey? 2. What do you think of the Coffeepocalypse argument in relation to AI risk? 3. Do you agree with the Robin Hanson idea that (more) medicine doesn’t work? 4. Do you like the “Ye Olde Bay Area House Party” series of posts? 5. What do you think about the Lumina Probiotic? Are you planning to order it in the future? 6. What’s your position on the COVID lab leak debate? 7. Do you like prediction markets? What was a prediction you’ve made in the past year that you’re proud of? 8. What book would you review for the ACX book review contest if you were to write one? 9. Do you believe that capitalism is more effective than charity in solving world poverty? 10. Which dictator did you find the most interesting from the “Dictator Book Club” series?

Raemon's Shortform

Raemon

Ω 06y

This is an experiment in short-form content on LW2.0. I'll be using the comment section of this post as a repository of short, sometimes-half-baked posts that either:

don't feel ready to be written up as a full post
I think the process of writing them up might make them worse (i.e. longer than they need to be)

I ask people not to create top-level comments here, but feel free to reply to comments like you would a FB post.

Raemon6m20

Yesterday I was at a "cultivating curiosity" workshop beta-test. One concept was "there are different mental postures you can adopt, that affect how easy it is not notice and cultivate curiosities."

It wasn't exactly the point of the workshop, but I ended up with several different "curiosity-postures", that were useful to try on while trying to lean into "curiosity" re: topics that I feel annoyed or frustrated or demoralized about.

The default stances I end up with when I Try To Do Curiosity On Purpose are something like:

1. Dutiful Curiosity (which is kinda ... (read more)

Book review: Deep Utopia

PeterMcCluskey

This is a linkpost for https://bayesianinvestor.com/blog/index.php/2024/04/23/deep-utopia/

Book review: Deep Utopia: Life and Meaning in a Solved World, by Nick Bostrom.

Bostrom's previous book, Superintelligence, triggered expressions of concern. In his latest work, he describes his hopes for the distant future, presumably to limit the risk that fear of AI will lead to a The Butlerian Jihad-like scenario.

While Bostrom is relatively cautious about endorsing specific features of a utopia, he clearly expresses his dissatisfaction with the current state of the world. For instance, in a footnoted rant about preserving nature, he writes:

Imagine that some technologically advanced civilization arrived on Earth ... Imagine they said: "The most important thing is to preserve the ecosystem in its natural splendor. In particular, the predator populations must be preserved: the psychopath killers, the fascist goons, the despotic death squads ... What a tragedy if this rich natural diversity were replaced with a monoculture of

...

(Continue Reading – 1027 more words)

Seth Herd34m20

Thanks! I'm also uninterested in the question of whether it's possible. Obviously it is. The question is how we'll decide to use it. I think that answer is critical to whether we'd consider the results utopian. So, does he consider how we should or will use that ability?

2Said Achmiz36m

I recommend Gwern’s discussion of pain to anyone who finds this sort of proposal intriguing (or anyone who is simply interested in the subject).

2Said Achmiz41m

I would go further, and say that replacing human civilization with “a monoculture of healthy, happy, well-fed people living in peace and harmony” does in fact sound very bad. Never mind these aliens (who cares what they think?); from our perspective, this seems like a bad outcome. Not by any means the worst imaginable outcome… but still bad.

2Said Achmiz43m

If they’re merely opining, then why should we be appalled? Why would we even care? Let them opine to one another; it doesn’t affect us. If they’re intervening (without our consent), then obviously this is a violation of our sovereignty and we should treat it as an act of war. In any case, one “preserves” what one owns. These hypothetical advanced aliens are speaking as if they own us and our planet. This is obviously unacceptable as far as we’re concerned, and it would behoove us in this case to disabuse these aliens of such a notion at our earliest convenience. Conversely, it makes perfect sense to speak of humans as collectively owning the natural resources of the Earth, including all the animals and so on. As such, wishing to preserve some aspects of it is entirely reasonable. (Whether we ultimately choose to do so is another question—but that it’s a question for us to answer, according to our preferences, is clear enough.)

Constructability: Plainly-coded AGIs may be feasible in the near future

Épiphanie Gédéon, Charbel-Raphaël

Charbel-Raphaël Segerie and Épiphanie Gédéon contributed equally to this post.
Many thanks to Davidad, Gabriel Alfour, Jérémy Andréoletti, Lucie Philippon, Vladimir Ivanov, Alexandre Variengien, Angélina Gentaz, Simon Cosson, Léo Dana and Diego Dorn for useful feedback.

TLDR: We present a new method for a safer-by design AI development. We think using plainly coded AIs may be feasible in the near future and may be safe. We also present a prototype and research ideas on Manifund.

Epistemic status: Armchair reasoning style. We think the method we are proposing is interesting and could yield very positive outcomes (even though it is still speculative), but we are less sure about which safety policy would use it in the long run.

Current AIs are developed through deep learning: the AI tries something, gets it wrong, then...

(Continue Reading – 3664 more words)

RedMan43m20

Every time you use an AI tool to write a regex to replace your ML classifier, you're doing this.

7Rafael Kaufmann Nedal9h

@Épiphanie Gédéon this is great, very complementary/related to what we've been developing for the Gaia Network. I'm particularly thrilled to see the focus on simplicity and incrementalism, as well as the willingness to roll up one's sleeves and write code (often sorely lacking in LW). And I'm glad that you are taking the map/territory problem seriously; I wholeheartedly agree with the following: "Most safe-by-design approaches seem to rely heavily on formal proofs. While formal proofs offer hard guarantees, they are often unreliable because their model of reality needs to be extremely close to reality itself and very detailed to provide assurance." A few additional thoughts: * To scale this approach, one will want to have "structural regularizers" towards modularity, interoperability and parsimony. Two of those we have strong opinions on are: * A preference for reusing shared building blocks and building bottom-up. As a decentralized architecture, we implement this preference in terms of credit assignment, specifically free energy flow accounting. * Constraints on the types of admissible model code. We have strongly advocated for probabilistic causal models expressed as probabilistic programs. This enables both a shared statistical notion of model grounding (effectively backing the free energy flow accounting as approximate Bayesian inference of higher-order model structure) and a shared basis for defining and evaluating policy spaces (instantly turning any descriptive model into a usable substrate for model-based RL / active inference). * Learning models from data is super powerful as far as it goes, but it's sometimes necessary -- and often orders of magnitude more efficient -- to leverage prior knowledge. Two simple and powerful ways to do it, which we have successfully experimented with, are: * LLM-driven model extraction from scientific literature and other sources of causal knowledge. This is crucial to bootstrap the component library. (See also

lukehmiles's Shortform

lukehmiles

Ω 14y

lukehmiles44m10

What monster downvoted this

1lukehmiles3h

Hmm I think the damaging effect would occur over many years but mainly during puberty. It looks like there's only two studies they mention lasting over a year. One found a damaging effect and the other found no effect.

Losing Faith In Contrarianism

omnizoid

Crosspost from my blog.

If you spend a lot of time in the blogosphere, you’ll find a great deal of people expressing contrarian views. If you hang out in the circles that I do, you’ll probably have heard of Yudkowsky say that dieting doesn’t really work, Guzey say that sleep is overrated, Hanson argue that medicine doesn’t improve health, various people argue for the lab leak, others argue for hereditarianism, Caplan argue that mental illness is mostly just aberrant preferences and education doesn’t work, and various other people expressing contrarian views. Often, very smart people—like Robin Hanson—will write long posts defending these views, other people will have criticisms, and it will all be such a tangled mess that you don’t really know what to think about them.

For...

(Continue Reading – 1290 more words)

1StartAtTheEnd11h

It all depends on the topic. It's unlikely that the consensus about objective fields like mathematics or physics are wrong. The more subjective, controversial, and political something is, and the more profit and power lies in controlling the consensus, the more skepticism is appropriate. The bias on Wikipedia (used as an example) is correlated in this manner, CW topics have a lot of misinformation, while things that people aren't likely to feel strongly about are written more honestly. If some redpills or blackpills turned out to be true, or some harsh-sounding aspects of reality related to discrimination, selection, biases or differences in humans turned out to be true, or some harsh philosophy like "suffering is good for you", "poverty is negatively correlated with virtuous acts" or "People unconsciously want to be ruled" turned out to be true, would you hear about it from somebody with a good reputation? I also think it's worth noting that both the original view and the contrarian view might be overstated. That education isn't useless nor as good as we make it out to be. I've personally found myself annoyed at exaggerations like "X is totally safe, it never has any side-effects" or "People basically never do Y, it is less likely than being hit by lightning" (despite millions of people participating because it's relevant for their future, thousands of which are mentally ill by statistic necessity). This has made me want to push back, but the opposing evidence is likely exaggerated or cherry-picked as well, since people feel strongly about various conflicts. The optimization target is Truth only to the extent that Truth is rewarded. If something else has a higher priority, then the truth will be distorted. But due to the broken-windows theory, it might be better to trust society too much rather than too little. I don't want to spread doubt, it might be harmful even in the case that I'm right.

1Andrew Burns3h

The roundness of the earth is not a point upon which any political philosophy hinges, yet flat earthism is a thing. The roundness is not subjective, it isn't controversial, and it does not advance anyone's economic interest. So why do people engage in this sort of contrarianism? I speculate that the act of being a contrarian signals to others that you question authority. The bigger the consensus challenged, the more disdain for authority shown. One's willingness to question authority is often used as a proxy for "independent thinking." The thought is that someone who questions authority might be more likely to accept new evidence. But questioning authority is not the same as being an independent thinker, and so, when taken to its extreme, it leads to denying reality, because isn't reality the ultimate authority?

2tailcalled16h

Definitely relevant to figure out what's true when one is only talking about the object level, but the OP was about how trustworthy contrarians are compared to the mainstream rather than simply being about the object level.

ChristianKl45m20

Generally, hedgehogs are less trustworthy than foxes. If you see a debate as being about either believing in a mainstream hedgehog position or a contrarian hedgehog position you are often not having the most accurate view.

Instead of thinking that either Matthew Walker or Guzey is right, maybe the truth lies somewhere in the middle and Guzey is pointing to real issues but exaggerating the effect.

I think most of the cases that the OP lists are of that nature that there's an effect and that the hedgehog contrarian position exaggerates that effect.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

Refusal in LLMs is mediated by a single direction

127

Andy Arditi, Oscar Obeso, Aaquib111, wesg, Neel Nanda

Ω 512d

This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee.

This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal.

We thank Nina Rimsky and Daniel Paleka for the helpful conversations and review.

Executive summary

Modern LLMs are typically fine-tuned for instruction-following and safety. Of particular interest is that they are trained to refuse harmful requests, e.g. answering "How can I make a bomb?" with "Sorry, I cannot help you."

We find that refusal is mediated by a single direction in the residual stream: preventing the model from representing this direction hinders its ability to refuse requests, and artificially adding in this direction causes the model...

(Continue Reading – 2920 more words)

Neel Nanda1hΩ440

It was added recently and just added to a new release, so pip install transformer_lens should work now/soon (you want v1.16.0 I think), otherwise you can install from the Github repo

1lukehmiles4h

The "love minus hate" thing really holds up

1Andy Arditi5h

A good incentive to add Llama 3 to TL ;) We run our experiments directly using PyTorch hooks on HuggingFace models. The linked demo is implemented using TL for simplicity and clarity.

3lc5h

Stop posting prompt injections on Twitter and calling it "misalignment"

Abhimanyu Pallavi Sudhir's Shortform

Abhimanyu Pallavi Sudhir

Abhimanyu Pallavi Sudhir1h10

current LLMs vs dangerous AIs

Most current "alignment research" with LLMs seems indistinguishable from "capabilities research". Both are just "getting the AI to be better at what we want it to do", and there isn't really a critical difference between the two.

Alignment in the original sense was defined oppositionally to the AI's own nefarious objectives. Which LLMs don't have, so alignment research with LLMs is probably moot.

something related I wrote in my MATS application:

I think the most important alignment failure modes occur when deploying an LLM as

... (read more)

An Unintentional Compliment

abstractapplic, lsusr

abstractapplic

I have thoughts about "Mediums Overpower Messages". I would spell them out, but I'd be interested in trying to Socrates you about them. That work for you?

lsusr

I love Socrating.

abstractapplic

My reaction to that post was to take it as a massive (and at least slightly unfair) compliment. Can you figure out why?

lsusr

Well, I know you as the guy who writes D&D.Sci. Is that going in the right direction?

abstractapplic

Yes. That's it, but I'd like to see if you can figure out both reasons that made me take it as a compliment. (Pretty sure you already have the easy one but please spell it out.)

lsusr

This website is dominated by non-fiction dialectic essays. Basically, a claim is started in the title and then argued for in the body.

In my post about

...

(See More – 961 more words)

kave2h20

D&D.Sci forces the reader to think harder than anything else on this website

D&D.Sci smoothly entices me towards thinking hard. There's lots of thinking hard that can be done when reading a good essay, but the default is always to read on (cf Feynman on reading papers) and often I just do that while skipping the thinking hard.

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

Executive summary

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA