All of sudo's Comments + Replies

sudo10
  1. We need not provide the strong model with access to the benchmark questions.
  2. Depending on the benchmark, it can be difficult or impossible to encode all the correct responses in a short string.
sudo10

My reply to both your and @Chris_Leong 's comment is that you should simply use robust benchmarks on which high performance is interesting.

In the adversarial attack context, the attacker's objectives are not generally beyond the model's "capabilities."

3gwern
I don't think that's possible, because an attacker (LLM) can program a victim LLM to emit arbitrary text, so with enough attacks, you can solve any benchmark in the attacker's capability (thereby defeating the safety point entirely because now it's just a very expensive way to use an unsafe model), or otherwise bruteforce the benchmark by inferring the hidden answers and then creating the adversarial example which elicits that (like p-hacking: just keep trying things until you get below the magic threshold). See backdoors, triggers, dataset distillation... "A benchmark" is no more of a barrier than "flipping a specific image's class".
sudo190

A long time ago, I worked on an idea I called "Context Gems." A friend asked me to make a short, publicly accessible writeup for it. This is a really rough writeup of the idea, based on some old notes I had lying around.

Context Gems was an attempt at coming up with a theoretical way of safely eliciting superhuman capabilities from an AI. It was a fairly direct off-shoot of e.g. OSNR I later heard it was similar to some ideas Paul Christiano had a long time ago, like imitative generalization/learning the prior.

The goal is to get effective cognitive labor ou... (read more)

8gwern
Why is this not just a description of an adversarial attack loop on the weak AI model, and would not just produce the usual short adversarial strings of gibberish (for LLMs) or handful of pixel perturbations (for vision or VLMs), which are generally completely useless to humans and contain no useful information?
2Chris_Leong
If the strong AI has knowledge of the benchmarks (or can make correct guesses about how these were structured), then it might be able to find heuristics that work well on them, but not more generally, Some of these heuristics might seem more likely than not to humans. Still seems like a useful technique if the more powerful model isn't much more powerful.
sudo10

Thanks for the feedback! 

I agree that it is possible to learn quickly without mentorship. However, I believe that for most programmers, the first "real" programming job is a source of tremendous growth. Why not have that earlier, and save more of one's youth?

sudo71

Conventional advice directed at young people seem shockingly bad. I sat down to generate a list of anti-advice.

The anti-advice are things that I wish I was told in high school, but that are essentially negations of conventional advice.

You may not agree with the advice given here. In fact, they are deliberately controversial. They may also not be good advice. YMMV.

  • When picking between colleges, do care a lot about getting into a prestigious/selective university. Your future employers often care too.
  • Care significantly less about nebulous “college fit.” Wheth
... (read more)
Reply2211111
2metachirality
What do you mean by arbitrage?
2Viliam
The first two points... I wonder what is the relation between "prestigious university" and "quality of your peers". Seems like it should be positively correlated, but maybe there is some caveat about the quality not being one-dimensional, like maybe rich people go to university X, but technically skilled people to university Y. The third point, I'd say be aware of the distinction between the things you care about, and the things you have to do for bureaucratic reasons. There may or may not be an overlap between the former and the school lessons. The fourth and seventh points are basically: some people give bad advice; and for anything you could possibly do, someone will find a rationalization why that specific thing is important (if everything else fails, they can say it makes you more "well-rounded"). But "skills that develop value" does not say how to choose e.g. between a smaller value now or a greater value in future. The fifth point -- depends on what kind of job/mentor you get. It could be much better or much worse that school, and it may be difficult to see the difference; there are many overconfident people giving wrong advice in the industry, too. The sixth point -- clearly, getting fired is not an optimal outcome; if you do not need to complete the school, what are you even doing there?
3winstonBosan
There is some good stuff here! And i think it is accurate that some of these are controversial. But it also seems like a strange mix of good and “reverse-stupidity is not necessarily intelligence” ideas. Directionally good but odd framing: It seems like great advice to offer to people that going straight for the goal (“software programming”) is a good way to approach a seemingly difficult problem. But one does not necessarily need to be mentored - this is only one of many ways. In fact, many programmers started and expanded their curiosity from typing something like ‘man systemctl’ into their shell.
sudo20

Thanks for the post!

The problem was that I wasn’t really suited for mechanistic interpretability research.

Sorry if I'm prodding too deep, and feel no need to respond. I always feel a bit curious about claims such as this.

I guess I have two questions (which you don't need to answer):

  1. Do you have a hypothesis about the underlying reason for you being unsuited for this type of research? E.g. do you think you might be insufficiently interested/motivated, have insufficient conscientiousness or intelligence, etc.
  2. How confident are you that you just "aren't suited"
... (read more)
4Jay Bailey
Concrete feedback signals I've received: * I don't find myself excited about the work. I've never been properly nerd-sniped by a mechanistic interpretability problem, and I find the day-to-day work to be more drudgery than exciting, even though the overall goal of the field seems like a good one. * When left to do largely independent work, after doing the obvious first thing or two ("obvious" at the level of "These techniques are in Neel's demos") I find it hard to figure out what to do next, and hard to motivate myself to do more things if I do think of them because of the above drudgery. * I find myself having difficulty backchaining from the larger goal to the smaller one. I think this is a combination of a motivational issue and having less grasp on the concepts. By contrast, in evaluations, none of this is true. I am able to solve problems more effectively, I find myself actively interested in problems, (the ones I'm working on and ones I'm not) and I find myself more able to solve problems and reason about how they matter for the bigger picture. I'm not sure how much of each is a contributor, but I suspect that if I was sufficiently excited about the day-to-day work, all the other problems would be much more fixable. There's a sense of reluctance, a sense of burden, that saps a lot of energy when it comes to doing this kind of work. As for #2, I guess I should clarify what I mean, since there's two ways you could view "not suited". 1. I will never be able to become good enough at this for my funding to be net-positive. There are fundamental limitations to my ability to succeed in this field. 2. I should not be in this field. The amount of resources required to make me competitive in this field is significantly larger than other people who would do equally good work, and this is not true for other subfields in alignment. I view my use of "I'm not suited" more like 2 than 1. I think there's a reasonable chance that, given enough time with proper ef
sudo10

Hi, do you have a links to the papers/evidence?

sudo75

Strong upvoted.

I think we should be wary of anchoring too hard on compelling stories/narratives.

However, as far as stories go, this vignette scores very highly for me. Will be coming back for a re-read.

7Gabe M
Thanks! +1 on not over-anchoring--while this feels like a compelling 1-year timeline story to me, 1-year timelines don't feel the most likely.
sudo10

but a market with a probability of 17% implies that 83% of people disagree with you

Is this a typo?

1habryka
Oops, yep, fixed.
Answer by sudo11

Wyzant

sudo190

What can be used to auth will be used to auth

One of the symptoms of our society's deep security inadequacy is the widespread usage of unsecure forms of authentication.

It's bad enough that there are systems which authenticate you using your birthday, SSN, or mother's maiden name by spec.

Fooling bad authentication is also an incredibly common vector for social engineering. 

Anything you might have, which others seem unlikely to have (but which you may not immediately see a reason to keep secret), could be accepted by someone you implicitly trust as "auth... (read more)

sudoΩ2120

As the worst instance of this, the best way to understand a lot of AIS research in 2022 was “hang out at lunch in Constellation”. 

Is this no longer the case? If so, what changed?

LawrenceCΩ5120

I think this has gotten both worse and better in several ways.

It's gotten better in that ARC and Redwood (and to a lesser extent, Anthropic and OpenAI) have put out significantly more of their research. FAR Labs also exists is also doing some of the research proliferation that would've gone on inside of Constellation. 

It's worse in that there's been some amount of deliberate effort to build more of an AIS community in Constellation, e.g. with explicit Alignment Days where people are encouraged to present work-in-progress and additional fellowships and workshops. 

On net I think it's gotten better, mainly because there's just been a lot more content put out in 2023 (per unit research) than in 2022. 

sudo10

This is a reasonable point, but I have a cached belief that frozen food is substantially less healthy than non-frozen food somehow.

5nim
Updating your beliefs about the relative health impacts of frozen vs fast food seems like a low-effort, high-impact opportunity for improvement here. There are a lot of distinct questions or comparisons that you may be casually conflating in reasoning about frozen food: * Nutrition of a fresh ingredient vs the same ingredient commercially frozen. Frozen often wins here because fresh food in grocery stores has to be harvested long before it's ripe. Food harvested when it's ripe then frozen can travel fine, but food harvested when it's ripe then only refrigerated tends to degrade before it gets to the consumer. * Nutrition/quality of a given ingredient frozen at home vs commercially frozen. Home freezers will freeze items more slowly, which changes how ice crystals form and can sometimes degrade the quality of the item worse than extremely rapid commercial freezing. Plus if you buy a fresh ingredient at the store, it was picked way before it was ready, and freezing it at home isn't going to magically have left it on the plant for longer. * Nutrition of a fresh meal cooked from scratch vs a frozen pre-made meal. Fresh, conscientious cooking will add less salt and fat than any processed food. It may also not taste quite as delicious ;) * Nutrition of a fresh meal cooked from scratch vs a serving of that same meal which was frozen at home and reheated. For many meals, home freezing is mildly detrimental to the food's texture, and home cooks probably won't test enough variables on the freezing process to really dial in the optimal technique. * Nutrition of a fresh meal cooked from scratch vs fast food. Fresh, conscientious cooking will add less salt and fat than restaurants, but may also be less delicious. Fresh cooking will also be more variable about ingredient quality -- ingredients might be much better or might be worse, depending on the cook and the pantry. Make sure your intuitions on those fronts are consistent with each other and with available researc
sudo10

This is very interesting. My guess is that this would take a lot of time to set up, but if you have eg. recommended catering providers in SFBA, I'd be very interested!

sudo30

Completely fair request. I think I was a bit vague when I said "on top or around AI systems."

The point here is that I want to find techniques that seem to positively influence model behavior, which I can "staple on" to existing models without a gargantuan engineering effort.

I am especially excited about these ideas if they seem scalable or architecture-agnostic.

Here are a few examples of the kind of research I'm excited about:

  • Conciseness priors on outputs (as a way to shift cognitive labor to humans) 
    • I think there is a reasonable story for how concise
... (read more)
sudo10

Recently my alignment orientation has basically been “design stackable tools on top or around AI systems which produce pressure toward increased alignment.”
 

I think it’s a pretty productive avenue that’s 1) harder to get lost in, and 2) might be eventually sufficient for alignment.

3Zac Hatfield-Dodds
I'm not sure what this would look like - can you give some concrete examples?
sudo20

neuron has

I was confused by the singular "neuron." 

I think the point here is that if there are some neurons which have low activation but high direct logit attribution after layernorm, then this is pretty good evidence for "smuggling."

Is my understanding here basically correct?

3Neel Nanda
Yes that's correct
sudo30

Shallow comment:

How are you envisioning the prevention of strategic takeovers? It seems plausible that robustly preventing strategic takeovers would also require substantial strategizing/actualizing.

3TsviBT
Are you echoing this point from the post? It might be possible for us humans to prevent strategicness, though this seems difficult because even detecting strategicness is maybe very difficult. E.g. because thinking about X also sneakily thinks about Y: https://tsvibt.blogspot.com/2023/03/the-fraught-voyage-of-aligned-novelty.html#inexplicitness My mainline approach is to have controlled strategicness, ideally corrigible (in the sense of: the mind thinks that [the way it determines the future] is probably partially defective in an unknown way).
sudo10

The first point isn’t super central. FWIW I do expect that humans will occasionally not swap words back.

Humans should just look at the noised plan and try to convert it into a more reasonable-seeming, executable plan.

Edit: that is, without intentionally changing details.

sudo10

Fair enough! For what it’s worth, I think the reconstruction is probably the more load-bearing part of the proposal.

sudo10

Is your claim that the noise borne asymmetric pressure away from treacherous plans disappears in above-human intelligences? I could see it becoming less material as intelligence increases, but the intuition should still hold in principle.

2Shmi
I am not confidently claiming anything, not really an expert... But yeah, I guess I like the way you phrased it. The more disparity there is in intelligence, the less extra noise matters. I do not have a good model of it though. Just feels like more and more disparate dangerous paths appear in this case, overwhelming the noise.
sudo*21

"Most paths lead to bad outcomes" is not quite right. For most (let's say human developed, but not a crux) plan specification languages, most syntactically valid plans in that language would not substantially mutate the world state when executed.

I'll begin by noting that over the course of writing this post, the brittleness of treacherous plans became significantly less central.

However, I'm still reasonably convinced that the intuition is sound. If a plan is adversarial to humans, the plan's executor will face adverse optimization pressure from humans and ... (read more)

2Shmi
I can see that working when the entity is at the human level of intelligence or less. Maybe I misunderstand the setup, and this is indeed the case. I can't imagine that it would work on a superintelligence...
sudo10

i.e. that the problem is easily enough addressed that it can be done by firms in the interests of making a good product and/or based on even a modest amount of concern from their employees and leadership

I'm curious about how contingent this prediction is on 1, timelines and 2, rate of alignment research progress. On 2, how much of your P(no takeover) comes from expectations about future research output from ARC specifically?

If tomorrow, all alignment researchers stopped working on alignment (and went to become professional tennis players or something) and no new alignment researchers arrived, how much more pessimistic would you become about AI takeover?

These predictions are not very related to any alignment research that is currently occurring. I think it's just quite unclear how hard the problem is, e.g. does deceptive alignment occur, do models trained to honestly answer easy questions generalize to hard questions, how much intellectual work are AI systems doing before they can take over, etc.

I know people have spilled a lot of ink over this, but right now I don't have much sympathy for confidence that the risk will be real and hard to fix (just as I don't have much sympathy for confidence that the problem isn't real or will be easy to fix).

sudo00

Epistemic Status: First read. Moderately endorsed.

I appreciate this post and I think it's generally good for this sort of clarification to be made.

 

One distinction is between dying (“extinction risk”) and having a bad future (“existential risk”). I think there’s a good chance of bad futures without extinction, e.g. that AI systems take over but don’t kill everyone.

This still seems ambiguous to me. Does "dying" here mean literally everyone? Does it mean "all animals," all mammals," "all humans," or just "most humans? If it's all humans dying, do all hu... (read more)

4paulfchristiano
I think these questions are all still ambiguous, just a little bit less ambiguous. I gave a probability for "most" humans killed, and I intended P(>50% of humans killed). This is fairly close to my estimate for E[fraction of humans killed]. I think if humans die it is very likely that many non-human animals die as well. I don't have a strong view about the insects and really haven't thought about it. In the final bullet I implicitly assumed that the probability of most humans dying for non-takeover reasons shortly after building AI was very similar to the probability of human extinction; I was being imprecise, I think that's kind of close to true but am not sure exactly what my view is.
sudo30

I think this is probably good to just 80/20 with like a weekend of work? So that there’s a basic default action plan for what to do when someone goes “hi designated community person, I’m depressed.”

sudo20

People really should try to not have depression. Depression is bad for your productivity. Being depressed for eg a year means you lose a year of time, AND it might be bad for your IQ too.

A lot of EAs get depressed or have gotten depressed. This is bad. We should intervene early to stop it.

I think that there should be someone EAs reach out to when they’re depressed (maybe this is Julia Wise?), and then they get told the ways they’re probably right and wrong so their brain can update a bit, and a reasonable action plan to get them on therapy or meds or whatever.

2Dagon
I don't disagree, but I don't think it's limited to EA or Rationalist community members, and I wouldn't expect that designated group helper contacts will reach most of the people who need it.  It's been my experience (for myself and for a number of friends) that when someone can use this kind of help, they tend not to "reach out" for it. Your framing of "we should intervene" may have more promise.  Having specific advice on HOW lay-people can intervene would go a long way toward shifting our norms of discourse from "you seem depressed, maybe you should seek help" to "this framing may indicate a depressive episode or negative emotional feedback loop - please take a look at <this page/thread> to help figure out who you can talk with about it".
3sudo
I think this is probably good to just 80/20 with like a weekend of work? So that there’s a basic default action plan for what to do when someone goes “hi designated community person, I’m depressed.”
sudo129

Strong upvoted.

I’m excited about people thinking carefully about publishing norms. I think this post existing is a sign of something healthy.

Re Neel: I think that telling junior mech interp researchers to not worry too much about this seems reasonable. As a (very) junior researcher, I appreciate people not forgetting about us in their posts :)

sudo10

I'd be excited about more people posting their experiences with tutoring 

sudo12

Short on time. Will respond to last point.

I wrote that they are not planning to "solve alignment once and forever" before deploying first AGI that will help them actually develop alignment and other adjacent sciences.

Surely this is because alignment is hard! Surely if alignment researchers really did find the ultimate solution to alignment and present it on a silver platter, the labs would use it.

sudo20

Also: An explicit part of SERI MATS’ mission is to put alumni in orgs like Redwood and Anthropic AFAICT. (To the extent your post does this,) it’s plausibly a mistake to treat SERI MATS like an independent alignment research incubator.

6Ryan Kidd
MATS aims to find and accelerate alignment research talent, including: * Developing scholar research ability through curriculum elements focused on breadth, depth, and originality (the "T-model of research"); * Assisting scholars in producing impactful research through research mentorship, a community of collaborative peers, dedicated 1-1 support, and educational seminars; * Aiding the creation of impactful new alignment organizations (e.g., Jessica Rumbelow's Leap Labs and Marius Hobbhahn's Apollo Research); * Preparing scholars for impactful alignment research roles in existing organizations. Not all alumni will end up in existing alignment research organizations immediately; some return to academia, pursue independent research, or potentially skill-up in industry (to eventually aid alignment research efforts). We generally aim to find talent with existing research ability and empower it to work on alignment, not necessarily through existing initiatives (though we certainly endorse many).
1Roman Leventov
Yes, admittedly, there is much less strain on being very good at philosophy of science if you are going to work within a team with a clear agenda, particularly within AGI lab where the research agendas tend to be much more empirical than in "academic" orgs like MIRI or ARC. And thinking about research strategy is not the job of non-leading researchers at these orgs either, whereas independent researcher or researchers at more boutique labs have to think about their strategies by themselves. Founders of new orgs and labs have to think about their strategies very hard, too. But preparing employees for OpenAI, Antrhopic, or DeepMind is clearly not the singular focus of SERI MATS.
sudo113

Epistemic status: hasty, first pass

First of all thanks for writing this.

I think this letter is “just wrong” in a number of frustrating ways.

A few points:

  • “Engineering doesn’t help unless one wants to do mechanistic interpretability.” This seems incredibly wrong. Engineering disciplines provide reasonable intuitions for how to reason about complex systems. Almost all engineering disciplines require their practitioners to think concretely. Software engineering in particular also lets you run experiments incredibly quickly, which makes it harder to be wrong.
... (read more)
2Roman Leventov
I should have written "ML engineering" (I think it was not entirely clear from the context, fixed now). Knowing the general engineering methodology and the typical challenges in systems engineering for robustness and resilience is, of course, useful, and having visceral experience of these (e.g., engineering distributed systems, coding oneself bugs in the systems and seeing how they may fail in unexpected ways). But I would claim that learning this through practice, i.e., learning "from one's own mistakes", is again inefficient. Smart people learn from others' mistakes. Just going through some of the materials from here would give alignment researchers much more useful insights than years of hands-on engineering practice[1]. Again, it's an important qualification that we are talking about what's effective for theoretical-ish alignment research, not actual engineering of (AGI) systems! I don't argue that ML theory is useless. I argue that going through ML courses that spend too much time on building basic MLP networks or random forests (and understanding the theory of these, though it's minimal) is ineffective. I personally stay abreast of ML research by following MLST podcast (e.g., on spiking NNs, deep RL, Domingos on neurosymbolic and lots of other stuff, a series of interviews with people at Cohere: Hooker, Lewis, Grefenstette, etc.) This is not what I wrote. I wrote that they are not planning to "solve alignment once and forever" before deploying first AGI that will help them actually develop alignment and other adjacent sciences. This might sound ridiculous to you, but that's what OpenAI and Conjecture say absolutely directly, and I suspect other labs thinking about it, too, though don't pronounce it directly. 1. ^ I did develop several databases and distributed systems over my 10-year-long engineering career and was also interested in resilience research and was reading about it, so I know what I'm talking about and can compare.
2sudo
Also: An explicit part of SERI MATS’ mission is to put alumni in orgs like Redwood and Anthropic AFAICT. (To the extent your post does this,) it’s plausibly a mistake to treat SERI MATS like an independent alignment research incubator.
sudo20

Ordering food to go and eating it at the restaurant without a plate and utensils defeats the purpose of eating it at the restaurant

Restaurants are a quick and convenient way to get food, even if you don’t sit down and eat there. Ordering my food to-go saves me a decent amount of time and food, and also makes it frictionless to leave.

But judging by votes, it seems like people don’t find this advice very helpful. That’s fine :(

2Jiro
It sounded like you were suggesting that people order the food to go even if they're eating there. Ordering it to go and then actually going makes more sense, but still has the problem of "what is your reason for going to a restaurant?" Most people who go to restaurants want to eat there a large portion of the time.
sudo20

I think there might be a misunderstanding. I order food because cooking is time-consuming, not because it doesn’t have enough salt or sugar.

5nim
Have you considered ordering catering for a "group" a couple times a week, and having your meals from the single catering order for several days, instead of spending the time choosing and acquiring more premade food each day? I've seen some folks online who have great success using catering as a meal prep option because it's more frugal than ordering separate meals, but it also incurs less time investment as well as costing less money.
sudo10

Maybe it’d be good if someone compiled a list of healthy restaurants available on DoorDash/Uber Eats/GrubHub in the rationalist/EA hubs?

5nim
IMO the most obvious harm reduction strategy for "fast food delivery is expensive and terrible" is not to order different fast food, but to keep pre-made frozen meals on hand. You can buy frozen meals with the nutrition profile of your choice, make them yourself, or pay someone to make them for you. This costs less money and time than ordering delivery, and has the added benefit of leveraging that cognitive bias where you make "healthier" food choices when planning meals in advance compared to decisions that you make while hungry. I'd postulate that people often order delivery because it's the quickest and easiest option available to them. It seems like getting people (including oneself) to eat something healthier than their defaults is a matter of making something even quicker and easier available, rather than offering a choice between "do it your usual way" and a higher-friction option of checking a list first.
2trevor
The advice that I heard is to put more and more salt into your cooking, until that you feel satisfied with your cooking and become less likely to order food (which will have tons of salt anyway, way more than you would ever add). There's no easy fix with sugar because it's addictive and has a withdrawal period.
sudo10

Can’t you just combat this by drinking water?

2trevor
That results in too much salt and too much water, and not enough of the other stuff (e.g. electrolytes). Adding in more of the other stuff doesn't solve the problem, it means your metabolism is going too quickly, because more is going in and therefore more has to be going out at the same time. The human metabolism has, like, a million interconnected steps, and increasing the salinity or speed of your bloodstream affects all of them at once.
sudo10

If you plan to eat at the restaurant, you can just ask them for a box if you have food left over.

This is true at most restaurants. Unfortunately, it often takes a long time for the staff to prepare a box for you (o(5 minutes)).

A potential con is that most food needs to be refrigerated if you want to keep it safe to eat for several hours

One might simply get into the habit of putting whatever food they have in the refrigerator. I find that refrigerated food is usually not unpleasant to eat, even without heating.

2nim
My experience is that not all locations (work etc) have refrigerator space conveniently available, but if you have access, that's great! I find that asking the person at the counter to give me a box is a much quicker operation than asking wait staff to put my food into a box for me.
sudo20

Sometimes when you purchase an item, the cashier will randomly ask you if you’d like additional related items. For example, when purchasing a hamburger, you may be asked if you’d like fries.

It is usually a horrible idea to agree to these add-ons, since the cashier does not inform you of the price. I would like fries for free, but not for $100, and not even for $5.

The cashier’s decision to withhold pricing information from you should be evidence that you do not, in fact, want to agree to the deal.

2Dagon
For most LW readers, it's usually a bad idea, because many of us obsessively put cognitive effort into unimportant choices like what to order at a hamburger restaurant, and reminders or offers of additional things don't add any information or change our modeling of our preferences, so are useless.  For some, they may not be aware that fries were not automatic, or may not have considered whether they want fries (at the posted price, or if price is the decider, they can ask), and the reminder adds salience to the question, so they legitimately add fries.  Still others feel it as (a light, but real) pressure to fit in or please the cashier by accepting, and accept the add-on out of guilt or whatever.   Some of these reasons are "successes" in terms of mutually-beneficial trade, some are "predatory" in that the vendor makes more money and the customer doesn't get the value they'd hoped. Many are "irrelevant" in that they waste a small amount of time and change no decisions.   I think your heuristic of "decline all non-solicited offers" is pretty strong, in most aspects of the world.  
3Richard_Kennaway
You could always ask. I ignore upsells because I've already decided what I want and ordered that, whether it's extra fries or a hotel room upgrade.
sudo10

Epistemic status: clumsy

An AI could also be misaligned because it acts in ways that don't pursue any consistent goal (incoherence).

It’s worth noting that this definition of incoherence seems inconsistent with VNM. Eg. A rock might satisfy the folk definition of “pursuing a consistent goal,” but fail to satisfy VNM due to lacking completeness (and by corollary due to not performing expected utility optimization over the outcome space).

sudo20

Strong upvoted.

The result is surprising and raises interesting questions about the nature of coherence. Even if this turns out to be a fluke, I predict that it’d be an informative one.

sudo65

I think I was deceived by the title.

I’m pretty sure that rapid capability generalization is distinct from the sharp left turn.

sudoΩ022

dedicated to them making the sharp left turn

I believe that “treacherous turn” was meant here.

1scasper
thanks
sudo11

Wait I’m pretty confident that this would have the exact opposite effect on me.

1Christopher King
Well it helps that he is super chill. It's not like he's micromanaging me, but if I start literally goofing off he'd probably notice, lol.
sudo30

You can give ChatGPT the job posting and a brief description of Simon’s experiment, and then just ask them to provide critiques from a given perspective (eg. “What are some potential moral problems with this plan?”)

2the gears to ascension
ah, I see, yeah, solid and makes sense.
sudo42

I clicked the link and thought it was a bad idea ex post. I think that my attempted charitable reading of the Reddit comments revealed significantly less constructive data than what would have been provided by ChatGPT.

I suspect that rationalists engaging with this form of content harms the community a non-trivial amount.

2the gears to ascension
Interesting, if the same could be done with chatgpt I'd be curious to hear how you'd frame the question. If the same analysis can be done with chatgpt I'd do it consistently. Can you say more about how it causes harm? I'd like to find a way to reduce that harm, because there's a lot of good stuff in this sort of analysis, but you're right that there's a tendency to use extremely spiky words. A favorite internet poster of mine has some really interesting takes on how it's important to use soft language and not demand people agree, which folks on that subreddit are in fact pretty bad at doing. It's hard to avoid it at times, though, when one is impassioned.
sudo10

I’m a fan of this post, and I’m very glad you wrote it.

sudo50

I understand feeling frustrated given the state of affairs, and I accept your apology.

Have a great day.

sudo21

You don’t have an accurate picture of my beliefs, and I’m currently pessimistic about my ability to convey them to you. I’ll step out of this thread for now.

6the gears to ascension
that's fair. I apologize for my behavior here; I should have encoded my point better, but my frustration is clearly incoherent and overcalibrated. I'm sorry to have wasted your time and reduced the quality of this comments section.
sudo3-2

I find the accusation that I'm not going to do anything slightly offensive.

Of course, I cannot share what I have done and plan to do without severely de-anonymizing myself. 

I'm simply not going to take humanity's horrific odds of success as a license to make things worse, which is exactly what you seem to be insisting upon.

5the gears to ascension
no, there's no way to make it better that doesn't involve going through, though. your model that any attempt to understand or use capabilities is failure is nonsense, and I wish people on this website would look in a mirror about what they're claiming when they say that. that attitude was what resulted in mispredicting alphago! real safety research is always, always, always capabilities research! it could not be otherwise!
Load More