LESSWRONG
LW

All of zeshen's Comments + Replies

(The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser

Is there any difference between donating through Manifund or directly via Stripe?

5habryka4mo

Don't think so. It is plausible that Manifund is eating the Stripe fees themselves, so we might end up getting ~1% more money, but I am not sure.

Information vs Assurance

zeshen4mo50

This happened all the time at my line of work. Forecasts become targets and you become responsible for meeting them. So whenever I was asked to provide a forecast, I will either i) ask as many questions as I need to know the exact purpose of the request, and produce a forecast that meets exactly that intent, or ii) pick a forecast and provide it, but first list down all the assumptions and caveats behind the forecast that I can possibly think of. With time, I'd also get a sense of who I need to be extra careful with when providing any forecasts because of all sorts of ways that might backfire.

Alexander Gietelink Oldenziel's Shortform

zeshen11mo5-4

Agreed. I'm also pleasantly surprised that your take isn't heavily downvoted.

We might be missing some key feature of AI takeoff; it'll probably seem like "we could've seen this coming"

zeshen11mo*3-1

There’ll be discussions about how these systems will eventually become dangerous, and safety-concerned groups might even set up testing protocols (“safety evals”).

My impression is that safety evals were deemed irrelevant because a powerful enough AGI, being deceptively aligned, would pass all of them anyway. We didn't expect the first general-ish AIs to be so dumb, like how GPT-4 was being so blatant and explicit about lying to the TaskRabbit worker.

Deep Honesty

zeshen1y50

Scott Alexander talked about explicit honesty (unfortunately paywalled) in contrast with radical honesty. In short, explicit honesty is being completely honest when asked, and radical honesty is being completely honest even without being asked. From what I understand from your post, it feels like deep honesty is about being completely honest about information you perceive to be relevant to the receiver, regardless of whether the information is explicitly being requested.

Scott also links to some cases where radical honesty did not work out well, like ... (read more)

3Bobby Freshour11mo

Dear lord, that third link. (Look how the OP doubles down in this thread)[https://www.reddit.com/r/Mindfulness/comments/e4r6pi/comment/f9eu8la/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button]. As someone else in that thread mentioned, explicit radicalism in any facet of life is going to cause you some trouble. Calling someone ugly to their face and asking them to start a conversation is a hilariously bad example of "radical honesty" over deep honesty.

1Aletheophile1y

Thanks for bringing up the comparison points of radical honesty and explicit honesty. It does seem like deep honesty is in between the two. But the characterization of deep honesty that you've posited doesn't feel very respectful? It leaves space to patronizingly share things the listener doesn't want to hear, because you've determined that they're relevant. Our notion of deep honesty is closer to being grounded in a notion of respect, perhaps something like "being completely honest about information you perceive that the receiver would want, regardless of whether the information has explicitly been requested". Sometimes that could involve some leaving of trailheads, or testing of the waters, to ascertain whether the person does in fact want the information. As to "when should this apply", it's maybe something like "when you're trying to cooperate with the other party". Of course there's still room for this to go wrong (in the first example you link it seems like the person was trying to cooperate with their boss, who didn't reciprocate), but it does seem like a pretty important safety valve compared to radical honesty.

Why is AGI/ASI Inevitable?

zeshen1y83

Can't people decide simply not to build AGI/ASI?

Yeah, many people, like the majority of users on this forum, have decided to not build AGI. On the other hand, other people have decided to build AGI and are working hard towards it.

Side note: LessWrong has a feature to post posts as Questions, you might want to use it for questions in the future.

6ErioirE1y

Not to build AGI yet. Many of us would love to build it as soon as we can be confident we have a realistic and mature plan for alignment, but that's a problem that's so absurdly challenging that even if aliens landed tomorrow and handed us the "secret to Friendly AI", we would have a hell of a time trying to validate that it actually was the real thing. If one is faced with a math problem where you could be staring at the answer and know no way to unambiguously verify said answer, you are likely not capable of solving the problem until you somehow close the inferential distance separating you from understanding. Assuming the problem is solvable at all.

LLMs seem (relatively) safe

zeshen1y12

Definitely. Also, my incorrect and exaggerated model of the community is likely based on the minority who have a tendency of expressing those comments publicly, against people who might even genuinely deserve those comments.

LLMs seem (relatively) safe

zeshen1y3-2

I agree with RL agents being misaligned by default, even more so for the non-imitation-learned ones. I mean, even LLMs trained on human-generated data are misaligned by default, regardless of what definition of 'alignment' is being used. But even with misalignment by default, I'm just less convinced that their capabilities would grow fast enough to be able to cause an existential catastrophe in the near-term, if we use LLM capability improvement trends as a reference.

LLMs seem (relatively) safe

zeshen1y110

Thanks for this post. This is generally how I feel as well, but my (exaggerated) model of the AI aligment community would immediately attack me by saying "if you don't find AI scary, you either don't understand the arguments on AI safety or you don't know how advanced AI has gotten". In my opinion, a few years ago we were concerned about recursively self improving AIs, and that seemed genuinely plausible and scary. But somehow, they didn't really happen (or haven't happened yet) despite people trying all sorts of ways to make it happen. And instead of a in... (read more)

4Seth Herd1y

Nothing in this post or the associated logic says LLMs make AGI safe, just safer than what we were worried about. Nobody with any sense predicted runaway AGI by this point in history. There's no update from other forms not working yet. There's a weird thing where lots of people's p(doom) went up when LLMs started to work well, because they found it an easier route to intellligence than they'd been expecting. If it's easier it happens sooner and with less thought surrounding it. See Porby's comment on his risk model for language model agents. It's a more succinct statement of my views. LLMs are easy to turn into agents, so let's don't get complacent. But they are remarkably easy to control and align, so that's good news for aligning the agents we build from them. But that doesn't get us out of the woods; there are new issues with self-reflective, continuously learning agents, and there's plenty of room for misuse and conflict escalation in a multipolar scenario, even if alignment turns out to be dead easy if you bother to try.

4JustisMills1y

Maybe worth a slight update on how the AI alignment community would respond? Doesn't seem like any of the comments on this post are particularly aggressive. I've noticed an effect where I worry people will call me dumb when I express imperfect or gestural thoughts, but it usually doesn't happen. And if anyone's secretly thinking it, well, that's their business!

quetzal_rainbow1y105

The reason why EY&co were relatively optimistic (p(doom) ~ 50%) before AlphaGo was their assumption "to build intelligence, you need some kind of insight in theory of intelligence". They didn't expect that you can just take sufficiently large approximator, pour data inside, get intelligent behavior and have no idea about why you get intelligent behavior.

9the gears to ascension1y

My p(doom) was low when I was predicting the yudkowsky model was ridiculous, due to machine learning knowledge I've had for a while. Now that we have AGI of the kind I was expecting, we have more people working on figuring out what the risks really are, and the previous concern of the only way to intelligence being RL seems to be only a small reassurance because non-imitation-learned RL agents who act in the real world is in fact scary. and recently, I've come to believe much of the risk is still real and was simply never about the kind of AI that has been created first, a kind of AI they didn't believe was possible. If you previously fully believed yudkowsky, then yes, mispredicting what AI is possible should be an update down. But for me, having seen these unsupervised AIs coming from a mile away just like plenty of others did, I'm in fact still quite concerned about how desperate non-imitation-learned RL agents seem to tend to be by default, and I'm worried that hyperdesperate non-imitation-learned RL agents will be more evolutionarily fit, eat everything, and not even have the small consolation of having fun doing it. upvote and disagree: your claim is well argued.

Modern Transformers are AGI, and Human-Level

zeshen1y10

I've gotten push-back from almost everyone I've spoken with about this

I had also expected this reaction, and I always thought I was the only one who thinks we have basically achieved AGI since ~GPT-3. But looking at the upvotes on this post I wonder if this is a much more common view.

Using axis lines for good or evil

zeshen1y30

My first impression was also that axis lines are a matter of aesthetics. But then I browsed The Economist's visual styleguide and realized they also do something similar, i.e. omit the y-axis line (in fact, they omit the y-axis line on basically all their line / scatter plots, but almost always maintain the gridlines).

Here's also an article they ran about their errors in data visualization, albeit probably fairly introductory for the median LW reader.

Good taxonomies of all risks (small or large) from AI?

Answer by zeshenMar 07, 202432

I'm pretty sure you have come across this already, but just in case you haven't:

https://incidentdatabase.ai/taxonomy/gmf/

Funding case: AI Safety Camp 10

zeshen1y41

Strong upvoted. I was a participant of AISC8 in the team that went on to launch AI Standards Lab, which I think counterfactually would not be launched if not for AISC.

How should we think about the decision relevance of models estimating p(doom)?

Answer by zeshenMay 11, 202380

Why is this question getting downvoted?

Support me in a Week-Long Picketing Campaign Near OpenAI's HQ: Seeking Support and Ideas from the LessWrong Community

zeshen2y73

This seems to be another one of those instances where I wish there was a dual-voting system to posts. I would've liked to strong disagree with the contents of the post without discouraging well-intentioned people from posting here.

[SEE NEW EDITS] No, *You* Need to Write Clearer

zeshen2y63

I feel like a substantial amount of disagreement between alignment researchers are not object-level but semantic disagreements, and I remember seeing instances where person X writes a post about how he/she disagrees with a point that person Y made, with person Y responding about how that wasn't even the point at all. In many cases, it appears that simply saying what you don't mean could have solved a lot of the unnecessary misunderstandings.

Catching the Eye of Sauron

zeshen2y10

I'm curious if there are specific parts to the usual arguments that you find logically inconsistent.

LLM Basics: Embedding Spaces - Transformer Token Vectors Are Not Points in Space

zeshen2y11

I Googled up 'how are tokens embedded' and this post came up third in the results - thanks for the post!

"Carefully Bootstrapped Alignment" is organizationally hard

zeshen2y10

If this interests you, there is a proposal in the Guideline for Designing Trustworthy Artificial Intelligence by Fraunhofer IAIS which includes the following:

[AC-R-TD-ME-06] Shutdown scenarios
Requirement: Do
Scenarios should be identified, analyzed and evaluated in which the live AI application must be completely or partially shut down in order to maintain the ability of users and affected persons to perceive situations and take action. This includes shutdowns due to potential bodily injury or damage to property and also due to the violation of personal rig

zeshen2y21

Everyone in any position of power (which includes engineers who are doing a lot of intellectual heavy-lifting, who could take insights with them to another company), thinks of it as one of their primary jobs to be ready to stop

In some industries, Stop Work Authorities are implemented, where any employee at any level in the organisation has the power to stop a work deemed unsafe at any time. I wonder if something similar in spirit would be feasible to be implemented in top AI labs.

2Raemon2y

This is definitely my dream, although I think we're several steps away from this being achievable at the present time.

The hot mess theory of AI misalignment: More intelligent agents behave less coherently

zeshen2y10

Without thinking about it too much, this fits my intuitive sense. An amoeba can't possibly demonstrate a high level of incoherence because it simply can't do a lot of things, and whatever it does would have to be very much in line with its goal (?) of survival and reproduction.

zeshen2y40

Thanks for this post. I've always had the impression that everyone around LW have been familiar with these concepts since they were kids and now know them by heart, while I've been struggling with some of these concepts for the longest time. It's comforting to me that there are long time LWers who don't necessarily fully understand all of these stuff either.

3Adam Zerner2y

That's awesome to hear! Before reading this comment I had a vague sense of "Maybe this'll help people" but after reading it I have a very concrete sense of "Oh yes! This is exactly the thing I was hoping would happen."

You Don't Exist, Duncan

zeshen2y10

Browsing through the comments section it seems that everyone relates to this pretty well. I do, too. But I'm wondering if this applies mostly to a LW subculture, or is it a Barnum/Forer effect where every neurotypical person would also relate to?

3Duncan Sabien (Deactivated)2y

I suspect everyone can relate in that everyone has felt this at some point, or even at a few memorable points. I suspect people who are more firmly normal, not because they're trying to conform but because they're actually close to what the center of their local culture is built to accommodate, cannot relate to feeling this constantly.

A newcomer’s guide to the technical AI safety field

zeshen2y10

With regards the Seed AI paradigm, most of the publications seem to have come from MIRI (especially the earlier ones when they were called the Singularity Institute) with many discussions happening both here on LessWrong as well as events like the Singularity Summit. I'd say most of the thinking around this paradigm happened before the era of deep learning. Nate Soares' post might provide more context.

You're right that brain-like AI has not had much traction yet, but it seems to me that there is a growing interest in this research area lately (albeit much ... (read more)

Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk")

zeshen2y10

AI is highly non-analogous with guns.

Yes, especially for consequentialist AIs that don't behave like tool AIs.

Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk")

zeshen2yΩ221

I feel like I broadly agree with most of the points you make, but I also feel like accident vs misuse are still useful concepts to have.

For example, disasters caused by guns could be seen as:

Accidents, e.g. killing people by mistaking real guns for prop guns, which may be mitigated with better safety protocols
Misuse, e.g. school shootings, which may be mitigated with better legislations and better security etc.
Other structural causes (?), e.g. guns used in wars, which may be mitigated with better international relations

Nevertheless, all of the above ... (read more)

3David Scott Krueger (formerly: capybaralet)2y

Yes it may be useful in some very limited contexts. I can't recall a time I have seen it in writing and felt like it was not a counter-productive framing. AI is highly non-analogous with guns.

Recursive Middle Manager Hell

zeshen2y21

Upvoted. Though as someone who has been in the corporate world for close to a decade, this is probably one of the rare LW posts that I didn't learn anything new from. And because every point is so absolutely true and extremely common in my experience, when reading the post I was just wondering the whole time how this is even news.

Models Don't "Get Reward"

zeshen2y00

There are probably enough comments here already, but thanks again for the post, and thanks to the mods for curating it (I would've missed it otherwise).

Be less scared of overconfidence

zeshen2y185

This is a nice post that echoes many points in Eliezer's book Inadequate Equilibria. In short, it is entirely possible that you outperform 'experts' or 'the market', if there are reasons to believe that these systems converge to a sub-optimal equilibrium, and even more so when you have more information that the 'experts', like in your Wave vs Theorem example.

MondSemmel2y115

More related LW concepts: Hero Licensing, and a few essays in the Inside/Outside View tag.

Don't design agents which exploit adversarial inputs

zeshen2y10

Thanks for the explanation!

Don't design agents which exploit adversarial inputs

zeshen2yΩ110

In every scenario, if you have a superintelligent actor which is optimizing the grader's evaluations while searching over a large real-world plan space, the grader gets exploited.

Similar to the evaluator-child who's trying to win his mom's approval by being close to the gym teacher, how would grader exploitation be different from specification gaming / reward hacking? In theory, wouldn't a perfect grader solve the problem?

2TurnTrout2y

One point of this post is that specification gaming, as currently known, is an artifact of certain design patterns, which arise from motivating the agent (inner alignment) to optimize an objective over all possible plans or world states (outer alignment). These design patterns are avoidable, but AFAICT are enforced by common ways of thinking about alignment (e.g. many versions of outer alignment commit to robustly grading the agent on all plans it can consider). One hand (inner alignment) loads the shotgun, and our other hand (outer alignment) points it at our own feet and pulls the trigger. Yes, in theory. In practice, I think the answer is "no", for reasons outlined in this post.

A newcomer’s guide to the technical AI safety field

zeshen2y2-5

In case anyone comes across this post trying to understand the field, Scott Aaronson did a better job at me at describing the "seed AI" and "prosaic AI" paradigms here, which he calls "Orthodox" vs "Reform".

Don't design agents which exploit adversarial inputs

zeshen2yΩ110

I'm probably missing something, but doesn't this just boil down to "misspecified goals lead to reward hacking"?

2TurnTrout2y

Nope! Both "misspecified goals" and "reward hacking" are orthogonal to what I'm pointing at. The design patterns I highlight are broken IMO.

Reflective Consequentialism

zeshen2y43

This post makes sense to me though it feels almost trivial. I'm puzzled by the backlash against consequentialism, it just feels like people are overreacting. Or maybe the 'backlash' isn't actually as strong as I'm reading it to be.

I'd think of virtue ethics as some sort of equilibrium that society has landed ourselves in after all these years of being a species capable of thinking about ethics. It's not the best but you'd need more than naive utilitarianism to beat it (this EA forum post feels like commonsense to me too), which you describe as reflective c... (read more)

3Adam Zerner2y

Yeah, I have very similar thoughts.

2-D Robustness

zeshen2y10

Thanks - this helps.

2-D Robustness

zeshen2y10

Thanks for the reply!

But I think you can come up with clean examples of capabilities failures if you look at, say, robots that use search to plan; they often do poorly according to the manually specified reward function on new domains because optimizing the reward is too hard for its search algorithm.

I'd be interested to see actual examples of this, if there are any. But also, how would this not be an objective robustness failure if we frame the objective as "maximize reward"?

if you perform Inverse Optimal Control on the behavior of the r

... (read more)

2LawrenceC2y

I have some toy examples from a paper I worked on: https://proceedings.mlr.press/v144/jain21a But I think this is a well known issue in robotics, because SOTA trajectory planning is often gradient-based (i.e. local). You definitely see this on any "hard" robotics task where initializing a halfway decent trajectory is hard. I've heard from Anca Dragan (my PhD advisor) that this happens with actual self driving car planners as well. Oops, sorry, the answer got cutoff somehow. I meant to say that if you take a planner that's suboptimal, and look at the policy it outputs, and then rationalize that policy assuming that the planner is optimal, you'll get a reward function that is different from the reward function you put in. (Basically what the Armstrong + Mindermann paper says.) Well, the paper doesn't show that you can't decompose it, but merely that the naive way of decomposing observed behavior into capabilities and objectives doesn't work without additional assumptions. But we have additional information all the time! People can tell when other people are failing due to incompetence via misalignment, for example. And in practice, we can often guess whether or not a failure is due to capabilities limitations or objective robustness failures, for example by doing experiments with fine-tuning or prompting. The reason we care about 2D alignment is that capability failures seem much more benign than alignment failures. Besides the reasons given in the main post, we might also expect that capability failures will go away with scale, while alignment failures will become worse with scale. So knowing whether or not something is a capability robustness failure vs an alignment one can inform you as to the importance and neglectedness of research directions.

2-D Robustness

zeshen2y10

Thanks for the example, but why this is a capabilities robustness problem and not an objective robustness problem, if we think of the objective as 'classify pandas accurately'?

1LawrenceC2y

Insofar as it's not a capability problem, I think it's example of Goodharting and not inner misalignment/mesa optimization. The given objective ("minimize cross entropy loss") is maximized on distribution by incorporating non-robust features (and also gives no incentive to be fully robust to adversarial examples, so even non-robust features that don't really help with performance could still persist after training). You might argue that there is no alignment without capabilities, since a sufficiently dumb model "can't know what you want". But I think you can come up with clean examples of capabilities failures if you look at, say, robots that use search to plan; they often do poorly according to the manually specified reward function on new domains because optimizing the reward is too hard for its search algorithm. Of course, you can always argue that this is just an alignment failure one step up; if you perform Inverse Optimal Control on the behavior of the robot and derive a revealed reward function, you'll find that its . In other words, you can invoke the fact that for biased agents, there doesn't seem to be a super principled way of dividing up capabilities and preferences in the first place. I think this is probably going too far; there are still examples in practice (like the intractable search example) where thinking about it as a capabilities robustness issue is more natural than thinking about it as a objective problem.

Thoughts on AGI safety from the top

zeshen2y20

I don't know how I even got here after so long but I really like this post. Looking forward to next year's post.

Has anyone increased their AGI timelines?

zeshen2y30

I'd love to see a post with your reasonings.

4 Key Assumptions in AI Safety

zeshen2y20

I think these are fair assumptions for the alignment field in general. There are, however, work done outside this community that have different assumptions but also call themselves AI safety, e.g. this one.

(I've written more about these assumptions here).

Instead of technical research, more people should focus on buying time

zeshen2y10

Buying time could also mean implementing imperfect solutions that don't work against strong AGIs but might help to not get us destroyed by the first AGI that might be relatively weak.

(I wrote about it recently)

Discussion: Objective Robustness and Inner Alignment Terminology

zeshen2y10

For example, although our results show CoinRun models failed to learn the general capability of pursuing the coin, the more natural interpretation is that the model has learned a robust ability to avoid obstacles and navigate the levels,^[7] but the objective it learned is something like “get to the end of the level,” instead of “go to the coin.”

It seems to me that every robustness failure can be interpreted as an objective robustness failure (as aptly titled in your other post). Do you have examples of a capability robustness failure that is not an objecti... (read more)

2-D Robustness

zeshen2y10

Are there any examples of capability robustness failures that aren't objective robustness failures?

1Charbel-Raphaël2y

Yes. This image is only a classifier. No mesa optimizer here. So we have only a capability robustness problem

[AN #112]: Engineering a Safer World

zeshen2yΩ330

I got the book (thanks to Conjecture) after doing the Intro to ML Safety Course where the book was recommended. I then browsed through the book and thought of writing a review of it - and I found this post instead, which is a much better review than I would have written, so thanks a lot for this!

Let me just put down a few thoughts that might be relevant for someone else considering picking up this book.

Target audience: Right at the beginning of the book, the author says "This book is written for the sophisticated practitioner rather than the academic... (read more)

Epistemic modesty and how I think about AI risk

zeshen2y30

This is actually what my PhD research is largely about: Are these risks actually likely to materialize? Can we quantify how likely, at least in some loose way? Can we quantify our uncertainty about those likelihoods in some useful way? And how do we make the best decisions we can if we are so uncertain about things?

I'd be really interested in your findings.

aisafety.community - A living document of AI safety communities

zeshen2y30

If there's an existing database of university groups already, it would be great to include a link to that database, perhaps under "Local EA Group". Thanks!

'Utility Indifference' (2010) by FHI researcher Stuart Armstrong

zeshen2y30

The link in the post no longer works. Here's one that works.

Charitable Reads of Anti-AGI-X-Risk Arguments, Part 1

zeshen3y10

I thought this is a reasonable view and I'm puzzled with the downvotes. But I'm also confused by the conclusion - are you arguing on whether the x-risk from AGI is something predictable or not? Or is the post just meant to convey examples on the merits to both arguments?

2sstich3y

Thanks — I'm not arguing for this position, I just want to understand the anti AGI x-risk arguments as well as I can. I think success would look like me being able to state all the arguments as strongly/coherently as their proponents would.

Alignment Org Cheat Sheet

zeshen3y20

(see Zac's comment for some details & citations)

Just letting you know the link doesn't work although the comment was relatively easy to find.

My Thoughts on the ML Safety Course

zeshen3yΩ110

Thanks for the comment!

You can read more about how these technical problems relate to AGI failure modes and how they rank on importance, tractability, and crowdedness in Pragmatic AI Safety 5. I think the creators included this content in a separate forum post for a reason.

I felt some of the content in the PAIS series would've been great for the course, though the creators probably had a reason to exclude them and I'm not sure why.

The second group doesn't necessarily care about why each research direction relates to reducing X-risk.

In this case I fee... (read more)

1joshc2y

From what I understand, Dan plans to add more object-level arguments soon.