atharva

Situational Awareness is (mostly) here to stay

11d

Epistemic Status: I’m not saying anything new here; just thinking through ideas for myself.

Tl;dr: besides evaluations, and some limited real-world scenarios, we can’t do much about situational awareness.

Situational Awareness is, roughly, when a model knows about itself (“I’m an LLM”) and its surroundings (“I’m editing the fine-tuning code for an internal deployment of Claude”).

We might worry about situational awareness during (a) evals, or (b) real world deployments (RWDs). With evals, if the model is aware its being evaluated, it might pretend to be more aligned than it really is, or it might sandbag. With RWDs, the model might use its knowledge of, say, currently editing training code for a frontier model, to... (read 199 more words →)

Transformers, Intuitively

atharva

1mo

What are transformers, and some ways to think about them.

Modern LLMs are Transformers. An LLM is a transformer in the same way that a building might be a ‘Art Deco building’ – specific instances of a general architecture. Like with Art Deco, there are many variants of buildings that would still be considered the same style. Similarly, while there are many different variants of transformer-based LLMs, they all follow a similar architecture. However, not every building is a Art Deco building. And there exist LLMs that are not transformers. That being said, with modern LLMs today, virtually all decent models are transformers.

Art Deco isn’t just an architectural style. You also have fashion,... (read 1074 more words →)

AI Safety – Analyse Affordances

atharva

2mo

This post spun off from the work I was doing at UChicago’s XLab over this summer. Thanks to Jay Kim, Jo Jiao, and Aryan Bhatt for feedback on this post! Also, thanks to Zack Rudolph, Rhea Kanuparthi, and Aryan Shrivastava for organizing & facilitating an incredible fellowship this summer.

This post sketches out a frame to think about broad classes of AI risk.^[1]The goal is to help people identify gaps in safety cases, and, by doing so, identify neglected directions in AI Safety research. Epistemic status: exploratory.

One sentence summary: To understand the risks AI systems pose, it is useful to look at affordances they have available to act upon the world.^[2]

If you work... (read 590 more words →)

Replying toThings You Can't Countersignal

atharva9mo

Things You Can't Countersignal

Countersignaling vaguely reminds me of Benign Boundary Violations! Both of these work when you know the other person well enough – which can be nice by making them feel seen.

Replying toOn 'On Caring'

atharva9mo

On 'On Caring'

That’s a fair question! In short, I don’t quite agree with population ethics, and I’m skeptical of the quantification that comes with utilitarianism.

Of course, these are separate topics worthy of discussion. Hope to write thoughts on them soon!

On 'On Caring'

atharva

9mo

Thanks to Jacob G-W, Sophia Lloyd George, Ariana Azarbal, and Jo Jiao for reading through / commenting on a draft. All opinions, and any errors, are my own.

‘On Caring’ by Nate Soares is a great piece. It’s a moving appeal to do good in the world. I respect Nate’s articulation of his personal philosophy, and I empathize with his motivations for writing the same. At the same time, I don’t share many of the intuitions behind it. I don’t intend for this to be ‘a rebuttal’. That seems unkind. A rebuttal would also be unproductive – deep-seated values seem hard to change. Yet, I want to articulate why these intuitions don’t ring true... (read 816 more words →)

Replying toOptimization & AI Risk

atharva9mo

Optimization & AI Risk

Great post, thanks so much!!

Replying toOptimization & AI Risk

atharva9mo

Optimization & AI Risk

Ooh, Value Learning sounds cool – I'll check that out.
And yup, explicitly noting Goodhart's Law would have been nice.

Thanks for the comment!

Optimization & AI Risk

atharva

9mo

There many ways to taxonomize AI risk. One interesting framing, is ‘risks from optimization’. These are not new ideas. Eliezer wrote about this ~15 years ago, and it seems like many ‘theory folks’ have been saying this for years. I don’t understand these concepts deeply – I’m trying to improve my understanding by writing about them. Hopefully, I can add something new in the process.

Thanks to Jo Jiao for comments on a draft, and for nudging me to write this. Feedback is highly appreciated!

Epistemic status: Exploratory.

Tl;dr: intelligence is optimization, and (too much) optimization is bad.

First, what is optimization? It’s ‘squeezing’ the world into improbable states. Worlds where I have a quintillion dollars in my... (read 257 more words →)

Replying toUgh fields

atharva9mo

Ugh fields

I found The Flinch (Julien Smith) to be a good read! It’s less a book, and more an extended self-help essay. It was also useful to approach it as learning a soft skill, rather than explicitly gaining novel information.

Replying toHow I Am Productive

atharva9mo

How I Am Productive

The Action-Waiting-Reference framework clicked for me – thank you!

Replying toIs Reality Ugly?

atharva9mo

Is Reality Ugly?

Ooh that makes sense – thank you!

Replying toIs Reality Ugly?

atharva10mo

Is Reality Ugly?

I’m not sure I understand indexical uncertainty! To clarify – if we lived in a classical world, would this uncertainty not be present?

Replying toDoes Summarization Affect LLM Performance?

atharva10mo

Does Summarization Affect LLM Performance?

Ooh, this sounds like a neat follow-up – thank you for sharing!

Does Summarization Affect LLM Performance?

atharva

10mo

Edit 4/2/25: Added footnotes; I didn't realize they got lost en-route.

Hello! This is a mini-project that I carried out to get a better sense of ML engineering research. The question itself is trivial, but it was useful to walk through every step of the process by myself. I also haven’t had much experience with technical research, so I’d greatly appreciate any feedback. My code is available on github.

This task originated from an application to Aryan Bhatt’s SPAR stream. Thanks to Jacob G-W, Sudarsh Kunnavakkam, and Sanyu Rajakumar for reading through / giving feedback on an initial version of this post! Any errors are, of course, my own.

Tl;dr: Does summarization degrade LLM performance on... (read 1161 more words →)

Takes on Takeoff

atharva

11mo

Epistemic Status: Exploratory

I wrote this as part of an application for the Chicago Symposium on Transformative AI, where I try & sketch out what takeoff might look like. I’m making a lot of claims, across a range of domains, so I’d expect there to be many places I’m wrong. But on the whole, I hope this is more well-thought-out than not.^[1]

Many thanks to Nikola Jurković & Tao Burga for thoughtful comments on this writeup. Any errors are, of course, my own.

Prompt: Please outline an AGI takeoff scenario in detail, noting your key uncertainties

What will AGI look like?

Uncertainty about near-term upper-bound capabilities. There is a vast difference between ‘models that can automate 90%

... (read 1160 more words →)

LESSWRONG
LW

LESSWRONG
LW

Does Summarization Affect LLM Performance?

Optimization & AI Risk

Takes on Takeoff

On 'On Caring'

atharva

atharva

Situational Awareness is (mostly) here to stay

Transformers, Intuitively

AI Safety – Analyse Affordances

On 'On Caring'

Optimization & AI Risk

Does Summarization Affect LLM Performance?

Takes on Takeoff

atharva

Does Summarization Affect LLM Performance?

Optimization & AI Risk

Takes on Takeoff

On 'On Caring'

atharva

atharva

Situational Awareness is (mostly) here to stay

Transformers, Intuitively

AI Safety – Analyse Affordances

On 'On Caring'

Optimization & AI Risk

Does Summarization Affect LLM Performance?

Takes on Takeoff

What will AGI look like?