LESSWRONG
LW

Ben Amitay — LessWrong

I seem to be the the only one who read the post that way, so probably I read my own opinions into it, but my main takeaway was pretty much that people with your (and my) values are often shamed into pretending to have other values and invent excuses for how their values are consistent with their actions, while it would be more honest and productive if we take a more pragmatic approach to cooperating around our altruistic goals.

Ben Amitay3y

I probably don't understand the shortform format, but it seem like others can't create top-level comments. So you can comment here :)

Ben Amitay3yQuick Take

I had an idea for fighting goal misgeneralization. Doesn't seem very promising to me, but does feel close to something interesting. Would like to read your thoughts:

Use IRL to learn which values are consistent with the actor's behavior.
When training the model to maximize the actual reward, regularize it to get lower scores according to the values learned by the IRL. That way, the agent is incentivized to signal not having any other values (and somewhat incentivized agains power seeking)

Ben Amitay's Shortform

Ben Amitay

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

Writing this post as rationality case study

Ben Amitay

[Before I begin: If you don't like this post, please let me know why. Even just few words like boring/off-topic/poorly-written may give me something to work with.]

I wish to share a late struggle I have with rationality, because I think that it touches some interesting points. But more importantly - because I think that it is important to think about rationality in the context of concrete day-to-day decisions. This post is going to be messy and have no specific "point" or definite conclusion - like real life decisions.

Like many, I am attracted to content creation. I like to think about stuff and share my insights and patterns of thoughts, and like the... (read 391 more words →)

Replying toDouglas Hofstadter changes his mind on Deep Learning & AI risk (June 2023)?

Ben Amitay3y

Douglas Hofstadter changes his mind on Deep Learning & AI risk (June 2023)?

It is beautiful to see that many of our greatest minds are willing to Say Oops, even about their most famous works. It may not score that many winning-points, but it does restore quite a lot of dignity-points I think.

124

125

Replying toOn how various plans miss the hard bits of the alignment challenge

Ben Amitay3y

On how various plans miss the hard bits of the alignment challenge

Learning without Gradient Descent - Now it is much easier to imagine learning without gradient decent. An LLM can add into its context or even save into a database knowledge, meta-cognitive strategies, code, etc.

It is very similar to value change due to inner misalignment or self improvement, except it is not literally inside the model but inside its extended cognition.

Replying toEthodynamics of Omelas

Ben Amitay3y

Ethodynamics of Omelas

In another comment on this post I suggested an alternative entropy-inspired expression, that I took from RL. To the best of my knowledge, to the RL context it came from FEP or active inference or at least is acknowledged to be related.

Don't know about the specific Friston reference though

Replying toEthodynamics of Omelas

Ben Amitay3y

Ethodynamics of Omelas

I agree with all of it. I think that I through the N there because average utilitarianism is super contra intuitive for me so I tried to make it total utility.

And also about the weights - to value equality is basically to weight the marginal happiness of the unhappy more than that of the already-happy. Or when behind the vail of ignorance, to consider yourself unlucky and therefore more likely to be born as the unhappy. Or what you wrote.

Replying toEthodynamics of Omelas

Ben Amitay3y

Ethodynamics of Omelas

I think that the thing you want is probably to maximize N*sum(u_i exp(-u_i/T))/sum(exp(-u_i/T)) or -log(sum(exp(-u_i/T))) where u_i is the utility of the Ith person, and N is the number of people - not sure which. That way you get in one limit the vail of ignorance for utility maximizers, and in the other limit the vail of ignorance of Roles (extreme risk aversion).

That way you also don't have to treat the mean utility separately.

Replying toNew OpenAI Paper - Language models can explain neurons in language models

Ben Amitay3y*

New OpenAI Paper - Language models can explain neurons in language models

It's not a full answer, but: To the degree that it is true that the quantities align with the standard basis, it must be somehow a result of asymmetry of the activation. For example ReLU trivially depend on the choice of basis.

If you focus on the ReLU example, it sort of make sense: if multiple non-related concepts express in the same neuron, and one of them push the neuron in the negative direction, it may make the ReLU destroy information of the other concepts.

Replying toLeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem

Ben Amitay3y

LeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem

Sorry for the off-topicness. I will not consider it rude if you stop reading here and reply with "just shut up" - but I do think that it is important:

A) I do agree that the first problem to address should probably be misalignment of the rewards to our values, and that some of the proposed problems are not likely in practice - including some versions of the planning-inside-worldmodel example.

B) I do not think that planning inside the critic or evaluating inside the actor are an example of that, because the functions that those two models are optimized to approximate reference each other explicitly in their definitions. It doesn't mean that the critic... (read more)

Semantics, Syntax and Pragmatics of the Mind?

Ben Amitay

One feature of natural languge that seem to be essential is the (blurry) distinction between 3 layers:

Semantics of basic units
Syntactic rules for how to combine the meanings of basic units.
Pragmatic considerations for modifying the literal meaning by taking into account things like context, shared priors, and cooperative game theory.

That distinction seem very natural to me, because 1+2 are needed for learnability of the language and 3 is a direct result of trying to optimize the communication in context. Hence my question: I'm interested at whether similar 3 layers emerge in other important communication context:

Communication between modules in the brain.
Memory, when viewed as communication if the present brain with the future brain.
inter-modul communication and memory in deep learning.

Do you have evidence for/against it in those contexts?

Agents synchronization

Ben Amitay

There is a line in alignment-related thinking, of looking for ways that agents will tend to be similar. An early example is convergent instrumental goals, and a later is natural abstractions. Those two ideas share an important attribute - trying to think on something "mental" (values, abstractions) as at least partially grounded in the objective environment. My goal in this post is to present and discuss another family of mechanisms for convergence of agents. Those mechanisms are different in that they arise from interaction between the agents and make them "synchronize" with each other, rather than adapt to similar non-agentic environment. As a result, the convergence is around things that are "socially... (read 1250 more words →)

Training for corrigability: obvious problems?

Ben Amitay

I've been thinking about AI corrigibility lately and have come up with a potential solution that probably has been refuted, but I'm not aware of a refutation.

The solution I'm proposing is to condition both the actor and the critic on a goal-representing vector g, change it multiple times during training when the model is still weak, and add a baseline to the value function to ensure it doesn't change when the goal is changed. In other words, we want the agent to not instrumentally-care about its goals. For example, if we switch the goal from maximizing paperclips to minimizing paperclips, the model would be trained to maximize the number of paperclips it... (read more)

A learned agent is not the same as a learning agent

Ben Amitay

[Edit: after reading the comments and thinking more about in-context learning, I don't endorse most of what's written here. I explain why I'm the end of the post.]

I notice a common confusion when people talk about deep learning. Before I try to describe it in general, let’s start with an example.

Like everyone, I lately had many conversations with friends about ChatGPT. A friend of mine said that while ChatGPT is indeed impressive, it highlights how amazing is the ability of human children to learn language from much less data. While I strongly share the sentiment, I think that the comparison is wrong. As Chomsky hypothesized, children’s ability to learn language from experience... (read 905 more words →)

A Short Intro to Humans

Ben Amitay

A Preface to Intro to Humans

I used to understand humans. We were God's avatars on earth, sent to witness His Glory and enjoy His Grace - maybe up to a measurment error. Than God died, leaving some huge holes in my worldview. To address one of these holes I bought an Introduction to Psychology textbook, and read the "approaches to psychology" section. In this post I will try to lay out some main approaches to psychology and some relations between them, trying to understand them as different layers of the same map. My goal is not to inroduce any novel object-level ideas, but a high-level perspective.

Main Approaches to Human Psychology

Evolutionary Psychology

Attempts to... (read 1872 more words →)