User Comment Replies

Why Is No One Trying To Align Profit Incentives With Alignment Research?

I think much of this is right, which is why, as an experienced startup founder that's deeply concerned about AI safety & alignment, I'm starting a new AI safety public benefit corp startup, called Harmony Intelligence. I recently gave a talk on this at VAISU conference: slides and recording.

If what I'm doing is interesting for you and you'd like to be involved or collaborate, please reach out via the contact details on the last slide linked above.

(My understanding of) What Everyone in Technical Alignment is Doing and Why

Soroush Pour2y41

For anybody else wondering what "ERO" stands for in the DeepMind section -- it stands for "Externalized Reasoning Oversight" and more details can be found in this paper.

Source: @Rohin Shah's comment.

Statement on AI Extinction - Signed by AGI Labs, Top Academics, and Many Other Notable Figures

Soroush Pour2y174

There have been some strong criticisms of this statement, notably by Jeremy Howard et al here. I've written a detailed response to the criticisms here:

https://www.soroushjp.com/2023/06/01/yes-avoiding-extinction-from-ai-is-an-urgent-priority-a-response-to-seth-lazar-jeremy-howard-and-arvind-narayanan/

Please feel free to share with others who may find it valuable (e.g. skeptics of AGI x-risk).

[Linkpost] "Governance of superintelligence" by OpenAI

Soroush Pour2y126

I don't think this is a fair consideration of the article's entire message. This line from the article specifically calls out slowing down AI progress:

we could collectively agree (with the backing power of a new organization like the one suggested below) that the rate of growth in AI capability at the frontier is limited to a certain rate per year.

Having spent a long time reading through OpenAI's statements, I suspect that they are trying to strike a difficult balance between:

A) Doing the right thing by way of AGI safety (including considering options like

... (read more)

-5James Payor2y

Deep Deceptiveness

Soroush Pour2y*21

No comment on this being an accurate take on MIRI's worldview or not, since I am not an expert there. I wanted to ask a separate question related to the view described here:

> "With gradient descent, maybe you can learn enough to train your AI for things like "corrigibility" or "not being deceptive", but really what you're training for is "Don't optimise for the goal in ways that violate these particular conditions"."

On this point, it seems that we create a somewhat arbitrary divide between corrigibility & deception on one side and all other goals of... (read more)

1Jay Bailey2y

Sorry it took me a while to get to this. Intuitively, as a human, you get MUCH better results on a thing X if your goal is to do thing X, rather than Thing X being applied as a condition for you to do what you actually want. For example, if your goal is to understand the importance of security mindset in order to avoid your company suffering security breaches, you will learn much more than being forced to go through mandatory security training. In the latter, you are probably putting in the bare minimum of effort to pass the course and go back to whatever your actual job is. You are unlikely to learn security this way, and if you had a way to press a button and instantly "pass" the course, you would. I have in fact made a divide between some things and some other things, in my above post. I suppose I would call those things "goals" (the things you really want for their own sake) and "conditions" (the things you need to do for some external reason) My inner MIRI says - we can only train conditions into the AI, not goals. We have no idea how to put a goal in the AI, and the problem is that if you train a very smart system with conditions only, and it picks up some arbitrary goal along the way, you end up not getting what you wanted. It seems that if we could get the AI to care about corrigibility and non-deception robustly, at the goal level, we would have solved a lot of the problem that MIRI is worried about.

LESSWRONG
LW

All of Soroush Pour's Comments + Replies