User Comment Replies

The Von Neumann-Morgenstern paradigm allows for binary utility functions, i.e. functions that are equal to 1 on some event/(measurable) set of outcomes, and to 0 on the complement. Said event could be, for instance "no global catastrophe for humanity in time period X".
Of course, you can implement some form of deontology by multiplying such a binary utility function with something like exp(- bad actions you take).

Any thoughts on this observation?

What convincing warning shot could help prevent extinction from AI?

Paul W3mo10

When you say "maybe we should be assembling like minded and smart people [...]", do you mean "maybe"? Or do you mean "Yes, we should definitely do that ASAP"?

Conceptual Rounding Errors

Paul W3mo20

Have you noticed that you keep encountering the same ideas over and over? You read another post, and someone helpfully points out it's just old Paul's idea again. Or Eliezer's idea. Not much progress here, move along.
Or perhaps you've been on the other side: excitedly telling a friend about some fascinating new insight, only to hear back, "Ah, that's just another version of X." And something feels not quite right about that response, but you can't quite put your finger on it.

Some questions regarding these contexts:

-Is it true that you can deduce that... (read more)

Good Research Takes are Not Sufficient for Good Strategic Takes

Paul W3mo10

Strong upvote. Slightly worried by the fact that this wasn't written, in some form, earlier (maybe I missed a similar older post?)

I think we^[1] can, and should, go even further:

-Find a systematic/methodical way of identifying which people are really good at strategic thinking, and help them use their skills in relevant work; maybe try to hire from outside the usual recruitment pools.

If deemed feasible (in a short enough amount of time): train some people mainly on strategy, so as to get a supply of better strategists.

-Encourage people to s... (read more)

Elicitation for Modeling Transformative AI Risks

Paul W3mo21

Hi!

Have you heard of the ModelCollab and CatColab projects ? It seems that there is an interesting overlap with what you want to do!

More generally, people at the Topos Institute work on related questions, of collaborative modelling and collective intelligence:

https://topos.institute/work/collective-intelligence/

https://topos.institute/work/collaborative-modelling/

https://www.localcharts.org/t/positive-impact-of-algebraicjulia/6643

There's a website for sharing world-modelling ideas, run by Owen Lynch (who works at Topos UK)

http... (read more)

Emergence, The Blind Spot of GenAI Interpretability?

Paul W11mo10

Are you saying that holistic/higher-level approaches can be useful because they are very likely to be more computationally efficient/actually fit inside human brains/ do not require as much data ?
Is that the main point, or did I miss something ?

1Quentin FEUILLADE--MONTIXI9mo

Holistic means studying at every level. I think that mech interp is very useful for some things, and stuff like what I am pursuing (GenAI Ethology) is very useful for other things. If there is emergence, it means that we can't study the model at only one level of abstraction, we need to do it at multiple level and combine insight to be able to remotely understand and control what's happening. Additionally, I think that there is still other levels (and in between levels) of abstraction that should be pursued (like the footnote on glitch interp, and things like developmental interp but at the behavior level)

Natural Latents: The Concepts

Paul W1y30

Hello !
These ideas seem interesting, but there's something that disturbs me: in the coin flip example, how is 3 fundamentally different from 1000 ? The way I see it, the only mathematical difference is that your "bounds" (whatever that means) are simply much worse in the case with 3 coins. Of course, I think I understand why humans/agents would want to say "the case with 3 flips is different from that with 1000", but the mathematics seem similar to me.
Am I missing something ?

3johnswentworth1y

That's correct, the difference is just much worse bounds, so for 3 there only exists a natural latent to within a much worse approximation.

Davidad's Bold Plan for Alignment: An In-Depth Explanation

Paul W1y*12

Is the field advanced enough that it would be feasible to have a guaranteed no-zero-day evaluation and deployment codebase that is competitive with a regular codebase?

As far as I know (I'm not an expert), such absolute guarantees are too hard right now, especially if the AI you're trying to verify is arbitrarily complex. However, the training process ought to yield an AI with specific properties. I'm not entirely sure I got what you meant by "a guaranteed no-zero-day evaluation and deployment codebase". Would you mind explaining more ?

"Or is the clai

... (read more)

2Fabien Roger1y

I was thinking that the formal guarantees would be about state evaluations (i.e. state -> badness bounds) - which would require sth like "showing there is no zero-day" (since "a code-base with a zero-day" might be catastrophically bad if no constraints are put on actions). Thanks for pointing out they can be about action (i.e. (state, action) -> badness bounds), which seem intuitively easier to get good bounds for (you don't need to show there are no zero-days, just that the currently considered action is extremely unlikely to exploit a potential zero-day). I'd be curious to know what kind of formal process could prove that (codebase, codebase-interaction) pairs are provably not-bad (with high probability, and with a false positive rate low enough if you trained an AI to minimize it). My guess is that there is nothing like that on the horizon (that could become competitive at all), but I could be wrong. ("let you see what program is running" was an example of a very minimal safety guarantee I would like to have, not a representative safety guarantee. My point is that I'd be surprised if people got even such a simple and easy safety guarantee anytime soon, using formal methods to check AI actions that actually do useful stuff.)

Davidad's Bold Plan for Alignment: An In-Depth Explanation

Paul W1y32

I believe that the current trends for formal verification, say, of traditional programs or small neural networks, are more about conservative overapproximations (called abstract interpretations). You might want to have a look at this: https://caterinaurban.github.io/pdf/survey.pdf
To be more precise, it appears that so-called "incomplete formal methods" (3.1.1.2 in the survey I linked) are more computationally efficient, even though they can produce false negatives.
Does that answer your question ?

2Fabien Roger1y

Not entirely. This makes me slightly more hopeful that we can have formal guarantees of computer systems, but is the field advanced enough that it would be feasible to have a guaranteed no-zero-day evaluation and deployment codebase that is competitive with a regular codebase? (Given a budget of 1 LHC for both the codebase inefficiency tax + the time to build the formal guarantees for the codebase.) (And computer systems are easy mode, I don't even know how you would start to build guarantees like "if you say X, then it's proven that it doesn't persuade humans of things in ways they would not have approved of beforehand.")

LESSWRONG
LW

All of Paul W's Comments + Replies