Abstract The aim of this text is to give an overview of Davidad’s safety plan, while also outlining some of the limitations and challenges. Additionally, I’ll explain why I would like to contribute to this field. Disclaimers I am not Davidad, I tried to convey his ideas as I understand...
Hi! Interesting post!
Just to make sure I got it right: in Claim 1, if you substract $\sum_n H(P_{\mu}(\cdot | x_n))$ on both sides of inequality (1.1), you get that the sum of the KL-divergences $\sum_n KL(P_{\mu}(\cdot | x_n), P_{M_1}(\cdot | x_n, D_{<n})$ is smaller than the constant $C(\mu, M_1)$, right?
Then, dividing by N on both sides, you get that the average of said KL divergences goes to 0 as N goes to infinity, at least with rate 1/N, is that correct?
The Von Neumann-Morgenstern paradigm allows for binary utility functions, i.e. functions that are equal to 1 on some event/(measurable) set of outcomes, and to 0 on the complement. Said event could be, for instance "no global catastrophe for humanity in time period X".
Of course, you can implement some form of deontology by multiplying such a binary utility function with something like exp(- bad actions you take).
Any thoughts on this observation?
When you say "maybe we should be assembling like minded and smart people [...]", do you mean "maybe"? Or do you mean "Yes, we should definitely do that ASAP"?
Have you noticed that you keep encountering the same ideas over and over? You read another post, and someone helpfully points out it's just old Paul's idea again. Or Eliezer's idea. Not much progress here, move along.
Or perhaps you've been on the other side: excitedly telling a friend about some fascinating new insight, only to hear back, "Ah, that's just another version of X." And something feels not quite right about that response, but you can't quite put your finger on it.
Some questions regarding these contexts:
-Is it true that you can deduce that "not much progress" is being made? In (pure) maths, it is sometimes very useful to be able to connect... (read more)
Strong upvote. Slightly worried by the fact that this wasn't written, in some form, earlier (maybe I missed a similar older post?)
I think we[1] can, and should, go even further:
-Find a systematic/methodical way of identifying which people are really good at strategic thinking, and help them use their skills in relevant work; maybe try to hire from outside the usual recruitment pools.
If deemed feasible (in a short enough amount of time): train some people mainly on strategy, so as to get a supply of better strategists.
-Encourage people to state their incompetence in some domains (except maybe in cases where it makes for bad PR) / embrace the idea of specialization and division of labour more: maybe high-level strategists don't need as much expertise on the technical details, only the ability to see which phenomena matter (assuming domain experts are able to communicate well enough)
say, the people who care about preventing catastrophic events, in a broad sense
Hi!
Have you heard of the ModelCollab and CatColab projects ? It seems that there is an interesting overlap with what you want to do!
More generally, people at the Topos Institute work on related questions, of collaborative modelling and collective intelligence:
https://topos.institute/work/collective-intelligence/
https://topos.institute/work/collaborative-modelling/
https://www.localcharts.org/t/positive-impact-of-algebraicjulia/6643
There's a website for sharing world-modelling ideas, run by Owen Lynch (who works at Topos UK)
https://www.localcharts.org/t/localcharts-is-live/5714
For instance, they have a paper on task-delegation:
Their work uses somewhat advanced maths, but I think it is justified by the ambition: to develop general tools for creating and combining models. They seem to make an effort to popularise these, so that non-mathematicians can get something out of their work.
Are you saying that holistic/higher-level approaches can be useful because they are very likely to be more computationally efficient/actually fit inside human brains/ do not require as much data ?
Is that the main point, or did I miss something ?
Hello !
These ideas seem interesting, but there's something that disturbs me: in the coin flip example, how is 3 fundamentally different from 1000 ? The way I see it, the only mathematical difference is that your "bounds" (whatever that means) are simply much worse in the case with 3 coins. Of course, I think I understand why humans/agents would want to say "the case with 3 flips is different from that with 1000", but the mathematics seem similar to me.
Am I missing something ?
The aim of this text is to give an overview of Davidad’s safety plan, while also outlining some of the limitations and challenges. Additionally, I’ll explain why I would like to contribute to this field.
I am not Davidad, I tried to convey his ideas as I understand them. While my interpretation may not be exact, I hope it still holds value. Also, this post does not focus on the technical details; I might write another one later, with a deeper technical discussion[1].
I began exploring these questions during a weekend-long research sprint. Then, for a couple of months, I kept thinking about them, reading related posts (Davidad’s List of AI Safety... (read 1676 more words →)
Is the field advanced enough that it would be feasible to have a guaranteed no-zero-day evaluation and deployment codebase that is competitive with a regular codebase?
As far as I know (I'm not an expert), such absolute guarantees are too hard right now, especially if the AI you're trying to verify is arbitrarily complex. However, the training process ought to yield an AI with specific properties. I'm not entirely sure I got what you meant by "a guaranteed no-zero-day evaluation and deployment codebase". Would you mind explaining more ?
... (read more)"Or is the claim that it's feasible to build a conservative world model that tells you "maybe a zero-day" very quickly once you start doing
Hi! Interesting post!
I have a question: how much are "mechanistic explanations" (whatever they are) expected to be relative to, or valid for, only specific datasets/regions of "data space"?
Could it happen that you can (if you have enough computing power) cover the set of all relevant data with small subsets, such that you have mechanistic explanations valid for each small subset, but no global one?
What would this mean for AI safety? Are there people thinking about this potential issue?
For context: I'm a mathematician, with more experience in algebra than in analysis, currently thinking about this question using sheaves and (hopefully) cohomology.