All of drocta's Comments + Replies

There are no coherence theorems

Yes. I believe that is consistent with what I said.

"not((necessarily, for each thing) : has [x] -> those [x] are such that P_1([x]))"
is equivalent to, " (it is possible that something) has [x], but those [x] are not such that P_1([x])"

not((necessarily, for each thing) : has [x] such that P_2([x]) -> those [x] are such that P_1([x]))
is equivalent to "(it is possible that something) has [x], such that P_2([x]), but those [x] are not sure that P_1([x])" .

The latter implies the former, as (A and B and C) implies (A and C), and so the latter is stronger, not weaker, than the former.

Right?

Doesn't "(has preferences, and those preferences are transitive) does not imply (completeness)" imply (has preferences) does not imply (completeness)" ? Surely if "having preferences" implied completeness, then "having transitive preferences" would also imply completeness?

1martinkunev1y

Usually "has preferences" is used to convey that there is some relation (between states?) which is consistent with the actions of the agent. Completeness and transitivity are usually considered additional properties that this relation could have.

Thoughts on seed oil

Questions about Solomonoff induction

"Political category" seems, a bit strong? Like, sure, the literal meaning of "processed" is not what people are trying to get at. But, clearly, "those processing steps that are done today in the food production process which were not done N years ago" is a thing we can talk about. (by "processing step" I do not include things like "cleaning the equipment", just steps which are intended to modify the ingredients in some particular way. So, things like, hydrogenation. This also shall not be construed as indicating that I think all steps that were done N years ago were better than steps done today.)

drocta1y70

For example, it is not clear to me if once I consider a program that outputs 0101 I will simply ignore other programs that output that same thing plus one bit (e.g. 01010).

No, the thing about prefixes is about what strings encode a program, not about their outputs.
The purpose of this is mostly just to define a prior over possible programs, in a way that conveniently ensures that the total probability assigned over all programs is at most 1. Seeing as it still works for different choices of language, it probably doesn't need to exactly use this kind of defi... (read more)

3mukashi1y

Thank you for the comprehensive answer and for correcting the points where I wasn't clear. Also, thank you for pointing out that the Kolmogorov complexity of a program is the length of the program that writes that program The complexity of the algorithms was totally arbitrary and for the sake of the example. I still have some doubts, but everything is more clear now (see my answer to Charlie Steiner also)

What's the protocol for if a novice has ML ideas that are unlikely to work, but might improve capabilities if they do work?

Questions about Solomonoff induction

Thanks! The specific thing I was thinking about most recently was indeed specifically about context length, and I appreciate the answer tailored to that, as it basically fully addresses my concerns in this specific case.

However, I also did mean to ask the question more generally. I kinda hoped that the answers might also be helpful to others who had similar questions (as well as if I had another idea meeting the same criteria in the future), but maybe thinking other people with the same question would find the question+answers here, was not super realistic, idk.

Answer by droctaJan 10, 202441

Here is my understanding:
we assume a programming language where a program is a finite sequence of bits, and such that no program is a prefix of another program. So, for example, if 01010010 is a program, then 0101 is not a program.
Then, the (not-normalized) prior probability for a program is $2^{- length (the program)}$
Why that probability?
If you take any infinite sequence of bits, then, because no program is a prefix of any other program, at most one program will be a prefix of that sequence of bits.
If you randomly (with uniform distribution) select an infi... (read more)

2mukashi1y

The part I understood is, that you weigh the programs based on the length in bits, the longer the program the less weight it has. This makes total sense. I am not sure that I understand the prefix thing and I think that's relevant. For example, it is not clear to me if once I consider a program that outputs 0101 I will simply ignore other programs that output that same thing plus one bit (e.g. 01010). I also find still fuzzy (and know at least I can put my finger on it) is the part where Solomonoff induction is extended to deal with randomness. Let me see if I can make my question more specific: Let's imagine for a second that we live in a universe where only the next programs could be written: * A) A program that produces deterministically a given sequence of five digits (there are 2^5 of this programs * B) A program that produces deterministically a given sequence of 6 digits (there are 2^6 of them) * C) A program that produces 5 random coin flips with p=0.5 The programs in A have 5 bits of Kolmogorov complexity each. The programs in B have 6 bits. The program C has 4 We observe the sequence O = HTHHT I measure the likelihood for each possible model. I discard the models with L = 0 A) There is a model here with likelihood 1 B) There are 2 models here, each of them with likelihood 1 too C) This model has likelihood 2^-5 Then, things get murky: the priors for each models will be 2^-5 for model A, 2^-6 for model B and 2^-4 for model C, according to their Kolmogorov complexity?

Infrafunctions and Robust Optimization

drocta2y20

Well, I was kinda thinking of $ν$ as being, say, a distribution of human behaviors in a certain context (as filtered through a particular user interface), though, I guess that way of doing it would only make sense within limited contexts, not general contexts where whether the agent is physically a human or something else, would matter. And in this sort of situation, well, the action of "modify yourself to no-longer be a quantilizer" would not be in the human distribution, because the actions to do that are not applicable to humans (as humans are, ... (read more)

Infrafunctions and Robust Optimization

drocta2y10

For the "Crappy Optimizer Theorem", I don't understand why condition 4, that if $f \leq 0$ , then $Q (s) (f) \leq 0$ , isn't just a tautology^[1]. Surely if $\forall x \in X, f (x) \leq c$ , then no-matter what $s : (X \to R) \to X$ is being used,
as $Q (s) (f) := f (s (f))$ , then letting $x = s (f)$ , then $f (x) \leq c$ , and so $Q (s) (f) = f (s (f)) = f (x) \leq c$ .

I guess if the 4 conditions are seen as conditions on a function $F : (X \to R) \to R$ (where they are written for $F = Q (s)$ ), then it no-longer is automatic, and it is just when specifying... (read more)

Infrafunctions and Robust Optimization

drocta2y20

I thought CDT was considered not reflectively-consistent because it fails Newcomb's problem?
(Well, not if you define reflective stability as meaning preservation of anti-Goodhart features, but, CDT doesn't have an anti-Goodhart feature (compared to some base thing) to preserve, so I assume you meant something a little broader?)
Like, isn't it true that a CDT agent who anticipates being in Newcomb-like scenarios would, given the opportunity to do so, modify itself to be not a CDT agent? (Well, assuming that the Newcomb-like scenarios are of the form "a... (read more)

2Jeremy Gillen2y

Good point on CDT, I forgot about this. I was using a more specific version of reflective stability. > - wait.. that doesn't seem right..? Yeah this is also my reaction. Assuming that bound seems wrong. I think there is a problem with thinking of ν as a known-to-be-acceptably-safe agent, because how can you get this information in the first place? Without running that agent in the world? To construct a useful estimate of the expected value of the "safe"-agent, you'd have to run it lots of times, necessarily sampling from it's most dangerous behaviours. Unless there is some other non-empirical way of knowing an agent is safe? Yeah I was thinking of having large support of the base distribution. If you just rule-in behaviours, this seems like it'd restrict capabilities too much.

Research agenda: Formalizing abstractions of computations

drocta2y10

Whoops, yes, that should have said $f (a) \to_{B} f (a^{'})$ , thanks for the catch! I'll edit to make that fix.

Also, yes, what things between $a$ and $a^{'}$ should be sent to, is a difficulty..
A thought I had which, on inspection doesn't work, is that (things between $a$ and $a^{'}$ ) could be sent to $f (a^{'})$ , but that doesn't work, because $a^{'}$ might be terminal, but (thing between $a$ and $a^{'}$ ) isn't terminal. It seems like the only thing that would always work would be for them to be sent to somethin... (read more)

Research agenda: Formalizing abstractions of computations

drocta2y*10

A thought on the "but what if multiple steps in the actual-algorithm correspond to a single step in an abstracted form of the algorithm?" thing :
This reminds me a bit of, in the topic of "Abstract Rewriting Systems", the thing that the $* \to$ vs $\to$ distinction handles. (the asterisk just indicating taking the transitive reflexive closure)

Suppose we have two abstract rewriting systems $(A, \to_{A})$ and $(B, \to_{B})$ .
(To make it match more closely what you are describing, we can suppose that every node has at most one outgoing arrow, to make... (read more)

2Erik Jenner2y

Should it be f(a)→Bf(a′) at the end instead? Otherwise not sure what b is. I think this could be a reasonable definition but haven't thought about it deeply. One potentially bad thing is that f would have to be able to also map any of the intermediate steps between a an a' to f(a). I could imagine you can't do that for some computations and abstractions (of course you could always rewrite the computation and abstraction to make it work, but ideally we'd have a definition that just works). What I've been imagining instead is that the abstraction can specify a function that determines which are the "high-level steps", i.e. when f should be applied. I think that's very flexible and should support everything. But also, in practice the more important question may just be how to optimize over this choice of high-level steps efficiently, even just in the simple setting of circuits.

All AGI Safety questions welcome (especially basic ones) [May 2023]

drocta2y20

In the line that ends with "even if God would not allow complete extinction.", my impulse is to include " (or other forms of permanent doom)" before the period, but I suspect that this is due to my tendency to include excessive details/notes/etc. and probably best not to actually include in that sentence.

(Like, for example, if there were no more adult humans, only billions of babies grown in artificial wombs (in a way staggered in time) and then kept in a state of chemically induced euphoria until the age of 1, and then killed, that technically wouldn't be... (read more)

All AGI Safety questions welcome (especially basic ones) [May 2023]

drocta2y90

I want to personally confirm a lot of what you've said here. As a Christian, I'm not entirely freaked out about AI risk because I don't believe that God will allow it to be completely the end of the world (unless it is part of the planned end before the world is remade? But that seems unlikely to me.), but that's no reason that it can't still go very very badly (seeing as, well, the Holocaust happened).

In addition, the thing that seems to me most likely to be the way that God doesn't allow AI doom, is for people working on AI safety to succeed. One shouldn... (read more)

4Yaakov T2y

@drocta @Cookiecarver We started writing up an answer to this question for Stampy. If you have any suggestions to make it better I would really appreciate it. Are there important factors we are leaving out? Something that sounds off? We would be happy for any feedback you have either here or on the document itself https://docs.google.com/document/d/1tbubYvI0CJ1M8ude-tEouI4mzEI5NOVrGvFlMboRUaw/edit#

Mechanistically interpreting time in GPT-2 small

drocta2y10

I don't understand why this comment has negative "agreement karma". What do people mean by disagreeing with it? Do they mean to answer the question with "no"?

Verification Is Not Easier Than Generation In General

drocta3y30

First, I want to summarize what I understand to be what your example is an example of:
"A triple consisting of
1) A predicate P
2) the task of generating any single input x for which P(x) is true
3) the task of, given any x (and given only x, not given any extra witness information), evaluating whether P(x) is true
"

For such triples, it is clear, as your example shows, that the second task (the 3rd entry) can be much harder than the first task (the 2nd entry).

_______

On the other hand, if instead one had the task of producing an exhaustive list of all x such tha... (read more)

Attempts at Forwarding Speed Priors

drocta3y1-2

As you know, there's a straightforward way to, given any boolean circuit, to turn it into a version which is a tree, by just taking all the parts which have two wires coming out from a gate, and making duplicates of everything that leads into that gate.
I imagine that it would also be feasible to compute the size of this expanded-out version without having to actually expand out the whole thing?

Searching through normal boolean circuits, but using a cost which is based on the size if it were split into trees, sounds to me like it would give you the memoizati... (read more)

Visualizing Neural networks, how to blame the bias

drocta3y10

It seems like the 5th sentence has its ending cut off? "it tries to parcel credit and blame for a decision up to the input neurons, even when credit and blame" , seems like it should continue [do/are x] for some x.

drocta3y20

When you say "which yields a solution of the form $f (w) = c_{1} / (1 - w) + c_{2}$ ", are you saying that $f^{'} (w) / f (w) = 1 / (1 - w)$ yields that, or are you saying that $(1 - w) f^{'} (w) - f (w) > 0$ yields that? Because, for the former, that seems wrong? Specifically, the former should yield only things of the form $f (w) = c_{1} / (1 - w)$ .

But, if the latter, then, I would think that the solutions would be more solutions than that?

Like, what about $g (w) := c_{1} / (1 - w) + c_{2} \cdot (1 - \frac{1}{10^{6}} + \frac{1}{10^{6}} cos (w))$ ? (where, say, $c_{1} = ε + δ$ and $c_{2} = - δ$
$g^{'} (w) = \frac{c_{1}}{(1 - w)^{2}} + c_{2} \cdot (- \frac{1}{1}$ ... (read more)

2[anonymous]3y

I meant the former (which you're right only has the solution with c1). I only added the c2 term to make it work for the inequality. As a result, it's only a subset of the solutions for the inequality. The (quite complicated!) expression you provided also works.

How Do Selection Theorems Relate To Interpretability?

drocta3y80

As another "why not just" which I'm sure there's a reason for:

in the original circuits thread, they made a number of parameterized families of synthetic images which certain nodes in the network responded strongly to in a way that varied smoothly with the orientation parameter, and where these nodes detected e.g. boundaries between high-frequency and low-frequency regions at different orientations.

If given another such network of generally the same kind of architecture, if you gave that network the same images, if it also had analogous nodes, I'd expect th... (read more)

5johnswentworth3y

You've correctly identified most of the problems already. One missing piece: it's not necessarily node-activations which are the right thing to look at. Even in existing work, there's other ways interpretable information is embedded, like e.g. directions in activation space of a bunch of neurons, or rank-one updates to matrices.

Paper: Teaching GPT3 to express uncertainty in words

drocta3y20

I was surprised by how the fine-tuning was done for the verbalized confidence.

My initial expectation was that it would make the loss be based on like, some scoring rule based on the probability expressed and the right answer.

Though, come to think of it, I guess seeing as it would be assigning logits values to different expressions of probabilities, it would have to... what, take the weighted average of the scores it would get if it gave the different probabilities? And, I suppose that if many training steps were done on the same question/answer pairs, then... (read more)

3Owain_Evans3y

The indirect logit is trained with cross-entropy based on the groundtruth correct answer. You can't do this for verbalized probability without using RL, and so we instead do supervised learning using the empirical accuracy for different question types as the labels.

An observation about Hubinger et al.'s framework for learned optimization

drocta3y10

For $m : S$ such that $m$ is a mesa=optimizer let $Σ_{m}$ be the space it optimizes over, and $g_{m} : Σ_{m} \to R$ be its utility function .

I know you said "which we need not notate", but I am going to say that for $s : S$ and $x : X$ , that $s (x) : A$ , and $A$ is the space of actions (or possibly, $s (x) : A_{x}$ and $A_{x}$ is the space of actions available in the situation $x$ )
(Though maybe you just meant that we need note notate separately from s, the map from X to A which s defines. In which ... (read more)

A Semitechnical Introductory Dialogue on Solomonoff Induction

Seeking Power is Convergently Instrumental in a Broad Class of Environments

Is this something that the infra-bayesianism idea could address? So, would an infra-bayesian version of AIXI be able to handle worlds that include halting oracles, even though they aren't exactly in its hypothesis class?

An Intuitive Guide to Garrabrant Induction

Do I understand correctly that in general the elements of A, B, C, are achievable probability distributions over the set of n possible outcomes? (But that in the examples given with the deterministic environments, these are all standard basis vectors / one-hot vectors / deterministic distributions ?)

And, in the case where these outcomes are deterministic, and A and B are disjoint, and A is much larger than B, then given a utility function on the possible outcomes in A or B, a random permutation of this utility function will, with high probability, ha... (read more)

3TurnTrout4y

Yes. Nice catch. In the stochastic case, you do need a permutation-enforced similarity, as you say (see definition 6.1: similarity of visit distribution sets in the paper). They won't apply for all A, B, because that would prove way too much.

drocta4y62

My understanding:

One could create a program which hard-codes the point about which it oscillates (as well as some amount which it always eventually goes that far in either direction), and have it buy once when below, and then wait until the price is above to sell, and then wait until price is below to buy, etc.

The programs receive as input the prices which the market maker is offering.

It doesn't need to predict ahead of time how long until the next peak or trough, it only needs to correctly assume that it does oscillate sufficiently, and respond when it does.

Finite Factored Sets: Introduction and Factorizations

drocta4yΩ030

The part about Chimera functions was surprising, and I look forward to seeing where that will go, and to more of this in general.

In section 2.1 , Proposition 2 should presumably say that $\geq_{S}$ is a partial order on $Part (S)$ rather than on $S$ .

2Scott Garrabrant4y

Fixed, Thanks.

An Intuitive Guide to Garrabrant Induction

drocta4y*40

In the section about Non-Dogmatism , I believe something was switched around. It says that if the logical inductor assigns prices converging to $1 to a proposition that cannot be proven, that the trader can buy shares in that proposition at prices of $ $2^{- n}$ and thereby gain infinite potential upside. I believe this should say that if the logical inductor assigns prices converging to $0 to a proposition that can't be dis-proven, instead of prices converging to $1 for a proposition that can't be proven .
(I think that if the price was converging to $1 for ... (read more)

3Mark Xu4y

Thanks! Should be fixed now.

Finite Factored Sets

drocta4yΩ230

You said that you thought that this could be done in a categorical way. I attempted something which appears to describe the same thing when applied to the category FinSet , but I'm not sure it's the sort of thing you meant by when you suggested that the combinatorial part could potentially be done in a categorical way instead, and I'm not sure that it is fully categorical.

Let S be an object.
For i from 1 to k, let $A_{i}$ be an object, (which is not anything isomorphic to the product of itself with itself, or at least is not the terminal object) .
Let&n... (read more)

6Scott Garrabrant4y

I have not thought much about applying to things other than finite sets. (I looked at infinite sets enough to know there is nontrivial work to be done.) I do think it is good that you are thinking about it, but I don't have any promises that it will work out. What I meant when I think that this can be done in a categorical way is that I think I can define a nice symmetric monodical category of finite factored sets such that things like orthogonality can be given nice categorical definitions. (I see why this was a confusing thing to say.)

Parsing Chris Mingard on Neural Networks

Parsing Chris Mingard on Neural Networks

Agency in Conway’s Game of Life

For the volumes, I suppose that because scaling all of these parameters by the same positive constant doesn't change the function computed, it would make sense to compute the volumes of the corresponding regions of the cube, and this would handle the issues with these regions having unbounded size.
(this would still work with more parameters, it would just be a higher dimensional sphere)
Er, would that give the same thing as the limit if we took the parameters within a cube?
Anyway, at least in this case, if we use the "projected onto the sphere" case, we cou... (read more)

1drocta4y

I've now computed the volumes within the [-a,a]^3 cube for and, or, and the constant 1 function. I was surprised by the results. (I hadn't considered that the ratios between the volumes will not depend on the size of the cube) If we select x,y,z uniformly at random within this cube, the probability of getting the and gate is 1/48, the probability of getting the or gate is 2/48, and the probability of getting the constant 1 function is 13/48 (more than 1/4). This I found quite surprising, because of the constant 1 function requiring 4 half planes to express the conditions for it. So, now I'm guessing that the ones that required fewer half spaces to specify, are the ones where the individual constraints are already implying other constraints, and so actually will tend to have a smaller volume. On the other hand, I still haven't computed any of them for if projecting onto the sphere, and so this measure kind of gives extra weight to the things in the directions near the corners of the cube, compared to the measure that would be if using the sphere.

2Alex Flint4y

Yes it does seem challenging to compute realistic complexity measures for such small functions. Perhaps we could just look at the mappings ordered by their volume in parameter space, and check whether the mappings at the top of that ordering "seem" less complex than the mappings at the bottom.

drocta4y70

nitpick : the appendix says $10^{60}$ possible configurations of the whole grid, while it should say $2^{(10^{60})}$ possible configurations. (Similarly for what it says about the number of possible configurations in the region that can be specified.)

4Alex Flint4y

Thank you. Fixed.

Parsing Chris Mingard on Neural Networks

drocta4yΩ2100

This comment I'm writing is mostly because this prompted me to attempt to see how feasible it would be to computationally enumerate the conditions for the weights of small networks like the 2 input 2 hidden layer 1 output in order to implement each of the possible functions. So, I looked at the second smallest case by hand, and enumerated conditions on the weights for a 2 input 1 output no hidden layer perceptron to implement each of the 2 input gates, and wanted to talk about it. This did not result in any insights, so if that doesn't sound interesting, m... (read more)

3Chris Mingard4y

Check out https://arxiv.org/pdf/1909.11522.pdf where we do some similar analysis of perceptrons but in higher dimensions. Theorem 4.1 shows that there is an anti-entropy bias - in other words, functions with either mostly 0s or mostly 1s are exponentially more likely to show up than expected under a uniform prior - which holds for perceptrons of any dimension. This proves a (fairly trivial) bias towards simple functions, although it doesn't say anything about why a function like 010101010101... appears more frequently than other functions in the maximum-entropy class.

4Alex Flint4y

Very very cool. Thank you for this drocta. What would it take to map out the sizes of the volumes corresponding to each of these mappings? Also, could you perhaps compute the exact Kolmogorov complexity of these mappings in some particular description language, since they are so small? It would be super interesting to me to assemble a table of volumes and Kolmogorov complexities for each of these small mappings. It may then be possible to write some code that does the same for 3-input and 4-input mappings.

AI Safety Beginners Meetup (European Time)

The link in the rss feed entry for this at https://agentfoundations.org/rss goes to https://www.alignmentforum.org/events/vvPYYTscRXFBvdkXe/ai-safety-beginners-meetup which is a broken link (though, easily fixed by replacing "events" with "posts" in the url) .
[edit: it appears that it is no longer in the rss feed? It showed up in my rss feed reader.]
I think this has also happened with other "event" type posts in the rss feed before, but I may be remembering wrong.
I suspect this is some bug in how the rss feed is generated, but possibly it is a known bug wh... (read more)

2Linda Linsefors4y

I think this happened because I unselected "Alignment Forum" for this event. To my best understanding, evens are not supposed to be Alignment Forum content, and it is a but that this is even possible. Therefore, I decided that the cooperative thing to do would be not to use this bug. Though I'm not sure what is better, since I think events should be allowed on the Alignment Forum. > I assume that when the event is updated that the additional information will include how to join the meetup? Yes. We'll probably be in Zoom, but I have not decided. > I am interested in attending. Great, see you there.

Optimal play in human-judged Debate usually won't answer your question

The agent/thinker are limited in the time or computational resources available to them, while the predictor is unlimited.

My understanding is that this is generally situation which is meant. Well, not necessarily unlimited, just with enough resources to predict the behavior of the agent.

I don't see why you call this situation uninteresting.

drocta4y30

That something can be modeled using some Turing machine, doesn't imply that it can be any Turing machine.

If I have some simple physical system, such that I can predict how it will behave, well, it can be modeled by a Turing machine, but me being able to predict it doesn't imply that I've solved the halting problem.

A realistic conception of agents in an environment doesn't involve all agents having unlimited compute at every time-step. An agent cannot prevent the universe from continuing simply by getting stuck in a loop and never producing its output for its next action.

Optimal play in human-judged Debate usually won't answer your question

Ah, thank you, I see where I misunderstood now. And upon re-reading, I see that it was because I was much too careless in reading the post, to the point that I should apologize. Sorry.
I was thinking that the agents were no longer being trained, already being optimal players, and so I didn't think the judge would need to take into account how their choice would influence future answers. This reading clearly doesn't match what you wrote, at least past the very first part.

If the debaters are still being trained, or the judge can be convinced that the debaters... (read more)

2Joe Collman4y

Oh no need for apologies: I'm certain the post was expressed imperfectly - I was understanding more as I wrote (I hope!). Often the most confusing parts are the most confused. Since I'm mainly concerned with behaviour-during-training, I don't think the post-training picture is too important to the point I'm making. However, it is interesting to consider what you'd expect to happen after training in the event that the debaters' only convincing "ignore-the-question" arguments are training-signal based. I think in that case I'd actually expect debaters to stop ignoring the question (assuming they know the training has stopped). I assume that a general, super-human question answerer must be able to do complex reasoning and generalise to new distributions. Removal of the training signal is a significant distributional shift, but one that I'd expect a general question-answerer to handle smoothly (in particular, we're assuming it can answer questions about [optimal debating tactics once training has stopped]). [ETA: I can imagine related issues with high-value-information bribery in a single debate: "Give me a win in this branch of the tree, and I'll give you high-value information in another branch", or the like... though it's a strange bargaining situation given that in most setups the debaters have identical information to offer. This could occur during or after training, but only in setups where the judge can give reward before the end of the debate.... Actually I'm not sure on that: if the judge always has the option to override earlier decisions with larger later rewards, then mid-debate rewards don't commit the judge in any meaningful way, so aren't really bargaining chips. So I don't think this style of bribery would work in setups I've seen.]