All of Past Account's Comments + Replies

DeepMind: Generally capable agents emerge from open-ended play

[Deleted]

Past Account4y*30

[Deleted]

DeepMind: Generally capable agents emerge from open-ended play

What are some triggers that prompt you to do a Fermi estimate, or to pull up a spreadsheet and make a simple/rough quantitative model?

[Deleted]

3brp4y

What's the practical difference between "text" and one-hots of said "text"? One-hots are the standard for inputting text into models. It is only recently that we expect models to learn their preferred encoding for raw text (cf. transformers). By taking a small shortcut, the authors of this paper get to show off their agent work without loss of generality: one could still give one-hot instructions to an agent that is learning to act in the real life.

5Quintin Pope4y

The summary says they use text and a search for “text” in the paper gives this on page 32: “In these past works, the goal usually consists of the position of the agent or a target observation to reach, however some previous work uses text goals (Colas et al., 2020) for the agent similarly to this work.” So I thought they provided goals as text. I’ll be disappointed if they don’t. Hopefully, future work will do so (and potentially use pretrained LMs to process the goal texts).

Answer by Past AccountJul 26, 2021*80

[Deleted]

The Mountaineer's Fallacy

Past Account4y*30

[Deleted]

MDP models are determined by the agent architecture and the environmental dynamics

MDP models are determined by the agent architecture and the environmental dynamics

[Deleted]

Past Account4y*Ω-3-10

[Deleted]

7TurnTrout4y

I don't understand your point in this exchange. I was being specific about my usage of model; I meant what I said in the original post, although I noted room for potential confusion in my comment above. However, I don't know how you're using the word. You used the word 'model' in both of your prior comments, and so the search-replace yields "state-abstraction-irrelevant abstractions." Presumably not what you meant? That's not a "concrete difference." I don't know what you mean when you talk about this "third alternative." You think you have some knockdown argument - that much is clear - but it seems to me like you're talking about a different consideration entirely. I likewise feel an urge to disengage, but if you're interested in explaining your idea at some point, message me and we can set up a higher-bandwidth call.

MDP models are determined by the agent architecture and the environmental dynamics

Past Account4y*Ω-210

[Deleted]

7TurnTrout4y

I read your formalism, but I didn't understand what prompted you to write it. I don't yet see the connection to my claims. Yeah, I don't want you to spend too much time on a bulletproof grounding of your argument, because I'm not yet convinced we're talking about the same thing. In particular, if the argument's like, "we usually express reward functions in some featurized or abstracted way, and it's not clear how the abstraction will interact with your theorems" / "we often use different abstractions to express different task objectives", then that's something I've been thinking about but not what I'm covering here. I'm not considering practical expressibility issues over the encoded MDP: ("That's also a claim that we can, in theory, specify reward functions which distinguish between 5 googolplex variants of red-ghost-game-over.") If this doesn't answer your objection - can you give me an english description of a situation where the objection holds? (Let's taboo 'model', because it's overloaded in this context)

MDP models are determined by the agent architecture and the environmental dynamics

Past Account4y*Ω-210

[Deleted]

7TurnTrout4y

Why would we need that, and what is the motivation for "models"? The moment we give the agent sensors and actions, we're done specifying the rewardless MDP (and its model). ETA: potential confusion - in some MDP theory, the “model” is a model of the environment dynamics. Eg in deterministic environments, the model is shown with a directed graph. i don’t use “model” to refer to an agent’s world model over which it may have an objective function. I should have chosen a better word, or clarified the distinction. If, by "tasks", you mean "different agent deployment scenarios" - I'm not claiming that. I'm saying that if we want to predict what happens, we: 1. Consider the underlying environment (assumed Markovian) 2. Consider different state/action encodings we might supply the agent. 3. For each, fix a reward function distribution (what goals we expect to assign to the agent) 4. See what the theory predicts. There's a further claim (which seems plausible, but which I'm not yet making) that (2) won't affect (4) very much in practice. The point of this post is that if you say "the MDP has a different model", you're either disagreeing with (1) the actual dynamics, or claiming that we will physically supply the agent with a different state/action encoding (2). I don't follow. Can you give a concrete example?

Past Account4y10

Because $f = λ \cdot Δ L$ . They are the same. Does that help?

SGD's Bias

Past Account4y*60

[Deleted]

2johnswentworth4y

I don't have any empirical evidence, but we can think about what a flat minimum with high noise would mean. It would probably mean the system is able to predict some data points very well, and other data points very poorly, and both of these are robust: we can make large changes to the parameters while still predicting the predictable data points about-as-well, and the unpredictable data points about-as-poorly. In human terms, it would be like having a paradigm in which certain phenomena are very predictable, and other phenomena look like totally-random noise without any hint that they even could be predictable. Not sure what it would look like in the perfect-training-prediction regime, though.

Past Account4y10

The term $π$ is meant to be a posterior distribution after seeing data. If you have a good prior you could take $π = π_{0}$ . However, note $L (π)$ could be high. You want trade-off between the cost of updating the prior and the loss reduction.

Example, say we have a neural network. Then our prior would be the initialization and the posterior would be the distribution of outputs from SGD.

(Btw thanks for the correction)

1Steveot4y

Thanks, I finally got it. What I just now fully understood is that the final inequality holds with high πn0 probability (i.e., as you say, π0 is the data), while the learning bound or loss reduction is given for π.

Open and Welcome Thread - May 2021

Open and Welcome Thread - May 2021

[Deleted]

[Deleted]

2ChristianKl4y

They don't speak about a having a PhD but ability to get a into a top 5 graduate program. Many people who do have the ability to get into a top 5 program don't get into a top 5 graduate program but persue other directions. The number of people with that ability level is not directly dependent on the amount of of PhD's that are given out.

[Deleted]

2interstice4y

Right, but trying to fit an unknown function with linear combinations of those features might be extremely data-inefficient, such that it is basically unusable for difficult tasks. Of course you could do better if you're not restricted to linear combinations -- for instance, if the map is injective you could invert back to the original space and apply whatever algorithm you wanted. But at that point you're not really using the Fourier features at all. In particular, the NTK always learns a linear combination of its features, so it's the efficiency of linear combinations that's relevant here. You originally said that the NTK doesn't learn features because its feature class already has a good representation at initialization. What I was trying to convey (rather unclearly, admittedly) in response is: A) There exist learning algorithms that have universal-approximating embeddings at initialization yet learn features. If we have an example of P and !Q, P-->Q cannot hold in general, so I don't think it's right to say that the NTK's lack of feature learning is due to its universal-approximating property. B) Although the NTK's representation may be capable of approximating arbitrary functions, it will probably be very slow at learning some of them, perhaps so slow that using it is infeasible. So I would dispute that it already has 'good' representations. While it's universal in one sense, there might be some other sense of 'universal efficiency' in which it's lacking, and where feature-learning algorithms can outperform it. I agree that in practice there's likely to be some relationship between universal approximation and efficiency, I just think it's worth distinguishing them conceptually. Thanks for the paper link BTW, it looks interesting.

[Deleted]

1interstice4y

Ah, rereading your original comment more carefully I see that you indeed didn't say anything about 'universal learning'. You're quite right that the NTK is a universal function approximator. My apologies. However, I still disagree that this is the reason that the NTK doesn't learn features. I think that 'universal function approximation' and 'feature learning' are basically unrelated dimensions along which a learning algorithm can vary. That is, it's quite possible to imagine a learning algorithm which constructs a sequence of different embeddings, all of which are universal approximators. The paper by Greg Yang I linked gives an example of such an algorithm(I don't think he explicitly proves this but I'm pretty sure it's true) What I was trying to get at with the 'universal learning' remarks is that, although the NTK does indeed contain all finite embeddings, I believe that it does not do so in a very efficient way -- it might require disproportionately many training points to pick out what are, intuitively, fairly simple embeddings. I believe this is what is behind the poor performance of empirical NTKs compared to SGD-trained nets, as I brought up in this comment, and ultimately explains why algorithms that do 'feature learning' can outperform those that don't -- the feature learning algorithms are able to find more efficient embeddings for a given set of inputs(of course, it's possible to imagine a fixed embedding that's 'optimally efficient' in some way, but as far as I'm aware the NTK has no such property). This issue of 'embedding efficiency' seems only loosely related to the universal approximation property. To formalize this, it would be nice to develop a theory of universal inference in the setting of classification problems akin to Solomonoff induction. To effectively model this in an asymptotic theory, I think it might be necessary to increase the dimension of the model input along with the number of data points, since otherwise all universal approxima

[Deleted]

2interstice4y

There's a big difference between 'universal learner' and 'fits any smooth function on a fixed input space'. The 'universal learner' property is about data efficiency: do you have bounded regret compared to any learning algorithm in some wide class? Solomonoff induction has this property with respect to computable predictors on binary strings, for instance. There are lots of learning algorithms able to fit any finite binary sequence but which are not universal. I haven't seen a good formalism for this in the neural net case, but I think it would involve letting the input dimension increase with the number of data points, and comparing the asymptotic performance of various algorithms.

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

[Deleted]

1interstice4y

I've never heard of any result suggesting this, what's your argument? I suspect the opposite -- by the central limit theorem the partial derivatives and activations at each layer tend toward samples from a fixed distribution(differing per layer but fixed across neurons). I think this means that the NTK embedding is 'essentially finite' and actually not universal(though I'm not sure). Note that to show universality it's not enough to show that all embeddings can be found, you'll also need an argument showing that their density in the NTK embedding is bounded above zero.

Past Account4y10

Your example is interesting and clarifies exchange rates. However,

The shadow price quantifies the opportunity cost, so if I'm paid my shadow price, then that's just barely enough to cover my opportunity cost.

This is an interpretive point I'd like to focus on. When you move a constraint, in this case with price, the underlying equilibrium of the optimization shifts. From this perspective your usage of the word 'barely' stops making sense to me. If you were to 'overshoot' you wouldn't be optimal in the new optimization problem.

At this point I understand ... (read more)

Past Account4y40

I suppose this is the most correct answer. I'm not really updating very much though. From my perspective I'll continue to see cheerful price as a psychological/subjective reinvention of shadow price.

Edit: It seems clear in this context, shadow price isn't exactly measurable. Cheerful price is just the upper estimate on the shadow price.

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

[Deleted]

Past Account4y*Ω110

[Deleted]

2Joar Skalse4y

Yes. I imagine this is why overtraining doesn't make a huge difference. See e.g., page 47 in the main paper.

Recognizing Numbers

What is going on in the world?

[Deleted]

Past Account4y*60

[Deleted]

The Good Try Rule

Past Account4y*90

[Deleted]

Past Account4y50

I'm going to have to spend some time unpacking the very compact notation in the post, but here are my initial reactions.

I should apologize a bit for that. To a degree I wasn't really thinking about any of the concepts in the title and only saw the connection later.

First, very clean proof of the lemma, well done there.

Thanks!

Second... if I'm understanding this correctly, each neuron activation (or set of neuron activations?) would contain all the information from some-part-of-data relevant to some-other-part-of-data and the output.

To be honest, I... (read more)

Doing discourse better: Stuff I wish I knew

Doing discourse better: Stuff I wish I knew

[Deleted]

[Deleted]

2ChristianKl4y

When it comes to a forum like this, it's important to incentivise people who write posts. Part of the incentive is that people control the posts they write to say what they want to say. A system that works like Google Docs where the author can choose to accept or deny requests for change would likely work better.

Past Account5y*10

[Deleted]

2Ben Pace5y

I did suspect you'd confused it with the Alignment Newsletter :)

Past Account5y*70

[Deleted]

Ben Pace5y110

Much of the same is true of scientific journals. Creating a place to share and publish research is a pretty key piece of intellectual infrastructure, especially for researchers to create artifacts of their thinking along the way.

The point about being 'cross-posted' is where I disagree the most.

This is largely original content that counterfactually wouldn't have been published, or occasionally would have been published but to a much smaller audience. What Failure Looks Like wasn't crossposted, Anna's piece on reality-revealing puzzles wasn't cro... (read more)

3Ben Pace5y

By "AN" do you mean the AI Alignment Forum, or "AIAF"?

Past Account5y*70

[Deleted]