All of DragonGod's Comments + Replies

I'm currently working through Naive Set Theory (alongside another text). I'll take this as a recommendation to work through the other textbooks later.

My maths level is insufficient for the course I'd guess.

I would appreciate it if videos of the meetings could be recorded. Or maybe I should just stick around and hope this will be run again next year.

DragonGod100

When I saw the beginning/title I thought the post would be a refutation of the material scarcity thesis; I found myself disappointed it is not.

6Matthew Barnett
I suppose that means it might be worth writing an additional post that more directly responds to the idea that AGI will end material scarcity. I agree that thesis deserves a specific refutation.

There is not an insignificant sense of guilt/of betraying myself from 2023 and my ambitions from before.

And I don't want to just end up doing irrelevant TCS research that only a few researchers in a niche field will ever care about.

It's not high impact research.

And it's mostly just settling. I get the sense that I enjoy theoretical research, I don't currently feel poised to contribute to the AI safety problem, I seem to have an unusually good (at least it appears so to my limited understanding) opportunity to pursue a boring TCS PhD in some niche field tha... (read more)

I still want to work on technical AI safety eventually.

I feel like I'm on quite far off path from directly being useful in 2025 than I felt in 2023.

And taking a detour to do a TCS PhD that isn't directly pertinent to AI safety (current plan) feels like not contributing.

Cope is that becoming a strong TCS researcher will make me better poised to contribute to the problem, but short timelines could make this path less viable.

[Though there's nothing saying I can't try to work on AI on the side even if it isn't the focus of my PhD.]

6DragonGod
There is not an insignificant sense of guilt/of betraying myself from 2023 and my ambitions from before. And I don't want to just end up doing irrelevant TCS research that only a few researchers in a niche field will ever care about. It's not high impact research. And it's mostly just settling. I get the sense that I enjoy theoretical research, I don't currently feel poised to contribute to the AI safety problem, I seem to have an unusually good (at least it appears so to my limited understanding) opportunity to pursue a boring TCS PhD in some niche field that few people care about. I don't think I'll be miserable pursuing the boring TCS PhD or not enjoy it, or anything of the sort. It's just not directly contributing to what I wanted to contribute to. It's somewhat sad and it's undignified (but it's less undignified than the path I thought I was on at various points in the last 15 months).
DragonGod123

I think LW is a valuable intellectual hub and community.

Haven't been an active participant of recent, but it's still a service I occasionally find myself relying on explicitly, and I prefer the world where it continues to exist.

[I donated $20. Am unemployed and this is a nontrivial fraction of my disposable income.]

o1's reasoning trace also does this for different languages (IIRC I've seen Chinese and Japanese and other languages I don't recognise/recall), usually an entire paragraph not a word, but when I translated them it seemed to make sense in context.

This is not a rhetorical question:) What do you mean by "probability" here?

Yeah, since posting this question:

I have updated towards thinking that it's in a sense not obvious/not clear what exactly "probability" is supposed to be interpreted as here.

And once you pin down an unambiguous interpretation of probability the problem dissolves.

I had a firm notion in mind for what I thought probability meant. But Rafael Harth's answer really made me unconfident that the notion I had in mind was the right notion of probability for the question.

7TsviBT
I think the question is underdefined. Some bets are posed once per instance of you, some bets are posed once per instance of a world (whatever that means), etc.

My current position now is basically:

Actually, I'm less confident and now unsure.

Harth's framing was presented as an argument re: the canonical Sleeping Beauty problem.

And the question I need to answer is: "should I accept Harth's frame?"

I am at least convinced that it is genuinely a question about how we define probability.

There is still a disconnect though.

While I agree with the frequentist answer, it's not clear to me how to backgpropagate this in a Bayesian framework.

Suppose I treat myself as identical to all other agents in the reference class.

I kno

... (read more)

I'm curious how your conception of probability accounts for logical uncertainty?

3Rafael Harth
I count references within each logical possibility and then multiply by their "probability". Here's a super contrived example to explain this. Suppose that if the last digit of pi is between 0 and 3, Sleeping Beauty experiments work as we know them, whereas if it's between 4 and 9, everyone in the universe is miraculously compelled to interview Sleeping Beauty 100 times if the coin is tails. In this case, I think P(coin heads|interviewed) is 0.4⋅13+0.6⋅1101. So it doesn't matter how many more instances of the reference class there are in one logical possibility; they don't get "outside" their branch of the calculation. So in particular, the presumptuous philosopher problem doesn't care about number of classes at all. In practice, it seems super hard to find genuine examples of logical uncertainty and almost everything is repeated anyway. I think the presumptuous philosopher problem is so unintuitive precisely because it's a rare case of actual logical uncertainty where you genuinely cannot count classes.
DragonGod150

So in this case, I agree that like if this experiment is repeated multiple times and every Sleeping Beauty version created answered tails, the reference class of Sleeping Beauty agents would have many more correct answers than if the experiment is repeated many times and every sleeping Beauty created answered heads.

I think there's something tangible here and I should reflect on it.

I separately think though that if the actual outcome of each coin flip was recorded, there would be a roughly equal distribution between heads and tails.

And when I was thinking t... (read more)

7Rafael Harth
What I'd say is that this corresponds to the question, "someone tells you they're running the Sleeping Beauty experiment and just flipped a coin; what's the probability that it's heads?". Difference reference class, different distribution; probability now is 0.5. But this is different from the original question, where we are Sleeping Beauty.
Measure101

I separately think though that if the actual outcome of each coin flip was recorded, there would be a roughly equal distribution between heads and tails.

Importantly, this is counting each coinflip as the "experiment", whereas the above counts each awakening as the "experiment". It's okay that different experiments would see different outcome frequencies.

2DragonGod
My current position now is basically:

I mean I am not convinced by the claim that Bob is wrong.

Bob's prior probability is 50%. Bob sees no new evidence to update this prior so the probability remains at 50%.

I don't favour an objective notion of probabilities. From my OP:

2. Bayesian Reasoning

  • Probability is a property of the map (agent's beliefs), not the territory (environment).
  • For an observation O to be evidence for a hypothesis H, P(O|H) must be > P(O|¬H).
  • The wake-up event is equally likely under both Heads and Tails scenarios, thus provides no new information to update priors.
  • The o
... (read more)
5Charlie Steiner
Yes, Bob is right. Because the probability is not a property of the coin. It's 'about' the coin in a sense, but it also depends on Bob's knowledge, including knowledge about location in time (Dave) or possible worlds (Carol).
DragonGod4-1

I mean I think the "gamble her money" interpretation is just a different question. It doesn't feel to me like a different notion of what probability means, but just betting on a fair coin but with asymmetric payoffs.

The second question feels closer to actually an accurate interpretation of what probability means.

7Gurkenglas
https://www.lesswrong.com/posts/Mc6QcrsbH5NRXbCRX/dissolving-the-question
DragonGodΩ120

i.e. if each forecaster  has an first-order belief , and  is your second-order belief about which forecaster is correct, then  should be your first-order belief about the election.

I think there might be a typo here. Did you instead mean to write: "" for the second order beliefs about the forecasters?

The claim is that given the presence of differential adversarial examples, the optimisation process would adjust the parameters of the model such that it's optimisation target is the base goal.

That was it, thanks!

Probably sometime last year, I posted on Twitter something like: "agent values are defined on agent world models" (or similar) with a link to a LessWrong post (I think the author was John Wentworth).

I'm now looking for that LessWrong post.

My Twitter account is private and search is broken for private accounts, so I haven't been able to track down the tweet. If anyone has guesses for what the post I may have been referring to was, do please send it my way.

3Dalcy
The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

Most of the catastrophic risk from AI still lies in superhuman agentic systems.

Current frontier systems are not that (and IMO not poised to become that in the very immediate future).

I think AI risk advocates should be clear that they're not saying GPT-5/Claude Next is an existential threat to humanity.

[Unless they actually believe that. But if they don't, I'm a bit concerned that their message is being rounded up to that, and when such systems don't reveal themselves to be catastrophically dangerous, it might erode their credibility.]

Immigration is such a tight constraint for me.

My next career steps after I'm done with my TCS Masters are primarily bottlenecked by "what allows me to remain in the UK" and then "keeps me on track to contribute to technical AI safety research".

What I would like to do for the next 1 - 2 years ("independent research"/ "further upskilling to get into a top ML PhD program") is not all that viable a path given my visa constraints.

Above all, I want to avoid wasting N more years by taking a detour through software engineering again so I can get Visa sponsorship.

[... (read more)

Specifically, the experiments by Morrison and Berridge demonstrated that by intervening on the hypothalamic valuation circuits, it is possible to adjust policies zero-shot such that the animal has never experienced a previously repulsive stimulus as pleasurable.

I find this a bit confusing as worded, is something missing?

Does anyone know a ChatGPT plugin for browsing documents/webpages that can read LaTeX?

The plugin I currently use (Link Reader) strips out the LaTeX in its payload, and so GPT-4 ends up hallucinating the LaTeX content of the pages I'm feeding it.

How frequent are moderation actions? Is this discussion about saving moderator effort (by banning someone before you have to remove the rate-limited quantity of their bad posts), or something else? I really worry about "quality improvement by prior restraint" - both because low-value posts aren't that harmful, they get downvoted and ignored pretty easily, and because it can take YEARS of trial-and-error for someone to become a good participant in LW-style discussions, and I don't want to make it impossible for the true newbies (young people discovering

... (read more)

I find noticing surprise more valuable than noticing confusion.

Hindsight bias and post hoc rationalisations make it easy for us to gloss over events that were apriori unexpected.

5Raemon
My take on this is that noticing surprise is easier than noticing confusion, and surprise often correlates with confusion so a useful thing to do is have a habit of : 1. practice noticing surprise 2. when you notice surprise, check if you have a reason to be confused (Where surprise is "something unexpected happened" and confused is "something is happening that I can't explain, or my explanation of doesn't make sense")

I think the model of "a composition of subagents with total orders on their preferences" is a descriptive model of inexploitable incomplete preferences, and not a mechanistic model. At least, that was how I interpreted "Why Subagents?".

I read @johnswentworth as making the claim that such preferences could be modelled as a vetocracy of VNM rational agents, not as claiming that humans (or other objects of study) are mechanistically composed of discrete parts that are themselves VNM rational.

 

I'd be more interested/excited by a refutation on the grounds ... (read more)

5Nina Panickssery
The presence of a pre-order doesn't inherently imply a composition of subagents with ordered preferences. An agent can have a pre-order of preferences due to reasons such as lack of information, indifference between choices, or bounds on computation - this does not necessitate the presence of subagents.  If we do not use a model based on composition of subagents with ordered preferences, in the case of "Atticus the Agent" it can be consistent to switch B -> A + 1$ and A -> B + 1$.  Perhaps I am misunderstanding the claim being made here though.

Suppose it is offered (by a third party) to switch  and then 

Seems incomplete (pun acknowledged). I feel like there's something missing after "to switch" (e.g. "to switch from A to B" or similar).

Another example is an agent through time where as in the Steward of Myselves

This links to Scott Garrabrant's page, not to any particular post. Perhaps you want to review that?

I think you meant to link to: Tyranny of the Epistemic Majority.

DragonGodΩ120

We aren’t offering these criteria as necessary for “knowledge”—we could imagine a breaker proposing a counterexample where all of these properties are satisfied but where intuitively M didn’t really know that A′ was a better answer. In that case the builder will try to make a convincing argument to that effect.

Bolded should be sufficient.

DragonGodΩ242

In fact, I'm pretty sure that's how humans work most of the time. We use the general-intelligence machinery to "steer" ourselves at a high level, and most of the time, we operate on autopilot.

Yeah, I agree with this. But I don't think the human system aggregates into any kind of coherent total optimiser. Humans don't have an objective function (not even approximately?).

A human is not well modelled as a wrapper mind; do you disagree?

2Thane Ruthenis
Certainly agree. That said, I feel the need to lay out my broader model here. The way I see it, a "wrapper-mind" is a general-purpose problem-solving algorithm hooked up to a static value function. As such: * Are humans proper wrapper-minds? No, certainly not. * Do humans have the fundamental machinery to be wrapper-minds? Yes. * Is any individual run of a human general-purpose problem-solving algorithm essentially equivalent to wrapper-mind-style reasoning? Yes. * Can humans choose to act as wrapper-minds on longer time scales? Yes, approximately, subject to constraints like force of will. * Do most humans, in practice, choose to act as wrapper-minds? No, we switch our targets all the time, value drift is ubiquitous. * Is it desirable for a human to act as a wrapper-mind? That's complicated. * On the one hand, yes because consistent pursuit of instrumentally convergent goals would lead to you having more resources to spend on whatever values you have. * On the other hand, no because we terminally value this sort of value-drift and self-inconsistency, it's part of "being human". * In sum, for humans, there's a sort of tradeoff between approximating a wrapper-mind, and being an incoherent human, and different people weight it differently in different context. E. g., if you really want to achieve something (earning your first million dollars, averting extinction), and you value it more than having fun being a human, you may choose to act as a wrapper-mind in the relevant context/at the relevant scale. As such: humans aren't wrapper-minds, but they can act like them, and it's sometimes useful to act as one.
DragonGodΩ340

Thus, any greedy optimization algorithm would convergently shape its agent to not only pursue , but to maximize for 's pursuit — at the expense of everything else.

Conditional on:

  1. Such a system being reachable/accessible to our local/greedy optimisation process
  2. Such a system being actually performant according to the selection metric of our optimisation process 

 

I'm pretty sceptical of #2. I'm sceptical that systems that perform inference via direct optimisation over their outputs are competitive in rich/complex environments. 

Such o... (read more)

4Thane Ruthenis
It's not a binary. You can perform explicit optimization over high-level plan features, then hand off detailed execution to learned heuristics. "Make coffee" may be part of an optimized stratagem computed via consequentialism, but you don't have to consciously optimize every single muscle movement once you've decided on that goal. Essentially, what counts as "outputs" or "direct actions" relative to the consequentialist-planner is flexible, and every sufficiently-reliable (chain of) learned heuristics can be put in that category, with choosing to execute one of them available to the planner algorithm as a basic output. In fact, I'm pretty sure that's how humans work most of the time. We use the general-intelligence machinery to "steer" ourselves at a high level, and most of the time, we operate on autopilot.

Do please read the post. Being able to predict human text requires vastly superhuman capabilities, because predicting human text requires predicting the processes that generated said text. And large tracts of text are just reporting on empirical features of the world.

Alternatively, just read the post I linked.

5cubefox
I did read your post. The fact that something like predicting text requires superhuman capabilities of some sort does not mean that the task itself will result in superhuman capabilities. That's the crucial point. It is much harder to imitate human text than to write while being a human, but that doesn't mean the imitated human itself is any more capable than the original. An analogy. The fact that building fusion power plants is much harder than building fission power plants doesn't at all mean that the former are better. They could even be worse. There is a fundamental disconnect between the difficulty of a task and the usefulness of that task.
2Blueberry
Maybe you're an LLM.

In what sense are they "not trying their hardest"?

4tailcalled
I think you inserted an extra "not".
1cubefox
Being able to perfectly imitate a Chimpanzee would probably also require superhuman intelligence. But such a system would still only be able to imitate chimpanzees. Effectively, it would be much less intelligent than a human. Same for imitating human text. It's very hard, but the result wouldn't yield large capabilities.

which is indifferent to the simplicify of the architecture the insight lets you find.

The bolded should be "simplicity". 

Sorry, please where can I get access to the curriculum (including the reading material and exercises) if I want to study it independently?

The chapter pages on the website doesn't seem to list full curricula.

If you define your utility function over histories, then every behaviour is maximising an expected utility function no?

Even behaviour that is money pumped?

I mean you can't money pump any preference over histories anyway without time travel.

The Dutchbook arguments apply when your utility function is defined over your current state with respect to some resource?

I feel like once you define utility function over histories, you lose the force of the coherence arguments?

What would it look like to not behave as if maximising an expected utility function for a utility function defined over histories.

My contention is that I don't think the preconditions hold.

Agents don't fail to be VNM coherent by having incoherent preferences given the axioms of VNM. They fail to be VNM coherent by violating the axioms themselves.

Completeness is wrong for humans, and with incomplete preferences you can be non exploitable even without admitting a single fixed utility function over world states.

8niplav
I notice I am confused. How do you violate an axiom (completeness) without behaving in a way that violates completeness? I don't think you need an internal representation. Elaborating more, I am not sure how you even display a behavior that violates completeness. If you're given a choice between only universe-histories a and b, and your preferences are imcomplete over them, what do you do? As soon as you reliably act to choose one over the other, for any such pair, you have algorithmically-revealed complete preferences. If you don't reliably choose one over the other, what do you do then? * Choose randomly? But then I'd guess you are again Dutch-bookable. And according to which distribution? * Your choice is undefined? That seems both kinda bad and also Dutch-bookable to me tbh. Alwo don't see the difference between this and random choice (shodt of going up in flames, which would constigute a third, hitherto unassumed option). * Go away/refuse the trade &c? But this is denying the premise! You only have universe-histories a and b tp choose between! I think what happens with humans is that they are often incomplete over very low-ranking worlds and are instead searching for policies to find high-ranking worlds while not choosing. I think incomplwteness might be fine if there are two options you can guarantee to avoid, but with adversarial dynamics that becomes more and more difficult.
4Alexander Gietelink Oldenziel
Agree. There are three stages: 1. Selection for inexploitability 2. The interesting part is how systems/pre-agents/egregores/whatever become complete. If it already satisfies the other VNM axioms we can analyse the situation as follows: Recall that ain inexploitable but incomplete VNM agents acts like a Vetocracy of VNM agents. The exact decomposition is underspecified by just the preference order and is another piece of data (hidden state). However, given sure-gain offers from the environment there is selection pressure for the internal complete VNM Subagents to make trade agreements to obtain a pareto improvement. If you analyze this it looks like a simple prisoner dilemma type case which can be analyzed the usual way in game theory. For instance, in repeated offers with uncertain horizon the Subagents may be able to cooperate. 1. Once they are (approximately) complete they will be under selection pressure to satisfy the other axioms. You could say this the beginning of 'emergence of expected utility maximizers' As you can see the key here is that we really should be talking about Selection Theorems not the highly simplified Coherence Theorems. Coherence theorems are about ideal agents. Selection theorems are about how more and more coherent and goal-directed agents may emerge.

Yeah, I think the preconditions of VNM straightforwardly just don't apply to generally intelligent systems.

2Dagon
As I say, open question.  We have only one example of a generally intelligent system, and that's not even very intelligent.  We have no clue how to extend or compare that to other types. It does seem like VNM-rational agents will be better than non-rational agents at achieving their goals.  It's unclear if that's a nudge to make agents move toward VNM-rationality as they get more capable, or a filter to advantage VNM-rational agents in competition to power.  Or a non-causal observation, because goals are orthogonal to power.

Not at all convinced that "strong agents pursuing a coherent goal is a viable form for generally capable systems that operate in the real world, and the assumption that it is hasn't been sufficiently motivated.

What are the best arguments that expected utility maximisers are adequate (descriptive if not mechanistic) models of powerful AI systems?

[I want to address them in my piece arguing the contrary position.]

4Garrett Baker
I like Utility Maximization = Description Length Minimization.
9Linda Linsefors
The boring technical answer is that any policy can be described as a utility maximiser given a contrived enough utility function. The counter argument to that if the utility function is as complicated as the policy, then this is not a useful description. 
niplav100

If you're not vNM-coherent you will get Dutch-booked if there are Dutch-bookers around.

This especially applies to multipolar scenarios with AI systems in competition.

I have an intuition that this also applies in degrees: if you are more vNM-coherent than I am (which I think I can define), then I'd guess that you can Dutch-book me pretty easily.

4Dagon
I don't know of any formal arguments that predict that all or most future AI systems are purely expected utility maximizers.  I suspect most don't believe that to be the case in any simple way.   I do know of a very powerful argument (a proof, in fact) that if an agent's goal structure is complete, transitively consistent, continuous, and independent of irrelevant alternatives, then it will be consistent with an expected-utility-maximizing model.  See https://en.wikipedia.org/wiki/Von_Neumann%E2%80%93Morgenstern_utility_theorem The open question remains, since humans do not meet these criteria, whether more powerful forms of intelligence are more likely to do so.  

Caveat to the caveat:

The solution is IMO just to consider the number of computations performed per generated token as some function of the model size, and once we've identified a suitable asymptotic order on the function, we can say intelligent things like "the smallest network capable of solving a problem in complexity class C of size N is X".

Or if our asymptotic bounds are not tight enough:

"No economically feasible LLM can solve problems in complexity class C of size >= N".

(Where economically feasible may be something defined by aggregate global eco

... (read more)

The solution is IMO just to consider the number of computations performed per generated token as some function of the model size, and once we've identified a suitable asymptotic order on the function, we can say intelligent things like "the smallest network capable of solving a problem in complexity class C of size N is X".

Or if our asymptotic bounds are not tight enough:

"No economically feasible LLM can solve problems in complexity class C of size >= N".

(Where economically feasible may be something defined by aggregate global economic resources or similar, depending on how tight you want the bound to be.)

Regardless, we can still obtain meaningful impossibility results.

Very big caveat: the LLM doesn't actually perform O(1) computations per generated token.

The number of computational steps performed per generated token scales with network size: https://www.lesswrong.com/posts/XNBZPbxyYhmoqD87F/llms-and-computation-complexity?commentId=QWEwFcMLFQ678y5Jp

2DragonGod
Caveat to the caveat:

Strongly upvoted.

Short but powerful.

Tl;Dr: LLMs perform O(1) computational steps per generated token and this is true regardless of the generated token.

The LLM sees each token in its context window when generating the next token so can compute problems in O(n^2) [where n is the context window size].

LLMs can get along the computational requirements by "showing their working" and simulating a mechanical computer (one without backtracking, so not Turing complete) in their context window.

This only works if the context window is large enough to contain the work... (read more)

2DragonGod
Very big caveat: the LLM doesn't actually perform O(1) computations per generated token. The number of computational steps performed per generated token scales with network size: https://www.lesswrong.com/posts/XNBZPbxyYhmoqD87F/llms-and-computation-complexity?commentId=QWEwFcMLFQ678y5Jp

A reason I mood affiliate with shard theory so much is that like...

I'll have some contention with the orthodox ontology for technical AI safety and be struggling to adequately communicate it, and then I'll later listen to a post/podcast/talk by Quintin Pope/Alex Turner, or someone else trying to distill shard theory and then see the exact same contention I was trying to present expressed more eloquently/with more justification.

One example is that like I had independently concluded that "finding an objective function that was existentially safe when optimis... (read more)

4Chris_Leong
My main critique of shard theory is that I expect one of the shards to end up dominating the others as the most likely outcome.

"All you need is to delay doom by one more year per year and then you're in business" — Paul Christiano.

Load More