All of tailcalled's Comments + Replies

The big picture is plausible but one major error you make is assuming "academics" will be a solid bastion of opposition. My understanding is that academics are often some of the first ones to fall (like when teachers struggle with students who use ChatGPT to cheat on homework), and many of the academic complaints about AI are just as slop-y as what the AI produces.

Maybe someone who believes in following the will of the majority even if he/she disagrees (and could easily become a dictator)?

Do you mean "resigns from a presidential position/declines a dictatorial position because they disagree with the will of the people" or "makes policy they know will be bad because the people demand it"?

Maybe a good parent who listens to his/her child's dreams?

Can you expand on this?

1Knight Lee1d

Maybe someone like George Washington who was so popular he could easily stay in power, but still chose to make America democratic. Let's hope it stays democratic :/ No human is 100% corrigible and would do anything that someone else wants. But a good parent might help his/her child get into sports and so forth but if the child says he/she wants to be a singer instead the parent helps him/her on that instead. The outcome the parent wants depends on what the child wants, and the child can change his/her mind.

Can you give 1 example of a person choosing to be corrigible to someone they are not dependent upon for resources/information and who they have much more expertise than?

1Knight Lee1d

* Maybe someone who believes in following the will of the majority even if he/she disagrees (and could easily become a dictator)? * Maybe a good parent who listens to his/her child's dreams? Very good question though. Humans usually aren't very corrigible, and there aren't many examples!

Richard Ngo's Shortform

tailcalled1d20

I feel like "evil" and "corruption" mean something different.

Corruption is about selfish people exchanging their power within a system for favors (often outside the system) when they're not supposed to according to the rules of the system. For example a policeman taking bribes. It's something the creators/owners of the system should try to eliminate, but if the system itself is bad (e.g. Nazi Germany during the Holocaust), corruption might be something you sometimes ought to seek out instead of to avoid, like with Schindler saving his Jews.

"Evil" I've in t... (read more)

If the AI can't do much without coordinating with a logistics and intelligence network and collaborating with a number of other agents, and its contact to this network routes through a commanding agent that is as capable if not more capable than the AI itself, then sure, it may be relatively feasible to make the AI corrigible to said commanding agent, if that is what you want it to be.

(This is meant to be analogous to the soldier-commander example.)

But was that the AI regime you expect to find yourself working with? In particular I'd expect you expect that the commanding agent would be another AI, in which case being corrigible to them is not sufficient.

1Knight Lee1d

Oops I didn't mean that analogy. It's not necessarily a commander, but any individual that a human chooses to be corrigible/loyal to. A human is capable of being corrigible/loyal to one person (or group), without accruing the risk of listening to prompt injections, because a human has enough general intelligence/common sense to know what is a prompt injection and what is a request from the person he is corrigible/loyal to. As AI approach human intelligence, they would be capable of this too.

Discriminating on the basis of the creators vs a random guy on the street helps with many of the easiest cases, but in an adversarial context, it's not enough to have something that works for all the easiest cases, you need something that can't predictably made to fail by a highly motivated adversary.

Like you could easily do some sort of data augmentation to add attempts at invoking the corrigibility system from random guys on the street, and then train it not to respond to that. But there'll still be lots of other vulnerabilities.

1Knight Lee1d

I still think, once the AI approaches human intelligence (and beyond), this problem should start to go away, since a human soldier can choose to be corrigible to his commander and not the enemy, even in very complex environments. I still feel the main problem is "the AI doesn't want to be corrigible," rather than "making the AI corrigible enables prompt injections." It's like that with humans. That said, I'm highly uncertain about all of this and I could easily be wrong.

Is instrumental convergence a thing for virtue-driven agents?

Let's say you are using the AI for some highly sensitive matter where it's important that it resists prompt-hacking - e.g. driving a car (prompt injections could trigger car crashes), something where it makes financial transactions on the basis of public information (online websites might scam it), or military drones (the enemy might be able to convince the AI to attack the country that sent it).

A general method for ensuring corrigibility is to be eager to follow anything instruction-like that you see. However, this interferes with being good at resisting prompt-hacking.

1Knight Lee1d

I think the problem you mention is a real challenge, but not the main limitation of this idea. The problem you mention actually decreases with greater intelligence and capabilities, since a smarter AI clearly understands the concept of being corrigible to its creators vs. a random guy on the street, just like a human does. The main problem is still how reinforcement learning trains the AI behaviours which actually maximize reward, while corrigibility only trains the AI behaviours which appear corrigibile.

What is autism?

tailcalled3d51

My current best guess is that:

Like for most other concepts, we don't have rigorous statistics and measurements showing that there is a natural clustering of autism symptoms, (there are some non-rigorous ones though)
When various schools of psychotherapy, psychiatry and pediatrics sorted children with behavioral issues together, they often ended up with an autistic group,
Each school has their own diagnosis on what exactly is wrong in the case of autism, and presumably they aren't all correct about all autistic people, so to know the True Reason autism is "a

tailcalled4d30

https://www.lesswrong.com/posts/gebzzEwn2TaA6rGkc/deep-learning-systems-are-not-less-interpretable-than-logic

tailcalled11d20

The assumption of virtue ethics isn't that virtue is unknown and must be discovered - it's that it's known and must be pursued.

If it is known, then why do you not ever answer my queries about providing an explicit algorithm for converting intelligence into virtuous agency, instead running in circles about how There Must Be A Utility Function!?

If the virtuous action, as you posit, is to consume ice cream, intelligence would allow an agent to acquire more ice cream, eat more over time by not making themselves sick, etc.

I'm not disagreeing with this, I'm sayi... (read more)

Is instrumental convergence a thing for virtue-driven agents?

tailcalled11d20

No, that's not my argument.

Let's imagine that True Virtue is seeking and eating ice cream, but that you don't know what true virtue is for some reason.

Now let's imagine that we have some algorithm for turning intelligence into virtuous agency. (This is not an assumption that I'm willing to grant (since you haven't given something like argmax for virtue), and really that's the biggest issue with my proposal, but let's entertain it to see my point.)

If the algorithm is run on the basis of some implementation of intelligence that is not good enough, then the r... (read more)

2Davidmanheim11d

The assumption of virtue ethics isn't that virtue is unknown and must be discovered - it's that it's known and must be pursued. If the virtuous action, as you posit, is to consume ice cream, intelligence would allow an agent to acquire more ice cream, eat more over time by not making themselves sick, etc. But any such decision algorithm, for a virtue ethicist, is routing through continued re-evaluation of whether the acts are virtuous, in the current context, not embracing some farcical LDT version of needing to pursue ice cream at all costs. There is an implicit utility function which values intelligence, but it's not then inferring back what virtue is, as you seem to claim. Your assumption, which is evidently that the entire thing turns into a compressed and decontextualized utility function ("algorithm") is ignoring the entire hypothetical.

Is instrumental convergence a thing for virtue-driven agents?

tailcalled11d20

I didn't say you need to understand what an argument is, I said you need to understand your own argument.

It is true that if the utility functions cover a sufficiently broad set of possibilities, any "reasonable" policy (for a controversial definition of "reasonable") maximizes a utility function, and if the utility functions cover an even broader set of possibilities, literally any policy maximizes a utility function.

But, if you want to reference these facts, you should know why they are true. For instance, here's a rough sketch of a method for finding a u... (read more)

2Davidmanheim11d

OK, so your argument against my claim is that a stupid and biased decision procedure wouldn't know that intelligence would make it more effective at being virtuous. And sure, that seems true, and I was wrong to assert unconditionally that "for virtue ethics, the derivative of that utility with respect to intelligence is positive." I should have instead clarified that I meant that any not idiotic virtue ethics decision procedure would have a positive first derivative in intelligence - because as your claim seems to admit, a less stupid decision procedure would not make that mistake, and would then value intelligence as it bootstrapped its way to greater intelligence.

Is instrumental convergence a thing for virtue-driven agents?

tailcalled12d-20

I'm showing that the assumptions necessary for your argument don't hold, so you need to better understand your own argument.

2Davidmanheim11d

I understand what an argument is, but I don't understand why you think that converting policies to.utility functions needs to assume no systematic errors, or why, if true, that would make it incompatible with varying intelligence.

Is instrumental convergence a thing for virtue-driven agents?

tailcalled12d20

The methods for converting policies to utility functions assume no systematic errors, which doesn't seem compatible with varying the intelligence levels.

2Davidmanheim12d

I don't understand your argument here.

Is instrumental convergence a thing for virtue-driven agents?

tailcalled12d20

This.

In particular imagine if the state space of the MDP factors into three variables x, y and z, and the agent has a bunch of actions with complicated influence on x, y and z but also just some actions that override y directly with a given value.

In some such MDPs, you might want a policy that does nothing other than copy a specific function of x to y. This policy could easily be seen as a virtue, e.g. if x is some type of event and y is some logging or broadcasting input, then it would be a sort of information-sharing virtue.

While there are certain circum... (read more)

Is instrumental convergence a thing for virtue-driven agents?

tailcalled13d62

I didn't claim virtue ethics says not to predict consequences of actions. I said that a virtue is more like a procedure than it is like a utility function. A procedure can include a subroutine predicting the consequences of actions and it doesn't become any more of a utility function by that.

The notion that "intelligence is channeled differently" under virtue ethics requires some sort of rule, like the consequentialist argmax or Bayes, for converting intelligence into ways of choosing.

2Davidmanheim12d

Yes, virtue ethics implies a utility function, because anything that outputs decisions implies a utility function. In this case, I'm noting that for virtue ethics, the derivative of that utility with respect to intelligence is positive.

Is instrumental convergence a thing for virtue-driven agents?

tailcalled13d18-3

Consequentialism is an approach for converting intelligence (the ability to make use of symmetries to e.g. generalize information from one context into predictions in another context or to e.g. search through highly structured search spaces) into agency, as one can use the intelligence to predict the consequences of actions and find a policy which achieves some criterion unusually well.

While it seems intuitively appealing that non-consequentialist approaches could be used to convert intelligence into agency, I have tried a lot and not been able to come up ... (read more)

4Davidmanheim13d

I think this is confused about how virtue ethics works. Virtue ethics is centered on the virtues of the moral agent, but it certainly does not say not to predict consequences of actions. In fact, one aspect of virtue, in the Aristotelian system, is "practical wisdom," i.e. intelligence which is critical for navigating choices - because practical wisdom includes an understanding of what consequences will follow actions. It's more accurate to say that intelligence is channeled differently — not toward optimizing outcomes, but toward choosing in a way consistent with one's virtues. And even if virtues are thought of as policies, as in the "loyal friend" example, the policies for being a good friend require interpretation and context-sensitive application. Intelligence is crucial for that.

Latent variables for prediction markets: motivation, technical guide, and design considerations

tailcalled21d20

Not sure what you mean. Are you doing a definitional dispute about what counts as the "standard" definition of Bayesian networks?

Latent variables for prediction markets: motivation, technical guide, and design considerations

tailcalled21d20

Your linked paper is kind of long - is there a single part of it that summarizes the scoring so I don't have to read all of it?

Either way, yes, it does seem plausible that one could create a market structure that supports latent variables without rewarding people in the way I described it.

1Abhimanyu Pallavi Sudhir21d

No; I mean a standard Bayesian network wouldn't work for latents.

Mo Putera's Shortform

tailcalled25d71

I'm not convinced Scott Alexander's mistakes page accurately tracks his mistakes. E.g. the mistake on it I know the most about is this one:

56: (5/27/23) In Raise Your Threshold For Accusing People Of Faking Bisexuality, I cited a study finding that most men’s genital arousal tracked their stated sexual orientation (ie straight men were aroused by women, gay men were aroused by men, bi men were aroused by either), but women’s genital arousal seemed to follow a bisexual pattern regardless of what orientation they thought they were - and concluded that althou

... (read more)

1Mo Putera25d

Thanks, good example.

How far along Metr's law can AI start automating or helping with alignment research?

tailcalled26d42

I mean I don't really believe the premises of the question. But I took "Even if you're not a fan of automating alignment, if we do make it to that point we might as well give it a shot!" to imply that even in such a circumstance, you still want me to come up with some sort of answer.

How far along Metr's law can AI start automating or helping with alignment research?

Answer by tailcalledMar 20, 20250-7

Life on earth started 3.5 billion years ago. Log_2(3.5 billion years/1 hour) = 45 doublings. With one doubling every 7 months, that makes 26 years, or in 2051.

(Obviously this model underestimates the difficulty of getting superalignment to work. But also extrapolating the METR trend is questionable for 45 doublings is dubious in an unknown direction. So whatever.)

1Christopher King26d

You're saying that if you assigned 1 human contractor the task of solving superalignment, they would succeed after ~3.5 billion years of work? 🤔 I think you misunderstood what the y-axis on the graph is measuring.

tailcalled1mo60

I talk to geneticists (mostly on Twitter, or rather now BlueSky) and they don't really know about this stuff.

tailcalled1mo40

(Presumably there exists some standard text about this that one can just link to lol.)

I don't think so.

I'm still curious whether this actually happens.... I guess you can have the "propensity" be near its ceiling.... (I thought that didn't make sense, but I guess you sometimes have the probability of disease for a near-ceiling propensity be some number like 20% rather than 100%?) I guess intuitively it seems a bit weird for a disease to have disjunctive causes like this, but then be able to max out at the risk at 20% with just one of the disjunctive causes

... (read more)

2TsviBT1mo

How confident are you / why do you think this? (It seems fairly plausible given what I've heard about the field of genomics, but still curious.) E.g. "I have a genomics PhD" or "I talk to geneticists and they don't really know about this stuff" or "I follow some twitter stuff and haven't heard anyone talk about this". Ok I'm too tired to follow this so I'll tap out of the thread for now. Thanks again!

tailcalled1mo100

Ok, more specifically, the decrease in the narrowsense heritability gets "double-counted" (after you've computed the reduced coefficients, those coefficients also get applied to those who are low in the first chunk and not just those who are high, when you start making predictions), whereas the decrease in the broadsense heritability is only single-counted. Since the single-counting represents a genuine reduction while the double-counting represents a bias, it only really makes sense to think of the double-counting as pathological.

2TsviBT1mo

Ah... ok I think I see where that's going. Thanks! (Presumably there exists some standard text about this that one can just link to lol.) I'm still curious whether this actually happens.... I guess you can have the "propensity" be near its ceiling.... (I thought that didn't make sense, but I guess you sometimes have the probability of disease for a near-ceiling propensity be some number like 20% rather than 100%?) I guess intuitively it seems a bit weird for a disease to have disjunctive causes like this, but then be able to max out at the risk at 20% with just one of the disjunctive causes? IDK. Likewise personality...

tailcalled1mo20

It would decrease the narrowsense (or additive) heritability, which you can basically think of as the squared length of your coefficient vector, but it wouldn't decrease the broadsense heritability, which is basically the phenotypic variance in expected trait levels you'd get by shuffling around the genotypes. The missing heritability problem is that when we measure these two heritabilities, the former heritability is lower than the latter.

2TsviBT1mo

Why not? Shuffling around the second chunk, while the first chunk is already high, doesn't do anything, and therefore does not contribute phenotypic variance to broadsense heritability.

tailcalled1mo20

If some amount of heritability is from the second chunk, then to that extent, there's a bunch of pairs of people whose trait differences are explained by second chunk differences. If you made a PGS, you'd see these pairs of people and then you'd find out how specifically the second chunk affects the trait.

This only applies if the people are low in the first chunk and differ in the second chunk. Among the people who are high in the first chunk but differ in the second chunk, the logarithm of their trait level will be basically the same regardless of the sec... (read more)

2TsviBT1mo

Wouldn't this also decrease the heritability?

tailcalled1mo20

Why?

2TsviBT1mo

Because if some of the heritability is from the second chunk, that means that for some pairs of people, they have roughly the same first chunk but somewhat different second chunks; and they have different traits, due to the difference in second chunks. If some amount of heritability is from the second chunk, then to that extent, there's a bunch of pairs of people whose trait differences are explained by second chunk differences. If you made a PGS, you'd see these pairs of people and then you'd find out how specifically the second chunk affects the trait. I could be confused about some really basic math here, but yeah, I don't see it. Your example for how the gradient doesn't flow seems to say "the gradient doesn't flow because the second chunk doesn't actually affect the trait".

tailcalled1mo20

Some of the heritability would be from the second chunk of genes.

2TsviBT1mo

To the extent that the heritability is from the second chunk, to that extent the gradient does flow, no?

tailcalled1mo20

The original discussion was about how personality traits and social outcomes could behave fundamentally differently from biological traits when it comes to genetics. So this isn't necessarily meant to apply to disease risks.

2TsviBT1mo

Well you brought up depression. But anyway, all my questions apply to personality traits as well. ..... To rephrase / explain how confused I am about what you're trying to tell me: It kinda sounds like you're saying "If some trait is strongly determined by one big chunk of genes, then you won't be able to see how some other chunk affects the trait.". But this can't explain missing heritability! In this scenario, none of the heritability is even from the second chunk of genes in the first place! Or am I missing something?

tailcalled1mo20

Let's start with the basics: If the outcome $f$ is a linear function of the genes $x$ , that is $f (x) = β x$ , then the effect of each gene is given by the gradient of $f$ , i.e. $\nabla_{x} f (x) = β$ . (This is technically a bit sketchy since a genetic variant is discrete while gradients require continuity, but it works well enough as a conceptual approximation for our purposes.) Under this circumstance, we can think of genomic studies as finding $β$ . (This is also technically a bit sketchy because of linkage disequillibrium and such, but it works we... (read more)

2TsviBT1mo

Ah. Thank you, this makes sense of what you said earlier. (I / someone could have gotten this from what you had written before, by thinking about it more, probably.) I agree with your analysis as math. However, I'm skeptical of the application to the genetics stuff, or at least I don't see it yet. Specifically, you wrote: And your argument here says that there's "gradient interference" between the summed products specifically when one of the summed products is really big. But in the case of disease risk, IIUC the sum-of-products f(x) is something like logits. So translating your argument, it's like: In this case, yes the analysis is valid, but it's not very relevant. For the diseases that people tend to talk about, if there are several substantial disjunctive causes (I mean, the risk is a sum of a few different sub-risks), then they all would show substantial signal in the data. None of them drowns out all the others. Maybe you just meant to say "In theory this could happen". Or am I missing what you're suggesting? E.g. is there a way for there to be a trait that: * has lots of variation (e.g. lots of sick people and lots of non-sick people), and * it's genetic, and * it's a fairly simple functional form like we've been discussing, * but you can't optimize it much by changing a bunch of variants found by looking at some millions of genotype/phenotype pairs?

tailcalled1mo20

It kind-of applies to the Bernoulli-sigmoid-linear case that would usually be applied to binary diagnoses (but only because of sample size issues and because they usually perform the regression one variable at a time to reduce computational difficulty), but it doesn't apply as strongly as it does to the polynomial case, and it doesn't apply to the purely linear (or exponential-linear) case at all.

If you have a purely linear case, then the expected slope of a genetic variant onto an outcome of interest is proportional to the effect of the genetic variant.

Th... (read more)

tailcalled1mo20

It doesn't matter if depression-common is genetic or environmental. Depression-common leads to the genetic difference between your cases and controls to be small along the latent trait axis that causes depression-rare. So the effect gets estimated to be not-that-high. The exact details of how it fails depends on the mathematical method used to estimate the effect.

2TsviBT1mo

Ok I think I get what you're trying to communicate, and it seems true, but I don't think it's very relevant to the missing heritability thing. The situation you're describing applies to the fully linear case too. You're just saying that if a trait is more polygenic / has more causes with smaller effects, it's harder to detect relevant causes. Unless I still don't get what you're saying.

tailcalled1mo20

Not right now, I'm on my phone. Though also it's not standard genetics math.

2TsviBT1mo

Ok. I don't get why you think this. It doesn't seem to make any sense. You'd still notice the effect of variants that cause depression-rare, exactly like if depression-rare was the only kind of depression. How is your ability to detect depression-rare affected by the fact that there's some genetic depression-common? Depression-common could just as well have been environmentally caused. I might be being dumb, I just don't get what you're saying and don't have a firm grounding myself.

tailcalled1mo20

Isn't the derivative of the full variable in one of the multiplicands still noticeable? Maybe it would help if you make some quantitative statement?

Taking the logarithm (to linearize the association) scales the derivative down by the reciprocal of the magnitude. So if one of the terms in the sum is really big, all the derivatives get scaled down by a lot. If each of the terms are a product, then the derivative for the big term gets scaled up to cancel out the downscaling, but the small terms do not.

I mean, I think depression is heritable, and I think there

... (read more)

2TsviBT1mo

Can you please write down the expressions you're talking about as math? If you're trying to invoke standard genetics knowledge, I'm not a geneticist and I'm not picking it up from what you're saying.

Caleb Biddulph's Shortform

tailcalled1mo20

It becomes more complex once you take the sum of the product of several things. At that point the log-additive effect of one of the terms in the sum disappears if the other term in the sum is high. If you've got a lot of terms in the sum and the distribution of the variables is correct, this can basically kill the bulk of common additive variance. Conceptually speaking, this can be thought of as "your system is a mixture of a bunch of qualitatively distinct things". Like if you imagine divorce or depression can be caused by a bunch of qualitatively unrelated things.

2TsviBT1mo

Hm.... Not sure how to parse this. (What do you mean " the distribution of the variables is correct"?) Isn't the derivative of the full variable in one of the multiplicands still noticeable? Maybe it would help if you make some quantitative statement? I mean, I think depression is heritable, and I think there are polygenic scores that do predict some chunk of this. (From a random google: https://jamanetwork.com/journals/jamapsychiatry/fullarticle/2783096 ) Quite plausibly yes these heritability estimates and PGSes are picking up on heterogeneous things, but they still work, and you can still construct the PGS; you find the additive variants when you look. (Also I am interested in the difference between traits that are OR / SUM of some heritable things and some non-heritable things. E.g. you can get lung cancer from lung cancer genes, or from smoke 5 packs a day. This matters for asking "just how low exactly can we drive down disease risk?". But this would not show up as missing heritability!)

tailcalled1mo70

Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc

1Caleb Biddulph1mo

Interesting, strong-upvoted for being very relevant. My response would be that identifying accurate "labels" like "this is a tree-detector" or "this is the Golden Gate Bridge feature" is one important part of interpretability, but understanding causal connections is also important. The latter is pretty much useless without the former, but having both is much better. And sparse, crisply-defined connections make the latter easier. Maybe you could do this by combining DLGNs with some SAE-like method.

A model of the final phase: the current frontier AIs as de facto CEOs of their own companies

tailcalled1mo00

Couldn't it also end if all the AI companies collapse under their own accumulated technical debt and goodwill lost to propaganda, and people stop wanting to use AI for stuff?

So how well is Claude playing Pokémon?

tailcalled1mo209

And as a separate note, I'm not sure what the appropriate human reference class for game-playing AIs is, but I challenge the assumption that it should be people who are familiar with games. Rather than, say, people picked at random from anywhere on earth.

Should maybe restrict it to someone who has read all the documentation and discussion for the game that exists on the internet.

4MondSemmel1mo

Fair. But then also restrict it to someone who has no hands, eyes, etc.

tailcalled1mo30

The defining difference was whether they have contextually activating behaviors to satisfy a set of drives, on the basis that this makes it trivial to out-think their interests. But this ability to out-think them also seems intrinsically linked to them being adversarially non-robust, because you can enumerate their weaknesses. You're right that one could imagine an intermediate case where they are sufficiently far-sighted that you might accidentally trigger conflict with them but not sufficiently far-sighted for them to win the conflicts, but that doesn't mean one could make something adversarially robust under the constraint of it being contextually activated and predictable.

2Mateusz Bagiński1mo

Alright, fair, I misread the definition of "homeostatic agents".

tailcalled1mo63

That would be ones that are bounded so as to exclude taking your manipulation methods into account, not ones that are truly unbounded.

2Mateusz Bagiński1mo

I interpreted "unbounded" as "aiming to maximize expected value of whatever", not "unbounded in the sense of bounded rationality".

tailcalled1mo20

That's not something unique to homeostatic agents, though. If a model-based maximizer has some gap between its model and the real world, that gap can be exploited by another agent for its own gain, and that's game over for the maximizer.

I don't think of my argument as model-based vs heuristic-reactive, I mean it as unbounded vs bounded. Like you could imagine making a giant stack of heuristics that makes it de-facto act like an unbounded consequentialist, and you'd have a similar problem. Model-based agents only become relevant because they seem like an ea... (read more)