I'm not trying to present johnswentworth's position, I'm trying to present my position.
The entire field is based on fears that consequentialism provides an extremely powerful but difficult-to-align method of converting intelligence into agency. This is basically wrong. Yes, people attempt to justify it with coherence theorems, but obviously you can be approximately-coherent/approximately-consequentialist and yet still completely un-agentic, so this justification falls flat. Since the field is based on a wrong assumption with bogus justification, it's all fake.
The big picture is plausible but one major error you make is assuming "academics" will be a solid bastion of opposition. My understanding is that academics are often some of the first ones to fall (like when teachers struggle with students who use ChatGPT to cheat on homework), and many of the academic complaints about AI are just as slop-y as what the AI produces.
Maybe someone who believes in following the will of the majority even if he/she disagrees (and could easily become a dictator)?
Do you mean "resigns from a presidential position/declines a dictatorial position because they disagree with the will of the people" or "makes policy they know will be bad because the people demand it"?
Maybe a good parent who listens to his/her child's dreams?
Can you expand on this?
Can you give 1 example of a person choosing to be corrigible to someone they are not dependent upon for resources/information and who they have much more expertise than?
I feel like "evil" and "corruption" mean something different.
Corruption is about selfish people exchanging their power within a system for favors (often outside the system) when they're not supposed to according to the rules of the system. For example a policeman taking bribes. It's something the creators/owners of the system should try to eliminate, but if the system itself is bad (e.g. Nazi Germany during the Holocaust), corruption might be something you sometimes ought to seek out instead of to avoid, like with Schindler saving his Jews.
"Evil" I've in t...
If the AI can't do much without coordinating with a logistics and intelligence network and collaborating with a number of other agents, and its contact to this network routes through a commanding agent that is as capable if not more capable than the AI itself, then sure, it may be relatively feasible to make the AI corrigible to said commanding agent, if that is what you want it to be.
(This is meant to be analogous to the soldier-commander example.)
But was that the AI regime you expect to find yourself working with? In particular I'd expect you expect that the commanding agent would be another AI, in which case being corrigible to them is not sufficient.
Discriminating on the basis of the creators vs a random guy on the street helps with many of the easiest cases, but in an adversarial context, it's not enough to have something that works for all the easiest cases, you need something that can't predictably made to fail by a highly motivated adversary.
Like you could easily do some sort of data augmentation to add attempts at invoking the corrigibility system from random guys on the street, and then train it not to respond to that. But there'll still be lots of other vulnerabilities.
Let's say you are using the AI for some highly sensitive matter where it's important that it resists prompt-hacking - e.g. driving a car (prompt injections could trigger car crashes), something where it makes financial transactions on the basis of public information (online websites might scam it), or military drones (the enemy might be able to convince the AI to attack the country that sent it).
A general method for ensuring corrigibility is to be eager to follow anything instruction-like that you see. However, this interferes with being good at resisting prompt-hacking.
My current best guess is that:
https://www.lesswrong.com/posts/gebzzEwn2TaA6rGkc/deep-learning-systems-are-not-less-interpretable-than-logic
The assumption of virtue ethics isn't that virtue is unknown and must be discovered - it's that it's known and must be pursued.
If it is known, then why do you not ever answer my queries about providing an explicit algorithm for converting intelligence into virtuous agency, instead running in circles about how There Must Be A Utility Function!?
If the virtuous action, as you posit, is to consume ice cream, intelligence would allow an agent to acquire more ice cream, eat more over time by not making themselves sick, etc.
I'm not disagreeing with this, I'm sayi...
No, that's not my argument.
Let's imagine that True Virtue is seeking and eating ice cream, but that you don't know what true virtue is for some reason.
Now let's imagine that we have some algorithm for turning intelligence into virtuous agency. (This is not an assumption that I'm willing to grant (since you haven't given something like argmax for virtue), and really that's the biggest issue with my proposal, but let's entertain it to see my point.)
If the algorithm is run on the basis of some implementation of intelligence that is not good enough, then the r...
I didn't say you need to understand what an argument is, I said you need to understand your own argument.
It is true that if the utility functions cover a sufficiently broad set of possibilities, any "reasonable" policy (for a controversial definition of "reasonable") maximizes a utility function, and if the utility functions cover an even broader set of possibilities, literally any policy maximizes a utility function.
But, if you want to reference these facts, you should know why they are true. For instance, here's a rough sketch of a method for finding a u...
I'm showing that the assumptions necessary for your argument don't hold, so you need to better understand your own argument.
The methods for converting policies to utility functions assume no systematic errors, which doesn't seem compatible with varying the intelligence levels.
This.
In particular imagine if the state space of the MDP factors into three variables x, y and z, and the agent has a bunch of actions with complicated influence on x, y and z but also just some actions that override y directly with a given value.
In some such MDPs, you might want a policy that does nothing other than copy a specific function of x to y. This policy could easily be seen as a virtue, e.g. if x is some type of event and y is some logging or broadcasting input, then it would be a sort of information-sharing virtue.
While there are certain circum...
I didn't claim virtue ethics says not to predict consequences of actions. I said that a virtue is more like a procedure than it is like a utility function. A procedure can include a subroutine predicting the consequences of actions and it doesn't become any more of a utility function by that.
The notion that "intelligence is channeled differently" under virtue ethics requires some sort of rule, like the consequentialist argmax or Bayes, for converting intelligence into ways of choosing.
Consequentialism is an approach for converting intelligence (the ability to make use of symmetries to e.g. generalize information from one context into predictions in another context or to e.g. search through highly structured search spaces) into agency, as one can use the intelligence to predict the consequences of actions and find a policy which achieves some criterion unusually well.
While it seems intuitively appealing that non-consequentialist approaches could be used to convert intelligence into agency, I have tried a lot and not been able to come up ...
Not sure what you mean. Are you doing a definitional dispute about what counts as the "standard" definition of Bayesian networks?
Your linked paper is kind of long - is there a single part of it that summarizes the scoring so I don't have to read all of it?
Either way, yes, it does seem plausible that one could create a market structure that supports latent variables without rewarding people in the way I described it.
I'm not convinced Scott Alexander's mistakes page accurately tracks his mistakes. E.g. the mistake on it I know the most about is this one:
...56: (5/27/23) In Raise Your Threshold For Accusing People Of Faking Bisexuality, I cited a study finding that most men’s genital arousal tracked their stated sexual orientation (ie straight men were aroused by women, gay men were aroused by men, bi men were aroused by either), but women’s genital arousal seemed to follow a bisexual pattern regardless of what orientation they thought they were - and concluded that althou
I mean I don't really believe the premises of the question. But I took "Even if you're not a fan of automating alignment, if we do make it to that point we might as well give it a shot!" to imply that even in such a circumstance, you still want me to come up with some sort of answer.
Life on earth started 3.5 billion years ago. Log_2(3.5 billion years/1 hour) = 45 doublings. With one doubling every 7 months, that makes 26 years, or in 2051.
(Obviously this model underestimates the difficulty of getting superalignment to work. But also extrapolating the METR trend is questionable for 45 doublings is dubious in an unknown direction. So whatever.)
I talk to geneticists (mostly on Twitter, or rather now BlueSky) and they don't really know about this stuff.
(Presumably there exists some standard text about this that one can just link to lol.)
I don't think so.
...I'm still curious whether this actually happens.... I guess you can have the "propensity" be near its ceiling.... (I thought that didn't make sense, but I guess you sometimes have the probability of disease for a near-ceiling propensity be some number like 20% rather than 100%?) I guess intuitively it seems a bit weird for a disease to have disjunctive causes like this, but then be able to max out at the risk at 20% with just one of the disjunctive causes
Ok, more specifically, the decrease in the narrowsense heritability gets "double-counted" (after you've computed the reduced coefficients, those coefficients also get applied to those who are low in the first chunk and not just those who are high, when you start making predictions), whereas the decrease in the broadsense heritability is only single-counted. Since the single-counting represents a genuine reduction while the double-counting represents a bias, it only really makes sense to think of the double-counting as pathological.
It would decrease the narrowsense (or additive) heritability, which you can basically think of as the squared length of your coefficient vector, but it wouldn't decrease the broadsense heritability, which is basically the phenotypic variance in expected trait levels you'd get by shuffling around the genotypes. The missing heritability problem is that when we measure these two heritabilities, the former heritability is lower than the latter.
If some amount of heritability is from the second chunk, then to that extent, there's a bunch of pairs of people whose trait differences are explained by second chunk differences. If you made a PGS, you'd see these pairs of people and then you'd find out how specifically the second chunk affects the trait.
This only applies if the people are low in the first chunk and differ in the second chunk. Among the people who are high in the first chunk but differ in the second chunk, the logarithm of their trait level will be basically the same regardless of the sec...
Why?
Some of the heritability would be from the second chunk of genes.
The original discussion was about how personality traits and social outcomes could behave fundamentally differently from biological traits when it comes to genetics. So this isn't necessarily meant to apply to disease risks.
Let's start with the basics: If the outcome is a linear function of the genes , that is , then the effect of each gene is given by the gradient of , i.e. . (This is technically a bit sketchy since a genetic variant is discrete while gradients require continuity, but it works well enough as a conceptual approximation for our purposes.) Under this circumstance, we can think of genomic studies as finding . (This is also technically a bit sketchy because of linkage disequillibrium and such, but it works we...
It kind-of applies to the Bernoulli-sigmoid-linear case that would usually be applied to binary diagnoses (but only because of sample size issues and because they usually perform the regression one variable at a time to reduce computational difficulty), but it doesn't apply as strongly as it does to the polynomial case, and it doesn't apply to the purely linear (or exponential-linear) case at all.
If you have a purely linear case, then the expected slope of a genetic variant onto an outcome of interest is proportional to the effect of the genetic variant.
Th...
It doesn't matter if depression-common is genetic or environmental. Depression-common leads to the genetic difference between your cases and controls to be small along the latent trait axis that causes depression-rare. So the effect gets estimated to be not-that-high. The exact details of how it fails depends on the mathematical method used to estimate the effect.
Not right now, I'm on my phone. Though also it's not standard genetics math.
Isn't the derivative of the full variable in one of the multiplicands still noticeable? Maybe it would help if you make some quantitative statement?
Taking the logarithm (to linearize the association) scales the derivative down by the reciprocal of the magnitude. So if one of the terms in the sum is really big, all the derivatives get scaled down by a lot. If each of the terms are a product, then the derivative for the big term gets scaled up to cancel out the downscaling, but the small terms do not.
...I mean, I think depression is heritable, and I think there
It becomes more complex once you take the sum of the product of several things. At that point the log-additive effect of one of the terms in the sum disappears if the other term in the sum is high. If you've got a lot of terms in the sum and the distribution of the variables is correct, this can basically kill the bulk of common additive variance. Conceptually speaking, this can be thought of as "your system is a mixture of a bunch of qualitatively distinct things". Like if you imagine divorce or depression can be caused by a bunch of qualitatively unrelated things.
Couldn't it also end if all the AI companies collapse under their own accumulated technical debt and goodwill lost to propaganda, and people stop wanting to use AI for stuff?
And as a separate note, I'm not sure what the appropriate human reference class for game-playing AIs is, but I challenge the assumption that it should be people who are familiar with games. Rather than, say, people picked at random from anywhere on earth.
Should maybe restrict it to someone who has read all the documentation and discussion for the game that exists on the internet.
The defining difference was whether they have contextually activating behaviors to satisfy a set of drives, on the basis that this makes it trivial to out-think their interests. But this ability to out-think them also seems intrinsically linked to them being adversarially non-robust, because you can enumerate their weaknesses. You're right that one could imagine an intermediate case where they are sufficiently far-sighted that you might accidentally trigger conflict with them but not sufficiently far-sighted for them to win the conflicts, but that doesn't mean one could make something adversarially robust under the constraint of it being contextually activated and predictable.
That would be ones that are bounded so as to exclude taking your manipulation methods into account, not ones that are truly unbounded.
That's not something unique to homeostatic agents, though. If a model-based maximizer has some gap between its model and the real world, that gap can be exploited by another agent for its own gain, and that's game over for the maximizer.
I don't think of my argument as model-based vs heuristic-reactive, I mean it as unbounded vs bounded. Like you could imagine making a giant stack of heuristics that makes it de-facto act like an unbounded consequentialist, and you'd have a similar problem. Model-based agents only become relevant because they seem like an ea...
Homeostatic agents are easily exploitable by manipulating the things they are maintaining or the signals they are using to maintain them in ways that weren't accounted for in the original setup. This only works well when they are basically a tool you have full control over, but not when they are used in an adversarial context, e.g. to maintain law and order or to win a war.
As capabilities to engage in conflict increase, methods to resist losing to those capabilities have to get optimized harder. Instead of thinking "why would my coding assistant/tutor bot ...
What if humanity mistakenly thinks that ceding control voluntarily is temporary, when actually it is permanent because it makes the systems of power less and less adapted to human means of interaction?
When asking this question, do you include scenarios where humanity really doesn't want control and is impressed by the irreproachability of GPTs, doing our best to hand over control to them as fast as possible, even as the GPTs struggle and only try in the sense that they accept whatever tasks are handed to them? Or do the GPTs have to in some way actively attempt to wrestle control from or trick humans?
Consider this model.
Suppose the state threatens people to do the following six things for their citizens:
* Teach the young
* Cure the sick
* Maintain law and order
* Feed, clothe and house people with work injuries
* Feed, clothe and house the elderly
* Feed, clothe and house people with FUBAR agency
(Requesting roughly equally many resources to be put into each of them.)
People vary in how they react to the threats, having basically three actions:
1. Assist with what is asked
2. Develop personal agency for essentially-selfish reasons, beyond what is useful on the ...
I feel like the case of bivariate PCA is pretty uncommon. The classic example of PCA is over large numbers of variables that have been transformed to be short-tailed and have similar variance (or which just had similar/small variance to begin with before any transformations). Under that condition, PCA gives you the dimensions which correlate with as many variables as possible.
Personalities don't just fall into a linear ranking from worse to better.
Imagineers' job isn't to design a good personality for a friendless nerd, it's to come up with children's stories that inspire and entertain parents and which they proudly want their children to consume.
The parents think they should try to balance the de... (read more)