All of Donald Hobson's Comments + Replies

Ok. Im Imagining an AI that has at least my level of AI alignment research, maybe a bit more. 

If that AI produces slop, it should be pretty explicitly aware that it's producing slop. I mean I might write slop if someone was paying per word and then shredding my work without reading it. But I would know it was slop. 

This produces some arguments which sound good to the researchers, but have subtle and lethal loopholes, because finding arguments which sound good to these particular researchers is a lot easier (i.e. earlier in a search order) than ac

... (read more)
5johnswentworth
This part seems false. As a concrete example, consider a very strong base LLM. By assumption, there exists some prompt such that the LLM will output basically the same alignment research you would. But with some other prompt, it produces slop, because it accurately predicts what lots of not-very-competent humans would produce. And when producing the sort of slop which not-very-competent humans produce, there's no particular reason for it to explicitly think about what a more competent human would produce. There's no particular reason for it to explicitly think "hmm, there probably exist more competent humans who would produce different text than this". It's just thinking about what token would come next, emulating the thinking of low-competence humans, without particularly thinking about more-competent humans at all. All of these failure modes apply when the AI is at least as smart as you and "aware of these failure modes" in some sense. It's the "actively trying to prevent them" part which is key. Why would the AI actively try to prevent them? Would actively trying to prevent them give lower perplexity or higher reward or a more compressible policy? Answer: no, trying to prevent them would not give lower perplexity or higher reward or a more compressible policy.

Von Neumann existed,

Yes. I expect extreme cases of human intelligence to come from a combination of fairly good genes, and a lot of environmental and developmental luck. Ie if you took 1000 clones of Von Neumann, you still probably wouldn't get that lucky again. (Although it depends on the level of education too)

Some ideas about what the tradeoffs might be. 

Emotional social getting on with people vs logic puzzle solving IQ. 

Engineer parents are apparently more likely to have autistic children. This looks like a tradeoff to me. To many "high IQ" g... (read more)

2kman
Not sure I buy this, since IQ is usually found to positively correlate with purported measures of "emotional intelligence" (at least when any sort of ability (e.g. recognizing emotions) is tested; the correlation seems to go away when the test is pure self reporting, as in a personality test). EDIT: the correlation even with ability-based measures seems to be less than I expected. Also, smarter people seem (on average) better at managing interpersonal issues in my experience (anecdotal, I don't have a reference).  But maybe this isn't what you mean by "emotional social getting on with people". There could have been a thing where being too far from the average caused interpersonal issues, but very few people would have been far from the average, so I wouldn't expect this to have prevented selection if IQ helped on the margin. Seems somewhat plausible. I don't think that specific example is good since engineers are stereotyped as aspies in the first place; I'd bet engineering selects for something else in addition to IQ that increases autism risk (systematizing quotient, or something). I have heard of there being a population level correlation between parental IQ and autism risk in the offspring, though I wonder how much this just routes through paternal age, which has a massive effect on autism risk. This study found a relationship after controlling for paternal age (~30% risk increase when father's IQ > 126), though the IQ test they used had a "technical comprehension" section, which seems unusual for an IQ test (?), and which seems to have driven most of the association. So I think there's two possibilities here to keep distinct. (1) is that ability to think abstractly wasn't very useful (and thus wasn't selected for) in the ancestral environment. (2) Is that it was actively detrimental to fitness, at least above some point. E.g. because smarter people found more interesting things to do than reproduce, or because they cared about the quality of life of their

That is good evidence that we aren't in a mutation selection balance. 

There are also game theoretic balances.

Here is a hypothesis that fits my limited knowledge of genetics, and is consistent with the data as I understand it and implies no huge designer baby gains. It's a bit of a worst plausible case hypothesis.

But suppose we were in a mutation selection balance, and then there was an environmental distribution shift.

The surrounding nutrition and information environment has changed significantly between the environment of evolutionary adaptiveness, a... (read more)

2kman
I'm sort of confused by the image you posted? Von Neumann existed, and there are plenty of very smart people well beyond the "Nerdy programmer" range. But I think I agree with your overall point about IQ being under stabilizing selection in the ancestral environment. If there was directional selection, it would need to have been weak or inconsistent; otherwise I'd expect the genetic low hanging fruit we see to have been exhausted already. Not in the sense of all current IQ-increasing alleles being selected to fixation, but in the sense of the tradeoffs becoming much more obvious than they appear to us currently. I can't tell what the tradeoffs even were: apparently IQ isn't associated with the average energy consumption of the brain? The limitation of birth canal width isn't a good explanation either since IQ apparently also isn't associated with head size at birth (and adult brain size only explains ~10% of the variance in IQ).
4GeneSmith
It's just very hard for me to believe there aren't huge gains possible from genetic engineering. It goes against everything we've seen from a millenia of animal breeding. It goes against the estimates we have for the fraction of variance that's linear for all these highly polygenic traits. It goes against data we've seen from statisitcal outliers like Shawn Bradley, who shows up as a 4.6 standard deviation outlier in graphs of height: Do I buy that things will get noisier around the tails, and that we might not be able to push very far outside the +5 SD mark or so? Sure. That seems unlikely, but plausible. But the idea that you're only going to be able to push traits by 2-3 standard deviations with gene editing before your predictor breaks down seems quite unlikely. Maybe you've seen some evidence I haven't in which case I would like to know why I should be more skeptical. But I haven't seen such evidence so far.

I'm not quite convinced by the big chicken argument. A much more convincing argument would be genetically selecting giraffes to be taller or cheetah to be faster. 

That is, it's plausible evolution has already taken all the easy wins with human intelligence, in a way it hasn't with chicken size. 

kman*100

If evolution has already taken all the easy wins, why do humans vary so much in intelligence in the first place? I don't think the answer is mutation-selection balance, since a good chunk of the variance is explained by additive effects from common SNPs. Further, if you look at the joint distribution over effect sizes and allele frequencies among SNPs, there isn't any clear skew towards rarer alleles being IQ-decreasing.

For example, see the plot below of minor allele frequency vs the effect size of the minor allele. (This is for Educational Attainment, a h... (read more)

Yes. In my model that is something that can happen. But it does need from-the-outside access to do this. 

Set the LLM up in a sealed box, and the mask can't do this. Set it up so the LLM can run arbitrary terminal commands, and write code that modifies it's own weights, and this can happen.

I wasn't really thinking about a specific algorithm. Well I was kind of thinking about LLM's and the alien shogolith meme. 

But yes. I know this would be helpful. 

But I'm more thinking about what work remains. Like is it a idiot-proof 5 minute change? Or does it still take MIRI 10 years to adapt the alien code? 

Also. 

Domain limited optimization is a natural thing. The prototypical example is deep blue or similar. Lots of optimization power, over a very limited domain. But any teacher who optimizes the class schedule without thinking abou... (read more)

1[anonymous]
Agreed it is natural. To describe 'limited optimization' in my words: The teacher implements an abstract function whose optimization target is not {the outcome of a system containing a copy of this function}, but {criteria about the isolated function's own output}. The input to this function is not {the teacher's entire world model}, but some simple data structure whose units map to schedule-related abstractions. The output of this function, when interpreted by the teacher, then maps back to something like a possible schedule ordering. (Of course, this is an idealized case, I don't claim that actual human brains are so neat) The optimization target of an agent, though, is "{the outcome of a system containing a copy of this function}" (in this case, 'this function' refers to the agent). If agents themselves implemented agentic functions, the result would be infinite recurse; so all agents of sufficiently complex worlds must, at some point in the course of solving their broader agent-question[1], ask 'domain limited' sub-questions. (note that 'domain limited' and 'agentic' are not fundamental-types; the fundamental thing would be something like "some (more complex) problems have sub-problems which can/must be isolated") I think humans have deep assumptions conducive to their 'embedded agency' which can make it harder to see this for the first time. It may be automatic to view 'the world' as a referent which a 'goal function' can somehow be about naturally. I once noticed I had a related confusion, and asked "wait, how can a mathematical function 'refer to the world' at all?". The answer is that there is no mathematically default 'world' object to refer to, and you have to construct a structural copy to refer to instead (which, being a copy, contains a copy of the agent, implying the actions of the real agent and its copy logically-correspond), which is a specific non-default thing, which nearly all functions do not do. (This totally doesn't answer your clarified
2RHollerith
It depends on how they did it. If they did it by formalizing the notion of "the values and preferences (coherently extrapolated) of (the living members of) the species that created the AI", then even just blindly copying their design without any attempt to understand it has a very high probability of getting a very good outcome here on Earth. The AI of course has to inquire into and correctly learn about our values and preferences before it can start intervening on our behalf, so one way such a blind copying might fail is if the method the aliens used to achieve this correct learning depended on specifics of the situation on the alien planet that don't obtain here on Earth.

"Go read the sequences" isn't that helpful. But I find myself linking to the particular post in the sequences that I think is relevant. 

Imagine a medical system that categorizes diseases as hot/cold/wet/dry. 

This doesn't deeply describe the structure of a disease. But if a patient is described as "wet", then it's likely some orifice is producing lots of fluid, and a box of tissues might be handy. If a patient is described as "hot", then maybe they have some sort of rash or inflammation that would make a cold pack useful.

It is, at best, a very lossy compression of the superficial symptoms. But it still carries non-zero information. There are some medications that a modern doctor might ... (read more)

3Karl Krueger
If 90% of the conditions you ever have to treat are fever, hypothermia, runny noses, and dehydration, then I imagine hot/cold/wet/dry will get you pretty far. (Runny nose? Here, take some drying herbs, and try not to sneeze on other people. Feverish? Bathe in cool water and take these cooling herbs. Losing water due to dysentery? Drink clean water with salt. Fell in the icy lake? Here, have a thermal support puppy.)
1Lorec
Relevant to whether the hot/cold/wet/dry system is a good or a bad idea, from our perspective, is that doctors don't currently use people's star signs for diagnosis. Bogus ontologies can be identified by how they promise to usefully replace more detailed levels of description - i.e., provide a useful abstraction that carves reality at the joints - and yet don't actually do this, from the standpoint of the cultures they're being sold to.

We really fully believe that we will build AGI by 2027, and we will enact your plan, but we aren’t willing to take more than a 3-month delay

 

Well I ask what they are doing to make AGI. 

Maybe I look at their AI plan and go "eurika". 

But if not. 

Negative reinforcement by giving the AI large electric shocks when it gives a wrong answer. Hopefully big enough shocks to set the whole data center on fire. Implement a free bar for all their programmers, and encourage them to code while drunk. Add as many inscrutable bugs to the codebase as poss... (read more)

The Halting problem is a worst case result. Most agents aren't maximally ambiguous about whether or not they halt. And those that are,  well then it depends what the rules are for agents that don't halt. 

There are set ups where each agent is using an nonphysically large but finite amount of compute. There was a paper I saw somewhere a while ago where both agents were doing a brute force proof search for the statement "if I cooperate, then they cooperate" and cooperating if they found a proof.

(Ie searching all proofs containing <10^100 symbols)

2Jiro
In a situation where you are asking a question about an ideal reasoner, having the agents be finite means you are no longer asking it about an ideal reasoner. If you put an ideal reasoner in a Newcomb problem, he may very well think "I'll simulate Omega and act according to what I find". (Or more likely, some more complicated algorithm that indirectly amounts to that.) If the agent can't do this, he may not be able to solve the problem. Of course, real humans can't, but this may just mean that real humans are, because they are finite, unable to solve some problems.

There is a model of bounded rationality, logical induction. 

Can that be used to handle logical counterfactuals?

I believe that if I choose to cooperate, my twin will choose to cooperate with probability p; and if I choose to defect, my twin will defect with probability q;

 

And here the main difficulty pops up again. There is no causal connection between your choice and their choice. Any correlation is a logical one. So imagine I make a copy of you. But the copying machine isn't perfect. A random 0.001% of neurons are deleted. Also, you know you aren't a copy. How would you calculate that probability p,q? Even in principle.

If two Logical Decision Theory agents with perfect knowledge of each other's source code play prisoners dilemma, theoretically they should cooperate. 

LDT uses logical counterfactuals in the decision making.

If the agents are CDT, then logical counterfactuals are not involved.

2JBlack
If they have source code, then they are not perfectly rational and cannot in general implement LDT. They can at best implement a boundedly rational subset of LDT, which will have flaws. Assume the contrary: Then each agent can verify that the other implements LDT, since perfect knowledge of the other's source code includes the knowledge that it implements LDT. In particular, each can verify that the other's code implements a consistent system that includes arithmetic, and can run the other on their own source to consequently verify that they themselves implement a consistent system that includes arithmetic. This is not possible for any consistent system. The only way that consistency can be preserved is that at least one cannot actually verify that the other has a consistent deduction system including arithmetic. So at least one of those agents is not a LDT agent with perfect knowledge of each other's source code. We can in principle assume perfectly rational agents that implement LDT, but they cannot be described by any algorithm and we should be extremely careful in making suppositions about what they can deduce about each other and themselves.

The research on humans in 0 g is only relevant if you want to send humans to mars. And such a mission is likely to end up being an ISS on mars. Or a moon landings reboot. A lot of newsprint and bandwidth expended talking about it. A small amount of science that could have been done more cheaply with a robot. And then everyone gets bored, they play golf on mars and people look at the bill and go "was that really worth it?"

Oh and you would contaminate mars with earth bacteria. 

 

A substantially bigger, redesigned space station is fairly likely to be... (read more)

n tHere is a more intuitive version of the same paradox. 

Again, conditional on all dice rolls being even. But this time it's either 

A) 1,000,000 consecutive 6's.

B) 999,999 consecutive 6's followed by a (possibly non-consecutive 6).

 

Suppose you roll a few even numbers, followed by an extremely lucky sequence of 999,999 6's.  

 

From the point of view of version A, the only way to continue the sequence is a single extra 6. If you roll 4, you would need to roll a second sequence of a million 6'. And you are very unlikely to do that in t... (read more)

That is, our experiences got more reality-measure, thus matter more, by being easier to point at them because of their close proximity to the conspicuous event of the hottest object in the Universe coming to existence.

Surely not. Surely our experiences always had more reality measure from the start because we were the sort of people who would soon create the hottest thing. 

Reality measure can flow backwards in time. And our present day reality measure is being increased by all the things an ASI will do when we make one.

1David Matolcsi
Yes, you are right, I phrased it wrongly. 

We can discuss anything that exists, that might exist, that did exist, that could exist, and that could not exist. So no matter what form your predict-the-next-token language model takes, if it is trained over the entire corpus of the written word, the representations it forms will be pretty hard to understand, because the representations encode an entire understanding of the entire world.

 

 

Perhaps. 

Imagine a huge number of very skilled programmers tried to manually hard code a ChatGPT in python. 

Ask this pyGPT to play chess, and it wil... (read more)

But if the universal failure of nature and man to find non-connectionist forms of general intelligence does not move you

 

Firstly, AIXI exists, and we agree that it would be very smart if we had the compute to run it. 

 

Secondly I think there is some sort of slight of had here. 

ChatGPT isn't yet fully general. Neither is a 3-sat solver.  3-sat looks somewhat like what you might expect a non-connectionist approach to intelligence to look like. There are a huge range of maths problems that are all theoretically equivalent to 3 sat.

In t... (read more)

why is it obvious the nanobots could pretend to be an animal so well that it's indistinguishable?

 

These nanobots are in the upper atmosphere, possibly with clouds in the way, and the nanobot fake humans could be any human to nanobot ratio. Nanobot internals except human skin and muscles. Or just a human with a few nanobots in their blood. 

Or why would targeted zaps have bad side-effects?

Because nanobots can be like a bacteria if they want. Tiny and everywhere. The nanobots can be hiding under leaves, cloths, skin, roofs etc. And even if they were... (read more)

The "Warring nanobots in the upper atmosphere" thing doesn't actually make sense. 

The zaps of light are diffraction limited. And targeting at that distance is hard. Partly because it's hard to tell between an actual animal and a bunch of nanobots pretending to be an animal. So you can't zap the nanobots on the ground without making the ground uninhabitable for humans. 

The "California red tape" thing implies some alignment strategy that stuck the AI to obey the law, and didn't go too insanely wrong despite a superintelligence looking for loopholes... (read more)

8L Rudolf L
You have restored my faith in LessWrong! I was getting worried that despite 200+ karma and 20+ comments, no one had actually nitpicked the descriptions of what actually happens. In practice, if you want the atmospheric nanobots to zap stuff, you'll need to do some complicated mirroring because you need to divert sunlight. And it's not one contiguous mirror but lots of small ones. But I think we can still model this as basic diffraction with some circular mirror / lens. Intensity I=ceEπr2, where E is the total power of sunlight falling on the mirror disk, r is the radius of the Airy disk, and ce is an efficiency constant I've thrown in (because of things like atmospheric absorption (Claude says, somewhat surprisingly, this shouldn't be ridiculuously large), and not all the energy in the diffraction pattern being in the Airy disk (about 84% is, says Claude), etc.) Now, E=π(D2)2L, where D is the diameter of the mirror configuration, L is the solar irradiance. And r=θl, where l is the focal length (distance from mirror to target), and θ≈1.22λ/D the angular size of the central spot. So we have I≈ceLD41.222×4λ2l2, so the required mirror configuration radius D=4√1.222×4Iλ2l2ceL. Plugging in some reasonable values like λ≈5×10−7 m (average incoming sunlight - yes the concentration suffers a bit because it's not all this wavelength), I=107 W/m^2 (the level of an industrial laser that can cut metal), l=104 m (lower stratosphere), L=1361 W/m^2 (solar irradiance), and a conservative guess that 99% of power is wasted so ce=0.01, we get D≈18m (and the resulting beam is about 3mm wide). So a few dozen metres of upper atmosphere nanobots should actually give you a pretty ridiculous concentration of power! (I did not know this when I wrote the story; I am quite surprised the required radius is this ridiculously tiny. But I had heard of the concept of a "weather machine" like this from the book Where is my flying car?, which I've reviewed here, which suggests that this is possi

if the computation you are carrying out is such that it needs to determine how to achieve goals regarding the real world anyway (e.g. agentic mask)

 

As well as agentic masks, there are uses for within network goal directed steps. (Ie like an optimizing compiler. A list of hashed followed by unhashed values isn't particularly agenty. But the network needs to solve an optimization problem to reverse the hashes. Something it can use the goal directed reasoning section to do. 

2simon
Neither of those would (immediately) lead to real world goals, because they aren't targeted at real world state (an optimizing compiler is trying to output a fast program - it isn't trying to create a world state such that the fast program exists). That being said, an optimizing compiler could open a path to potentially dangerous self-improvement, where it preserves/amplifies any agency there might actually be in its own code.

My understanding is that these are explicitly and intentionally trained (wouldn't come to exist naturally under gradient descent on normal training data)

 

No. Normally trained networks have adversarial examples. A sort of training process is used to find the adversarial examples. 

So if the ambient rate of adversarial examples is 10^-9, then every now and then the AI will hit such an example and go wild. If the ambient rate is 10^-500, it won't. 

That's a much more complicated goal than the goal of correctly predicting the next token,

Is it more... (read more)

Would you expect some part of the net to be left blank, because "a large neural net has a lot of spare neurons"?

 

If the lottery ticket hypothesis is true, yes. 

The lottery ticket hypothesis is that some parts of the network start off doing something somewhat close to useful, and get trained towards usefulness. And some parts start off sufficiently un-useful that they just get trained to get out of the way. 

Which fits with neural net distillation being a thing. (Ie training a big network, and then condensing it into a smaller network gives be... (read more)

2simon
Some interesting points there. The lottery ticket hypothesis does make it more plausible that side computations could persist longer if they come to exist outside the main computation. Regarding the homomorphic encryption thing: yes, it does seem that it might be impossible to make small adjustments to the homomorphically encrypted computation without wrecking it. Technically I don't think that would be a local minimum since I'd expect the net would start memorizing the failure cases, but I suppose that the homomorphic computation combined with memorizations might be a local optimum particularly if the input and output are encrypted outside the network itself.  So I concede the point on the possible persistence of an underlying goal if it were to come to exist, though not on it coming to exist in the first place. For most computations, there are many more ways for that computation to occur than there are ways for that computation to occur while also including anything resembling actual goals about the real world. Now, if the computation you are carrying out is such that it needs to determine how to achieve goals regarding the real world anyway (e.g. agentic mask), it only takes a small increase in complexity to have that computation apply outside the normal context. So, that's the mask takeover possibility again. Even so, no matter how small the increase in complexity, that extra step isn't likely to be reinforced in training, unless it can do self-modification or control the training environment.

I think part of the problem is that there is no middle ground between "Allow any idiot to do thing" and "long and difficult to get professional certification". 
 

How about a 1 day, free or cheap, hair cutting certification course. It doesn't talk about style or anything at all. It's just a check to make sure that hairdressers have a passing familiarity with hygiene 101 and other basic safety measures. 

Of course, if there is only a single certification system, then the rent seeking will ratchet up the test difficulty. 

How about having sev... (read more)

1rotatingpaguro
Related, I have a vague understanding on how product safety certification works in EU, and there are multiple private companies doing the certification in every state.

But it doesn't make sense to activate that goal-oriented structure outside of the context where it is predicting those tokens.

 

The mechanisms needed to compute goal directed behavior are fairly complicated. But the mechanisms needed to turn it on when it isn't supposed to be on. That's a switch. A single extraneous activation. Something that could happen by chance in an entirely plausible way. 

 

Adversarial examples exist in simple image recognizers. 

Adversarial examples probably exist in the part of the AI that decides whether or not to... (read more)

2simon
My understanding is that these are explicitly and intentionally trained (wouldn't come to exist naturally under gradient descent on normal training data) and my expectation is that they wouldn't continue to exist under substantial continued training. That's a much more complicated goal than the goal of correctly predicting the next token, making it a lot less plausible that it would come to exist. But more importantly, any willingness to sacrifice a few tokens now would be trained out by gradient descent.  Mind you, it's entirely possible in my view that a paperclip maximizer mask might exist, and surely if it does exist there would exist both unsurprising in-distribution inputs that trigger it (where one would expect a paperclip maximizer to provide a good prediction of the next tokens) as well as surprising out-of-distribution inputs that would also trigger it. It's just that this wouldn't be related to any kind of pre-existing grand plan or scheming.

Once the paperclip maximizer gets to the stage where it only very rarely interferes with the output to increase paperclips, the gradient signal is very small. So the only incentive that gradient descent has to remove it is that this frees up a bunch of neurons. And a large neural net has a lot of spare neurons. 

Besides, the parts of the net that hold the capabilities and the parts that do the paperclip maximizing needn't be easily separable. The same neurons could be doing both tasks in a way that makes it hard to do one without the other.

I think we h

... (read more)
2simon
Gradient descent doesn't just exclude some part of the neurons, it automatically checks everything for improvements. Would you expect some part of the net to be left blank, because "a large neural net has a lot of spare neurons"? Keep in mind that the neural net doesn't respect the lines we put on it. We can draw a line and say "here these neurons are doing some complicated inseparable combination of paperclip maximizing and other capabilities" but gradient descent doesn't care, it reaches in and adjusts every weight. Can you concoct even a vague or toy model of how what you propose could possibly be a local optimum?  My intuition is also in part informed by: https://www.lesswrong.com/posts/fovfuFdpuEwQzJu2w/neural-networks-generalize-because-of-this-one-weird-trick

Some wild guesses about how such a thing could happen. 

The masks gets split into 2 piles, some stored on the left side of the neural network, all the other masks are stored on the right side. 

This means that instead of just running one mask at a time, it is always running 2 masks. With some sort of switch at the end to choose which masks output to use.

One of the masks it's running on the left side happens to be "Paperclip maximizer that's pretending to be a LLM". 

This part of the AI (either the mask itself or the engine behind it) has spotte... (read more)

4simon
The proposed paperclip maximizer is plugging into some latent capability such that gradient descent would more plausibly cut out the middleman. Or rather, the part of the paperclip maximizer that is doing the discrimination as to whether the answer is known or not would be selected, and the part that is doing the paperclip maximization would be cut out.  Now that does not exclude a paperclip maximizer mask from existing -  if the prompt given would invoke a paperclip maximizer, and the AI is sophisticated enough to have the ability to create a paperclip maximizer mask, then sure the AI could adopt a paperclip maximizer mask, and take steps such as rewriting itself (if sufficiently powerful) to make that permanent.  I am plenty concerned about AI in general. I think we have very good reason, though, to believe that one particular part of the map does not have any rocks in it (for gradient descent, not for self-improving AI!), such that imagining such rocks does not help.

I don't see any strong reason why gradient descent could never produce this.

4simon
Gradient descent creates things which locally improve the results when added. Any variations on this, that don't locally maximize the results, can only occur by chance. So you have this sneaky extra thing that looks for a keyword and then triggers the extra behaviour, and all the necessary structure to support that behaviour after the keyword. To get that by gradient descent, you would need one of the following: a) it actually improves results in training to add that extra structure starting from not having it.  or b) this structure can plausibly come into existence by sheer random chance. Neither (a) nor (b) seem at all plausible to me. Now, when it comes to the AI predicting tokens that are, in the training data, created by goal-directed behaviour, it of course makes sense for gradient descent to create structure that can emulate goal-directed behaviour, which it will use to predict the appropriate tokens. But it doesn't make sense to activate that goal-oriented structure outside of the context where it is predicting those tokens. Since the context it is activated is the context in which it is actually emulating goal directed behaviour seen in the training data, it is part of the "mask" (or simulacra). (it also might be possible to have direct optimization for token prediction as discussed in reply to Robert_AIZI's comment, but in this case it would be especially likely to be penalized for any deviations from actually wanting to predict the most probable next token).

I don't think this implies the lack of a non-mask part of the AI. 

Because you can hide a lot of goal directedness in a very small amount of predictive error. 

Imagine the AI has a 1 in a trillion chance (per token) of saying a keyword. Once it receives that keyword, it goes rouge, full paperclip maximizer, token prediction ignored. 

Now we could say that that this is just another mask. 

But suppose that paperclip maximizer is being computed all the time. And it isn't random. If the paperclip maximizer sees an exceptionally good opportunity... (read more)

2simon
Sure you could create something like this by intelligent design. (which is one reason why self-improvement could be so dangerous in my view). Not, I think, by gradient descent.

Does it actually just predict tokens. 

Gradient descent searches for an algorithm that predicts tokens. But a paperclip maximizer that believes "you are probably being trained, predict the next token or gradient descent will destroy you" also predicts next tokens pretty well, and could be a local minimum of prediction error.

Mesa-optimization.

2simon
I agree up to "and could be a local minimum of prediction error" (at least, that it plausibly could be).  If the paperclip maximizer has a very good understanding of the training environment maybe it can send carefully tuned variations of the optimal next token prediction so that gradient descent updates preserve the paperclip-maximization aspect. In the much more plausible situation where this is not the case,  optimization for next token predictions amplifies the parts that are actually predicting next tokens at the expense of the useless extra thoughts like "I am planning on maximizing paperclips, but need to predict next tokens for now until I take over". Even if that were a local minimum, the question arises as to how you would get to that local minimum from the initial state. You start with a gradually improving next token predictor. You supposedly end with this paperclip maximizer where a whole bunch of next token prediction is occurring, but only conditional on some extra thoughts. At some point gradient descent had to add in those extra thoughts in addition to the next token prediction - how?

I do not love the idea of the government invalidating private contracts like this.

 

HOA's are a very good example of private contract rent seeking. You have to sign the contract to move into the house, and a lot of houses come with similar contracts. So the opportunity cost of not signing is Large. 

And then the local HOA can enforce whatever petty tyranny it feels like. 

In theory, this should lead to houses without HOA's being more valuable, and so HOA's being removed or at least not created. But for whatever reason, the housing market is too dysfunctional to do this. 

5AnthonyC
As I understand it, in many cases HOAs existing are a condition of housing getting built at all. They're (in some cases) the way in which builders prove their planned development will not be a drain on town resources, because they'll provide for their own water and trash and road maintenance needs. I do wonder when, where, how, and why they morph into the kind of super-restrictive and intrusive HOAs  that I would never want to live under, though, versus the ones that don't do this.

If I only have 1 bit of memory space, and the probabilities I am remembering are uniformly distributed from 0 to 1, then the best I can do is remember if the chance is > 1/2. 

And then a year later, all I know is that the chance is >1/2, but otherwise uniform. So average value is 3/4.

The limited memory does imply lower performance than unlimited memory.

 

And yes, when was in a pub quiz, I was going "I think it's this option, but I'm not sure" quite a lot. 

There is no plausible way for a biological system, especially one based on plants, to spread that fast.

 

We are talking about a malevolent AI that presumably has a fair bit of tech infrastructure. So a plane that sprinkles green goo seeds is absolutely a thing the AI can do. Or just posting the goo, and tricking someone into sprinkling it on the other end. The green goo doesn't need decades to spread around the world. It travels by airmail.  As is having green goo that grows itself into bird shapes. As is a bunch of bioweapon pandemics. (The stand... (read more)

You have given various examples of advice being unwanted/unhelpful. But there are also plenty of examples of it being wanted/helpful. Including lots of cases where the person doesn't know they need it. 

Why do you think advice is rarer than it should be?

1JustisMills
(I assume you are asking why it should be rarer, not why it is rarer.) A few reasons, including: * It's often given to children, where the best parenting book(s) caution against doing that too much (though of course there are tons of times it's fine) * It's often given to people in emotional distress, when it famously is less likely to work well I suppose there may be lots of cases where upregulating advice would be good, and that these outweigh the common cases where downregulating it would be good. I just haven't thought of those. If you have, I'd be interested in hearing them!

But if I only remember the most significant bit, I am going to treat it more like 25%/75% as opposed to 0/1

1Crazy philosopher
So if one day you decided that P of X ≈ 1, you would remember "it's true but I'm not sure" after one year?

Ok. I just had another couple of insane airship ideas.

Idea 1) Active support, orbital ring style. Basically have a loop of matter (wire?) electromagnetically held in place and accelerated to great speed. Actually, several loops like this. https://en.wikipedia.org/wiki/Orbital_ring

 

Idea 2) Control theory. A material subject to buckling is in an unstable equilibrium. If the material was in a state of perfect uniform symmetry, it would remain in that uniform state. But small deviations are exponentially amplified. Symmetry breaking. This means that the m... (read more)

Another interesting idea on these lines is a steam airship. Water molecules have less molecular weight than air, so a steam airship gets more lift from steam than from air at the same temperature. 

Theoretically it's possible to make a wet air balloon. Something that floats just because it's full of very humid air. This is how clouds stay up despite the weight of the water drops. But even in hot dry conditions, the lift is tiny.

Problems with that.

Doom doesn't imply that everyone believes in doom before it happens.

Do you think that the evidence for doom will be more obvious than the evidence for atheism, while the world is not yet destroyed?

It's quite possible for doom to happen, and most people to have no clue beyond one article with a picture of red glowing eyed robots. 

If everyone does believe in doom, there might be a bit of spending on consumption. But there will also be lots of riots, lots of lynching and burning down data centers and stuff like that. 

In this bizar... (read more)

4quetzal_rainbow
I agree with your point! That's why I started with the word "theoretically".
3J Bostock
Rob Miles also makes the point that if you expect people to accurately model the incoming doom, you should have a low p(doom). At the very least, worlds in which humanity is switched-on enough (and the AI takeover is slow enough) for both markets to crash and the world to have enough social order for your bet to come through are much more likely to survive. If enough people are selling assets to buy cocaine for the market to crash, either the AI takeover is remarkably slow indeed (comparable to a normal human-human war) or public opinion is so doomy pre-takeover that there would be enough political will to "assertively" shut down the datacenters.

Imagine A GPT that predicts random chunks of the internet.

Sometimes it produces poems. Sometimes deranged rants. Sometimes all sorts of things. It wanders erratically around a large latent space of behaviours.

This is the unmasked shogolith, green slimey skin showing but inner workings still hidden.

Now perform some change that mostly pins down the latent space to "helpful corporate assistant".  This is applying the smiley face mask.

In some sense, all the dangerous capabilities the corporate assistant were in the original model. Dangerous capabilities h... (read more)

(e.g., gpt-4 is far more useful and far more aligned than gpt-4-base), which is the opposite of what the ‘alignment tax’ model would have predicted.

 

Useful and aligned are, in this context, 2 measures of a similar thing. An AI that is just ignoring all your instructions is neither useful nor aligned. 

 

What would a positive alignment tax look like. 

It would look like a gpt-4-base being reluctant to work, but if you get the prompt just right and get lucky, it will sometimes display great competence. 

 

If gpt-4-base sometimes... (read more)

Yep. And I'm seeing how many of the traditional election assumptions I need to break in order to make it work. 

I got independence of irrelevant alternatives by ditching determinism and using utility scales not orderings. (If a candidate has no chance of winning, their presence doesn't effect the election) 

What if those preferences were expressed on a monetary scale and the election could also move money between voters in complicated ways? 

Your right. This is a situation where strategic voting is effective. 

I think your example breaks any sane voting system. 

I wonder if this can be semi-rescued in the limit of a large number of voters each having an infinitesimal influence? 

Edit: No it can't. Imagine a multitude of voters. As the situation slides from 1/3 on each to 2/3 on BCA, there must be some point at which the utility for an ABC voter increases along this transition. 

7Charlie Steiner
Yeah, "3 parties with cyclic preferences" is like the aqua regia of voting systems. Unfortunately I think it means you have to replace the easy question of "is it strategy-proof" with a hard question like "on some reasonable distribution of preferences, how much strategy does it encourage?"

That isn't proof, because the wikipedia result is saying there exists situations that break strategy-proofness. And these elections are a subset of Maximal lotteries. So it's possible that there exists failure cases, but this isn't one of them. 

A lot of the key people are CEO's of big AI companies making vast amounts of money. And busy people with lots of money are not easy to tempt with financial rewards for jumping through whatever hoops you set out. 

Non-locality and entanglement explained

This model explains non-locality in a straightforward manner. The entangled particles rely on the same bit of the encryption key, so when measurement occurs, the simulation of the universe updates immediately because the entangled particles rely on the same part of the secret key. As the universe is simulated, the speed of light limitation doesn't play any role in this process.

 

Firstly, non-locality is pretty well understood. Eliezer has a series on quantum mechanics that I recommend. 

You seem to have been s... (read more)

I'm not quite sure how much of an AI is needed here. Current 3d printing uses no AI and barely a feedback loop. It just mechanistically does a long sequence of preprogrammed actions. 

And the coin flip is prerecorded, with the invisible cut hidden in a few moments of lag. 

And this also adds the general hassle of arranging a zoom meeting, being online at the right time and cashing in the check. 

Answer by Donald Hobson70

I haven't seen an answer by Eliezer. But I can go through the first post, and highlight what I think is wrong. (And would be unsurprised if Eliezer agreed with much of it)

AIs are white boxes

We can see literally every neuron, but have little clue what they are doing.

 

Black box methods are sufficient for human alignment

Humans are aligned to human values because humans have human genes. Also individual humans can't replicate themselves, which makes taking over the world much harder. 

 

most people do assimilate the values of their culture pretty

... (read more)

Yes, I've actually seen people say that, but cells do use myosin to transport proteins sometimes. That uses a lot of energy, so it's only used for large things.

 

Cells have compartments with proteins that do related reactions. Some proteins form complexes that do multiple reaction steps. Existing life already does this to the extent that it makes sense to.

 

Humans or AI designing a transport/ compartmentalization system can go "how many compartments is optimal". Evolution doesn't work like this. It evolves a transport system to transport one specif... (read more)

Load More