If that AI produces slop, it should be pretty explicitly aware that it's producing slop. I mean I might write slop if someone was paying per word and then shredding my work without reading it. But I would know it was slop.

This produces some arguments which sound good to the researchers, but have subtle and lethal loopholes, because finding arguments which sound good to these particular researchers is a lot easier (i.e. earlier in a search order) than actually solving the problem.

Regardless of which is easier, if the AI is doing this, it has to be thinking about the researchers psychology, not just about alignment.

How many of these failure modes still happen when there is an AI at least as smart as you, that is aware of these failure modes and actively trying to prevent them?

How to Make Superbabies

Donald Hobson1mo20

Von Neumann existed,

Yes. I expect extreme cases of human intelligence to come from a combination of fairly good genes, and a lot of environmental and developmental luck. Ie if you took 1000 clones of Von Neumann, you still probably wouldn't get that lucky again. (Although it depends on the level of education too)

Some ideas about what the tradeoffs might be.

Emotional social getting on with people vs logic puzzle solving IQ.

Engineer parents are apparently more likely to have autistic children. This looks like a tradeoff to me. To many "high IQ" genes and you risk autism.

How many angels can dance on the head of a pin. In the modern world, we have complicated elaborate theoretical structures that are actually correct and useful. In the pre-modern world, the sort of mind that now obsesses about quantum mechanics would be obsessing about angels dancing on pinheads or other equally useless stuff.

How to Make Superbabies

Donald Hobson1mo20

That is good evidence that we aren't in a mutation selection balance.

There are also game theoretic balances.

Here is a hypothesis that fits my limited knowledge of genetics, and is consistent with the data as I understand it and implies no huge designer baby gains. It's a bit of a worst plausible case hypothesis.

But suppose we were in a mutation selection balance, and then there was an environmental distribution shift.

The surrounding nutrition and information environment has changed significantly between the environment of evolutionary adaptiveness, and today.

A large fraction of what was important in the ancestral world was probably quite emotion based. Eg calming down other tribe members. Winning friends and influencing people.

In the modern world, abstract logic and maths are somewhat more important than they were, although the emotional stuff still matters too.

Iq tests mostly test the more abstract logical stuff.

Now suppose that the optimum genes aren't that different compared to ambient genetic variation. Say 3 standard deviations.

How to Make Superbabies

Donald Hobson1mo50

I'm not quite convinced by the big chicken argument. A much more convincing argument would be genetically selecting giraffes to be taller or cheetah to be faster.

That is, it's plausible evolution has already taken all the easy wins with human intelligence, in a way it hasn't with chicken size.

Hopeful hypothesis, the Persona Jukebox.

Donald Hobson1mo20

Fixed

Hopeful hypothesis, the Persona Jukebox.

Donald Hobson1mo20

Yes. In my model that is something that can happen. But it does need from-the-outside access to do this.

Set the LLM up in a sealed box, and the mask can't do this. Set it up so the LLM can run arbitrary terminal commands, and write code that modifies it's own weights, and this can happen.

How useful would alien alignment research be?

Donald Hobson2mo20

I wasn't really thinking about a specific algorithm. Well I was kind of thinking about LLM's and the alien shogolith meme.

But yes. I know this would be helpful.

But I'm more thinking about what work remains. Like is it a idiot-proof 5 minute change? Or does it still take MIRI 10 years to adapt the alien code?

Also.

Domain limited optimization is a natural thing. The prototypical example is deep blue or similar. Lots of optimization power, over a very limited domain. But any teacher who optimizes the class schedule without thinking about putting nanobots in the student brains is doing something similar.

I am guessing and hoping that the masks in an LLM are at least as limited-optimizers as humans, often more. Due to their tendency to learn the most usefully predictive patterns first. Hidden long term sneaky plans will only very rarely influence the text. (Due to the plans being hidden)

And, I hope, the shogolith isn't itself particularly intrested in optimizing the real world. The shogolith just chooses what mask to wear.

So.

Can we duct tape a mask of "alignment researcher" onto a shogolith, and keep the mask in place long enough to get some useful alignment research done.

The more that there is one "know it when you see it" simple alignment solution, the more likely this is to work.

The 101 Space You Will Always Have With You

Donald Hobson2mo20

"Go read the sequences" isn't that helpful. But I find myself linking to the particular post in the sequences that I think is relevant.

Don't fall for ontology pyramid schemes

Donald Hobson3mo30

Imagine a medical system that categorizes diseases as hot/cold/wet/dry.

This doesn't deeply describe the structure of a disease. But if a patient is described as "wet", then it's likely some orifice is producing lots of fluid, and a box of tissues might be handy. If a patient is described as "hot", then maybe they have some sort of rash or inflammation that would make a cold pack useful.

It is, at best, a very lossy compression of the superficial symptoms. But it still carries non-zero information. There are some medications that a modern doctor might commonly use on "wet" patients, but only rarely used on "dry" patients or visa versa.

It is at least more useful information than someones star sign, in a medical context.

Old alchemical air/water/fire/earth systems are also like this. "air-ish" substances tend to have a lower density.

These sort of systems are a rough attempt at a principle component analysis on the superficial characteristics.

And the Five Factor model of personality is another example of such a system.

What’s the short timeline plan?

Donald Hobson3moΩ000

We really fully believe that we will build AGI by 2027, and we will enact your plan, but we aren’t willing to take more than a 3-month delay

Well I ask what they are doing to make AGI.

Maybe I look at their AI plan and go "eurika".

But if not.

Negative reinforcement by giving the AI large electric shocks when it gives a wrong answer. Hopefully big enough shocks to set the whole data center on fire. Implement a free bar for all their programmers, and encourage them to code while drunk. Add as many inscrutable bugs to the codebase as possible.

But, taking the question in the spirit it's meant in.

https://www.lesswrong.com/posts/zrxaihbHCgZpxuDJg/using-llm-s-for-ai-foundation-research-and-the-simple