Joern Stoehler

Wiki Contributions

Comments

Sorted by

Here's my best quickly-written story of why I expect AGI to understand human goals, but not share it. The intended audience is mostly myself, so I use personal jargon.

What a system at AGI level wants depends a lot on how it coheres its different goal-like instincts during self-reflection. Introspection & (less strongly) neuroscience tells us humans start with very similar-to-each-other internal impulses and self-reflection processes and also end up with similar goals. AI has a quite different set of genes/architecture and environment/training data, and so we can't expect it to have as similar internal impulses and metacognition. Instead it acquires an unknown different set of internals that is easy to pick up with gradient descent and enables becoming a general intelligence.

All smart agents do eventually learn natural abstractions, e.g. algebra or what the humans living in the outside world want, and can use that knowledge, e.g. during a treacherous turn. But the internal impulses aren't pushed towards a natural abstraction, as there's no unique universal solution, though there are local attractors. Instead they depend on more subtleties in the architecture & training data. Also, the difference between AI and human internals might not be visible early in its behavior, because a) we do select the internals such that during training the AI isn't visibly misaligned in its behavior (+whatever interpretability we have). b) the behaviorial differences we do spot may be both explained by different internal motivations, and by a lack of capabilities. c) even if we don't train against visible misalignment, similar instrumental pressures may apply in humans and AI, leading to partially similar behavior.

Without a better theory of what distinguishes goal-like from capability-like learned cognitive and behavioral patterns, it's not straightforward to formalize similarity between goal-like instincts at low capability levels, similar to how AIs that don't behave robustly goal-directed can't be assigned a coherent goal but are perhaps better described with shard theory. Without a better theory of how AGIs will do self-reflection and metacognition, it's not clear what sets of internal impulses during early training will later cohere into a safe goal. And it's also not clear to me how to actually know a goal is safe.

In particular, I don't think that using AI assistants to solve the alignment problem will work, as investigating metacognition probably requires AGI-level capabilities. Instead we just get a series of AI assistants that successfully train away visible misalignment in each other, using their knowledge of what the external humans want, until finally one AI or group of AIs realizes that a treacherous turn has become the better action.

Mayyybe there will be a stage during which the AI assistants recognize the flaw in this alignment solution, and have not yet cohered their impulses in a way that leads to misaligned behavior. In that case, the AI assistants may warn us, giving us a very late, and easily ignored, fire alarm.

Imo mildly misleading. I expect large parts of the 85% to just not have read their mails, or to have been too busy to answer what may look to them like a mildly useful survey.

Why are you concerned in that scenario? Any more concrete details on what you expect to go wrong?

I don't think there's a cure-it-all solution, except "don't build it", and even that might be counterproductive in some edge cases.

Addendum: I just learned that dipole-dipole interaction are classified as a type of vdW force in chemistry. This is different from solid state physics, where vdW is reserved for the quantum mechanical effect of induced dipole - induced dipole interaction.

So it's indeed vdW forces that keep a protein in its shape. (This might also explain why OP found different oom for their strength?)

When discussing the stability of proteins, I mostly think of their folding, not whether their primary or secondary structure breaks.

The free energy difference between folded and unfolded states of a typical protein is allegedly (not an expert!) in the range 21-63 kJ/mol. So way less than a single covalent bond.

I have a friend who does his physics PhD on protein folding, and from what I remember he mostly simulates the surface charge of proteins, i.e. cares about dipole-dipole interactions (the weaker version of ionic bonds) and interaction effects with the surrounding water (again dipole-dipole afaict).

This suggests that vdW forces aren't all that important, but the energy scale you get from imagining vdW forces is still way better than when imagining covalent bonds.

Regarding how to do enzyme-like catalysts with covalent nanotech: my first guess is that we'd want to build a structure that has several "folded"/usable states close in energy, e.g. due to rotational degrees of freedoms in the covalent bonds. This way "unfolding"/breaking the machine requires a lot of energy, while it can still mechanically move to catalyze a chemical reaction at low activation energies.

See Table 2 in https://www.emilkirkegaard.com/p/skill-vs-luck-in-games for

[...] the corresponding winning probability of a player who is exactly one standard deviation better than his opponent. We refer to this probability as p^sd . For comparison, we also provide the winning probablities when a 99% percentile player is matched against a 1% percentile player, which we call p99 1 .

Go & Chess (p^sd=83.3,72.9) are notably above Backgammon (p^sd=53.6%)

I expect that other voters correlate with my choice, and so I am not just deciding 1 vote, but actually a significant fraction of votes.

If the number of uncorrelated blue voters, plus the number of people who vote identical to me exceeds 50%, then I can save the uncorrelated blue voters.

More formally: let R, B, C denote the fraction of uncorrelated red, uncorrelated blue and correlated voters that will vote the same as you do. Let S be how large a fraction of people you'd let die in order to save yourself (i.e. some measure of selfishness).

Then choosing blue over red gives you extra utility/lives saved depending on what R,B,C,S are.

If B>0.5 then the utility difference is 0.

If B<0.5 and B+C>0.5 then the difference is +B.

If B+C<0.5 then the difference is -(C+S).

By taking the expectation over your uncertainties about what B,R,C might be, for example by averaging across some randomly chosen scenarios that seem like they properly cover your uncertainty, you get the difference in expected utility between voting blue and red.

Estimating C,R,B can be done by guessing which algorithms other voters use to decide their votes, and how much those algorithms equal your own. Getting good precision on the latter part probably involves also guessing the epistemic state of other voters, i.e. their guesses for C,R,B, and doing some more complicated game theory and solving for equilibria.

Thanks for this concise post :) If we set I actually worry that agent will not do nothing, but instead prevent us from doing anything that reduces . Imo it is not easy to formalize such that we no longer want to reduce ourselves. For example, we may want to glue a vase onto a fixed location inside our house, preventing it from accidentally falling and breaking. This however also prevents us from constantly moving the vase around the house, or from breaking it and scattering the pieces for maximum entropy.

Building an aligned superintelligence may also reduce , as the SI steers the universe into a narrow set of states.

[This comment is no longer endorsed by its author]Reply