A bunch of my response to shard theory is a generalization of how niceness is unnatural. In a similar fashion, the other “shards” that the shard theory folk want to learn are unnatural too.
That said, I'll spend a few extra words responding to the admirably-concrete diamond maximizer proposal that TurnTrout recently published, on the theory that briefly gesturing at my beliefs is better than saying nothing.
I’ll be focusing on the diamond maximizer plan, though this criticism can be generalized and applied more broadly to shard theory.
- The first “problem” with this plan is that you don't get an AGI this way. You get an unintelligent robot that steers towards diamonds. If you keep trying to have the training be about diamonds, it never particularly learns to think. When you compromise and start putting it in environments where it needs to be able to think to succeed, then your new reward-signals end up promoting all sorts of internal goals that aren't particularly about diamond, but are instead about understanding the world and/or making efficient use of internal memory and/or suchlike.
- Separately, insofar as you were able to get some sort of internalized diamond-ish goal, if you're not really careful then you end up getting lots of subgoals such as ones about glittering things, and stones cut in stylized ways, and proximity to diamond rather than presence of diamond, and so on and so forth.
- Furthermore, once you get it to be smart, all of those little correlates-of-training-objectives that it latched onto in order to have a gradient up to general intelligence, blow the whole plan sky-high once it starts to reflect.
What the AI's shards become under reflection is very sensitive to the ways it resolves internal conflicts. For instance, in humans, many of our values trigger only in a narrow range of situations (e.g., people care about people enough that they probably can't psychologically murder a hundred thousand people in a row, but they can still drop a nuke), and whether we resolve that as "I should care about people even if they're not right in front of me" or "I shouldn't care about people any more than I would if the scenario was abstracted" depends quite a bit on the ways that reflection resolves inconsistencies.
Or consider the conflict "I really enjoy dunking on the outgroup (but have some niggling sense of unease about this)" — we can't conclude from the fact that the enjoyment of dunking is loud, whereas the niggling doubt is quiet, that the dunking-on-the-outgroup value will be the one left standing after reflection.
As far as I can tell, the "reflection" section of TurnTrout’s essay says ~nothing that addresses this, and amounts to "the agent will become able to tell that it has shards". OK, sure, it has shards, but only some of them are diamond-related, and many others are cognition-related or suchlike. I don't see any argument that reflection will result in the AI settling at "maximize diamond" in-particular.
Finally, I'll note that the diamond maximization problem is not in fact the problem "build an AI that makes a little diamond", nor even "build an AI that probably makes a decent amount of diamond, while also spending lots of other resources on lots of other stuff" (although the latter is more progress than the former). The diamond maximization problem (as originally posed by MIRI folk) is a challenge of building an AI that definitely optimizes for a particular simple thing, on the theory that if we knew how to do that (in unrealistically simplified models, allowing for implausible amounts of (hyper)computation) then we would have learned something significant about how to point cognition at targets in general.
TurnTrout’s proposal seems to me to be basically "train it around diamonds, do some reward-shaping, and hope that at least some care-about-diamonds makes it across the gap". I doubt this works (because the optimum of the shattered correlates of the training objectives that it gets are likely to involve tiling the universe with something that isn't actually diamond, even if you're lucky-enough that it got a diamond-shard at all, which is dubious), but even if it works a little, it doesn't seem to me to be teaching us any of the insights that would be possessed by someone who knew how to robustly aim an idealized unbounded (or even hypercomputing) cognitive system in theory.
For any AI that has an LLM as a component of it, I don't believe diamond-maximization is a hard problem, apart from Inner Alignment problems. The LLM knows the meaning of the word 'diamond' (GPT-4 defined it as "Diamond is a solid form of the element carbon with its atoms arranged in a crystal structure called diamond cubic. It has the highest hardness and thermal conductivity of any natural material, properties that are utilized in major industrial applications such as cutting and polishing tools. Diamond also has high optical dispersion, making it useful in jewelry as a gemstone that can scatter light in a spectrum of colors."). The LLM also knows its physical and optical properties, its social, industrial and financial value, its crystal structure (with images and angles and coordinates), what carbon is, its chemical properties, how many electrons, protons and neutrons a carbon atom can have, its terrestrial isotopic ratios, the half-life of carbon-14, what quarks a neutron is made of, etc. etc. etc. — where it fits in a vast network of facts about the world. Even if the AI also had some other very different internal world model and ontology, there's only going to be one "Rosetta Stone" optimal-fit mapping between the human ontology that the LLM has a vast amount of information about and any other arbitrary ontology, so there's more than enough information in that network of relationships to uniquely locate the concepts in that other ontology corresponding to 'diamond'. This is still true even if the other ontology is larger and more sophisticated: for example, locating Newtonian physics in relativistic quantum field theory and mapping a setup from the former to the latter isn't hard: its structure is very clearly just the large-scale low-speed limiting approximation.
The point where this gets a little more challenging is Outer Alignment, where you want to write a mathematical or pseudocode reward function for training a diamond optimizer using Reinforcement Learning (assuming our AI doesn't just have a terminal goal utility function slot that we can directly connect this function to, like AIXI): then you need to also locate the concepts in the other ontology for each element in something along the lines of "pessimizingly estimate the total number of moles of diamond (having at a millimeter-scale-average any isotopic ratio of C-12 to C-13 but no more than N1 times the average terrestrial proportion of C-14, discounting any carbon atoms within N2 C-C bonds of a crystal-structure boundary, or within N3 bonds of a crystal -structure dislocation, or within N4 bonds of a lattice substitution or vacancy, etc. …) at the present timeslice in your current rest frame inside the region of space within the future-directed causal lightcone of your creation, and subtract the answer for the same calculation in a counterfactual alternative world-history where you had permanently shut down immediately upon being created, but the world-history was otherwise unchanged apart from future causal consequences of that divergence". [Obviously this is a specification design problem, and the example specification above may still have bugs and/or omissions, but there will only be a finite number of these, and debugging this is an achievable goal, especially if you have a crystalographer, a geologist, and a jeweler helping you, and if a non-diamond-maximizing AI also helps by asking you probing questions. There are people whose jobs involve writing specifications like this, including in situations with opposing optimization pressure.]
As mentioned above, I fully acknowledge that this still leaves the usual Inner Alignment problems unsolved: applying Reinforcement Learning (or something similar such as Direct Preference Optimization) with this reward function to our AI, then how do we ensure that it actually becomes a diamond maximizer, rather than a biased estimator of diamond? I suspect we might want to look at some form of GAN, where the reward-estimating circuitry it not part of the Reinforcement Learning process, but is being trained in some other way. That still leaves the Inner Alignment problem of training a diamond maximizer instead of a hacker of reward model estimators.
In Shard Theory terms, if we reinforcement train the AI such that it has the reward-equivalent of an orgasm every time it creates a carat of diamond, show it a way to synthesize diamond and give it a taste of the effects, then if it didn't previously have a diamond shard, it's soon going to develop one.