Consider gravity on Earth. It seems to work every year. However, this fact alone is consistent with theories that gravity will stop working in 2025, 2026, 2027, 2028, 2029, 2030, etc. There are infinite such theories, and only one theory states that gravity will work as an absolute rule.

It is true that we might infer from the simplest explanation that gravity holds as an absolute rule. However, the case is different with alignment. To ensure AI alignment, we must rule out whether an AI is following a misaligned rule compared to an aligned rule based on time-and-situation-limited data.

While it may be safe, for all practical purposes, to assume that simpler explanations tend to be correct when it comes to nature, we cannot safely assume this for LLMs—for the reason that the learning algorithms that are programmed into them can have complex unintended consequences for how the LLM will behave in the future, given the changing conditions an LLM finds itself in.

Doesn't this mean that it is impossible to achieve AI alignment even theoretically?

New Answer
New Comment

1 Answers sorted by

Seth Herd

3-7

In the limit, yes, permanent alignment is impossible.

Anything permanent is probably technically impossible for the same reason.

Anything that can happen will eventually happen. That includes a godlike intelligence just screwing up and accidentally getting its goals changed even though it really wants not to change its goals.

But that could take much, much longer than the heat death of this universe as we currently understand it.

It's not a concern with alignment, it's a concern possibly in some billions of years.

1 comment, sorted by Click to highlight new comments since:

Although the argument you outline might be an argument against ever fully trusting tests (usually called "evals" on this site) that this or that AI is aligned, alignment researchers have other tools in their toolbox besides running tests or evals.

It would take a long time to explain these tools, particularly to someone unfamiliar with software development or a related field like digital-electronics design. People make careers in studying tools to make reliable software systems (and reliable digital designs).

The space shuttle is steered by changing the direction in which the rocket nozzles point relative to the entire shuttle. If at any point in flight, one of the rocket nozzles had pointed in a direction a few degrees different from the one it should point in, the shuttle would have been lost and all aboard would have died. The pointing or aiming of these rocket nozzles is under software control. How did the programmers at NASA make this software reliable? Not merely through testing!

These programmers at NASA relied on their usual tools (basically engineering-change-order culture) and did not need a more advanced tool called "formal verification", which Intel turned to to make sure their microprocessors did not have any flaw necessitating another expensive recall after they spent 475 million dollars in 1994 in a famous product recall to fix the so-called Pentium FDIV bug.

Note that FDIV refers to division of (floating-point) numbers and that it is not possible in one human lifetime to test all possible dividends and divisors to ensure that a divider circuit is operating correctly. I.e., the "impossible even theoretically" argument you outline would have predicted that it is impossible to ensure the correct operation of even something as simple as a divider implemented in silicon, and yet Intel has during the 30 years since the 1994 recall avoided another major recall of any of their microprocessors for reasons of a mistake in their digital design.

"Memory allocation" errors (e.g., use-after-free errors) are an important source of bugs and security vulnerabilities in software, and testing has for decades been an important way to find and eliminate these errors (Valgrind probably being the most well-known framework for doing the testing) but the celebrated new programming language Rust completely eliminates the need for testing for this class of programming errors. Rust replaces testing with a more reliable methodology making use of a tool called a "borrow checker". I am not asserting that a borrow checker will help people create an aligned super-intelligence: I am merely pointing at Rust and its borrow checker as an example of a methodology that is superior to testing for ensuring some desirable property (e.g., the absence of use-after-free errors that an attacker might be able to exploit) of a complex software artifact.

In summary, aligning a superhuman AI is humanly possible given sufficiently careful and talented people. The argument for stopping frontier AI research (or pausing it for 100 years) depends on considerations other than the "impossible even theoretically" argument you outline.

Methodologies that are superior to testing take time to develop. For example, the need for a better methodology to prevent "memory allocation" errors was recognized in the 1970s. Rust and its borrow checker are the result of a line of investigation inspired by a seminal paper published in 1987. But Rust has become a realistic option for most programming projects only within the last 10 years or less. And an alignment methodology that continue to be reliable even when the AI becomes super-humanly capable is a much taller order than a methodology to prevent use-after-free errors and related memory-allocation errors.