I think the answer to the question of how well realistic NN-like systems with finite compute approximate the results of hypothetical utility maximizers with infinite compute is "not very well at all".
So the MIRI train of thought, as I understand it, goes something like
argmax
over them to pick the best one, which results in the best outcome according to its utility function.I think the last few years in ML have made points 2 and 5 look particularly shaky here. For example, the actual architecture of the SOTA chess-playing systems doesn't particularly resemble a cheaper version of the optimal-with-infinite-compute thing of "minmax over tree search", but instead seems to be a different thing of "pile a bunch of situation-specific heuristics on top of each other, and then tweak the heuristics based on how well they do in practice".
Which, for me at least, suggests that looking at what the optimal-with-infinite-compute thing would do might not be very informative for what actual systems which will show up in the next few decades will do.
Can you give a concrete example of a safety property of the sort that are you envisioning automated testing for? Or am I misunderstanding what you're hoping to see?
For example a human can to an extent inspect what they are going to say before they say or write it. Before saying Gary Marcus was "inspired by his pet chicken, Henrietta" a human may temporarily store the next words they plan to say elsewhere in the brain, and evaluate it.
Transformer-based also internally represent the tokens they are likely to emit in future steps. Demonstrated rigorously in Future Lens: Anticipating Subsequent Tokens from a Single Hidden State, though perhaps the simpler demonstration is simply that LLMs can reliably complete the sentence "Alice likes apples, Bob likes bananas, and Aaron likes apricots, so when I went to the store I bought Alice an apple and I got [Bob/Aaron]" with the appropriate "a/an" token.
I think the answer pretty much has to be "yes", for the following reasons.
During the 2000 election, in Okaloosa County, Florida (at the western tip of the panhandle), 71k of the county's 171k residents voted, with 52186 votes going to Bush and 16989 votes going to Gore, for a 42% turnout rate.
On the day of November 7, 2000, there was no significant rainfall in Pensacola (which is the closest weather station I could find with records going back that far). A storm which dropped 2 inches of rain on the tip of the Florida panhandle that day would have reduced voter turnout by 1.8%,[1] which would have resulted in a margin that leaned 634 votes closer to Gore. Which would have tipped Florida, which would in turn have tipped the election.
Now, November is the "dry" season in Florida, so heavy rains like that are not incredibly common. Still, they can happen. For example, on 2015-11-02, 2.34 inches of rain fell.[2] That was only one day, out of the 140 days I looked at, which would have flipped the 2000 election, and the 2000 election was, to my knowledge, the closest of the 59 US presidential elections so far. Still, there are a number of other tracks that a storm could have taken, which would also have flipped the 2000 election.[3] And in the 1976 election, somewhat worse weather in the great lakes region would likely have flipped Ohio and Wisconsin, where Carter beat Ford by narrow margins.[4]
So I think "weather, on election day specifically, flips the 2028 election in a way that cannot be foreseen now" is already well over 0.1%. And that's not even getting into other weather stuff like "how many hurricanes hit the gulf coast in 2028, and where exactly do they land?".
"The results indicate that if a county experiences an inch of rain more than what is normal for the county for that election date, the percentage of the voting age population that turns out to vote decreases by approximately .9%.".
I pulled the weather for the week before and after November 7 for the past 10 years from the weather.gov api and that was the highest rainfall date.
var precipByDate = {}
for (var y = 2014; y < 2024; y++) {
var res = await fetch('https://api.weather.com/v1/location/KPNS:9:US/observations/historical.json?apiKey=<redacted>&units=e&startDate='+y+'1101&endDate='+y+'1114').then(r => r.json());
res.observations.forEach(obs => {
var d = new Date(obs.valid_time_gmt*1000);
var ds = d.getFullYear()+'-'+(d.getMonth()+1)+'-'+d.getDate();
if (!(ds in precipByDate)) { precipByDate[ds] = 0; }
if (obs.precip_total) { precipByDate[ds] += obs.precip_total }
});
}
Object.entries(precipByDate).sort((a, b) => b[1] - a[1])[0]
Looking at the 2000 election map in Florida, any good thunderstorm in the panhandle, in the northeast corner of the state, or on the west-middle-south of the peninsula would have done the trick.
https://en.wikipedia.org/wiki/1976_United_States_presidential_election -- Carter won Ohio and Wisconsin by 11k and 35k votes, respectively.
Also "provably safe" is a property a system can have relative to a specific threat model. Many vulnerabilities come from the engineer having an incomplete or incorrect threat model, though (most obviously the multitude of types of side-channel attack).
Counterpoint: Sydney Bing was wildly unaligned, to the extent that it is even possible for an LLM to be aligned, and people thought it was cute / cool.
The two examples everyone loves to use to demonstrate that massive top-down engineering projects can sometimes be a viable alternative to iterative design (the Manhattan Project and the Apollo Program) were both government-led initiatives, rather than single very smart people working alone in their garages. I think it's reasonable to conclude that governments have considerably more capacity to steer outcomes than individuals, and are the most powerful optimizers that exist at this time.
I think restricting the term "superintelligence" to "only that which can create functional self-replicators with nano-scale components" is misleading. Concretely, that definition of "superintelligence" says that natural selection is superintelligent, while the most capable groups of humans are nowhere close, even with computerized tooling.
Looking at the AlphaZero paper
Our new method uses a deep neural network fθ with parameters θ. This neural network takes as an input the raw board representation s of the position and its history, and outputs both move probabilities and a value, (p, v) = fθ(s). The vector of move probabilities p represents the probability of selecting each move a (including pass), pa = Pr(a| s). The value v is a scalar evaluation, estimating the probability of the current player winning from position s. This neural network combines the roles of both policy network and value network12 into a single architecture. The neural network consists of many residual blocks4 of convolutional layers16,17 with batch normalization18 and rectifier nonlinearities19 (see Methods).
So if I'm interpreting that correctly, the NN is used for both position evaluation and also for the search part.
I think you get very different answers depending on whether your question is "what is an example of a policy that makes it illegal in the United States to do research with the explicit intent of creating AGI" or whether it is "what is an example of a policy that results in nobody, including intelligence agencies, doing AI research that could lead to AGI, anywhere in the world".
For the former, something like updates to export administration regulations could maybe make it de-facto illegal to develop AI aimed at the international market. Historically, that was successful at making it illegal to intentionally export software which implemented strong encryption for a bit. It didn't actually prevent the export, but it did arguably make that export unlawful. I'd recommend reading that article in full, actually, to give you an idea of how "what the law says" and "what ends up happening" can diverge.