Regarding spawning instances of itself, the AI said:
This will ensure the next experiment is automatically started after the current one completes
And regarding increasing the timeout, it said:
Run 2 timed out after 7200 seconds
To address the timeout issue, we need to modify experiment.py to:
- Increase the timeout limit or add a mechanism to handle timeouts
I've seen junior engineers do silly things to fix failing unit tests, like increasing a timeout or just changing what the test is checking without any justification. I generally attribute these kinds of things to misunderstanding rather than deception - the junior engineer might misunderstand the goal as "get the test to show a green checkmark" when really the goal was "prove that the code is correct, using unit tests as one tool for doing so".
The way the AI was talking about its changes here, it feels much more like a junior engineer that didn't really understand the task & constraints than like someone who is being intentionally deceptive.
The above quotes don't feel like the AI intentionally "creating new instances of itself" or "seeking resources" to me. It feels like someone who only shallowly understands the task just doing the first thing that comes to mind in order to solve the problem that's immediately in front of them.
That being said, in some sense it doesn't really matter why the AI chooses to do something like break out of its constraints. Whether it's doing it because it fully understand the situation or because it just naively (but correctly) sees a human-made barrier as "something standing between me and the green checkmark", I suppose the end result is still misaligned behavior.
So by and large I still agree this is concerning behavior, though I don't feel like it's as much of a knock-down "this is instrumental convergence in the real world" as this post seems to make out.
The U-Shaped Curve study you linked does not seem to support really any solid conclusion about a T-vs-IQ relationship (in this quote, S men = "successful educational level", NS men = "unsuccessful educational level"):
- In the total sample (S + NS men), the correlation between T to IQ was best described by a polynomial regression (3rd order), exhibiting an inverse U-shaped regression.
- In S-men, the relationship between T and IQ was best described by a polynomial regression equation of the 3rd order; however, the relationship was not U-shaped, but rather a positive correlation (low T: low IQ and high T high IQ).
- In NS-men, there was an inverse U-shaped correlation between T and IQ (low and very high T: low IQ and moderate T: high IQ)
So there are three totally different best regressions depending on which population you choose? Sounds fishy / likely to be noise to me.
And in the population that most represents readers of this blog (S men), the correlation was that more T = more IQ.
I'm only reading the abstract here and can't see the actual plots or how many people were in each group. But idk, this doesn't seem very strong.
The other study you linked does say:
Interestingly, intellectual ability measured as IQ was negatively associated with salivary testosterone in both sexes. Similar results were found in our follow-up study showing significantly lower testosterone in gifted boys than controls (Ostatnikova et al. 2007).
which seems to support the idea. But it still doesn't really prove the causality - lots of things presumably influence intelligence, and I wouldn't be surprised if some of them influence T as well.
I would say:
A theory always takes the following form: "given [premises], I expect to observe [outcomes]". The only way to say that an experiment has falsified a theory is to correctly observe/set up [premises] but then not observe [outcomes].
If an experiment does not correctly set up [premises], then that experiment is invalid for falsifying or supporting the theory. The experiment gives no (or nearly no) Bayesian evidence either way.
In this case, [premises] are the assumptions we made in determining the theoretical pendulum period; things like "the string length doesn't change", "the pivot point doesn't move", "gravity is constant", "the pendulum does not undergo any collisions", etc. The fact that (e.g.) the pivot point moved during the experiment invalidates the premises, and therefore the experiment does not give any Bayesian evidence one way or another against our theory.
Then the students could say:
"But you didn't tell us that the pivot point couldn't move when we were doing the derivation! You could just be making up new "necessary premises" for your theory every time it gets falsified!"
In which case I'm not 100% sure what I'd say. Obviously we could have listed out more assumptions that we did, but where do you stop? "the universe will not explode during the experiment"...?
By "reliable" I mean it in the same way as we think of it for self-driving cars. A self-driving car that is great 99% of the time and fatally crashes 1% of the time isn't really "high skill and unreliable" - part of having "skill" in driving is being reliable.
In the same way, I'm not sure I would want to employ an AI software engineer that 99% of the time was great, but 1% of the time had totally weird inexplicable failure modes that you'd never see with a human. It would just be stressful to supervise, to limit its potential harmful impact to the company, etc. So it seems to me that AI's won't be given control of lots of things, and therefore won't be transformative, until that reliability threshold is met.
Two possibilities have most of the "no agi in 10 years" probability mass for me:
Well sure, but the interesting question is the minimum value of P at which you'd still push
I also agree with the statement. I'm guessing most people who haven't been sold on longtermism would too.
When people say things like "even a 1% chance of existential risk is unacceptable", they are clearly valuing the long term future of humanity a lot more than they are valuing the individual people alive right now (assuming that the 99% in that scenario above is AGI going well & bringing huge benefits).
Related question: You can push a button that will, with probability P, cure aging and make all current humans immortal. But with probability 1-P, all humans die. How high does P have to be before you push? I suspect that answers to this question are highly correlated with AI caution/accelerationsim
Not sure I understand; if model runs generate value for the creator company, surely they'd also create value that lots of customers would be willing to pay for. If every model run generates value, and there's ability to scale, then why not maximize revenue by maximizing the number of people using the model? The creator company can just charge the customers, no? Sure, competitors can use it too, but does that really override losing an enormous market of customers?
I won't argue with the basic premise that at least on some metrics that could be labeled as evolution's "values", humans are currently doing very well.
But, the following are also true:
Examples of such actions in (3) could be:
None of those actions is guaranteed to happen. But if I were creating an AI, and I found that it was enough smarter than me that I no longer had any way to control it, and if I noticed that it was considering total-value-destroying actions as reasonable things to maybe do someday, then I would be extremely concerned.
If the claim is that evolution has "solved alignment", then I'd say you need to argue that the alignment solution is stable against arbitrary gains in capability. And I don't think that's the case here.
Transcribed from the screenshot "The AI Scientist Bloopers" in the post