It was a relatively fringe topic that only recently got the attention of a large number of real researchers. And parts of it could need large amounts of computational power afforded by only by superhuman narrow AI.
There have been a few random phd dissertations saying the topic is hard but as far as I can tell there has only recently been push for a group effort by capable and well funded actors (I.e. openAI’s interpretability research).
I don’t trust older alignment research much as an outsider. It seems to me that Yud has built a cult of personality around AI dooming and thus is motivated to find reasons for alignment not being possible. And most of his followers treat his initial ideas as axiomatic principles and don’t dare to challenge them. And lastly most past alignment research seems to be made by those followers.
Unfortunately, we do not have the luxury of experimenting with dangerous AI systems to see whether they cause human extinction or not. When it comes to extinction, we do not get another chance to test.
For example this is an argument that has been convincingly disputed to varying levels (warning shots, incomputability of most plans of danger) but it is still treated as a fundamental truth on this site.
Could this translate to agents having difficulty predicting other agents values and reactions, leading to a lesser likelihood of multiple agent systems acting as one?
And, sure, but it's not clear why any of this matters? What is the thing that we're going to (attempt) to do with AI, if not use it to solve real-world problems?
It matters because the original poster isn’t saying we don’t use it to solve real world problems, but rather that real world constraints (I.e. laws of physics) will limit its speed of advancement.
An AI likely cannot easily predict a chaotic system unless it can simulate reality at a high fidelity. I guess Op is assuming the TAI won’t have this capability, so even if we do solve real world problems with AI, it is still limited by real world experimentation requirements.
I better include the predict words about the appropriate amount of green-coloured objects”, and write about green-coloured objects even more frequently, and then also notice that, and in the end, write exclusively about green objects.
Can you explain this logic to me? Why would it write more and more on green coloured objects even if its training data was biased towards green colored objects? If there is a bad trend in its output, without reinforcement, why would it make that trend stronger? Do you mean, it recognizes incorrectly that improving said bad trend is good because it works in the short term but not in the long term?
Could we not align the AI to realize there could be limits on such trends? What if there is a gradual misaligning that gets noticed by the aligner and is corrected? The only way for this to avoid some sort of continuous alignment system is if it catastrophically fails before continuous alignment detects it.
Consider it inductively, we start off with an aligned model that can improve itself. The aligned model, if properly aligned, will make sure its new version is also properly aligned, and won't improve itself if it is unable to do so.
I think my point is lowering it to just there being a non trivial probability of it following the rule. Fully aligning AIs to near certainty may be a higher bar than just potentially aligning AI.
Align with arbitrary values without possibility of inner deception. If it is easy to verify the values of an agent to a near certainty, it seems to follow that we can more or less bootstrap alignment with weaker agents inductively aligning stronger agents.