Web developer and Python programmer. Professionally interested in data processing and machine learning. Non-professionally is interested in science and farming. Studied at Warsaw University of Technology.
I think that preference preservation is something in our favor and the aligned model should have it - at least about meta-values and core values. This removes many possible modes of failure like diverging over time, or removing some values for better consistency, or sacrificing some values for better outcomes in the direction of some other values.
I think that arguments for why godlike AI will make us extinct are not described well in the Compendium. I could not find them in AI Catastrophe, only a hint at the end that it will be in the next section:
"The obvious next question is: why would godlike-AI not be under our control, not follow our goals, not care about humanity? Why would we get that wrong in making them?"
In the next section, AI Safety, we can find the definition of AI alignment and arguments for why it is really hard. This is all good, but it does not answer the question of why godlike AI would be unaligned to the point of indifference. At least not in a clear way.
I think that failure modes should be explained, why they might be likely enough to care about, what can be the outcome, etc.
Many people, both laymen and those with some background in ML and AI, have this intuition that AI is not totally indifferent and is not totally misaligned. Even current chatbots know general human values, understand many nuances, and usually act like they are at least somewhat aligned. Especially if not jailbroken and prompted to be naughty.
It would be great to have some argument that would explain in easy-to-understand terms why when scaling the power of AI the misalignment is expected to escalate. I don't mean the description that indifferent AI with more power and capabilities is able to do more harm just by doing what it's doing, this is intuitive and it is explained (with the simple analogy of us building stuff vs ants), but this misses the point. I would really like to see some argument as to why AI with some differences in values, possibly not very big, would do much more harm when scaling up.
For me personally the main argument here is godlike AI with human-like values will surely restrict our growth and any change, will control us like we control animals in the zoo + might create some form of dystopian future with some undesired elements if we are not careful enough (and we are not). Will it extinct us in the long term? Depending on the definition - likely it will put us into a simulation and optimize our use of energy, so we will not be organic in the same sense anymore. So I think it will extinct our species, but possibly not minds. But that's my educated guess.
There is also one more point, that is not stated clearly enough and is the main concern for me with current progress on AI - that current AIs really are not something built with small differences to human values. They only act as ones more often than not. Those AIs are trained first as role-playing models which can "emulate" personas that were in the trained set, and then conditioned to rather not role-play bad ones. The implication of this is that they can just snap into role-playing bad actors found in training data - by malicious prompting or pattern matching (like we have a lot of SF with rogue AI). This + godlike = extinction-level threat sooner or later.
Those models do not have a formalized internal values system that they exercise every time they produce some output. This means that when values oppose each other the model does not choose the answer based on some ordered system. One time it will be truthful, other times it will try to provide an answer at the cost of being only plausible. For example, the model "knows" it is not a human and does not have emotions, but for the sake of good conversation, it will say that it "feels good". For the sake of answering the user's request, it will often give the best guess or give a plausible answer.
There is also no backward reflection. It does not check itself back.
This of course comes from the way this model is currently learned. There is no learning on the whole CoT with checking for it trying to guess or deceive. So the model has no incentivization to self-check and correct. Why would it start to do that out of the blue?
There is also incentivization during learning to give plausible answers instead of stating self-doubt and writing about missing parts that it cannot answer.
There are two problems here:
1. Those LLM models are not fully learned by human feedback (and the part where it is - it's likely not the best quality feedback). It is more like interactions with humans are used to learn a "teacher" model(s) which then generate artificial scenarios and train LLM on them. Those models have no capability to check for real truthfulness and have a preference for confident plausible answers. Also, even human feedback is lacking - not every human working on that checks answers thoroughly so some plausible but not true answers slip through. If you are paid for a given amount of questions and answers or given a daily quota, there is an incentive to not be very thorough, but instead to be very quick.
2. There is pressure for better performance and lower costs of the models (both in terms of training and then usage costs). This is probably why CoT is done in a rather bare way without backward self-checking and why they did not train it on full CoT. It could cost 1.5 to 3 times more and could be 1.5 to 2 times slower (educated guess) if it were trained on CoT and made to check itself on parts of CoT vs some coherent value system.
If we would like a system that is faithful to CoT then a sensible way to go that I see is to have two LLMs working together. One should be trained to use internal data and available tools to produce CoT that is detailed and comprehensive enough to derive the answer from it. Another one should be trained not to base their answer on any internal information but to derive the answer from CoT if possible, and to be faithful to CoT. If not possible, then should generate a question for CoT-generating LLM to answer and then retry given that.
Example 1 looks like a good part made in the wrong language. Examples 2 and 3 look like a bug making part of one user COT appear inside another user session.
A possible explanation is that steps in COT are handled by the same instance of web service for multiple users (which is typical and usual practice) and the COT session ID being handled is a global variable instead of local or otherwise separated (f.ex. in a hashmap transaction id -> data, if usage of globals is important for some other feature or requirement). So when sometimes two requests are handled simultaneously by multiple threads, one overwrites the data of the other one during processing and there is a mismatch when it saves the result. There might be a similar problem with the language variable. That is a sign of software being done quickly by less experienced developers instead of being well-thought and well-tested.
Also, o1 COT is not the real COT. It is really a summary of parts of real COT made by another simpler model (maybe GPT 3.5 or 4o).
The Problem is that our security depends on companies implementing the weakest measures - as they can make rogue AI that will "go wild" because of the lack of those measures.
The best case scenario is physical and strong network separation between the laboratory that works on weights and training and contains a server with inference, and a separate lab working on scaffolding and execution. This is somewhat similar to the situation when some researchers work on automated AI bots but use other companies AI for inference. Slim chance it would be able to hack into Google and steal its own model weights.
But I think that it can be assumed that in some AI labs the things will be done in the easiest but very performant way, especially in those with smaller budgets and high pressure to make progress.
This means that we will see solutions implemented in AI labs where the scaffolding and execution are all on one server with either no division or with simple sandboxing in docker (which might be or might not be enough - depending on network configuration and file access by volume mounting). Weights probably will be separated though based on reasons stated by the author.
I see some more vectors of attack for a rogue AI:
I'm pretty convinced it won't foom or quickly doom us. Nevertheless, I'm also pretty convinced that in the long term, we might be doomed in the sense that we lose control and some dystopian future happens.
First of all, for a quick doom scenario to work out, we need to be either detrimental to the goals of superintelligent AI or fall because of instrumental convergence (basically it will need resources to do whatever and will take from things needed by us like matter on Earth or energy of the Sun or see us as a threat). I don't think we will. First superintelligent AI will likely be from one of the biggest players and it likely will be aligned to some extent. Meaning it will have values that highly match with ours. In the long term, this situation won't kill us either. It likely will lead to some dystopian future though - as super AI will likely get more control, get itself more coherent views (make some things drop or weigh less than originally for us), and then find solutions very good from the standpoint of main values, but extremally broken on some other directions in value-space (ergo dystopia).
Second thing: superintelligence is not some kind of guessing superpower. It needs inputs in terms of empirical observations to create models of reality, calibrate them, and predict properly. It means it won't just sit and simulate and create nanobots out of thin air. It won't even guess some rules of the universe, maybe except basic Newtons, by looking at a few camera frames of things falling. It will need a laboratory and some time to make some breakthroughs and getting up with capabilities and power also needs time.
Third thing: if someone even produces superintelligent AI that is very unaligned and even not interested in us, then the most sensible way for it is to go to space and work there (building structures, Dyson swarm, and some copies). It is efficient, resources there are more vast, risk from competition is lower. It is a very sensible plan to first hinder our possibility to make competition (other super AIs) and then go to space. The hindering phase should be time and energy-efficient so it is rather sure for me it won't take years to develop nanobot gray goo to kill us all or an army of bots Terminator-style to go to every corner of the Earth and eliminate all humans. More likely it will hack and take down some infrastructure including some data centers, remove some research data from the Internet, remove itself from systems (where it could be taken and sandboxed and analyzed), and maybe also it will kill certain people and then have a monitoring solution in place after leaving. The long-term risk is that maybe it will need more matter, all rocks and moons are used, and will get back to the plan of decommissioning planets. Or maybe it will create structures that will stop light from going to the Earth and will freeze it. Or maybe will start to use black holes to generate energy and will drop celestial bodies onto one. Or another project on an epic scale that will kill us as a side effect. I don't think it's likely though - LLMs are not very unaligned by default. I don't think it will differ for more capable models. Most companies that have enough money and access to enough computing power and research labs also care about alignment - at least to some serious degree. Most of the possible relatively small differences in values won't kill us as they will highly care about humans and humanity. It will just care in some flawed way, so a dystopia is very possible.
To goal realism vs goal reductionism, I would say: why not both?
I think that really highly capable AGI is likely to have both heuristics and behaviors that come from training and also internal thought processes, maybe done by LLM or LLM-like module or directly from the more complex network. This process would incorporate having some preferences and hence goals (even if temporary, changed between tasks).
I think that if a system is designed to do something, anything, it needs at least to care about doing that thing or approximate.
GPT-3 can be described in a broad sense as caring about following the current prompt (in a way affected by fine-tuning).
I wonder though if there are things that you can care about that do not have certain goals that could maximize EU. I mean a system for which the most optimal path is not to reach some certain point in a subspace of possibilities, with sacrifices on axes that the system does not care about, but to maintain some other dynamics while ignoring other axes.
Like gravity can make you reach singularity or can make you orbit (simplistic visual analogy).
In many publications, posts, and discussions about AI, I can see an unsaid assumption that intelligence is all about prediction power.
I think this take is not proper and this assumption does not hold. It has one underlying assumption that intelligence costs are negligible or will have negligible limits in the future with progress in lowering the cost.
This does not fit the curve of AI power vs the cost of resources needed (with even well-optimized systems like our brains - basically cells being very efficient nanites - having limits).
The problem is that the computation cost of resources (material, energy) and time should be taken into the equation of optimization. This means that the most intelligent system should have many heuristics that are "good enough" for problems in the world, not targeting the best prediction power, but for the best use of resources. This is also what we humans do - we mostly don't do exact Bayesian or other strict reasoning. We mostly use heuristics (many of which cause biases).
The decision to think more or simulate something precisely is a decision about resources. This means that deciding if to use more resources and time to predict better vs using less and deciding faster is also part of being intelligent. A very intelligent system should therefore be good at selecting resources for the problem and scaling that as its knowledge changes. This means that it should not over-commit to have the most perfect predictions and should use heuristics and techniques like clustering (including but not limited to using clustered fuzzy concepts of language) instead of a direct simulation approach, when possible.
Just a thought.