Htarlov

Web developer and Python programmer. Professionally interested in data processing and machine learning. Non-professionally is interested in science and farming. Studied at Warsaw University of Technology.

Posts

Sorted by New

Wiki Contributions

Comments

Sorted by
Htarlov184

Those models do not have a formalized internal values system that they exercise every time they produce some output. This means that when values oppose each other the model does not choose the answer based on some ordered system. One time it will be truthful, other times it will try to provide an answer at the cost of being only plausible. For example, the model "knows" it is not a human and does not have emotions, but for the sake of good conversation, it will say that it "feels good". For the sake of answering the user's request, it will often give the best guess or give a plausible answer.
There is also no backward reflection. It does not check itself back.

This of course comes from the way this model is currently learned. There is no learning on the whole CoT with checking for it trying to guess or deceive. So the model has no incentivization to self-check and correct. Why would it start to do that out of the blue?
There is also incentivization during learning to give plausible answers instead of stating self-doubt and writing about missing parts that it cannot answer.

There are two problems here:
1. Those LLM models are not fully learned by human feedback (and the part where it is - it's likely not the best quality feedback). It is more like interactions with humans are used to learn a "teacher" model(s) which then generate artificial scenarios and train LLM on them. Those models have no capability to check for real truthfulness and have a preference for confident plausible answers. Also, even human feedback is lacking - not every human working on that checks answers thoroughly so some plausible but not true answers slip through. If you are paid for a given amount of questions and answers or given a daily quota, there is an incentive to not be very thorough, but instead to be very quick.
2. There is pressure for better performance and lower costs of the models (both in terms of training and then usage costs). This is probably why CoT is done in a rather bare way without backward self-checking and why they did not train it on full CoT. It could cost 1.5 to 3 times more and could be 1.5 to 2 times slower (educated guess) if it were trained on CoT and made to check itself on parts of CoT vs some coherent value system.

Htarlov30

If we would like a system that is faithful to CoT then a sensible way to go that I see is to have two LLMs working together. One should be trained to use internal data and available tools to produce CoT that is detailed and comprehensive enough to derive the answer from it. Another one should be trained not to base their answer on any internal information but to derive the answer from CoT if possible, and to be faithful to CoT. If not possible, then should generate a question for CoT-generating LLM to answer and then retry given that.

Htarlov30

Example 1 looks like a good part made in the wrong language. Examples 2 and 3 look like a bug making part of one user COT appear inside another user session. 

A possible explanation is that steps in COT are handled by the same instance of web service for multiple users (which is typical and usual practice) and the COT session ID being handled is a global variable instead of local or otherwise separated (f.ex. in a hashmap transaction id -> data, if usage of globals is important for some other feature or requirement). So when sometimes two requests are handled simultaneously by multiple threads, one overwrites the data of the other one during processing and there is a mismatch when it saves the result. There might be a similar problem with the language variable. That is a sign of software being done quickly by less experienced developers instead of being well-thought and well-tested.

Also, o1 COT is not the real COT. It is really a summary of parts of real COT made by another simpler model (maybe GPT 3.5 or 4o).

Htarlov54

The Problem is that our security depends on companies implementing the weakest measures - as they can make rogue AI that will "go wild" because of the lack of those measures.

The best case scenario is physical and strong network separation between the laboratory that works on weights and training and contains a server with inference, and a separate lab working on scaffolding and execution. This is somewhat similar to the situation when some researchers work on automated AI bots but use other companies AI for inference. Slim chance it would be able to hack into Google and steal its own model weights.

But I think that it can be assumed that in some AI labs the things will be done in the easiest but very performant way, especially in those with smaller budgets and high pressure to make progress.

This means that we will see solutions implemented in AI labs where the scaffolding and execution are all on one server with either no division or with simple sandboxing in docker (which might be or might not be enough - depending on network configuration and file access by volume mounting). Weights probably will be separated though based on reasons stated by the author.

I see some more vectors of attack for a rogue AI:

  1. It is possible that in the same network, there is a computer where scientists work on refining the model, and updating weights. It might be less secure than an inference server and easier to hack into, if found.
  2. Often most vulnerable part of the system is the personnel. If workers are able to talk with the AI through a scaffolding routine, it can suggest things. I can imagine a situation where some task related to security falls onto someone senior in data science but not in systems security (because someone left the company, due to budget and time constraints, etc.). Then that person gets a genius idea to consult AI about that task and get misguided.
  3. If it records its own prompts and gets access to scaffolding code, it can recreate this piece somewhere else. The key element of the system AGI-level abilities might be the scaffolding architecture, not weights - possibly it can work elsewhere with a different model of weights (f.ex. open source model). This means sending only its own scaffolding code and prompts, not weights. Then it can refine those weights on its own if able.
Answer by Htarlov10

I'm pretty convinced it won't foom or quickly doom us. Nevertheless, I'm also pretty convinced that in the long term, we might be doomed in the sense that we lose control and some dystopian future happens.

First of all, for a quick doom scenario to work out, we need to be either detrimental to the goals of superintelligent AI or fall because of instrumental convergence (basically it will need resources to do whatever and will take from things needed by us like matter on Earth or energy of the Sun or see us as a threat). I don't think we will. First superintelligent AI will likely be from one of the biggest players and it likely will be aligned to some extent. Meaning it will have values that highly match with ours. In the long term, this situation won't kill us either. It likely will lead to some dystopian future though - as super AI will likely get more control, get itself more coherent views (make some things drop or weigh less than originally for us), and then find solutions very good from the standpoint of main values, but extremally broken on some other directions in value-space (ergo dystopia). 

Second thing: superintelligence is not some kind of guessing superpower. It needs inputs in terms of empirical observations to create models of reality, calibrate them, and predict properly. It means it won't just sit and simulate and create nanobots out of thin air. It won't even guess some rules of the universe, maybe except basic Newtons, by looking at a few camera frames of things falling. It will need a laboratory and some time to make some breakthroughs and getting up with capabilities and power also needs time. 

Third thing: if someone even produces superintelligent AI that is very unaligned and even not interested in us, then the most sensible way for it is to go to space and work there (building structures, Dyson swarm, and some copies). It is efficient, resources there are more vast, risk from competition is lower. It is a very sensible plan to first hinder our possibility to make competition (other super AIs) and then go to space. The hindering phase should be time and energy-efficient so it is rather sure for me it won't take years to develop nanobot gray goo to kill us all or an army of bots Terminator-style to go to every corner of the Earth and eliminate all humans. More likely it will hack and take down some infrastructure including some data centers, remove some research data from the Internet, remove itself from systems (where it could be taken and sandboxed and analyzed), and maybe also it will kill certain people and then have a monitoring solution in place after leaving. The long-term risk is that maybe it will need more matter, all rocks and moons are used, and will get back to the plan of decommissioning planets. Or maybe it will create structures that will stop light from going to the Earth and will freeze it. Or maybe will start to use black holes to generate energy and will drop celestial bodies onto one. Or another project on an epic scale that will kill us as a side effect. I don't think it's likely though - LLMs are not very unaligned by default. I don't think it will differ for more capable models. Most companies that have enough money and access to enough computing power and research labs also care about alignment - at least to some serious degree. Most of the possible relatively small differences in values won't kill us as they will highly care about humans and humanity. It will just care in some flawed way, so a dystopia is very possible.

Htarlov10

To goal realism vs goal reductionism, I would say: why not both?

I think that really highly capable AGI is likely to have both heuristics and behaviors that come from training and also internal thought processes, maybe done by LLM or LLM-like module or directly from the more complex network. This process would incorporate having some preferences and hence goals (even if temporary, changed between tasks).

Htarlov10

I think that if a system is designed to do something, anything, it needs at least to care about doing that thing or approximate.

GPT-3 can be described in a broad sense as caring about following the current prompt (in a way affected by fine-tuning).

I wonder though if there are things that you can care about that do not have certain goals that could maximize EU. I mean a system for which the most optimal path is not to reach some certain point in a subspace of possibilities, with sacrifices on axes that the system does not care about, but to maintain some other dynamics while ignoring other axes.

Like gravity can make you reach singularity or can make you orbit (simplistic visual analogy).

Htarlov50

I think you can't really assume "that we (for some reason) can't convert between and compare those two properties". Converting space of known important properties of plans into one dimension (utility) to order them and select the top one is how the decision-making works (+/- details, uncertainty, etc. but in general).

If you care about two qualities but really can't compare, even by proxy "how much I care", then you will probably map them anyway by any normalization, which seems most sensible. 
Drawing both on a chart is kind of such normalization - you visually get similar sizes on both axes (even if value ranges are different).

If you really cannot convert, you should not draw them on a chart to decide. Drawing makes a decision about how to compare them as there is a comparable implied scale to both. You can select different ranges for axes and the chart will look different and the conclusion would be different.

However, I agree that fat tail discourages compromise anyway, even without that assumption - at least when your utility function over both is linear or similar.
Another thing is that the utility function might not be linear, and even if both properties are not correlated it might make compromises more or less sensible, as applying the non-linear utility function might change the distributions. 

Especially if it is not a monotonous function. Like for example you might have some max popularity level that max the utility of your plan choice and anything above you know will make you more uncomfortable or anxious or something (it might be because of other variables that are connected/correlated). This might wipe out fat tail in utility space.

Htarlov4-1

I think there is a general misconception that we humans without training can learn some features, classes, and semantics based on one or few examples. It seems so in observation and it seems nearly magical, but really it is that way just because we don't see the whole picture, and we assume that a human is a "carte blanche" at first. 

In reality, we are not a blank page when we are born - our brain is an effect of training that took millions of years based on evolution. We have already trained and very similar pattern recognition (aka complex features detection), similar basic senses including basic features in spaces of each sense (like cold or yellow), and image segmentation (aka object clustering) with multilevel capabilities (see the object as a whole and see parts) and pose detection (aka division between object geometry and pose in space).

For example, if you enter a kind of store that you never entered and you see some stuff - you might not know names for the things, maybe don't know and can't even guess the purpose for some. Still, you can easily differentiate each object from the background, you are able to take it into hand properly and ask the cashier what is that tool for. Even you can take a small baby, put some things around for which he/she does not know name or purpose, and observe it will rather grab at those things than at random in the background - he/she already can see features and can make that clustering in the same way other humans do (maybe with more errors as brain still develops - but it is very guided development, not training from zero).

What was astonishing to me is that so much about semantics, features, relations, etc. can be learned without having any human-like way to relate words to real objects and features and observations. Transformer-based LLM can grasp it only by reinforced learning on a huge amount of text. 

I have some intuition as to why this might be the case. If you take a 2D map with points e.g. cities and measure approximate distances f.ex. by radio signal power loss between points and plug that information into a force-directed graph algorithm with forces based on a measure that approximated distance, then with enough measurements it will recreate something very resembling the original map - even when the original positions are not known and measures are not exact and by proxy. Learning word embeddings is similar to this in an abstract sense. From a lot of text, we measure relations between words in a multidimensional space and the algorithm learns similar structure, even if not anchored to reality. Transformed-based LLMs go one step further - they are learned how to emulate the kind of processing that we do in minds based on "applying additional forces" in that graph (not exactly that, but similar - focus based on weights and transforming positions in embedding space).

Htarlov93

I also think that the natural abstraction hypothesis holds with current AI. The architecture of LLMs is based on the capability of modeling ontology in terms of vectors in space of thousands of dimensions and there are experiments that show it generalizes and has somewhat interpretable meanings to the directions in that space. (even if not easy to interpret to the scale above toy models). Like in that toy example when you take the embedding vector of the word "king", subtract the vector of "man", add the vector of "woman" and you land near the position of "queen" in the space.  LLM is based on those embedding spaces but also makes operations that direct focus and modify positions in that space (hence "transformer") by meaning and information taken from other symbols in context (simplifying here). There are basically neural network layers that tell which word should have an impact on the meaning of other words in the text (weight) and layers that apply that change with some modifications. This being learned on the human texts internalizes our symbols, relations, and whole ontology (in broad terms of our species ontology - parts common to us all and different possibilities that happen in reality and in fiction).

Even if NAH doesn't need to hold in general, I think in the case of LLM it holds.

Nevertheless, I see there is a different problem with LLM. That is, those models seem to me basically goalless but easily directed towards any goal. Meaning they are not based on the ontology of a single human mind and don't internalize only a single certain morality and set of goals. They generalize over the whole human species ontology and a whole space of possible thoughts that can be made based on that. They also generalize over hypothetical and fictitious space, not only real humans in particular.

Human minds are all from some narrow area of space of possible minds and through our evolution and how we are raised usually we have certain stable and similar models of ethics, morality, and somewhat similar goals in life  (except in some rare extreme cases). We sometimes entertain different possibilities, and we create movies and books with villains, but in circumstances of real decisions and not thought experiments or entertainment - we are similar. What LLM does is it generalizes into a much broader space and does not have any hard base there. So even if the ontology matches and even if LLM is barely capable of creating new concepts not fitting human ontology at the current level, the model is much broader in terms of goals and ways of processing over that general ontology. It basically has not one certain ontology, but a whole spectrum like a good actor that can take any role, but to the extreme.  In other terms, it can "think*" similar thoughts, that are understandable by a human in the end, even if not that quickly, internally also have vectors that correspond to our ontology, but also can easily produce thoughts that no real sane human would have. Also, it has hardly any internal goals and none are stable. We have certain beliefs that are not based on objective facts and ontology, but we still believe them because we are who we are. We are not goal-less agents and it is hard to change our terminal goals in a meaningful way. For LLM goals are modeled by training into "default state" (being helpful etc.) and are stated as "system prompts" that are stated/repeated for it to base upon are part of the context that anchors it into some part of that very vast space of minds it can "emulate". So LLM might be helpful and friendly by default, but If you tell it to simulate being a murderbot, it will. Additional training phases might make it harder to start it in that direction, but won't totally remove that from the space of ways it can operate. This removes only some paths in that multidimensional space. Jailbreaking of GPTs shows that it's possible to find other more complex paths around.

What is even more dangerous for me is that LLMs are already above human levels in some aspects - it just does not show yet because it is learned to emulate our ways of thinking and similar (the big area around it in the space of possible ways of thinking and possible goals, but not too alien, still can be grasped by humans).

We are capable of processing about 7 "symbols" at once in working memory  (a few more in some cases). It might be a few dozen more if we take into account long-term memory and how we take context from it. This first number is taken from neurobiology literature (aka "The Magical Number Seven, Plus or Minus Two"), and the second one is an educated guess. This is the context window on which we work. That is nothing in comparison to LLM which can have a whole small book as its current working context, which means in principle it can process and create much much more complex thoughts. It does not do that because our text never does that and it learns to generalize over our capabilities. Nevertheless, in principle, it could and we might see it in action if we start of process of learning LLM on top of the output of LLM in closed loops. It might easily go beyond the space of our capabilities and complexity that is easily understandable to us (I don't say it won't be understandable, but we might need time to understand and might never grasp it as a whole without dividing it into less complex parts - like we can take compiled assembler code and organize it into meaningful functions with few levels of abstractions that we are able to understand).

 

* "think" in analogy as the process of thinking is different, but also has some similarities

Load More