An epistemic status is a statement of how confident the writer / speaker is in what they are saying, and why. E.g. this post about the use of epistemic status on Lesswrong . Google's definition of epistemic is "relating to knowledge or to the degree of its validation".
This branch of research is aimed at finding a (nearly) objective way of thinking about the universe. When I imagine the end result, I imagine something that receives a distribution across a bunch of data, and finds a bunch of useful patterns within it. At the moment that looks like finding patterns in data via find_natural_latent(get_chunks_of_data(data_distribution))
or perhaps showing that find_top_n(n, (chunks, natural_latent(chunks)) for chunks in
all_chunked_subsets_of_data(data_distribution),
key=lambda chunks, latent: usefulness_metric(latent))
is a (convergent sub)goal of agents. As such, the notion that the donuts' data is simply poorly chunked - which needs to be solved anyway - makes a lot of sense to me.
I don't know how to think about the possibilities when it comes to decomposing . Why would it always be possible to decompose random variables to allow for a natural latent? Do you have an easy example of this?
Also, what do you mean by mutual information between , given that there are at least 3 of them? And why would just extracting said mutual information be useless?
If you get the chance to point me towards good resources about any of these questions, that would be great.
Let's say every day at the office, we get three boxes of donuts, numbered 1, 2, and 3. I grab a donut from each box, plunk them down on napkins helpfully labeled X1, X2, and X3. The donuts vary in two aspects: size (big or small) and flavor (vanilla or chocolate). Across all boxes, the ratio of big to small donuts remains consistent. However, Boxes 1 and 2 share the same vanilla-to-chocolate ratio, which is different from that of Box 3.
Does the correlation between X1 and X2 imply that there is no natural latent? Is this the desired behavior of natural latents, despite the presence of the common size ratio? (and the commonality that I've only ever pulled out donuts; there has never been a tennis ball in any of the boxes!)
If so, why is this what we want from natural latents? If not, how does a natural latent arise despite the internal correlation?
We could remove information from For instance, could be a bit indicating whether the temperature is above 100°C
I don't understand how this is less information than a bit indicating whether the temperature is above 50C. Specifically, given a bit telling you whether the temperature is above 50C, how do you know whether the temperature is above 100C or between 50C and 100C?
As to the definition of short term goal: any goal that is can be achieved (fully, e.g. without a "and keep it that way" clause) in a finite short time (for instance, in a few seconds), with the resources the system already has at hand. Equivalently, I think: any goal that doesn't push instrumental power seeking. As to how we know a system has a short term goal: if we could argue that systems prefer short term goals by default, then we still wouldn't know as to the goals of a particular system but we could hazard a guess that the goals are short term. Perhaps we could expect short term goals by default if they were, for instance, easier to specify, and thus to have. As pointed out by others, if we try to give systems long term goals on purpose, they will probably end up with long term goals.
So long term goals aren't a default; market pressure will put them there as humans slowly cede more and more control to AIs, simply because the latter are making decisions that work out better. Presumably this would start with lower level decisions (e.g. how exactly to write this line of code; which employee to reward based on performance) and then slowly be given higher level decisions to make. In particular, we don't die the first time someone creates an AI with the ability to (escape, self improve and then) kill the competing humans, because that AI is likely focused on a much smaller more near term goal. That way, if we're careful and clever we have a chance to study a smarter-than-human general intelligence without dying. Is that an accurate description of how you see things playing out?
I'm not sure I understand; are you saying that given these, we have high P(Doom), or that these are necessary to be safe even if GIs have only short term goals? Or something else entirely?
I am using Wikipedia's definition: "Ensuring that emergent goals match the specified goals for the system is known as inner alignment."
Inner alignment is definitely a problem. In the case you described, the emergent goal was long term (ensure I remember the answer to 1+1), and I remain wondering whether by default short term specified goals do or do not lead to strange long term goals like in your example.
I wrote the following in custom instructions, under "How would you like ChatGPT to respond?":
After every answer, give a short reflection, like "I looked at this answer again and it looks good" or "I thought about this again and I think it could be better if ..." or "After looking this over again, this makes no sense because...".
I then had the following conversation:
Q: What is denser, gold or uranium?
A: Uranium is denser than gold. The density of gold is about 19.32 grams per cubic centimeter, while the density of uranium is about 19.1 grams per cubic centimeter.
Upon reflection, I see that I made an error in my initial response. The correct densities are indeed 19.32 g/cm³ for gold and 19.1 g/cm³ for uranium, but these values indicate that gold is denser than uranium, not the other way around. I apologize for the confusion, and I appreciate your patience.
The fact of the matter is that humans communicate. They learn to communicate on the basis of some combination of their internal similarities (in terms of goals and perception) and their shared environment. The natural abstraction hypothesis says that the shared environment accounts for more rather than less of it. I think of the NAH as a result of instrumental convergence - the shared environment ends up having a small number of levers that control a lot of the long term conditions in the environment, so the (instrumental) utility functions and environmental pressures are similar for beings with long term goals - they want to control the levers. The claim then is exactly that a shared environment provides most of the above.
Additionally, the operative question is what exactly it means for an LLM to be alien to us, does it converge to using enough human concepts for us to understand it, and if so how quickly.