I think we maybe shouldn't reify "concepts" at all. To some extent it seems to me like there are only patterns of internal and external behavior. Certain stimuli - experiences, feelings, trains of thought - lead to certain sequences of words being produced. The question should not be "what does the word mean?" but rather "what does this word having been used say about the internal state of the one using it, in this context?" Words don't have fixed meanings - they are entirely contextual.
Concepts are not things that actually exist as stable units - our usage of a finite set of words makes us think that they do, but in reality every pattern of neural activations is unique and we reach for the words that seem the least far away in some vector space (generated by training on our past experience of the same words in other contexts) to the position of the thought we actually want to express. Every such action moves the positions of those words a little bit in other people's semantic spaces, leading to semantic drift - something like a rise in entropy, a gas expanding through a room.
Probably I didn't fully understand your position, but here are my thoughts and emotions about this topic:
This might be a bridge between machine learning and agent foundations that is itself related to alignment. A vague concept could be expressed by a model of machine learning that presents behaviors in a large diverse collection of specific episodes it can make sense of (its scope), exercising its influence as it decides according to its taste.
The machine learning point of view is that we are training the model using all the episodes as the dataset (or maybe for reinforcement learning), with the other things defining the episodes (besides the model itself) giving the data that the model learns. The decision theory point of view is that the model is an adjudicator between the episodes, a shared agent of acausal coordination that intervenes in all of them jointly, that the model presents a single policy that is the updateless decision to arrange all episodes in one possible way, as opposed to other possible ways.
The alignment point of view is that a concept expresses a tiny aspect of preference in its decisions, with its role as an adjudicator giving consistency to that aspect of preference, and coherence to decisions of an agent that relies on the concept. It acts to extend the goodhart scope of the agent as a whole to the episodes that the concept can make sense of. The scope of a concept should be in an equilibrium with its content (behavior), as settled by reflection of learning from the episodes where the concept acts/occurs. Different concepts interact in shared episodes, where their scopes intersect, jointly supplying all the data that makes an episode (when it doesn't originate as observations of reality, grounding the whole thing). Concepts in this sense are both a way of extending the goodhart scope and of remaining aware of its current locus.
Sorry if I'll dumb it down too much. I tried to come up with specific examples without terminology. That's how I understand what you're saying:
Simple/absurd examples:
This might be a bridge between machine learning and agent foundations that is itself related to alignment.
In this case, could you help me with the topic about "colors"? I wouldn't write this post if I didn't write about "colors". So, this is evidence (?) that the topic about "colors" isn't insane.
There a "place" is a vague concept. "Spectrum" is a specific context for the place. Meaning is a distribution of "details". Learning is guessing the correct distribution of details ("color") for a place in a given context.
So, this is evidence (?) that the topic about "colors" isn't insane.
I mean, the sketch I've written up (mostly here, some bargaining related discussion here) is not very meaningful, it's like that "Then a miracle occurs" comic. You can make up such things for anything, it's almost theoretical fiction (not real theory) in a sense analogous to that of historical fiction (not real history). It might be possible to build something out of this, but probably not, like you don't normally look for practical advice in books of fiction, even though it can sometimes happen to be found there. That's not what books of fiction are for though, and there could be a scene for self-aware theoretical fiction writers.
I think the interesting point of the sketch is how it naturally puts models of machine learning in the context of acausal decisions from agent foundations, the points of view that are usually disjoint in central examples of either. So maybe there is a way to persist in contorting them in each other's direction along these lines, and that prompted me to mention it.
I've written in response to this post because your definition of vague concepts (at the beginning of the post) seems to fit adjudicators pretty well. In the colors post, there are also references to paradigms, which are less centrally adjudicators, but goodhart scope is their signature feature (a paradigm can famously fail to understand/notice problems that are natural and important for a different point of view).
This post about vague concepts in general is mostly meaningless for me too: I care about something more specific, "colors". However, I think a text may be "meaningless" and yet very useful:
Did we achieve anything? I think we could have. If one of us gets a specific insight, there's a chance to translate this insight (from A to B, or from B to A).
So I think the use of "agent" in the first point I quoted is about adjudicators, in the second point both adjudicator and outer agent fit (but mean different things), and the third point is about the outer agent (how its goodhart scope relates to those of the adjudicators). (link)
I just tried to understand (without terminology) how my ideas about "vague concepts" could help to align an AI. Your post prompted me to think in this direction directly. And right now I see this possibility:
The most important part of my post is the idea that the specific meanings of a vague concept have an internal structure. (at least in specific circumstances) As if (it's just an analogy) the vague concept is self-aware about its changes of meaning and reacts to those changes. You could try to use this "self-awareness" to align an AI, to teach it to respect important boundaries.
For example (it's an awkward example) let's say you want to teach an AI that interacting with a human is often not a game or it may be bad to treat it as a game. If AI understands that reducing the concept of "communication" to the concept of a "game" may bear some implications, you would be able to explain what reductions and implications are bad without giving AI complicated explicit rules.
(Another example) If AI has (or able to reach) an internal worldview in which "loving someone" and "making a paperclip" are fundamentally different things and not just a matter of arbitrary complicated definitions, then it may be easier to explain human values to it.
However this is all science fiction if we have no idea how to model concepts and ideas and their changes of meaning. But my post about colors, I believe, can give you ideas how to do this. I know:
But it may give ideas, a new approach. I want to fight for this chance, both because of AI risk and because of very deep personal reasons.
"Adjudicator" is a particular role for agents/policies, and the policies (algorithms that run within episodes) are not necessarily themselves agents (adjudicator-as-agent chooses an adjudicator-as-policy as its decision, in the agent foundations point of view). There is also an "outer agent" I didn't explicitly discuss that constructs episodes on situations, deciding that certain adjudicators are relevant to a situation and should be given authority to participate in shaping or observing the content of the episode on it. This outer agent is at a different level of sophistication than the adjudicators-as-policies (though not necessarily different from adjudicators-as-agents), and is in a sense built out of the adjudicators, as discussed here.
- A vague concept can be compared to an agent (AI).
- You can use vague concepts to train agents (AIs).
- An agent can use a vague concept to define its field of competence.
So I think the use of "agent" in the first point I quoted is about adjudicators, in the second point both adjudicator and outer agent fit (but mean different things), and the third point is about the outer agent (how its goodhart scope relates to those of the adjudicators).
My definition of a "vague concept" (A):
Examples of vague concepts: games, health, human values, personality, words and some hypotheses.
I think it may be very important to study ways to think about vague concepts: it's related to hypotheses generation and human values. So it may be related to AGI and AI alignment.
I'll discuss some examples and share my thoughts about inner working of vague concepts.
Examples
Games
An example about "games" is famous in philosophy. But keep in mind that my interpretation may differ:
https://en.wikipedia.org/wiki/Family_resemblance
When you consider a bunch of "games", it's easy to see the common features. But as you consider more and more "games" and things that are sometimes called "games", it turns out that everything can be a game. The boundary between "games" and "non-games" becomes more and more arbitrary and complicated.
And yet no matter how much you stretch the concept (e.g. say something like "love is just a game"), in a specific context the meaning is clear enough. In a specific context the boundary between "games" and "non games" is quite simple. Does this boundary have a fractal dimension? (I'm half-joking)
I also think it's important to consider "family difference": maybe you have just a single object A and a group of objects (B). When you compare A to an object from (B), they're always different enough. But you can't formulate a universal difference between A and all objects from (B). Maybe because you don't know all the objects in (B).
Sorites paradox
Sorites paradox is at least partially relevant here.
It is (presumably) easy to tell apart a heap from a non-heap. But it's hard/impossible to explore the boundary or analyze the question through the framework of causation ("What causes non-heap to become a heap?").
Health, moral values
You may also call vague concepts "cluster properties" (explanation in a Philosophy Tube video). In the text form:
You may even compare "vague concept" to "social constructs".
Vague Hypotheses
You can imagine a hypothesis based on vague concepts, for example "healthy people earn more money than unhealthy people" or "people who love games earn more money". In their most abstract form, those theories can't be falsified. But it's easy to generate specific falsifiable hypotheses based on those ideas.
Scientific theories, too, can have an unfalsifiable core. This is Imre Lakatos' model of scientific progress:
https://en.wikipedia.org/wiki/Imre_Lakatos#Research_programmes
Vague concepts lead to vague hypotheses ("research programmes"). Vague hypotheses work the same way vague concepts do.
Personality, thinking
Even human personality may be a vague concept.
The smaller amount of situations you take, the easier it is to understand what someone's personality is. But if you take more and more situations it may turn out impossible to find a single connecting thing. Especially if someone lives long enough, encounters different enough communities, obtains different enough opportunities.
And even thinking itself may be a vague concept: before you realize some key connection your mind might be wandering between associations that have many overlaps, but don't form a coherent seamless picture.
Word meanings
Most of the words (their meaning) are vague concepts.
For example, the word "beast" may mean a fantastic creature, someone who lost their humanity, someone who's shockingly good at something and etc.
And of course there's the infamous/comical example with the word "shit" (ISMO skit), which changes its meaning in absolutely wild ways.
Exploring vague concepts
How to understand a vague concept? You can try to memorize all contexts (that you know of) in which it's used. Or you can learn to infer its meaning in new contexts. And learn to create new contexts for this concept yourself.
But it would require a different type of generalization, not based on definitions. I think it's important to explore what this "different type of abstraction" may be.
What are vague concepts made of? My ideas:
Internal structure, "gradient"
Here's an observation: when the meaning of a word changes, this change doesn't come without a cost. (I'm not talking about words with completely separate meanings.) It comes with a change of emotion and emphasis. The point is that different meanings of a word have some internal relationships, "positions" relative to each other. If you could evaluate the "internal" meaning of a word at least in some simplified way, you would notice the relationships. Maybe those relationships allow us to really "understand" words and context.
For example, take the word "beast":
If you're interested in deep internal qualities of people/objects, the negative meaning may be the "main" one for you, even if you dislike it. This suggests an idea: if you have a certain bias or goal, internal relationships between meanings may become simpler. To understand the change of meaning of a word you only need to understand how your emotions changed and how the "alignment" between the word and your goal changed.
If you're interested in irony, the positive meaning of the word "shit" (e.g. "This is the shit") may be the main one for you.
Emotions, biases and goals give the meanings of a word some "gradient" or "flavors". I recommend to check out "Propagating Facts into Aesthetics" by Raemon, it may be very relevant.
Meta contrasts
My conclusion is that some vague concepts may create "meta contrasts", "contrasts of contrasts": they contrast the real world and a counterfactual world, but also contrast multiple possible emotions/goals related to those worlds. For example, when something bad happens and you think "this is a bad day" ("bad day" is a vague concept, there's almost no real boundary between good and bad days), you put the bad outcome in context of your possible goals and emotions related to this day. And at this point it doesn't really matter what the bad thing was, what matters is how your emotions and goals where affected. The same thing with your overall "health". It's not about specific medical conditions, it's about your feelings and goals being (un)affected. And when you call someone a "beast" (in the positive sense) you talk about your emotions and compare the referent to other people.
However,
so, "physical" and "emotional" ("gradient") aspects of a meaning may get really entangled. I think this is what makes defining some vague concepts really impossible.
Toy example
I also tried to explore vague concepts in more detail on a specific toy example. I know that my post turned out to be very unclear, but there you can find:
I think the toy example in the linked post could help to come up with some statistics for "vague concepts".