Thoughts on Creating a Good Language

Towards_Keeperhood

Criteria for creating a good language

Phonology

1 to 1 correspondence between letters and phones
1. Pronunciation should not depend on what word a letter occurs in

Vocabulary

1 to 1 correspondence between words and concepts
1. Aka no overloaded words and no synonyms
Be able to quickly learn words
1. Use modularity wherever possible.
  1. Use derived words.
    1. Use affixes corresponding to thematic relations so you can derive further words from verbs.
      1. E.g. use "stealer" instead of "thief".
      2. E.g. use "sitobject" instead of "chair". (Where the "-object" should actually be a very short suffix.)
  2. Use compound words.
    1. E.g. don't have extra words for "small" and "big" but just have "low-size" and "high-size".
      1. This can be done for any linear property and there are lots of those.
    2. E.g. don't have a word for "toilet" and just use "peechair".
2. Reflect similarities and relations between concepts in morphological similarities and relations in words.
  1. Branch out more precise concepts from the wordstems of more general concepts.

Easy to learn grammar

Have simple and well-chosen parsing rules for translating sentences into logical statements.
1. Being close to the logical representation might make it easier to infer logical implications.
Have a good ontology for thematic relations and a 1-to-1 correspondence between them and prepositions.

Avoiding misunderstandings

Unambiguous sentences. (Note: Ambiguity is not the same as vagueness.) E.g. not like:
1. "Mia wants to dance with a drummer" → Is there a specific drummer or can it be any drummer?
2. "Every student reads a book" → The same book or a different book?
Having clear conventions on what words mean.
1. E.g. in English, when you say "I'm not bad at math", I don't know whether you might be mediocre or whether you're actually great but just modest.
2. E.g. we could say "medium" means within 0.5 standard derivations relative to the reference distribution, "high" means +0.5-+1.5 SDs, "very high" mean +1.5-+2.5 SDs, "extremely high" means >+2.5 SDs.

Making rational thinking easier / Making irrational thinking harder

Non-violent communication (Rosenberg, 2015)
Incentivize describing more directly what happened or what is true, rather than being overly abstract. E.g.:
1. "Most people had a lot of fun at the party" is more direct than "the party was cool".
2. "Communism doesn't incentivize smart people to put in a lot of work to invent something very useful" is more direct than "Communism is bad".
  1. (Where it's even better if "Communism" gets unpacked into what it stands for, in case people have different or insufficiently clear conceptions of what the word refers to.)

Allowing high expressivity. Allowing clear precise thinking

Have natural and precise concepts.
1. E.g. it's useful to have separate words for "safety" (=preventing internal failures) and "security" (=defending against external attacks). (Btw, those concepts are similar enough that they perhaps ought to be variations on the same word stem in a good language. Also, one ought to also be able to use the more general "failure avoidance" concept in cases where more precision isn't useful.)

Ability to not needing to include much more information than is relevant

This is important because if you often need to find how to express something precisely when a vague description would be sufficient, this unnecessarily burns mental resources.

Conciseness: Be able to communicate quickly without needing too many syllables

Use shorter words for concepts you need often.
(Fun fact: In English you could probably just drop the word "the" and instead assume by default that "the" is meant if you use a noun without any determiner (or other kinds of modifiers like the plural "-s"). This works just fine since (AFAICT) nouns don't appear in English sentences in raw form.)

Other criteria

There should always be a canonical way to express a statement, in order to avoid mental hiccups from your mind proposing multiple equally valid expressions simultaneously.
1. Avoid having too many ways of expressing something, as e.g. in:
  1. "Alice's cake" and "the cake of Alice"
  2. "Alice brought the cake" and "The cake was brought by Alice"
Similarities to other languages to make it easier to learn for people who know other languages
Beauty. Nice-soundingness. Options for poetry.

Proposing an approach for creating a language

TLDR: Often translating sentences into formal representations probably is probably useful practice for getting a sense of how to design a good language. ^[1]

(The following is sorta low-quality. Unless you're specifically interested in designing a language, I basically recommend to stop reading here.^[2])

(Thanks to claude-3.7 for helping me phrase parts of this section.)

The "starting formal" approach

One approach, for designing a good grammar, is to start with a formal system, in which everything that can meaningfully be expressed in natural languages can be expressed, and then practicing expressing lots of sentences in that framework, and then adding parsing rules for making expressing statements more convenient.

Advantages of the approach

Unambiguity: Logical statements in formal systems have precise meanings. This clarity can be preserved while adding convenience-oriented parsing rules.

Enhanced inference capabilities: Maintaining proximity to formal logical representations may facilitate easier inference, potentially enabling more effective recognition of conceptual connections. Furthermore, it might make it easier to see when an argument is actually supporting a position, vs when it doesn't directly or there's just a vague resemblance.

Strengthened argumentation: The formal structure could enable more proof-like chains of reasoning, potentially allowing for more precise articulation of complex positions. This enhanced concreteness might also facilitate pinpointing specific flaws in arguments by making each logical step explicit and examinable. (Though the extent of this benefit remains speculative.)

Canonical expression: The initial formal system largely provides a standard way to express any given concept, eliminating the need to process multiple equivalent formulations. This property might make cognitive processing more efficient, in a similar way to how simplifying statements in automated theorem provers makes proof search more efficient.

(Though this canonicity has limitations. Introducing synonymous predicates can undermine it, and predicate logic itself offers alternative expressions (e.g., "NOT EXISTS" vs. "FORALL NOT"). Establishing conventions for preferred forms can help address this challenge.)

Promotion of precision: At least in my experience, working within formal systems encourages precise thinking. The process of trying to express something formally often highlights vague concepts, motivating the use of more concrete and well-defined predicates. Among other things, this can be useful to avoid Motte-and-Bailey fallacies.

Disadvantages of the approach

Foundation-dependent quality: The effectiveness of this approach hinges entirely on the initial formal representation chosen. An inadequate foundational system will likely propagate its limitations throughout the language.

Missing benefits from language evolution: Natural languages evolve organically over centuries, developing solutions to communicative challenges through distributed experimentation. A designed system bypasses this evolutionary process and may encounter unforeseen problems that natural selection would have addressed.

Potential learning barriers: There's a risk that the resulting grammar might present increased acquisition challenges for young language learners. The formal underpinnings could create cognitive hurdles not present in naturally evolved languages, though not sure.

Defining shortcodes on top of our formal language

This section directly builds upon the formal statement representation presented in my post "Introduction to Representing Sentences as Logical Statements".

I didn't practice that much yet in expressing statements in the formal logic system, and don't know yet which parsing rules might be most needed, but here I make two examples for parsing rules that seem likely useful.

Agentic causes

Most English sentences which express events contain an agent who usually can be seen as causing the event. Consider the sentence "Alice gave the pen to Bob". We could express this in our system as:

{[t1,t2]: giving(Alice, Pen, Bob)} CAUSES {[t2, t2+delta]: holding(Bob, Pen)}

Where "giving(x1,x2,x3)" could be more precisely expressed as something like "x1 holding x2 in hand and moving it in the direction of the location of x3".

However, we might want to also be able to just say "Alice caused Bob to hold the pen" without needing to specify the method. The need for being able to treat agents as causes can be more directly seen in examples like "The chef caused the soup to taste fantastic", "The parent made the child do his homework", and "The doctor healed the patient".

However, the "CAUSES" connective in our system only connects statements. To nevertheless naturally capture the possibility of agents causing statements, I propose that when we want to express "A causes X", where "A" is an agent and "X" is a statement, this can usually be interpreted as "The fact that A was trying to achieve X and A was competent enough to achieve X, caused X".^[3]

Thus, I propose adding a keyword "causes" with the parsing rule:

"A causes X" gets parsed into "{try(A, X) AND can(A, X)} CAUSES X"

Having added this to our language, we can now express some statements more concisely, e.g.:

"The doctor healed the patient." Doctor causes {health(Patient, high)}

State changes

The sentence "Ben became rich in 2010" not only expresses that Ben is rich at the end of 2010 (and likely onward), but also that he wasn't rich before.

Likewise, "It started to rain at time t" doesn't just convey the same information as "It rained at t", but also that "It didn't rain for some period before time t".

Since it is annoying to write statements like "{[t1,t2]: NOT {rains(Location)}} AND {[t2, t3]: rains(Location)}" in full, we introduce the keyword "BECOME" which has the following parsing rule:

"BECOME(X, t1=?, t2=?, t3=?)" gets parsed into "{[t1,t2]: NOT X[t]} AND {[t2, t3]: X[t]}"^[4]

The "=?" means that those are optional parameters where by default existential quantification over the variables is used. (So "BECOME(X)" returns "EXISTS t1, t2, t3: {{[t1,t2]: NOT X[t]} AND {[t2, t3]: X[t]}}".)^[5]

Now we can e.g. express "Mary fell asleep at 11pm yesterday" as:

BECOME({lambda t. sleep(Mary, t)}, t2="2025-03-29 11pm CET")^[6]

Actually, most English event-expressing sentences implicitly describe a state change, and if the information that the previous state was different is useful to convey, the "BECOME" keyword can be used here as well.

Concluding thoughts on designing a good language

The two keywords with corresponding parsing rules defined in this work—"causes" and "BECOME"—represent merely the initial steps in what would be a comprehensive language development process using the "starting formal" approach. The transition from a bare formal logical system to a well-useable grammar would require addressing numerous additional challenges:

Syntactic Simplification: Developing conventions that reduce explicit notation requirements, like needing to write fewer parentheses than formal systems typically demand.
Reference Resolution: Creating efficient mechanisms for object identification and reference. This includes Systems for establishing discourse referents, e.g. anaphoric shortcuts like pronouns.
Compression through events: Introducing keywords for efficiently communicating common clusters of connected facts.
Discourse Management: Implementing pragmatic markers that manage expectation flow, similar to natural language connectives like "but", "however", and "nevertheless".

The proposed approach seems useful beyond the construction of a good grammar. By reducing abstract sentences to more concrete low-level statements, we can more clearly see the underlying meaning, from which we may carve natural ontologies with clear concepts. The methodical construction of a language with explicit design principles also provides a unique investigative lens into the nature of language itself, potentially revealing which conceptual primitives truly form the foundation of effective communication.

The language we are striving to create shares the ambitious vision of Leibniz's characteristica universalis (Leibniz, 1666)—a universal symbolic language capable of expressing all conceptual thought with mathematical precision and clarity. This path remains lengthy and complex, requiring substantial intellectual discipline throughout the process, particularly in resisting the temptation to introduce convenient but imprecise abstractions prematurely. While such a complete language may remain an aspirational goal, even partial progress toward this ideal can yield valuable insights for linguistics, cognitive science, and the philosophy of language.

^{^}
Actually I'm proposing something more specific than this, but I'm not really that confident in the more specific version.
^{^}
I wrote this for my (half-assed) Bachelor's thesis, and given that I wrote it I thought I might as well post it.
^{^}
Note that this doesn't let us express agents as causes of states that were produced unintentionally. We cannot express "Mary broke the vase" as "Mary caused the vase to be broken" (assuming she didn't deliberately break it). I think this is a feature, not a bug.
^{^}
In case you're wondering about the X[t]: Remember that "[t2, t3]: X" stands for "(t2 ≤ t ≤ t3) ⇒ X". So given this reminder, it probably becomes clear that X[t] stands for the statement X evaluated at time t. So the statement X must actually take a time as input. Read further to see an example.
^{^}
We might want a BECOME keyword that has more freedom for specifying times, like for describing that something changed during some period, without specifying when exactly. I guess actually maybe we could define a probability distribution over t2. But overall not sure whether this definition of BECOME is optimal.
^{^}
Actually I perhaps ought to have defined the BECOME keyword in a way we don't need to write the "lambda t" explicitly.

[-]johnswentworth9d30

no synonyms
[...]
Use compound words.

These two goals conflict. When compounding is common, there will inevitably be multiple reasonable ways to describe the same concept as a compound word. I think you probably want flexible compounding more than a lack of synonyms.

[-]Towards_Keeperhood8d30

Thx.

Yep there are many trade-offs between criteria.

Btw, totally unrelatedly:

I think in the past on your abstraction you probably lost a decent amount of time from not properly tracking the distinction between (what I call) objects and concepts. I think you likely at least mostly recovered from this, but in case you're not completely sure you've fully done so you might want to check out the linked section. (I think it makes sense to start by understanding how we (learn to) model objects and only look at concepts later, since minds first learn to model objects and later carve up concepts as generalizations over similarity clusters of objects.)

Tbc, there's other important stuff than objects and concepts, like relations and attributes. I currently find my ontology here useful for separating subproblems, so if you're interested you might read more of the linked post even though you're surely already familiar with knowledge representation (if you haven't done so yet), but maybe you already track all that.

LESSWRONG
LW

1