Comment Permalink

Aransentin3y10

Spitballing here, but how about designing the language in tandem with a ML model for it? I see multiple benefits to that:

First is that current English language models spend an annoyingly large amount of power on reasoning about what specific words mean in context. For "I went to the store" and "I need to store my things", store is the same token in both, so the network needs to figure out what it actually means^[1]. For a constructed language, that task can be made much easier.

English has way too many words to make each of them their own token, so language models preprocess texts by splitting them up into smaller units. For a logical language you can have significantly fewer tokens, and each token can be an unique word with an unique meaning^[2]. With the proper morphology you also no longer need to tokenize spaces, which cuts down on the size of the input (and thus complexity).

Language models such as GPT-3 work by spitting out a value for each possible output token, representing the likelihood that it will be the next in sequence. For a half-written sentence in a logical language it will be possible to reliably filter out words that are known to be ungrammatical, which means the model doesn't have to learn all of that itself.

The benefits of doing this would not only be to the ML model. You'd get a tool that's useful for the language development, too:

Let's say you want to come up with a lexicon, and you have certain criteria like "two words that mean similar things should not sound similar, so as to make them easy to differentiate while speaking". Simply inspect the ML model, and see what parts of the network is affected by the two tokens. The more similar that is, presumably the closer they are conceptually. You can then use that distance to programmatically generate the entire lexicon, using whatever criteria you want.

If the language has features to construct sentences that would be complicated for an English-speaker to think, the model might start outputting those. By human-guided use of the model itself for creating a text corpus, it might be possible to converge to interestingly novel and alien thoughts and concepts.

^{^}
Typically the input text is pre-processed with a secondary model (such as BERT) which somewhat improves the situation.
^{^}
Except proper nouns I suppose, those you'd still need to split.

See in context

Making A Logical Language

23 Why I want to make a logical language

by Zmavli Caimle

5th Feb 2022

4 min read

23

I would normally publish this on my blog, however, I thought that LessWrong people might be interested in this topic. This sequence is about my experience creating a logical language, Sekko (still a work-in-progress).

What's a logical language?

Languages can be divided into two categories: natural languages, which arise...naturally, and constructed langugages, which are designed. Of constructed languages, there are two categories: artlangs, which are designed as works of art or as supplements to a world (think Quenya or Klingon), or engelangs (engineered languages), which are designed to fulfill a specific purpose.

Loglangs (logical languages) are a subset of engelangs. Not all engelangs are loglangs -- Toki Pona is a non-loglang engelang.

Despite the name, what constitutes a "loglang" is rather undefined. It's more of a "vibe" than anything. Still, I'll try to lay out various properties that loglangs have:

Adherence to a logic system. Most logical languages have second-order logic. Toaq and Lojban have plural logic; Eberban has singular logic.
Self-segregating morphology (SSM). SSM refers to a morphology (scheme for making word forms) such that word boundaries can be unambiguously determined. Several schemes exist: Lojban uses stress and consonant clusters, Toaq uses tone contours, Eberban and Sekko use arbitrary phoneme classification. Sekko's SSM will be discussed in later posts.
Elimination of non-coding elements. Many natural languages have features which add complication but do not encode information. Examples include German grammatical gender, Latin's five declensions or English number and person agreement (i.e. "I am" vs "You are" vs "He is"; "It is" vs "They are"). Loglangs aim to remove non-coding grammatical elements like these.
Exceptionless grammar and syntactic unambiguity. Many natural languages have irregular forms. Examples include
Machine parsability. Basically all loglangs are able to be parsed to check for syntactic correctness and correct scope of grammatical structures. Most loglangs (Loglan, Lojban, Toaq, and Eberban) have Parsing Expression Grammar (PEG) parsers.
Ergonomics and extensibility. Loglangs try to have as ergonomic a grammar as possible -- one that gives access to the most semantic space for the least amount of grammar complexity. An example is the fact that in basically all loglangs, there are really only two parts of speech (aside from particles): predicates and arguments. Adjectives in English can simply be taken to be copular verbs (eg. X is beautiful.)
Audiovisual isomorphism. (AVI) Loglangs will usually have a script or writing system (usually there is one based on Latin script, and then later, a completely novel script is devised) that has audiovisual isomorphism. What this means is that text and writing should correspond to each other phonetically -- that the text must encode speech precisely (i.e. words are spelled as they are said). AVI only necessitates that the phonemic (significant or distinguished) speech features are encoded. For example, if your language does not distinguish between short and long vowels, it is unnecessary to have an encoding for them.

Phonemic differences are those for which a language has pairs of words differing only in that aspect. For example, English distinguishes between "pat" and "bat". We may therefore conclude that English distinguishes voicing on the bilabial plosive (p and b sound). Mandarin does not -- all of its plosives are voiceless (rather, it distinguishes on aspiration, which English does not do). Another example is English "thin" and "thing" -- this represents that English distinguishes between the alveolar nasal /n/ and the velar nasal /ŋ/.

Why logical languages?

Frankly, it was not the logic aspect of loglangs that attracted me to them, but the presence of parsers and exceptionless grammar. I'm interested in the "structure". I'm not a subscriber to the Sapir-Whorf hypothesis, but I did find that learning Lojban was much, much easier than learning a natural language.

Lojban, such as it is (it has many, many problems), was still much more regular than any natural language -- a common way to learn is to participate in conversations with a dictionary in the other tab, knowing only the grammar. The regular grammar means that even if you don't know what a word might mean, you know its syntactic role. You do not have to pay attention to little natural language quirks, such as the "come-go" difference in English (both meaning "to move/travel", or the "kureru-ageru" (both meaning "to give") difference in Japanese.

There is also the syntactic ambiguity in a sentence such as "Do you want to drink tea or coffee?". The joke answer is to say, "Yes", since it's ambiguous whether the question is a true-or-false question or a choice question. In Lojban (and other loglangs), there is no ambiguity since the syntax of those two questions are different.

Parser: BPFK Lojban

TRUE OR FALSE.
.i xu do djica lonu do pinxe lo tcati .a lo ckafi
Is the following statement true?: You desire the event of you drinking tea OR coffee.

Possible answers:

go'i 
The previous statement is true.
nago'i
The previous statement is false.

CHOICE
.i do djica lonu do pinxe lo tcati ji lo ckafi
You desire the event of you drinking tea ??? coffee. (where ??? asks for a logical connective)

Possible answers:

.e
Both.
.enai
The former, but not the latter. (tea only)
na.e 
Not the former, but the latter. (coffee only)
na.enai
Neither.


.a
OR (one or the other, or both)
.o 
XNOR (both, or neither)
.onai
XOR (one or the other)

I don't think any loglang is going to replace natlangs anytime soon -- this is just a hobby. It's very pleasing to speak a logical language, and I often wish that English had support for some of the constructs in the logical languages I speak (or, at least, know about).

Sekko, my loglang

I will be publishing documentation on my work-in-progress logical language Sekko in this sequence. Now, all of the documentation published as of now should be treated as temporary. It is likely that I'll make sweeping changes on one or more of the parts of the grammar -- some of it hasn't been made yet. Likely, future posts will invalidate past posts. I plan on restructuring it once I've written something on each topic. I'm planning on using mdbook and Github pages, similar to Eberban.

The documentation posts I have tried to write and annotate such that you need little background to understand. I have plans to split up this initial documentation into a reference grammar and teaching course, which are separate (and the latter may even be further separated based on the sort of background you already have). I have also annotated the documentation with analogies if you already know either Lojban or Toaq. Please feel free to make suggestions on design.

World Optimization

Frontpage

23

Phonology | Sekko

8 comments4 karma

Why I want to make a logical language

New Comment

Rendering 12/13 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 9:36 AM

[-][anonymous]3y90

> Despite the name, what constitutes a "loglang" is rather undefined.

The term “loglang” has a fairly well accepted definition these days: a language with monoparsing semantics. That is to say, every utterance in the language uniquely maps to exactly one semantic statement within some logical formalism (the exact choice of logic may differ from language to language).

Your descriptive features of loglangs either fall under this property (e.g. self-segregating, which is a necessary precondition for strict monoparsing), or are arguably not properties of loglans per se (e.g. elimination of non-coding elements, audiovisual isomorphism; these are just generally true of engineered languages as a matter of course). Aside: Loglan/Lojban is anything but ergonomic :)

I think you are missing the most important question: why are you making a new loglang? Why not just use or extend Toaq or Lojban?

Also, why loglang at all? What is this post doing on a rationalist website? I can think of many answers to that, but your justification is missing.

[-]Zmavli Caimle3y20

You're right! I totally forgot to talk about monoparsing. I really shouldn't have missed that.

I don't like the design choices that Lojban and Toaq made. The latter much less so, but there are still aspects I dislike. I wanted to see if I could, if not do better, at least go in a different direction. It's why I use Livagian or Eberbanian style grammar with no "main verb", and use agglutinative morphology rather than analytic.

You can post anything on LessWrong as a personal blogpost. From the LessWrong FAQ:

What can I post on LessWrong? Posts on practically any topic are welcomed on LessWrong. I (and others on the team) feel it is important that members are able to “bring their entire selves” to LessWrong and are able to share all their thoughts, ideas, and experiences without fearing whether they are “on topic” for LessWrong. Rationality is not restricted to only specific domains of one’s life and neither should LessWrong be.

However, to maintain its overall focus while still allowing posts on any topic, LessWrong classifies posts as either Personal blogposts or as Frontpage posts. See more in the post on Personal Blogpost vs Frontpage Posts.

[-][anonymous]3y30

Just FYI you posted to the front page, not your personal blog.

[-]ChristianKl3y30

I have a few thoughts about designing new languages.

Generally, backreferencing often is quite complicated. Words like "he", "her", "this" and "that" can often only be interpreted based on context. In German grammatical gender often contains information that helps to make such backreferencing clear. If I remember correctly backreferencing was quite complicated and complex in Lojban.

One of the bigger problems of Lojban is that it is designed by focusing on having words for a specific list of known concepts. Part of what a good language allows it to make up new words to describe new concepts. All good science involves making up new terms to describe observed phenomena.

Ideally, you have a way to easily create new terms that are understood.

If you take English for example you have existing pairs like see - watch and hear - listen but you can't easily get to the same distinction for a word like think. When learning to meditate that distinction is useful as doing the think/see is okay but think/watch is to be avoided. I know a person who said that the fact that Esperanto allowed easily to make a word for the new concept allowed him to have a conversation where meditation started to make sense for him when he previously didn't get the point.

On the same token English has student and teacher which is similar to child and parent but has no easy way to say the equivalent of sibling for the first pair that exists for the second. There's also no equivalent for cousin. You could design a language in a way where there's a structure that easily gives you ways to extend all sorts of other contexts in a similar way.

Similar to those relations I think that spatial concepts could be a lot better.

If a language is bad for what you want to talk about, you run in a lot of Motte and Bailey issues. It's a sign of a good language when you are able to be precise to clarify what you mean. The way the English language overloads "to feel" makes it really hard to speak well about a lot of distinction. I don't have a good way to ask feel!physical sensation, feel!emotion and feel!mood (and the see/watch distinction for each one...).

When it comes to avoiding people from misunderstanding each other, it's helpful if a person who hears a single phoneme wrong doesn't hear another word that exists and that means something completely different. In informatics, there's the concept of error-correcting codes to make sure that messages are resilient against errors. Especially a language that's fully a-priori can think well about how to use the available combinations of phonemes to assign words in a way that's resilient to a few errors.

[-]Zmavli Caimle3y20

Anaphora is super complicated, and I've thought long and hard about how to express them. Each loglang has its own ways of dealing with anaphors. Yes, you are correct that Lojban anaphora is poorly designed. There's the ko'V series, the vo'V series, goi, the letteral series...it's really bad.

Most people use a variant of the ko'V series. How it works is that you bind a variable to ko'a (or the others in the series), and then when you repeat "ko'a", it recalls the bound variable. The extremely big issue with this is that it requires forethought. It's fine when you're writing, but when you're speaking, you don't necessarily know whether or not you'll need to refer back to something you said before. You could simply repeat the words, and context plus good faith/Grice's Maxims usually means you can safely assume you meant to refer to the same thing, but you didn't state it explicitly. Very unloglangy.

Toaq anaphora is also not good. The new Toaq anaphor system is such that all arguments are classified into several classes: animate entity (really Toaq? Animacy distinction?), inanimate entity, abstract entity, adjectives, clauses, LU-clauses, genitives, personal pronouns and demonstratives. Each pronoun refers to the closest argument that fulfills its type -- each class has its own pronoun. The issue is if you want to talk about things which belong to the same class, this type of anaphor becomes unwieldy. The plus side is that it requires no forethought.

I plan on having a variation of Toaq anaphors, which I'll discuss in a later chapter.

Creating new words is something that all loglangs encourage. It's more of an infrastructure issue -- Lojban and Toaq both have community dictionaries that anyone can add to (Jbovlaste and Toadua respectively). People can then define new words to talk about what they want to talk about, as they wish. ~~It also saves a lot of effort on the part of the language maker(s).~~

I distinguish between vagueness and ambiguity. Vagueness is when a word encloses a large volume in semantic space. This is totally fine, and most root words ought to be on the vague side. Ambiguity is when a word encloses disconnected volumes in semantic space. This is unacceptable and should be removed. Consider the vagueness of the word "animal" and the ambiguity of the word "set".

On the same token English has student and teacher which is similar to child and parent but has no easy way to say the equivalent of sibling for the first pair that exists for the second.

Sorry, I don't understand what you mean.

Yes, this is something that upset me with Lojban and Eberban and pleased me with Toaq. Lojban usually tries to make particle families have similar forms. This is bad because single-phoneme errors can cause misunderstanding, since particles in the same family would usually take the same places as each other. It's best to have particles in the same family be phonetically far away, even if it makes it harder to learn. Phonetically-close words should be semantically far away such that even if point errors occur, context can be sufficient to correct it.

[-]ChristianKl3y30

In English, it's not possible to construct easily a word that refers to "someone who has the same teacher as me" or "someone who reads the same blog as me".

If you have a wordpair like employee and boss the nearest equivalent for sibling is coworker but even that doesn't specifically mean someone who has the same boss as you.

If you create a new language and just try to create words for important concepts like employee, boss, child, parent, student and teacher which is roughly what Lojban did you can't reuse the same structure as easily as you would be able if you put more thought into identifying the relations that there are and how to systematize them.

If you have a language like English with words like see, watch, hear and listen and need a similar term for as listen for taste you can make up a new word. Making up the new word is relatively cheap. The problem is that your listener doesn't automatically understand the new word. The speaker and listener have to engage in an effort to learn the new word and can't just construct it on the fly and be understood.

Lojban made its own words for concepts like north and south instead of creating a more systematic approach. If you have a more systematic approach you could have something like X degrees in reference system Y where north would made up of two syllables. One syllable would refer to something like "0 degree on a plane" and the other syllable about "cardinal direction". Then east is one syllable for "90 degree on a plane" + the syllable for "cardinal direction". Once you have such a system you can reuse it in different contexts. You then can afford to have words for more than just 0, 90, 180, and 270 degrees.

In aviation in practice, they refer to "there's another plane at 2 o'clock" which is quite complex way to reuse the concept of the clock to have more than just 4 distinctions of directions in a plane. Once you have a system that can be reused, it might become more natural to state your political position as "2 o'clock" on the political compass instead of just speaking one-dimensionally about being left- or right-wing.

If you do new science and that gives you a new 2D-reference frame, having the existing language provide you with a powerful way to address individual points allows you to more easily think about your new topic of investigation because the language helps you in a way that a language that has not thought about systematizing such a mechanism does.

For a new language to be actually useful, one way is to provide better systemization that makes the language superior when talking about a specific problem domain.

Lojban is too much designed based on the idea of wanting to translate what can already be easily be expressed in English.

The extremely big issue with this is that it requires forethought. It's fine when you're writing, but when you're speaking, you don't necessarily know whether or not you'll need to refer back to something you said before.

This is especially an issue if you have a conversation with someone and don't know what

Creating new words is something that all loglangs encourage. It's more of an infrastructure issue -- Lojban and Toaq both have community dictionaries that anyone can add to (Jbovlaste and Toadua respectively). People can then define new words to talk about what they want to talk about, as they wish.

No language that gets actually used in practice has people consistently referring to dictionaries to deal with new words. If anyone doing knowledge production has to interface with a dictionary-maker to get his terms approved, that's widely unpractical.

A good English speaker has access to a few hundred thousand words, you can't learn that amount of words easily from a dictionary.

It's best to have particles in the same family be phonetically far away, even if it makes it harder to learn.

It doesn't have to make it harder to learn. Let's say we have numbers:

1: fa

2: ge

3: hi

4: jo

5: ku

6: la

Now, what's the name for 7? You can derive from the pattern that it's 'me'. If you forget the word for a single number you can easily reconstruct it if you understand the general pattern and at the same time you can't confuse any of the numbers by mishearing a single phoneme.

If you drop Lojban's idea of making your words derive from existing words you can create patterns that help to learn related concepts while still having phonetical distance.

[-]Zmavli Caimle3y10

In English, it's not possible to construct easily a word that refers to "someone who has the same teacher as me" or "someone who reads the same blog as me".

I don't see how it's useful to make words (i.e. separate lexemes) for these concepts, when they're better expressed as phrases. The relationship of "parent-child-sibling" (in the genetic sense) is more fundamental than "employee-boss" because the former is immutable. You cannot lose your genetic relation, whereas you can separate from your boss. I also think it's good that "coworker" doesn't imply having the same boss -- there could be no boss (e.g. a startup with two co-founders). Whereas there cannot be a child without a parent.

I'm more concerned with removing ambiguity from words (in the sense that words that enclose non-continuous spaces in semantic space have to be separated), than I am in trying to figure out how to divide it exactly. Many natural languages make distinctions (and not make distinctions) differently than in English -- and in the same way, you can rederive the relation "has the same boss as me" via phrases using other words rather than creating a word.

No language that gets actually used in practice has people consistently referring to dictionaries to deal with new words. If anyone doing knowledge production has to interface with a dictionary-maker to get his terms approved, that's widely unpractical.

This is a consequence of the languages being less mature than natural languages. Natlangs have had much more time to build up vocabulary.

Now, what's the name for 7? You can derive from the pattern that it's 'me'.

I cannot see how it's 'me'. I can tell the pattern of the vowels: a e i o u. But how is it m?

[-]ChristianKl3y31

Phrases take more effort than having words for things. In practice that usually results in people being vaguer about what they mean and less conversational bandwidth.

Generally, when people are doing new things they need new words. In the poly community, you for example have people talking about metamours (which is someone who is in a relationship with the same person as you). While it's possible to express that as a phrase, it's something that's important enough to have it's own word. In English, a newly made word like this is not able to be understood by people who haven't heard it before.

If you however put effort into thinking through the primitives of your language, you can actually easily make words that are understood without having to be learned specifically.

There can be context where the ability to have a word for a person who has the same boss is important and contexts where it's not important to have such a word. A language that makes it easy to have such words when needed is superior when it comes to speaking about new domains of knowledge.

It's possible that a new language would be superior enough over existing languages to be used in a new domain of knowledge that people prefer to write in it over writing in English.

I cannot see how it's 'me'. I can tell the pattern of the vowels: a e i o u. But how is it m?

If you follow the alphabet m would be the next consonant. My main point here is that you can have structure that can give order that make learning easier that doesn't depend on the words being phonetically similar.

This is especially true if you reuse the structures.

[-]Zmavli Caimle3y10

If you however put effort into thinking through the primitives of your language, you can actually easily make words that are understood without having to be learned specifically.

I highly doubt this is true or possible in any meaningful degree. There have already been several conlangs that try this -- Lojban is one with its compounding system, another is Toki Pona. While it's definitely possible to have compounds whose meaning is related to their components, each context a specific component is going to have to be interpreted in its own special way. Again, because of context. You're going to have to learn something explicitly regardless.

[-]ChristianKl3y20

I highly doubt this is true or possible in any meaningful degree.

I gave an example of my friend having an experience where Esperanto already allowed him to have a conservation about meditation that he couldn't have had easily in English or German which are the languages he otherwise speaks.

Lojban put little effort into it as evidenced by having words for individual cardinal directions instead of going for a more systematic approach.

When it comes to family relations and also for things like lover/metamour, you would model them mathematical as a graph plus a context. Systematizing a language allows you to have words for things like metamour that are immediately understood.

[-]Aransentin3y10

Spitballing here, but how about designing the language in tandem with a ML model for it? I see multiple benefits to that:

The benefits of doing this would not only be to the ML model. You'd get a tool that's useful for the language development, too:

^{^}
Typically the input text is pre-processed with a secondary model (such as BERT) which somewhat improves the situation.
^{^}
Except proper nouns I suppose, those you'd still need to split.

[-]irarseil3y10

Do you intend Sekko to be as expressive as natlangs? Where are you envisioning Sekko will get its vocabulary from?

[-]Zmavli Caimle3y10

Yes, I do intend it to be as expressive as natlangs. It will be very difficult, but I want to try.

I'm planning on Sekko vocabulary to be fully a-priori. I actually haven't made many word forms aside from those in example sentences. I'm considering using a Lojban style word-blending system to derive words, except that I would only select languages that have phonemic vowel and consonant length distinctions, like Finnish, Japanese, and Classical Latin.

Moderation Log

Curated and popular this week