I would normally publish this on my blog, however, I thought that LessWrong people might be interested in this topic. This sequence is about my experience creating a logical language, Sekko (still a work-in-progress).
What's a logical language?
Languages can be divided into two categories: natural languages, which arise...naturally, and constructed langugages, which are designed. Of constructed languages, there are two categories: artlangs, which are designed as works of art or as supplements to a world (think Quenya or Klingon), or engelangs (engineered languages), which are designed to fulfill a specific purpose.
Loglangs (logical languages) are a subset of engelangs. Not all engelangs are loglangs -- Toki Pona is a non-loglang engelang.
Despite the name, what constitutes a "loglang" is rather undefined. It's more of a "vibe" than anything. Still, I'll try to lay out various properties that loglangs have:
- Adherence to a logic system. Most logical languages have second-order logic. Toaq and Lojban have plural logic; Eberban has singular logic.
- Self-segregating morphology (SSM). SSM refers to a morphology (scheme for making word forms) such that word boundaries can be unambiguously determined. Several schemes exist: Lojban uses stress and consonant clusters, Toaq uses tone contours, Eberban and Sekko use arbitrary phoneme classification. Sekko's SSM will be discussed in later posts.
- Elimination of non-coding elements. Many natural languages have features which add complication but do not encode information. Examples include German grammatical gender, Latin's five declensions or English number and person agreement (i.e. "I am" vs "You are" vs "He is"; "It is" vs "They are"). Loglangs aim to remove non-coding grammatical elements like these.
- Exceptionless grammar and syntactic unambiguity. Many natural languages have irregular forms. Examples include
- Machine parsability. Basically all loglangs are able to be parsed to check for syntactic correctness and correct scope of grammatical structures. Most loglangs (Loglan, Lojban, Toaq, and Eberban) have Parsing Expression Grammar (PEG) parsers.
- Ergonomics and extensibility. Loglangs try to have as ergonomic a grammar as possible -- one that gives access to the most semantic space for the least amount of grammar complexity. An example is the fact that in basically all loglangs, there are really only two parts of speech (aside from particles): predicates and arguments. Adjectives in English can simply be taken to be copular verbs (eg. X is beautiful.)
- Audiovisual isomorphism. (AVI) Loglangs will usually have a script or writing system (usually there is one based on Latin script, and then later, a completely novel script is devised) that has audiovisual isomorphism. What this means is that text and writing should correspond to each other phonetically -- that the text must encode speech precisely (i.e. words are spelled as they are said). AVI only necessitates that the phonemic (significant or distinguished) speech features are encoded. For example, if your language does not distinguish between short and long vowels, it is unnecessary to have an encoding for them.
Phonemic differences are those for which a language has pairs of words differing only in that aspect. For example, English distinguishes between "pat" and "bat". We may therefore conclude that English distinguishes voicing on the bilabial plosive (p and b sound). Mandarin does not -- all of its plosives are voiceless (rather, it distinguishes on aspiration, which English does not do). Another example is English "thin" and "thing" -- this represents that English distinguishes between the alveolar nasal /n/ and the velar nasal /ŋ/.
Why logical languages?
Frankly, it was not the logic aspect of loglangs that attracted me to them, but the presence of parsers and exceptionless grammar. I'm interested in the "structure". I'm not a subscriber to the Sapir-Whorf hypothesis, but I did find that learning Lojban was much, much easier than learning a natural language.
Lojban, such as it is (it has many, many problems), was still much more regular than any natural language -- a common way to learn is to participate in conversations with a dictionary in the other tab, knowing only the grammar. The regular grammar means that even if you don't know what a word might mean, you know its syntactic role. You do not have to pay attention to little natural language quirks, such as the "come-go" difference in English (both meaning "to move/travel", or the "kureru-ageru" (both meaning "to give") difference in Japanese.
There is also the syntactic ambiguity in a sentence such as "Do you want to drink tea or coffee?". The joke answer is to say, "Yes", since it's ambiguous whether the question is a true-or-false question or a choice question. In Lojban (and other loglangs), there is no ambiguity since the syntax of those two questions are different.
Parser: BPFK Lojban
TRUE OR FALSE.
.i xu do djica lonu do pinxe lo tcati .a lo ckafi
Is the following statement true?: You desire the event of you drinking tea OR coffee.
Possible answers:
go'i
The previous statement is true.
nago'i
The previous statement is false.
CHOICE
.i do djica lonu do pinxe lo tcati ji lo ckafi
You desire the event of you drinking tea ??? coffee. (where ??? asks for a logical connective)
Possible answers:
.e
Both.
.enai
The former, but not the latter. (tea only)
na.e
Not the former, but the latter. (coffee only)
na.enai
Neither.
.a
OR (one or the other, or both)
.o
XNOR (both, or neither)
.onai
XOR (one or the other)
I don't think any loglang is going to replace natlangs anytime soon -- this is just a hobby. It's very pleasing to speak a logical language, and I often wish that English had support for some of the constructs in the logical languages I speak (or, at least, know about).
Sekko, my loglang
I will be publishing documentation on my work-in-progress logical language Sekko in this sequence. Now, all of the documentation published as of now should be treated as temporary. It is likely that I'll make sweeping changes on one or more of the parts of the grammar -- some of it hasn't been made yet. Likely, future posts will invalidate past posts. I plan on restructuring it once I've written something on each topic. I'm planning on using mdbook
and Github pages, similar to Eberban.
The documentation posts I have tried to write and annotate such that you need little background to understand. I have plans to split up this initial documentation into a reference grammar and teaching course, which are separate (and the latter may even be further separated based on the sort of background you already have). I have also annotated the documentation with analogies if you already know either Lojban or Toaq. Please feel free to make suggestions on design.
Spitballing here, but how about designing the language in tandem with a ML model for it? I see multiple benefits to that:
First is that current English language models spend an annoyingly large amount of power on reasoning about what specific words mean in context. For "I went to the store" and "I need to store my things", store is the same token in both, so the network needs to figure out what it actually means[1]. For a constructed language, that task can be made much easier.
English has way too many words to make each of them their own token, so language models preprocess texts by splitting them up into smaller units. For a logical language you can have significantly fewer tokens, and each token can be an unique word with an unique meaning[2]. With the proper morphology you also no longer need to tokenize spaces, which cuts down on the size of the input (and thus complexity).
Language models such as GPT-3 work by spitting out a value for each possible output token, representing the likelihood that it will be the next in sequence. For a half-written sentence in a logical language it will be possible to reliably filter out words that are known to be ungrammatical, which means the model doesn't have to learn all of that itself.
The benefits of doing this would not only be to the ML model. You'd get a tool that's useful for the language development, too:
Let's say you want to come up with a lexicon, and you have certain criteria like "two words that mean similar things should not sound similar, so as to make them easy to differentiate while speaking". Simply inspect the ML model, and see what parts of the network is affected by the two tokens. The more similar that is, presumably the closer they are conceptually. You can then use that distance to programmatically generate the entire lexicon, using whatever criteria you want.
If the language has features to construct sentences that would be complicated for an English-speaker to think, the model might start outputting those. By human-guided use of the model itself for creating a text corpus, it might be possible to converge to interestingly novel and alien thoughts and concepts.
Typically the input text is pre-processed with a secondary model (such as BERT) which somewhat improves the situation.
Except proper nouns I suppose, those you'd still need to split.