I wanted to figure out how to properly write Russian and Ukrainian words using the Latin alphabet, but it turns out there are dozen different standards for Russian and half a dozen different standards for Ukrainian.

My original plan was to post two short algorithms in pseudocode, along with some commentary. I guess it will be only the commentary instead. I hope some of you may find some information here interesting anyway.

*

Disclaimer: I am not a native speaker of either of these languages. This article may contain mistakes, and I would be thankful if you point them out in comments. I have learned Russian at school, decades ago; only a rusty passive knowledge has remained. My knowledge of Ukrainian is limited to the first two lessons at Duolingo. (My native language is Slovak, which is also a Slavic language, but written in Latin script.)

*

The basic strategic choice for conversion between different scripts is the following: would you rather preserve the minutiae of the written version (and sacrifice some information about pronunciation, if necessary), or would you rather preserve the pronunciation (and mostly throw away the original written form)?

As a specific example, consider the Russian word for milk: "молоко". In written form, all its three vowels are the same. When pronounced, the stress is on the last syllable, and a stressed "о" is generally pronounced differently from an unstressed "о". The entire word sounds kinda like "mah luck core"; the first two vowel sounds are the same, and the third one is different.

A system trying to preserve the written form might write this word as "moloko", while a system trying to preserve pronunciation might write it as "malako" instead.

(This is not unique for Cyrillic, by the way. You face a similar dilemma e.g. when trying to write Japanese syllables "た ち つ て と". Would you rather preserve the sound, and write them as "ta, chi, tsu, te, to", or follow the internal logic of hiragana which insists that this is the same consonant, and write them as "ta, ti, tu, te, to"?)

Both options have their advantages and disadvantages. It also depends on the audience. People who do not speak the language and do not plan to learn it, have no reason to care about its orthography; they just want to pronounce the weird scribbles. On the other hand, people who are already used to reading that language, may feel very uncomfortable seeing it written in a way that violates everything they learned about its orthography.

(To explain this feeling to a native English speaker, consider how you feel about various proposals to simplify English orthography. But from the technical perspective, those are merely conversions from Latin, to sound, to Latin again. You are doing an analogical thing when you convert from some other script, to sound, to Latin.)

*

The strategy of preserving the spelling is called transliteration. Ignore the pronunciation completely, mostly just create a table how the characters from the source script are converted to characters (or sequences of characters) in the target script; bonus points if you can afterwards unambiguously revert them back.

The Latin alphabet was designed for the Latin language of the Ancient Rome, and is a less perfect fit for some other languages that use it, which often need to express more sounds than Latin had. In general, there are two approaches: the extra sounds can be expressed by sequences of characters (such as "sh" in English), or the alphabet can be extended using accent marks (such as "š", "ś", "ş", or "ș" in different languages). When transliterating Cyrillic, the latter approach is closer to the 1:1 ideal. (Note that there are Slavic languages which already use Latin characters, so you can use their conventions instead of designing a new one. They often disagree with each other, though.)

Advantages of transliteration:

  • can be implemented by a very simple algorithm;
  • you can convert words without knowing their pronunciation;
  • people who spent years learning the orthography of the language will not feel like you are disrespecting their great sacrifice. (Probably the most important, socially!)

Disadvantages of transliteration:

  • you need to learn separately how the words are actually pronounced, otherwise you may not recognize them when watching the TV.

This still provides a lot of space for bikeshedding, so we have several competing standards. (Relevant Wikipedia pages: 1, 2, 3.) Here are Russian and Ukrainian alphabets, side by side, with a Latin character where the transliteration is straightforward, or an asterisk where it requires further commentary:

ru:    А Б В Г   Д Е Ё   Ж З И     Й К Л М Н О П Р С Т У Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я
       а б в г   д е ё   ж з и     й к л м н о п р с т у ф х ц ч ш щ ъ ы ь э ю я
uk:    А Б В Г Ґ Д Е   Є Ж З И І Ї Й К Л М Н О П Р С Т У Ф Х Ц Ч Ш Щ     Ь   Ю Я ʼ
       а б в г ґ д е   є ж з и і ї й к л м н о п р с т у ф х ц ч ш щ     ь   ю я ʼ
Latin: a b v * g d e * * * z * * * * k l m n o p r s t u f * * * * * * y * * * * *

(Yes, there is a controversy even about whether the apostrophe should be transliterated to Latin as an apostrophe, or as a quote mark! Nothing is ever simple.)

Well, half of the transliteration is unambiguous, that is a good start.

A part of the complication is that different languages use Cyrillic slightly differently. Thus, tables for Russian only, or Ukrainian only, would be simpler.

  • Russian "г" = Ukrainian "ґ" = Latin "g".
  • Ukrainian "г" (does not exist in Russian) = Latin "h".
  • Russian "е" = Ukrainian "є". Ukrainian "е" = Russian "э".
  • Russian "и" = Ukrainian "і". Ukrainian "и" = Russian "ы".
  • Russian "ъ" is written as an apostrophe in Ukrainian.

The table would be slightly simpler if we rearranged it accordingly.

Another part is that languages that use Latin disagree about the proper way to write the sound "y" (as in "yes"). Many languages that use Latin would write it "j", but many other languages use "j" for very different sounds. English uses "y"; and other languages at least use "y" for similar sounds, so good arguments can be made in support of either. One could also argue that "y" is just a shorter version of "i" (as in "hit"). Thus, for a completely unambiguous Cyrillic letter "й" we now have three proposed equivalents in Latin.

This complicates transliteration more than you might expect, because the "й" sound has a special role in some Slavic languages. It can result in palatalization of a preceding consonant in the word.

Not sure I can explain it in text to a native English speaker what palatalization is; you would need to hear actual examples. The idea is that in Slavic languages some consonants have two ways of pronouncing. The normal way (called "hard"); and with your tongue moved closer to the roof of your mouth (called "soft"). I am not a linguist, but I suspect that the "soft" form has historically evolved from the "hard" form followed by "й" sound. This is reflected in Cyrillic by some vowels having an alternate form where they are preceded by "й" which optionally becomes a softener of the previous consonant. For example, "я" = "й" + "а" ("ya"), "ю" = "й" + "у" ("you"), "ё" = "й" + "о", Russian "е" = "й" + "э", Ukrainian "є" = "й" + "е". The point of this digression is that if you have a disagreement about how to transliterate "й", you automatically also have a disagreement on how to transliterate "є", "ё", "ю", "я".

Should we transliterate the consonants "ц", "ч", "ш", "щ" following the English rules as "ts", "ch", sh", "shch", or following the Czech rules as "c", "č", "š", "šč"? (Note that even the Czechs cannot express "щ" as a single symbol.)

Then you have the "soft symbol" and "hard symbol", which have no direct equivalent in Latin, and their role is to regulate palatalization of the previous consonant. If you want to palatalize a consonant, you normally do it by following it by "я", "ю", etc. instead of the usual "а", "у", etc. But what if it's followed by another consonant, or it is at the end of the word? In such case, you put the "soft symbol" after the consonant. On the other hand, if you want to prevent accidental palatalization where a word stem ending with a consonant is joined with a word stem beginning by "я", "ю", etc., you put the "hard symbol" between them.

Here the problem is that the soft symbol in Russian was traditionally transcribed to Latin as an apostrophe. But Ukrainian uses the apostrophe (in Cyrillic) as a hard symbol. :(

*

The strategy of preserving the pronunciation is called transcription.

Advantages of transcription:

  • it actually sounds like people use it, like you hear it on TV.

Disadvantages of transcription:

  • you need to actually speak the language in order to transcribe it correctly;
  • you will have different transcriptions to English, to German, etc.; a different system for each combination of source language and target language;
  • unless everyone uses the English transcription; which simply does not make sense for speakers of other languages (they need to mentally revert the English-specific rules, which existed neither in the original language, nor in their language).

*

Summary: It's complicated; I am giving up. If anyone criticizes you for writing something incorrectly, you may send them a link to this article.

New Comment
22 comments, sorted by Click to highlight new comments since:

Not sure I can explain it in text to a native English speaker what palatalization is; you would need to hear actual examples.

 

There are some examples in English. It's not quite the same as how Slavic languages work*, but it's close enough to get the idea: If you compare "cute" and "coot", the "k" sound in "cute" is palatalized while the "k" sound in "coot" is not. Another example would be "feud" and "food".

British vs American English differ sometimes in palatalization. For instance, in British English (RP), "tube" is pronounced with a palatalized "t" sound, while in American English (SAE), "tube" is pronounced with a normal "t" sound.

 

* In English, the palatalization is more like a separate phoneme, so "cute" is /kjut/ and "coot" is /kut/, but in Slavic languages, the palatalization is directly on the consonant, so it would be /kʲut/. With the Slavic version, the tongue is in a different spot for the entire sound, while in the English version the /k/ is like normal and then the tongue moves to the soft palate.

I would just call this an extra 'y' sound before the vowel. ([ˈkjuːt] vs. [ˈkuːt])

Yeah, that's absolutely more correct, but it is at least a little helpful for a monolingual English speaker to understand what palatalization is.

Perhaps many Americans know at least some basics of Spanish? I think the Spanish ñ letter, as in "el niño", is proper palatalization. (But I do not speak Spanish.)

My understanding of Spanish (also not a Spanish speaker) is that it's a palatal nasal /ɲ/, not a palatalized alveolar nasal /nʲ/. With a palatal nasal, you're making the sound with the tip of your tongue at the soft palate (the soft part at the top of your mouth, behind the alveolar ridge). With a palatalized nasal, it's a "secondary" articulation, with the body of your tongue moving to the soft palate.

That said, the Spanish ñ is a good example of a palatal or palatalized sound for an English speaker.

And Irish (Gaelic) has both! (/ɲ/ is slender ng, /nʲ/ is slender n)

British vs American English differ sometimes in palatalization.

This explains something I was confused about, thank you.

In Bulgaria (where cyrilic was invented) writing in Latin is common (especially before cyrilic support was good) but frowned upon as it is considered uneducated and ugly. The way we do it is just replace each letter with the equivalent latin letter one to one and do whatever with the few which don't fit (eg just use y for ъ but some might use a, ч is just ch etc). So молоко is just moloko. Водка is vodka. Стол is stol etc. This is also exactly how it works on my keyboard with the phonetic layout.

Everyone else who uses cyrilic online seems to get it when you write like that in my experience though nowadays it's rarer.

I agree that this is the simplest way; algorithmically, but ultimately also for humans.

and do whatever with the few which don't fit

That's where the dozen different standards come from.

My first impression is that Bulgarian language uses fewer characters than Russian and Ukrainian. Not fewer sounds, though, it just doesn't have characters like "ё" or "ї", which represent pairs of other existing characters anyway. (Though you still have "ю" and "я", which work the same way.)

I think, "milk - молоко - [малако] - malako" is a bad idea, because in the word "молочный" (milky) second о is the stressed vowel, so the first о is pronounced as o. Also, if you say молоко as [молоко], any Russian will understand you.

Moreover, in some regions (green here) of Russia it is pronounced this way!

Ah, yes. This is another complication, language evolves. The written form often reflects how the language was spoken in the past (and may still remain so in some regions). However, some aspects of spoken language also reflect how the language was spoken in the past.

So you get these weird situations where the written form is in conflict with some aspects of the speech, but if you tried to fix it, it would then be in conflict with some other aspects of the speech.

I wonder what would happen if literacy magically disappeared overnight, but people would still remember the idea of literacy, and would try to reinvent the written form from scratch.

In the example of milk, perhaps they would ultimately conclude that the second vowel needs to be "о". However, could they (using only the Moscow pronunciation) figure out the same about the first vowel?

(Oleg and Olga are the masculine and feminine variants of the same name, which is nearly obvious from their spellings but you'd never guess that from their Russian pronunciations alone)

Only half joking: unless there is untranslatable wordplay or poetry that is trying to rhyme or scan, I'd be tempted to just "drop" the original sounds and "ascend" to a maximally universal orthographic system that is reasonably standardized and yet still "very pointwise similar (given the extra information about where someone comes from) to how a person might have made mouth sounds".

So maybe: translate the meaning via Interslavic (Medžuslovjansky / Меджусловjaнскы) and then render the Interslavic via the roman half of its orthographic system (which shouldn't be too hard for readers to learn to map to Slavic-compatible phonemes in the ear and tongue).

For your given example, you would read "молоко", then translate to milk, then render the interslavic "mlěko"?

(Taking abstraction to an extreme, you maybe just end up with ideograms? That would be too far. I'm not advocating that "молоко" should go all the way to "乳" or "🥛".)

I. Pragmatic Barbarism? <3

The primary objection to translating to Interslavic might be that such a move is barbaric and butchers a beautiful source language's beautiful details. However: Consider the audience! Have you noticed that English itself is practically a creole? <3

A practical motivation here is that I can't even pronounce Interslavic properly (because I haven't put in the practice (not because it would be impossible)), but if I'm going to "speculatively learn" an entire new orthographic system, I want the thing that I learn to apply to as much of the world as I can.

Interslavic is one of the best such things that I currently know of, that I might put non-trivial time into learning, that isn't just IPA or kanji or whatever.

(I'm not saying Interslavic is perfect, anymore than I would say "Python is perfect". I'm saying something more like "Python in 2001 obviously had legs and would be useful in 2021, and, similarly, Interslavic in 2022 seems likely to not be a waste of learning effort if retained until 2042 (unless universal translation brain chips are introduced earlier than 2042)".)

I grant that my proposal has a partial DEFECT in that all Cyrillic words for milk in various eastern european languages (with respect-worthy and validly different vocabularies, and different orthographies, and different cultures, and so on) coming out with the same romanized characters, but consider: from my perspective, that is sort of a feature rather than a bug!

II. Features

Feature: I can learn one orthography, and read the text out loud, and it will sound "slavic" and people who don't know any eastern european language will initially (falsely) think I'm speaking a natural language of eastern europe, and people who DO know one slavic language might get most of the gist and think I'm just speaking some other slavic language than the specific one that they know.

Feature: If you include the original text with annotations, then rendering a romanization VIA Interslavic will help create data that could make Interslavic better :-)

Feature: Totally naive english speakers will get more-or-less "the same gist" no matter what you do, but with interslavic you give them a maximally easy entry point (that has been designed to be a maximally easy entry point). 

(My hunch is that it would not cost much (and might help a lot) for naive people to FIRST learn interslavic orthography, and THEN learn the orthography of any of the other languages that interslavic is trying to span? (If this is false, then the "good cheap onramp to learning" feature isn't actually a feature. I have real uncertainty here.))

A fourth virtue might be political neutrality. The movie "The Painted Bird" is about an orphan who wanders through bad places and the book the movie is based on very carefully left the contextual fact OUT, and the movie wanted to retain that ambiguity, and not imply that any specific regional nationality was bad, so they had the bad people all speak interslavic. (I haven't seen it, and reports are that it is a harrowing cinematic experience that sometimes causes people to walk out of the theater. Plausibly: this is art that is emotionally powerful enough to really deserve a "trigger warning"?)

III. Weakness In The Particulars

I grant that my proposal would totally fail if your goal was to write about differences in the phonology or morphology or even the vocabulary of Serbian and Bulgarian, or how Moscow Russian and St Petersburg Russian are different. All three languages and all four varieties are romanized to the same roman letters in my proposal. My proposal just goes UP to the "denotational semantics" then DOWN to something systematically easily-learnable.

If you really want to get into these pronunciation/orthography differences, interslavic can maybe start to render these in a standardized way via flavorization (Flavorizacija)?

Sometimes the regional/cultural choices really matters, and it can turn into a comedy of cultural ignorance...

Standard: "English orthography is already full of complex tradeoffs."

"Scottish" (auto-flavorized): "Sassenach orthography is awready stowed oot o' complex tradeoffs."

IV. Summary

Here's a video where I think they're sometimes speaking in Serbian, and sometimes in Interslavic, as a test of mutual-comprehension-with-no-practice, and I think the subtitles use Interslavic romanization conventions all the way through? But I'm honestly not sure.

Anyway. The tradeoffs of interslavic are (1) formal systematicity, with a gesture towards (2) discovery of something (3) universally accessible, while retaining (4) denotational (5) semantics, all of which are potentially virtues :-)

If you are aiming for different virtues, I am happy to respect different choices. Also, if my choices don't actually hit my goals then I'm interested in hearing about how I'm wrong so I can choose better-to-me things <3

https://en.wikipedia.org/wiki/Kyiv#Name is quite interesting as it goes over the spelling:

Kiev is the traditional English name for the city,[21][24][25] but because of its historical derivation from the Russian name, Kiev lost favor with many Western media outlets after the outbreak of the Russo-Ukrainian War in 2014.

[...]

After Ukraine's 1991 independence, the Ukrainian government introduced the national rules for transliteration of geographic names into the Latin alphabet for legislative and official acts in October 1995,[23] according to which the Ukrainian name Київ is romanized Kyiv. These rules are applied for place names and addresses, as well as personal names in passports, street signs, and so on. 

[...]

Alternative romanizations used in English-language sources include Kyïv (according to the ALA–LC romanization used in bibliographic cataloguing), Kyjiv (scholarly transliteration used in linguistics), and Kyyiv (the 1965 BGN/PCGN transliteration standard). 

Choosing how to transliterate is a political act. 

Interestingly, the German Wikipedia still stayed with Kiew. There's a long discussion on the talk page about what the name should be. 

Choosing how to transliterate is a political act.

Choosing which language to transliterate from. Using a Russian name for an Ukrainian city has certain connotations, using an Ukrainian name has other.

The city is called Киев in Russian, Київ in Ukrainian. I think that both sides would agree that "Kiev" is the correct English transliteration of Киев, and "Kyiv" is the correct English transliteration of Київ. The question is not how to transliterate, but what to transliterate.

The thing I complained about is that each side has several different norms how to transliterate their language. For "Киев" the choices are "Kiev" or "Kiyev", and for "Київ" it is "Kyiv" or "Kyjiv".

(In Slovak we have already used "Kyjev", which sounds like a compromise, so maybe we will keep it.)

Alternative romanizations used in English-language sources include Kyïv (according to the ALA–LC romanization used in bibliographic cataloguing), Kyjiv (scholarly transliteration used in linguistics), and Kyyiv (the 1965 BGN/PCGN transliteration standard).

I would expect that choosing either of those will also get you into problems because it differs from the official transliteration. 


 

I finally found a book describing the official rules for transcribing Cyrillic to Slovak language. This solves my original problem that inspired this search, and is probably useless for you if you don't speak Slovak.

The rules are straightforward for most letters, only for the "й"-related letters ("е", "ё", "и", "ю", "я") we distinguish situations whether the given character is:

  • at the beginning of a word, or following a hard symbol;
  • following a soft symbol;
  • following a vowel;
  • following one of: "ж", "ч", "ш", "щ";
  • following a different consonant.

For example, "е" would be transcribed as "je", "ie", "je", "e", "e" respectively; "ё" would be transcribed as "jo", "jo", "jo", "o", "io" respectively. I believe this follows the logic of Russian language (rather than Slovak), so I would expect similar rules when transcribing to other languages.

For Ukrainian letters different from Russian, the rules are much simpler: "є" is always "je", "і" is always "i", and "ї" is always "ji". I have no idea whether this is because of intrinsic differences with regards to given letters, or because the authors of these rules did not take Ukrainian language equally seriously. (To avoid misunderstanding, the last sentence was not meant as a sarcasm. Russian "и" is an equivalent of both Ukrainian "і" and "ї", so it would make sense if the rules for the former are more complicated than for the latter.)

I finally found a book describing the official rules for transcribing Cyrillic to Slovak language. 

I remember that there are sometimes different official rules for transcription depending on the language. The German Wikipedia for example makes often a point of transcribing names according to German rules even if many times the English transcription of a name gets used more. 

If a German-speaking person wants to read some words transcribed from Russian, why should the English rules be used in the process at all? It's not like the Russian language is somehow inherently English-like. (Arguably, German is actually a bit closer to Russian, having a bit more of the "one sound - one letter" correspondence.)

But then it is annoying when that German person speaks English, and needs to remember both transcriptions, or to be able to convert between them on the spot.

Essentially, there is no way to transcribe Russian "to Latin script", because there is no consensus how to use the Latin script among those who use it. (I have no idea what is the situation with Cyrillic. It seems much more unified for Slavic languages, but it's optimized for them. No idea how e.g. Mongols tweak it.)

If German journalists read something in English media, they often copy over the name from English media without the journalist thinking about what the proper transcription in German happens to be. 

Perhaps you can just use the international phonetic alphabet?

If I am just doing it for myself, yes. If I want to communicate with other people, most of them probably can't read IPA. Different audiences require different solutions.

On a second thought, if I am just doing it only for myself, unless I am already fluent in IPA, it is probably less work to learn reading Cyrillic directly.