Inari Listenmaa

Logo

CV · Blog · GitHub

6 April 2019

Working with bidirectional text

Since last autumn I’ve been working with Arabic and Persian. Here’s a small selection of things I’ve found useful while working with different scripts. For a proper introduction on bidirectionality, I recommend rtl.wtf which is informative and entertaining.

If my setup changes, or I find other useful information, I’ll update this post.

Basics

In Perso-Arabic script, characters join to each other. Each character has up to 4 forms: independent,initial, medial and final. Check e.g. Wikipedia if you want a longer explanation, here’s what you need to understand this post:

Vowels and some other markers, such as gemination (=long consonant), are indicated by small diacritics above or below the character.

Font choice

I tried several fonts, stopped at Pragmata Pro. The winning feature is that it has the smallest size difference between the different scripts, so I don’t need to set my font size ridiculously big. Here’s a demonstration: monaco menlo andale_mono courier pragmata

Notice especially the width, not just length. In addition, Pragmata is pretty blocky, whereas other fonts are more cursive and fancy-looking. It’s not pretty but it scales down. Note that Pragmata costs money—out of the free ones, I’d probably go for Courier (New) and maybe boldface it.

Normalisation of several combining characters

Think of a character such as ä: with the right keyboard, you can type if by pressing the Ä key, or you can combine ¨◌ with a. swedish-keyboard

Either way you produce the character, they are considered canonically equivalent. It would be inconvenient otherwise: suppose I have written a text in Swedish with a Swedish keyboard, by pressing the Ä key every time I need an ä. Now you’re doing corpus statistics on my text using a non-Swedish keyboard, so you need to type ¨◌+a to get ä. Without the equivalence, you could grep until your fingers are sore with the ä you produced by combining ¨◌+a and it wouldn’t match my ä. Sounds like an annoying world, right?

Now the same thing happens with Arabic: you can have several diacritics, e.g. one for gemination and one for vowel, in either order, and you would like them to be canonically equivalent, for the same reasons.

Out of the fonts I tried, Pragmata is the only one that shows a difference between the order of the combining characters:

order-actual order-menlo order-pragmata

In most fonts, it shouldn’t even matter which order the diacritics are, so for output reasons it should be fine. But if you e.g. want to compare the grammar’s output against a gold standard, then it’s nice if you can trust that two sentences are marked as different if they actually are different on a more fundamental level than رُّ and رُّ. You can try to search رُّ , it should only match the first رُّ in the previous sentence.

Also beware that some browsers may normalise the text that you input in a text field or copy and paste. Try to select and copy the second رُّ in the previous paragraph, Ctrl+F it, and if it doesn’t match itself, then your browser is doing something funny. This is not directly related to GF, but good to know if you e.g. rely on a web-based tool to gather data for your grammar, or copy and paste lexicon out of GitHub web page instead of checking out the repository to your computer.

You can do your own normalisation in a grammar, like this:

-- vowel : pattern Str = #("َ "|"ِ "|"ُ "|"ً "|"ٍ "|"ٌ ") ;
-- geminate : Str = "ّ "
case word of {
  <x + v@vowel + g@geminate + y> => x + g + v + y ;
  _                              => word
}

or use any external normalisation library in your favourite programming language. I haven’t used any, but there’s the ICU normalisation library for Java, C++ and C, and for other languages, you can just google normalization arabic $LANG.

Invisible control characters

Left-to-right mark/Right-to-left mark

If you’re like me and create your lexicon by copying and pasting from Wiktionary, you’re likely to accidentally copy and paste a RTL mark along with the word. If RTL mark ends up in your inflection table, your text will look weird. Nowadays I run the following check regularly:

grep "‎" *.gf

It doesn’t look like anything because RTL character is invisible. But you can copy that line and save it in a file, e.g. checkRTL.sh and then run ./checkRTL every now and then.

Zero-width joiner and non-joiner

The zero-width joiner (ZWJ) forces a character in its joining form (i.e. initial or medial), even when it’s followed by a space or a non-joining character. The zero-width non-joiner (ZWNJ) forces the character in its independent form. I’m only using the non-joiner (ZWNJ) in my grammars, but I linked the joiner just for completeness’ sake.

In Persian, the ZWNJ is used e.g. to mark morpheme boundaries. I handle it in the following way:

oper
  ZWNJ : Str = "‌" ;
  zwnj : Str -> Str -> Str = \s1,s2 -> s1 + ZWNJ + s2 ;

Then in any grammatical function that needs ZWNJ, I use the function zwnj instead of inserting the ZWNJ character into the code multiple times. This means that if some other grammarian wants to swap ZWNJ for a space or an empty string, they only need to change one line in the RGL, and that changes all grammatical functions that use zwnj.

Text editors

I’ve only tried 3, if you don’t count Google Translate textbox. (Seriously, it’s a pretty good text editor for RTL text!) If I try more, I’ll write about them here.

Emacs

Out of the box, Emacs looks like this:

emacs

GF code contains strings of different directions on the same line. When I would paste Arabic text to a line with so far only Latin text, the line would flip, including => looking like <=. (If you read the first link I posted, you’ll know exactly why.)

Naturally, for such a popular editor, other people have come up with solutions how to make it work. This episode of Emacs Is Great (thanks to @odanoburu for the link!) presents a solution that works for me. If you don’t want to bother with yet another link, here’s the relevant bit you can copy and paste into your init.el. (I didn’t write it, I just took it from that YouTube video.)

(setq-default bidi-display-reordering nil)

(defun bidi-reordering-toggle ()
  "Toggle bidirectional display reordering."
  (interactive)
  (setq bidi-display-reordering (not bidi-display-reordering))
  (message "bidi reordering is %s" bidi-display-reordering)
  )

Sublime Text < 3.2

Sublime has the positive property that it doesn’t even try to do anything smart about bidirectionality or joining the characters. If you use a version older than 3.2, it’s also ignoring the diacritics, simply showing each of them as a character in its own right. As of 3.2 it’s trying to be smart about diacritics but messes it up, so if you work with vocalised Arabic, try an earlier version of Sublime.

sublime

Features:

I suppose there are also plugins for Sublime, I just didn’t bother looking into it because of the following.

Atom with alphabetter

I stopped looking into other solutions when my office mate John wrote alphabetter, three Atom plugins to solve my problems:

So here’s Atom with the default setup: RTL and joined.

atom-no-alphabetter

It looks already pretty good, but without any plugins, I had a few complaints:

atom-no-alphabetter-fail

The diacritics are wonky, and despite how it looks like, line 50 has balanced parentheses. This happens when there are several RTL and LTR blocks on one line. The third, which is perhaps a feature rather than a bug, but nevertheless a feature I dislike, is that the cursor moves logically instead of visually, and selecting is difficult. So I choose to do all my development in the following mode: force everything LTR and unjoined.

atom-alphabetter

With this setup, the offending lines render like this: atom-alphabetter-success

The most useful thing is that I can toggle LTR/RTL and unjoined/joined. This is useful when showing the code to an informant who is used to reading the alphabet like it’s supposed to, RTL and joined. After I’ve shown the words to the informant, I can press the button again and it flips to LTR+unjoined for my comfort.

The downside of this setup is that sometimes Atom gets really slow with these plugins, especially with bigger files. If your computer is slow, you might prefer another option. And of course if you’re already an established $EDITOR hacker, you’ll probably enjoy finding just the right incantations that solve your problems in $EDITOR.

Vim

I don’t use Vim for more than Git commit messages, but another person on the Internet writes Arabic in Vim and has written a blog post about it, check it out.

Learning the alphabet

Next question is, do you need to learn to read the language in order to write a grammar for it? Not necessarily: if you just need to implement a small application grammar for a language with a good quality resource grammar, and you have native informants, you’ll probably be fine. But in my case, I have to add a lot of new stuff into the RG, so I can’t see how I could do my job without being able to read.

But you don’t need to read well in order to write a grammar. Furthermore, you can totally learn Perso-Arabic script LTR and unjoined, and never need to write it by hand. This makes reading Arabic resemble much more reading Latin: you don’t need to change direction, and the characters always look the same, unlike when you read it RTL and joined.

I started learning the alphabet last autumn in the traditional way, and practiced handwriting until I could read RTL and joined. After I started using Atom with alphabetter, probably 80% of the text I read is in my text editor, which shows it LTR and unjoined. If I see an Arabic or Persian word RTL and joined, and it’s not one of the very common words (e.g. articles, personal pronouns, question particles, verbs such as do, be, take), I need to make conscious effort to read it. It is possible that I read a verb in vocalised Arabic and can tell immediately which verb class it belongs to, and only then I may decide that I want to actually pronounce the word too.

tags: gf