Inari Listenmaa

Logo

CV · Blog · GitHub

22 February 2021

Lists in GF

This post is about lists in GF. It’s aimed for multiple audiences, so possibly some parts won’t interest you. If you have a good grasp of the basics, feel free to jump directly to Advanced topics.

Basics

Quoting from the GF book.

C.4.3 List categories

Since categories of lists of elements of another category are a common idiom, the following syntactic sugar is available:

cat [C] {n}

abbreviates a set of three judgements:

cat ListC ;
fun BaseC : C -> ... -> C -> ListC ; --n C’s
fun ConsC : C -> ListC -> ListC

The functions BaseC and ConsC are automatically generated in the abstract syntax, but their linearizations, as well as the linearization type of ListC, must be defined manually. The type expression [C] is in all contexts interchangeable with ListC.

Choice of n

The parameter n in cat [C]{n} determines the size of the base list. For instance,

cat [C]{1}

generates the following functions:

BaseC : C -> ListC
ConsC : C -> ListC -> ListC

Likewise,

cat [C]{0}

generates the following functions:

BaseC : ListC
ConsC : C -> ListC -> ListC

In fact, the choice of n only affects the BaseC function. ConsC is always the same, adding a single C to an already existing list.

If you’re used to lists from other programming languages, you might wonder what’s the purpose of n > 0. An empty list is such a useful concept, why force the minimum size of a list to be 1, 2 or even more?

The answer is that it depends on an application. In the next sections, I’ll cover the use of lists for natural and formal languages.

Lists in natural language

The purpose of lists in the Resource Grammar Library (RGL) is to allow coordination. I’ll start with an example and explain the functions right after.

While the example is for NP (noun phrase), exactly the same principle holds for other list categories in the RGL (AP, Adv,RS, S).

Baseline: single NP

Here’s the RGL API and the underlying tree for “I walk”. No lists yet, this is just for comparison.

-- RGL API, what you'd write in application grammar
lin I_walk_Cl = mkCl i_NP walk_V ;

-- Underlying RGL tree
  PredVP              -- : NP -> VP -> Cl
    (UsePron i_Pron)  -- : NP
    (UseV walk_V)     -- : VP
  -- : Cl

If you’ve only ever used the RGL API and never seen PredVP, UsePron etc. before, you can read an explanation here. Knowing the RGL abstract syntax is not necessary for writing application grammars, but for this deep dive post, it’s useful to understand that the two levels exist.

List of NPs

Here’s the corresponding GF code for “they, you and I walk”. This time, the subject is constructed from a list of three NPs, which are put back together into one NP.

-- RGL API, what you'd write in application grammar
lin They_You_and_I_walk_Cl =
   mkCl
     (mkNP and_Conj
           (mkListNP they_NP
                     (mkListNP you_NP
                               i_NP)
           )  -- : ListNP
     ) -- : NP
     walk_V ;

The RGL API overloads the mkListNP oper. In the underlying RGL abstract syntax tree, we see their true names BaseNP and ConsNP.

-- Underlying RGL tree
   PredVP
    (ConjNP and_Conj
            (ConsNP (UsePron they_Pron)
                    (BaseNP (UsePron you_Pron)
                            (UsePron i_Pron))
            )  -- : ListNP
    ) -- : NP
   (UseV walk_V)

As long as you have GF_LIB_PATH set up, you can open the RGL abstract syntax in the GF shell and run the commands below.

$ gf alltenses/LangEng.gfo
…
Lang> l PredVP (ConjNP and_Conj (ConsNP (UsePron they_Pron) (BaseNP (UsePron youSg_Pron) (UsePron i_Pron)))) (UseV walk_V)

In RGL, n = 2

Looking at the examples, we see that the base list size in RGL is 2. We saw that the innermost mkListNP was applied to two arguments:

(mkListNP you_NP i_NP)

And that translated to a BaseNP, which took 2 arguments. So we know that BaseNP and ConsNP were generated from the following expression.

cat [NP]{2}

We can even verify that this was the expression: look at the RGL abstract syntax!

Lists for less than 2 aren’t needed in the RGL. If we had n = 1, even a single NP could be a list, like in the following.

mkCl
  (mkNP and_Conj (mkListNP i_NP))
  walk_V

But why write mkNP and_Conj (mkListNP i_NP), when you can just write i_NP? Clearly, we only start needing lists when we want to coordinate 2 or more elements.

Back to C from ListC: introducing ConjC

For all categories C that have a ListC, the RGL includes a corresponding function

ConjC : Conj -> ListC -> C

which turns a list back into a single instance of the category, with the help of a conjunction. For example, “[Alice, Bob, Charlie]” is a list of NPs, whereas “Alice, Bob and Charlie” is a single NP.

Notice that ConjC is different from BaseC and ConsC, which are derived automatically whenever ListC is defined. In contrast, ConjC is manually defined in the RGL.

ConjC in the API: just another instance of mkC

In the RGL API, all these ConjC funs are accessible from mkC. Here are two versions of mkNP, which under the hood call ConjNP:

Screenshot of the RGL API, showing mkNP : Conj -> NP -> NP -> NP

If you only want to coordinate two NPs, you can skip constructing ListNP, and call mkNP directly for the two NPs (and a conjunction).

-- RGL API
mkNP or_Conj i_NP they_NP
mkNP or_Conj (mkListNP i_NP they_NP)

-- both correspond to the RGL abstract syntax
ConjNP or_Conj (BaseNP i_NP they_NP)

Conj

Naturally, RGL also includes the category Conj, with examples such as

and_Conj : Conj ;
or_Conj  : Conj ;

There are also conjunctions that put a string before the list.

both7and_DConj  : Conj ; -- both...and
either7or_DConj : Conj ; -- either...or

For instance,

mkNP either7or_DConj everybody_NP nobody_NP

returns “either everybody or nobody”.

Implementation of ListC in RGL

ListC is usually implemented as exactly like the lincat of C, but with two s fields, called s1 and s2.

For example, if C is defined as

lincat C   = {s     : Number => Str ; g : Gender} ;

then ListC will split its s field into two, and retain its other fields as

lincat [C] = {s1,s2 : Number => Str ; g : Gender} ;

You can see examples in ConjunctionEng:

lincat
  [S] = {s1,s2 : Str} ;
  [Adv] = {s1,s2 : Str} ;
  [NP] = {s1,s2 : NPCase => Str ; a : Agr} ;
  [AP] = {s1,s2 : Agr => Str ; isPre : Bool} ;
  [RS] = {s1,s2 : Agr => Str ; c : NPCase} ;
  [CN] = {s1,s2 : Number => Case => Str} ;

If you look at the lins of a given Conjunction module, it probably looks rather cryptic. Most of it is just calling opers like twoTable and consTable. These opers, and many more, are found in the module gf-rgl/prelude/Coordination.gf.

The only documentation of Coordination (that I’m aware of) is in the comments of the module itself, and it looks like the following:

-- Create a ListX from two Xs. Example:
--     x = {s = "here"} ;
--     y = {s = "there"} ;
-- twoSS x y ==> {s1 = "here" ; s2 = "there"}
twoSS : (_,_ : SS) -> ListX = \x,y -> twoStr x.s y.s ;

It’s likely that the opers still look very cryptic, so as prerequisite knowledge, you should read the first half of this post on types in GF. You can stop when you see the subheading “Dependent types in abstract syntax”.

If you’re implementing a new resource grammar and are struggling to understand Conjunction, the best way is to look at other languages’ implementation of Conjunction, and read the comments of Coordination. If your language uses converbs or conjunctions as suffixes, you can read the section later in this post Natural language strategies beyond A, B and C. Later on, I might add a section in this post to be more of a hands-on guide for resource grammarians, but for now I’ve prioritised other things.

That was the end of the natural language/RGL part. Next section is about lists in formal languages.


Lists in formal language

Suppose that I’m defining my own object-oriented programming language. Here’s a class definition:

class Business = {
  bus_name : String ;
  is_legal : Boolean ;
  } ;

Let’s ignore any other language constructs for the sake of this example, and just concentrate on the class definition.

All classes should have a name, and some amount of fields. I’m also happy to accept an empty class, with no fields—that’d just be written as follows:

class Business ;

Since this whole blog post is about lists, you might see where this is going.

Abstract syntax

Here’s my abstract syntax for the GF grammar which describes this programming language fragment.

abstract MyOOP = {
  cat
    Class ;      -- class ClassName : { [Field] }

    Field ;      -- field_name : BuiltinType
    [Field]{0} ; -- generates funs BaseField, ConsField

    BuiltinType ;

  fun
    ClassDef : String -> [Field] -> Class ;
    MkField : String -> BuiltinType -> Field ;

    BoolType, StringType : BuiltinType ;
}

In this case, it’s a good idea to allow an empty list. If I had done like in the RGL, and made the minimum size of [Field] >0, then I would’ve needed two versions of ClassDef: one for 0 fields, other for >0 fields. Instead, with [Field]{0}, I can express empty and non-empty classes with the same fun.

Concrete syntax

I still want to print out different strings for empty and non-empty classes. But that’s no problem—I just make a parameter for it. Here’s my concrete syntax.

concrete MyOOPCnc of MyOOP = {
  lincat
    [Field] = {s : Str ; isEmpty : IsEmpty} ;

  param
    IsEmpty = Empty | NonEmpty ;

  lin
    -- : String -> [Field] -> Class
    ClassDef name flds = {
      s = "class" ++ name.s ++
             case flds.isEmpty of {
               Empty    => flds.s ; -- just empty string
               NonEmpty => "= {" ++ flds.s ++ "}"
             } ++ ";"
      } ;

    -- : String -> BuiltinType -> Field
    MkField name typ = {s = name.s ++ ":" ++ typ.s} ;

    -- : BuiltinType
    BoolType = {s = "Boolean"} ;
    StringType = {s = "String"} ;

-- These funs automatically generated from [Field]{0}
    -- : [Field]
    BaseField = {s = [] ; isEmpty = Empty} ;

    -- : Field -> [Field] -> [Field]
    ConsField f fs =
      let sep : Str = case fs.isEmpty of {
                        Empty => [] ;
                        NonEmpty => ";" } ;
       in {s = f.s ++ sep ++ fs.s ; isEmpty = NonEmpty} ;
}

I’m using the parameter IsEmpty twice:

That’s all I need to make ClassDef handle empty and non-empty lists.

Side note: in ClassDef, I use the string from flds even when flds is empty, and only contains the empty string. The reason is explained in my gotchas post. In GF, every argument needs to contribute with a string, otherwise it isn’t recognised when parsing. This happens even if there is no ambiguity.

Natural language concrete

Next, I want to do a natural language interface in my programming language! Wouldn’t this be cool:

MyOOP> p "class Business ;" | l -lang=Eng
Business is a class with no fields

MyOOP> p "class Business = { is_legal : Boolean } ;" | l -lang=Eng
Business is a class with a Boolean field is_legal

MyOOP> p "class Business = { is_legal : Boolean ; bus_name : String } ;" | l -lang=Eng
Business is a class with fields is_legal of type Boolean and bus_name of type String

English presents new challenges with lists of different sizes. But there’s still nothing to worry about: the English concrete just needs to use a different set of internal params. Here’s a fragment:

ClassDef name fields =
 let classname : NP = symb name ;
     with_fields : Adv = case fields.size of {
       Zero => mkAdv with_Prep (mkNP noPl_Det field_N) ;
       One  => withField fields.firstField ;
       Many => withField fields.s
     } ;
     descr : NP = mkNP a_Det (mkCN class_N with_fields)
 in mkS (mkCl classname descr) ;

You can see that I’m using a three-valued parameter for list size: 0, 1 or >=2 (i.e. “many”). I chose those cases, because I wanted to use three different verbalisation strategies:

I’m not going to paste the whole English concrete, but you can see everything—the abstract and the two concretes—in this link.

Advanced topics

First, I’ll talk about the problem that everything gets flattened to a string. Second, I’ll talk about natural languages like Turkish and Korean, whose conjunctions work a bit differently than the standard RGL model, and how to implement them.

Problem: everything flattens to string

You probably find lists in GF more restricted than you’re used to in other languages. For instance, you can’t retain the individual elements—they just get concatenated into one long string, separated by commas in the RGL, or a separator of your choice in formal languages. So you can’t peek into a list and decide, for an arbitrary n, “if the nth element has parameter Foo, then do something”.

Of course, for a finite n, you can always have a lincat with n fields, and pattern match them as much as you like. For example:

lincat
  Elem = {s : Str ; foo : FooBar} ;
  [Elem] = {
    s1,s2,s3,s4,s5 = {s : Str ; foo : FooBar} ;
    numElems : NumElems ;
   } ;

param
  FooBar = Foo | Bar ;
  NumElems = Zero | One | Two | Three | Four | Five ;

lin
  BaseElem = {
    s1,s2,s3,s4,s5 = { -- dummy values in all 5 fields
      s = "NOTHING HERE YET" ;
      foo = Foo} ;
    numElems = Zero
    } ;

  ConsElem e es =
    case es.numElems of {
      Zero => es ** {s1 = e ; numElems = One} ;
      ...
      Four => es ** {s5 = e ; numElems = Five} ;
      Five => error "The list already has 5 elements"
    } ;

This hypothetical list of elements has 5 fields s1‥s5, and a parameter NumElems to keep track of which of the 5 fields contains some actual value. (If you’re familiar with grammar blowup due to lots of parameters, here’s an exercise to you: which one is cheaper, this solution or the alternative in the footnote1?)

While this type of solution can work for a finite n, the much more scalable method is to do any list manipulation from an external program.

Natural language strategies beyond A, B and C

The RGL Conjunction module was primarily designed for the following natural language strategy:

But that’s not the only strategy natural languages use to connect words and phrases. Other strategies include:

Special form with coordination

Linguistically, this seems like a more exotic thing than just repeating the conjunction. But as long as the coordinated forms are still separated by commas (or anything else that doesn’t depend on the final conjunction chosen), this modification is the easier of the two.

Let’s take a hypothetical AP in a hypothetical language. Here’s the spec:

So we have three forms: attributive, predicative and conjunctive. Here’s an inflection table:

lincat
  AP = {s : AForm => Str} ;
param
  AForm = AAttr | APred | AConj ;

You don’t need to do anything special in BaseAP and ConsAP. The basic design is that the last AP is stored in the s2 field, and the others in s1.

lincat
  [AP] = {s1, s2 : AForm => Str} ;

lin
  -- BaseAP and ConsAP are like for any RGL language.
  BaseAP a1 a2 = twoTable AForm a1 a2 ;
   -- returns: {s1 = a1.s ; s2 = a2.s}

  ConsAP a as = consrTable AForm comma a as ;
   -- returns the following:
   --   {s1 = \\af => a.s ! af ++ "," ++ as.s1 ! af ; s2 = as.s2}

The differences are in ConjAP, which chooses the AConj form for all but the last AP.

lin
  ConjAP conj as = {
    s = \\af => as.s1 ! AConj -- All but the last AP in AConj
             ++ conj.s        -- Conj string, like "and"
             ++ as.s2 ! af    -- Full inflection table retained
    } ;

The last AP retains the full inflection table, including the AConj form. That’s because any AP can potentially appear before a conjunction—even one that is already built of several APs. (The house is [big and blue] or [small and green].)

But if an AP has ever been coordinated, and it was not the last one, then it won’t ever go back to a non-conjunctive form.

Of course, in real languages the inflection tables tend to be much larger, with agreement, polarity, tense and whatnot. I have implemented similar languages in the RGL and GF works very well—I just chose a minimal example for the blog post.

Repeated conjunctions

Consider the API of lists in the RGL. First, we construct a list: [scallions, onions, garlic], and only after that, we apply a conjunction to the list. As we saw earlier in this post, this is implented as a record with two s fields: s1 contains the string "scallions , onions" and s2 contains the string "garlic". Then all we need to do is to put the conjunction between the two strings in s1 and s2.

But now the conjunction is repeated after every element. So we need to be prepared for scallions and onions and garlic, as well as scallions or onions or garlic, even before ConjC <someConj> has been called. What to do?

Conjunction as parameter

Luckily, there are a finite amount of conjunctions. So we can make it into a parameter:

param
  ConjType = And | Or | Nor ;

This ConjType will be on the left-hand side of our list type. Like this:

-- Assuming that lincat of NP is {s : Case => Str ; a : Agr}
lincat
  [NP] = {s : ConjType => Case => Str ; a : Agr} ;

And here comes the important part: the ConjType parameter is used to insert the desired conjunctions after each new addition to the list.

lin
  BaseNP np1 np2 = {
    s = table {
          And => \\c => np1.s ! c ++ "and" ++ np2.s ! c ;
          Or  => \\c => np1.s ! c ++ "or" ++ np2.s ! c ;
          Nor => \\c => np1.s ! c ++ "nor" ++ np2.s ! c
      } ;
    a = conjAgr np1.a np2.a
    } ;

  ConsNP np nps = nps ** {
    s = table {
          And => \\c => np.s ! c ++ "and" ++ nps.s ! And ! c ;
          Or  => \\c => np.s ! c ++ "or" ++ nps.s ! Or ! c ;
          Nor => \\c => np.s ! c ++ "nor" ++ nps.s ! Nor ! c
      }
    } ;

And how to get the right string out of the ListNP? Let’s look at the implementation of ConjNP.

lincat
  Conj = {c : ConjType} ; -- Conj string already in the list

lin
  ConjNP conj nps = {s = nps.s ! conj.c} ;

How about if your language uses different conjunctions for different parts of speech? That’s no problem at all. Each BaseC and ConsC decides which strings to insert, so BaseNP might use the string “and” for the parameter And, and BaseAP might put a different string for the same parameter.

Complex morphology

What if the conjunctions have many allomorphs depending on the words they attach to? That definitely happens, for instance in Korean. Here’s a peek into the Korean RG, where we have a type

conjTable : POS => ConjType => Phono => Str

That’s a bit more things to consider than just output the string “and” after the parameter And, but nothing that GF can’t handle. Here’s some code in action, where we choose the right allomorphs:

oper
  mkFirstAP : AdjPhrase -> VForm => ConjType => Str = \ap ->
    \\af,conj => ap.compar ++ case isPos af of {
       True  => glue (ap.s ! VStem Pos) (conjTable ! VStar ! conj ! ap.p) ;
       False => glue (ap.s ! VStem Neg) (conjTable ! VStar ! conj ! ap.pNeg)
    } ;

The lincat for AP includes a parameter p which tell us whether the VStem Pos form ends in a consonant or vowel, and pNeg, which tells whether the VStem Neg form ends in a consonant or vowel.

If you find that the code is getting very complicated, maybe a better approach would be to store the forms with the conjunction included, already before any lists are made. So your AP would contain forms like

and BaseAP would just choose whatever blue.*+and it needs for a given parameter like And.

Of course, the larger your inflection table for just blue is, the more it may blow up when you add conjunctions. Writing GF grammars for morphologically complex languages can be tricky, but also very rewarding.

If you are ever writing a resource grammar and struggle with these type of questions, I would love to help. You can write to the GF mailing list (or to me personally if you’re shy), and I’ll try to give some tips.

Footnotes

1. Below is an alternative method to implement the list of up to 5 Elems. In light of grammar blowup, which version is better?

lincat
  Elem = {s : Str ; foo : FooBar} ;
  [Elem] = {
    s1,s2,s3,s4,s5 = {s : Str ; foo : MaybeFooBar}
   } ;

param
  FooBar = Foo | Bar ;
  -- This is not the place to teach polymorphism in GF
  MaybeFooBar = Just FooBar | Nothing ;

lin
  BaseElem = {
    s1,s2,s3,s4,s5 = { -- dummy values in all 5 fields
      s = "NOTHING HERE YET" ;
      foo = Nothing}
    } ;

  ConsElem e es =
    case es.s1.foo of {
      Nothing
        => es ** {s1 = {s = e.s ; foo = Just e.foo}} ;
      _ => case es.s2.foo of {
             Nothing
               => es ** {s2 = {s = e.s ; foo = Just e.foo}} ;
             _ => case es.s3.foo of {
                    Nothing
                      => es ** {s3 = {s = e.s ; foo = Just e.foo}} ;
                    _ => case es.s4.foo of {
                           Nothing
                             => es ** {s4 = {s = e.s ; foo = Just e.foo}} ;
                           _ => case es.s5.foo of {
                                  Nothing
                                    => es ** {s5 = {s = e.s ; foo = Just e.foo}} ;
                                  _ => error "The list already has 5 elements"}}}}

    } ;

The solution is here.

tags: gf, programming