13 June 2018

Compiling GF to a low-level format

Update 2018-12-28: this post adapted into a presentation.

So I’m talking a bunch about PGF, PMCFG, concrete categories, sequences and such when I talk about gftest. If you ever wondered what’s that all about, keep reading.

How about if you have never wondered what’s that all about? Honestly, that’s no problem either. It may be cool to learn about PMCFG if you’re writing a lot of GF, but only in the sense that it may be cool to learn some Italian if you eat a lot of pizza. So take this post as a cool extra tidbits about your favourite grammar formalism, not as a prerequisite for anything.

Terminology
Grammars
GF linearisation
Concrete categories
Concrete productions
Concrete functions
Sequences
Variants
- Why variants can blow up

Terminology

PMCFG is short for “Parallel Multiple Context-Free Grammar”, introduced by Seki et al. (1991). (An important subset of) GF is “equivalent to” PMCFGs, in the sense that all things we can express in GF, we can also express in PMCFG. As the name might suggest, PMCFGs are more expressive than plain old context-free grammars, but still not quite as expressive as context-sensitive grammars.

Okay, and what are the formats for writing PMCFG grammars? That is, “grammars that are as expressive as those by Seki et al., and that have the same hard-to-read low-level feel to them”.

We know of at least three:

The particular format used by Seki et al.
This format which doesn’t seem to have gained much following
PGF, by Angelov (2011)

It’s completely normal to have different syntaxes for the same thing. Think of regular grammars: you can write them as automata, regular expressions or A → Aa style production rules (see e.g. this StackOverflow answer if you’re confused about this).

However, we don’t usually say that GF as a programming language is PMCFG, instead that it is equivalent to PMCFG.

PGF is short for “Portable Grammar Format”. It’s the concrete low-level format to which GF grammars are compiled. The syntax I use in this post is specific to PGF, not to PMCFG in general. (Technically, PGF is an extension of PMCFG with added features such as multilinguality, but seriously, nobody is going to die if we just say that PGF is a format for writing PMCFG grammars.)

Grammars

All my examples come from the Foods grammar. I’m assuming that if you’re reading about advanced topics in GF, you’re familiar with Foods, but in case not, take a look at the abstract syntax and why not check out the concrete in some language you know.

GF linearisation

Here we use a generic Romance language for concrete syntax: adjectives inflect in number and gender, nouns have a fixed gender and inflect in number; most adjectives come after the noun but some come before. Here’s GF code for the Mod function.

lincat
  Kind    = {s : Number => Str ; g : Gender} ;
  Quality = {s : Number => Gender => Str ; isPre : Bool} ;

lin
  -- Mod : Quality -> Kind -> Kind ;
  Mod quality kind = {
    s = \\num =>
     let q = quality.s ! num ! kind.gender ;
         k = kind.s ! num ;
      in case quality.isPre of {
                True  => q++k ;
                False => k++q } ;
    g = kind.g
    } ;

So Mod constructs a table of Number => Str, choosing the right gender form of Quality and putting it either before or after the Kind, depending on the parameter isPre of Quality.

Concrete categories

This is the actual output you get when you type pg (print_grammar) in the GF shell! Let’s start from the concrete categories for Kind and Quality.

categories
  Kind := range  [C5 .. C6]
          labels ["s Sg"
                  "s Pl"]
  Quality := range  [C7 .. C8]
             labels ["s Masc Sg"
                     "s Masc Pl"
                     "s Fem Sg"
                     "s Fem Pl"]

Okay, this tells us that both Kind and Quality have been split into two concrete categores (C5-C6 and C7-C8 respectively).

The things in labels are variable to the category. This is easy to see from the names of the labels—Sg, Pl, Masc, Fem are all names of the parameters in the original tables. What used to be Kind consists of two strings, labelled “s Sg” and “s Pl”: there used to be a table Number => Str in the field s, but now it’s just flattened into strings. So far not a big deal: we just went from a record with tables {s : Number => Str} into a vector of strings [sSg, sPl]. The same happened to Quality: it used to have a nested table {s : Number => Gender => Str}, but now all combinations are enumerated.

The things in range are inherent to the category: as we will soon see, C5 corresponds to masculine Kind, and C6 to feminine. Likewise, C7 corresponds to postmodifier Quality (i.e. isPre field is False) and C8 to premodifier (isPre field is True).

Concrete productions

Now let’s have a closer look at C5–C6. Which productions make them?

productions

    C5 -> F24[]       -- Cheese
    C5 -> F25[]       -- Fish
    C5 -> F26[]       -- Wine
    C5 -> F27[C7,C5]  -- Mod
    C5 -> F28[C8,C5]  -- Mod
    C6 -> F29[C7,C6]  -- Mod
    C6 -> F30[C8,C6]  -- Mod
    C6 -> F31[]       -- Pizza

I added the comments (Cheese–Pizza) to make it clearer, otherwise it’s straight from pg. This is what we learn:

Functions F24–F28 produce a C5 (masculine Kind).
Functions F29–F31 produce a C6 (feminine Kind).
Functions F24,F25,F26 and F31 take no arguments.
Functions F27,F28,F29 and F30 take 2 arguments, their types specified inside [].

You don’t know it anywhere from the PGF that C5 is in fact masculine, you just have to know if based on e.g. the fact that Cheese, Fish and Wine are all masculine nouns in Generic Romance. If you actually go and look at F24 and follow the trail, you will eventually get to the string "cheese", and then you know that F24 corresponds to Cheese.

Anyway, we were interested in Mod! So we better look up F27–F30. Let’s check out Pizza as well, just to appreciate the difference between functions with and without arguments.

Concrete functions

lin

  F27[C7,C5] := (S19,S20) [Mod]  -- post,masc
  F28[C8,C5] := (S21,S22) [Mod]  -- pre,masc
  F29[C7,C6] := (S23,S24) [Mod]  -- post,fem
  F30[C8,C6] := (S25,S26) [Mod]  -- pre,fem

  F31        := (S27,S28) [Pizza]

Okay, we have split Mod into several concrete functions (called F27-F30) according to the concrete categories of their arguments. There are 4 of them, because {masc,fem} x {post,pre} makes 4 combinations.

Pizza doesn’t change a lot: it’s just called F31 now, but it didn’t split into many.

Next, what are those S19–S28? They’re called sequences. There are 2 sequences for each concrete Mod (and Pizza), because the goal category (C5 or C6, i.e. masculine or feminine Kind) has two strings in it: singular and plural. The number of arguments is irrelevant: Mod takes 2 arguments and Pizza takes none, but both expand to 2 sequences.

Sequences

The real action happens in sequences. I have added all comments myself, otherwise the output is straight from pg.

sequences

  -- Mod
  S19 := <1,0> <0,0>   -- post,masc,sg
  S20 := <1,1> <0,1>   -- post,masc,pl
  S21 := <0,0> <1,0>   -- pre,masc,sg
  S22 := <0,1> <1,1>   -- pre,masc,pl
  S23 := <1,0> <0,2>   -- post,fem,sg
  S24 := <1,1> <0,3>   -- post,fem,pl
  S25 := <0,2> <1,0>   -- pre,fem,sg
  S26 := <0,3> <1,1>   -- pre,fem,pl

  -- Pizza
  S27 := "pizza"
  S28 := "pizzas"

Let’s start with lexical functions. We renamed Pizza into F31, defined it in terms of S27 and S28, which finally resolve into the strings “pizza” and “pizzas”.

How about the others, with the weird numbers? These <n,m> are, respectively, <#argument, #string>.

Just like Mod itself, the sequences used for all concrete versions of Mod have two arguments. We don’t even bother to name them–they’re just called 0 and 1, short for “first argument” and “second argument”.

So what is the order of the arguments? You can look at the productions with F27[C7,C5] and so on, or just the original GF code, with Mod : Quality -> Kind -> Kind. Both will tell you that 0 corresponds to Quality, and 1 corresponds to Kind.

The second number in the <n,m> is the index of the string in the argument. As we remember from concrete categories, Quality has 4 strings, Kind has 2 strings. Now we can tell that S23 := <1,0> <0,2> builds a new string out of the first string of its second argument, and the third string of its first argument. Here’s an example pair of arguments:

0 = ["vegano", "veganos", "vegana", "veganas"]
1 = ["pizza", "pizzas"]

Given these arguments to S23, we get the string pizza vegana.

Okay, that S23 was only the first part of the concrete function F29. Here’s the definition again:

F29[C7,C6] := (S23,S24)

Applied to the same arguments, S24 produces pizzas veganas. This makes sense: F29 applied to Vegan and Pizza produces the vector ["pizza vegana", "pizzas veganas"].

Variants

Say that we want to be able to capture learner language: for instance, I couldn’t care less about the gender of nouns. I trust that people understand me, even if I say pizza vegano. In GF concrete syntax, the grammarian could model that as follows:

lin Pizza = {s = table {Sg => "pizza" ;
                        Pl => "pizzas" } ;
             g = Masc|Fem } ;

Remember that masculine nouns are compiled into the concrete category C5 and feminine nouns into C6. The difference to our changed grammar in the PGF is simple: just add one more production from C5 to F31.

productions

    C5 -> F24[]       -- Cheese
    C5 -> F25[]       -- Fish
    C5 -> F26[]       -- Wine
    C5 -> F31[]       -- Pizza (m)
    ...
    C6 -> F31[]       -- Pizza (f)

How about adding another string as a variant? This time I’m not even trying to make a plausible real-world scenario, but here’s the grammarian’s solution to allowing “pizza” or “Neapolitan pie”:

lin Pizza = {s = table {Sg => "pizza"|"Neapolitan pie" ;
                        Pl => "pizzas"|"Neapolitan pies" } ;
             g = Masc|Fem } ;

This time we need a new concrete syntax function, let’s just call it F32[] and it will expand to two new sequences.

productions

    C5 -> F24[]       -- Cheese
    C5 -> F25[]       -- Fish
    C5 -> F26[]       -- Wine
    C5 -> F31[]       -- Pizza (m), variant "pizza"
    C5 -> F32[]       -- Pizza (m), variant "Neapolitan pie"
    ...
    C6 -> F31[]       -- Pizza (f), variant "pizza"
    C6 -> F32[]       -- Pizza (f), variant "Neapolitan pie"

For completeness’ sake, here are the two new sequences:

lin

  F27[C7,C5] := (S19,S20) [Mod]  -- post,masc
  F28[C8,C5] := (S21,S22) [Mod]  -- pre,masc
  F29[C7,C6] := (S23,S24) [Mod]  -- post,fem
  F30[C8,C6] := (S25,S26) [Mod]  -- pre,fem

  F31        := (S27,S28) [Pizza] -- variant "pizza"
  F32        := (S29,S30) [Pizza] -- variant "Neapolitan pie"

sequences

  S27 := "pizza"
  S28 := "pizzas"
  S29 := "Neapolitan" ++ "pie"
  S30 := "Neapolitan" ++ "pies"

Why variants can blow up

For big grammars, variants might blow up. To quote Angelov (2011):

The free variation is a useful tool but it have to be used carefully. Note that when the variants are concatenated, the number of possibilities multiply, so having too many variations could lead to combinatorial explosion. It is even worse when the explosion is not easy to find, for example the variants can be in operations which are only later expanded in the linearization rules. At the same time, if we refactor the abstract syntax and add explicit categories for operation [–], then we can eliminate the variations all together which usually improves the parsing efficiency.

So far we’ve just seen three more productions, one more concrete function and two more sequences. Aside from the productions, the PGF looks like I had just added a new lexical item—adding new lexicon surely doesn’t blow up grammars?

Let me demonstrate variants on a slightly more complex example. Since it’s really hard to blow up Foods, I’ll change the abstract syntax a bit.

abstract Foods = {
  flags startcat = Comment ;
  cat
    Comment ; Item ; Kind ; Quality ;
  fun
    Pred : Kind -> Quality -> Comment ;
    -- no more This, That, These, Those : Kind -> Item
    Mod : Quality -> Kind -> Kind ;
    Wine, Cheese, Fish, Pizza : Kind ;
    Very : Quality -> Quality ;
    Fresh, Good, Italian, Delicious, Vegan : Quality ;
}

In the standard Foods, Pred’s first argument is Item, and there are 4 functions that add a determiner to a Kind to make it into an Item. Now instead, the first argument to Pred is Kind, and the choice of determiner is free variation: no matter if we type “this pizza is vegan” or “those pizzas are vegan”, we get the tree Pred Pizza Vegan.

We’re still doing Generic Romance here, so adjectives inflect in gender and number, copula only in number (and person, but everything’s 3rd person in Foods).

Pred in the new grammar

oper
  Det : Type = {s : Gender => Str; n : Number} ;
  this_Det, that_Det, these_Det, those_Det : Det = ...
  copula : Agr => Str = ...

lin
  Pred kind qual =
   let det : Det = this_Det | that_Det
                 | these_Det | those_Det ;
       np : NP = {
         s = det.s ! kind.g ++ kind.s ! det.n ;
         a = agr kind.g det.n} ;
   in {s = np.s ++ copula ! np.a ++ qual.s ! np.a} ;

First let’s introduce the baseline, where we have Pred applied to Item. These are the productions, concrete functions and sequences used in Pred. (I can’t avoid showing strings anymore, so now Generic Romance gets instantiated as Spanish.)

-- Baseline: the original Foods
categories
    Comment := range  [C0 .. C0]
               labels ["s"]

productions
    C0 -> F19[C1,C9]  -- sg. masc. Item, any Quality
    C0 -> F20[C3,C9]  -- pl. masc. Item, any Quality
    C0 -> F21[C2,C9]  -- sg. fem. Item, any Quality
    C0 -> F22[C4,C9]  -- pl. fem. Item, any Quality

lin
    F19 := (S2) [Pred]
    F20 := (S4) [Pred]
    F21 := (S3) [Pred]
    F22 := (S5) [Pred]

sequences
    S2 := <0,0> "es" <1,1>
    S3 := <0,0> "es" <1,5>
    S4 := <0,0> "son" <1,3>
    S5 := <0,0> "son" <1,7>

As expected, Comment expands to only one concrete category, with only one field. There are 4 productions, because there are 4 concrete categories for Item, first of Pred’s arguments in the original grammar.

Now we see the corresponding productions, concrete functions and sequences for the new grammar, with Pred : Kind -> Quality -> Comment and determiner as free variation.

-- Foods where determiner is free variation
categories
    Comment := range  [C0 .. C0]
               labels ["s"]
productions
    C0 -> F19[C5,C9] -- 64 productions
    ...
    C0 -> F82[C6,C9]

lin
   -- the corresponding 64 concrete functions

sequences
    -- finally, 64 sequences
    S19 := "esa" <0,0> "es" <1,5>
    S21 := "esa" <0,0> "son" <1,5>
    ...
    S80 := "estos" <0,1> "es" <1,3>
    S82 := "estos" <0,1> "son" <1,3>

We get 64 productions, concrete functions and sequences for the original 4 of each. You can see already from the strings in the sequences, that this grammar overgenerates heavily. Esa is a singular determiner, but it happily appears with the plural son. Gender agreement is ignored in the same way, both inside a noun phrase (ese pizza) and as a predicative (pizza es vegano).

Why does this happen? The linearisation of Pred didn’t look that outrageous. But actually all occurrences of the variable det expand to include all the variants, so a more honest view is this:

Pred kind qual =
  let np : NP = {
        s = (this_Det  | that_Det
            |these_Det | those_Det).s ! kind.g ++
            kind.s ! (this_Det  | that_Det
                     |these_Det | those_Det).n ;
        a = agr kind.g (this_Det  | that_Det
                       |these_Det | those_Det).n} ;
   in {s = np.s ++ copula ! np.a ++ qual.s ! np.a} ;

And that’s why we get 4³=64 productions, concrete functions and sequences.

This might not be a realistic scenario, in that maybe you don’t want to make determiners into free variation, but it’s educational. In the original code, I just put the 4 variants in one place, but it’s secretly expanded to 3 places, so the actual number of variants is 4³, not 4.

If you’re writing an resource grammar, it’s important not to introduce variants there, because they compound very quickly. Using variants in an application grammar is safer, but it might also lead to unexpected blowups. See for instance this GF question and my answer to it on Stack Overflow.

Pred with only the intended variants

Here’s how to do variants in the intended way. The code in the locally defined addDet is identical to the old Pred, but we just restrict the scope, and create only 4 variants.

Pred kind qual =
  let addDet : Det -> SS = \d ->
        let np : NP = {
              s = d.s ! kind.g ++ kind.s ! d.n ;
              a = agr kind.g d.n} ;
        in {s = np.s ++ copula ! np.a ++ qual.s ! np.a} ;
  in addDet this_Det
   | addDet that_Det
   | addDet these_Det
   | addDet those_Det ;

Plot twist: the version with variants has fewer productions, concrete functions and sequences. It’s not a fair comparison, because the new abstract syntax has fever funs: This, That, These and Those don’t exist in the new abstract, they’re just variants in Pred.

This is the last PGF dump pair I’ll show you, I promise. Here are sequences from the original grammar.

-- Original grammar
sequences
    -- Pred : Item -> Quality -> Comment
    S2 := <0,0> "es" <1,1>
    S3 := <0,0> "es" <1,5>
    S4 := <0,0> "son" <1,3>
    S5 := <0,0> "son" <1,7>

    -- : Kind -> Item
    S23 := "esa" <0,0>    -- That  (fem. argument)
    S24 := "esas" <0,1>   -- Those (fem. argument)
    S25 := "ese" <0,0>    -- That  (masc. argument)
    S26 := "esos" <0,1>   -- Those (masc. argument)
    S27 := "esta" <0,0>   -- This  (fem. argument)
    S28 := "estas" <0,1>  -- These (fem. argument)
    S29 := "este" <0,0>   -- This  (masc. argument)
    S30 := "estos" <0,1>  -- These (masc. argument)

The 12 sequences above cover 3 abstract syntax functions in the old grammar. Below we see 8 sequences, which cover for 1 abstract syntax function in the new grammar, with the 4 determiners as variants.

-- New grammar
sequences
    -- Pred : Kind -> Quality -> Comment
    S19 := "esa" <0,0> "es" <1,5>
    S20 := "esas" <0,1> "son" <1,7>
    S21 := "ese" <0,0> "es" <1,1>
    S22 := "esos" <0,1> "son" <1,3>
    S23 := "esta" <0,0> "es" <1,5>
    S24 := "estas" <0,1> "son" <1,7>
    S25 := "este" <0,0> "es" <1,1>
    S26 := "estos" <0,1> "son" <1,3>

There are marginally fewer productions, concrete functions and sequences, but we’ve completely lost the distinction between all determiners. I am totally not advocating this, and especially not as a way to “optimise your grammar”. Just thought it was a funny result, and wanted to share in the name of transparency.

If you use variants and it blows up, see the post on gotchas for an alternative hack. (If your grammar blows up without variants, see the post on blowup for other tips.)

If you want to play around with the modified Foods, it’s all in my gf-contrib, branch variants. FoodsOld is the original grammar, and FoodsNew is the one with variants. There are three different concrete syntaxes, with varying degrees of overgenerating. To see the PGF dump, just load any file in a GF shell and type pg.

$ gf FoodsNewSpaOvergenerating.gf
FoodsNew> pg
… PGF output printed here …

That’s it for now

So, now you can type pg in the GF shell and try to decipher your own grammar! Why not also take a look at the obscure options in gftest.

tags: gf