Inari Listenmaa

Logo

CV · Blog · GitHub

22 September 2018

Grammar blowup: some causes and solutions

Update 2018-12-14: added section V & this post adapted into a presentation

I. The problem

Some GF grammars take a long time to compile and parse (and to generate test cases when using gftest). The reasons for this can be many and complicated, but we can identify two concrete reasons:

Both of these things can make a grammar blow up, and reducing either can make it decent again. Especially the amount of concrete categories can be very sneaky; you wouldn’t expect that adding just a tiny Boolean to some insignificant category would cause a blowup, but it just might. Did you ever wonder why French is so much slower than Spanish, even though they use the same functor? It’s because French participles agree in gender and number1, and Spanish don’t—a difference of two 2-valued parameters cause a major difference in compile time.

Here’s another anecdote: back in 2011 or so, I tried to add a 2-valued vowel harmony parameter to a few categories in Finnish, and the grammar just stopped compiling. Later, Aarne reduced the size of nouns and verbs, by replacing the inflection tables of nouns and verbs with just the minimum number of different stems (the method known as glu &+ ing morph &+ eme &+ s). That sped up the grammar a lot, with vowel harmony and all!

So, a problem that comes from adding one parameter can be solved by shaving off from somewhere else, either by getting rid of some other parameter, or making the inflection tables smaller (in the changed category, or elsewhere). You can get some ideas of the numbers when you compile your GF grammar with the setting -v=32. The bad news is that a lot of categories are deeply interconnected, so you really might have no clue where to start. The good news is that if you just start from where it seems feasible, it’s often actually helpful.

Before we get started, here’s a disclaimer. These hacks can make the grammar less readable and harder to debug. If you don’t have a performance problem, and you suspect you won’t have one in the future, you don’t need to think about these all that much.

II. Reducing the amount of concrete categories

In order to reduce the amount of concrete categories, you need to reduce the amount of inherent parameters. Here’s a couple of cases, with inspiration from real life, but simplified for the sake of example.

Word order

If you’re thinking of adding a parameter about word order (pre or post), often you can do without a parameter, and just create two fields in your record type instead.

Think of any major Romance language: some adjectives, like good come before the noun (buena pizza ‘good pizza’), but most adjectives come after (pizza vegana ‘vegan pizza’).

This is how you do it with a parameter:

lincat
  Adj = { s : Str ; isPre : Bool } ;
  CN  = { s : Str } ;

lin
  AdjCN adj cn = {
    s = case adj.isPre of {
          True  => adj.s ++ cn.s ;
          False => cn.s ++ adj.s }
    } ;

  Good  = { s = "buena"  ; isPre = True } ;
  Vegan = { s = "vegana" ; isPre = False } ;

And this is how you do it without a parameter.

lincat
  Adj = { sPre : Str ; sPost : Str } ;
  CN  = { s : Str } ;

lin
  AdjCN adj cn = { s = adj.sPre ++ cn.s ++ adj.sPost } ;

  Good  = { sPre = "buena" ; sPost = "" } ;
  Vegan = { sPre = ""      ; sPost = "vegana" } ;

linref
  Adj = \adj -> adj.sPre ++ adj.sPost ;

The latter solution looks uglier, and needs a linref to make any sense in the GF shell. (Alternatively, you could have a field called s, which is just a copy of the non-empty member of sPre/sPost.)

This also opens up a possibility for new kinds of bugs: you might end up with an adjective that has strings in both sPre and sPost, like “buena pizza buena”. But this cuts down the need for parameter, i.e. does its job.

Has <some function> been applied?

Consider a language, where only the last word in the noun phrase gets the case marking, and the rest are in some unmarked base form (which we can language-agnostically call absolutive). Both nouns and adjectives can inflect in case, and the case inflection causes some serious stem changes, so you cannot just be lazy and pile up the words in the CN -> CN functions, and finally glue the case suffix in whatever CN -> NP function you are using.

Here’s one way to do it, with a parameter hasAdj in the CN, and DetCN checking if its argument CN has an adjective. If it has, then DetCN chooses the absolutive form of the noun, and only puts the rest of the case endings at the end of the adjective.

lincat
  CN = { s,adj : Case => Str ; hasAdj : Bool } ;
  NP = { s = Case => Str } ;

lin
  -- : Adj -> CN -> CN
  AdjCN adj cn = { -- Move previous adjectives into s.
   s   = \\c => cn.s ! c ++ cn.adj ! c ;
   adj = \\c => adj.s ! c ;
   hasAdj = True
  } ;

  -- : Det -> CN -> NP
  DetCN det cn = {
   s = case cn.hasAdj of {
        True  => \\c => det.s ++ cn.s ! Abs ++ cn.adj ! c ;
        False => \\c => det.s ++ cn.s ! c }
  } ;

This has the positive property that at any phase of CN, all forms are intact. The s field contains a table with all the potential forms of the noun, and the parameter hasAdj tells whether to actually use all that potential. The parameter is necessary to get the desired behaviour, if we want to retain all forms of the noun (and any other adjectives that were added before the latest application of AdjCN).

However, we can also do it without a parameter: when applying AdjCN, force everything else but the newest adjective into absolutive. Then you can choose any case in DetCN and you get the correct behaviour: only the newest adjective gets the other case endings.

lincat
  CN = { s,adj : Case => Str } ;
  NP = { s : Case => Str } ;

lin
  AdjCN adj cn = {
    s   = \\c => cn.s ! Abs ;
    adj = \\c => cn.adj ! Abs ++ adj.s ! c ;
  } ;

  DetCN det cn = {
    s = \\c => det.s ++ cn.s ! c ++ cn.adj ! c
  } ;

This means that when you select e.g. cn.s ! Gen of a CN that has undergone AdjCN, you actually get absolutive, not genitive. It’s up to you how you feel about such a decision.

III. Reducing the size of concrete categories

This time, we don’t touch the inherent parameters, but the variable ones, that is, the LHS of a table. If our Noun type is {s : Case => Str}, we want to make the parameter Case smaller.

Inflection tables vs. glu &+ ing morph &+ eme &+ s

A traditional GF grammar would store an inflection table in the entries for noun and verb, and the function for building sentences would pick the right forms for the subject, verb and object. See an example below (not a real RGL function):

lincat
  S  = Str ;
  NP = { s : Case => Str ; agr : Agr } ;
  V2 = { s : Agr => Str ; compl : Case } ;


lin
  -- : NP -> V2 -> NP -> S ;
  MakeSentence subj verb obj = subj.s ! Nom
                            ++ verb.s ! subj.agr
                            ++ obj.s ! verb.compl ;

Let’s take a linguist version of assuming a perfectly spherical cow in a vacuum. Suppose that your language had 100 % regular concatenative morphology, then you could write a GF grammar like this:

lincat
  S  = Str ;
  NP = { s : Str ; agr : Str } ;
  V2 = { s : Str ; compl : Str } ;

lin
  -- : NP -> V2 -> NP -> S ;
  MakeSentence subj verb obj = subj.s -- base form = Nom
                            ++ glue verb.s subj.agr
                            ++ glue obj.s verb.compl ;

oper
  -- Concatenates two strings at runtime without spaces
  glue x y = x ++ BIND ++ y ;

In this ideal world, the category for NP would just contain the morpheme that needs to be added to the verb to make it agree with the subject NP, and the category for V2 would contain the case that needs to be added to the object NP. Now (eh, since 2015 or so) that GF has the magical BIND token, which concatenates two strings at runtime without spaces, it is possible to do things this way.

Often real world morphemes aren’t all that ideal though. (My job wouldn’t exist if they were!) You could still get away with a similar solution: instead of a full inflection table of, say, 50 forms, maybe you only need 5-10 different stems where you attach only, say, 1-5 different allomorphs of whatever morphemes you now want to attach. Here’s an example in the Basque RG that demonstrates both:

let stem =
  itsaso ++ table {
      {- other cases … -} ;
      LocStem => table {
        FinalCons => BIND ++ "e" ; --mutil+e+tik
        FinalR    => BIND ++ "re" ; --txakur+re+tik
        FinalA    => BIND ++ "a" ; --neska+tik
        _         => [] } --itsaso+tik
   } ;

So, the case LocStem replaces locative cases, because they all use the same stem, and the parameter Phono with values FinalR, FinalCons, FinalA, FinalVow tells which allomorph to choose. Then the locPrep oper chooses the form with parameter LocStem from its NP argument and glues its string argument into the NP.

IV. Trading size for amount: a case study in Basque

Here’s a real-life case study, where first I had few concrete categories but they were huge (verb inflection tables). I replaced those with many concrete categories that were small, and it improved the performance a lot.

If you’re reading this blog, chances are high that you know a thing or two about Basque verbs, but in case no, here’s a recap:

What are the implications for GF grammars? First of all, the size of the verb categories can get really big, if all verbs have all the forms in an inflection table. But since most of the verbs share the inflecting part with other verbs in the same valency, putting it all in inflection tables sounds like a waste.

So in all of the verb categories in the Basque RG, there are no finite verb forms:

Verb : Type = {
  prc : Tense => Str ; -- 3 forms
  nstem : Str ;
  aux : AuxType -- <20 different values
  } ;

The participle in the prc field has only 3 forms: past, progressive and future. In the lexical categories, we only have the AuxType param, which has around 10 different values so far: the three auxiliaries, and the handful of synthetic (i.e. fully inflecting) verbs. The parameter just tells us which auxiliary or synthetic verb to choose for the verb. Furthermore, this Verb is the lincat of every single subcategory of verbs in the RG!

Here’s an example–the words Etorri, Ibili and Jakin are values of the parameter AuxType:

lin
  -- Ordinary verb, using the transitive auxiliary
  buy_V2 = mkV2 "erosi" ;

  -- Some synthetic verbs
  come_V = syntVerb1 "etorri" Etorri ; -- intrans.
  walk_V = syntVerb1 "ibili" Ibili ;   -- intrans.

  know_V2 = syntVerb2 "jakin" Jakin ;  -- trans.

As the VP keeps forming, we keep track of the direct and indirect object in corresponding fields–still no finite verb forms, just the parameter that was there since the beginning. Finally, we introduce the actual verb inflection in PredVP. It’s only at that point we know all the arguments (subject, direct object and indirect object), so we can choose the agreement at once3.

I might write more about this, if I ever write a paper about the Basque RG. But now let’s recap what we did for verbs:

I wonder if this scales up when I add the remaining synthetic verbs. There should be around 20 parameters for a complete grammar, rather than 10, but I haven’t gotten around adding them; we don’t need more in the small resource lexicon. It is possible that this approach doesn’t scale up, but at least is has scaled up better than the original version with inflection tables.

Of course, not all languages support this kind of thing–but maybe you can find other redundancies to exploit in yours. The take-home message here was that “you don’t need to put verb inflection forms in the verb category, but somewhere higher up”. Maybe write a RG or two, and by your third, you’ll get creative! ^_^

V. Quick fix for testing an oversized grammar

The part that takes time is the conversion to PMCFG–that’s what is needed for parsing. However, it is also possible to use a GF grammar such that it only linearises. Unfortunately, as of December 2018, this is possible only in the GF shell, i.e. it is not supported in any external application.

But if you want to test a slow grammar just for linearising, you can do as follows.

nopmcfg

In words:

> help cc
cc, compute_concrete
computes concrete syntax term using a source grammar

syntax:
  cc (-all | -table | -unqual)? TERM

Compute TERM by concrete syntax definitions. Uses the topmost
module (the last one imported) to resolve constant names.
N.B.1 You need the flag -retain when importing the grammar, if you want
the definitions to be retained after compilation.
N.B.2 The resulting term is not a tree in the sense of abstract syntax
and hence not a valid input to a Tree-expecting command.
This command must be a line of its own, and thus cannot be a part
of a pipe.

options:
 -all	pick all strings (forms and variants) from records and tables
 -list	all strings, comma-separated on one line
 -one	pick the first strings, if there is any, from records and tables
 -table	show all strings labelled by parameters
 -unqual	hide qualifying module names
 -trace	trace computations

Read more

Krasimir’s thesis Section 2.8.4 “Hints for Efficient Grammars” is a good read. It is found on page 67 (page number), which is page 79 in the PDF.

Footnotes

1. Let’s hope that the French language goes through this grammar reform, then we have some hope of GF grammars running faster!

2. Compiling with -v=3, you get this kind of output:

  generating PMCFG
+ AdVVP 1280 (1280,1)
+ AdVVPSlash 69120 (69120,1)
+ AdvVP 1280 (1280,1)
+ AdvVPSlash 69120 (69120,1)
+ CompAP 8 (2,2)
+ CompAdv 1 (1,1)
+ CompCN 8 (2,2)
+ CompNP 32 (1,1)
+ ComplSlash 2211840 (25600,21)

If you’re developing a resource grammar, you can try playing around with the parameters and see if you get a difference in the numbers.

3. I know this is quite abstract, so here’s some code. The actual verbs live in the weird-sounding identifiers like AditzTrinkoak.ukanZaio, and chooseAuxPol picks the right forms (the right forms also depend on if the sentence is negative or positive, and that comes one step later, from UseCl).

tags: gf