
Update 2018-12-28: this post adapted into a presentation.
So I’m talking a bunch about PGF, PMCFG, concrete categories, sequences and
such when I talk about gftest. If you ever wondered what’s that all
about, keep reading.
How about if you have never wondered what’s that all about? Honestly, that’s no problem either. It may be cool to learn about PMCFG if you’re writing a lot of GF, but only in the sense that it may be cool to learn some Italian if you eat a lot of pizza. So take this post as a cool extra tidbits about your favourite grammar formalism, not as a prerequisite for anything.
PMCFG is short for “Parallel Multiple Context-Free Grammar”, introduced by Seki et al. (1991). (An important subset of) GF is “equivalent to” PMCFGs, in the sense that all things we can express in GF, we can also express in PMCFG. As the name might suggest, PMCFGs are more expressive than plain old context-free grammars, but still not quite as expressive as context-sensitive grammars.
Okay, and what are the formats for writing PMCFG grammars? That is, “grammars that are as expressive as those by Seki et al., and that have the same hard-to-read low-level feel to them”.
We know of at least three:
It’s completely normal to have different syntaxes for the same thing. Think of regular grammars: you can write them as automata, regular expressions or A → Aa style production rules (see e.g. this StackOverflow answer if you’re confused about this).
However, we don’t usually say that GF as a programming language is PMCFG, instead that it is equivalent to PMCFG.
PGF is short for “Portable Grammar Format”. It’s the concrete low-level format to which GF grammars are compiled. The syntax I use in this post is specific to PGF, not to PMCFG in general. (Technically, PGF is an extension of PMCFG with added features such as multilinguality, but seriously, nobody is going to die if we just say that PGF is a format for writing PMCFG grammars.)
All my examples come from the Foods grammar. I’m assuming that if you’re reading about advanced topics in GF, you’re familiar with Foods, but in case not, take a look at the abstract syntax and why not check out the concrete in some language you know.
Here we use a generic Romance language for concrete syntax: adjectives
inflect in number and gender, nouns have a fixed gender and inflect in
number; most adjectives come after the noun but some come before.
Here’s GF code for the Mod function.
lincat
Kind = {s : Number => Str ; g : Gender} ;
Quality = {s : Number => Gender => Str ; isPre : Bool} ;
lin
-- Mod : Quality -> Kind -> Kind ;
Mod quality kind = {
s = \\num =>
let q = quality.s ! num ! kind.gender ;
k = kind.s ! num ;
in case quality.isPre of {
True => q++k ;
False => k++q } ;
g = kind.g
} ;
So Mod constructs a table of Number => Str, choosing the right
gender form of Quality and putting it either before or after the
Kind, depending on the parameter isPre of Quality.
This is the actual output you get when you type pg (print_grammar)
in the GF shell! Let’s start from the concrete categories for Kind
and Quality.
categories
Kind := range [C5 .. C6]
labels ["s Sg"
"s Pl"]
Quality := range [C7 .. C8]
labels ["s Masc Sg"
"s Masc Pl"
"s Fem Sg"
"s Fem Pl"]
Okay, this tells us that both Kind and Quality have been split
into two concrete categores (C5-C6 and C7-C8 respectively).
The things in labels are variable to the category. This is easy to
see from the names of the labels—Sg, Pl, Masc, Fem are all names of
the parameters in the original tables.
What used to be Kind consists of two strings, labelled “s Sg” and “s Pl”:
there used to be a table Number => Str in the field s, but now
it’s just flattened into strings. So far not a big deal: we just went
from a record with tables {s : Number => Str} into a vector of
strings [sSg, sPl].
The same happened to Quality: it used to have a nested table {s :
Number => Gender => Str}, but now all combinations are enumerated.
The things in range are inherent to the category: as we will soon
see, C5 corresponds to masculine Kind, and C6 to
feminine. Likewise, C7 corresponds to postmodifier Quality
(i.e. isPre field is False) and C8 to premodifier (isPre field
is True).
Now let’s have a closer look at C5–C6. Which productions make them?
productions
C5 -> F24[] -- Cheese
C5 -> F25[] -- Fish
C5 -> F26[] -- Wine
C5 -> F27[C7,C5] -- Mod
C5 -> F28[C8,C5] -- Mod
C6 -> F29[C7,C6] -- Mod
C6 -> F30[C8,C6] -- Mod
C6 -> F31[] -- Pizza
I added the comments (Cheese–Pizza) to make it clearer, otherwise it’s
straight from pg. This is what we learn:
F24–F28 produce a C5 (masculine Kind).F29–F31 produce a C6 (feminine Kind).F24,F25,F26 and F31 take no arguments.F27,F28,F29 and F30 take 2 arguments, their types
specified inside [].You don’t know it anywhere from the PGF that C5 is in fact
masculine, you just have to know if based on e.g. the fact that
Cheese, Fish and Wine are all masculine nouns in Generic Romance. If
you actually go and look at F24 and follow the trail, you will
eventually get to the string "cheese", and then you know that F24
corresponds to Cheese.
Anyway, we were interested in Mod! So we better look up
F27–F30. Let’s check out Pizza as well, just to appreciate the
difference between functions with and without arguments.
lin
F27[C7,C5] := (S19,S20) [Mod] -- post,masc
F28[C8,C5] := (S21,S22) [Mod] -- pre,masc
F29[C7,C6] := (S23,S24) [Mod] -- post,fem
F30[C8,C6] := (S25,S26) [Mod] -- pre,fem
F31 := (S27,S28) [Pizza]
Okay, we have split Mod into several concrete functions
(called F27-F30) according to the concrete categories of their
arguments. There are 4 of them, because {masc,fem} x {post,pre}
makes 4 combinations.
Pizza doesn’t change a lot: it’s just called F31 now, but it
didn’t split into many.
Next, what are those S19–S28? They’re called sequences. There
are 2 sequences for each concrete Mod (and Pizza), because the
goal category (C5 or C6, i.e. masculine or feminine Kind) has
two strings in it: singular and plural.
The number of arguments is irrelevant: Mod takes 2 arguments and Pizza takes none, but both expand to 2 sequences.
The real action happens in sequences. I have added all comments myself, otherwise the output is straight from pg.
sequences
-- Mod
S19 := <1,0> <0,0> -- post,masc,sg
S20 := <1,1> <0,1> -- post,masc,pl
S21 := <0,0> <1,0> -- pre,masc,sg
S22 := <0,1> <1,1> -- pre,masc,pl
S23 := <1,0> <0,2> -- post,fem,sg
S24 := <1,1> <0,3> -- post,fem,pl
S25 := <0,2> <1,0> -- pre,fem,sg
S26 := <0,3> <1,1> -- pre,fem,pl
-- Pizza
S27 := "pizza"
S28 := "pizzas"
Let’s start with lexical functions. We renamed Pizza into F31,
defined it in terms of S27 and S28, which finally resolve into
the strings “pizza” and “pizzas”.
How about the others, with the weird numbers? These <n,m> are,
respectively, <#argument, #string>.
Just like Mod itself, the sequences used for all concrete versions
of Mod have two arguments. We don’t even bother to name
them–they’re just called 0 and 1, short for “first argument” and
“second argument”.
So what is the order of the arguments? You can look at the
productions with F27[C7,C5] and so on, or just the
original GF code, with Mod : Quality -> Kind -> Kind. Both will
tell you that 0 corresponds to Quality, and 1 corresponds to
Kind.
The second number in the <n,m> is the index of the string in the
argument. As we remember from concrete categories,
Quality has 4 strings, Kind has 2 strings. Now we can tell that
S23 := <1,0> <0,2> builds a new string out of the first string of
its second argument, and the third string of its first
argument. Here’s an example pair of arguments:
0 = ["vegano", "veganos", "vegana", "veganas"]
1 = ["pizza", "pizzas"]
Given these arguments to S23, we get the string pizza vegana.
Okay, that S23 was only the first part of the concrete function
F29. Here’s the definition again:
F29[C7,C6] := (S23,S24)
Applied to the same arguments, S24 produces pizzas veganas. This
makes sense: F29 applied to Vegan and Pizza produces the vector
["pizza vegana", "pizzas veganas"].
Say that we want to be able to capture learner language: for instance, I couldn’t care less about the gender of nouns. I trust that people understand me, even if I say pizza vegano. In GF concrete syntax, the grammarian could model that as follows:
lin Pizza = {s = table {Sg => "pizza" ;
Pl => "pizzas" } ;
g = Masc|Fem } ;
Remember that masculine nouns are compiled into the concrete category
C5 and feminine nouns into C6. The difference to our changed
grammar in the PGF is simple: just add one more production from C5
to F31.
productions
C5 -> F24[] -- Cheese
C5 -> F25[] -- Fish
C5 -> F26[] -- Wine
C5 -> F31[] -- Pizza (m)
...
C6 -> F31[] -- Pizza (f)
How about adding another string as a variant? This time I’m not even trying to make a plausible real-world scenario, but here’s the grammarian’s solution to allowing “pizza” or “Neapolitan pie”:
lin Pizza = {s = table {Sg => "pizza"|"Neapolitan pie" ;
Pl => "pizzas"|"Neapolitan pies" } ;
g = Masc|Fem } ;
This time we need a new concrete syntax function, let’s just call it
F32[] and it will expand to two new sequences.
productions
C5 -> F24[] -- Cheese
C5 -> F25[] -- Fish
C5 -> F26[] -- Wine
C5 -> F31[] -- Pizza (m), variant "pizza"
C5 -> F32[] -- Pizza (m), variant "Neapolitan pie"
...
C6 -> F31[] -- Pizza (f), variant "pizza"
C6 -> F32[] -- Pizza (f), variant "Neapolitan pie"
For completeness’ sake, here are the two new sequences:
lin
F27[C7,C5] := (S19,S20) [Mod] -- post,masc
F28[C8,C5] := (S21,S22) [Mod] -- pre,masc
F29[C7,C6] := (S23,S24) [Mod] -- post,fem
F30[C8,C6] := (S25,S26) [Mod] -- pre,fem
F31 := (S27,S28) [Pizza] -- variant "pizza"
F32 := (S29,S30) [Pizza] -- variant "Neapolitan pie"
sequences
S27 := "pizza"
S28 := "pizzas"
S29 := "Neapolitan" ++ "pie"
S30 := "Neapolitan" ++ "pies"
For big grammars, variants might blow up. To quote Angelov (2011):
The free variation is a useful tool but it have to be used carefully. Note that when the variants are concatenated, the number of possibilities multiply, so having too many variations could lead to combinatorial explosion. It is even worse when the explosion is not easy to find, for example the variants can be in operations which are only later expanded in the linearization rules. At the same time, if we refactor the abstract syntax and add explicit categories for operation [–], then we can eliminate the variations all together which usually improves the parsing efficiency.
So far we’ve just seen three more productions, one more concrete function and two more sequences. Aside from the productions, the PGF looks like I had just added a new lexical item—adding new lexicon surely doesn’t blow up grammars?
Let me demonstrate variants on a slightly more complex example. Since it’s really hard to blow up Foods, I’ll change the abstract syntax a bit.
abstract Foods = {
flags startcat = Comment ;
cat
Comment ; Item ; Kind ; Quality ;
fun
Pred : Kind -> Quality -> Comment ;
-- no more This, That, These, Those : Kind -> Item
Mod : Quality -> Kind -> Kind ;
Wine, Cheese, Fish, Pizza : Kind ;
Very : Quality -> Quality ;
Fresh, Good, Italian, Delicious, Vegan : Quality ;
}
In the standard Foods, Pred’s first argument is Item, and there are 4 functions that add a determiner to a Kind to make it into an Item.
Now instead, the first argument to Pred is Kind, and the choice of determiner is free variation: no matter if we type “this pizza is vegan” or “those pizzas are vegan”, we get the tree Pred Pizza Vegan.
We’re still doing Generic Romance here, so adjectives inflect in gender and number, copula only in number (and person, but everything’s 3rd person in Foods).
oper
Det : Type = {s : Gender => Str; n : Number} ;
this_Det, that_Det, these_Det, those_Det : Det = ...
copula : Agr => Str = ...
lin
Pred kind qual =
let det : Det = this_Det | that_Det
| these_Det | those_Det ;
np : NP = {
s = det.s ! kind.g ++ kind.s ! det.n ;
a = agr kind.g det.n} ;
in {s = np.s ++ copula ! np.a ++ qual.s ! np.a} ;
First let’s introduce the baseline, where we have Pred applied to Item. These are the productions, concrete functions and sequences used in Pred. (I can’t avoid showing strings anymore, so now Generic Romance gets instantiated as Spanish.)
-- Baseline: the original Foods
categories
Comment := range [C0 .. C0]
labels ["s"]
productions
C0 -> F19[C1,C9] -- sg. masc. Item, any Quality
C0 -> F20[C3,C9] -- pl. masc. Item, any Quality
C0 -> F21[C2,C9] -- sg. fem. Item, any Quality
C0 -> F22[C4,C9] -- pl. fem. Item, any Quality
lin
F19 := (S2) [Pred]
F20 := (S4) [Pred]
F21 := (S3) [Pred]
F22 := (S5) [Pred]
sequences
S2 := <0,0> "es" <1,1>
S3 := <0,0> "es" <1,5>
S4 := <0,0> "son" <1,3>
S5 := <0,0> "son" <1,7>
As expected, Comment expands to only one concrete category, with only one field. There are 4 productions, because there are 4 concrete categories for Item, first of Pred’s arguments in the original grammar.
Now we see the corresponding productions, concrete functions and sequences for the new grammar, with Pred : Kind -> Quality -> Comment and determiner as free variation.
-- Foods where determiner is free variation
categories
Comment := range [C0 .. C0]
labels ["s"]
productions
C0 -> F19[C5,C9] -- 64 productions
...
C0 -> F82[C6,C9]
lin
-- the corresponding 64 concrete functions
sequences
-- finally, 64 sequences
S19 := "esa" <0,0> "es" <1,5>
S21 := "esa" <0,0> "son" <1,5>
...
S80 := "estos" <0,1> "es" <1,3>
S82 := "estos" <0,1> "son" <1,3>
We get 64 productions, concrete functions and sequences for the original 4 of each. You can see already from the strings in the sequences, that this grammar overgenerates heavily. Esa is a singular determiner, but it happily appears with the plural son. Gender agreement is ignored in the same way, both inside a noun phrase (ese pizza) and as a predicative (pizza es vegano).
Why does this happen? The linearisation of Pred didn’t look that outrageous. But actually all occurrences of the variable det expand to include all the variants, so a more honest view is this:
Pred kind qual =
let np : NP = {
s = (this_Det | that_Det
|these_Det | those_Det).s ! kind.g ++
kind.s ! (this_Det | that_Det
|these_Det | those_Det).n ;
a = agr kind.g (this_Det | that_Det
|these_Det | those_Det).n} ;
in {s = np.s ++ copula ! np.a ++ qual.s ! np.a} ;
And that’s why we get 4³=64 productions, concrete functions and sequences.
This might not be a realistic scenario, in that maybe you don’t want to make determiners into free variation, but it’s educational. In the original code, I just put the 4 variants in one place, but it’s secretly expanded to 3 places, so the actual number of variants is 4³, not 4.
If you’re writing an resource grammar, it’s important not to introduce variants there, because they compound very quickly. Using variants in an application grammar is safer, but it might also lead to unexpected blowups. See for instance this GF question and my answer to it on Stack Overflow.
Here’s how to do variants in the intended way. The code in the locally defined addDet is identical to the old Pred, but we just restrict the scope, and create only 4 variants.
Pred kind qual =
let addDet : Det -> SS = \d ->
let np : NP = {
s = d.s ! kind.g ++ kind.s ! d.n ;
a = agr kind.g d.n} ;
in {s = np.s ++ copula ! np.a ++ qual.s ! np.a} ;
in addDet this_Det
| addDet that_Det
| addDet these_Det
| addDet those_Det ;
Plot twist: the version with variants has fewer productions, concrete functions and sequences. It’s not a fair comparison, because the new abstract syntax has fever funs: This, That, These and Those don’t exist in the new abstract, they’re just variants in Pred.
This is the last PGF dump pair I’ll show you, I promise. Here are sequences from the original grammar.
-- Original grammar
sequences
-- Pred : Item -> Quality -> Comment
S2 := <0,0> "es" <1,1>
S3 := <0,0> "es" <1,5>
S4 := <0,0> "son" <1,3>
S5 := <0,0> "son" <1,7>
-- : Kind -> Item
S23 := "esa" <0,0> -- That (fem. argument)
S24 := "esas" <0,1> -- Those (fem. argument)
S25 := "ese" <0,0> -- That (masc. argument)
S26 := "esos" <0,1> -- Those (masc. argument)
S27 := "esta" <0,0> -- This (fem. argument)
S28 := "estas" <0,1> -- These (fem. argument)
S29 := "este" <0,0> -- This (masc. argument)
S30 := "estos" <0,1> -- These (masc. argument)
The 12 sequences above cover 3 abstract syntax functions in the old grammar. Below we see 8 sequences, which cover for 1 abstract syntax function in the new grammar, with the 4 determiners as variants.
-- New grammar
sequences
-- Pred : Kind -> Quality -> Comment
S19 := "esa" <0,0> "es" <1,5>
S20 := "esas" <0,1> "son" <1,7>
S21 := "ese" <0,0> "es" <1,1>
S22 := "esos" <0,1> "son" <1,3>
S23 := "esta" <0,0> "es" <1,5>
S24 := "estas" <0,1> "son" <1,7>
S25 := "este" <0,0> "es" <1,1>
S26 := "estos" <0,1> "son" <1,3>
There are marginally fewer productions, concrete functions and sequences, but we’ve completely lost the distinction between all determiners. I am totally not advocating this, and especially not as a way to “optimise your grammar”. Just thought it was a funny result, and wanted to share in the name of transparency.
If you use variants and it blows up, see the post on gotchas for an alternative hack. (If your grammar blows up without variants, see the post on blowup for other tips.)
If you want to play around with the modified Foods, it’s all in my gf-contrib, branch variants. FoodsOld is the original grammar, and FoodsNew is the one with variants. There are three different concrete syntaxes, with varying degrees of overgenerating. To see the PGF dump, just load any file in a GF shell and type pg.
$ gf FoodsNewSpaOvergenerating.gf
FoodsNew> pg
… PGF output printed here …
So, now you can type pg in the GF shell and try to decipher your
own grammar! Why not also take a look at the
obscure
options
in gftest.