Update 2018-12-28: this post adapted into a presentation.
So I’m talking a bunch about PGF, PMCFG, concrete categories, sequences and
such when I talk about gftest
. If you ever wondered what’s that all
about, keep reading.
How about if you have never wondered what’s that all about? Honestly, that’s no problem either. It may be cool to learn about PMCFG if you’re writing a lot of GF, but only in the sense that it may be cool to learn some Italian if you eat a lot of pizza. So take this post as a cool extra tidbits about your favourite grammar formalism, not as a prerequisite for anything.
PMCFG is short for “Parallel Multiple Context-Free Grammar”, introduced by Seki et al. (1991). (An important subset of) GF is “equivalent to” PMCFGs, in the sense that all things we can express in GF, we can also express in PMCFG. As the name might suggest, PMCFGs are more expressive than plain old context-free grammars, but still not quite as expressive as context-sensitive grammars.
Okay, and what are the formats for writing PMCFG grammars? That is, “grammars that are as expressive as those by Seki et al., and that have the same hard-to-read low-level feel to them”.
We know of at least three:
It’s completely normal to have different syntaxes for the same thing. Think of regular grammars: you can write them as automata, regular expressions or A → Aa
style production rules (see e.g. this StackOverflow answer if you’re confused about this).
However, we don’t usually say that GF as a programming language is PMCFG, instead that it is equivalent to PMCFG.
PGF is short for “Portable Grammar Format”. It’s the concrete low-level format to which GF grammars are compiled. The syntax I use in this post is specific to PGF, not to PMCFG in general. (Technically, PGF is an extension of PMCFG with added features such as multilinguality, but seriously, nobody is going to die if we just say that PGF is a format for writing PMCFG grammars.)
All my examples come from the Foods grammar. I’m assuming that if you’re reading about advanced topics in GF, you’re familiar with Foods, but in case not, take a look at the abstract syntax and why not check out the concrete in some language you know.
Here we use a generic Romance language for concrete syntax: adjectives
inflect in number and gender, nouns have a fixed gender and inflect in
number; most adjectives come after the noun but some come before.
Here’s GF code for the Mod
function.
lincat
Kind = {s : Number => Str ; g : Gender} ;
Quality = {s : Number => Gender => Str ; isPre : Bool} ;
lin
-- Mod : Quality -> Kind -> Kind ;
Mod quality kind = {
s = \\num =>
let q = quality.s ! num ! kind.gender ;
k = kind.s ! num ;
in case quality.isPre of {
True => q++k ;
False => k++q } ;
g = kind.g
} ;
So Mod
constructs a table of Number => Str
, choosing the right
gender form of Quality
and putting it either before or after the
Kind
, depending on the parameter isPre
of Quality
.
This is the actual output you get when you type pg
(print_grammar)
in the GF shell! Let’s start from the concrete categories for Kind
and Quality
.
categories
Kind := range [C5 .. C6]
labels ["s Sg"
"s Pl"]
Quality := range [C7 .. C8]
labels ["s Masc Sg"
"s Masc Pl"
"s Fem Sg"
"s Fem Pl"]
Okay, this tells us that both Kind
and Quality
have been split
into two concrete categores (C5-C6
and C7-C8
respectively).
The things in labels
are variable to the category. This is easy to
see from the names of the labels—Sg, Pl, Masc, Fem are all names of
the parameters in the original tables.
What used to be Kind
consists of two strings, labelled “s Sg” and “s Pl”:
there used to be a table Number => Str
in the field s
, but now
it’s just flattened into strings. So far not a big deal: we just went
from a record with tables {s : Number => Str}
into a vector of
strings [sSg, sPl]
.
The same happened to Quality
: it used to have a nested table {s :
Number => Gender => Str}
, but now all combinations are enumerated.
The things in range
are inherent to the category: as we will soon
see, C5
corresponds to masculine Kind
, and C6
to
feminine. Likewise, C7
corresponds to postmodifier Quality
(i.e. isPre
field is False) and C8
to premodifier (isPre
field
is True).
Now let’s have a closer look at C5
–C6
. Which productions make them?
productions
C5 -> F24[] -- Cheese
C5 -> F25[] -- Fish
C5 -> F26[] -- Wine
C5 -> F27[C7,C5] -- Mod
C5 -> F28[C8,C5] -- Mod
C6 -> F29[C7,C6] -- Mod
C6 -> F30[C8,C6] -- Mod
C6 -> F31[] -- Pizza
I added the comments (Cheese–Pizza) to make it clearer, otherwise it’s
straight from pg
. This is what we learn:
F24
–F28
produce a C5
(masculine Kind
).F29
–F31
produce a C6
(feminine Kind
).F24
,F25
,F26
and F31
take no arguments.F27
,F28
,F29
and F30
take 2 arguments, their types
specified inside []
.You don’t know it anywhere from the PGF that C5
is in fact
masculine, you just have to know if based on e.g. the fact that
Cheese, Fish and Wine are all masculine nouns in Generic Romance. If
you actually go and look at F24
and follow the trail, you will
eventually get to the string "cheese"
, and then you know that F24
corresponds to Cheese
.
Anyway, we were interested in Mod
! So we better look up
F27
–F30
. Let’s check out Pizza
as well, just to appreciate the
difference between functions with and without arguments.
lin
F27[C7,C5] := (S19,S20) [Mod] -- post,masc
F28[C8,C5] := (S21,S22) [Mod] -- pre,masc
F29[C7,C6] := (S23,S24) [Mod] -- post,fem
F30[C8,C6] := (S25,S26) [Mod] -- pre,fem
F31 := (S27,S28) [Pizza]
Okay, we have split Mod
into several concrete functions
(called F27-F30
) according to the concrete categories of their
arguments. There are 4 of them, because {masc,fem} x {post,pre}
makes 4 combinations.
Pizza
doesn’t change a lot: it’s just called F31
now, but it
didn’t split into many.
Next, what are those S19
–S28
? They’re called sequences. There
are 2 sequences for each concrete Mod
(and Pizza
), because the
goal category (C5
or C6
, i.e. masculine or feminine Kind
) has
two strings in it: singular and plural.
The number of arguments is irrelevant: Mod
takes 2 arguments and Pizza
takes none, but both expand to 2 sequences.
The real action happens in sequences. I have added all comments myself, otherwise the output is straight from pg
.
sequences
-- Mod
S19 := <1,0> <0,0> -- post,masc,sg
S20 := <1,1> <0,1> -- post,masc,pl
S21 := <0,0> <1,0> -- pre,masc,sg
S22 := <0,1> <1,1> -- pre,masc,pl
S23 := <1,0> <0,2> -- post,fem,sg
S24 := <1,1> <0,3> -- post,fem,pl
S25 := <0,2> <1,0> -- pre,fem,sg
S26 := <0,3> <1,1> -- pre,fem,pl
-- Pizza
S27 := "pizza"
S28 := "pizzas"
Let’s start with lexical functions. We renamed Pizza
into F31
,
defined it in terms of S27
and S28
, which finally resolve into
the strings “pizza” and “pizzas”.
How about the others, with the weird numbers? These <n,m>
are,
respectively, <#argument, #string>
.
Just like Mod
itself, the sequences used for all concrete versions
of Mod
have two arguments. We don’t even bother to name
them–they’re just called 0
and 1
, short for “first argument” and
“second argument”.
So what is the order of the arguments? You can look at the
productions with F27[C7,C5]
and so on, or just the
original GF code, with Mod : Quality -> Kind -> Kind
. Both will
tell you that 0
corresponds to Quality
, and 1
corresponds to
Kind
.
The second number in the <n,m>
is the index of the string in the
argument. As we remember from concrete categories,
Quality
has 4 strings, Kind
has 2 strings. Now we can tell that
S23 := <1,0> <0,2>
builds a new string out of the first string of
its second argument, and the third string of its first
argument. Here’s an example pair of arguments:
0 = ["vegano", "veganos", "vegana", "veganas"]
1 = ["pizza", "pizzas"]
Given these arguments to S23
, we get the string pizza vegana.
Okay, that S23
was only the first part of the concrete function
F29
. Here’s the definition again:
F29[C7,C6] := (S23,S24)
Applied to the same arguments, S24
produces pizzas veganas. This
makes sense: F29
applied to Vegan
and Pizza
produces the vector
["pizza vegana", "pizzas veganas"]
.
Say that we want to be able to capture learner language: for instance, I couldn’t care less about the gender of nouns. I trust that people understand me, even if I say pizza vegano. In GF concrete syntax, the grammarian could model that as follows:
lin Pizza = {s = table {Sg => "pizza" ;
Pl => "pizzas" } ;
g = Masc|Fem } ;
Remember that masculine nouns are compiled into the concrete category
C5
and feminine nouns into C6
. The difference to our changed
grammar in the PGF is simple: just add one more production from C5
to F31
.
productions
C5 -> F24[] -- Cheese
C5 -> F25[] -- Fish
C5 -> F26[] -- Wine
C5 -> F31[] -- Pizza (m)
...
C6 -> F31[] -- Pizza (f)
How about adding another string as a variant? This time I’m not even trying to make a plausible real-world scenario, but here’s the grammarian’s solution to allowing “pizza” or “Neapolitan pie”:
lin Pizza = {s = table {Sg => "pizza"|"Neapolitan pie" ;
Pl => "pizzas"|"Neapolitan pies" } ;
g = Masc|Fem } ;
This time we need a new concrete syntax function, let’s just call it
F32[]
and it will expand to two new sequences.
productions
C5 -> F24[] -- Cheese
C5 -> F25[] -- Fish
C5 -> F26[] -- Wine
C5 -> F31[] -- Pizza (m), variant "pizza"
C5 -> F32[] -- Pizza (m), variant "Neapolitan pie"
...
C6 -> F31[] -- Pizza (f), variant "pizza"
C6 -> F32[] -- Pizza (f), variant "Neapolitan pie"
For completeness’ sake, here are the two new sequences:
lin
F27[C7,C5] := (S19,S20) [Mod] -- post,masc
F28[C8,C5] := (S21,S22) [Mod] -- pre,masc
F29[C7,C6] := (S23,S24) [Mod] -- post,fem
F30[C8,C6] := (S25,S26) [Mod] -- pre,fem
F31 := (S27,S28) [Pizza] -- variant "pizza"
F32 := (S29,S30) [Pizza] -- variant "Neapolitan pie"
sequences
S27 := "pizza"
S28 := "pizzas"
S29 := "Neapolitan" ++ "pie"
S30 := "Neapolitan" ++ "pies"
For big grammars, variants might blow up. To quote Angelov (2011):
The free variation is a useful tool but it have to be used carefully. Note that when the variants are concatenated, the number of possibilities multiply, so having too many variations could lead to combinatorial explosion. It is even worse when the explosion is not easy to find, for example the variants can be in operations which are only later expanded in the linearization rules. At the same time, if we refactor the abstract syntax and add explicit categories for operation [–], then we can eliminate the variations all together which usually improves the parsing efficiency.
So far we’ve just seen three more productions, one more concrete function and two more sequences. Aside from the productions, the PGF looks like I had just added a new lexical item—adding new lexicon surely doesn’t blow up grammars?
Let me demonstrate variants on a slightly more complex example. Since it’s really hard to blow up Foods, I’ll change the abstract syntax a bit.
abstract Foods = {
flags startcat = Comment ;
cat
Comment ; Item ; Kind ; Quality ;
fun
Pred : Kind -> Quality -> Comment ;
-- no more This, That, These, Those : Kind -> Item
Mod : Quality -> Kind -> Kind ;
Wine, Cheese, Fish, Pizza : Kind ;
Very : Quality -> Quality ;
Fresh, Good, Italian, Delicious, Vegan : Quality ;
}
In the standard Foods, Pred
’s first argument is Item
, and there are 4 functions that add a determiner to a Kind
to make it into an Item
.
Now instead, the first argument to Pred
is Kind
, and the choice of determiner is free variation: no matter if we type “this pizza is vegan” or “those pizzas are vegan”, we get the tree Pred Pizza Vegan
.
We’re still doing Generic Romance here, so adjectives inflect in gender and number, copula only in number (and person, but everything’s 3rd person in Foods).
oper
Det : Type = {s : Gender => Str; n : Number} ;
this_Det, that_Det, these_Det, those_Det : Det = ...
copula : Agr => Str = ...
lin
Pred kind qual =
let det : Det = this_Det | that_Det
| these_Det | those_Det ;
np : NP = {
s = det.s ! kind.g ++ kind.s ! det.n ;
a = agr kind.g det.n} ;
in {s = np.s ++ copula ! np.a ++ qual.s ! np.a} ;
First let’s introduce the baseline, where we have Pred
applied to Item
. These are the productions, concrete functions and sequences used in Pred
. (I can’t avoid showing strings anymore, so now Generic Romance gets instantiated as Spanish.)
-- Baseline: the original Foods
categories
Comment := range [C0 .. C0]
labels ["s"]
productions
C0 -> F19[C1,C9] -- sg. masc. Item, any Quality
C0 -> F20[C3,C9] -- pl. masc. Item, any Quality
C0 -> F21[C2,C9] -- sg. fem. Item, any Quality
C0 -> F22[C4,C9] -- pl. fem. Item, any Quality
lin
F19 := (S2) [Pred]
F20 := (S4) [Pred]
F21 := (S3) [Pred]
F22 := (S5) [Pred]
sequences
S2 := <0,0> "es" <1,1>
S3 := <0,0> "es" <1,5>
S4 := <0,0> "son" <1,3>
S5 := <0,0> "son" <1,7>
As expected, Comment
expands to only one concrete category, with only one field. There are 4 productions, because there are 4 concrete categories for Item
, first of Pred
’s arguments in the original grammar.
Now we see the corresponding productions, concrete functions and sequences for the new grammar, with Pred : Kind -> Quality -> Comment
and determiner as free variation.
-- Foods where determiner is free variation
categories
Comment := range [C0 .. C0]
labels ["s"]
productions
C0 -> F19[C5,C9] -- 64 productions
...
C0 -> F82[C6,C9]
lin
-- the corresponding 64 concrete functions
sequences
-- finally, 64 sequences
S19 := "esa" <0,0> "es" <1,5>
S21 := "esa" <0,0> "son" <1,5>
...
S80 := "estos" <0,1> "es" <1,3>
S82 := "estos" <0,1> "son" <1,3>
We get 64 productions, concrete functions and sequences for the original 4 of each. You can see already from the strings in the sequences, that this grammar overgenerates heavily. Esa is a singular determiner, but it happily appears with the plural son. Gender agreement is ignored in the same way, both inside a noun phrase (ese pizza) and as a predicative (pizza es vegano).
Why does this happen? The linearisation of Pred
didn’t look that outrageous. But actually all occurrences of the variable det
expand to include all the variants, so a more honest view is this:
Pred kind qual =
let np : NP = {
s = (this_Det | that_Det
|these_Det | those_Det).s ! kind.g ++
kind.s ! (this_Det | that_Det
|these_Det | those_Det).n ;
a = agr kind.g (this_Det | that_Det
|these_Det | those_Det).n} ;
in {s = np.s ++ copula ! np.a ++ qual.s ! np.a} ;
And that’s why we get 4³=64 productions, concrete functions and sequences.
This might not be a realistic scenario, in that maybe you don’t want to make determiners into free variation, but it’s educational. In the original code, I just put the 4 variants in one place, but it’s secretly expanded to 3 places, so the actual number of variants is 4³, not 4.
If you’re writing an resource grammar, it’s important not to introduce variants there, because they compound very quickly. Using variants in an application grammar is safer, but it might also lead to unexpected blowups. See for instance this GF question on Stack Overflow. I haven’t (yet!) compiled a smaller version of that code and looked at the PGF, but I imagine that those 8 verbs end up being used in multiple places in the VP.
Here’s how to do variants in the intended way. The code in the locally defined addDet
is identical to the old Pred
, but we just restrict the scope, and create only 4 variants.
Pred kind qual =
let addDet : Det -> SS = \d ->
let np : NP = {
s = d.s ! kind.g ++ kind.s ! d.n ;
a = agr kind.g d.n} ;
in {s = np.s ++ copula ! np.a ++ qual.s ! np.a} ;
in addDet this_Det
| addDet that_Det
| addDet these_Det
| addDet those_Det ;
Plot twist: the version with variants has fewer productions, concrete functions and sequences. It’s not a fair comparison, because the new abstract syntax has fever funs: This
, That
, These
and Those
don’t exist in the new abstract, they’re just variants in Pred
.
This is the last PGF dump pair I’ll show you, I promise. Here are sequences from the original grammar.
-- Original grammar
sequences
-- Pred : Item -> Quality -> Comment
S2 := <0,0> "es" <1,1>
S3 := <0,0> "es" <1,5>
S4 := <0,0> "son" <1,3>
S5 := <0,0> "son" <1,7>
-- : Kind -> Item
S23 := "esa" <0,0> -- That (fem. argument)
S24 := "esas" <0,1> -- Those (fem. argument)
S25 := "ese" <0,0> -- That (masc. argument)
S26 := "esos" <0,1> -- Those (masc. argument)
S27 := "esta" <0,0> -- This (fem. argument)
S28 := "estas" <0,1> -- These (fem. argument)
S29 := "este" <0,0> -- This (masc. argument)
S30 := "estos" <0,1> -- These (masc. argument)
The 12 sequences above cover 3 abstract syntax functions in the old grammar. Below we see 8 sequences, which cover for 1 abstract syntax function in the new grammar, with the 4 determiners as variants.
-- New grammar
sequences
-- Pred : Kind -> Quality -> Comment
S19 := "esa" <0,0> "es" <1,5>
S20 := "esas" <0,1> "son" <1,7>
S21 := "ese" <0,0> "es" <1,1>
S22 := "esos" <0,1> "son" <1,3>
S23 := "esta" <0,0> "es" <1,5>
S24 := "estas" <0,1> "son" <1,7>
S25 := "este" <0,0> "es" <1,1>
S26 := "estos" <0,1> "son" <1,3>
There are marginally fewer productions, concrete functions and sequences, but we’ve completely lost the distinction between all determiners. I am totally not advocating this, and especially not as a way to “optimise your grammar”. Just thought it was a funny result, and wanted to share in the name of transparency.
If you use variants and it blows up, see the post on gotchas for an alternative hack. (If your grammar blows up without variants, see the post on blowup for other tips.)
If you want to play around with the modified Foods, it’s all in my gf-contrib, branch variants. FoodsOld
is the original grammar, and FoodsNew
is the one with variants. There are three different concrete syntaxes, with varying degrees of overgenerating. To see the PGF dump, just load any file in a GF shell and type pg
.
$ gf FoodsNewSpaOvergenerating.gf
FoodsNew> pg
… PGF output printed here …
So, now you can type pg
in the GF shell and try to decipher your
own grammar! Why not also take a look at the
obscure
options
in gftest
.