So now you know the two different formalisms I’ve worked with. Here I’m rambling a bit about various connections between them, or really anything that pops up in my mind.

The GF grammarian (or an informant) had to come up with phenomena to test, and trees that capture those phenomena. Now both steps are fully automated: the grammarian only needs to input a function or category, and the system outputs a minimal and representative set of sentences. Humans are still needed to read and evaluate the example sentences, as well as write the grammars in the first place. I am an active user of this system, and so happy of its existence that I don’t really need a degree to be happier. There are even external users who are not me, which after 30 years of existing and 10+ years of programming feels pretty amazing.

On the CG side, I have also automated a task, but I’m not sure if it’s a task people were doing widely. Sure, CG grammarians evaluate their grammars—I’ve seen in practice people applying their rules to a corpus, and reading through output and judging if it’s fine. But the kind of systematic, “do some of these rules prevent others from applying” seems to not have been a common activity. Automatic reordering has been a thing, but with a black-box machine learning rather than formal methods such as SAT-solvers.

What I’m saying is that my CG tools haven’t had any adoption. In a sense this is not surprising—I was never a CG writer myself, and the starting point was honestly just “SAT is cool, let’s find a problem”. The SAT-encoding itself works as a parallel CG engine, meaning that it would complain even well-ordered sequences of rules, if they “contradict” each other. Our first version was not of much practical use, but I could make smug remarks like quoting Fred Karlsson (1995, page 11):

More generally, the formalism should be such that individual constraints are unordered and maximally independent of each other. These properties facilitate incremental adding of new constraints to the grammar, especially to a large one.

I don’t know if anyone had bothered to actually try to execute a full existing CG grammar in parallel after this was written. But I made that remark already in my licentiate thesis, so let’s move on.

So, the initial parallel SAT-encoding was not much use neither for
writing new CG rules nor executing or testing old CG rules. Thus we
made a second one, with added state. This turned out to be more
useful: the conflicts it reports are *actually conflicts* in an
actually execution order. I’m still hopeful that people may have liked
it if I had made it easier for them—the tool is written in Haskell
(which made the only potential user who tried to install it to joke
about the relevant xkcd), and it’s not
integrated to any existing workflow.

I don’t know about you but “automatic systematic” just makes me think of this.

Anyway, let’s start from CG now because it’s easier to reason about. A CG grammar can be “good” or “bad” according to purely internal criteria. Good means just that the rules are internally consistent, and bad the opposite of that. The point of symbolic evaluation is to explore all possibilities: if there was a way in which some rule would apply (e.g. make another rule not apply and thus preserving a reading; or create a solution in which a particular reading was already false), the SAT-solver would have found it. You know, that’s just how SAT-solvers work. So we can be confident that if the system reports no conflicts, there are no conflicts—“what if we just run a few more tests” doesn’t make sense to worry about.

It is a separate question (not my research question!) whether the CG grammar also does a good job at disambiguating natural language. Such questions are better left for corpus-based methods.

On the GF side, we actually are trying to answer the harder question: does this grammar do an adequate job at producing English/French/…? For that, we need a set of test sentences that covers all phenomena in the grammar. In the CG world, we had a natural restriction: there is only a few hundreds or thousands of different tags in the morphological lexicon, and there are restrictions (in a lucky case also well-documented and not like 3 different Word documents that are deprecated in a different way) how they mix together into readings. So the language of the tags and readings is finite—sounds reasonable that we can do the job thoroughly.

But natural language is infinite. How does that work with a finite set of test cases?

First of all, we are testing a PMCFG representation of natural
language. This representation is also infinite: for example, the
grammar can parse and produce a sentence with arbitrarily many
subordinate clauses
(*I know that you think that she disagrees that … this example is
silly*), only limited by physical constraints such as memory or having
a source of power after all humans die. But importantly, the grammar
still has a finite number of *functions*. Let’s see, can we test one
function with finite test cases?

Take any GF function `f : A -> B -> C`

. All categories `A`

, `B`

and
`C`

may contain fields with strings, tables and parameters. Recall
that in the PMCFG, first all categories `A`

, `B`

and `C`

are compiled
into several concrete categories, one for each combination of
parameters. Then, the function `f`

is compiled into several concrete
functions, one for each combination of concrete categories. This is
still finite, and just makes our job easier! Now we can ask: can we
test one *concrete* function with finite test cases?

Let’s take a bit more concrete example: ```
Mod : Quality -> Kind ->
Kind
```

from Foods grammar, Spanish concrete syntax. It compiles into 4
concrete functions:

```
Mod_pre_fem : Quality_pre -> Kind_fem -> Kind_fem
Mod_post_fem : Quality_post -> Kind_fem -> Kind_fem
Mod_pre_masc : Quality_pre -> Kind_masc -> Kind_masc
Mod_post_masc : Quality_post -> Kind_masc -> Kind_masc
```

`Mod_pre_fem`

takes a feminine `Kind`

and a premodifier `Quality`

. As
you can guess by the name (just kidding, in the PMCFG it’s like
`Kind_3493`

and `Quality_6270`

so you can’t guess anything), it
chooses the feminine forms (singular and plural) of the `Quality`

argument, and places them before the singular and plural forms of
`Kind`

argument. If this function fails, we can totally scold it by
saying “you had one job”.

So what inputs does this function need? Duh, one premodifier `Quality`

and one feminine `Kind`

. At this level, all we need to do is to choose
*maximally representative* premodifier and feminine arguments. In
other words, we prefer words with maximally unique^{1} inflection tables. If we find such
arguments, then `Mod_pre_fem`

should also construct an inflection
table with unique strings. Or if it doesn’t, at least we know that the
repetition is due to `Mod_pre_fem`

and not its arguments.

Okay, now we agree that the *arguments* to the functions are finite.
Now it’s just contexts left. In general, a test case needs as many
contexts as it has strings—certainly a finite amount.
In order to be representative, we need to make sure that the contexts
don’t introduce the same words that appear in the test cases: if the
test case is *likes the dog* and the context happens to include *the
dog* as a subject, we cannot be certain which *dog* comes from the
test case and which one from the context.

Summary of what I just said:

“How can we test a whole big infinite grammar?”

Easy: there’s a finite amount of functions, just test each function separately.

“How can we be sure to use one function in all the different ways?”

No problem: split it into several functions that can all be used in only
one way, then test each of them.

Summary for those who like their summaries with bullet points:

- Finite categories in GF grammar
- Finite concrete categories in PMCFG

- Finite functions in GF grammar
- Finite concrete functions in PMCFG
- Finite arguments to the concrete functions → finite test cases
- (Combinations of) arguments with unique strings

- Finite contexts for the test cases
- Unique words in the contexts

Now let’s look at how we can still mess things up.

A given CG grammar is internally consistent *given a
certain tagset, rules to combine tags into readings, and ambiguity
classes*. All of this information comes from a morphological
lexicon, and we all know that morphological lexica are generated at
airports at 04:00 after the conference dinner, because that was the
cheapest flight and you didn’t have to pay for one more night of
accommodation, and the free drinks would keep you up and cheerful
until your flight departs.

So, CG analysis is as reliable as is the morphological lexicon where the tags and the readings come from. How about GF test cases?

I just tried to convince you how the system produces a test case for
each combination of parameters. But the failure mode here is, simply,
*not enough parameters*. Consider the `Mod`

function again, and say
that we don’t have the parameter for pre- and postmodifier at all,
and `Mod`

only comes in two variants, one for feminine and one for
masculine. Suppose that all adjectives are, incorrectly,
postmodifiers. Then we would notice the error only if the single `Quality`

argument for `Mod_fem`

and `Mod_masc`

happened to be a premodifier
one.

If the missing parameter handles a common phenomenon, it is more
likely that we catch the error even with missing parameters. This was
the case with, for example, Dutch preposition contraction (which I’ve
explained in Chapter 4 of the thesis). Originally, none of the
prepositions contracted, but this is the most common behaviour—only a
handful of prepositions *don’t* contract. Thus we got some test cases
which showed the error, and went on to fix the grammar. After the
fixes, the system created separate test cases for contracting and
non-contracting prepositions.

Test your grammars! I don’t know. I’ll stop this now anyway, it’s pretty long and I have more to blog! Next up, Koen’s Fancy Fixpoints and/or other interesting statistics we can gather from GF grammars:

- Which fields are always the same (i.e. is there a parameter that doesn’t make a difference?)
- Are some fields always empty?
- Are some fields or categories unreachable from another category?
- Are some subtrees erased by some function?

Stay tuned!

1: *I, me, my* is a better
example than either *you, you, your* or *she, her, her* alone, because
*I* only has unique strings, and *you* and *she* repeat some strings.
But together *you* and *she* cover everything that *I* does alone.