7 October 2020

Strategies for adding linguistic distinctions to a GF grammar

You have written a GF grammar: an abstract syntax, and a concrete or two for Standard Average European. As you add another language, you find the need to include more distinctions. For example, suppose you wrote your grammar based on English, and only included one You in the abstract syntax. Now when you add a French concrete syntax, you suddenly have two words for you: tu and vous, with different effects on verb agreement. The solution is to split your abstract syntax You into YouSg and YouPl, and just linearise them identically in English.

In the rest of this post, I will discuss a slightly bigger example. To make this more relevant for people using GF ouside their armchair, I’m assuming that this grammar is used as a component in some natural language generation (NLG) application.

NLG application
GF grammar
New linguistic challenges
Change abstract syntax
Change only concrete syntax
Conclusion

NLG application

nlg_pizza_order

Suppose you have a restaurant application like above. You can choose your dish, and then choose specifics of the dish: toppings for pizza, fillings for lasagna and so on. After you have chosen your dish, an order confirmation is displayed as a natural language sentence: “Your pizza has [toppings]”. This natural language sentence is, of course, generated by a GF grammar.

GF grammar

Suppose that the application is slightly more advanced. Maybe you can even describe the dish’s properties.

abstract Foods = {
 flags startcat = Comment ;
 cat
   Comment ; Item ; Kind ; Quality ;
 fun
   Is : Item -> Quality -> Comment ; -- Your pizza is vegan
   Has : Item -> Item -> Comment ;   -- Your pizza has pine seeds

   Your : Kind -> Item ;             -- your pizza
   Mass : Kind -> Item ;             -- zucchini
   Plural : Kind -> Item ;           -- pine seeds
   ConjItem :
     Item -> Item -> Item ;          -- zucchini and pine seeds
   Mod : Quality -> Kind -> Kind ;   -- vegan pizza
   Pizza, Lasagna, Risotto : Kind ;
   PineSeed, Mush, Zucchini : Kind ;
   Colorless, Green,
     Indeterminate, Vegan : Quality ;
}

You may have seen something similar before. This abstract syntax and the English concrete syntax are found in my fork of gf-contrib, directory foods/zucchini/initial.

What can we say of this grammar? Its abstract syntax is not very semantically restrictive: it happily generates both of these sentences.

> gr -number=2 Has ? ? | l -treebank
Foods: Has (Your Pizza) (Mass Zucchini)
FoodsEng: your pizza has zucchini

Foods: Has (Mass Zucchini) (Your Pizza)
FoodsEng: zucchini has your pizza

I think it’s completely fine to not have selectional restrictions in the GF grammar. The actual restrictions come from the external program that generates the GF trees. But if it makes you happier, by all means do split Item into Dish and Ingredient.

fun
  Is : Dish -> Quality -> Comment ;
  Has : Dish -> Ingredient -> Comment ;

-- Exercise to the interested reader:
-- update the rest of the abstract syntax to match!

-- Hint: you can keep Kind as it is, but then you'll
-- need a parameter in the lincat of Kind.

With a more restrictive abstract syntax, you’re more likely to get sentences that make sense, whether you hit gr on the GF shell or use gftest. However, for the purposes of this post, we continue with the original design, and only change it when the need arises.

New linguistic challenges

The problem

This is the most important thing to understand: what is the problem that we are solving?

I’m not talking about plausibility of the sentences. “Zucchini has your pizza” is obviously the wrong thing to say to a customer who has made an order of a zucchini pizza. But I trust the external program to never generate such nonsense in the first place. The grammar allows me to communicate the right thing, “your pizza has zucchini”, and that’s all I care about.

I am talking about the case where the grammar has no way of generating the correct sentence for a new language. Remember the abstract syntax with only one You: if we choose You to be linearised as youSg_Pron, then there’s no way of addressing a group of customers in French. This is the type of problem I address in this post.

Introducing Pretenglish (“Pretend English”)

All examples are going to be in English, with some imaginary restrictions. If you’ve read my post on agreement, you know how much I like to show off my copypasting-from-Wiktionary skills, but for this post, I believe it’s clearer if we skip the glosses and explanations. You may find the particular expressions contrived, but these situations happen all the time.

To distinguish this pretend English from actual English, we’ll call it Pretenglish. (There’s no language with ISO codes pre or pr.) I introduce the differences between Pretenglish and English in the next section.

Concrete syntax in Pretenglish

Our restaurant business grows, and we want to adapt the ordering app to other languages. No problem, GF was just made for this! Let’s write a concrete syntax in Pretenglish and check its output against a fluent speaker.

> l -treebank Is (Your Zucchini) Green
FoodsPre: your zucchini is green

«Actually,» our informant tells us, «the idiomatic Pretenglish is “the color of your zucchini is green”. We understand what you mean by “zucchini is green”, but it reads like machine translation. GF can surely do better.» We note this on our TODO list and continue the QA round.

> gr -number=2 Has (Your ?) (Mass Zucchini) | l -treebank
Foods: Has (Your Pizza) (Mass Zucchini)
FoodsPre: your pizza has zucchini

Foods: Has (Your Risotto) (Mass Zucchini)
FoodsPre: your risotto has zucchini

The informant speaks again. «There’s no verb “to have” in Pretenglish. You need to specify whether the zucchini is on top of the dish, or mixed in the dish. “The dish has zucchini” is not merely strange, it’s ungrammatical.»

We could go on, but the point becomes clear with these two. In the rest of the post, I’ll show different ways to introduce these distinctions in the grammar.

Change abstract syntax

If you’ve read my post on low-level hacks, you may guess what I suggest:

Adapt the abstract syntax.

We can interpret this advice in different ways.

Hardcore: change types, major restructuring

This approach could be for you, if you don’t like the idea of a GF grammar that can even in theory output anything wrong. You feel uneasy until you can split Item into Dish and Ingredient, and now you want to split Quality into colour and non-colour types. I have written an implementation so that you don’t have to, found at foods/zucchini/hardcore. This is a fragment of the new design.

fun
  IsColor : Dish -> ColorQuality -> Comment ;
  IsOther : Dish -> OtherQuality -> Comment ;
  HasInside : InsideDish -> Ingredient -> Comment ;
  HasOnTop : OnTopDish -> Ingredient -> Comment ;

The types are now carefully engineered, and you can be sure that no tree is ungrammatical. (Until you add the next language.) To demonstrate, everything here is correct Pretenglish:

> gt -depth=2 | l
your lasagna has pine seeds inside
your lasagna has zucchini inside
your risotto has pine seeds inside
your risotto has zucchini inside
your pizza has pine seeds on top
your pizza has zucchini on top
the color of your lasagna is green
the color of your risotto is green
the color of your pizza is green
your lasagna is colorless
your lasagna is vegan
your risotto is colorless
your risotto is vegan
your pizza is colorless
your pizza is vegan
…

However, I don’t actually recommend this approach for most situations. Some reasons why:

It can make the grammar overly heavy. To quote Michal Měchura:
The first suggestion is to design a complex hierarchy of types and subtypes in your abstract grammar. So, for example, you would have one type for food items which can be described as hot, another for those which can be described as fresh and so on. I find this solution unsatisfactory because it causes more problems than it solves:
- If an item belongs in more than one type, for example pizza which can be described both as hot and as fresh, then it needs to exist in your grammar more than once. This bloats the grammar up and misses a generalization. Ideally you want to have only one pizza entity on your grammar.
- GF doesn’t really do subtyping, you can only fake it with type coercion functions. This makes your abstract syntax trees more complex than they need to be.
It creates more work in other languages, which don’t even have the distinction. However, if Pretenglish is the second concrete syntax after English, then a major revamp is most likely appropriate. (Ideally, you should design the abstract syntax with at least a couple of different languages in mind!) But if you have 10 languages already, and need to rewrite all of them due to a distinction in Pretenglish, that’s just not fun. Even if you went through all that trouble, your grammar might still not cover the distinctions of the next language.
- This is a good argument for using a functor. It might be extra work in the beginning to get the functor structure right, but if you ever need to restructure the grammar, it’s much less painful to do it just once for the functor, instead of for each individual concrete syntax.

Softcore: add constructions with default implementations

The hardcore person’s philosophy is: “I want every tree in my grammar to be correct.” (100% precision.)

The softcore person’s philosophy is: “I want my grammar to contain the trees that I need.” (100% recall.)

In the softcore strategy, we don’t try to eliminate the “zucchini is green” tree from Pretenglish. We just add a correct tree on top¹. The implementation is in the directory foods/zucchini/softcore, but the whole thing is so short that I’ll paste it all.

--# -path=.:../initial
abstract FoodsSoftcore = Foods ** {
  flags startcat = Comment ;
  fun
    IsColor : Item -> Quality -> Comment ;
    HasInside,
      HasOnTop : Item -> Item -> Comment ;
}

We extend the old Foods grammar (found at foods/zucchini/initial, so we need an appropriate -path flag), and add three new constructions on top of the old ones. Here’s how to implement them in Pretenglish.²

--# -path=.:../initial
concrete FoodsSoftcorePre of FoodsSoftcore = FoodsEng ** {
 lin
   IsColor item qual = {
     s = "the color of" ++ item.s ++ "is" ++ qual.s} ;
   HasInside dish ingr = {
     s = dish.s ++ have ! dish.n ++ ingr.s ++ "inside"} ;
   HasOnTop dish ingr = {
     s = dish.s ++ have ! dish.n ++ ingr.s ++ "on top"} ;
}

Now these constructions aren’t as restrictive as the hardcore ones. FoodsHardcore.HasOnTop would only accept an OnTopDish as its first argument. In contrast, you can apply FoodsSoftcore.HasOnTop to all dishes, and the responsibility of applying selectional restrictions is on the external program.

If you’re still unsure about this approach, look at the concrete syntax for ordinary English.

--# -path=.:../initial
concrete FoodsSoftcoreEng of FoodsSoftcore = FoodsEng ** {
 lin
   IsColor             = Is ;
   HasInside, HasOnTop = Has ;
}

This is the real selling point. No need to refactor anything in the previous languages or functor. Add as fine-grained constructions as you need, but don’t remove the old ones. Only write new GF code for those languages that make the distinction, and for the rest, just use the old construction, like we did with HasOnTop = Has. (If you’re using a functor, you can write HasOnTop = Has in the functor, and override it for Pretenglish.)

We can confirm that this grammar now translates successfully between English and Pretenglish.

> l -treebank IsColor (Your Pizza) Green
FoodsSoftcoreEng: your pizza is green
FoodsSoftcorePre: the color of your pizza is green

Yes, the incorrect Pretenglish “your pizza is green” is still available:

> l -treebank Is (Your Pizza) Green
FoodsSoftcoreEng: your pizza is green
FoodsSoftcorePre: your pizza is green

But we don’t care about it, because the correct construction is available in another tree.

Of course, when you use this strategy, remember to update the program that constructs GF trees. You don’t need to do anything language-specific in that external program, it’s enough to match the abstract syntax. Whenever the customer order is Pizza, call HasOnTop, otherwise call HasInside. (You would need to do this for hardcore approach as well.)

Change only concrete syntax

Now suppose that it’s not possible to change the abstract syntax. The code that creates GF trees based on user choices is a black box—you can still use it, but you’re not allowed to modify it.

pizza_blackbox

Are we screwed? No, in fact, we can mimic the hardcore approach without adding anything to the abstract syntax. If you’ve gotten past Lesson 3 in the GF tutorial, you know how to do this.

We add some parameters:

param
  DishType = Inside | OnTop | NotADish ;
  QualType = Color | Other ;

And use them in the lincats of Kind, Item and Quality:

lincat
  Item    = {s : Str ; n : Number ; d : DishType} ;
  Kind    = {s : Number => Str ;    d : DishType} ;
  Quality = {s : Str ;              q : QualType} ;

We adjust the constructors mkKind and mkQuality from the initial grammar, and construct our lexicon like this:

lin
  Pizza = mkKind "pizza" OnTop ;
  Lasagna = mkKind "lasagna" Inside ;
  Zucchini = mkKind "zucchini" NotADish ;
  Vegan = mkQuality "vegan" Other ;
  Green = mkQuality "green" Color ;

Finally, the parameters are put in action in the functions Is and Has:

lin
 Is item quality =
   let theColorOf : Str = case quality.q of {
         Color => "the color of" ;
         Other => [] }
       is : Str = case quality.q of {
         Color => "is" ; -- the color of pine seeds *is* green
         Other => copula ! item.n
         } ;
    in {s = theColorOf ++ item.s ++ is ++ quality.s} ;

 Has food ingrds =
   let place : Str = case food.d of {
         Inside => "inside" ;
         OnTop  => "on top" ;
         NotADish => [] }
    in {s = food.s ++ have ! food.n ++ ingrds.s ++ place} ;

The full grammar is in foods/zucchini/only-concrete. To run together with the initial FoodsEng (since it shares the same abstract syntax), go to the zucchini directory and give both files as arguments to gf:

$ gf initial/FoodsEng.gf only-concrete/FoodsPre.gf
…
Languages: FoodsEng FoodsPre
Foods> gr | l -treebank
Foods: Is (Plural PineSeed) Green
FoodsEng: pine seeds are green
FoodsPre: the color of pine seeds is green

And you’re done! Since we didn’t change the abstract syntax, there’s no need to change the external program. You might argue that this is more elegant than foods/zucchini/hardcore, because we didn’t have to add all the extra types, functions and coercions, but we still have 100% precision. I’m not saying you’re wrong. To quote the GF best practices document (page 30), “the whole idea of GF is that language-dependent distinctions need not be reflected in the abstract syntax.”

However, please be cautious with this approach. Internal parameters are a great tool for many purposes, and I use them all the time. But if they are your only way of fixing things, eventually you will be very sad. I have horror stories of this approach taken way too long, but the examples are so long and complicated that they don’t make a good blog post.

Conclusion

The ideal solution depends on the situation. It is most often a combination of hardcore, softcore and concrete-only strategies. The fewer concrete syntaxes you have, the more you should dare to restructure your abstract syntax. In contrast, say the abstract syntax has been stable for the past 16 languages, you are adding the 24th language, and you run into a quirk like “the colour of zucchini is green”—I’d recommend just to add a parameter and not to start questioning the whole design.

Footnotes

For a proper source, see Embedded Controlled Languages (Ranta, 2014). ↩
Note that we can extend FoodsEng, because apart from these distinctions, English and Pretenglish are the same language. In a more realistic scenario, the rest of the grammar would come from a functor, and these three functions are the only ones that need to be implemented separately for Pretenglish. ↩

tags: gf, programming