Latest update: 2023-03-26
This post contains real-life examples when I or others have been confused in the past. It might be updated whenever someone is confused again.
In my experience from teaching GF, the most common source of confusion is what you’re allowed to do with strings. Why is it sometimes okay to use +
, and other times you get an error about “unsupported token gluing”?
The short answer is in the GF tutorial. I’ve written this longer answer, because I’ve found that it’s often needed.
Gluing strings with +
oper
addBar : Str -> Str ;
addBar x = x + "bar" ;
-- addBar "foo" returns "foobar"
Pattern matching strings
oper
isPlural : Str -> Bool ;
isPlural str = case str of {
_ + "s" => True ;
_ => False } ;
You can use these operations when constructing lexicon, but not in a function that takes arguments.
To quote the GF tutorial for Python programmers:
GF requires that every token – every separate word – be known at compile-time. Rearranging known tokens in new ways, no problem: GF can generate an infinite variety of different combinations of words.
But they have to be words known to GF at compile-time. GF is not improv: as Shakespeare might have said, if anybody’s going to make up new words around here, it’ll be the playwright, not the actor. You can
+
tokens together but only at compile-time. If you try to do it at run-time, you will get weird errors, likeunsupported token gluing
or, worse,Internal error in GeneratePMCFG
.
So how do you know whether a line of code is executing at compile-time or at run-time?
Look at the functions’ type signatures. If the fun
takes an argument, that’s a run-time argument.
fun
Pred : Item -> Quality -> Comment ; -- 2 args: Item and Quality
Mod : Quality -> Kind -> Kind ; -- 2 args: Quality and Kind
Very : Quality -> Quality ; -- 1 arg: Quality
When you write the grammar, you tell a function like Pred
how to do its thing. When you run the grammar in a GF shell and tell it to linearise Pred (This Pizza) Italian
, then it will actually do its thing, with the actual arguments This Pizza
and Italian
. Hence, run-time arguments.
In the linearisation of these functions, you are working on tokens that come from the arguments. You may concatenate them with ++
, introduce new tokens (like “very” in the example below), even duplicate or remove some of the arguments. But you may not use +
or pattern match.
This is correct.
lin
Very qual = {s = "very" ++ qual.s} ;
This is wrong.
lin
Very qual = {s = "very" + qual.s} ;
This is also wrong. (But if you really wanted to prevent multiple “very” and don’t know how, follow the footnote1.)
lin
Very qual = {
s = case qual.s of { -- don't add a second very
"very" + x => qual.s ;
_ => "very" + qual.s }
} ;
In contrast, these functions take no arguments.
fun
Pizza : Kind ;
Vegan : Quality ;
How do you know? There’s no arrow in the type. The types are just Kind
and Quality
, as opposed to something like Kind -> Quality -> Kind
.
This means that in the linearisation of Pizza
, you can use as many single +
s and pattern matches as you want.
Doesn’t matter if they’re directly in the lin
, or via oper
s that are called in the lin
.
lin
Pizza = mkN "pizza" ; -- mkN calls addS, which
-- uses plus and pattern match
oper
mkN : Str -> Str -> {s : Number => Str} ;
mkN str = {s = table {Sg => str ; Pl => addS str}} ;
addS : Str -> Str ;
addS str = case str of {
bab + "y" => bab + "ies" ; -- y -> ie, e.g. babies
_ => str + "s" ; -- regular, e.g. dogs
} ;
The oper mkN
takes the string "pizza"
as an argument, but that’s no problem. The string "pizza"
is not a run-time argument to the fun Pizza
.
That line, lin Pizza = mkN "pizza"
, really means this:
lin
Pizza = {s = table {Sg => "pizza" ; Pl => "pizzas"}} ;
So you can ignore all the opers, the only meaningful question is whether the fun has arguments. Pizza
has none—we’re only constructing what will become a run-time argument for other GF funs.
If you get the error for unsupported token gluing, even though you’re certain that you don’t have any lins with +
, check two things.
+
or pattern match?Restricted inheritance is explained with the gentle example of
abstract Foodmarket = Food, Fruit [Peach], Mushroom - [Agaric]
in
the
tutorial.
However, when you have a large grammar (like the resource grammar), you need to exclude abstract syntax functions or categories in all modules that extend the original module. For instance, Common.gf is used in Cat, Tense and Text.
Sometimes you need to write your own definition of some category
defined in Common
, like
here in CatEng:
concrete CatEng of Cat = CommonX - [Pol,SC,CAdv] **
open ResEng, Prelude in { ... }
Then in addition, you need to exclude Pol
, SC
and CAdv
in
GrammarEng:
concrete GrammarEng of Grammar =
NounEng,
...,
PhraseEng,
TextX - [Pol,SC,CAdv],
StructuralEng,
IdiomEng,
TenseX - [Pol,SC,CAdv]
** open ResEng, Prelude in { ... }
If you don’t do this, and only exclude the categories in CatEng
, you
get the following error of all the categories:
GrammarEng.gf:
cannot unify the information
lincat CAdv = {s : Str; p : Str} ;
lindef CAdv = \str_0 -> {s = str_0; p = str_0} ;
linref CAdv = \str_0 -> str_0.s ;
in module CommonX with
lincat CAdv = {s : ParamX.Polarity => Str; p : Str} ;
lindef CAdv = \str_0 -> {s = table ParamX.Polarity {
_ => str_0
};
p = str_0} ;
linref CAdv = \str_0 -> str_0.s ! ParamX.Pos ;
in module CatEng
An anonymous GF developer had a problem with the Czech concrete syntax of an application grammar. The grammar worked fine in the ordinary GF shell, but it crashed when using the C runtime.
I am experiencing something strange with the Czech grammar. When compiling to PGF and loading with the C runtime, I get:
Abstract category A is missing pgf_read_cnccat (pgf/reader.c:1067): assertion failed cnccat->abscat != NULL
This doesn’t occur when loading with the Haskell runtime. I’m not doing anything special, just
gf -make AppCze.gf && gf -cshell App.pgf
The problem turned out to be about the (non)restricted inheritance in the Czech resource grammar. Here’s the first (content) line of Numeral.gf:
abstract Numeral = Cat [Numeral,Digits] ** {
And here’s the corresponding line in NumeralCze.gf at the time before it had been fixed:
concrete NumeralCze of Numeral =
CatCze ** {
NumeralCze was inheriting all of CatCze and not just [Numeral,Digits]
. After restricting the inheritance, the problem was solved. The current NumeralCze.gf looks like this, and the application works correctly.
concrete NumeralCze of Numeral =
CatCze [Numeral,Digits] ** {
Does this ever happen to you?
Lang> gr -cat=VP | l -treebank
Lang: UseComp (CompAP (UseComparA probable_AS))
LangDut: zijn
LangEng: be more probable
Why is there only zijn ‘to be’ in the Dutch and the whole VP in
English? Well, that’s because the Dutch grammar has no linref
for
VP
.
linref
is defined in the
reference manual
as the “reference linearization of category C, i.e. the function
applicable to an object of the linearization type of C to make it into
a string”.
So if your VP
is a record with a bunch of fields,
and you don’t have a linref, then the GF shell shows only the thing
in the s
field. If you have a linref, then the function is applied
to the record, and GF shell shows a string that makes sense, like be
more probable.
If you want to play around with linrefs, I have an example grammar where you can comment out, uncomment and modify the linref and see what happens.
Here are some English linrefs in the RGL, if you want to see other than toy examples.
In the worst case, it can lead to this.
It’s a long thread, so here’s the essential.
Here’s an alternative hack,
if you need the behaviour of variants
but it’s too slow. However, consider
also whether you really want variants–e.g. in the thread in question,
it is much better to separate European and Brazilian Portuguese into
distinct modules.
variants
vs. |
You may have come across the use of variants
while reading GF
code. Except for the behaviour of empty variants,
t1 | ... | tn
and variants {t1 ; ... ; tn}
are equivalent.
It is clear that {s = "ink well" | "ink-well"}
is the same as {s =
"ink well"} | {s = "ink-well}
, but which is better? Both will compile
to the same thing at the PGF level, but the former is clearer and less
repetitive. Just be careful with examples which are not actually
equivalent, such as {s = "Auto" ; g = Neutr} | {s = "Wagen" ; g =
Masc}
and {s = "Auto" | "Wagen" ; g = Neutr | Masc}
.
Update 2018-12-28: I thought this part deserved its own post, go and read it.
First tip: type help
in the GF shell and read the output! You might
find cool features.
Second tip: read on for a random list of things I think are useful and suspect are less known.
The sophisticated version of opening a terminal in the
gf-rgl/src/abstract
and grepping for a RGL function to figure out
what it’s doing.
ai
, short for abstract_info
, shows info about expressions, functions and
categories. Load a grammar, and type e.g. ai VPSlash
: you get all
functions that produce a VPSlash
. Start typing ai In
and press
tab, you see all functions that start with In
. Works for whole
expressions as well: ai UseN dog_N
tells you that it’s of type CN
and has the probability 3.3967391304347825e-4.
Here’s a concrete use case: you’re implementing a resource grammar but
nothing comes out when you do gt
or gr
. Probably you’re missing
some function along the way, but which one? There’s so many of them!
You can start by typing ai Phr
, which is the start category; this
gives you fun PhrUtt : PConj -> Utt -> Voc -> Phr
, so you can go and
search for PConj
s, Utt
s and Voc
s next. You can use pg -missing
and grep for the functions in a category to see if you’re missing
linearisations. Here’s a concrete example:
Lang> ai Voc
cat Voc ;
fun NoVoc : Voc ;
fun VocNP : NP -> Voc ;
fun please_Voc : Voc ;
You can of course check your files in your favourite text editor or IDE and see if any of those is implemented, or you can also do the following in the GF shell:
Lang> pg -missing | ? grep -o Voc
Then, if you are missing any of NoVoc
, please_Voc
or VocNP
, it
will show in the output.
Wouldn’t it be handy if you just typed a word in the GF shell and
could find out which abstract syntax function(s) it comes from, and
which fields? ma
, short for morpho_analyse
, does this as follows:
Lang> ma -lang=Dut "het"
het
ProgrVP : a2
PossPron : sp Sg Neutr
DefArt : s True Sg Neutr
DefArt : s False Sg Neutr
...
The ma
command is useful also if you’re dealing with GF text that is first generated in GF, but postprocessed with another program after that generation. If you need to debug or reverse engineer such output, then the p
command won’t work, because it contains strings outside the grammar. But you can still ma
a text like that: it just ignores the words that it doesn’t recognise. Example:
Lang> p "책과 고양이가 좋습니다"
The parser failed at token 1: "\52293\44284"
Lang> ma "책과 고양이가 좋습니다"
책과
고양이가
cat_N : s Subject
좋습니다
good_A : s (VF Formal Pos)
You still don’t know what the whole sentence is about, but at least you know its subject is a cat 😺 and something is good.
Another unexpected use case for ma
is if your grammar is extremely ambiguous. Say you have a few empty strings in strategic places, or are using a large lexicon with tons of synonyms–you’ll easily get hundreds or even thousands of parses for a 3-word sentence. Example:
Lang> p "작은 고양이가 좋습니다" | ? wc
359 11232 86068
Lang> ma "작은 고양이가 좋습니다"
작은
small_A : s (VAttr Pos)
short_A : s (VAttr Pos)
고양이가
cat_N : s Subject
좋습니다
good_A : s (VF Formal Pos)
This is much more readable than 359 trees. The subject is a small or short cat, and the predicate is that the cat is good. Just by seeing the morphological parameters from the inflection tables, we can infer that small
is attributive and good
is predicative.
If your grammar includes bind tokens (&+
), the standard GF shell can’t parse it out of the box. You can either switch to the C shell (install GF with C runtime support and then type gf -cshell
), or preprocess your input with clitic_analyse
, ca
for short. For this to work, you need a list of clitics in the GF grammar.
The Korean RG uses clitics for Conj (e.g. 고, 하고, 며, 이며), Prep (e.g. 와, 과) and Predet (e.g. 마다), and a few other things. If the grammar uses a the BIND token a lot, it’s honestly pretty terrible to try to gather them all in order use ca
, but let’s just show this for the sake of example. Let’s go back to the sentence we couldn’t parse in the ma
section: “책과 고양이가 좋습니다”.
Lang> ca -clitics=와,과,고,하고,며,이며 "책과 고양이가 좋습니다"
책 &+ 과 고양이가 좋습니다
ca
gives a hint why we couldn’t parse it: the first word is in fact two tokens, 책 &+ 과
. The output of the ca
command can be directly piped into other commands.
Lang> ca -clitics=와,과,고,하고,며,이며 "책과 고양이가 좋습니다" | p | ? wc
1799 62920 496268
As you can see, p
is more than happy to parse that stuff, it can find 1799 trees for those 3 words! What an improvement to The parser failed at token 1: "\52293\44284"
. Let’s try ma
:
Lang> ca -clitics=와,과,고,하고,며,이며 "책과 고양이가 좋습니다" | ma
책
book_N : s Bare
&+
과
with_Prep : s Consonant
married_A2 : p2 s Consonant
고양이가
cat_N : s Subject
좋습니다
good_A : s (VF Formal Pos)
So it’s either a cat with a book or cat married to a book that is good. The mystery is solved.
tt
, short for to_trie
, outputs the results of a parse in a trie
format, so you can see at a glance where the differences lie. Use it
in place of l
, like in the example below:
MiniLang> p "you are a cat" | tt
* UttS
* UseCl
* TSim
PPos
PredVP
* UsePron
1 youPl_Pron
2 youSg_Pron
UseNP
* DetCN
* a_Det
UseN
* cat_N
l
Isn’t it boring when you generate poetry using gr
and you get a
really good sentence, but don’t see the AST? Then you need to copy the
sentence and parse it all over again, so much hassle.
Lang> gr | l
en wie alsjeblieft
and who please
ja kuka ole hyvä
Fear not, with the -treebank
flag you get it all in one step:
Lang> gr | l -treebank
Lang: PhrUtt (PConjConj and_Conj) (UttIP whoSg_IP) please_Voc
LangDut: en wie alsjeblieft
LangEng: and who please
LangFin: ja kuka ole hyvä
Also wouldn’t it be nice to see all forms of a tree? -table
gives
you that:
Lang> gr -cat=NP | l -treebank -table -lang=Eng,Dut
Lang: UsePron she_Pron
LangEng: s (NCase Nom) : she
LangEng: s (NCase Gen) : her
LangEng: s NPAcc : her
LangEng: s NPNomPoss : hers
LangDut: s NPNom : zij
LangDut: s NPAcc : haar
How about more internal structure? -bracket
is just what you need:
Lang> gr -cat=Cl | l -bracket -treebank
Lang: PredVP (languageNP norwegian_Language) tired_VP
LangDut: (Cl:3 (NP:1 (Language:0 Noors)) (VP:2 is) (VP:2 moe))
LangEng: (Cl:3 (NP:1 (Language:0 Norwegian)) (VP:2 is) (VP:2 tired))
LangFin: (Cl:3 (NP:1 (Language:0 norjaa)) (VP:2 väsyttää))
Find out the rest by typing help l
in the GF shell.
When you have a resource grammar that is not quite complete, but you still want to have the API functions for it, you generate a MissingXxx.gf
(where Xxx
is the extension of your language). The file itself looks like this. You need to add ** open MissingXxx in {}
in front of every file in the api
directory, like this. If it still doesn’t work, try to include the directory in the header, like this: --# -path=.:alltenses:prelude:../Xxx
. But I’ve seen it work without the latter so I’m not quite sure, maybe it depends on some environment variables or the position of the stars.
Here’s how to generate one, mostly for myself so I don’t need to type these again next time. :-P
-- (in GF shell)
Lang> pg -missing | ? tr ' ' '\n' > /tmp/missingLins.txt
You can also prepare the file in the language directory:
echo "resource MissingXxx = open GrammarXxx, Prelude in {" > MissingXxx.gf
echo "-- temporary definitions to enable the compilation of RGL API" >> MissingXxx.gf
Then go to the directory abstract
, and do this:
for l in `cat /tmp/missingLins.txt` ; do egrep -o "\<$l\> +:.*;" *.gf | tr -s ' ' | egrep -v "(Definition|Document|Inflection|Language|Mark|Month|Monthday|Tag|Timeunit|Hour|Weekday|Year|Month)" | sed 's/gf:/ ;/g' | cut -f 2 -d ';' | sed -E 's/(.*) : (.*)/oper \1 : \2 = notYet "\1" ;/' ; done >> ../XXX/MissingXXX.gf
And remember to add the ending brace to the MissingXxx
file.
It won’t work straight out of the box, because it only greps the abstract files, and some old definitions are in the file commented out. Just fix those as you try to build and get an error.
If you also haven’t implemented SymbolXxx
, then you need to find missing lins from there as well. Open SymbolXxx in GF shell, do pg -missing
and complete the same procedure. And on the first line you need to add resource MissingXxx = open GrammarXxx,
SymbolXxx, Prelude in …
.
Has something similar happened to you before? You parse a completely unambiguous sentence, and get mysterious question marks in the tree.
Follow-up question: have you seen this before?
-- Cl = {s : Polarity => Tense => Str}
UseCl tense pol cl = {
s = tense.s ++ pol.s ++ cl.s ! pol.p ! tense.t
} ;
-- p : Polarity
PPos = {s = [] ; p = Pos} ;
PNeg = {s = [] ; p = Neg} ;
-- t : Tense
TSim = {s = [] ; t = Sim} ;
TAnt = {s = [] ; t = Ant} ;
In other words, all the tenses and polarities have an empty s
field,
but UseCl
still adds it to the linearisation.
The latter is done exactly to prevent the first situation from
happening. The rule is that every argument needs to contribute with a
string, otherwise it isn’t recognised when parsing. This happens
even if there is no ambiguity: the parameters Pos
, Neg
, Sim
and
Ant
all select a different form from the table in Cl
.
The solution is simply to include some string field in every lincat, even if its only purpose is to select a form from an inflection table in another category.
Sometimes you run into these situations even when there is a string: for instance, say that you implement obligatory prodrop in the following way.
PredVP np vp =
let subj = case np.isPron of {
True => [] ;
False => np.s ! Nom
} ;
pred = vp.s ! np.a ;
in {s = subj ++ pred} ;
Then when you parse “the cat sleeps”, the cat_N
has contributed with
a string that is selected in np.s ! Nom
, but in case of “I sleep”, subj
becomes just the empty
string. The solution is to add some dummy string field to your NP
type, and add it in case the NP
is a pronoun, as shown below.
-- NP = {s : Case => Str ; empty : Str}
let subj = case np.isPron of {
True => np.empty ;
False => np.s ! Nom
} ;
Some words have incomplete paradigms: a form just doesn’t exist. One example of this is plurale tantum nouns: you can’t say a scissor, only scissors. In GF, you have a few different alternatives: linearising an empty string, another form, a nonexistent form, or raising an exception. I go through the options in the following.
In the case of scissors, you would create the following inflection table:
Sg => "" ;
Pl => "scissors" ;
First of all, linearising empty strings results in weird results:
Lang> l PredVP (UsePron i_Pron) (AdvVP (UseV run_V) (PrepNP with_Prep (MassNP (UseN scissors_N))))
I run with
Furthermore, it creates ambiguity, which can be
dangerous. An example (real cases are often much more subtle and hard
to debug): suppose that today_Adv
is linearised as an empty
string. Then this happens:
Lang> l PredVP (UsePron i_Pron) (AdvVP (UseV run_V) today_Adv)
I run
Lang> l PredVP (UsePron i_Pron) (UseV run_V)
I run
The sentence I run is now ambiguous, you could get both trees when parsing it. For these reasons, most often you should consider other options.
Sg => "scissors" ;
Pl => "scissors" ;
Now we get I run with scissors when applying MassNP
to
scissors_N
, just like you get I run with beer for the MassNP
of
beer_N
. It’s pretty much what we’d like MassNP
to do, so this works
much better than the previous approach! The downside is that the
grammar will now overgenerate sentences with wrong agreement.
Lang> l PredVP (DetCN (DetQuant DefArt NumPl) (UseN scissor_N)) (UseComp (CompAdv here_Adv))
the scissors are here
Lang> l PredVP (DetCN (DetQuant DefArt NumSg) (UseN scissor_N)) (UseComp (CompAdv here_Adv))
the scissors is here
Also, problems with ambiguity may occur, which can give weird results for translation: you parse “I run with scissors” in English, get two trees, accidentally choose the singular, and linearise in some other language “I run with scissor”. The core point is parsing text that makes sense and getting trees that don’t–we return to it later in this post.)
If you think any of the two points will be a problem in your use case, proceed with caution. Otherwise, I think this approach would work pretty well for many applications.
Sg => "scissor" ;
Pl => "scissors" ;
Just put scissor in the table, and if someone needs it in an application, let them take care to always choose the plural.
Okay, scissor is not such a controversial thing to put in an inflection table. After all, the form does exist e.g. in compound words. But in my personal opinion2, this is a feasible option even for more questionable forms.
I’m mostly thinking about RGL, which is already full of somewhat questionable language. Consider the following trees:
PredVP (UsePron i_Pron) (ComplSlash (SlashV2a like_V2) (UsePron i_Pron))
-- 'I like me'
PredVP (UsePron i_Pron) (ReflVP (SlashV2a like_V2))
-- 'I like myself'
I like me sounds less idiomatic in English than I like myself, but the RGL’s task is not to judge what is idiomatic, just to provide a wide variety of syntactic structures, and let an application grammarian to choose the most idiomatic structures for a given language and situation.
The first sentence is created by a construction, which creates all
combinations of <Pron> <Verb> <Pron>
. Sure, we could write a special
case for those instances where the two pronouns are the same, and
replace the second pronoun by the appropriate reflexive pronoun. But
the reflexive construction already exists in the RGL, as a completely
separate entity. I personally think both can coexist in the RGL: I like me
has distinct uses from I like myself, for instance as a hypothetical “I wouldn’t like me if I was you”.
Now let’s consider Basque, where transitive verbs agree with both subject and object. Except when the subject and object are both the same 1st or 2nd person pronoun. So he/she/they like(s) him/her/them is all fine, but I/we like me/us or you like you–the verb form needed to convey this information just doesn’t exist. In such a case, the reflexive construction (where the object agreement is always 3rd person singular) is the only way to go.
Despite this, I decided to linearise nonexisting forms: the morpheme for e.g. 1st person as a subject exists, so does the morpheme for 1st person as an object, just put the morphemes together and call it a theoretical form3.
This approach is definitely not for all use cases. For instance, don’t use it in applications where you provide suggestions for the next word, unless your users are into speculative morphotactics.
Sg => nonExist ; --if reached, nothing is linearised
Pl => "scissors" ;
As defined in Orthography Engineering in Grammatical Framework (Angelov, 2015):
In generation mode,
nonExist
behaves like an exception. Any attempt to linearise a sentence which containsnonExist
will fail completely and no output will be generated. Ideally the grammarian should avoid exceptional situations and write grammars that always produce sensible output. At least this should be the goal for expressions generated from the intended start category, i.e. for sentences. Avoiding the exceptions is usually possible by using parameters to control the possible choices in the grammar.nonExist
is still useful as an explicit marker for impossible combinations in nested categories, i.e. categories that are not used as start categories. If hiding all exceptions in the grammar is not possible then at least by usingnonExist
the user gets a chance to handle the situation by rephrasing the request.
If you want to skip the linearisation of a whole abstract syntax function, instead of just a single form, you can use empty variants:
lin beer_N = variants {} ;
Using beer_N
in a sentence works like this:
Lang> l PredVP (UsePron i_Pron) (ComplSlash (SlashV2a like_V2) (MassNP (UseN beer_N)))
I like [beer_N]
If we had defined beer_N
as a table \\_ => nonExist
, then
linearising the sentence would have produced no output whatsoever. Now
at least we get I like part properly linearised.
You’re implementing a concrete syntax for some grammar, and try to linearise a test tree. In the example below, we are linearising the tree for “second cat” in English and Malay:
Lang> l AdjCN (AdjOrd (OrdNumeral (num (pot2as3 (pot1as2 (pot0as1 (pot0 n2))))))) (UseN cat_N)
second cat
kucing [AdjOrd]
Based on the output, the subtree UseN cat_N
is implemented in Malay, but the function AdjOrd
is not. But you are sure that you have implemented AdjOrd
, so what is the problem?
In this case, the unimplemented function is not AdjOrd
, but one (or more) of the functions in its subtree(s), so e.g. OrdNumeral
, num
, or any of the pot…
funs. This is confusing, because the shell output puts the blame on a function that actually exists.
To find out which function is the actually missing one, do the following:
-retain
cc
(compute_concrete) on the same treeLang> i -retain LangMay.gf
40 msec
> cc AdjCN (AdjOrd (OrdNumeral (num (pot2as3 (pot1as2 (pot0as1 (pot0 n2))))))) (UseN cat_N)
constant not found: OrdNumeral
So now you know which one is missing, go and implement OrdNumeral
, and then you can try again linearising the original tree for “second cat”.
You’ve implemented a linearisation for some fun, but when testing some tree in the GF shell, you only get an empty string. (NB. This is a made up example, if you try this in the actual RGL, it will work correctly.) But suppose, for the sake of example, that you try the following:
$ gf AllEng.gf
AllEngAbs> l ApposNP nothing_NP something_NP
The result is just an empty string.
Then you grep for the functions in the source code:
StructuralEng.gf
109: something_NP = regNP "something" singular ;
StructuralEng.gf
148: nothing_NP = regNP "nothing" singular ;
ExtendEng.gf
435: ApposNP np1 np2 = {s = \\c => np1.s ! c ++ comma ++ np2.s ! c; a = np1.a} ;
As we can see, all three functions are defined in the files. But the issue here is that ExtendEng
is based on ExtendFunctor, and the default linearisation for ApposNP
(and most other functions) is variants {}
.
So in your own implementation of ExtendLang
, you will need to do two things for each function:
ApposNP
is in the exclusion list.This holds for any concrete syntax that is based on a functor. If you override a functor implementation, you will need to add it specifically to the exlucison list. Although ExtendFunctor
is rather uncommon in that most linearisations are variants {}
, which is usually not the case in functors for application grammars.
For instance, DetNP every_Det
returns “everything” for English.
One can argue that it’s better for as many RGL trees as possible to return something that makes sense;
in such a case, “everything is blue” or “I see everything” is preferrable to “every is blue” and “I see every”.
On the other hand, such (over)engineered solution creates ambiguity: now “everything” is a linearisation of both everything_NP
and DetNP every_Det
. Furthermore, it forces other languages to do something that makes equally much sense, because now you will get more nonsensical trees while parsing text that makes sense.
Here’s what you can do in any grammar in the GF shell: load any grammar, parse and linearise arbitrary trees in the grammar.
Foods> l Pred (These Pizza) Delicious
estas pizzas son deliciosas
Foods> p "este vino es caro"
Pred (This Wine) Expensive
Here’s another standard thing we can do in the GF shell: import the Paradigms module with retain
, and create any inflection table.
$ gf
> i -retain alltenses/ParadigmsSpa.gfo
> cc -table mkN "tortilla"
s . ParamX.Sg => tortilla
s . ParamX.Pl => tortillas
g . CommonRomance.Fem
But what about combining the two? What if I want to create sentences about delicious tortillas, but can’t be bothered to extend the abstract syntax of Foods? Wouldn’t it be nice if I could just do this?
Foods> l Pred (That (mkN "tortilla")) Delicious
esa tortilla es deliciosa
Unfortunately this won’t work exactly like that. But you can get very close. Let’s look at the concrete syntax:
lincat
Comment = Utt ;
Item = NP ;
Kind = CN ;
Quality = AP ;
lin
Pred item quality = mkUtt (mkCl item quality) ;
This kind = mkNP this_QuantSg kind ;
Mod quality kind = mkCN quality kind ;
Very quality = mkAP very_AdA quality ;
Wine = mkCN (mkN "vino") ;
Pizza = mkCN (mkN "pizza") ;
Delicious = mkAP (mkA "delicioso") ;
Now we know which RGL categories to use, so we can be more precise. Kinds are CNs, so we need mkN
and mkCN
to write make a Kind
out of the string “tortilla”. These opers, mkN
and mkCN
, are in scope when writing the grammar, because the grammar opens ParadigmsSpa and SyntaxSpa. But these opers are only used in the Foods grammar, they aren’t retained when we import the Foods grammar in the GF shell using i -retain
.
So how to make those mkN
and mkCN
usable? We can re-export them. Add the following lines in the concrete syntax:
oper
mkN : Str -> N = ParadigmsSpa.mkN ;
mkCN : N -> CN = SyntaxSpa.mkCN ;
Now you can do the following in the GF shell:
$ gf
> i -retain FoodsSpa.gf
> cc -one Pred (These (mkCN (mkN "tortilla"))) Delicious
estas tortillas son deliciosas
You can repeat the process for other parts of speech. Want to say that the tortilla is vegan? Just check what’s the lincat for Quality and re-export the appropriate opers from ParadigmsSpa and SyntaxSpa. In fact, you don’t even need to call the functions the same. You could as well do this:
oper
kind : Str -> CN = \s -> mkCN (mkN s) ;
qual : Str -> AP = \s -> mkAP (mkA s) ;
To be used in the GF shell like this:
> cc -one Pred (These (kind "tortilla")) (qual "vegano")
estas tortillas son veganas
The full grammar, including my additions, is here.
There is no official style guide for GF, but if you look at the existing grammars, there are many conventions. I also have my own set of practices that I use from grammar to grammar, but which are not necessarily used by others. I share here both kinds of naming schemes.
The biggest distinction is application grammar categories vs. RGL categories.
The general principle is that RGL cats are purely syntactic, and are named after the part of speech, e.g. A
, N
, V
and their phrasal forms AP
, NP
, VP
. Numbers after the category mean how many arguments they take:
V
means intransitive verb: only subject, no object, e.g. sleep_V
V2
means transitive verb: subject and object, e.g. break_V2
V3
means ditransitive verb: subject, object and indirect object, e.g. give_V3
.Nouns and adjectives follow the same number scheme, so we have adjectives like married_A2
“married to X” and nouns like distance_N3
“distance from Y to Z”.
For RGL verbs, we have even more subcategories, such as VV
, VA
and VS
. The names are hopefully compositional enough: VV
takes a verbal complement (want [to sleep]), VA
an adjectival complement (become [tired]) and VS
takes a whole sentence (say [that she sleeps]). These can also be combined with the numbers, so for example V2A
takes both a direct object and an adjectival complement: paint [the house] [red].
Often the categories of an application grammar are more fine-grained: not just part-of-speech based like NP
or V2
, but more semantic distinctions like Weekday
, Currency
, Language
, Action
, State
.
Some application grammars are built from scratch, and all of their categories are like above. Other application grammars extend the RGL, and thus contain both the RGL categories and more semantically oriented categories. There are not that many conventions, it just depends on your application what is significant.
s
fieldMost GF records, no matter if in a resource or an application grammar, have a field called s
, which contains the main element. For instance, the lincat of CN could look like this in many languages:
lincat CN = {
s : Number => Str ; -- house / houses
postmod : Str ; -- on a hill
} ;
The s
field is significant to the GF shell: unless we have specified a linref, the GF shell will only parse and linearise contents in the s
field and ignore the rest. Similarly, human GF programmers are used to finding more central contents in the s
and more peripheral contents in other fields.
The following names are mostly interesting for RGL implementors, or those who live dangerously and use the RGL not via the API, but touching the raw parameters.
sp
20 RGL languages use a field called sp
(source: I went to gf-rgl/src and did grep "\<sp :" */*.gf | cut -f 1 -d '/' | sort -u
).
The sp
field appears in the category Det
and other categories that may become Det (e.g. Quant
, Num
, Pron
). It is meant to be the field that contains a standalone version of the Det. A simplified example for English:
l -table PossPron i_Pron
s : my
sp : mine
The s
field is then used in the primary function of Det, which is being a determiner. The sp
field is used when the determiner is promoted into a NP. This is from the actual English resource grammar:
Lang> p -cat=NP "my dog"
DetCN (DetQuant (PossPron i_Pron) NumSg) (UseN dog_N)
Lang> p -cat=NP "mine"
DetNP (DetQuant (PossPron i_Pron) NumSg)
c2
and c3
The field c2
is used in 39 RGL languages, and c3
in 34 (source: the same bash oneliner I used for sp
, but grep for the string c2
or c3
instead of sp
).
These fields appear in categories that take complements, so almost all verb subcategories (V2
, VS
, V2A
…), but also N2
, A2
and their ditransitive versions N3
, A3
. They belong to the categories that are still waiting for their complement, so V2
’s c2
field is inherited to the VPSlash
, but not to VP
. A simplified example below:
cat
V2 ; VP ; NP ;
fun
ComplV2 : V2 -> NP -> VP ;
believe_V2 : V2 ;
see_V2 : V2 ;
lincat
NP, VP = {s : Str} ;
V2 = {
s, -- main verb ('believe', 'see')
c2 : Str -- optional preposition ('in')
} ;
lin
ComplV2 vp np = {
s = vp.s ++ vp.c2 ++ np.s ;
} ;
believe_V2 = {
s = "believe" ;
c2 = "in"
} ;
see_V2 = {
s = "see" ;
c2 = []
} ;
This works for other concepts than just prepositions, and more complex types than just strings. What kinds of types are used for those fields? You can inspect yourself with this oneliner, running it in the directory gf-rgl/src:
grep -o "\<c2 : [^ ]*\>" */*.gf | cut -f 3 -d ':' | sort -u
Here are the cleaned up results, merging e.g. ResNep.Compl
together with Compl
:
The names don’t tell much—all of those, except for Str
and Bool
, are internal types defined in each concrete RGL language. (For all we know, maybe earlier in the grammar it says Compl : Type = Str
.) I just wanted to show proof that c2
fields can contain much more than just strings. If you are interested in their relative frequencies, you can run this command in gf-rgl/src:
grep -o "\<c2 : [^ ]*\>" */*.gf | cut -f 3 -d ':' | sort | uniq -c | sort -nr
This is idiosyncratic to me, but I share it because
So, I have a grammar—RGL or application—with some categories, let’s call them Foo
and Bar
. Then for their lincats, I follow this convention:
cat
Foo ; Bar ;
lincat
Foo = LinFoo ;
Bar = LinBar ;
linref
Foo = linFoo ;
Bar = linBar ;
oper
LinFoo : Type = {- … -} ;
linFoo : LinFoo -> Str = \foo -> {- … -} ;
LinBar : Type = {- … -} ;
linBar : LinBar -> Str = \bar -> {- … -} ;
doStuff : Str -> LinFoo -> Str = \someStr,foo ->
someStr ++ linFoo foo ;
C
, I create a type called LinC
as an oper
, and use it for the lincat of C
.LinC
, I create an oper linC : LinC -> Str
, and use it for the linref of C
.linC
is also useful in many other places where you need to make a C
into a string, e.g. in doStuff
.This is useful, because often many cats share a lincat, and you need to do a LinC
and linC
only once per distinct lincat. It also makes the code safer: every time you need to produce a string out a C
, you just call linC
instead of manually adding all the fields together. Then when you update the lincat, you only need to update in one place, that is linC
.
1. Suppose you wanted to prevent “very very Italian pizza”, and wrote this code, which pattern matches a string at run-time.
lin
Very qual = {
s = case qual.s of { -- don't add a second very
"very" + x => qual.s ;
_ => "very" + qual.s }
} ;
Instead, you can add a parameter in the lincat of Quality
, which tells whether Very
has been applied or not.
lincat
Quality = {s : Str ; hasVery : HasVery} ;
param
HasVery = ZeroVery | OneVery ;
lin
Italian = {s = "Italian" ; hasVery = ZeroVery} ;
Very qual = {
s = case qual.hasVery of { -- don't add a second very
OneVery => qual.s ;
ZeroVery => "very" + qual.s } ;
hasVery = OneVery ;
} ;
This is legal, because you’re pattern matching against a finite type. You just defined all possible values of HasVery
, we know there are only 2. In contrast, there are infinitely many strings, so it’s not feasible to be prepared for all of them at run-time. ↩
2. My nick inariksit is a combination of Finnish morphemes
in an illegal way, so maybe I just like breaking the
law morphotactics. ↩
3. Actually, I wonder if the def feature would work here–just define all trees of form
PredVP (UsePron p) (ComplSlash (SlashV2a v) (UsePron p))
as the corresponding tree using ReflVP
:
PredVP (UsePron p) (ReflVP (SlashV2a v))
In practice, it won’t work in the RGL, because relevant funs are defined as fun
, and they’d need to be defined as data
.
However, if you want to make such a thing work for your own application, here’s a pull request that you can use in your own copy of the RGL. ↩
tags: gf