Part I of this post. As you might guess from the naming scheme, it’s recommended to read before Part II.
This time, let’s take a more advanced grammar for numerals. And since the RGL already contains a nice implementation of numerals, we just extend the Numeral grammar from RGL. (If extending and opening are new concepts, see tutorial.)
abstract Modifiers = Numeral
** {
flags startcat = Clause ;
cat
Kind ;
Clause ;
-- The categories Int, Float and String are present in all grammars
-- The categories Dig and Digits come from Numeral.
fun
NItemsLiteral : Int -> Kind -> Clause ;
NItemsDig : Digits -> Kind -> Clause ;
Car : Kind ;
} ;
We defined 2 functions just to compare the implementations: NItemsLiteral
using the literal Int
, and NItemsDig
using the Digits
category from Numeral
.
concrete ModifiersEng of Modifiers = NumeralEng -- English linearisations of Digits, IIDig, IDig
** open
SyntaxEng, -- for CN, S, mkS, mkCl, mkNP, mkCN, mkDet, aPl_Det
ParadigmsEng, -- for mkN
SymbolicEng -- for symb
in {
lincat
Kind = CN;
Clause = S;
lin
-- Hacky, and produces "1 cars"
NItemsLiteral int kind =
let sym : Symb = mkSymb int.s ; -- mkSymb : Str -> Symb ;
card : Card = symb sym ; -- symb : Symb -> Card ;
det : Det = mkDet card ;
in mkS (mkCl (mkNP det kind)) ;
-- Comes from the RGL, produces "1 car"
NItemsDig num kind = mkS (mkCl (mkNP (mkDet num) kind)) ;
Car = mkCN (mkN "car") ;
}
As we saw already in the definition, choosing indefinite plural will not work for the case n=1:
Modifiers> l NItemsDig (IDig D_1) Car
there is 1 car
Modifiers> l NItemsLiteral 1 Car
there are 1 cars
That’s what was bound to happen. How about other numbers?
Mofifiers> p "there are 2 cars"
NItemsDig (IDig D_2) Car
NItemsLiteral 2 Car
For numbers 2-9 there is no difference. But when we get to 10 and
higher, there’s a small difference. With Digits, the numbers are
bound together by the BIND
token. If you
linearise without the flag -bind
in the normal GF shell, you get
&+
in between.
Modifiers> l NItemsDig (IIDig D_9 (IIDig D_9 (IDig D_9))) Car
there are 9 &+ 9 &+ 9 cars
0 msec
Modifiers> l -bind NItemsDig (IIDig D_9 (IIDig D_9 (IDig D_9))) Car
there are 999 cars
Parsing in the standard GF shell doesn’t work if you don’t insert &+
yourself:
Modifiers> p "there are 999 cars"
NItemsLiteral 999 Car
This is because the standard GF shell uses the Haskell runtime, which
doesn’t add the &+
s automatically. However, the newer C runtime adds
&+
s automatically, and there are bindings from it to several
programming languages, if you want to use a GF grammar which uses BIND
tokens in an application. If your GF is compiled with C runtime
support, then you can start the GF shell with the flag -cshell
, and
open your grammar in a PGF format. This is already included in the
binary versions, except for Windows.
To use the C-shell, follow these steps:
$ gf -make ModifiersEng.gf -- this creates Modifiers.pgf
$ gf -cshell -- open GF with -cshell flag
> i Modifiers.pgf -- import Modifiers.pgf
Modifiers> p "there are 999 cars"
NItemsLiteral 999 Car
NItemsDig (IIDig D_9 (IIDig D_9 (IDig D_9))) Car
As you can see from the implementation of NItemsLiteral
, we can make a Det
out of an Int in three steps:
mkSymb
to the string inside the Int
Symb -> Card
instance of symb
mkDet : Card -> Det
. NItemsLiteral int kind =
let sym : Symb = mkSymb int.s ; -- mkSymb : Str -> Symb ;
card : Card = symb sym ; -- symb : Symb -> Card ;
det : Det = mkDet card ;
in mkS (mkCl (mkNP det kind)) ;
Technically, we can do this to any other literal category, because all of them have the lincat {s : Str}
, and the first step is just Str -> Symb
.
However, a Det
formed out of Card
is always plural, so if you need a singular determiner, you need to do something else.
We need to find some function that makes a singular determiner out of a literal. Let’s write down a skeleton:
fun
LiteralDetKind : String -> Kind -> Clause ;
lin
LiteralDetKind string kind =
let det : Det = ??? string ; -- Wishful thinking
in mkS (mkCl (mkNP det kind)) ;
Is there some mkDet
constructor in
ParadigmsEng or ResEng that we can use? Let’s examine
StructuralEng
to see how they make Det
s: by using
mkDeterminerSpec,
which in turn calls
regGenitiveS,
which ultimately pattern matches a string. In the case of
LiteralKind
, the string is a runtime string: it comes as an
argument to a function. So we can’t use mkDeterminerSpec
, because that would give us unsupported token gluing exception.
So, back to the drawing board. Let’s see first what is the lincat of
Det
in CatEng:
Det = {s : Str ;
sp : Gender => Bool => NPCase => Str ;
n : Number ; hasNum : Bool} ;
That’s a lot of stuff! But if we look at the implementation of
DetCN
(which is the instance of mkNP that we are interested in this case),
we see that s
field is the most relevant one. But of course n
, sp
and
hasNum
need to be present too (as well as
lock_Det),
otherwise the record isn’t a Det
and cannot be given as an argument
to any function that expects a Det
.
NB. At this point, I expect you to know about record extension in GF. No need to go further than this blog, if you need to brush up on that.
Anyway, it’s unsafe to write stuff like this manually in an application grammar:
myHackyDet : String -> Det = \string -> lin Det
s = string.s ;
sp = \\gend,bool,npcase => string.s ;
n = ParamX.Sg ;
hasNum = False
} ;
But we can make it just tiny bit less dangerous by a simple trick: extend some known stable Det
. You can find a bunch of them in the synopsis, they work for every language!
So let’s say that we want our determiner to be singular indefinite. Then we can extend aSg_Det
from the API, and only change manually the s
field. For English, that is; for any other language Xxx you want to do this for, you need to check CatXxx
to see what is the lincat of Det
in Xxx.
LiteralDetKind string kind =
let det : Det = a_Det ** {s = a_Det.s ++ string.s} ;
in mkS (mkCl (mkNP det kind)) ;
Finally we’re through all the disclaimers, time to check out what the grammar produces!
Modifiers> l LiteralKind "asdasdfsdggdfs" Car
there is asdasdfsdggdfs car
How about the version with Det
? It inserts the indefinite article in there:
Modifiers> l LiteralDetKind "asdasdfsdggdfs" Car
there is an asdasdfsdggdfs car
Looks good! By the way, the indefinite article is even bound to be
correct in most of the cases, because the choice between a and
an is
done on another level, using the pre
construction
(tutorial,
ref. manual).
Parsing works as expected:
Modifiers> p "there is a qZPjp car"
LiteralDetKind "qZPjp" Car
Modifiers> p "there is vyknmj3 car"
LiteralKind "vyknmj3" Car
Can I just say literally anything? Profanities and grammatically incorrect language?
Modifiers> l LiteralKind "incorrectly grammatical" Car
there is incorrectly grammatical car
We just made a GF grammar say that! The power, it’s intoxicating! Now let’s parse that:
Modifiers> p "there is incorrectly grammatical car"
The parser failed at token 4: "incorrectly"
There’s one more gotcha: we can’t parse literals that contain spaces. As for explanations, let me quote page 46 in Krasimir’s thesis.
The common in all cases is that the set of values for the literal categories is not enumerated in the grammar but is hard-wired in the compiler and the interpreter. The linearization rule is also predefined, for example, if we have the constant 3.14 in an abstract syntax tree, then it is automatically linearized as the record
{ s = ”3.14” }
. Similarly, if we have the string ”John Smith” then its linearization is the wrapping of the string in a record, i.e.{ s = ”John Smith” }
.Now we have a problem because the rules in Section 2.3 are not sufficient to deal with literals. Furthermore, while usually the parser can use the grammar to predict the scopes of the syntactic phrases, this is not possible for the literals since we allow arbitrary unrestricted strings as values of category
String
. Let say, for example, that we have a grammar which represents named entities as literals, then we can represent the sentence:“John Smith is one of the main characters in Disney’s film Pocahontas.”
as an abstract syntax tree of some sort, for instance:
MainCharacter ”John Smith” ”Disney” ”Pocahontas”
This works fine for linearization because we have already isolated the literals as separated values. However, if we want to do parsing, then the parser will have to consider all possible segmentations where three of the substrings in the input string are considered literals. This means that the number of alternatives will grow exponentially with the number of String literals. Such exponential behaviour is better to be avoided, and in most cases, it is not really necessary.
So there’s that. I recommend reading Krasimir’s thesis, it has answers to all your problems you didn’t know you had. Some mornings I read Krasimir’s thesis before I get out of the bed, it’s so good.
Here’s a Stack Overflow answer which shows a similar solution, but making String literals into APs instead. (Scroll or Ctrl+F to “Arbitrary strings as artists”.)
My next blog post is about going beyond the API more generally (not just with literals), and how to limit the dangers of using low-level opers in your application grammar.
I got an email asking how to make the previous Modifiers
grammar to be a functor instead. Here’s the answer—ignore the changed filename, apart from that it’s the same grammar. gist.github.com/inariksit/daf9b5c6cd374930aca6eec9d57ce3a6