Inari Listenmaa

Logo

CV · Blog · GitHub

26 January 2019

String, Int and Float literals in GF, part II

Part I of this post. As you might guess from the naming scheme, it’s recommended to read before Part II.

Using numerals as modifiers

This time, let’s take a more advanced grammar for numerals. And since the RGL already contains a nice implementation of numerals, we just extend the Numeral grammar from RGL. (If extending and opening are new concepts, see tutorial.)

Abstract syntax

abstract Modifiers = Numeral
  ** {
  flags startcat = Clause ;
cat
  Kind ;
  Clause ;

  -- The categories Int, Float and String are present in all grammars

  -- The categories Dig and Digits come from Numeral.
fun
  NItemsLiteral : Int -> Kind -> Clause ;
  NItemsDig : Digits -> Kind -> Clause ;
  Car : Kind ;

} ;

We defined 2 functions just to compare the implementations: NItemsLiteral using the literal Int, and NItemsDig using the Digits category from Numeral.

English concrete syntax

concrete ModifiersEng of Modifiers = NumeralEng -- English linearisations of Digits, IIDig, IDig
  ** open
  SyntaxEng, -- for CN, S, mkS, mkCl, mkNP, mkCN, mkDet, aPl_Det
  ParadigmsEng, -- for mkN
  SymbolicEng -- for symb
  in {

lincat
  Kind = CN;
  Clause = S;

lin

  -- Hacky, and produces "1 cars"
  NItemsLiteral int kind =
    let sym : Symb = mkSymb int.s ; -- mkSymb : Str -> Symb ;
        card : Card = symb sym ;    -- symb : Symb -> Card ;
        det : Det = mkDet card ;
     in mkS (mkCl (mkNP det kind)) ;

  -- Comes from the RGL, produces "1 car"
  NItemsDig num kind = mkS (mkCl (mkNP (mkDet num) kind)) ;

  Car = mkCN (mkN "car") ;
}

Output of the grammar

As we saw already in the definition, choosing indefinite plural will not work for the case n=1:

Modifiers> l NItemsDig (IDig D_1) Car
there is 1 car
Modifiers> l NItemsLiteral 1 Car
there are 1 cars

That’s what was bound to happen. How about other numbers?

Mofifiers> p "there are 2 cars"
NItemsDig (IDig D_2) Car
NItemsLiteral 2 Car

For numbers 2-9 there is no difference. But when we get to 10 and higher, there’s a small difference. With Digits, the numbers are bound together by the BIND token. If you linearise without the flag -bind in the normal GF shell, you get &+ in between.

Modifiers> l NItemsDig (IIDig D_9 (IIDig D_9 (IDig D_9))) Car
there are 9 &+ 9 &+ 9 cars
0 msec
Modifiers> l -bind NItemsDig (IIDig D_9 (IIDig D_9 (IDig D_9))) Car
there are 999 cars

BIND token and the different runtimes

Parsing in the standard GF shell doesn’t work if you don’t insert &+ yourself:

Modifiers> p "there are 999 cars"
NItemsLiteral 999 Car

This is because the standard GF shell uses the Haskell runtime, which doesn’t add the &+s automatically. However, the newer C runtime adds &+s automatically, and there are bindings from it to several programming languages, if you want to use a GF grammar which uses BIND tokens in an application. If your GF is compiled with C runtime support, then you can start the GF shell with the flag -cshell, and open your grammar in a PGF format. This is already included in the binary versions, except for Windows.

To use the C-shell, follow these steps:

$ gf -make ModifiersEng.gf  -- this creates Modifiers.pgf
$ gf -cshell                -- open GF with -cshell flag
> i Modifiers.pgf           -- import Modifiers.pgf
Modifiers> p "there are 999 cars"
NItemsLiteral 999 Car
NItemsDig (IIDig D_9 (IIDig D_9 (IDig D_9))) Car

Arbitrary strings as modifiers

As you can see from the implementation of NItemsLiteral, we can make a Det out of an Int in three steps:

  1. Call mkSymb to the string inside the Int
  2. Use the Symb -> Card instance of symb
  3. Use mkDet : Card -> Det.
  NItemsLiteral int kind =
    let sym : Symb = mkSymb int.s ; -- mkSymb : Str -> Symb ;
        card : Card = symb sym ;    -- symb : Symb -> Card ;
        det : Det = mkDet card ;
     in mkS (mkCl (mkNP det kind)) ;

Technically, we can do this to any other literal category, because all of them have the lincat {s : Str}, and the first step is just Str -> Symb. However, a Det formed out of Card is always plural, so if you need a singular determiner, you need to do something else.

String to Det

We need to find some function that makes a singular determiner out of a literal. Let’s write down a skeleton:

fun
  LiteralDetKind : String -> Kind -> Clause ;

lin
  LiteralDetKind string kind =
    let det : Det = ??? string ; -- Wishful thinking
     in mkS (mkCl (mkNP det kind)) ;

Is there some mkDet constructor in ParadigmsEng or ResEng that we can use? Let’s examine StructuralEng to see how they make Dets: by using mkDeterminerSpec, which in turn calls regGenitiveS, which ultimately pattern matches a string. In the case of LiteralKind, the string is a runtime string: it comes as an argument to a function. So we can’t use mkDeterminerSpec, because that would give us unsupported token gluing exception.

So, back to the drawing board. Let’s see first what is the lincat of Det in CatEng:

Det = {s : Str ;
       sp : Gender => Bool => NPCase => Str ;
       n : Number ; hasNum : Bool} ;

That’s a lot of stuff! But if we look at the implementation of DetCN (which is the instance of mkNP that we are interested in this case), we see that s field is the most relevant one. But of course n, sp and hasNum need to be present too (as well as lock_Det), otherwise the record isn’t a Det and cannot be given as an argument to any function that expects a Det.

NB. At this point, I expect you to know about record extension in GF. No need to go further than this blog, if you need to brush up on that.

Anyway, it’s unsafe to write stuff like this manually in an application grammar:

myHackyDet : String -> Det = \string -> lin Det
  s = string.s ;
  sp = \\gend,bool,npcase => string.s ;
  n = ParamX.Sg ;
  hasNum = False
  } ;

But we can make it just tiny bit less dangerous by a simple trick: extend some known stable Det. You can find a bunch of them in the synopsis, they work for every language!

So let’s say that we want our determiner to be singular indefinite. Then we can extend aSg_Det from the API, and only change manually the s field. For English, that is; for any other language Xxx you want to do this for, you need to check CatXxx to see what is the lincat of Det in Xxx.

LiteralDetKind string kind =
  let det : Det = a_Det ** {s = a_Det.s ++ string.s} ;
   in mkS (mkCl (mkNP det kind)) ;

Output of the grammar

Finally we’re through all the disclaimers, time to check out what the grammar produces!

Modifiers> l LiteralKind "asdasdfsdggdfs" Car
there is asdasdfsdggdfs car

How about the version with Det? It inserts the indefinite article in there:

Modifiers> l LiteralDetKind "asdasdfsdggdfs" Car
there is an asdasdfsdggdfs car

Looks good! By the way, the indefinite article is even bound to be correct in most of the cases, because the choice between a and an is done on another level, using the pre construction (tutorial, ref. manual).

Parsing works as expected:

Modifiers> p "there is a qZPjp car"
LiteralDetKind "qZPjp" Car

Modifiers> p "there is vyknmj3 car"
LiteralKind "vyknmj3" Car

Can I just say literally anything? Profanities and grammatically incorrect language?

Modifiers> l LiteralKind "incorrectly grammatical" Car
there is incorrectly grammatical car

We just made a GF grammar say that! The power, it’s intoxicating! Now let’s parse that:

Modifiers> p "there is incorrectly grammatical car"
The parser failed at token 4: "incorrectly"

Cannot parse literals that contain spaces

There’s one more gotcha: we can’t parse literals that contain spaces. As for explanations, let me quote page 46 in Krasimir’s thesis.

The common in all cases is that the set of values for the literal categories is not enumerated in the grammar but is hard-wired in the compiler and the interpreter. The linearization rule is also predefined, for example, if we have the constant 3.14 in an abstract syntax tree, then it is automatically linearized as the record { s = ”3.14” }. Similarly, if we have the string ”John Smith” then its linearization is the wrapping of the string in a record, i.e. { s = ”John Smith” } .

Now we have a problem because the rules in Section 2.3 are not sufficient to deal with literals. Furthermore, while usually the parser can use the grammar to predict the scopes of the syntactic phrases, this is not possible for the literals since we allow arbitrary unrestricted strings as values of category String. Let say, for example, that we have a grammar which represents named entities as literals, then we can represent the sentence:

“John Smith is one of the main characters in Disney’s film Pocahontas.”

as an abstract syntax tree of some sort, for instance:

MainCharacter ”John Smith” ”Disney” ”Pocahontas”

This works fine for linearization because we have already isolated the literals as separated values. However, if we want to do parsing, then the parser will have to consider all possible segmentations where three of the substrings in the input string are considered literals. This means that the number of alternatives will grow exponentially with the number of String literals. Such exponential behaviour is better to be avoided, and in most cases, it is not really necessary.

So there’s that. I recommend reading Krasimir’s thesis, it has answers to all your problems you didn’t know you had. Some mornings I read Krasimir’s thesis before I get out of the bed, it’s so good.

Read more

Here’s a Stack Overflow answer which shows a similar solution, but making String literals into APs instead. (Scroll or Ctrl+F to “Arbitrary strings as artists”.)

My next blog post is about going beyond the API more generally (not just with literals), and how to limit the dangers of using low-level opers in your application grammar.

Appendix: functors

I got an email asking how to make the previous Modifiers grammar to be a functor instead. Here’s the answer—ignore the changed filename, apart from that it’s the same grammar. gist.github.com/inariksit/daf9b5c6cd374930aca6eec9d57ce3a6

tags: gf