next up previous
Next: The DATR language Up: The DATR Web Pages Previous: Introduction

DATR by example

 

We begin our presentation of DATR with a partial analysis of morphology in the English verbal system. In DATR, information is organised as a network of nodes, where a node is essentially just a collection of closely related information. In the context of lexical description, a node typically corresponds to a word, a lexeme or a class of lexemes. For example, we might have a node describing an abstract verb, another for the subcase of a transitive verb, another for the lexeme love and still more for the individual words that are instances of this lexeme (love, loves, loved, loving, etc.). Each node has associated with it a set of path/value pairs where a path is a sequence of atoms (which are primitive objects), and a value is an atom or a sequence of atoms. We will sometimes refer to atoms in paths as attributes.

 

Path Value
syn cat verb
syn type main
syn form present participle
mor form love ing
Table 1: Path/value pairs for present participle of love

For example, a node describing the present participle form of the verb love (and called perhaps Word1) might contain the path/value pairs shown in Table 1. The paths in this example all happen to contain two attributes, and the first attribute can be thought of as distinguishing syntactic and morphological types of information. The values indicate appropriate linguistic settings for the paths for a present participle form of love. Thus its syntactic category is verb, its syntactic type is main (i.e., it is a main verb, not an auxiliary), its syntactic form is present participle (a two atom sequence), its morphological form is love ing (another two atom sequence). In DATR this can be written as:

    Word1:
        <syn cat>  = verb
        <syn type> = main
        <syn form> = present participle
        <mor form> = love ing.
Here, angle brackets <...> delimit paths. Note that values can be atomic or they can consist of sequences of atoms, as the two last lines of the example illustrategif. As a first approximation, nodes can be thought of as denoting partial functions from paths (sequences of atoms) to values (sequences of atoms)gif.

In itself, this tiny fragment of DATR is not persuasive, apparently allowing only for the specification of words by simple listing of path/value statements for each one. It seems that if we wished to describe the passive form of love we would have to write:

    Word2:
        <syn cat>  = verb
        <syn type> = main
        <syn form> = passive participle
        <mor form> = love ed.
This does not seem very helpful: the whole point of a lexical description language is to capture generalisations and avoid the kind of duplication evident in the specification of Word1 and Word2. And indeed, we shall shortly introduce an inheritance mechanism which allows us to do just that. But there is one sense in which this listing approach is exactly what we want: it represents the actual information we generally wish to access from the description. So in a sense we do want all the above statements to be present in our description; what we want to avoid is repeated specification of the common elements.

This problem is overcome in DATR in the following way: such exhaustively listed path/value statements are indeed present in a description, but typically only implicitly present. Their presence is a logical consequence of a second set of statements, which have the concise, generalisation-capturing properties we expect. To make the distinction sharp, we call the first type of statement extensional and the second type definitional. Syntactically, the distinction is made with the equality operator: for extensional statements (as above), we use =, while for definitional statements we use ==. And, although our first example of DATR consisted entirely of extensional statements, almost all the remaining examples will be definitional. The semantics of the DATR language binds the two together in a declarative fashion, allowing us to concentrate on concise definitions of the network structure from which the extensional ``results'' can be read off.

Our first step towards a more concise account of Word1 and Word2 is simply to change the extensional statements to definitional ones:

    Word1:
        <syn cat>  == verb
        <syn type> == main
        <syn form> == present participle
        <mor form> == love ing.
    Word2:
        <syn cat>  == verb
        <syn type> == main
        <syn form> == passive participle
        <mor form> == love ed.
This is possible because DATR respects the unsurprising condition that if at some node a value is specifically defined for a path with a definitional statement, then the corresponding extensional statement also holds. So the statements we previously made concerning Word1 and Word2 remain true, but now only implicitly true.

Although this change does not itself make the description more concise, it allows us to introduce other ways of describing values in definitional statements, in addition to simply specifying them. Such value descriptors will include inheritance specifications which allow us to gather together the properties that Word1 and Word2 have solely by virtue of being verbs. We start by introducing a VERB node:

    VERB:
        <syn cat>  == verb
        <syn type> == main.
and then redefine Word1 and Word2 to inherit their verb properties from it. A direct encoding for this is as follows:

    Word1:
        <syn cat>  == VERB:<syn cat>
        <syn type> == VERB:<syn type>
        <syn form> == present participle
        <mor form> == love ing.
    Word2:
        <syn cat>  == VERB:<syn cat>
        <syn type> == VERB:<syn type>
        <syn form> == passive participle
        <mor form> == love ed.
In these revised definitions the right hand side of the <syn cat> statement is not a direct value specification, but instead an inheritance descriptor. This is the simplest form of DATR\ inheritance, it just specifies a new node and path from which to obtain the required value. It can be glossed roughly as ``the value associated with <syn cat> at Word1 is the same as the value associated with <syn cat> at VERB''. Thus from VERB:<syn cat> == verb it now follows that Word1:<syn cat> == verbgif.

However, this modification to our analysis seems to make it less rather than more concise. It can be improved in two ways. The first is really just a syntactic trick: if the path on the right hand side is the same as the path on the left hand side it can be omitted. So we can replace VERB:<syn type>, in the example above, with just VERB. We can also extend this abbreviation strategy to cover cases like the following, where the path on the right hand side is different but the node is the same:

    Come:
        <mor root> == come
        <mor past participle> == Come:<mor root>.
In this case we can simply omit the node:

    Come:
        <mor root> == come
        <mor past participle> == <mor root>.
The other improvement introduces one of the most important features of DATR - specification by default. Recall that paths are sequences of attributes. If we understand paths to start at their left hand end, we can construct a notion of path extension: a path P2 extends a path P1 if and only if all the attributes of P1 occur in the same order at the left hand end of P2 (so <a1 a2 a3> extends <>, <a1>, <a1 a2> and <a1 a2 a3>, but not <a2>, <a1 a3>, etc..). If we now consider the (finite) set of paths occurring in definitional statements associated with some node, that set will not include all possible paths (of which there are infinitely many). So the question arises of what we can say about paths for which there is no specific definition. For some path P1 not defined at node N, there are two cases to consider: either P1 is the extension of some path defined at N or it is not. The latter case is easiest - there is simply no definition for P1 at N (hence N can be a partial function, as already noted above). But in the former case, where P1 extends some P2 which is defined at N, P1 assumes a definition ``by default''. If P2 is the only path defined at N which P1 extends, then P1 takes its definition from the definition of P2. If P1 extends several paths defined at N, it takes its definition from the most specific (i.e., the longest) of the paths that it extends.

In the present example, this mode of default specification can be applied as follows. We have two statements at Word1 which (after applying the abbreviation introduced above) both inherit from VERB:

    Word1:
        <syn cat> == VERB
        <syn type> == VERB.
Because they have a common leading subpath <syn>, we can collapse them into a single statement about <syn> alone:

    Word1:
        <syn> == VERB.
If this were the entire definition of Word1, the default mechanism would ensure that all extensions of <syn> (including the two that concern us here) would be given the same definition - inheritance from VERB. But in our example, of course, there are other statements concerning Word1. If we add these back in, the complete definition looks like this:

    Word1:
        <syn> == VERB
        <syn form> == present participle
        <mor form> == love ing.
The paths <syn type> and <syn cat> (and also many others, such as <syn cat foo>, <syn baz>) obtain their definitions from <syn> using the default mechanism just introduced, and so inherit from VERB. But <syn form>, being explicitly defined, is exempt from this default behaviour, and so retains its value definition, present participle. And any extensions of <syn form> obtain their definitions from <syn form> rather than <syn> (since it is a more specific leading subpath), and so will have the value present participle also.

The net effect of this definition for Word1 can be glossed as ``Word1 stipulates its morphological form to be love ing and inherits values for its syntactic features from VERB, except for <syn form> which is present participle ''. More generally, this mechanism allows us to define nodes differentially: by inheritance from default specifications, augmented by any non-default settings associated with the node at hand. In fact, the Word1 example can take this default inheritance one step further, by inheriting everything (not just <syn>) from VERB, except for the specifically mentioned values:

    Word1:
        <> == VERB
        <syn form> == present participle
        <mor form> == love ing.
Here the empty path <> is a leading subpath of every path, and so acts as a ``catch all'' - any path for which no more specific definition at Word1 exists will inherit from VERB. Inheritance via the empty path is ubiquitous in real DATR lexicons but it should be remembered that the empty path has no special formal status in the language.

In this way Word1 and Word2 can both inherit their general verbal properties from VERB. But of course these two particular forms have more in common than simply being verbs - they are both instances of the same verb, love. By introducing an abstract Love lexeme, we can provide a site for properties shared by all forms of love (in this simple example, just its morphological root and the fact that it is a verb).

    VERB:
        <syn cat> == verb
        <syn type> == main.
    Love:
        <> == VERB
        <mor root> == love.
    Word1:
        <> == Love
        <syn form> == present participle
        <mor form> == <mor root> ing.
    Word2:
        <> == Love
        <syn form> == passive participle
        <mor form> == <mor root> ed.
So now Word1 inherits from Love rather than VERB (but Love inherits from VERB, so the latter's definitions are still present at Word1). However, instead of explicitly including the atom love in the morphological form, the value definition includes the descriptor <mor root>. This descriptor is equivalent to Word1:<mor root> and, since <mor root> is not defined at Word1, the empty path definition applies, causing it to inherit from Love:<mor root>, and thereby return the expected value, love. Notice here that each element of a value can be defined entirely independently of the others; for <mor form> we now have an inheritance descriptor for the first element and a simple value for the second.

Our toy fragment is beginning to look somewhat more respectable: a single node for abstract verbs, a node for each abstract verb lexeme, and then individual nodes for each morphological form of each verb. But there is still more that can be done. Our focus on a single lexeme has meant that one class of redundancy has remained hidden. The line

        <mor form> == <mor root> ing
will occur in every present participle form of every verb. But it is a completely generic statement that can be applied to all English present participle verb forms. So can we not replace it with a single statement in the VERB node? Using the mechanisms we have seen so far, the answer is no. The statement would have to be (i), which is equivalent to (ii), whereas the effect we want is (iii):

(i)     VERB:<mor form> == <mor root> ing
(ii)    VERB:<mor form> == VERB:<mor root> ing
(iii)   VERB:<mor form> == Word1:<mor root> ing
Using (i) or (ii), we would end up with the same morphological root for every verb (or more likely no value at all, since it is hard to imagine what value VERB:<mor root> might plausibly be given), rather than a different one for each. And of course, we cannot simply use (iii) as it is, since that only applies to the particular word described by Word1, namely loving.

The problem is that the inheritance mechanism we have been using is local, in the sense that it can only be used to inherit either from a specifically named node (and/or path), or relative to the local context of the node (and/or path) at which it is defined. What we need is a way of specifying inheritance relative to the the original node/path specification whose value we are trying to determine, rather than the one we have reached by following inheritance links. We shall refer to this original specification as the query we are attempting to evaluate, and the node and path associated with this query as the global contextgif. Global inheritance, that is, inheritance relative to the global context, is indicated in DATR by using quoted ("...") descriptors, and we can use it to extend our definition of VERB as follows:

    VERB:
        <syn cat> == verb
        <syn type> == main
        <mor form> == "<mor root>" ing.
Here we have added a definition for <mor form> which contains the quoted path "<mor root>". Roughly speaking, this is to be interpreted as ``inherit the value of <mor root> from the node originally queried''. With this extra definition, we no longer need a <mor form> definition in Word1, so it just becomes:

    Word1:
        <> == Love
        <syn form> == present participle.
To see how this global inheritance works, consider evaluating the query Word1:<mor form>. Since <mor form> is not defined at Word1, it will inherit from VERB via Love. This specifies inheritance of <mor root> from the query node, which in this case is Word1. The path <mor root> is not defined at Word1 but inherits the value love from Love. Finally, the definition of <mor form> at VERB adds an explicit ing, resulting in a value of love ing for Word1:<mor form>. However, had we begun evaluation at, say, a daughter of the lexeme Eat, we would have been directed from VERB:<mor form> back to the original daughter of Eat to determine its <mor root>, which would be inherited from Eat itself. So we would have ended up with the value eat ing.

The analysis is now almost the way we would like it to be. However, by moving <mor form> from Word1 to VERB, we have introduced a new problem: we have frozen in the present participle as the (default) value of <mor form> for all verbs. Clearly, if we want to specify other forms at the same level of generality, then <mor form> is currently misnamed: it should be <mor present participle>, so that we can add <mor past participle>, <mor present tense>, etc. If we make this change, then the VERB node will look like this:

    VERB:
        <syn cat> == verb
        <syn type> == main
        <mor past> == "<mor root>" ed
        <mor passive> == "<mor past>"
        <mor present> == "<mor root>"
        <mor present participle> == "<mor root>" ing
        <mor present tense sing three> == "<mor root>" s.
In adding these new specifications, we have added a little extra structure as well. The passive form is asserted to be the same as the past form - the use of global inheritance here ensures that irregular or subregular past forms result in irregular or subregular passive forms, as we shall see shortly. The paths introduced for the present forms illustrate another use of default definition. We assume that the morphology of present tense forms is specified with paths of five attributes, the fourth specifying number, the fifth, person. Here we define default present morphology to be simply the root, and this generalises to all the longer forms, except the present participle and the third person singular.

So now for Love, the following extensional statements hold, inter alia:

    Love:
        <syn cat> = verb
        <syn type> = main
        <mor present tense sing one> = love
        <mor present tense sing two> = love
        <mor present tense sing three> = love s
        <mor present tense plur> = love
        <mor present participle> = love ing
        <mor past tense sing one> = love ed
        <mor past tense sing two> = love ed
        <mor past tense sing three> = love ed
        <mor past tense plur> = love ed
        <mor past participle> = love ed
        <mor passive participle> = love ed.

There remains one last problem in the definitions of Word1 and Word2. The morphological form of Word1 is now given by <mor present participle>. Similarly, Word2's morphological form is given by <mor passive participle>. There is no longer a unique path representing morphological form. But this can be corrected by the addition of a single statement to VERB:

    VERB:
        <mor form> == "<mor "<syn form>">".
This statement employs a DATR construct, the evaluable path, which we have not encountered before. The right hand side consists of a (global) path specification, one of whose component attributes is itself a descriptor, to be evaluated before the outer path can be. The effect of the above statement is to say that <mor form> globally inherits from the path given by the atom mor followed by the global value of <syn form>. For Word1, <syn form> is present participle, so <mor form> inherits from <mor present participle>. But for Word2, <mor form> inherits from <mor passive participle>. Effectively, the <syn form> is being used as a parameter to control which specific form should be considered the morphological form. Evaluable paths may themselves be global (as in our example) or local and their evaluable components may also involve global or local reference.

Our analysis now looks like this:

    VERB:
        <syn cat> == verb
        <syn type> == main
        <mor form> == "<mor "<syn form>">"
        <mor past> == "<mor root>" ed
        <mor passive> == "<mor past>"
        <mor present> == "<mor root>"
        <mor present participle> == "<mor root>" ing
        <mor present tense sing three> == "<mor root>" s.
    Love:
        <> == VERB
        <mor root> == love.
    Word1:
        <> == Love
        <syn form> == present participle.
    Word2:
        <> == Love
        <syn form> == passive participle.
The entire analysis is somewhat larger than the original, but it encodes all the past and present tense forms as well as all three participial forms. More importantly, almost all the information is in the VERB node and is common to many verb lexemesgif. Indeed, the other nodes are as small as they reasonably could be: Love simply states that it is a verb with morphological root love and Word1 simply states that it is a present participle instance of Love.

Of course, Love is a completely regular verb. But DATR 's capacity for definition by default allows subregular and irregular lexemes to be concisely represented also. As an example, consider the class of verbs which take en as their past participle ending: hew, mow, saw, sew, etc.gif We can represent this subregularity with a new verbal node which defaults to VERB, but overrides just the past participle morphology:

    EN_VERB:
        <> == VERB
        <mor past participle> == "<mor root>" en.
Relevant individual verb lexemes then inherit from this node instead of directly from VERB:
    Mow:
        <> == EN_VERB
        <mor root> == mow.
    Sew:
        <> == EN_VERB
        <mor root> == sew.
As noted above, the passive forms of these subregular verbs will also now be correct, because of the use of a global cross-reference to the past participle form in the VERB node. So for example, the definition of the passive form of sew is:
    Word3:
        <> == Sew
        <syn form> == passive participle.
If we seek to establish the <mor form> of Word3, we are sent up the hierarchy of nodes, first to Sew, then to EN_VERB, and then to VERB. Here we encounter "<mor "<syn form>">" which resolves to "<mor passive participle>" in virtue of the embedded global reference to <syn form> at Word3. This means we now have to establish the value of <mor passive participle> at Word3. Again, we ascend the hierarchy to VERB and find ourselves referred to the global descriptor "<mor past participle>". This takes us back to Word3, from where we again climb, first to Sew, then to EN_VERB. Here, <mor past participle> is given as the sequence "<mor root>" en. This leads us to look for the <mor root> of Word3 which we find at Sew giving the result we seek:
    Word3:
        <mor form> = sew en.
Irregularity can be treated as just the limiting case of subregularity, so, for example, the morphology of Do can be specified as followsgif:

    Do:
        <> == VERB
        <mor root> == do
        <mor past> == did
        <mor past participle> == done
        <mor present tense sing three> == does.
Likewise, the morphology of Be can be specified as follows:

    Be:
        <> == EN_VERB
        <mor root> == be
        <mor present tense sing one> == am
        <mor present tense sing three> == is
        <mor present tense plur> == are
        <mor past tense sing one> == <mor past tense sing three>
        <mor past tense sing three> == was
        <mor past tense plur> == were.

In this section we have moved from simple attribute/value listings to a compact, generalisation-capturing representation for a fragment of English verbal morphology. In so doing, we have seen examples of most of the important ingredients of DATR : local and global descriptors, definition by default, and evaluable paths.


next up previous
Next: The DATR language Up: The DATR Web Pages Previous: Introduction

Copyright © Roger Evans, Gerald Gazdar & Bill Keller
Wed Feb 26 12:00:02 GMT 1997