A Different Approach to Named Entity Recognition Using ANNIE in GATE

This is the first of a multi-part post describing a named-entity recognizer (NER), called NAG, built using ANNIE, that replaces many of the default named entity transducers that ship with the GATE package designed to provide a more robust and flexible named entity recognition capability. NAG uses a small core set of simple rules to score proper noun phrases based on terms that precede, follow and/or are embedded within the phrase, and can easily accommodate new entity types to suit individual needs.

Overview

NAG processes text in phases: Gazetteer, NP Chunking, Pre-term scoring, Post-term scoring, Embedded-term scoring, and finally, Term conversion.

The gazetteer phase uses the ANNIE Gazetteer to annotate specific terms in the text matching terms in the gazetteer’s lists. Different annotations are applied depending on which list the matching term comes from. A term can be annotated multiple times if it appears in multiple gazetteer lists.

The NP chunking phase aggregates consecutive proper nouns into a single term, as a convenience to other rules.

The pre, post, and embedded-term scoring phases use a set of similar rules to evaluate the text and add to the score for different entity types. For example, if the word “Stadium” is found within a term, then the score for that term being a Facility type is increased.

The term conversion phase identifies the highest scoring entity types for a given phrase, and converts that term into the chosen type(s). The system is designed such that a given term can be converted into multiple types, if multiple high scores are recorded.

ANNIE Overview

ANNIE consists of a number of processing resources (PR) that perform a variety of core tasks. These include the English Tokenizer, Sentence Splitter, Part-of-speech (POS) tagger, and Gazetteer. Text is annotated by ANNIE PRs. For now, we can think of an annotation as structured information that gets attached to a string of text. The English Tokenizer, for example, annotates strings of text based on word boundaries. Essentially each word in this document is annotated with the {Token} annotation. The POS tagger adds features to each {Token}. The sentence splitter adds a {Split} annotation where appropriate, while the Gazetteer adds {Lookup} annotations. Each PR in the processing chain relies on annotations added by previous PRs. For example, the POS Tagger relies on the {Token} and {Split} annotations in order to properly do its job.

In addition to the included PRs, ANNIE lets you create your own. NAG consists of several PRs built using the Jape language. A Jape rule consists of two parts, referred to as the Left-Hand Side (LHS) and Right-Hand Side (RHS), although in practices the two halves generally appear one above the other. Like most rule-based systems, the LHS specifies the matching conditions under which the RHS is executed. If a piece of text matches the conditions specified in the LHS, then the rule “fires” executing the code defined in the RHS.

A simple LHS of a rule:


Rule: MatchAllTokens
(
{Token}
):token

-->

All rules begin with Rule: rulename

The name is followed by an opening parend (

The matching constraints are listed next. In this case, the LHS will match any token. As we learned earlier, the English Tokenizer annotates every word with {Token}, so this rule will essentially match every word in a document.

The rule is closed with a parend ).

As we’ll see later, the RHS needs a handle to the text that the LHS has matched. In this case, the handle is :token.

Finally, the --> separates the LHS from the RHS.

Usually, we will want to match on something more specific than just {Token}. This is done by specifying features. A few of the most commonly used features for a {Token} are: category, length, string, orth, and kind.

For example, if we want to match on a {Token} having a length of one (ie: a single character), we could specify {Token.length == 1}. We could specify {Token.string == “and”} to match just the word “and”. The orth feature relates to capitalization, such as {Token.orth == “upperInitial”}, {Token.orth == “allCaps”}, or {Token.orth == “mixedCaps”}, the latter corresponding to a word such as MacArthur. The category feature is particularly powerful, because it lets you create a rule that matches all words of a given part-of-speech. For example, {Token.category == “DT”} will match all words used as determiners. Common determiners include the articles “the” and “a”, but can also include other words depending on context, which is what makes the POS tagger so important and this feature so powerful. Token is not the only annotation. Gazetteer, for example, creates {Lookup} annotations, while the Sentence Splitter creates {Split} annotations.

The Gazetteer phase

The Gazetteer stage uses the standard ANNIE Gazetteer PR. For each entity type to be recognized, NAG uses three types of gazetteer lists: a list of terms that precede the entity type, a list of terms that follow the entity type, and a list of terms that can be found within the entity type itself. For example, jobtitles often appear before or after a person’s name, while titles, such as military rank, can appear before a person’s name. Both jobtitles and titles are gazetteer lists used by NAG. Certain endings, such as Jr or Sr, appear at the end of person’s names. Facilities are often distinguished by the presence of the facility type, such as Bridge, Stadium, etc. either as part of the facility name itself (such as the George Washington Bridge), or sometimes immediately following the name (the latter often due to a capitalization mistake). Vessels (ships), are generally preceded by the type of vessel, such as the tugboat Clementine, or a special designation that is part of the vessel’s name, such as the USS Carl Vinson.

Gazetteer list file format

The gazetteer list files are simple text files, usually ending in .lst. Each term is placed on a separate line. As noted earlier, NAG uses a special feature of the ANNIE gazetteer to add weights for each term. The weight is separated from the term by a tab, as shown below:

said        weight=40
who       weight=50

Gazetteer control file format

The Gazetteer control file, called lists.def, specifies which .lst files are to be compiled into the Gazetteer. A typical entry looks like this:

province.lst:location:province

The first portion of the entry is the name of the .lst file to include, In this case, province.lst. Separated from the file name by a semicolon is the majorType, which in turn is separated by another semicolon from the minorType. As the gazetteer PR executes on a document, each string of text that matches an entry in the province.lst file, such as California, is annotated as a {Lookup}, similar to the way that the Tokenizer tags words as {Token}. In addition, every text string that gets annotated will also receive the majorType (eg: location) and minorType (eg: province) features specified in the lists.def file. California, for example, is annotated as a {Lookup} and the lookup annotation will have a majorType feature “location” and a minorType feature “province”.

The person_ending entry in lists.def is: person_ending.lst:person_ending. In this case, any text in the document that matches an entry in the person_ending.lst file (such as Jr), will be annotated {Lookup} and that annotation will have majorType “person_ending”. Since no minorType is specified, none is given.

The gazetteer phase annotates phrases in the target document that match phrases in the gazetteer lists. For example, both “tugboat” and “USS” would be annotated “vessel_prefix”, because those two terms appear in the vessel_prefix gazetteer list. It is important to note that NAG also takes advantage of the Gazetteer’s ability to append additional properties (called “features” in GATE) to an annotation. So, for example, “tugboat” is tagged “vessel_prefix”, which is in turn tagged with a weight value, such as 50. As each rule processes a particular piece of text, these weights accumulate for each candidate entity type, with the highest scoring types being chosen as the phrase’s final type.

Gazetteer .lst files used by NAG

Person

jobtitles.lst
titles.lst
person_ending.lst
person_prekey.lst
person_postkey.lst

Organization

company_ending.lst
org_key.lst
org_prekey.lst

Facility

facility_key.lst

Location

loc_generalkey.lst
loc_postkey.lst
loc_prekey.lst
country.lst
province.lst
region.lst
mountain.lst
rivers.lst

Vessel

vessel_prefix.lst

NP Chunking

This phase consists of two rules; the first rule is a special case that deals with person surnames as recorded by the US military while the second handles all other cases. The normal case looks like this:


(
(INITIALS)[0,3]
(PROPER)[0,3]
(PREFIX)?
(PROPER)[1,3]
(PERSONENDING)?
):person

(INITIALS)[0,3] matches if 0, 1, 2 or 3 initials are found. INITIALS is a macro. Macros let us write a fragment of a rule and reuse it in multiple places. In this case, INITIALS expands to:

(
{Token.orth == UpperInitial, Token.length =="1"}
({Token.string == "."})?
)

The first line of this rule states that we want to match a single (length = 1), uppercase (upperInitial) letter.[1] The second line matches on a “.”. The question mark at the end indicates that this part of the rule is optional, that is, this part of the rule will match either zero or one “.”. Also note that to specify these cardinality constraints requires the use of an additional pair of parentheses.

(PROPER)[0,3] matches 0 to 3 proper terms. Like INITIALS, PROPER is a macro, and it expands to:


(
{Token.orth == "upperInitial", Token.category =~ "N[N|NP|NPS|NS|P|PS]", Token.kind == "word", Token.length > 1, !Lookup.majorType == "company_ending"}|
{Token.orth == "allCaps", Token.category =~ "N[N|NP|NPS|NS|P|PS]", Token.kind == "word", Token.length > 1, Token.string != "AKA", !Lookup.majorType == "company_ending"}|
{Token.orth == "mixedCaps", Token.category =~ "N[N|NP|NPS|NS|P|PS]", Token.kind == "word", Token.length > 1, !Lookup.majorType == "company_ending"}
)

This rule consists of three constraints. The | symbol that separates each constraints signifies a logical OR. Consequently, this part of the rule will succeed if any one of these 3 constraints matches. You will note that each constraint consists of several components, separated by commas. The comma in this context signifies a logical AND. In order for one of these constraints to match, all of the conditions listed between the opening { and closing } must match. Looking at one of the constraints, we have already seen the orth and length features.

In the middle constraint notice Token.string != "AKA". != simply means NOT equal. In fact, ! is used in other place to represent NOT, which we’ll see in a moment.

The category feature adds a new twist. =~ “N[N|NP|NPS|NS|P|PS]” means match any of the following categories: NN, NNP, NNPS, NNS, NP or NPS. The meaning of these categories is described on the GATE website.

Now let’s turn our attention to {Lookup} annotations added by the gazetteer. We previously explained how majorType and minorType features are specified in the Gazetteer’s lists.def file; now we use these features to specify that we want to match Lookup.majorType == "company_ending". As we noted earlier, the exclamation point at the beginning specifies NOT. In other words, this part of the rule will succeed unless a Lookup.majorType == "company_ending" annotation is found (in other words, as long as Lookup.majorType == "company_ending" is not found along with the other constraints).

Fortunately, this is about as difficult as the LHS rules tend to get in NAG. The next part of the LHS rule is:

(PREFIX)?

As we’ve already seen, the ? simply means optional, and is equivalent to (PREFIX)[0,1].

(PROPER)[1,3]

This is the same as the previous use of PROPER, described above, except that the minimum number of matches is 1. Up until now, the other parts of the rule have been optional. This is the first part of the rule where something MUST match. Otherwise, the internals are the same as before.

Finally, we come to:

(PERSONENDING)?

Again, it’s optional, and the expansion is equally trivial:

(
{Lookup.majorType == person_ending}
)

The military rule looks like this:

(
(INITIALS)[0,3]
(PROPER)[0,3]
(PREFIX)?
(MILITARYNAME)[1,3]
(PERSONENDING)?
):person

You will notice that it is almost identical to the normal case, except for the (MILITARYNAME) macro replacing the second instance of (PROPER). The military name macro is much simpler than (PROPER):

(
(
{Token.string == "("}
{Token.string == "("}
)
({Token.string != ")"})+
(
{Token.string == ")"}
{Token.string == ")"}
)
)

This rule simply matches on two consecutive left parends “((“ one or more (indicated by the + sign) tokens as long as the token is not a right parend, and then ending with two right parends. In typical use a name recorded in this way would look like this: George ((Washington)). In the military context, the double parends is unambiguous, and in NAG immediately results in the creation of a Person, bypassing the normal weighting mechanism used for everything else.


[1] A single uppercase letter has orth=upperInitial, not orth=allCaps.

Advertisements

About jeffmershon

Director of Program Management at SiriusXM.
This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to A Different Approach to Named Entity Recognition Using ANNIE in GATE

  1. Sheshadri says:

    Hey Jeff! its Baba.
    Have you looked at Apache UIMA and uima-nerc ?
    I did some prototyping with these (mostly UIMA) for date/time recognition. It was cool and fun!
    b-

    • jeffmershon says:

      Baba! It’s good you told me it was you and not the *other* Sheshadri Mantha 😉

      I first looked at UIMA a long time ago, when it was under IBM. At the time, it was just a framework, with no guts, and it was just ugly. I looked at it again before settling on GATE, but as you know, my java skills are quite limited, and I found that I could get something up and running with GATE in short order without much nonsense using just TextPad. They had lots of examples that I could use. Anyway, GATE does have its issues, and it looks like the UIMA site is beefing up, so maybe I’ll give it another look next time I have something to build. I’m considering releasing this NER on Git, so we’ll see.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s