Named Entity Recognition using ANNIE in GATE Part 2

In a previous post, I introduced a new named-entity recognizer, called NAG, that replaced many of the named entity modules that ship with ANNIE. That post discussed the gazetteer portion of NAG as well as the rules to identify (but not classify) named-entities. In this post, I’ll introduce the classification rules themselves.

NAG was built to be a more flexible and extensible alternative to the apporach used in the default ANNIE named-entity recognizer. To achieve these goals, NAG combines both a rule-based approach with elements of machine-learning approaches.

Before looking at the rules, we need to first look at the different types of entities that NGA recognizes. NAG stores weights in an array, we need to agree on which array element corresponds to which entity type. These are:

/*
0 = Person
1 = Organization
2 = Facility
3 = Vessel
4 = Location
5 = Vehicle
6 = Street
*/

NAG was built to accomodate 20 different types. More could be supported, but would require some effort to modify all of the existing rules.

There are essentially three types of rules, “pre” rules, “post” rules and “embed” rules, which correspond to the location of the markers, or clues we are looking for to help us recognize an entity. The basic idea is that some words only appear before people, while others only appear before organizations, etc. Below is an example of a pre rule for a person. This rule is looking for either a person_prekey or people_relation annotation, followed by a CoreName (ie: something that looks like a named entity but we’re not sure what it is yet). A person_prekey is an annotation that is defined in the Gazetteer. See my previous post on NAG for an overview.

The left side of the rules (everything up to the –>) always differ, but the format of the right side is generally the same, at least within each rule type.

Rule: PersonPreKey
(
	({Lookup.majorType == "person_prekey"}|{Lookup.majorType == "people_relation"}):pre
	({Token.string == ","})?
)
({CoreName}):person
-->
{
	String wgtStr;
	int wgt = 50; //The wgt value is a measure of how definitive this rule is in determining which type of entity something is. 
	String[] wgts = {"0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0"}; 
	String wgtTmp;
	//define a new feature map
	gate.FeatureMap features = Factory.newFeatureMap();
	//build an AnnotationSet and Annotation for the :pre label (look at the LH side of the rule, above)
	gate.AnnotationSet personAS = (gate.AnnotationSet)bindings.get("pre");
	gate.Annotation personAnn = (gate.Annotation)personAS.iterator().next();
	//look for a weight feature and add it to the weight for this rule (wgt).
	if (personAnn.getFeatures().get("weight") != null) { 
		try {
			wgt += Integer.parseInt(personAnn.getFeatures().get("weight").toString().trim());
		} catch (NumberFormatException e) {
			//wgt = 0;
		}
	}
	//build an AnnotationSet and Annotation for the :person label
	personAS = (gate.AnnotationSet)bindings.get("person");
	personAnn = (gate.Annotation)personAS.iterator().next();
	//get the "rule" feature and append this rule's name (PersonPreKey) to the end of it. Then add it to the new feature set.
	if (personAnn.getFeatures().get("rule") != null) {
		features.put("rule", personAnn.getFeatures().get("rule").toString().trim() + ",PersonPreKey");
	} else {
		features.put("rule", "PersonPreKey");
	}
	//get the weights array, split it up, and then add to it the wgt value we computed.
	if (personAnn.getFeatures().get("wgts") != null) {
		wgts = personAnn.getFeatures().get("wgts").toString().split(",");
		//we have to clean up array items 0 (the first) and 19 (the last). This is probably due to a lack of java array understanding on my part.
		if (wgts[0] != null) {
			wgtTmp = wgts[0].trim();
			wgts[0] = wgtTmp.substring(1);
		}
		if (wgts[19] != null) {
			wgtTmp = wgts[19].trim();
			wgts[19] = wgtTmp.substring(0,wgtTmp.length()-1);
		}
		//This is where we add the wgt value of this rule to the appropriate array element. In this case, we're updating element 0, because this rule is a person rule.
		if (wgts[0] != null) {
			try {
				wgt += Integer.parseInt(wgts[0].trim());
			} catch (NumberFormatException e) {
				//wgt = 0;
			}
		}
	}
	wgts[0] = Integer.toString(wgt);
	features.put("wgts", Arrays.asList(wgts));
	//If we updated a weight then write out a new annotation.
	if (wgt > 0) {
		outputAS.add(personAS.firstNode(), personAS.lastNode(), "CoreName", features);
		outputAS.removeAll(personAS);	
	}
}

Below are the left hand sides of a few other pre rules. The right hand side, with the exception of the rule’s name, is the same as above.

Rule: PersonPreTitle
({Title})[1,3]:pre
({CoreName}):person
-->

Rule: PersonPreJobTitle
(
	({JobTitle}):pre
	({Token.string == ","})?
)
({CoreName}):person
-->

Rule: PersonPreRelation
(
	({Lookup.majorType == "people_relation"}):pre
	({Token.string == ","})?
)
({CoreName}):person
-->

Below is a pre rule for vessel.

Rule: VesselPre
({Lookup.majorType == "vessel_prefix"}):pre
({CoreName}):vessel
-->

The right hand side differs in one section, where the weight is added.

		if (wgts[3] != null) {
			try {
				wgt += Integer.parseInt(wgts[3].trim());
			} catch (NumberFormatException e) {
				//wgt = 0;
			}
		}
	}
	wgts[3] = Integer.toString(wgt);

Note that the code is exactly the same with the exception of the array element being accessed (3 for a vessel instead of 0 for a person).

Hopefully, you can see that adding new rules is pretty easy, just write the left hand side and make a few modifications to the right. Much of the work is done by building the appropriate gazetteer files.

A typical post rule uses the same right hand code. Just the left side is different.

({CoreName}):person
({Token.string == ","})?
({Lookup.majorType == "person_postkey"}):post
-->

The right side of the embed rules are a little more complicated, but the core of the right hand logic is the same. The left side rules are even simpler than either the pre or post rules. Below is a typical rule. Note that only a few lines are really different. The key difference is that, whereas the pre and post rules are looking for things that come either before or after a CoreName, embed rules are looking for things that are part of, or embedded with, the CoreName. This can happen for cases like person titles, such as President Barack Obama, where President Barack Obama is annotated as a CoreName and President is annotated as a JobTitle.
Let’s try to picture this case using XML:
President Barack Obama
This could also be expressed as:
President Barack Obama
You can visually see that the JobTitle and CoreName are comingled, but they may not be perfectly nested. You’ll see in a minute how we check for the possible cases.

Rule:	PersonTitleEmbed
(
{Title}
):person
-->  
{
	int wgt = 150;
	String[] wgts = {"0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0"};
	String wgtTmp;
	gate.FeatureMap features = Factory.newFeatureMap();
	gate.AnnotationSet personAS = (gate.AnnotationSet)bindings.get("person");
	//gate.Annotation personAN = (gate.Annotation)personAS.iterator().next();
	//this next line tries to find a CoreName annotation that overlaps the :person annotation found in the left side of the rule. This particular way of looking for CoreName handles all overlapping cases that might occur.
	gate.AnnotationSet XtagAS = inputAS.get("CoreName", personAS.firstNode().getOffset(), personAS.lastNode().getOffset());
	if (XtagAS != null && XtagAS.firstNode() != null && XtagAS.lastNode() != null) { //this check is to make sure we really did find something.
		gate.Annotation XtagAN = (gate.Annotation)XtagAS.iterator().next();
		//this next section is exactly the same as for the pre and post rules.
		if (XtagAN.getFeatures().get("rule") != null) {
			features.put("rule", XtagAN.getFeatures().get("rule").toString().trim() + ",PersonTitleEmbed");
		} else {
			features.put("rule", "PersonTitleEmbed");
		}
		if (XtagAN.getFeatures().get("wgts") != null) {
			wgts = XtagAN.getFeatures().get("wgts").toString().split(",");
			if (wgts[0] != null) {
				wgtTmp = wgts[0].trim();
				wgts[0] = wgtTmp.substring(1);
			}
			if (wgts[19] != null) {
				wgtTmp = wgts[19].trim();
				wgts[19] = wgtTmp.substring(0,wgtTmp.length()-1);
			}
			if (wgts[0] != null) {
				try {
					wgt += Integer.parseInt(wgts[0].trim());
				} catch (NumberFormatException e) {
					//wgt = 0;
				}
			}	
		}
		wgts[0] = Integer.toString(wgt);
		features.put("wgts", Arrays.asList(wgts));
		if (wgt > 0) { 
			//This next section is a little more complicated, because we need to figure out where, precisely, to add the new annotation, which depends on where we found the CoreName annotation to begin with.
			if ((personAS.firstNode().getOffset() <= XtagAS.firstNode().getOffset()) && (personAS.lastNode().getOffset() = XtagAS.firstNode().getOffset()) && (personAS.lastNode().getOffset() < XtagAS.lastNode().getOffset())) {
				outputAS.add(personAS.lastNode(),XtagAS.lastNode(),"CoreName",features);
				outputAS.removeAll(XtagAS);
				outputAS.removeAll(personAS);
			}
		}
	}
}

The core of this rule is the same as for the pre and post cases.

At the end of it all, we want to convert CoreNames annotations to Person, Organization, etc. That is accomplished by a single rule.

Rule:	ConvertCoreName
(
{CoreName}
):core
-->  
{
	int hwm = 0; //high water mark
	String[] wgts = {"0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0"};
	int wgt = 0;
	String wgtTmp;
	gate.FeatureMap features = Factory.newFeatureMap();
	gate.AnnotationSet coreAS = (gate.AnnotationSet)bindings.get("core");
	gate.Annotation coreAN = (gate.Annotation)coreAS.iterator().next();
	if (coreAN.getFeatures().get("rule") != null) {
		features.put("rule", coreAN.getFeatures().get("rule").toString().trim() + ",ConvertCoreName");
	} else {
		features.put("rule", "ConvertCoreName");
	}
	if (coreAN.getFeatures().get("type") != null) {
		features.put("type", coreAN.getFeatures().get("type").toString().trim());
	}
	if (coreAN.getFeatures().get("wgts") != null) {
		wgts = coreAN.getFeatures().get("wgts").toString().split(",");
		if (wgts[0] != null) {
			wgtTmp = wgts[0];
			wgts[0] = wgtTmp.substring(1);
		}
		if (wgts[19] != null) {
			wgtTmp = wgts[19];
			wgts[19] = wgtTmp.substring(0,wgtTmp.length()-1);
		}
		//this section tries to find the high water mark (hwm).  
		for (int i=0;i hwm) {
					hwm = wgt;
				}
			} catch (NumberFormatException e) {
			}
		}
	}
	features.put("wgts", Arrays.asList(wgts));
	if (hwm > 0) {
		try {
			wgt = Integer.parseInt(wgts[0].trim());
			if (wgt == hwm) {
				outputAS.add(coreAS.firstNode(),coreAS.lastNode(),"Person",features);
			}
		} catch (NumberFormatException e) {
		}
		try {
			wgt = Integer.parseInt(wgts[1].trim());
			if (wgt == hwm) {
				outputAS.add(coreAS.firstNode(),coreAS.lastNode(),"Organization",features);
			}
		} catch (NumberFormatException e) {
		}
		try {
			wgt = Integer.parseInt(wgts[2].trim());
			if (wgt == hwm) {
				outputAS.add(coreAS.firstNode(),coreAS.lastNode(),"Facility",features);
			}
		} catch (NumberFormatException e) {
		}
		try {
			wgt = Integer.parseInt(wgts[3].trim());
			if (wgt == hwm) {
				outputAS.add(coreAS.firstNode(),coreAS.lastNode(),"Vessel",features);
			}
		} catch (NumberFormatException e) {
		}
		try {
			wgt = Integer.parseInt(wgts[4].trim());
			if (wgt == hwm) {
				outputAS.add(coreAS.firstNode(),coreAS.lastNode(),"Location",features);
			}
		} catch (NumberFormatException e) {
		}
		try {
			wgt = Integer.parseInt(wgts[5].trim());
			if (wgt == hwm) {
				outputAS.add(coreAS.firstNode(),coreAS.lastNode(),"Vehicle",features);
			}
		} catch (NumberFormatException e) {
		}
		try {
			wgt = Integer.parseInt(wgts[6].trim());
			if (wgt == hwm) {
				outputAS.add(coreAS.firstNode(),coreAS.lastNode(),"Street",features);
			}
		} catch (NumberFormatException e) {
		}
		outputAS.removeAll(coreAS);
	}
	//Note: if hwm = 0, then we leave the incoming CoreName annotation intact.
}

The high water mark (hwm) is the highest score achieved by any type. Note that it is possible for multiple types to share the hwm. If that happens, one annotation is added for each type. It is possible, therefore, for a named entity to be transformed by NAG into both a person and a vessel, etc. In practice, it is hoped that most cases will resolve to a single type. It may be necessary to adjust certain weights to achieve the desired outcome. But this can be hopefully achieved through training and careful rule curation.

Finally, I want to automate the task of creating suitable gazetteer entries, based on previously processed data. The following rule writes out every organization in the text, along with the two preceding and trailing terms.

Rule: OrgProps
(
	({Token})? :pre1
	({Token})? :pre2
	({Organization}):entity
	({Token})? :post1
	({Token})? :post2
)
-->
{
	gate.AnnotationSet personAS;
	gate.Annotation personAnn;
	int start_offset = 0;
	int end_offset = 0;
	String outStr1 = "";
	String outStr2 = "";
	String outStr3 = "";
	String outStr4 = "";
	String outStr5 = "";
	
	personAS = (gate.AnnotationSet)bindings.get("pre1");
	if (personAS != null) {
		personAnn = (gate.Annotation)personAS.iterator().next();
		start_offset = personAnn.getStartNode().getOffset().intValue();
		end_offset = personAnn.getEndNode().getOffset().intValue();
		outStr1 = doc.getContent().toString().substring(start_offset, end_offset);
        }
	personAS = (gate.AnnotationSet)bindings.get("pre2");
	if (personAS != null) {
		personAnn = (gate.Annotation)personAS.iterator().next();
		start_offset = personAnn.getStartNode().getOffset().intValue();
		end_offset = personAnn.getEndNode().getOffset().intValue();
		outStr2 = doc.getContent().toString().substring(start_offset, end_offset);
        }
	personAS = (gate.AnnotationSet)bindings.get("entity");
	if (personAS != null) {
		personAnn = (gate.Annotation)personAS.iterator().next();
		start_offset = personAnn.getStartNode().getOffset().intValue();
		end_offset = personAnn.getEndNode().getOffset().intValue();
		outStr3 = doc.getContent().toString().substring(start_offset, end_offset);
        }
	personAS = (gate.AnnotationSet)bindings.get("post1");
	if (personAS != null) {
		personAnn = (gate.Annotation)personAS.iterator().next();
		start_offset = personAnn.getStartNode().getOffset().intValue();
		end_offset = personAnn.getEndNode().getOffset().intValue();
		outStr4 = doc.getContent().toString().substring(start_offset, end_offset);
        }
	personAS = (gate.AnnotationSet)bindings.get("post2");
	if (personAS != null) {
		personAnn = (gate.Annotation)personAS.iterator().next();
		start_offset = personAnn.getStartNode().getOffset().intValue();
		end_offset = personAnn.getEndNode().getOffset().intValue();
		outStr5 = doc.getContent().toString().substring(start_offset, end_offset);
        }
        System.out.println("Organization" + "`" + outStr1 + "`" + outStr2 + "`" + outStr3 + "`" + outStr4 + "`" + outStr5);
}

Note that the output file uses the ` as a field separator. A tab, comma, etc. could also be used, but I’ve found the back apostrophe to be a good choice because it is so infrequently used elsewhere. The output of this step will get fed into another system for weighting. More on that in a later post.

Advertisements

About jeffmershon

Director of Program Management at SiriusXM.
This entry was posted in Software and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s