Utility Scripts for Named-Entity Recognizer (follow-up post)

In this previous post, followed by this one, I described an approach to using ANNIE for named entity recognition. This approach combines a traditional rule set with machine learning. To complete the post, I need to publish the perl scripts I use to build up the recognition lists that identify different kinds of entities. These lists are used by the gazetteer to identify potential named entities.


#This program takes an input file in the OpenNLP named entity training data format, and outputs a training file in the NAG format.
use strict;
use warnings;

open(MYIN, "outOpenNLP.txt") or die "Can't open output file: $!";

my $line;
LINE: while ($line = ) {
chomp $line;
#if ($line =~ m/(\w?)(\w?)(\)(.*)(\)(\w?)(\w?)/) {
if ($line =~ m/(\S*)\s*(\S*)\s*\(.*)\\s*(\S*)\s*(\S*)/) {
print MYOUT "Person`$1`$2`$3`$4`$5\n";
}
}

close MYIN;
close MYOUT;

The above script can be used to take a training set from OpenNLP and convert it into something that looks like what we would output ourselves.


use strict;
use warnings;
use DBI;

sub trim($);

my $lineout = "";
my @fields;
my $count = 0;

my $username = 'root';
my $password = 'iam!root';
my $database = 'nag';
my $hostname = '';
my $dbh;
my $SQL;
my $UpdateRecord;
my $Select;
my $Row;
my $InsertRecord;
my $CreateTable;
my $PreTable = "person_pre";
my $PostTable = "person_post";
my $fieldVal;

$dbh = DBI->connect("dbi:mysql:database=$database;host=$hostname;port=3306", $username, $password) or die $DBI::errstr;

open(MYIN, "Pre.txt") or die "Can't open output file: $!";
open(MYOUTPOST, ">Post.txt") or die "Can't open output file: $!";

#These two lines build or clear the PreTable. I have to build the table the first time I set up a database, but afterwards...
#$SQL= "create table $PreTable(field text not null, count integer)";
$SQL= "delete from $PreTable";

$CreateTable = $dbh->do($SQL);

if($CreateTable){
print "Pre Table Initialized\n";
}
else{
die( "Pre Table Initialize Failed $DBI::errstr");
}

#These two lines let me build or clear the PostTable.
#$SQL= "create table $PostTable(field text not null, count integer)";
$SQL= "delete from $PostTable";

$CreateTable = $dbh->do($SQL);

if($CreateTable){
print "Post Table Initialized\n";
}
else{
die("Post Table Initialize Failed $DBI::errstr");
}
my $line;
#Start processing the input data
LINE: while ($line = ) {
chomp $line;
@fields = split (/`/, $line);
do_line();
if (length ($fields[1]) > 0 || length($fields[2]) > 0){
if (length ($fields[1]) > 0 && length($fields[2]) > 0){
$fieldVal = "$fields[1] $fields[2]";
} elsif (length ($fields[1]) > 0) {
$fieldVal = $fields[1];
} else {
$fieldVal = $fields[2];
}
$SQL= "select * from $PreTable where field = '$fieldVal'";
$Select = $dbh->prepare($SQL);
$Select->execute();
# If Select succeeds, update the count field in the record
if($Row=$Select->fetchrow_hashref) {
$count = $Row->{count} + 1;
$SQL= "update $PreTable set count = $count where field = '$fieldVal'";
$UpdateRecord = $dbh->do($SQL);
if($UpdateRecord){
#print "PreTable Update Success $fieldVal $count";
} else {
#print "PreTable Update Failure $DBI::errstr $fieldVal $count";
die ("Failure $DBI::errstr $fieldVal $count");
}
# if the select fails, then insert a new record.
} else {
$SQL= "insert into $PreTable (field, count) values('$fieldVal', 1)";
$InsertRecord = $dbh->do($SQL);
if($InsertRecord){
#print "$PreTable Success $fieldVal $count";
} else {
#print "Failure $DBI::errstr $fieldVal $count";
die ("Failure $DBI::errstr $fieldVal $count");
}
}
}

if (length ($fields[4]) > 0 || length($fields[5]) > 0){
if (length ($fields[4]) > 0 && length($fields[5]) > 0){
$fieldVal = "$fields[4] $fields[5]";
} elsif (length ($fields[4]) > 0) {
$fieldVal = $fields[4];
} else {
$fieldVal = $fields[5];
}
$SQL= "select * from $PostTable where field = '$fieldVal'";
$Select = $dbh->prepare($SQL);
$Select->execute();
# If Select succeeds, update the count field in the record
if($Row=$Select->fetchrow_hashref) {
$count = $Row->{count} + 1;
$SQL= "update $PostTable set count = $count where field = '$fieldVal'";
$UpdateRecord = $dbh->do($SQL);
if($UpdateRecord){
#print "PreTable Update Success $fieldVal $count";
} else {
#print "PreTable Update Failure $DBI::errstr $fieldVal $count";
die ("Failure $DBI::errstr $fieldVal $count");
}
# if the select fails, then insert a new record.
} else {
$SQL= "insert into $PostTable (field, count) values('$fieldVal', 1)";
$InsertRecord = $dbh->do($SQL);
if($InsertRecord){
#print "$PostTable Success $fieldVal $count";
} else {
#print "Failure $DBI::errstr $fieldVal $count";
die ("Failure $DBI::errstr $fieldVal $count");
}
}
}
}

#Now we want to print out the contents of the Pre and Post tables.
#Start by printing the PreTable

$SQL= "select * from $PreTable order by count desc";

$Select = $dbh->prepare($SQL);
$Select->execute();

while($Row=$Select->fetchrow_hashref)
{
print MYOUTPRE "$Row->{field}\tweight=$Row->{count}\n";
}

#Now Print out the PostTable

$SQL= "select * from $PostTable order by count desc";

$Select = $dbh->prepare($SQL);
$Select->execute();

while($Row=$Select->fetchrow_hashref)
{
print MYOUTPOST "$Row->{field}\tweight=$Row->{count}\n";
}

print "Closing Input and Output Files";
close MYIN;
close MYOUTPRE;
close MYOUTPOST;

sub trim($)
{
my $string = shift;
$string =~ s/^\s+//;
$string =~ s/\s+$//;
return $string;
}

sub do_line
{
for (my $i = 0; $i < @fields; $i++) {
$fields[$i] =~ s/[^A-Za-z0-9]+//g;
# print $out "\"" . $fields[$i]."\"" ."\t";
}
}

The above routine creates the recognition list from the output file.


use strict;
use warnings;
use DBI;
my $username = 'root';
my $password = 'iam!root';
my $database = 'nag';
my $hostname = '';
my $PreTable = "person_pre";
my $PostTable = "person_post";
my $dbh;
my $SQL;
my $Select;
my $Select2;
my $Row;
my $Row2;

$dbh = DBI->connect("dbi:mysql:database=$database;host=$hostname;port=3306", $username, $password) or die $DBI::errstr;

$SQL= "select * from $PreTable order by count desc";
print "Comparing $PreTable AND $PostTable\n";

$Select = $dbh->prepare($SQL);
$Select->execute();

while($Row=$Select->fetchrow_hashref)
{
$SQL= "select * from $PostTable where field = '$Row->{field}'";
$Select2 = $dbh->prepare($SQL);
$Select2->execute();
if ($Row2=$Select2->fetchrow_hashref) {
#we have a match. Print it out
print "HIT $Row->{field}\n";
} else {
#print "miss $Row->{field}\n";
}
}

print "Compare Complete";

This last utility compares the pre and post tables to give an idea how disjoint they are. the fewer duplicates, the better.

That’s it. If you understand ANNIE, then you should be able to use these utilities to create the appropriate files for the Gazetteer and the system should work!

Advertisements

About jeffmershon

Director of Program Management at SiriusXM.
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s