c******n 发帖数: 4965 | 1 【 以下文字转载自 THU 讨论区 】
发信人: creation (努力自由泳50m/45sec !), 信区: THU
标 题: parsing bibliography and sorting
发信站: BBS 未名空间站 (Sun Nov 11 13:27:54 2012, 美东)
my wife was spending a lot of time sorting the bibliography of her thesis,
because her bib was obtained in plain text form, I have to parse out the
first author last name first. so I wrote this little piece of code. hope it
will be useful for someone too....
right now it fails to parse single-author bib, cuz it's difficult to
recognize a human name from other words. but for biology papers, a paper
mostly has multiple authors
sub get_first_author($) {
my ($line) = @_;
my ($author, $second_possible, $remaining ) = split /,|and|\d/, $line ,3;
my $lastname = find_author_lastname($author);
my $second_possible_lastname = find_author_lastname($second_possible);
return $lastname ne ''? $lastname:$second_possible_lastname;
}
sub find_author_lastname($) {
my ($author) = @_;
my @segments = split /[, ]+/, $author;
my @candidates_for_last = ();
foreach my $s (@segments) {
if ( uc($s) eq $s ) { next;} # all upper case
if ( length($s) == 1 ) { next;} # only a single letter
if ( $s =~ /^([:alpha:]\.)+$/ ) { next;} # A.B.C. pattern
push @candidates_for_last, $s;
}
@candidates_for_last = sort {length($b) - length($a)} @candidates_for_
last;
return $candidates_for_last[0];
}
print join "", map {$_->[1]} sort { $a->[0] cmp $b->[0] } map { [get_first_
author($_) , $_ ] } <>; | l*******s 发帖数: 1258 | 2 这个东西可大可小
往小了说 写一堆正则表达式 自己弄一些rule 应该可以解决大部分问题
往大了说 就是NLP里面典型的Named Entity Recognition问题,主流方法用machine
learning加一些context features。不妨试试一些现成的包,比如opennlp等 |
|