parsing bibliography and sorting (转载) - Programming版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Programming版 - parsing bibliography and sorting (转载)

相关主题
● 一个搞统计的对C#的第一印象	● 问几个javascript面试题
● 请问Python初学者怎么学	● how to find the date of today in UNIX?
● segmentation fault as soon as entering 1 function in the arm processor board	● regular expression的一个问题
● 关于新语言的想法	● A very dump c++ question
● How to Parsing function in haskell?	● 谁知道如何调试yacc程序？
● 问java api的问题	● 怎么样用 C Parse HTML?
● parsing file in node: js or python ?	● Smart Parser/Compiler Development
● 请教一个parser的问题	● 求教Code

相关话题的讨论汇总
话题: author话题: lastname话题: my话题: sorting

进入Programming版参与讨论

(共1页)

c******n
发帖数: 4965

【以下文字转载自 THU 讨论区】
发信人: creation (努力自由泳50m/45sec !), 信区: THU
标题: parsing bibliography and sorting
发信站: BBS 未名空间站 (Sun Nov 11 13:27:54 2012, 美东)
my wife was spending a lot of time sorting the bibliography of her thesis,
because her bib was obtained in plain text form, I have to parse out the
first author last name first. so I wrote this little piece of code. hope it
will be useful for someone too....
right now it fails to parse single-author bib, cuz it's difficult to
recognize a human name from other words. but for biology papers, a paper
mostly has multiple authors
sub get_first_author($) {
my ($line) = @_;
my ($author, $second_possible, $remaining ) = split /,|and|\d/, $line ,3;
my $lastname = find_author_lastname($author);
my $second_possible_lastname = find_author_lastname($second_possible);
return $lastname ne ''? $lastname:$second_possible_lastname;
}
sub find_author_lastname($) {
my ($author) = @_;
my @segments = split /[, ]+/, $author;
my @candidates_for_last = ();
foreach my $s (@segments) {
if ( uc($s) eq $s ) { next;} # all upper case
if ( length($s) == 1 ) { next;} # only a single letter
if ( $s =~ /^([:alpha:]\.)+$/ ) { next;} # A.B.C. pattern
push @candidates_for_last, $s;
}
@candidates_for_last = sort {length($b) - length($a)} @candidates_for_
last;
return $candidates_for_last[0];
}
print join "", map {$_->[1]} sort { $a->[0] cmp $b->[0] } map { [get_first_
author($_) , $_ ] } <>;

l*******s
发帖数: 1258

这个东西可大可小
往小了说写一堆正则表达式自己弄一些rule 应该可以解决大部分问题
往大了说就是NLP里面典型的Named Entity Recognition问题，主流方法用machine
learning加一些context features。不妨试试一些现成的包，比如opennlp等

(共1页)

进入Programming版参与讨论

... ?

相关主题
● 求教Code	● How to Parsing function in haskell?
● 如何下载网络页面，不包含 ,
● 问java api的问题
● how to count the times a function is used	● parsing file in node: js or python ?
● How to user Perl to handle object on client side?	● 请教一个parser的问题
● 一个搞统计的对C#的第一印象	● 问几个javascript面试题
● 请问Python初学者怎么学	● how to find the date of today in UNIX?
● segmentation fault as soon as entering 1 function in the arm processor board	● regular expression的一个问题
● 关于新语言的想法	● A very dump c++ question

相关话题的讨论汇总
话题: author话题: lastname话题: my话题: sorting

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天