天，如何能让程序转得快点？有包子。 (转载) - LosAngeles版 - 未名存档

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

LosAngeles版 - 天，如何能让程序转得快点？有包子。 (转载)

相关主题
● 马上要去oxnard待一个星期, 推荐些好玩的地方吧....	● Gmail太操蛋了！
● 推荐一个电视剧	● Goldman beats its record for $100m-plus days (ZT)
● 喜欢她。。。	● 请问哪里买mattress性价比比较高
● LA METRO TOKENS SALES	● 利比亚战事一起，日本核电站安全多了
● 借人气问一下：在哪里能看到美剧“gossip girls“?	● 能浏览网页的，最便宜的smartphone是什么？
● eastern sierra 洗澡问题	● 昨天通过的budget是延长 1/4点的加税
● 在norgals & gale 交叉的那个burger king 真烂呀	● [出售]全新Sealed BOX 苹果Mackbook pro Retina MD212ll/a 一 (转载)
● Monrovia附近哪里可以给轮胎充气？	● 天，如何能让程序转得快点？有包子。 (转载)

相关话题的讨论汇总
话题: std话题: string话题: s1话题: s2话题: boost

进入LosAngeles版参与讨论

1

(共1页)

t***q 发帖数: 418	1 【以下文字转载自 Programming 讨论区】发信人: treeq (treeq), 信区: Programming 标题: 天，如何能让程序转得快点？有包子。发信站: BBS 未名空间站 (Fri Feb 27 23:26:22 2015, 美东) 天，如何能让程序转得快点？原帖在这里： http://www.mitbbs.com/article_t0/Programming/31381809.html 主要是要做 title matching. 有两个 file, file A 162283 行 X 12 列。 File B 3695 行 X 6 列。用 A 的第五列和 B的第四列进行比较，对 B 的第四列的每一行，从 A的那 162283 行中找出与之最相似的那一行。A 的第五列和 B 的第四列都是些影视作品的 title, 是一些长短不一的 string. 我用的是 Levenshtein algorithm 算每一对string 的相似度，再把相似度排序，从高到低，找出相似度最大的那一个 string, 也就是影视作品的 title, 加到 file B 对应的那一个title 那一行。再加入一个从file A 出来的对应的一个id, 到 file B 里。算相似度前，我先对每个title 组成的string做预处理，去掉 “:”,”-“,”season”,”episode “ , 等一些词。减少matching 的误差。但就这样一个程序，我先用 python, 一个程序要跑很长时间，才出结果，再用c++ 没想到用的时间更长。程序如下： Python: import csv import re import difflib import operator import Levenshtein import datetime import glob import os import fnmatch a=[] with open("D:\A.txt","rb") as f: for row in f: a.append(row.split("t")) f.close() b=[] with open("B.txt","rb") as k: for row in k: b.append(row.split("t")) k.close() dd={} ee={} my_list=[] for i in range(len(a)): ff={} # max_value=0 for j in range(len(b)): s1=re.sub(r',',' ',a[i][3]) s1=s1.lower() s2=re.sub(r',',' ',b[j][4]) s2=s2.lower() s1=re.sub(r'series',' ',s1) s1=re.sub(r'episode',' ',s1) s2=re.sub(r'series',' ',s2) s2=re.sub(r'episode',' ',s2) s1=re.sub(r'season',' ',s1) s2=re.sub(r'season',' ',s2) s1=re.sub(r'"',' ',s1) s2=re.sub(r'"',' ',s2) s1=re.sub(r'-',' ',s1) s2=re.sub(r'-',' ',s2) s2=re.sub(r':',' ',s2) s1=re.sub(r':',' ',s1) s1=re.sub(r' ','',s1) s2=re.sub(r' ','',s2) d=float(Levenshtein.ratio(s1,s2)) ff[b[j][4]+"t"+str(b[j][11])]=d # max_value=float(max(max_value,d)) qq="t".join(a[i]) dd[qq]=max(ff.iteritems(),key=operator.itemgetter(1))[0] my_list.append([qq.strip()+"t"+dd[qq]]) datestr=datetime.date.today().strftime("%y%m%d") filename="good2_codes_{}".format(datestr)+'.txt' File=open("C”+filename,'w') for item in my_list: File.write(str(item[0])+"n") File.close() C++: #include #include #include #include #include #include #include #include #include using namespace std; size_t uiLevenshteinDistance (const std::string &s1, const std::string &s2) { const size_t m(s1.size()); const size_t n(s2.size()); if(m==0) return n; if(n==0) return m; size_t costs=new size_t[n+1]; for(size_t k=0;k<=n;k++) costs[k]=k; size_t i=0; for (std::string::const_iterator it1=s1.begin(); it1!=s1.end();++it1,++i) {costs[0]=i+1; size_t corner=i; size_t j=0; for(std::string::const_iterator it2=s2.begin();it2!=s2.end();++it2,++j) { size_t upper=costs[j+1]; if(it1==it2) { costs[j+1]=corner; } else { size_t t(upper costs[j+1]=(costs[j] } corner=upper; } } size_t result=costs[n]; delete [] costs; return result; } int main() { std::vector lines; std::ifstream file("A.txt"); std::string line; while (std::getline(file,line)) { lines.push_back(line); } std::vector foxs; std::ifstream file1("B.txt"); std::string fox; while (std::getline(file1,fox)) { foxs.push_back(fox); } boost::unordered_map hashtable1; for (int i=0; i< (int) lines.size(); i++) { boost::unordered_map hashtable; for (int j=0; j<(int) foxs.size(); j++) { std::string str=lines[i]; std::vector tokens; boost::split(tokens,str,boost::algorithm::is_any_of("t")); std::string str1=foxs[j]; std::vector tokens1; boost::split(tokens1,str1,boost::algorithm::is_any_of("t")); std::string s1=tokens[3]; std::string s2=tokens1[4]; boost::algorithm::to_lower(s1); boost::algorithm::to_lower(s2); boost::replace_all(s1,",",""); boost::replace_all(s2,",",""); boost::replace_all(s1,"-",""); boost::replace_all(s2,"-",""); boost::replace_all(s1,"season",""); boost::replace_all(s2,"season",""); boost::replace_all(s1,"episode",""); boost::replace_all(s2,"episode",""); boost::replace_all(s1,"series",""); boost::replace_all(s2,"series",""); // size_t f = s1.find(","); // s1.replace(f, std::string(",").length(),""); // size_t f1=s2.find(","); // s2.replace(f1, std::string(",").length(),""); // size_t f2 = s1.find("season"); // s1.replace(f2, std::string("season").length(),""); // size_t f3=s2.find("season"); // s2.replace(f3, std::string(",").length(),""); // size_t f4 = s1.find("episode"); // s1.replace(f4, std::string("episode").length(),""); // size_t f5=s2.find("episode"); // s2.replace(f5, std::string("episode").length(),""); // size_t f6 = s1.find("series"); // s1.replace(f6, std::string("series").length(),""); // size_t f7=s2.find("series"); // s2.replace(f7, std::string("series").length(),""); s1.erase(remove( s1.begin(), s1.end(), '"' ),s1.end()); s2.erase(remove( s2.begin(), s2.end(), '"' ),s2.end()); //size_t f10 = s1.find("-"); // s1.replace(f10, std::string("-").length(),""); // size_t f11=s2.find("-"); // s2.replace(f11, std::string("-").length(),""); boost::replace_all(s1," ",""); boost::replace_all(s2," ",""); float k,k2,k3; k=float (std::max(s1.size(),s2.size())); k2=float ( uiLevenshteinDistance(s1,s2)); k3=1-k2/k; hashtable.insert(make_pair(tokens1[4]+"t"+(std::string)tokens1[11],k3)); } float max=0; std::string max_key; for (auto itr=hashtable.begin(); itr !=hashtable.end(); itr++) { if ((itr).second>max) { max=(itr).second; max_key=(itr).first; } } hashtable1.insert(make_pair(lines[i],max_key)); } for (auto itr1=hashtable1.begin(); itr1 !=hashtable1.end(); itr1++) cout << (itr1).first << "t" << (itr1).second << endl; return 0; } 天，为什么要用这么长的时间？我周末要跑12个file B, 每一个file B 都有 4000 行左右，对应一个 file A 162283 行 X 12 列. 今天下午回家从5：30开始run 一个程序， run 到 8 点都没有结束。为什么这样一个程序都这么耗时？谁来帮帮我，写一个快一点的程序。多谢。有大包子！
w********o 发帖数: 10088	2 你没试着看看时间都花在哪一步了？cout？
H******e 发帖数: 767	3 Check how much memory your process allocated. It could be too high that your time spend on swapping virtual memory.
t***q 发帖数: 418	4 多谢大家，包子已发。通过这个project 和大家的帮助学了不少东西。以后慢慢聊。多谢！

1

(共1页)

进入LosAngeles版参与讨论

相关主题
● 天，如何能让程序转得快点？有包子。 (转载)	● 借人气问一下：在哪里能看到美剧“gossip girls“?
● 天，如何能让程序转得快点？有包子。	● eastern sierra 洗澡问题
● 问题:从电话号码打出所有单词	● 在norgals & gale 交叉的那个burger king 真烂呀
● how to read a sentence into a vector of string?	● Monrovia附近哪里可以给轮胎充气？
● 马上要去oxnard待一个星期, 推荐些好玩的地方吧....	● Gmail太操蛋了！
● 推荐一个电视剧	● Goldman beats its record for $100m-plus days (ZT)
● 喜欢她。。。	● 请问哪里买mattress性价比比较高
● LA METRO TOKENS SALES	● 利比亚战事一起，日本核电站安全多了

相关话题的讨论汇总
话题: std话题: string话题: s1话题: s2话题: boost

未名新帖统计// 7月16日

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

* 这里只显示发帖超过25的版面，努力灌水吧:-)