天，如何能让程序转得快点？有包子。 (转载) - SanFrancisco版 - 未名存档

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

SanFrancisco版 - 天，如何能让程序转得快点？有包子。 (转载)

相关主题
● java EL 问题请教 (转载)	● 如何用python web.py web service 做 multiple parameters 的 c (转载)
● 工程师们看过来	● 问一下，小朋友的guitar怎么调？
● 这回是真有人要买brcd了？	● 我能问个linux的问题吗？版上大牛多
● GOOGLE坏掉了？	● Apple reveals headcount boost
● 车胎需要充气！！！ (转载)	● Why You Should Stop Trying to Beat the Market (转载)
● Does Jerry Brown give a damn to Chinese?	● iphone 5, the best stimulus package.
● 咱们硅工	● 微软并不只会做PC产品 (转载)
● 小朋友的蹦蹦屋，在San Jose	● 媒体coverage

相关话题的讨论汇总
话题: std话题: string话题: s1话题: s2话题: boost

进入SanFrancisco版参与讨论

1

(共1页)

t***q 发帖数: 418	1 【以下文字转载自 Programming 讨论区】发信人: treeq (treeq), 信区: Programming 标题: 天，如何能让程序转得快点？有包子。发信站: BBS 未名空间站 (Fri Feb 27 23:26:22 2015, 美东) 天，如何能让程序转得快点？原帖在这里： http://www.mitbbs.com/article_t0/Programming/31381809.html 主要是要做 title matching. 有两个 file, file A 162283 行 X 12 列。 File B 3695 行 X 6 列。用 A 的第五列和 B的第四列进行比较，对 B 的第四列的每一行，从 A的那 162283 行中找出与之最相似的那一行。A 的第五列和 B 的第四列都是些影视作品的 title, 是一些长短不一的 string. 我用的是 Levenshtein algorithm 算每一对string 的相似度，再把相似度排序，从高到低，找出相似度最大的那一个 string, 也就是影视作品的 title, 加到 file B 对应的那一个title 那一行。再加入一个从file A 出来的对应的一个id, 到 file B 里。算相似度前，我先对每个title 组成的string做预处理，去掉 “:”,”-“,”season”,”episode “ , 等一些词。减少matching 的误差。但就这样一个程序，我先用 python, 一个程序要跑很长时间，才出结果，再用c++ 没想到用的时间更长。程序如下： Python: import csv import re import difflib import operator import Levenshtein import datetime import glob import os import fnmatch a=[] with open("D:\A.txt","rb") as f: for row in f: a.append(row.split("t")) f.close() b=[] with open("B.txt","rb") as k: for row in k: b.append(row.split("t")) k.close() dd={} ee={} my_list=[] for i in range(len(a)): ff={} # max_value=0 for j in range(len(b)): s1=re.sub(r',',' ',a[i][3]) s1=s1.lower() s2=re.sub(r',',' ',b[j][4]) s2=s2.lower() s1=re.sub(r'series',' ',s1) s1=re.sub(r'episode',' ',s1) s2=re.sub(r'series',' ',s2) s2=re.sub(r'episode',' ',s2) s1=re.sub(r'season',' ',s1) s2=re.sub(r'season',' ',s2) s1=re.sub(r'"',' ',s1) s2=re.sub(r'"',' ',s2) s1=re.sub(r'-',' ',s1) s2=re.sub(r'-',' ',s2) s2=re.sub(r':',' ',s2) s1=re.sub(r':',' ',s1) s1=re.sub(r' ','',s1) s2=re.sub(r' ','',s2) d=float(Levenshtein.ratio(s1,s2)) ff[b[j][4]+"t"+str(b[j][11])]=d # max_value=float(max(max_value,d)) qq="t".join(a[i]) dd[qq]=max(ff.iteritems(),key=operator.itemgetter(1))[0] my_list.append([qq.strip()+"t"+dd[qq]]) datestr=datetime.date.today().strftime("%y%m%d") filename="good2_codes_{}".format(datestr)+'.txt' File=open("C”+filename,'w') for item in my_list: File.write(str(item[0])+"n") File.close() C++: #include #include #include #include #include #include #include #include #include using namespace std; size_t uiLevenshteinDistance (const std::string &s1, const std::string &s2) { const size_t m(s1.size()); const size_t n(s2.size()); if(m==0) return n; if(n==0) return m; size_t costs=new size_t[n+1]; for(size_t k=0;k<=n;k++) costs[k]=k; size_t i=0; for (std::string::const_iterator it1=s1.begin(); it1!=s1.end();++it1,++i) {costs[0]=i+1; size_t corner=i; size_t j=0; for(std::string::const_iterator it2=s2.begin();it2!=s2.end();++it2,++j) { size_t upper=costs[j+1]; if(it1==it2) { costs[j+1]=corner; } else { size_t t(upper costs[j+1]=(costs[j] } corner=upper; } } size_t result=costs[n]; delete [] costs; return result; } int main() { std::vector lines; std::ifstream file("A.txt"); std::string line; while (std::getline(file,line)) { lines.push_back(line); } std::vector foxs; std::ifstream file1("B.txt"); std::string fox; while (std::getline(file1,fox)) { foxs.push_back(fox); } boost::unordered_map hashtable1; for (int i=0; i< (int) lines.size(); i++) { boost::unordered_map hashtable; for (int j=0; j<(int) foxs.size(); j++) { std::string str=lines[i]; std::vector tokens; boost::split(tokens,str,boost::algorithm::is_any_of("t")); std::string str1=foxs[j]; std::vector tokens1; boost::split(tokens1,str1,boost::algorithm::is_any_of("t")); std::string s1=tokens[3]; std::string s2=tokens1[4]; boost::algorithm::to_lower(s1); boost::algorithm::to_lower(s2); boost::replace_all(s1,",",""); boost::replace_all(s2,",",""); boost::replace_all(s1,"-",""); boost::replace_all(s2,"-",""); boost::replace_all(s1,"season",""); boost::replace_all(s2,"season",""); boost::replace_all(s1,"episode",""); boost::replace_all(s2,"episode",""); boost::replace_all(s1,"series",""); boost::replace_all(s2,"series",""); // size_t f = s1.find(","); // s1.replace(f, std::string(",").length(),""); // size_t f1=s2.find(","); // s2.replace(f1, std::string(",").length(),""); // size_t f2 = s1.find("season"); // s1.replace(f2, std::string("season").length(),""); // size_t f3=s2.find("season"); // s2.replace(f3, std::string(",").length(),""); // size_t f4 = s1.find("episode"); // s1.replace(f4, std::string("episode").length(),""); // size_t f5=s2.find("episode"); // s2.replace(f5, std::string("episode").length(),""); // size_t f6 = s1.find("series"); // s1.replace(f6, std::string("series").length(),""); // size_t f7=s2.find("series"); // s2.replace(f7, std::string("series").length(),""); s1.erase(remove( s1.begin(), s1.end(), '"' ),s1.end()); s2.erase(remove( s2.begin(), s2.end(), '"' ),s2.end()); //size_t f10 = s1.find("-"); // s1.replace(f10, std::string("-").length(),""); // size_t f11=s2.find("-"); // s2.replace(f11, std::string("-").length(),""); boost::replace_all(s1," ",""); boost::replace_all(s2," ",""); float k,k2,k3; k=float (std::max(s1.size(),s2.size())); k2=float ( uiLevenshteinDistance(s1,s2)); k3=1-k2/k; hashtable.insert(make_pair(tokens1[4]+"t"+(std::string)tokens1[11],k3)); } float max=0; std::string max_key; for (auto itr=hashtable.begin(); itr !=hashtable.end(); itr++) { if ((itr).second>max) { max=(itr).second; max_key=(itr).first; } } hashtable1.insert(make_pair(lines[i],max_key)); } for (auto itr1=hashtable1.begin(); itr1 !=hashtable1.end(); itr1++) cout << (itr1).first << "t" << (itr1).second << endl; return 0; } 天，为什么要用这么长的时间？我周末要跑12个file B, 每一个file B 都有 4000 行左右，对应一个 file A 162283 行 X 12 列. 今天下午回家从5：30开始run 一个程序， run 到 8 点都没有结束。为什么这样一个程序都这么耗时？谁来帮帮我，写一个快一点的程序。多谢。有大包子！
g*******7 发帖数: 269	2 gcc优化打开了吗?
t***q 发帖数: 418	3 What is gcc 优化？please give me a hint or link, thank you! 【在 g*******7 的大作中提到】 : gcc优化打开了吗?
p*********r 发帖数: 7944	4 你先要搞清楚，慢是因为算法本身慢，还是写得不好，或者有bug。会做profile吗？或者你自己可以做一个profile，就是统计出每一个routine被call了多少次，每次花了多少时间。然后你就有了改进的目标。
p*********r 发帖数: 7944	5 还可以这样实验， file A 162 行 X 12 列，计算CPU time file A 1622 行 X 12 列，计算CPU time file A 16228 行 X 12 列，计算CPU time file A 162283 行 X 12 列，计算CPU time 看看这四种情况，CPU time 变化的规律。
m****v 发帖数: 780	6 把A用不同长度的substrings来 build index 用B的作为query，用OR search去search top k (相当于用了不同的distance算法), 再用edit distance（也许都不用这步了） lucene搜索是优化了得，非常快【在 t***q 的大作中提到】 : What is gcc 优化？please give me a hint or link, thank you!
o****e 发帖数: 916	7 +1 on lucene , index b4, and query each of a5, select top x, should be lightening fast. don't bother to write your own matching algorithm.
l****c 发帖数: 838	8 That's a strange question. Anyone who used GCC should know optimization -O2 or -O3 or -OS. How did you compile your code, what's your make file? Also improve algorithm, memory,cache. 【在 t***q 的大作中提到】 : What is gcc 优化？please give me a hint or link, thank you!
t***q 发帖数: 418	9 多谢大家，包子已发。通过这个project 和大家的帮助学了不少东西。以后慢慢聊。多谢！

1

(共1页)

进入SanFrancisco版参与讨论

相关主题
● 媒体coverage	● 车胎需要充气！！！ (转载)
● 现在买房，是不是会高高的站在山岗上啊？	● Does Jerry Brown give a damn to Chinese?
● Surprise! Black immigrants have more college education than (转载)	● 咱们硅工
● 说说最近的一次面试,兼告诫国人 (转载)	● 小朋友的蹦蹦屋，在San Jose
● java EL 问题请教 (转载)	● 如何用python web.py web service 做 multiple parameters 的 c (转载)
● 工程师们看过来	● 问一下，小朋友的guitar怎么调？
● 这回是真有人要买brcd了？	● 我能问个linux的问题吗？版上大牛多
● GOOGLE坏掉了？	● Apple reveals headcount boost

相关话题的讨论汇总
话题: std话题: string话题: s1话题: s2话题: boost

未名新帖统计// 7月16日

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

* 这里只显示发帖超过25的版面，努力灌水吧:-)