由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
Programming版 - Re: 一道count frequency of all words的面试题 (转载)
相关主题
请教pthread producer-consumer问题A try-catch problem in C++
请教一个C++的问题a simple question for C++ class
stl container erase in a loopwhich func will be called?
请教用c++读取large file怎么可以快一些?请问一个exception题目
关于文件读取的C++ 问题?reverse words, not the Microsoft one!!!
C++ string to int Problemabout new operator
C++如何输入的一个小问题关于C++中一个Class的大小 (转载)
C++ Q13: InputC++里面
相关话题的讨论汇总
话题: cout话题: endl话题: include话题: line话题: mutex
进入Programming版参与讨论
1 (共1页)
s*w
发帖数: 729
1
发信人: yangguo1220 (yangguo), 信区: JobHunting
标 题: 一道count frequency of all words的面试题
发信站: BBS 未名空间站 (Wed Sep 18 20:44:44 2013, 美东)
100G的文本文件,4G内存,4个CPU。写一个程序:列出所有文本文件中的word
frequency,并output到一个最终文件中, 格式为 . 这个最终文件
的size也可能比内存小.
大家有啥建议?
【 以下文字转载自 JobHunting 讨论区 】
发信人: saw (句子熊), 信区: JobHunting
标 题: Re: 一道count frequency of all words的面试题
发信站: BBS 未名空间站 (Thu Sep 19 02:18:15 2013, 美东)
just practicing c++11 multi-threading and got the following code to try out
your example
problem with my code now:
1. it does not deal with last line of input file
2. it has deadlock for certain test file
Anyone familiar with c++11 multi-threading?
没想明白哪里出问题,请指点一下啊
#include
#include
#include
#include
#include
#include
#include
#include
#include
using namespace std;
const int MAXCAPACITY = 3;
mutex mx;
deque data;
map freq;
condition_variable room2produce, data2consume;
void print() {
cout << "begin data" << endl;
for(auto str : data)
cout << str << endl;
cout << "end data" << endl;
}
void consumer() {
unique_lock lck(mx);
cout << "hi from thread " << this_thread::get_id() << endl;
while(data.empty())
data2consume.wait(lck);
cout << "before consuming, data is as follows:" << endl;
print();
string line = data.front();
data.pop_front();
istringstream iss(line);
string word;
cout << "extracting words as: ";
while( iss >> word) {
cout << word << ",";
freq[word]++;
}
cout << endl;
cout << "after consuming, data is as follows:" << endl;
print();
room2produce.notify_one();
}
void producer(const string &line) {
unique_lock lck2(mx);
cout << "hi from main: " << this_thread::get_id() << endl;
while(data.size()>=MAXCAPACITY)
room2produce.wait(lck2);
data.push_back(line);
data2consume.notify_all();
}
int main() {
int nThread = thread::hardware_concurrency();// maximum number of
threads hardware prefers
vector ts;
// launch consumers in threads
for(int i=0;i ts.push_back(thread(consumer));
// a single producer in main, since IO is bottleneck, no need for
concurrent read of input
cout << "from main: " << endl;
string line;
while(getline(cin,line))
producer(line);
// wait consumers finish all work
for(int i=0;i ts[i].join();
//
cout << "counts as following:" << endl;
for(auto kvpair : freq)
cout << kvpair.first << ": " << kvpair.second << endl;
}
s*w
发帖数: 729
2
改了两个小时,终于改好了一半
现在一个 producer 一个 consumer 没问题了;上个版本的 consumer 没循环,只能处
理一行;而且在睡觉的时候没有  guard spurious wakeup, 我以为用 if可以算
guard 了,其实需要 while loop来guard 或者用 data2consume.wait(lck,!data.
empty())
糟糕的是 多 consumer 还是不行,会 deadlock 。
再次呼唤高人指点一下
#include
#include
#include
#include
#include
#include
#include
#include
#include
using namespace std;
const int MAXCAPACITY = 3;
mutex mx;
deque data;
map freq;
condition_variable room2produce, data2consume;
bool stillproducing = true;
void consumer() {
unique_lock lck(mx);
cout << "hi from thread " << this_thread::get_id() << endl;
while(stillproducing || !data.empty()) {
// here it has to be while loop to guard against spurious wakeup
// I used if but it does not guard such tragedy
while(data.empty())
data2consume.wait(lck);
string line = data.front();
data.pop_front();
istringstream iss(line);
string word;
while( iss >> word) {
freq[word]++;
}
if(data.size() room2produce.notify_all();
}
}
void producer() {
unique_lock lck(mx);
cout << "hi from main: " << this_thread::get_id() << endl;
string line;
while(getline(cin,line)) {
while(data.size()>=MAXCAPACITY)
room2produce.wait(lck);
data.push_back(line);
if(!data.empty())
data2consume.notify_all();
}
stillproducing = false;
}
int main() {
vector ts;
int nThread = 1; // only works for 1 consumer
// launch consumers in threads
for(int i=0;i ts.push_back(thread(consumer));
// a single producer in main, since IO is bottleneck, no need for
concurrent read of input
producer();
// wait consumers finish all work
for(int i=0;i ts[i].join();
//
cout << "counts as following:" << endl;
for(auto kvpair : freq)
cout << kvpair.first << ": " << kvpair.second << endl;
}

【在 s*w 的大作中提到】
: 发信人: yangguo1220 (yangguo), 信区: JobHunting
: 标 题: 一道count frequency of all words的面试题
: 发信站: BBS 未名空间站 (Wed Sep 18 20:44:44 2013, 美东)
: 100G的文本文件,4G内存,4个CPU。写一个程序:列出所有文本文件中的word
: frequency,并output到一个最终文件中, 格式为 . 这个最终文件
: 的size也可能比内存小.
: 大家有啥建议?
: 【 以下文字转载自 JobHunting 讨论区 】
: 发信人: saw (句子熊), 信区: JobHunting
: 标 题: Re: 一道count frequency of all words的面试题

s*w
发帖数: 729
3
又琢磨了两天,看了不少相关东西,终于搞定了,觉得自己对这个多线程的理解加强了
很多。思考比单纯的看人说原理更刻骨铭心。
这个问题我设计的用一个 producer 多个 consumer 的架构,上个版本的是两头用
condition_variable, 一个通知 producer 有空间了开始干活生产,另一个通知
consumer 有库存了开始消费。参见上篇里面的 wait 和 notify_all,notify_one 语
句。 这个思路对于单 producer 单 consumer 没问题,可是不适用于 多 consumer.
因为所有的 consumer 可能同时睡觉(没空存),同时醒来(有库存),结果只有一个
能抢占mutex(拿到库存),其他的只好继续睡觉(while 循环 wait). 如果无限制的
生产下去,每个睡觉的都有机会醒来干活。可是在有限生产的情况下,producer 干完
活了后,总有睡觉的 consumer 无人唤醒导致死锁。解决的办法就是用 non-block
的 try_lock, lock 不上就返回 false,这样有机会检查是否需要(还在生产或是有
数据没处理)继续 try_lock
还有些零碎的改动:任何数据,只要有两个线程用,其中一个写,就是 shared data.
我最开始的版本中,都是只想到了保存中间数据的 deque data, 而忘记了
保存最终结果的 map 也是被多个 consumer 同时写的。
用过 valgrind 的 helgrind, 没发现咋用
贴一下现在的 code
运行环境:linux gcc 4.7.2
编译:g++ -pthread -std=c++11 testcc.cpp
运行:./a.out < 任意有几行文本的文件
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
using namespace std;
/*
hide concurrency inside the data structure
to minimize critical sections
*/
class concurDeque {
int capacity;
deque data;
mutex mx;
condition_variable room2produce;
public:
concurDeque(int cap) : capacity(cap) {}
int size() {
unique_lock lck(mx);
return data.size();
}
void put(const string &line) {
{
unique_lock lck(mx);
//cout << line << endl;
// while loop against spurious wakeup
while(data.size()>=capacity)
room2produce.wait(lck);
data.push_back(line);
// do not rely on lck out of scope for unlocking mx
// coz notify may be too fast and the other party may wake up to
find
// that this lck is not out of scope yet
//lck.unlock();
}
}
bool get(string &line) {
bool consumed = false;
if(mx.try_lock()) {
if(!data.empty()) {
line = data.front();
data.pop_front();
consumed = true;
}
mx.unlock();
}
if(consumed)
room2produce.notify_all();
return consumed;
}
};
class concurMap {
map freq;
mutex mx;
public:
void insert(string line) {
lock_guard lck(mx);
istringstream iss(line);
string word;
while( iss >> word)
freq[word]++;
}
void print() {
lock_guard lck(mx);
cout << "counts as following:" << endl;
for(auto kvpair : freq)
cout << kvpair.first << ": " << kvpair.second << endl;
}
};
concurDeque ccd(5);
concurMap ccm;
atomic stillproducing{true};
void consumer() {
// cout here may be messed up coz no mutex production
// but it is just for debug anyway
while(stillproducing || ccd.size()>0) {
string line;
if(ccd.get(line)) // get should not block
ccm.insert(line);
/* {// just for debug
mutex mtx;
lock_guard lck(mtx);
cout << "consumer " << this_thread::get_id() << ":" << line << endl;
}
*/
}
}
void producer() {
string line;
while(getline(cin,line)) {
ccd.put(line);
}
//cout << "producer " << this_thread::get_id() << ": end!" << endl;
stillproducing.store(false);
}
int main() {
vector ts;
int nThread = 3; // now works for multiple consumers
// launch consumers in threads
for(int i=0;i ts.push_back(thread(consumer));
// a single producer in main, since IO is bottleneck, no need for
concurrent read of input
producer();
// wait consumers finish all work
for(int i=0;i ts[i].join();
//
ccm.print();
}

【在 s*w 的大作中提到】
: 改了两个小时,终于改好了一半
: 现在一个 producer 一个 consumer 没问题了;上个版本的 consumer 没循环,只能处
: 理一行;而且在睡觉的时候没有  guard spurious wakeup, 我以为用 if可以算
: guard 了,其实需要 while loop来guard 或者用 data2consume.wait(lck,!data.
: empty())
: 糟糕的是 多 consumer 还是不行,会 deadlock 。
: 再次呼唤高人指点一下
: #include
: #include
: #include

1 (共1页)
进入Programming版参与讨论
相关主题
C++里面关于文件读取的C++ 问题?
两个继承问题C++ string to int Problem
为什么我看不懂下面的code,是不是水平还不够?C++如何输入的一个小问题
C++ 弱问一个C++ Q13: Input
请教pthread producer-consumer问题A try-catch problem in C++
请教一个C++的问题a simple question for C++ class
stl container erase in a loopwhich func will be called?
请教用c++读取large file怎么可以快一些?请问一个exception题目
相关话题的讨论汇总
话题: cout话题: endl话题: include话题: line话题: mutex