问一道(大)数据 algorithm (转载) - DataSciences版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

DataSciences版 - 问一道(大)数据 algorithm (转载)

相关主题
● 请教有关用R做t-test (转载)	● 如何改变spark dataframe的column names
● 在R里merge两个dataframe太慢了	● spark上一两个million的时间序列数据
● 我LG说	● Twitter Data Scientist 电面题目
● random forest 有没有可能保证某几个变量一直被选上	● 如何证明某个feature 没用，分组的分布和总体分布相同
● Spark开始使用DataFrame	● 问一个R的问题
● R问题请教	● Random forests on imbalanced data
● Memory Error in pandas.concat with Python	● matlab 和 python 的for loop
● 求助：关于2个python的题目	● training dataset和unbalanced dataset的设计

相关话题的讨论汇总
话题: positive话题: int话题: rcpp话题: index话题: negative

进入DataSciences版参与讨论

(共1页)

n*****3
发帖数: 1584

【以下文字转载自 JobHunting 讨论区】
发信人: nacst23 (cnc), 信区: JobHunting
标题: 问一道(大)数据 algorithm
发信站: BBS 未名空间站 (Sun Mar 22 00:11:01 2015, 美东)
请教大家一下：
两组人， POSITIVE 和 Negative ，
say
POSITIVE 100K ppl，
Negative 900K ppl.
基本的数据结构是人的 ID 和 length of stay（待了几天）。
ID length of stay(days)
ppl-0000001 8
ppl-0000002 10
...
目的是 sample Negative 组出来 100K 人 ,
which one-to-one match the Positive 组人
的 length of stay（待了几天），
这样 match 完, 两组人的 100K 个 length of stay（待了几天）
完全一样.
当然如果 negative
组人有多个 match 一个 POSITIVE 组人，任取一个就好了。
想用 c++ 写，use STL／Map hash，
不知有没好的算法哦，
or 更好的 STL 数据结构／算法可用？
因为是准备写成 RCPP for R, 现在不考虑用
并行 Solution.
谢谢。

w*****a
发帖数: 218

没完全明白你想干啥, 没理解错的话
将两组数据放一块, 同时标记好POSITIVE 和 Negative
然后按 length of stay 排序
取最相近的不同组的
没看见难在哪儿, 是我理解有误吗

【在 n*****3 的大作中提到】

: 【以下文字转载自 JobHunting 讨论区】
: 发信人: nacst23 (cnc), 信区: JobHunting
: 标题: 问一道(大)数据 algorithm
: 发信站: BBS 未名空间站 (Sun Mar 22 00:11:01 2015, 美东)
: 请教大家一下：
: 两组人， POSITIVE 和 Negative ，
: say
: POSITIVE 100K ppl，
: Negative 900K ppl.
: 基本的数据结构是人的 ID 和 length of stay（待了几天）。

n*****3
发帖数: 1584

the for loop will take a long of time to complete...
want to find out some good algorithms for this.

【在 w*****a 的大作中提到】

: 没完全明白你想干啥, 没理解错的话
: 将两组数据放一块, 同时标记好POSITIVE 和 Negative
: 然后按 length of stay 排序
: 取最相近的不同组的
: 没看见难在哪儿, 是我理解有误吗

m******a
发帖数: 77

you can use MapReduce
Use days as key
Positive, negative marker as value
one job to get it done
if your data is too large
In non-distributed version
you could loop your positive first, for example
build a map, then go to loop your negative, get the match.

【在 n*****3 的大作中提到】

: the for loop will take a long of time to complete...
: want to find out some good algorithms for this.

n*****3
发帖数: 1584

no distributed version for now, thanks for the tips.
go through the loop can be very time consuming, wish can find a
better approach..

【在 m******a 的大作中提到】

: you can use MapReduce
: Use days as key
: Positive, negative marker as value
: one job to get it done
: if your data is too large
: In non-distributed version
: you could loop your positive first, for example
: build a map, then go to loop your negative, get the match.

m******a
发帖数: 77

I want to learn the better approach too,
if there is any
without loop the data

【在 n*****3 的大作中提到】

: no distributed version for now, thanks for the tips.
: go through the loop can be very time consuming, wish can find a
: better approach..

s******e
发帖数: 114

1. keep estimate histogram in positive set util convergence
after scanning first 1000 records in positive, if lucky, k-s test shows no
difference with previous histogram,which is estimated for the first 900
records.
2.build a hash, key is stay days, value is count
3.keep scan negative set util find 100K record

n*****3
发帖数: 1584

先估计分布，再 hash then scan，
省得时间是估计分布 vs scan positive population；
hash table is for positive polulation or negative population?
thanks.

no

【在 s******e 的大作中提到】

: 1. keep estimate histogram in positive set util convergence
: after scanning first 1000 records in positive, if lucky, k-s test shows no
: difference with previous histogram,which is estimated for the first 900
: records.
: 2.build a hash, key is stay days, value is count
: 3.keep scan negative set util find 100K record

s******e
发帖数: 114

positive population.
我本来就不理解你的问题，你这一问，我更确切我误解了问题了。我们等高人来解答
吧。

s****i
发帖数: 197

试着回答一下。。。
用hive作的假设有2个table: tempp=positive and tempn=negative
column=id,ls
select a3.* from
(select rank() over (partition by a2.aid order by rand()) as rank, a2.aid,
a2.bid, a2.als
from (select a.id as aid,a.ls as als,b.id as bid from tempp a left join
tempn b on (a.ls=b.ls)) a2) a3
where a3.rank<=1
;
但是这样做有缺点就是重复sample,
i.e.会出现
u1,11,u11
u2,11,u11
u3,11,u11
这样地，虽然几率会很小如果negative很大的话
求高人解答。。。
-------
PS 好像partition那里直接by length of stay再改改就可以了。。。

【在 n*****3 的大作中提到】

: 先估计分布，再 hash then scan，
: 省得时间是估计分布 vs scan positive population；
: hash table is for positive polulation or negative population?
: thanks.
:
: no

相关主题
● R问题请教	● 如何改变spark dataframe的column names
● Memory Error in pandas.concat with Python	● spark上一两个million的时间序列数据
● 求助：关于2个python的题目	● Twitter Data Scientist 电面题目
进入DataSciences版参与讨论

T*****u
发帖数: 7103

我不懂，但这是要做啥啊

d******c
发帖数: 2407

一般matlab或者R里不推荐for loop，是因为系统函数进行了优化，很多时候能并行处
理，或者vectorized。
你用c++又不用for loop，还不并行，是到底想干什么？该有的计算是免不了的
另外你题目也没说明白，match的定义是什么？

【在 n*****3 的大作中提到】

: the for loop will take a long of time to complete...
: want to find out some good algorithms for this.

d******c
发帖数: 2407

看了一下他在另一个版的回帖，貌似negative population里是没有stop time的，哈哈
那match的标准是什么？
题目一直就没说明白

【在 s******e 的大作中提到】

: positive population.
: 我本来就不理解你的问题，你这一问，我更确切我误解了问题了。我们等高人来解答
: 吧。

m******a
发帖数: 77

感觉他/她好像是要作一个MATCH SAMPLE
得运算快
并且既不能用分布式计算
又不能用LOOP - 任何LOOP
所以无解
除非是前面理解有误

【在 d******c 的大作中提到】

: 看了一下他在另一个版的回帖，貌似negative population里是没有stop time的，哈哈
: 那match的标准是什么？
: 题目一直就没说明白

n*****3
发帖数: 1584

谢谢大家的会贴啊，很感谢！！
其实就是match sample，
分布式以后下一个stage 用，
不是说不可以Loop，只是希望尽量快。
下面抛砖一下：这 loop 不需每次全scan，
用两个indicator 往下走。
／*** assume f[] g[] are all sorted!!!!! ***/
int match(int f[], int g[], int m, int n)
{
int index_f, index_g; /* subscripts for f and g */
int m,n /* length of each group **/
int count; /* equal pair counter */
count = index_f = index_g = 0;
while (index_f < m && index_g < n)
if (f[index_f] < g[index_g]) /* if f[] is less */
index_f++; /* then move f */
else if (f[index_f] > g[index_g]) /* if f[] > */
index_g++; /* then move g */
else
count++, index_f++, index_g++; /* EQUAL */
return count;
}

【在 m******a 的大作中提到】

: 感觉他/她好像是要作一个MATCH SAMPLE
: 得运算快
: 并且既不能用分布式计算
: 又不能用LOOP - 任何LOOP
: 所以无解
: 除非是前面理解有误

T**n
发帖数: 47

至少先sort一下啊

【在 n*****3 的大作中提到】

: 谢谢大家的会贴啊，很感谢！！
: 其实就是match sample，
: 分布式以后下一个stage 用，
: 不是说不可以Loop，只是希望尽量快。
: 下面抛砖一下：这 loop 不需每次全scan，
: 用两个indicator 往下走。
: ／*** assume f[] g[] are all sorted!!!!! ***/
: int match(int f[], int g[], int m, int n)
: {
: int index_f, index_g; /* subscripts for f and g */

n*****3
发帖数: 1584

first line
／*** assume f[] g[] are all sorted!!!!! ***/
I sort them in the R part, before passing them to RCPP.
In general, I think R vectorized operating is fast
enough.

【在 T**n 的大作中提到】

: 至少先sort一下啊

n*****3
发帖数: 1584

／*** assume f[] g[] are all sorted!!!!! ***/
I sort them in the pure R section first; vectorized operation like data.
table/setkey is very fast already. Try to keep the RCPP as mini as possible.

【在 T**n 的大作中提到】

: 至少先sort一下啊

s******e
发帖数: 114

still confused.
you are looking for algorithm which is better then O(n). if sorting is only
for match sample, it is bad. sorting is kind of O(nlogn), I could be wrong
, if R uses bucket sort in this case.
assuming data was sorted already for other purpose, merge join is fast.
otherwise, I would use hash join.

n*****3
发帖数: 1584

good point;
if everything in c/c++, I bet there is better solution;
but here I just want to speed up a small, but very time consuming(big loop)
part ,
in R.
Sure I can use std::sort to sort them in the RCPP code instead of using R/
data.table/setkey.
BTW do you have better algorithm for this? my solution is just for 抛砖引玉.

only
wrong

【在 s******e 的大作中提到】

: still confused.
: you are looking for algorithm which is better then O(n). if sorting is only
: for match sample, it is bad. sorting is kind of O(nlogn), I could be wrong
: , if R uses bucket sort in this case.
: assuming data was sorted already for other purpose, merge join is fast.
: otherwise, I would use hash join.

相关主题
● 如何证明某个feature 没用，分组的分布和总体分布相同	● matlab 和 python 的for loop
● 问一个R的问题	● training dataset和unbalanced dataset的设计
● Random forests on imbalanced data	● [Data Science Project Case] Fuzzy matching on names
进入DataSciences版参与讨论

n*****3
发帖数: 1584

length of stay（待了几天）
不是 UNIQUE 的。很多人有一样的length of stay（待了几天）。
how to merge? many to many merge?

only
wrong

【在 s******e 的大作中提到】

H*H
发帖数: 472

题目的意思只懂个大概，你需要的应该是R里面的match() function
idx <- match(y$ID, x$ID)
然后你就可以用这个idx去直接提取x的信息并赋给y
y$info <- x$info[idx]
向量化操作，这么点数据，应该瞬间就能完成了

n*****3
发帖数: 1584

Thanks , but match() is kind of very slow ah..
for the R help:
"Matching for lists is potentially very slow and best avoided except in
simple cases."
our first stage data is about 60G txt file, not too smart for R .

【在 H*H 的大作中提到】

: 题目的意思只懂个大概，你需要的应该是R里面的match() function
: idx <- match(y$ID, x$ID)
: 然后你就可以用这个idx去直接提取x的信息并赋给y
: y$info <- x$info[idx]
: 向量化操作，这么点数据，应该瞬间就能完成了

H*H
发帖数: 472

你这不是100k个ID match另外100K个ID吗？不大明白为什么会涉及list。要是原始数据
存在list里面的，你可以把他们的id提取出来存放在vector里，然后match。照你这么
一说，你的code慢主要就是因为list操作，跟match基本没啥关系。

【在 n*****3 的大作中提到】

: Thanks , but match() is kind of very slow ah..
: for the R help:
: "Matching for lists is potentially very slow and best avoided except in
: simple cases."
: our first stage data is about 60G txt file, not too smart for R .

m******a
发帖数: 77

I do not think sort is needed.
So now, first loop either positive or negative sample - the small size one
build histogram with days, then normalize it
then loop the other sample, do sampling with acceptance/rejection method
until the same number records collected

)
玉.

【在 n*****3 的大作中提到】

: good point;
: if everything in c/c++, I bet there is better solution;
: but here I just want to speed up a small, but very time consuming(big loop)
: part ,
: in R.
: Sure I can use std::sort to sort them in the RCPP code instead of using R/
: data.table/setkey.
: BTW do you have better algorithm for this? my solution is just for 抛砖引玉.
:
: only

l*******s
发帖数: 1258

其实这个不算大数据，单机可以解决，都没上million这个量级，考虑下算法优化，找
个支持多核cpu的classifier跑就行。

n*****3
发帖数: 1584

你说的对，这不是大数据。
因为在modeling training stage，用 R 跑prototype system，
R itself 不方便并行。下一阶段会上 spark／hadoop。
继续抛砖引玉一下。下面的 CPP file 跑得还行。用 sourceRCPP call
就可。看看有什么可以改进的，
再次谢谢大家 all
#include
using namespace Rcpp;
using namespace std;
// Below is a the balancing C++ function to R. You can
// source this function into an R session using the Rcpp::sourceCpp
// function (or via the Source button on the editor toolbar)
// For more on using Rcpp click the Help button on the editor toolbar
// [[Rcpp::export]]
DataFrame ba(DataFrame ng, DataFrame ps) {
int ng_nrows = ng.nrows();
int ps_nrows = ps.nrows();
// int nCols = ng.size();
cout << " neg nRows is " << neg_nrows << endl;
cout << " pos nRows is " << pos_nrows << endl;
int count = 0;
std::vector ps_ls =Rcpp::as > (ps[0]);
std::vector ng_ls =Rcpp::as > (ng[1]);
std::vector::const_iterator i = ps_ls.begin();
std::vector:: iterator j = ng_ls.begin();
// while( (i ! ps_ls.end() ) && (j ! ng_los.end() ) )
while( (i != ps_ls.end() ) )
{if ( *i < *j)
i++ ;
else if (*i > *j)
j++;
else
{ count++;
std::cout << "the *i is " << *i <<"..and the *j is" << *j <<
endl;
*j = (*j) + 5000;
i++; j++; }
}
// for( std::vector::const_iterator k = ng_ls.begin(); k != ng_
ls.end(); ++i)
// std::cout << *k << ' ' << endl;
cout << " total matached # is " << count << endl;
return ng_ls;
}

【在 l*******s 的大作中提到】

: 其实这个不算大数据，单机可以解决，都没上million这个量级，考虑下算法优化，找
: 个支持多核cpu的classifier跑就行。

(共1页)

进入DataSciences版参与讨论

相关主题
● training dataset和unbalanced dataset的设计	● Spark开始使用DataFrame
● [Data Science Project Case] Fuzzy matching on names	● R问题请教
● 微软data scientist positions open	● Memory Error in pandas.concat with Python
● SF-based startup hiring a data scientist	● 求助：关于2个python的题目
● 请教有关用R做t-test (转载)	● 如何改变spark dataframe的column names
● 在R里merge两个dataframe太慢了	● spark上一两个million的时间序列数据
● 我LG说	● Twitter Data Scientist 电面题目
● random forest 有没有可能保证某几个变量一直被选上	● 如何证明某个feature 没用，分组的分布和总体分布相同

相关话题的讨论汇总
话题: positive话题: int话题: rcpp话题: index话题: negative

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天