Re: Efficient duplicate filtering for st - CS版 - 未名存档

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

CS版 - Re: Efficient duplicate filtering for st

相关主题
● Efficient duplicate filtering for stream
● Re: [转载] 有关做研究
● 请问一个遗传算法用于feature selection的问题
● 我们非金融行业写程序的是不是犯傻？$400000
● Re: 请问大牛们leetcode上的Permutations II (转载)
● 关于Hash table 和 bloom filter
● 问一个bloom filter 和 bitmap的使用区别
● 算法问题，找出现频率最高的元素
● cross validation and best model question
● [转载] 有搞算法的大侠吗？？？？？问个问题！！！

相关话题的讨论汇总
话题: duplicate话题: hashing话题: window话题: efficient话题: what

进入CS版参与讨论

1

(共1页)

f*******h 发帖数: 1269	1 Only in the window, since duplicates only happen in a very close window. You seem to be an expert. I am looking for your suggestions! What do you mean approximate? What's the trade-off between accuracy and performance? I prefer accurate results though.
r******h 发帖数: 809	2 I am not sure. But it seems to be a space/accuracy trade-off. If you don't care much about the space, could you try hash for each element in the window? If you care that and could stand with approximation (guaranteed) result, I think there are some papers (like count-min sketch) or something like that to deal with it. 【在 f*******h 的大作中提到】 : Only in the window, since duplicates only happen in a very close window. : You seem to be an expert. I am looking for your suggestions! : What do you mean approximate? What's the trade-off between accuracy and : performance? : I prefer accurate results though.
f*******h 发帖数: 1269	3 Thanks! The problem is that the value is mostly unique, so hashing is difficult. 【在 r******h 的大作中提到】 : I am not sure. But it seems to be a space/accuracy trade-off. : If you don't care much about the space, could you try hash for each element : in the window? : If you care that and could stand with approximation (guaranteed) result, : I think there are some papers (like count-min sketch) or something like that : to deal with it.
y***u 发帖数: 101	4 If you want accurate results, you have to remember everything in the window, and there might not be anything better than hashing. It's not clear to me how to define approximation for such yes/no problems. 【在 r******h 的大作中提到】 : I am not sure. But it seems to be a space/accuracy trade-off. : If you don't care much about the space, could you try hash for each element : in the window? : If you care that and could stand with approximation (guaranteed) result, : I think there are some papers (like count-min sketch) or something like that : to deal with it.
f*******h 发帖数: 1269	5 But all data are unqiue, except the duplicates. Static hashing will not work. element that 【在 y***u 的大作中提到】 : : If you want accurate results, you have to remember everything in the : window, and there might not be anything better than hashing. : It's not clear to me how to define approximation for such yes/no problems.
y***u 发帖数: 101	6 Yes it works, unless you want the worst-case cost to be O(1). If you want guaranteed expected O(1) cost, use universal hash families. If you just want things to work well in practice, use any reasonably hash function. I don't know any dynamic hashing scheme that gives you O(1) worst-case lookups and updates. 【在 f*******h 的大作中提到】 : But all data are unqiue, except the duplicates. : Static hashing will not work. : : element : that
r******h 发帖数: 809	7 It can be with one side or both sides. Side 1: Find all duplicates, (say totally n), and at most n\delta elements which are not duplicate by mistake. Side 2: All elements find are duplicate, while some (at most n\delta) duplicate elements are not found... 【在 y**u 的大作中提到】 : Yes it works, unless you want the worst-case cost to be O(1). : If you want guaranteed expected* O(1) cost, use universal hash families. : If you just want things to work well in practice, use any reasonably hash : function. : I don't know any dynamic hashing scheme that gives you O(1) worst-case : lookups and updates.
r******h 发帖数: 809	8 Do you mean the number of elements in each window is much smaller than the number of elements in the entire streaming? So difficult to implement hashing? 【在 f*******h 的大作中提到】 : But all data are unqiue, except the duplicates. : Static hashing will not work. : : element : that

f*******h 发帖数: 1269	9 Oh, yeah. The values are unique in the whole streaming space. How can you build hashing for it? problems. 【在 r******h 的大作中提到】 : Do you mean the number of elements in each window is much smaller than the : number of elements in the entire streaming? : So difficult to implement hashing?
r******h 发帖数: 809	10 The size of the hashing table may choose to be someone between # streaming and # window... I think the result will not be too bad (conflicts will not be too many), if the data is uniformly distributed in the whole streaming... 【在 f*******h 的大作中提到】 : Oh, yeah. The values are unique in the whole streaming space. : How can you build hashing for it? : : problems.
y***u 发帖数: 101	11 Why can't ? A hash table of size equal to the window size is enough. Hashing schemes give O(1) expected cost per operation on any input, the expectation comes from inside the algorithm, not the input. 【在 f*******h 的大作中提到】 : Oh, yeah. The values are unique in the whole streaming space. : How can you build hashing for it? : : problems.
x*******n 发帖数: 3	12 I am just a bit curious about what is the use of duplicate elimination. Could you please provide a real-world scenario for duplicate elimination in stream processing? What is your desired output? at every moment, eliminate duplicate tuples and output the unique items to next operator? I also think hash could be a good approach. the 【在 y**u 的大作中提到】 : Why can't ? : A hash table of size equal to the window size is enough. : Hashing schemes give O(1) expected cost per operation on any* input, : the expectation comes from inside the algorithm, not the input.
f*******h 发帖数: 1269	13 A read world application: You use mutiple sensors to sense a same target, to increase accuracy, e.g., mount sensors on different directions for maximum coverage of the target. The sensors are synchronized not to do duplicate readings. But, in reality, duplicates still happen. So, you don't want multiple readings of a same value. Since all values are distinct except duplicates, how to design a hashing function? 【在 x*******n 的大作中提到】 : I am just a bit curious about what is the use of duplicate elimination. : Could you please provide a real-world scenario for duplicate elimination : in stream processing? : What is your desired output? at every moment, eliminate duplicate tuples : and output the unique items to next operator? : I also think hash could be a good approach. : : the
x*******n 发帖数: 3	14 Just my small piece of views. I assume duplicate reading will only happen in a small range of time interval, then within such kind a short interval, I guess brute force or else may sometimes be better without complicated structures overhead. kidding.hehe Are your values within a fixed range and discrete? This may help. Regarding to the hashing function, I am not sure which design will be better. I guess this deal with the distribution of your data and duplicate. The than 【在 f*******h 的大作中提到】 : A read world application: : You use mutiple sensors to sense a same target, to increase accuracy, e.g., : mount sensors on different directions for maximum coverage of the target. The : sensors are synchronized not to do duplicate readings. But, in reality, : duplicates still happen. : So, you don't want multiple readings of a same value. : Since all values are distinct except duplicates, how to design a hashing : function?

1

(共1页)

进入CS版参与讨论

相关主题
● [转载] 有搞算法的大侠吗？？？？？问个问题！！！
● 帮忙找一篇paper
● 请教minimum set cover Problem
● max independent set
● Theory的高手们指教一下吧
● 问一个feedback vertex set的问题
● 请教一个distribution之间的likelihood问题 (转载)
● 请教一个聚类的问题
● 问个在图中删除边和点的算法问题 (转载)
● how to compute binomial distribution without overflow?

相关话题的讨论汇总
话题: duplicate话题: hashing话题: window话题: efficient话题: what