boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
CS版 - Re: Efficient duplicate filtering for st
相关主题
Efficient duplicate filtering for stream
Re: [转载] 有关做研究
请问一个遗传算法用于feature selection的问题
我们非金融行业写程序的是不是犯傻?$400000
Re: 请问大牛们leetcode上的Permutations II (转载)
关于Hash table 和 bloom filter
问一个bloom filter 和 bitmap的使用区别
算法问题,找出现频率最高的元素
cross validation and best model question
[转载] 有搞算法的大侠吗?????问个问题!!!
相关话题的讨论汇总
话题: duplicate话题: hashing话题: window话题: efficient话题: what
进入CS版参与讨论
1 (共1页)
f*******h
发帖数: 1269
1
Only in the window, since duplicates only happen in a very close window.
You seem to be an expert. I am looking for your suggestions!
What do you mean approximate? What's the trade-off between accuracy and
performance?
I prefer accurate results though.
r******h
发帖数: 809
2
I am not sure. But it seems to be a space/accuracy trade-off.
If you don't care much about the space, could you try hash for each element
in the window?
If you care that and could stand with approximation (guaranteed) result,
I think there are some papers (like count-min sketch) or something like that
to deal with it.

【在 f*******h 的大作中提到】
: Only in the window, since duplicates only happen in a very close window.
: You seem to be an expert. I am looking for your suggestions!
: What do you mean approximate? What's the trade-off between accuracy and
: performance?
: I prefer accurate results though.

f*******h
发帖数: 1269
3
Thanks!
The problem is that the value is mostly unique, so hashing is difficult.

【在 r******h 的大作中提到】
: I am not sure. But it seems to be a space/accuracy trade-off.
: If you don't care much about the space, could you try hash for each element
: in the window?
: If you care that and could stand with approximation (guaranteed) result,
: I think there are some papers (like count-min sketch) or something like that
: to deal with it.

y***u
发帖数: 101
4

If you want accurate results, you have to remember everything in the
window, and there might not be anything better than hashing.
It's not clear to me how to define approximation for such yes/no problems.

【在 r******h 的大作中提到】
: I am not sure. But it seems to be a space/accuracy trade-off.
: If you don't care much about the space, could you try hash for each element
: in the window?
: If you care that and could stand with approximation (guaranteed) result,
: I think there are some papers (like count-min sketch) or something like that
: to deal with it.

f*******h
发帖数: 1269
5
But all data are unqiue, except the duplicates.
Static hashing will not work.

element
that

【在 y***u 的大作中提到】
:
: If you want accurate results, you have to remember everything in the
: window, and there might not be anything better than hashing.
: It's not clear to me how to define approximation for such yes/no problems.

y***u
发帖数: 101
6
Yes it works, unless you want the worst-case cost to be O(1).
If you want guaranteed *expected* O(1) cost, use universal hash families.
If you just want things to work well in practice, use any reasonably hash
function.
I don't know any dynamic hashing scheme that gives you O(1) *worst-case*
lookups and updates.

【在 f*******h 的大作中提到】
: But all data are unqiue, except the duplicates.
: Static hashing will not work.
:
: element
: that

r******h
发帖数: 809
7
It can be with one side or both sides.
Side 1: Find all duplicates, (say totally n), and at most n*\delta elements
which are not duplicate by mistake.
Side 2: All elements find are duplicate, while some (at most n*\delta)
duplicate elements are not found...

【在 y***u 的大作中提到】
: Yes it works, unless you want the worst-case cost to be O(1).
: If you want guaranteed *expected* O(1) cost, use universal hash families.
: If you just want things to work well in practice, use any reasonably hash
: function.
: I don't know any dynamic hashing scheme that gives you O(1) *worst-case*
: lookups and updates.

r******h
发帖数: 809
8
Do you mean the number of elements in each window is much smaller than the
number of elements in the entire streaming?
So difficult to implement hashing?

【在 f*******h 的大作中提到】
: But all data are unqiue, except the duplicates.
: Static hashing will not work.
:
: element
: that

f*******h
发帖数: 1269
9
Oh, yeah. The values are unique in the whole streaming space.
How can you build hashing for it?

problems.

【在 r******h 的大作中提到】
: Do you mean the number of elements in each window is much smaller than the
: number of elements in the entire streaming?
: So difficult to implement hashing?

r******h
发帖数: 809
10
The size of the hashing table may choose to be someone between # streaming
and # window... I think the result will not be too bad (conflicts will not
be too many), if the data is uniformly distributed in the whole streaming...

【在 f*******h 的大作中提到】
: Oh, yeah. The values are unique in the whole streaming space.
: How can you build hashing for it?
:
: problems.

y***u
发帖数: 101
11
Why can't ?
A hash table of size equal to the window size is enough.
Hashing schemes give O(1) expected cost per operation on *any* input,
the expectation comes from inside the algorithm, not the input.

【在 f*******h 的大作中提到】
: Oh, yeah. The values are unique in the whole streaming space.
: How can you build hashing for it?
:
: problems.

x*******n
发帖数: 3
12
I am just a bit curious about what is the use of duplicate elimination.
Could you please provide a real-world scenario for duplicate elimination
in stream processing?
What is your desired output? at every moment, eliminate duplicate tuples
and output the unique items to next operator?
I also think hash could be a good approach.

the

【在 y***u 的大作中提到】
: Why can't ?
: A hash table of size equal to the window size is enough.
: Hashing schemes give O(1) expected cost per operation on *any* input,
: the expectation comes from inside the algorithm, not the input.

f*******h
发帖数: 1269
13
A read world application:
You use mutiple sensors to sense a same target, to increase accuracy, e.g.,
mount sensors on different directions for maximum coverage of the target. The
sensors are synchronized not to do duplicate readings. But, in reality,
duplicates still happen.
So, you don't want multiple readings of a same value.
Since all values are distinct except duplicates, how to design a hashing
function?

【在 x*******n 的大作中提到】
: I am just a bit curious about what is the use of duplicate elimination.
: Could you please provide a real-world scenario for duplicate elimination
: in stream processing?
: What is your desired output? at every moment, eliminate duplicate tuples
: and output the unique items to next operator?
: I also think hash could be a good approach.
:
: the

x*******n
发帖数: 3
14
Just my small piece of views.
I assume duplicate reading will only happen in a small range of time interval,
then within such kind a short interval, I guess brute force or else may
sometimes be better without complicated structures overhead. kidding.hehe
Are your values within a fixed range and discrete? This may help.
Regarding to the hashing function, I am not sure which design will be better.
I guess this deal with the distribution of your data and duplicate.

The
than

【在 f*******h 的大作中提到】
: A read world application:
: You use mutiple sensors to sense a same target, to increase accuracy, e.g.,
: mount sensors on different directions for maximum coverage of the target. The
: sensors are synchronized not to do duplicate readings. But, in reality,
: duplicates still happen.
: So, you don't want multiple readings of a same value.
: Since all values are distinct except duplicates, how to design a hashing
: function?

1 (共1页)
进入CS版参与讨论
相关主题
[转载] 有搞算法的大侠吗?????问个问题!!!
帮忙找一篇paper
请教minimum set cover Problem
max independent set
Theory的高手们指教一下吧
问一个feedback vertex set的问题
请教一个distribution之间的likelihood问题 (转载)
请教一个聚类的问题
问个在图中删除边和点的算法问题 (转载)
how to compute binomial distribution without overflow?
相关话题的讨论汇总
话题: duplicate话题: hashing话题: window话题: efficient话题: what