由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
Java版 - Java Map 存 1 million 记录
相关主题
那个快?云计算如何应用到传统的web server应用
JDK Source?有没有大牛在搞cloud?-- 包子贴
java知道一个reference怎么删掉它指向的内存空间? (转载)请问Hadoop要怎么学?
JDK HashMap 的 Entry class开发前景光明啊
answer Re: how HashMap/Hashtable compare key?火车旅行家在中文WINDOWS下无法运行的大问题已被解决
来问两个HashMap的问题Re: Question on mouse click on image in Jdk 1.1
这几年Java跟其他语言的差距拉大了。How do a Servlet sent a Applet?
问HashSet的问题?What version of the Java Development Kit (JDKTM)
相关话题的讨论汇总
话题: hashmap话题: map话题: collision话题: string话题: hash
进入Java版参与讨论
1 (共1页)
l******0
发帖数: 244
1
如果物理内存够,比如 8G,这么多 entry 的 map, 会有性能问题吗? Key 就是
String.还是有分到几个小的 map 比较好? 程序中,存到一个 map 里面更方便。
Map 比较合适的最大 size 是多少?
t*******e
发帖数: 684
2
key partition一下,用两层Maps.
c*********e
发帖数: 16335
3
存到hashtable里?能排序吗?

【在 l******0 的大作中提到】
: 如果物理内存够,比如 8G,这么多 entry 的 map, 会有性能问题吗? Key 就是
: String.还是有分到几个小的 map 比较好? 程序中,存到一个 map 里面更方便。
: Map 比较合适的最大 size 是多少?

w**z
发帖数: 8232
4
http://stackoverflow.com/questions/4333265/how-to-sort-a-java-h

【在 c*********e 的大作中提到】
: 存到hashtable里?能排序吗?
w**z
发帖数: 8232
5
1m doesn't seem to be large. do benchmark and tune it if it becomes a
problem.

【在 l******0 的大作中提到】
: 如果物理内存够,比如 8G,这么多 entry 的 map, 会有性能问题吗? Key 就是
: String.还是有分到几个小的 map 比较好? 程序中,存到一个 map 里面更方便。
: Map 比较合适的最大 size 是多少?

s******e
发帖数: 493
6
if possible, using distributed cache will provide you better QoS of
availability, scalability and performance.
If you have to store them in one single jvm, you'd better write your own
hash function by extending hashmap. The default implementation of hashmap is
not designed for the big data like that.
g*****g
发帖数: 34805
7
1M is nothing, just do it, don't even bother with benchmark.
c*********e
发帖数: 16335
8
linkedhashmap,排序,那时间复杂度就不是o(1)了吧?

【在 w**z 的大作中提到】
: http://stackoverflow.com/questions/4333265/how-to-sort-a-java-h
l******0
发帖数: 244
9
你是说分作两个 map?

【在 t*******e 的大作中提到】
: key partition一下,用两层Maps.
l******0
发帖数: 244
10
"The default implementation of hashmap is not designed for the big data like
that."
Any documentation or experiments on this point?

is

【在 s******e 的大作中提到】
: if possible, using distributed cache will provide you better QoS of
: availability, scalability and performance.
: If you have to store them in one single jvm, you'd better write your own
: hash function by extending hashmap. The default implementation of hashmap is
: not designed for the big data like that.

相关主题
来问两个HashMap的问题云计算如何应用到传统的web server应用
这几年Java跟其他语言的差距拉大了。有没有大牛在搞cloud?-- 包子贴
问HashSet的问题?请问Hadoop要怎么学?
进入Java版参与讨论
l******0
发帖数: 244
11
我记得好像有次一个 map 里面的数目超过 20万,速度就变慢了,但我不肯定是不是还
有别的原因造成速度慢。
我的理解是,就像一个房子里面,房子越大,每次要从中找出一个人的时间就要更多?
hash bucket 越多,是不是 search 的时间就越长?但不知100 万会不会成为瓶颈。

【在 g*****g 的大作中提到】
: 1M is nothing, just do it, don't even bother with benchmark.
t*******e
发帖数: 684
12
When key space grows big,the probability of hash collision enlarges。Java
handles hash collision in a less-efficient manner, which could lead to
slowness.



【在 l******0 的大作中提到】
: 我记得好像有次一个 map 里面的数目超过 20万,速度就变慢了,但我不肯定是不是还
: 有别的原因造成速度慢。
: 我的理解是,就像一个房子里面,房子越大,每次要从中找出一个人的时间就要更多?
: hash bucket 越多,是不是 search 的时间就越长?但不知100 万会不会成为瓶颈。

t*******e
发帖数: 684
13
Figure out an algorithm to evenly distribute all the keys into 100 slots (
Maps). Each slot contains 10K records. Consistent Hash in BigTable might
give you some inspiration.

【在 l******0 的大作中提到】
: 你是说分作两个 map?
w**z
发帖数: 8232
14
I don't think you are smarter than the people who implemented Java string
hash function. don't over engineer things. if you think you have performance
issue, run some benchmark and see where is the bottle neck and go from
there. as I said, 1m is nothing nowadays .

【在 t*******e 的大作中提到】
: When key space grows big,the probability of hash collision enlarges。Java
: handles hash collision in a less-efficient manner, which could lead to
: slowness.
:
: 。

t*******e
发帖数: 684
15
http://openjdk.java.net/jeps/180
Obviously they are smarter than you.

performance

【在 w**z 的大作中提到】
: I don't think you are smarter than the people who implemented Java string
: hash function. don't over engineer things. if you think you have performance
: issue, run some benchmark and see where is the bottle neck and go from
: there. as I said, 1m is nothing nowadays .

w**z
发帖数: 8232
16
I am talking about String hash function, not handling of hash map collision
. by the way, if you run openjdk in production, then good luck. Of course
, they are smarter than me.

【在 t*******e 的大作中提到】
: http://openjdk.java.net/jeps/180
: Obviously they are smarter than you.
:
: performance

t*******e
发帖数: 684
17
Same issue with Oracle JDK. BTW, you sound pretty naive and emotional.
Please be professional not personal when discussing technical subjects.

collision
course

【在 w**z 的大作中提到】
: I am talking about String hash function, not handling of hash map collision
: . by the way, if you run openjdk in production, then good luck. Of course
: , they are smarter than me.

w**z
发帖数: 8232
18
which part is emotional and not professional?

【在 t*******e 的大作中提到】
: Same issue with Oracle JDK. BTW, you sound pretty naive and emotional.
: Please be professional not personal when discussing technical subjects.
:
: collision
: course

s******e
发帖数: 493
19
rehash and collision are two major issues.

like

【在 l******0 的大作中提到】
: "The default implementation of hashmap is not designed for the big data like
: that."
: Any documentation or experiments on this point?
:
: is

c*********e
发帖数: 16335
20
hashtable发生collision是常事,换个算法。

【在 s******e 的大作中提到】
: rehash and collision are two major issues.
:
: like

相关主题
开发前景光明啊How do a Servlet sent a Applet?
火车旅行家在中文WINDOWS下无法运行的大问题已被解决What version of the Java Development Kit (JDKTM)
Re: Question on mouse click on image in Jdk 1.1Java向Linux迈出决定性一步
进入Java版参与讨论
s******e
发帖数: 493
21
read the context please. Do not jump into the topic too fast without knowing
the situation... we were talking about big number of string keys.

【在 c*********e 的大作中提到】
: hashtable发生collision是常事,换个算法。
t*******e
发帖数: 684
22
Judge yourself before you judge others.

【在 w**z 的大作中提到】
: which part is emotional and not professional?
w**z
发帖数: 8232
23
搞得那么悬呼? 人用一个1m hashmap , 还big table, consistent hashing. 你phd
搞研究的吧?

【在 t*******e 的大作中提到】
: Judge yourself before you judge others.
g*****g
发帖数: 34805
24
1M hashmap,屁大的事情。我这几天把几个Oracle的表转到MySQL上,有一些合并和分
解的操作。
几个1-2M记录的表,都是直接放到内存里,用来做id->object lookup,完全没有压力
。这是在我8G的laptop上跑的,不是什么大服务器。

phd

【在 w**z 的大作中提到】
: 搞得那么悬呼? 人用一个1m hashmap , 还big table, consistent hashing. 你phd
: 搞研究的吧?

s******e
发帖数: 493
25
well. seems that people have the different assumptions.
If the initial loading time is not counted and your string key is db id type
of string, then hashmap might be fine since both rehashing and collision
will not be issue any more.
But according to the original post, it seems that we are talking about
building the hashmap over the time on the fly with random and big strings as
the key. if so, rehashing and collision will be issues.
g*****g
发帖数: 34805
26
1M is such a small dataset it doesn't matter how it's built.
A 32 bits integer gives your 4 billion buckets, 1M is just 1/4000 of them,
you'll see a few collisions here and there, but not enough to cause
performance issue.

type
as

【在 s******e 的大作中提到】
: well. seems that people have the different assumptions.
: If the initial loading time is not counted and your string key is db id type
: of string, then hashmap might be fine since both rehashing and collision
: will not be issue any more.
: But according to the original post, it seems that we are talking about
: building the hashmap over the time on the fly with random and big strings as
: the key. if so, rehashing and collision will be issues.

s******e
发帖数: 493
27
you misunderstand the relationship between #buckets and collision. Having
enough buckets does not necessarily transfer into less collisions. As said,
db id type string (especially surrogate id), due to the constraint applied,
is more likely to be uniformly distributed than random big string keys.
rehashing is another issue because of the way how the default implementation
of hashmap works. if you build the map on the fly, I remember that somebody
reported after 200 thousand string key entries, rehashing could cause some
users to suffer quite a bit. Even not all the users will suffer, but it
should be avoided for any robust systems.

【在 g*****g 的大作中提到】
: 1M is such a small dataset it doesn't matter how it's built.
: A 32 bits integer gives your 4 billion buckets, 1M is just 1/4000 of them,
: you'll see a few collisions here and there, but not enough to cause
: performance issue.
:
: type
: as

g*****g
发帖数: 34805
28
No, you misunderstand. The number of available buckets is constant, the less
data, the less chance for collision, simple as that.
It is possible 1M data causes heavy collision? Possible, but unlikely for
most use cases. And when it's case, it's easy to tune the hashing function.
OP didn't mention any random big string to begin with, try a simple approach
and don't overengineer before you have to.

,
,
implementation
somebody
some

【在 s******e 的大作中提到】
: you misunderstand the relationship between #buckets and collision. Having
: enough buckets does not necessarily transfer into less collisions. As said,
: db id type string (especially surrogate id), due to the constraint applied,
: is more likely to be uniformly distributed than random big string keys.
: rehashing is another issue because of the way how the default implementation
: of hashmap works. if you build the map on the fly, I remember that somebody
: reported after 200 thousand string key entries, rehashing could cause some
: users to suffer quite a bit. Even not all the users will suffer, but it
: should be avoided for any robust systems.

w**z
发帖数: 8232
29
Java Hashmap implementation does a supplement hash on the hasCode caller
provides.
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/
It's not likely you will have a lot of hash collisions for random String Key
.
Another way to avoid rehashing is to set a bigger initial size if you are
sure you will have 1M records.
I agree with goodbug, at least do some benchmarking before trying to tune
anything.

less
approach

【在 g*****g 的大作中提到】
: No, you misunderstand. The number of available buckets is constant, the less
: data, the less chance for collision, simple as that.
: It is possible 1M data causes heavy collision? Possible, but unlikely for
: most use cases. And when it's case, it's easy to tune the hashing function.
: OP didn't mention any random big string to begin with, try a simple approach
: and don't overengineer before you have to.
:
: ,
: ,
: implementation

l*******m
发帖数: 1096
30
hashmap will increase the memory allocation when the current allocation is
not large enough

Key

【在 w**z 的大作中提到】
: Java Hashmap implementation does a supplement hash on the hasCode caller
: provides.
: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/
: It's not likely you will have a lot of hash collisions for random String Key
: .
: Another way to avoid rehashing is to set a bigger initial size if you are
: sure you will have 1M records.
: I agree with goodbug, at least do some benchmarking before trying to tune
: anything.
:

相关主题
Re: Confused!JDK Source?
Java's performance mythjava知道一个reference怎么删掉它指向的内存空间? (转载)
那个快?JDK HashMap 的 Entry class
进入Java版参与讨论
w**z
发帖数: 8232
31
At that time, rehashing happens... If you know your size in advance, just
set to a bigger number to avoid rehashing.

【在 l*******m 的大作中提到】
: hashmap will increase the memory allocation when the current allocation is
: not large enough
:
: Key

s******e
发帖数: 493
32
First, the number of buckets is never constant for the default
implementation of HashMap. That is why you have rehashing.
Second, the collision will depend on all three factors of the input, hash
function and the number of buckets. Even you can use all 64 bits of memory
for your bucket, you might still have heavy collision for a biased input.
That is why I said db id type string key might work fine for the default
hashmap. I believe in such a case, the default hash function will distribute
them uniformly.

less
approach

【在 g*****g 的大作中提到】
: No, you misunderstand. The number of available buckets is constant, the less
: data, the less chance for collision, simple as that.
: It is possible 1M data causes heavy collision? Possible, but unlikely for
: most use cases. And when it's case, it's easy to tune the hashing function.
: OP didn't mention any random big string to begin with, try a simple approach
: and don't overengineer before you have to.
:
: ,
: ,
: implementation

s******e
发帖数: 493
33
When you build your map dynamically, that is a big if. If you can preload, I
think even several million records should be fine if the input is not that
biased.

【在 w**z 的大作中提到】
: At that time, rehashing happens... If you know your size in advance, just
: set to a bigger number to avoid rehashing.

w**z
发帖数: 8232
34
That is true, it's the tradeoff between time and space. Only lz knows his
use case.

I
that

【在 s******e 的大作中提到】
: When you build your map dynamically, that is a big if. If you can preload, I
: think even several million records should be fine if the input is not that
: biased.

l******0
发帖数: 244
35
In my case,
1) The "keys" are English words or phrases, no repetitions.
2) I have to build this map in a static initialization function.
BTW, what do you mean by 'preload', is it an initialization?

I
that

【在 s******e 的大作中提到】
: When you build your map dynamically, that is a big if. If you can preload, I
: think even several million records should be fine if the input is not that
: biased.

l******0
发帖数: 244
36
Sounds like good experiments. I will use one map to try.

【在 g*****g 的大作中提到】
: 1M hashmap,屁大的事情。我这几天把几个Oracle的表转到MySQL上,有一些合并和分
: 解的操作。
: 几个1-2M记录的表,都是直接放到内存里,用来做id->object lookup,完全没有压力
: 。这是在我8G的laptop上跑的,不是什么大服务器。
:
: phd

1 (共1页)
进入Java版参与讨论
相关主题
What version of the Java Development Kit (JDKTM)answer Re: how HashMap/Hashtable compare key?
Java向Linux迈出决定性一步来问两个HashMap的问题
Re: Confused!这几年Java跟其他语言的差距拉大了。
Java's performance myth问HashSet的问题?
那个快?云计算如何应用到传统的web server应用
JDK Source?有没有大牛在搞cloud?-- 包子贴
java知道一个reference怎么删掉它指向的内存空间? (转载)请问Hadoop要怎么学?
JDK HashMap 的 Entry class开发前景光明啊
相关话题的讨论汇总
话题: hashmap话题: map话题: collision话题: string话题: hash