Java Map 存 1 million 记录 - Java版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Java版 - Java Map 存 1 million 记录

相关主题
● 那个快？	● 云计算如何应用到传统的web server应用
● JDK Source?	● 有没有大牛在搞cloud？-- 包子贴
● java知道一个reference怎么删掉它指向的内存空间？ (转载)	● 请问Hadoop要怎么学？
● JDK HashMap 的 Entry class	● 开发前景光明啊
● answer Re: how HashMap/Hashtable compare key?	● 火车旅行家在中文WINDOWS下无法运行的大问题已被解决
● 来问两个HashMap的问题	● Re: Question on mouse click on image in Jdk 1.1
● 这几年Java跟其他语言的差距拉大了。	● How do a Servlet sent a Applet?
● 问HashSet的问题？	● What version of the Java Development Kit (JDKTM)

相关话题的讨论汇总
话题: hashmap话题: map话题: collision话题: string话题: hash

进入Java版参与讨论

(共1页)

l******0
发帖数: 244

如果物理内存够，比如 8G，这么多 entry 的 map, 会有性能问题吗？ Key 就是
String.还是有分到几个小的 map 比较好？程序中，存到一个 map 里面更方便。
Map 比较合适的最大 size 是多少？

t*******e
发帖数: 684

key partition一下，用两层Maps.

c*********e
发帖数: 16335

存到hashtable里？能排序吗?

【在 l******0 的大作中提到】

: 如果物理内存够，比如 8G，这么多 entry 的 map, 会有性能问题吗？ Key 就是
: String.还是有分到几个小的 map 比较好？程序中，存到一个 map 里面更方便。
: Map 比较合适的最大 size 是多少？

w**z
发帖数: 8232

http://stackoverflow.com/questions/4333265/how-to-sort-a-java-h

【在 c*********e 的大作中提到】

: 存到hashtable里？能排序吗?

w**z
发帖数: 8232

1m doesn't seem to be large. do benchmark and tune it if it becomes a
problem.

【在 l******0 的大作中提到】

s******e
发帖数: 493

if possible, using distributed cache will provide you better QoS of
availability, scalability and performance.
If you have to store them in one single jvm, you'd better write your own
hash function by extending hashmap. The default implementation of hashmap is
not designed for the big data like that.

g*****g
发帖数: 34805

1M is nothing, just do it, don't even bother with benchmark.

c*********e
发帖数: 16335

linkedhashmap,排序，那时间复杂度就不是o(1)了吧？

【在 w**z 的大作中提到】

: http://stackoverflow.com/questions/4333265/how-to-sort-a-java-h

l******0
发帖数: 244

你是说分作两个 map?

【在 t*******e 的大作中提到】

: key partition一下，用两层Maps.

l******0
发帖数: 244

"The default implementation of hashmap is not designed for the big data like
that."
Any documentation or experiments on this point?

is

【在 s******e 的大作中提到】

: if possible, using distributed cache will provide you better QoS of
: availability, scalability and performance.
: If you have to store them in one single jvm, you'd better write your own
: hash function by extending hashmap. The default implementation of hashmap is
: not designed for the big data like that.

相关主题
● 来问两个HashMap的问题	● 云计算如何应用到传统的web server应用
● 这几年Java跟其他语言的差距拉大了。	● 有没有大牛在搞cloud？-- 包子贴
● 问HashSet的问题？	● 请问Hadoop要怎么学？
进入Java版参与讨论

l******0
发帖数: 244

我记得好像有次一个 map 里面的数目超过 20万，速度就变慢了，但我不肯定是不是还
有别的原因造成速度慢。
我的理解是，就像一个房子里面，房子越大，每次要从中找出一个人的时间就要更多？
hash bucket 越多，是不是 search 的时间就越长？但不知100 万会不会成为瓶颈。

【在 g*****g 的大作中提到】

: 1M is nothing, just do it, don't even bother with benchmark.

t*******e
发帖数: 684

When key space grows big，the probability of hash collision enlarges。Java
handles hash collision in a less-efficient manner, which could lead to
slowness.

。

【在 l******0 的大作中提到】

: 我记得好像有次一个 map 里面的数目超过 20万，速度就变慢了，但我不肯定是不是还
: 有别的原因造成速度慢。
: 我的理解是，就像一个房子里面，房子越大，每次要从中找出一个人的时间就要更多？
: hash bucket 越多，是不是 search 的时间就越长？但不知100 万会不会成为瓶颈。

t*******e
发帖数: 684

Figure out an algorithm to evenly distribute all the keys into 100 slots (
Maps). Each slot contains 10K records. Consistent Hash in BigTable might
give you some inspiration.

【在 l******0 的大作中提到】

: 你是说分作两个 map?

w**z
发帖数: 8232

I don't think you are smarter than the people who implemented Java string
hash function. don't over engineer things. if you think you have performance
issue, run some benchmark and see where is the bottle neck and go from
there. as I said, 1m is nothing nowadays .

【在 t*******e 的大作中提到】

: When key space grows big，the probability of hash collision enlarges。Java
: handles hash collision in a less-efficient manner, which could lead to
: slowness.
:
: 。

t*******e
发帖数: 684

http://openjdk.java.net/jeps/180
Obviously they are smarter than you.

performance

【在 w**z 的大作中提到】

: I don't think you are smarter than the people who implemented Java string
: hash function. don't over engineer things. if you think you have performance
: issue, run some benchmark and see where is the bottle neck and go from
: there. as I said, 1m is nothing nowadays .

w**z
发帖数: 8232

I am talking about String hash function, not handling of hash map collision
. by the way, if you run openjdk in production, then good luck. Of course
, they are smarter than me.

【在 t*******e 的大作中提到】

: http://openjdk.java.net/jeps/180
: Obviously they are smarter than you.
:
: performance

t*******e
发帖数: 684

Same issue with Oracle JDK. BTW, you sound pretty naive and emotional.
Please be professional not personal when discussing technical subjects.

collision
course

【在 w**z 的大作中提到】

: I am talking about String hash function, not handling of hash map collision
: . by the way, if you run openjdk in production, then good luck. Of course
: , they are smarter than me.

w**z
发帖数: 8232

which part is emotional and not professional?

【在 t*******e 的大作中提到】

: Same issue with Oracle JDK. BTW, you sound pretty naive and emotional.
: Please be professional not personal when discussing technical subjects.
:
: collision
: course

s******e
发帖数: 493

rehash and collision are two major issues.

like

【在 l******0 的大作中提到】

: "The default implementation of hashmap is not designed for the big data like
: that."
: Any documentation or experiments on this point?
:
: is

c*********e
发帖数: 16335

hashtable发生collision是常事，换个算法。

【在 s******e 的大作中提到】

: rehash and collision are two major issues.
:
: like

相关主题
● 开发前景光明啊	● How do a Servlet sent a Applet?
● 火车旅行家在中文WINDOWS下无法运行的大问题已被解决	● What version of the Java Development Kit (JDKTM)
● Re: Question on mouse click on image in Jdk 1.1	● Java向Linux迈出决定性一步
进入Java版参与讨论

s******e
发帖数: 493

read the context please. Do not jump into the topic too fast without knowing
the situation... we were talking about big number of string keys.

【在 c*********e 的大作中提到】

: hashtable发生collision是常事，换个算法。

t*******e
发帖数: 684

Judge yourself before you judge others.

【在 w**z 的大作中提到】

: which part is emotional and not professional?

w**z
发帖数: 8232

搞得那么悬呼？人用一个1m hashmap , 还big table, consistent hashing. 你phd
搞研究的吧？

【在 t*******e 的大作中提到】

: Judge yourself before you judge others.

g*****g
发帖数: 34805

1M hashmap，屁大的事情。我这几天把几个Oracle的表转到MySQL上，有一些合并和分
解的操作。
几个1-2M记录的表，都是直接放到内存里，用来做id->object lookup，完全没有压力
。这是在我8G的laptop上跑的，不是什么大服务器。

phd

【在 w**z 的大作中提到】

: 搞得那么悬呼？人用一个1m hashmap , 还big table, consistent hashing. 你phd
: 搞研究的吧？

s******e
发帖数: 493

well. seems that people have the different assumptions.
If the initial loading time is not counted and your string key is db id type
of string, then hashmap might be fine since both rehashing and collision
will not be issue any more.
But according to the original post, it seems that we are talking about
building the hashmap over the time on the fly with random and big strings as
the key. if so, rehashing and collision will be issues.

g*****g
发帖数: 34805

1M is such a small dataset it doesn't matter how it's built.
A 32 bits integer gives your 4 billion buckets, 1M is just 1/4000 of them,
you'll see a few collisions here and there, but not enough to cause
performance issue.

type
as

【在 s******e 的大作中提到】

: well. seems that people have the different assumptions.
: If the initial loading time is not counted and your string key is db id type
: of string, then hashmap might be fine since both rehashing and collision
: will not be issue any more.
: But according to the original post, it seems that we are talking about
: building the hashmap over the time on the fly with random and big strings as
: the key. if so, rehashing and collision will be issues.

s******e
发帖数: 493

you misunderstand the relationship between #buckets and collision. Having
enough buckets does not necessarily transfer into less collisions. As said,
db id type string (especially surrogate id), due to the constraint applied,
is more likely to be uniformly distributed than random big string keys.
rehashing is another issue because of the way how the default implementation
of hashmap works. if you build the map on the fly, I remember that somebody
reported after 200 thousand string key entries, rehashing could cause some
users to suffer quite a bit. Even not all the users will suffer, but it
should be avoided for any robust systems.

【在 g*****g 的大作中提到】

: 1M is such a small dataset it doesn't matter how it's built.
: A 32 bits integer gives your 4 billion buckets, 1M is just 1/4000 of them,
: you'll see a few collisions here and there, but not enough to cause
: performance issue.
:
: type
: as

g*****g
发帖数: 34805

No, you misunderstand. The number of available buckets is constant, the less
data, the less chance for collision, simple as that.
It is possible 1M data causes heavy collision? Possible, but unlikely for
most use cases. And when it's case, it's easy to tune the hashing function.
OP didn't mention any random big string to begin with, try a simple approach
and don't overengineer before you have to.

,
,
implementation
somebody
some

【在 s******e 的大作中提到】

: you misunderstand the relationship between #buckets and collision. Having
: enough buckets does not necessarily transfer into less collisions. As said,
: db id type string (especially surrogate id), due to the constraint applied,
: is more likely to be uniformly distributed than random big string keys.
: rehashing is another issue because of the way how the default implementation
: of hashmap works. if you build the map on the fly, I remember that somebody
: reported after 200 thousand string key entries, rehashing could cause some
: users to suffer quite a bit. Even not all the users will suffer, but it
: should be avoided for any robust systems.

w**z
发帖数: 8232

Java Hashmap implementation does a supplement hash on the hasCode caller
provides.
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/
It's not likely you will have a lot of hash collisions for random String Key
.
Another way to avoid rehashing is to set a bigger initial size if you are
sure you will have 1M records.
I agree with goodbug, at least do some benchmarking before trying to tune
anything.

less
approach

【在 g*****g 的大作中提到】

: No, you misunderstand. The number of available buckets is constant, the less
: data, the less chance for collision, simple as that.
: It is possible 1M data causes heavy collision? Possible, but unlikely for
: most use cases. And when it's case, it's easy to tune the hashing function.
: OP didn't mention any random big string to begin with, try a simple approach
: and don't overengineer before you have to.
:
: ,
: ,
: implementation

l*******m
发帖数: 1096

hashmap will increase the memory allocation when the current allocation is
not large enough

Key

【在 w**z 的大作中提到】

: Java Hashmap implementation does a supplement hash on the hasCode caller
: provides.
: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/
: It's not likely you will have a lot of hash collisions for random String Key
: .
: Another way to avoid rehashing is to set a bigger initial size if you are
: sure you will have 1M records.
: I agree with goodbug, at least do some benchmarking before trying to tune
: anything.
:

相关主题
● Re: Confused!	● JDK Source?
● Java's performance myth	● java知道一个reference怎么删掉它指向的内存空间？ (转载)
● 那个快？	● JDK HashMap 的 Entry class
进入Java版参与讨论

w**z
发帖数: 8232

At that time, rehashing happens... If you know your size in advance, just
set to a bigger number to avoid rehashing.

【在 l*******m 的大作中提到】

: hashmap will increase the memory allocation when the current allocation is
: not large enough
:
: Key

s******e
发帖数: 493

First, the number of buckets is never constant for the default
implementation of HashMap. That is why you have rehashing.
Second, the collision will depend on all three factors of the input, hash
function and the number of buckets. Even you can use all 64 bits of memory
for your bucket, you might still have heavy collision for a biased input.
That is why I said db id type string key might work fine for the default
hashmap. I believe in such a case, the default hash function will distribute
them uniformly.

less
approach

【在 g*****g 的大作中提到】

s******e
发帖数: 493

When you build your map dynamically, that is a big if. If you can preload, I
think even several million records should be fine if the input is not that
biased.

【在 w**z 的大作中提到】

: At that time, rehashing happens... If you know your size in advance, just
: set to a bigger number to avoid rehashing.

w**z
发帖数: 8232

That is true, it's the tradeoff between time and space. Only lz knows his
use case.

I
that

【在 s******e 的大作中提到】

: When you build your map dynamically, that is a big if. If you can preload, I
: think even several million records should be fine if the input is not that
: biased.

l******0
发帖数: 244

In my case,
1) The "keys" are English words or phrases, no repetitions.
2) I have to build this map in a static initialization function.
BTW, what do you mean by 'preload', is it an initialization?

I
that

【在 s******e 的大作中提到】

: When you build your map dynamically, that is a big if. If you can preload, I
: think even several million records should be fine if the input is not that
: biased.

l******0
发帖数: 244

Sounds like good experiments. I will use one map to try.

【在 g*****g 的大作中提到】

: 1M hashmap，屁大的事情。我这几天把几个Oracle的表转到MySQL上，有一些合并和分
: 解的操作。
: 几个1-2M记录的表，都是直接放到内存里，用来做id->object lookup，完全没有压力
: 。这是在我8G的laptop上跑的，不是什么大服务器。
:
: phd

(共1页)

进入Java版参与讨论

相关主题
● What version of the Java Development Kit (JDKTM)	● answer Re: how HashMap/Hashtable compare key?
● Java向Linux迈出决定性一步	● 来问两个HashMap的问题
● Re: Confused!	● 这几年Java跟其他语言的差距拉大了。
● Java's performance myth	● 问HashSet的问题？
● 那个快？	● 云计算如何应用到传统的web server应用
● JDK Source?	● 有没有大牛在搞cloud？-- 包子贴
● java知道一个reference怎么删掉它指向的内存空间？ (转载)	● 请问Hadoop要怎么学？
● JDK HashMap 的 Entry class	● 开发前景光明啊

相关话题的讨论汇总
话题: hashmap话题: map话题: collision话题: string话题: hash

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天