关于clustering的讨论汇总 - 话题女王

全部话题 - 话题: clustering

j*******e
发帖数: 529

来自主题: Faculty版 - cluster选择

想请教一下各位做计算的如何选择HPC cluster
现在有两个选项：
第一个是弄一个独立的HPC cluster。系里的IT director可以帮忙以学校的价格订购
cluster和相关软件，但是所有的维护和管理都只能自己来
第二个是系里正在扩建已有的HPC cluster的。如果把同样的预算用于系里cluster的扩
建的话，能比独立的两个cluster拿到更便宜些的价格。我能拿到新的cluster的
160core的最高权限。这个方案的好处是完全不用维护和管理，都交给系里专门的人负
责。坏处是，如果cluster非常拥挤，我的任务仍然需要排队。还有系里cluster的配置
对我的应用有点overkill了，如果是我自己的独立的cluster，可能能上更多的core。
不知道大家有没有相关的经验

X****i
发帖数: 1877

来自主题: Piebridge版 - 观察的人多后,发现一个现象,姑且称其为"Trait Cluster"。 (转载)

【以下文字转载自 Military 讨论区】
发信人: XiuShi (致力为花街散财，造福散户), 信区: Military
标题: 观察的人多后,发现一个现象,姑且称其为"Trait Cluster"。
发信站: BBS 未名空间站 (Fri Oct 27 17:11:23 2017, 美东)
"Trait Cluster"这名词，貌似还没被用。
但觉得很适当，所以就算我的原创名词好了。
这概念的好处，就是改进对人判断和预测的效率。
判断和预测的本质，就是有误差，所以还必须验证。
另外，由于这概念的连贯性质，可以帮助减少遗漏。
它因为是Generalization，所以要考虑少数例外。
效率高是因为知道一个Trait，也知道其Cluster。
这就能提醒观察者也留意其它属于该Cluster的Trait。
例如，假设要设立一个和虚荣有关的TraitCluster#1：
虚荣的人往往也是嫉妒心强烈的，
所以
Cluster#1 = {虚荣，妒嫉心}
但由于妒忌心和自私也一般同时存在，于是
Cluster#1 = {虚荣，妒嫉心，自私}
由于妒嫉心强烈的男女往往会做伤天害理的是，... 阅读全帖

X****i
发帖数: 1877

来自主题: Military版 - 观察的人多后,发现一个现象,姑且称其为"Trait Cluster"。

"Trait Cluster"这名词，貌似还没被用。
但觉得很适当，所以就算我的原创名词好了。
这概念的好处，就是改进对人判断和预测的效率。
判断和预测的本质，就是有误差，所以还必须验证。
另外，由于这概念的连贯性质，可以帮助减少遗漏。
它因为是Generalization，所以要考虑少数例外。
效率高是因为知道一个Trait，也知道其Cluster。
这就能提醒观察者也留意其它属于该Cluster的Trait。
例如，假设要设立一个和虚荣有关的TraitCluster#1：
虚荣的人往往也是嫉妒心强烈的，
所以
Cluster#1 = {虚荣，妒嫉心}
但由于妒忌心和自私也一般同时存在，于是
Cluster#1 = {虚荣，妒嫉心，自私}
由于妒嫉心强烈的男女往往会做伤天害理的是，即狠，于是
Cluster#1 = {虚荣，妒忌心，自私，狠}
再举一例
Cluster#2 = {贱，。。。}
由于贱的人，往往也是愚蠢的，于是
Cluster#2 = {贱，愚蠢}
再进一步就是把以上的因素量化。
如何应用？
例如遇到一个极度虚荣的人，就得小心他/她
有其它属于Cluster#1 的特征... 阅读全帖

S*******e
发帖数: 525

来自主题: Programming版 - 一个Hadoop Cluster升级的问题

你们用的Hadoop Cluster是怎么升级的？下面是我的问题
Rolling Upgrade Hadoop Cluster Question
In our company, one of main Hadoop clusters (HDP) has about 600 nodes. It
upgrades almost monthly plus some other maintenance. Every time doing so
takes hours to a couple of days and all apps running on it have to be shut
off. I just cannot imagine the clusters performing such important work in
other companies will get interrupted so often and so long. I asked why don't
we do rolling upgrade? Here is one of main architect's... 阅读全帖

f*******e
发帖数: 3433

来自主题: Parenting版 - CDC releases preliminary findings on Palo Alto suicide clusters

In light of the recent suicides of several Palo Alto teens, the Center for
Disease Control and Prevention (CDC) began an epidemiological study in
February 2016 that investigated previous youth suicide clusters. Last week,
the CDC released preliminary findings of their study, which revealed that
mental health problems, recent crises and problems at school were major
factors in the suicides of the 232 youths throughout Santa Clara County the
CDC investigated.
The CDC’s research revealed that 46 pe... 阅读全帖

a***n
发帖数: 404

来自主题: CS版 - 有没有这样的 clustering 算法？

可以对线性排列的数据进行 clustering的。比方有一个数列:
1, 2, 3, 44, 5, 6, 7, 8, 101, 102, 103, 144, 105, 106, 107, 108.
有没有一个类似hierachical的clustering算法。比方上面数列，如果要分成两个，
明显前面的 8个数字，后面的8个数字分别构成两个 cluster.尽管其中的44与144有些
另类。但要是分成 4个 clustering,44和144就可以分别单独成为一个cluster.
有这样的线性数据的clustering的算法么？就是结果不能跳跃。各个cluster不能
cross.
谢谢！

s*****t
发帖数: 1994

来自主题: Astronomy版 - Astronomy Picture of Day: star cluster in motion

A Star Cluster in Motion
Credit: Adam Block (NOAO)
Explanation: Star clusters are a swarm of complex motions. The stars that compose globular
clusters and many open clusters all orbit the cluster center, occasionally interacting,
gravitationally, with a close-passing star. The orbits of stars around the cluster are typically not
as circular as the orbits of planets in our solar system. Cluster stars frequently fall more
directly toward the center and many times trace out un

f*******a
发帖数: 663

来自主题: DataSciences版 - Science上新clustering算法的分析测试

开始忘贴代码了，有朋友要求，就把修改后的代码贴在这里。改动不多，可以部分提升
效率。原来的也没删，注释掉了。供参考。
=========================================================================
clear;
close all
disp('The only input needed is a distance matrix file')
disp('The format of this file should be: ')
disp('Column 1: id of element i')
disp('Column 2: id of element j')
disp('Column 3: dist(i,j)')
if(0)
% mdist=input('name of the distance matrix file (with single quotes)?\n'
);
mdist = 'example_distances.dat';

disp('Reading input dis... 阅读全帖

s*****t
发帖数: 1994

来自主题: _Astronomy版 - Astronomy Picture of Day: star cluster in motion

t**e
发帖数: 71

来自主题: Automobile版 - 吐槽奥迪TT Instrument Cluster的修理经历

过去在国内是开桑塔纳的，刚来美国的时候也没多考虑，随便买了个帕萨特，自己换过
clutch，和假专家合修过water pump timing belt未果（后来还是花钱让真专家修好的）
3年前又换了一辆奥迪TT，01年的，一看价格不错(7000块)，成色挺新，mileage低，两
年不用换timing belt就买了。
3个月前噩梦开始~instrument cluster总是不亮，开始还以为是小问题，开10分钟就亮
了，结果这个数字从10分钟慢慢变成20,30,40,50……后来干脆开车的时候就不知道车
速和剩余多少油，尼玛全靠估算！
我上网查了查，真是活该自己没做过功课，这个01款的TT就是出了名的Instrument
Cluster烂啊~~~~~
走遍了家附近所有高水平专修店和dealer，都给我同一个答复，re-manufacture parts
1200+500 labor，听得我肾上腺比脑门还高。
穷则思变，想想还是网络力量大，最终到这里向前辈们讨了新主意：走上了通常只有买
茶叶才去的Ebay。。。果然发现一个密歇根的哥们儿，手上instrument cluster无数。
... 阅读全帖

m****s
发帖数: 18160

来自主题: JobMarket版 - 招 cluster manager。 (转载)

【以下文字转载自 JobHunting 讨论区】
发信人: Nighthawk (Nighthawk), 信区: JobHunting
标题: 招 cluster manager。 (转载)
发信站: BBS 未名空间站 (Thu Jan 31 22:18:28 2013, 美东)
发信人: Nighthawk (Nighthawk), 信区: Physics
标题: 招 cluster manager。
发信站: BBS 未名空间站 (Thu Jan 31 22:16:37 2013, 美东)
有意者请站内信件。工资 $50-55k，职业有上升空间。
Job Description
Title: Linux System Administrator
Job Description: The Wake Forest School of Medicine (WFSM) high performance
computation facility (HPC) is looking for an intermediate level Linux System
Administrat... 阅读全帖

r*t
发帖数: 34

来自主题: BuildingWeb版 - puzzles about load balancing and failover of JBoss cluster (转载)

【以下文字转载自 Java 讨论区】
发信人: rgt (一脸无辜), 信区: Java
标题: puzzles about load balancing and failover of JBoss cluster
发信站: BBS 未名空间站 (Tue Jul 26 09:08:10 2005)
I just go through the book 'jboss clustering', but still have several
questions related with the load balance and session bean failover capabilities
of JBoss cluster.
1) It seems that JBoss cluster cannot balance running process, for example, I
have a two-node (A and B) cluster (my EJB running on both with clustered
configuration), and three sa

y****9
发帖数: 144

来自主题: Database版 - Doubts about clustered index

As far as I know, there are following types of tables in Oracle
- Heap organized tables ( default one, > 99% tables are this type in oracle
applications, in fact I did not see any of my production databases use any
other type of tables yet except for temporary tables - I use it, developers
have no idea about temporary tables in my work place)
- Index organized tables (IOT)
- Index clustered tables
- Hash clustered tables
- Sorted hash clustered tables
- Nested tables
- temporary tables
- Object... 阅读全帖

E******T
发帖数: 59

来自主题: Biology版 - 说说自己的研究：Gene module在生物医学癌症分类（clustering）的应用

Gene module在生物医学癌症分类（clustering），以及生物活性marker鉴定的应用
生物医学样品的分类（clustering）及其复杂，有几个原因：医学样品的构成，比如
说白血病病人，有年龄，性别，癌症分级（I，II，III），用药情况，癌症类别（AML
，ALL）等等。按照不同的标准，就可以把病人样品分成不同的类别（clustering）.
更深一层次，不同的类别如果从生物学上来看，是由不同的基因，信号通路引起。如果
能找到这些不同类别对应的pathway,那么相对应的分类也就能被发现。比如，白血病
里面的一种AML，他相应的信号通路就不同于另一种ALL，所以根据这些基因就能把白血
病分成AML和ALL。同理，如果能发现与癌症分级不同的信号通路，就能把白血病分成I
，II，III等不同的级别。但是，在平常的研究当中，这些具体的分类都不是特别清楚
。大多数情况下，仅仅知道其中的一种，比如在白血病里面就知道AML和ALL的分类，至
于其他的信息，很难得到。所以我们用了unsupervised learning的思想来研究这个问
题。
从生物角度来看，如果一个pat... 阅读全帖

E**********e
发帖数: 1736

来自主题: DataSciences版 - 有没有谁自己买服务器组建几个clusters跑hadoop大数据的？

自己的电脑内存只有6G。很快想先打算升到32G（这个是必须，一定的）。考虑建多
个clusters的原因是现在的数据动不动就几十个G。一台电脑完全不够用。当然平时自
己练习项目数据不大。不过这个不是重点。重点是自己想学大数据的分析，以后说不
定向data scientist方向转。而且现在是个数据分析职位，动不动就要求matchine
learning，大数据分析工具hadoop， spark什么的。我想自己买个几个服务器，建个
多个clusters，以后就可以直接练习。把大数据学好。
amazon的aws好是好，不过不是很自由。自己学好了以后拿来跑项目，是另外一回事
情。
你的意思是一个电脑就可以用VM设置多个cluster或instances，用来跑hadoop？我自己
已经装了个singel cluster的hadoop，用的是vm虚拟机，在ubuntu上跑hadoop。挺有
意思的。不过single cluster不能把hadoop的优势体现出来，也不知自己写的python
code 是不是可以在真正的多个clusters 上的hadoop跑。能不能推荐... 阅读全帖

N*******k
发帖数: 43

来自主题: JobHunting版 - 招 cluster manager。 (转载)

【以下文字转载自 Physics 讨论区】
发信人: Nighthawk (Nighthawk), 信区: Physics
标题: 招 cluster manager。
发信站: BBS 未名空间站 (Thu Jan 31 22:16:37 2013, 美东)
有意者请站内信件。工资 $50-55k，职业有上升空间。
Job Description
Title: Linux System Administrator
Job Description: The Wake Forest School of Medicine (WFSM) high performance
computation facility (HPC) is looking for an intermediate level Linux System
Administrator for an immediate fulltime position of Linux cluster
maintenance and HPC user support.
Benefits: Competitive salary, vac... 阅读全帖

l**n
发帖数: 7272

来自主题: WashingtonDC版 - FYI: 给Wooton Cluster家长的Survey

可以匿名填写。
==========================
From: Tony Lam
Subject: [GTAletters] 2.0 survey needs your help ASSP
To: G********[email protected], "'Parents Coalition'" yahoogroups.com>
Date: Wednesday, February 13, 2013, 4:30 PM
Dear All,
Wootton cluster has a once every 3 years meeting with the board on 2/21.
Please forward the following survey to our cluster parents ASSP. Our cluster
representatives need to have the numbers by 2/16. Here is the link. Thanks!
ht... 阅读全帖

w********c
发帖数: 2632

来自主题: CS版 - [合集] fuzzy clustering, soft clustering 区别？

☆─────────────────────────────────────☆
arson (笨也要活着) 于 (Wed Oct 31 15:29:24 2007) 提到:
看了半天感觉都是能一个可以属于多个clusters，还有啥其他区别呢？
是不是soft clustering 是一个属于多个，但fuzzy clustering 一定是一个不同程度地
属于其他所有clustering?
谢谢。
☆─────────────────────────────────────☆
sonyisme (偶静感乖类 :)) 于 (Wed Oct 31 15:30:15 2007) 提到:
好像是一样的，只是名称不同
fuzzy/soft/probablistic/partial

度地
☆─────────────────────────────────────☆
arson (笨也要活着) 于 (Wed Oct 31 15:33:59 2007) 提到:
这个领域有啥前途不？偶现在到处找方向呢， rrdw~~
另外有没有啥好的review啊，牛人啥的阿？
多谢

M*****r
发帖数: 1536

来自主题: Database版 - MySQL 5.0 cluster question

MySQL 5.0 cluster is said to be:
"In MySQL-5.0, Cluster is in-memory only. This means that all table data (
including indexes) is stored in RAM. Therefore, if your data takes up 1GB of
space and you want to replicate it once in the cluster, you need 2GB of
memory to do so (1 GB per replica). This is in addition to the memory
required by the operating system and any applications running on the cluster
computers. "
http://dev.mysql.com/doc/refman/5.0/en/faqs-mysql-cluster.html
What does that mean?

y****9
发帖数: 144

来自主题: Database版 - Doubts about clustered index

@vbitter,
Further reading ...
In SQL server, if a table does not contain a clustered index, when we create
a primary key constraint, SQL server will use the primary key column for
the clustered index key.
So I think we all agree, in most cases, a table should have a primary key,
no matter in oracle or in sql server.
I checked the AdventureWorks sample db in sql server, I saw > 99% table has
clustered index and most is also PK.
So I guess the rule of thumb is likely true in sql server.

clustered... 阅读全帖

r*t
发帖数: 34

来自主题: Java版 - puzzles about load balancing and failover of JBoss cluster

I just go through the book 'jboss clustering', but still have several
questions related with the load balance and session bean failover capabilities
of JBoss cluster.
1) It seems that JBoss cluster cannot balance running process, for example, I
have a two-node (A and B) cluster (my EJB running on both with clustered
configuration), and three same client requests from third computer C are
leveled with ?Round-Robin? policy, two in node A and one in node B. If I kill
node A (Ctrl+c), all processes

b****t
发帖数: 114

来自主题: Linux版 - build a small cluster, suggestions/comments please... (转载)

【以下文字转载自 EmergingNetworking 讨论区】
发信人: bbbeet (beet), 信区: EmergingNetworking
标题: build a small cluster, suggestions/comments please...
发信站: BBS 未名空间站 (Sun Apr 8 16:49:18 2012, 美东)
First of all, I am a novice in CS/Network, please offer advice ...
I need to build a small cluster (~7 nodes) for parallel simulation computing
. My first thought about this project is just to buy 7 PCs with decent
configurations (e.g. Dell computers $700/each). And then install OS and
clustering software etc to co... 阅读全帖

s*****t
发帖数: 1994

来自主题: Astronomy版 - Astronomy Picture of Day: Center of Virgo Cluster

In the Center of the Virgo Cluster
Credit & Copyright: Jean-Charles Cuillandre (CFHT), Hawaiian Starlight, CFHT
Explanation: The Virgo Cluster of Galaxies is the closest cluster of galaxies to our Milky
Way Galaxy. The Virgo Cluster is so close that it spans more than 5 degrees on the sky - about
10 times the angle made by a full Moon. It contains over 100 galaxies of many types -
including spiral, elliptical, and irregular galaxies. The Virgo Cluster is so massive that it is
noticeably pull

f*****r
发帖数: 138

来自主题: Computation版 - 哪个cluster性能更好？

本人是做CFD计算的。公司最近计划把计算资源从server升级到小型 cluster。问了两
家做cluster的，他们报的配置如下（基于相同的预算）：
Cluster 1:
8 compute nodes @ 128 cores
Processor: Intel XEON e5-2667 v4 CPU (3.2GHz/3.6GHz HT) (16 X 8c)
Chassis: 4 x 1U x 2 Node chassis (1000 Watt power supply)
Memory: 8 x 128 GB DDR4-2400MHz LRDIMM (8 X 16 GB)
OS Data: 8 x 2 x 1TB SATA 7200 RPM 6Gbp/s
Interconnect: 8 x 56 Gbps Infiniband
Interconnect Switch: 12 port Mellanox FDR switch
Operating System: CENTOS 7
Cluster 2:
8 compute nodes @ 192 cores
Processor: Inte... 阅读全帖

f*****r
发帖数: 138

来自主题: Computation版 - 哪个cluster性能更好？

f*******i
发帖数: 8492

来自主题: Statistics版 - 在线等，请教一个SAS关于cluster命令的输出结果问题

data homework6_1;
set dansas.homework6_1x;
run;
proc cluster data=homework6_1 method=centroid;
id id;
title 'Hierachical Cluster Analysis';
proc tree out = cluster nclusters=4;
Id id;
proc sort;
by cluster;
proc print;
by cluster;
run;
======================================
简单的作业题，我觉得和data无关啊。只不过是variable的数量多少不一样。
而现在的结果却是显示输出项目名称和数量完全不同。

c********h
发帖数: 330

来自主题: Statistics版 - data clustering by vector correlation distance

clustering没有局限于one-dim啊，这个可以用各种clustering的method, kmeans,
mixture EM都可以是multi-dim
如果你想用correlation as a distance，你可以用hierarchical clustering，这个可
以自己specify distance.
每一种clustering也都可以specify number of clusterings

t********6
发帖数: 43

来自主题: Statistics版 - cluster effect in case control study

case control study， case全部recruit了，control随机选的。但是population里有
很多cluster（比如family），而且知道cluster是confounder。有什么好的方法adjust
吗？GEE貌似不成，因为没考虑cluster 造成的sampling bias。
目前用的inverse probability weighting，就是给case的weight＝1（因为全部sample
了），control的weight用1/（cluster中sample的control数／cluster中总的control
数）。然后用sas的surveyreg
不知道还有什么更好的方法。

s*****t
发帖数: 1994

来自主题: _Astronomy版 - Astronomy Picture of Day: Center of Virgo Cluster

r*****y
发帖数: 507

来自主题: CS版 - 问个clustering的问题

想对一些二维的signal segments 进行clustering (例如用K-means clustering).
一个signal segment大概是如下的样子：
[0.2 0.4]' [ 0.25 0.34]' ... [ 0.89 0.57]'
问题是，这些segments的长度不等。有些长，有些短。
所以首先要进行resampling后(使得长度一样)，再进行clustering.
我现在想到的办法是，用uniform sampling的办法(比如每个segment
均匀采样10个地方)，这样每个
segment就转化为一个2*10=20的向量。但问题是，如此办法会不会不够科学？
哪位老大有更好的办法来做这种情况下的clustering?谢谢。

a***n
发帖数: 404

来自主题: CS版 - 什么 clustering 的openSource么？

请问你用过 CLUTO 么？不知道对于100，000 数量级的文档做clustering, 要花多少时
间啊？
(1)
K-means 好像要预先设定 K 值为多少？我并不知道文档最后被分为几类的话，是不是
就不能用k-means?
(2)
CLUTO 或者其他的clustering 软件做出来的结果中，一个文档可以属于多个cluster么
？或者可以允许一个文档不在任何cluster中么？rrdw.确认一下。:)
谢谢！！

a***n
发帖数: 404

来自主题: CS版 - 有没有这样的 clustering 算法？

k-means是要预先指定cluster的个数的，有hierachical的算法么？
另外，这个知道的仅仅是每个数据与其他的数据的距离。
但是每个数据前后关系是固定的，简单说就是已知一段光谱，要对光谱按照颜色相近分
段。
但是分多少段是未知的，而且当分得段比较大的时候，每一段中有可能有部分小段光谱
比较异常，这种情况下分段也要能够正确进行。
k-means感觉搞不定这个，即使已知每种颜色之间的距离。
可能这个类似spatial clustering,我想spatial clustering应该不会让点乱重组的。
不知道有没有人是做这个的，有没有什么好的算法做这样的光谱clustering.
谢谢。

个，
有些

e****f
发帖数: 296

来自主题: CS版 - how to find a cluster

I have a question and hope someone smart here can help me out.
I have a binary 2-dimensional image, say 16 by 16 pixels. The pixel values
are either 0 or 1 and are randomly distributed. We define a cluster as: for
those non-zero pixels, if they share at least one side with each other, they
form a cluster of connectivity-1. What is the simple algorithm to search
for these clusters and find the largest one (larget number of pixels in one
cluster)?
Your help is highly appreciated!

x*****o
发帖数: 23

来自主题: CS版 - build a small cluster， help

我实验室在弄一个10+台的小cluster. 主要是用于训练一些数据。
我对hpc, cluster这些不懂，刚用rock cluster， sge装了一下系统，但是不知道
cluster该怎么配置比较好，有没有搞这个的，推荐一下书，或link? thanks!

v*****r
发帖数: 1119

来自主题: Database版 - Doubts about clustered index

Very good blog, but the blogger didn't actually answer the question he
raised:
"So how is it you can have good perf in Oracle, w/o IOTs, but in SQL Server,
everyone says need CIs for good perf?"
My answer is Oracle simply has more options to achieve the same that
SQLServer clustered-index could achieve.
Let's first think about what kind of table will benefit most from using
clustered-index/IOT. One typical example is table storing time series data
will benefit a lot. Using clustered-index/IOT ... 阅读全帖

b****t
发帖数: 114

来自主题: EmergingNetworking版 - build a small cluster, suggestions/comments please...

First of all, I am a novice in CS/Network, please offer advice ...
I need to build a small cluster (~7 nodes) for parallel simulation computing
. My first thought about this project is just to buy 7 PCs with decent
configurations (e.g. Dell computers $700/each). And then install OS and
clustering software etc to configure.
This cluster will be mainly used for my research and teaching. So ideally it
would allow 20 or more users to login at the same time. The computational
work is typically medium... 阅读全帖

d*****0
发帖数: 68029

来自主题: Hardware版 - [合集] 为什么我的单机比cluster还要快？

☆─────────────────────────────────────☆
ggplay (dfdsf) 于 (Wed Jan 27 10:38:04 2010, 美东) 提到:
cluster配置是几年前的双核志强，主频不到3G。我的单机是Q6600，未超频。结果我的
比结果比cluster快一分钟，cluster上启动了12个线程，本地上有3个。
单线程的程序我的机器更是绝对优势，基本要快50%。 WTF?
☆─────────────────────────────────────☆
larrabee (larrabee) 于 (Wed Jan 27 10:39:47 2010, 美东) 提到:
多少node的cluster?

☆─────────────────────────────────────☆
daye520 (AGOG) 于 (Wed Jan 27 10:41:12 2010, 美东) 提到:
奔腾D？
☆─────────────────────────────────────☆
tyning (全副*你上过了么？) 于

S*A
发帖数: 7142

来自主题: Linux版 - 为什么sort 的速度在单个server比cluster上快很多

cluster 上面的 sort 是如何实现的？如果你用 MPI 那种
经常传东西的当然慢了。
cluster 之间传数据和本地访问内存的延时比较大。
对于数据有很多相关依赖性的 cluster 不适合。
cluster 适合数据可以被切割成小块，在小块里面可以独立运算的。
这个 sort 比较适合用 map reduce.
把 80G 分成很多小 N 份，给 N 个节点分别 sort.
然后 merge 就是 O(n) 的。
网上一发一收 80G 就是 160G 数据，这个就要花很多时间。
所以单机没准还快点。

d*******r
发帖数: 3299

来自主题: Programming版 - Redis Cluster beta -- Redis 3.0 beta

Redis Cluster 好像是第二轮 beta 中
http://redis.io/download
下个稳定版本 Redis 3.0 就可以用了
这里有个比较简炼的原理讲解:
http://www.chinahadoop.cn/course/31/learn#lesson/228
另外，顺便问一下二爷，你们这种在 Client 端自己实现 Redis Cluster 逻辑的话 (
比如插入数据时候，自己 hash 得到欲插值节点ID)，怎么修复死掉的节点？比如
Redis Cluster 有节点 A B C, 如果 A 死掉的话, Redis Cluster 自己会从相关的
slave节点恢复节点 A 的数据，并且原本该节点A回复的request，也需要被 redirect,
这些修复工作你们都是自己做的?

s*****t
发帖数: 1994

来自主题: Astronomy版 - Astronomy Picture of Day: Cluster M38

Open Star Cluster M38
Credit & Copyright: NOAO, AURA, NSF
Explanation: Open cluster M38 can be seen with binoculars toward the constellation of
Auriga. M38 is considered an intermediately rich open cluster of stars, each of which is about
200 million years old. Located in the disk of our Milky Way galaxy, M38 is still young enough
to house many bright blue stars, although it's brightest star is a yellow giant shining 900 times
brighter than our Sun. The cluster spans roughly 25

t*d
发帖数: 1290

来自主题: Biology版 - How can I install program on bioinfor cluster?

一个连 blast 也没有的机器也可以号称 bioinformatic cluster？
如果你能找到 for SunOS5.9 的 blast 源代码，就可以自己安装。

to install BLAST and Mira on the cluster. The stuff who maintains the
bioinformatic cluster told me it's very difficult to install programs in the
cluster (SunOS5.9). He is busy and doesn't want to help.

x*****u
发帖数: 3419

来自主题: Computation版 - clustering by openMosix

http://www.unixreview.com/print/

Checkpointing and Distributed Shared Memory in openMosix April 2004
by Mulyadi Santosa
One way to build a cluster is with off-the-shelf hardware, particularly IBM
PC-compatible with an x86 processor. Linux clusters (utilizing Linux and other
open source tools) are increasingly popular for migration from an existing
cluster or for creating new ones. OpenMosix is one open source clustering
middleware, and two new modules have entered the scene: a distributed sh

l******9
发帖数: 579

来自主题: Mathematics版 - data clustering by vector correlation distance (转载)

【以下文字转载自 Statistics 讨论区】
发信人: light009 (light009), 信区: Statistics
标题: data clustering by vector correlation distance
发信站: BBS 未名空间站 (Wed Feb 26 11:17:21 2014, 美东)
I am working on data analysis.
Given a group of data vectors, each of them has the same dimension. Each
element in a vector is a floating point number.
V1 [ , , , … ]
V2[ , , , … ]
...
Vn [ , , , … ]
Suppose that each vector has M numbers. M can be 10000.
n can be 200.
I need to find out how to partition the n vector... 阅读全帖

s****l
发帖数: 10462

来自主题: Physics版 - [转载] computer cluster

【以下文字转载自 Hardware 讨论区,原文如下】
发信人: stlstl (射天狼), 信区: Hardware
标题: computer cluster
发信站: Unknown Space - 未名空间 (Mon Oct 6 17:27:45 2003) WWW-POST
There is a giant computer cluster in the institution where I am working. The
cluster is
currently not being used and will not be used anymore. This institution is
seeking
anyone/group/institution who might be interested in it. Very low price. It
could
actually be given away for free in certain situations. Some details describing
the
cluster are follow

l******9
发帖数: 579

来自主题: Quant版 - data clustering by vector correlation distance (转载)

p********a
发帖数: 5352

来自主题: Statistics版 - [合集] k-mean clustering

☆─────────────────────────────────────☆
GoooG (pumpkin) 于 (Fri Mar 28 23:10:44 2008) 提到:
hi
if I have two point A, B as centre of cluster. how to cluster
a set of points into two clusters?
☆─────────────────────────────────────☆
drburnie (专门爆料) 于 (Fri Mar 28 23:46:04 2008) 提到:
随便一本书上都有。
模式识别，机器学习，多元统计等等。

☆─────────────────────────────────────☆
GoooG (pumpkin) 于 (Fri Mar 28 23:50:51 2008) 提到:
my boss suggest me to use k-mean to cluster the supervised problem.
does it make sense?
☆

d*******h
发帖数: 47

来自主题: Statistics版 - how to use SAS to do cluster analysis with both characteri

I would greatly appreciate it if anyone can help me with this question. I
want to find out how to use any Cluster Analysis Procedure (i.e., Proc
Cluster) to cluster data with both characteristic and numeric variables. I
thought Proc Cluster can only work on numeric variables, how can I use it
for both types of variables?
Greatly appreciate it!

f*******i
发帖数: 8492

来自主题: Statistics版 - 在线等，请教一个SAS关于cluster命令的输出结果问题

我用的是SAS 9.2，参照网上的例子，使用同样的命令行来运行不同的数据，但是输出结果的项目名称
和数量却完全不一样。
人家的输出结果,在“Centroid Hierarchical Cluster Analysis”下有如下几个项目
Variable，Mean，Std，Dev，Skewness，Kurtosis，Bimodality。
而我的输出结果，却只有如下四个项目Eigenvalue，Difference，Proportion，Cumulative。
在“Cluster History”，人家有NCL，Clusters Joined，FREQ，STD，SPRSQ，RSQ和Dist。
而我的输出结果却只有NCL， cluster joined，FREQ, Norm cent dist,tie。

请问，这是怎么回事啊？命令明明是一样的。

h***x
发帖数: 586

来自主题: Statistics版 - Sample size for clustering analysis

Use Varclus (SAS) and PCA to do variable reduction first before running
clustering. When you only have 10-20 variables, you won't JiuJie to ask the
sampling strategies.
I do not like kmeans. Everytime when I reset the seeds, or even reorder the
dataset, and I will have different results, but the pros is I can get the
results I desire after trying and trying... Not sure if it is kind of
cheating...
Non-parameter clustering (modeclus) is a better choice most of the time. It
can handle the situati... 阅读全帖

h***x
发帖数: 586

来自主题: Statistics版 - data clustering by vector correlation distance

1) As catforfish said, the data point is not necessary to be a scalar, a
vector is fine. All my work on clustering are for multi dimension instead of
one dimension. I suggest you spending some time to learn clustering first.
2)In your example, v1 and v2 has strong correlation. If you want to take
this into account in clustering, you should not use euclidean distance as
the statistical measure, you can use other measures with the features you
like for your task.
3)For clustering, result explanati... 阅读全帖

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

topics

未名新帖统计// 7月16日

历史上的今天