由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
Statistics版 - 请教一个correlation coefficient的test的问题
相关主题
请教一个问题如何在应用model前把correlated的predictors去掉?
看看你会不会出错!(Correlation factor vs Similarity)请教一个样本相关系数的问题
[合集] 请教个CORRELATION COEFFICIENT的问题请问一个统计学的问题
model里有multicollinearity,该如何处理呢?PCA (principle component analysis) analysis
网上看到一道题请教, linear regression model问题
请问multi variate linear regression 选择risk factor 问题Test for stationarity in time series
Is there any correlation between the two data set?Linear Regression
longitudinal, correlation, useless?有谁知道crossover design里面作linear mixed model如何计算coefficient of variation (CV)?
相关话题的讨论汇总
话题: samples话题: population话题: he
进入Statistics版参与讨论
1 (共1页)
t**********y
发帖数: 374
1
两个populations: population 1, 取10个samples; population 2, 取8 samples;
计算correlation coefficients among the samples from each population:
population 1, a total of 45 correlation coefficients calculated and the
average was 0.41; population, a total of 28 correlation coefficients
calculated and the averages was 0.9;
我的结论是:samples in population 2 have significantly lower variations when
compared with the samples in population 1
请问这种情况什么test可以用,得个p value?
多谢了!!
T*******I
发帖数: 5138
2
看上去像是在定义的第一个总体中做了45次抽样,每次抽10个观察单位,所以,每10个
观察单位做了一个相关系数的计算或测量;在定义的第二个总体中做了类似的工作,得
到了28个相关系数。
所以,两组相关系数就构成了两组样本相关系数的分布,而你想要检验这两组相关系数
统计上是否一致,因此简单的检验方法就是t-test.
如果你每个总体只做一次抽样得到一个相关系数,这相当于你把上述的每次对某个总体
的抽样合并成为单一的样本,即第一个总体样本数为450,第二个总体的样本数为224,
各得到一个相关系数。貌似依然可以用两样本t-test,因为你应该会得到每个相关系数
及其标准误。但要注意选择检验时的自由度。这个有点tricky。

when

【在 t**********y 的大作中提到】
: 两个populations: population 1, 取10个samples; population 2, 取8 samples;
: 计算correlation coefficients among the samples from each population:
: population 1, a total of 45 correlation coefficients calculated and the
: average was 0.41; population, a total of 28 correlation coefficients
: calculated and the averages was 0.9;
: 我的结论是:samples in population 2 have significantly lower variations when
: compared with the samples in population 1
: 请问这种情况什么test可以用,得个p value?
: 多谢了!!

R*****0
发帖数: 146
3
LOL. Do you know why we have 45 coefficients for 10 samples and 28 for 8
samples? Think about it. Meanwhile, please note that the number of samples
is NOT the same as the sample size.

【在 T*******I 的大作中提到】
: 看上去像是在定义的第一个总体中做了45次抽样,每次抽10个观察单位,所以,每10个
: 观察单位做了一个相关系数的计算或测量;在定义的第二个总体中做了类似的工作,得
: 到了28个相关系数。
: 所以,两组相关系数就构成了两组样本相关系数的分布,而你想要检验这两组相关系数
: 统计上是否一致,因此简单的检验方法就是t-test.
: 如果你每个总体只做一次抽样得到一个相关系数,这相当于你把上述的每次对某个总体
: 的抽样合并成为单一的样本,即第一个总体样本数为450,第二个总体的样本数为224,
: 各得到一个相关系数。貌似依然可以用两样本t-test,因为你应该会得到每个相关系数
: 及其标准误。但要注意选择检验时的自由度。这个有点tricky。
:

T*******I
发帖数: 5138
4
Did he/she do it with Jack-knife or bootstrap process? I am not sure, so I
simply suppose he/she did in that simple way.
To me, correlation coefficient is a continuous random variable. So, t-test
is proper for this case.
The terminology of "sample" in Statistics sometime may cause confusion when
it is used by different people under different situations.

【在 R*****0 的大作中提到】
: LOL. Do you know why we have 45 coefficients for 10 samples and 28 for 8
: samples? Think about it. Meanwhile, please note that the number of samples
: is NOT the same as the sample size.

R*****0
发帖数: 146
5
To LZ, it seems IMO the process of drawing samples (either 10 or 8) is not
independent each other. An average correlation of 0.9 means you have a lot
of sample pairs with very high (close to 1) correlations, which is not
likely to happen if you draw samples randomly. Anyway, it still does not
tell you much about the sample variances.
For example, if you draw 8 samples from population 2. Now if you
multiply all those data by 2, you will get the same 28 correlations.
However, the variances will be 4 times larger. That is why I don't think you
can compare sample variances based on the correlations you have got. Why
don't you simply calculate those (e.g. 8) sample variances and see what you
will get.

when

【在 t**********y 的大作中提到】
: 两个populations: population 1, 取10个samples; population 2, 取8 samples;
: 计算correlation coefficients among the samples from each population:
: population 1, a total of 45 correlation coefficients calculated and the
: average was 0.41; population, a total of 28 correlation coefficients
: calculated and the averages was 0.9;
: 我的结论是:samples in population 2 have significantly lower variations when
: compared with the samples in population 1
: 请问这种情况什么test可以用,得个p value?
: 多谢了!!

T*******I
发帖数: 5138
6
Simply, t-test is practicable since the correlation coefficient is a
continuous random variable, and the sample size of the two samples are 45
and 28, respectively.
If the t-test cannot be used for this case, let it die in Statistics!

【在 R*****0 的大作中提到】
: To LZ, it seems IMO the process of drawing samples (either 10 or 8) is not
: independent each other. An average correlation of 0.9 means you have a lot
: of sample pairs with very high (close to 1) correlations, which is not
: likely to happen if you draw samples randomly. Anyway, it still does not
: tell you much about the sample variances.
: For example, if you draw 8 samples from population 2. Now if you
: multiply all those data by 2, you will get the same 28 correlations.
: However, the variances will be 4 times larger. That is why I don't think you
: can compare sample variances based on the correlations you have got. Why
: don't you simply calculate those (e.g. 8) sample variances and see what you

g******2
发帖数: 234
7
If I understand correctly, each sample is some kind of measurements on
different coordinates, otherwise it's hard to image such big correlation on
independent samples. I would suggest that
1. calculate the variance (using 8 or 10 samples depending on the population
) on each coordinate
2. sum up the variances across coordinates, let's call it SS
3. run a F-test (SS1 / k1 / (SS2 / k2)) with df1 = 9 * k1, df2 = 7* k2,
where k1 is the number of coordinates for population 1 and k2 is the number
of coordinates for population 2
Here I assume that population 1 is independent of population 2 and equal
variance across coordinates.
T*******I
发帖数: 5138
8
关于随机变量的独立性不应该如此理解。

on
population
number

【在 g******2 的大作中提到】
: If I understand correctly, each sample is some kind of measurements on
: different coordinates, otherwise it's hard to image such big correlation on
: independent samples. I would suggest that
: 1. calculate the variance (using 8 or 10 samples depending on the population
: ) on each coordinate
: 2. sum up the variances across coordinates, let's call it SS
: 3. run a F-test (SS1 / k1 / (SS2 / k2)) with df1 = 9 * k1, df2 = 7* k2,
: where k1 is the number of coordinates for population 1 and k2 is the number
: of coordinates for population 2
: Here I assume that population 1 is independent of population 2 and equal

j*****e
发帖数: 182
9
what is the dimensionality of your repsonse variable?
45=10*9÷2;28=8*7÷2.
It seems that your computation of corr coeff does not conform to the
conventional approach. Please write out your parameter of interest in terms
of the distribution first.

when

【在 t**********y 的大作中提到】
: 两个populations: population 1, 取10个samples; population 2, 取8 samples;
: 计算correlation coefficients among the samples from each population:
: population 1, a total of 45 correlation coefficients calculated and the
: average was 0.41; population, a total of 28 correlation coefficients
: calculated and the averages was 0.9;
: 我的结论是:samples in population 2 have significantly lower variations when
: compared with the samples in population 1
: 请问这种情况什么test可以用,得个p value?
: 多谢了!!

t**********y
发帖数: 374
10
看来问题比我想像的复杂,
1.这里population 1, 雌性个体, sample 是10个雌性个体,测的是10,000个基因表
达值;计算的是两两雌性个体之间这10000个基因表达的correlation coefficient, 所
以总共45个,平均0.41
2 这里population 2, 雄性个体, sample 是10个雄性个体,测的是同样的10,000个
基因表达值;然后计算的是两两雄性个体之间这10000个基因表达的correlation
coefficient, 所以总共28个,平均0.9
3.我想说的是雄性个体之间基因表达差异小于雌性个体
看了这么多讨论,我在想用U-test可以吗?
45个correlation coefficients in female and 28 correlation coefficients in
male, apparently are not normally distributed by histogram

when

【在 t**********y 的大作中提到】
: 两个populations: population 1, 取10个samples; population 2, 取8 samples;
: 计算correlation coefficients among the samples from each population:
: population 1, a total of 45 correlation coefficients calculated and the
: average was 0.41; population, a total of 28 correlation coefficients
: calculated and the averages was 0.9;
: 我的结论是:samples in population 2 have significantly lower variations when
: compared with the samples in population 1
: 请问这种情况什么test可以用,得个p value?
: 多谢了!!

相关主题
请问multi variate linear regression 选择risk factor 问题如何在应用model前把correlated的predictors去掉?
Is there any correlation between the two data set?请教一个样本相关系数的问题
longitudinal, correlation, useless?请问一个统计学的问题
进入Statistics版参与讨论
t**********y
发帖数: 374
11
计算的是pearson correlation

terms

【在 j*****e 的大作中提到】
: what is the dimensionality of your repsonse variable?
: 45=10*9÷2;28=8*7÷2.
: It seems that your computation of corr coeff does not conform to the
: conventional approach. Please write out your parameter of interest in terms
: of the distribution first.
:
: when

t**********y
发帖数: 374
12
自己顶一下!!

【在 t**********y 的大作中提到】
: 看来问题比我想像的复杂,
: 1.这里population 1, 雌性个体, sample 是10个雌性个体,测的是10,000个基因表
: 达值;计算的是两两雌性个体之间这10000个基因表达的correlation coefficient, 所
: 以总共45个,平均0.41
: 2 这里population 2, 雄性个体, sample 是10个雄性个体,测的是同样的10,000个
: 基因表达值;然后计算的是两两雄性个体之间这10000个基因表达的correlation
: coefficient, 所以总共28个,平均0.9
: 3.我想说的是雄性个体之间基因表达差异小于雌性个体
: 看了这么多讨论,我在想用U-test可以吗?
: 45个correlation coefficients in female and 28 correlation coefficients in

j*****e
发帖数: 182
13
Pearson correlation is not applicable here.
You have miss-interpreted the concept of correlation.
Your data is high-dimensional. Try to specify your population first. Then,
what population parameter are you trying to compare?
The way you compute corr implies that you treat each individual as a
population and genes as the random samples. Is this what you want?
The problem is complex. Please consult an expert on high-dimensional data
analysis in person.

【在 t**********y 的大作中提到】
: 计算的是pearson correlation
:
: terms

a******1
发帖数: 201
14
1. Only professionals can define the variation in your case, the data
analysts can't. Since you are asking the question with the correlation
coefficient, I assume that as a professional in your field (biology?), you
know that is being used to quantize the variation.
2. In your case, you have two samples with size of 45 and 28, and you would
like to compare the means. The classic and best approach is to use t-test as
suggested. But please note that the each sample does not consists of
independent variables. For an example, let's look into 3 female samples X1,
X2, and X3.In an extreme case, if r(X1, X2) and r(X1, X3) are both large,
say, they are very close to 1, then r(X2, X3) should be large and close to 1
also. So r(X1, X2), r(X1, X2) and r(X2, X3) are somehow correlated.
3. To compare two means statistically, you need to know two sigma also. The
mean for the male sample is close to 1, so a lot of them must be close to 1
also, thus the distribution for the male sample should tight and the sigma
is small. But we can't say the same for female sample. They can be all close
to 0.41, thus have a tight distribution and small sigma, or they can be all
over between 0 and 1, but have mean 0.41 and large sigma. In an artificial
case, if the half of the female r are close to 0, and the other half are
close to 1, you can't say the female r are smaller than the male's r. So
please calculate the sigma also.
4. To use standard t-test, the assumption is that the underline distribution
are Gaussian. But in your case, the r are between 0 and 1, and they can't
be Gaussian, so you may want to do a transformation of r before you do the t
-test. Of cause you still can use t-test directly, but please check the
distribution of the data, make sure the approximation is good enough.
5. Use F-test to check if the sigma are the same before you pick equal
variance or unequal variance t-test.

【在 t**********y 的大作中提到】
: 看来问题比我想像的复杂,
: 1.这里population 1, 雌性个体, sample 是10个雌性个体,测的是10,000个基因表
: 达值;计算的是两两雌性个体之间这10000个基因表达的correlation coefficient, 所
: 以总共45个,平均0.41
: 2 这里population 2, 雄性个体, sample 是10个雄性个体,测的是同样的10,000个
: 基因表达值;然后计算的是两两雄性个体之间这10000个基因表达的correlation
: coefficient, 所以总共28个,平均0.9
: 3.我想说的是雄性个体之间基因表达差异小于雌性个体
: 看了这么多讨论,我在想用U-test可以吗?
: 45个correlation coefficients in female and 28 correlation coefficients in

T*******I
发帖数: 5138
15
根据他的新近描述,我最近才琢磨出来。
他是这样做的。他首先在第一个总体中抽取一定样本量的样本,可能只有一个随机变量
,例如某个基因的表达。一共进行了10次抽样,所以得到10个样本,每个样本包含一定
数量的个体观察值。然后,他使用排列组合的方式将10个样本两两配成一对计算同一个
基因的两次抽样间的相关系数,于是得到45个相关系数。
对第二个总体做了相似的工作,但只抽了8次样,所以得到28个两两组合和28个相关系
数。
他最后想知道,这两组相关系数的分布是否一致。所以,t-test或rank sum test就是
可选的检验方法。所以,关于检验的事情可能没那么复杂。
而真正复杂的是,我想知道他是如何把两两组合的样本配成对子来计算相关系数的。同
一个两两组合里,配的对子不一样,该两两组合的相关系数就会不一样。而这个例子中
,如何配对看起来没有特别的规则。

【在 a******1 的大作中提到】
: 1. Only professionals can define the variation in your case, the data
: analysts can't. Since you are asking the question with the correlation
: coefficient, I assume that as a professional in your field (biology?), you
: know that is being used to quantize the variation.
: 2. In your case, you have two samples with size of 45 and 28, and you would
: like to compare the means. The classic and best approach is to use t-test as
: suggested. But please note that the each sample does not consists of
: independent variables. For an example, let's look into 3 female samples X1,
: X2, and X3.In an extreme case, if r(X1, X2) and r(X1, X3) are both large,
: say, they are very close to 1, then r(X2, X3) should be large and close to 1

t**********y
发帖数: 374
16
由于数据的非normal,非independent, 所以t test 最出就排除了

would
as
,
1

【在 a******1 的大作中提到】
: 1. Only professionals can define the variation in your case, the data
: analysts can't. Since you are asking the question with the correlation
: coefficient, I assume that as a professional in your field (biology?), you
: know that is being used to quantize the variation.
: 2. In your case, you have two samples with size of 45 and 28, and you would
: like to compare the means. The classic and best approach is to use t-test as
: suggested. But please note that the each sample does not consists of
: independent variables. For an example, let's look into 3 female samples X1,
: X2, and X3.In an extreme case, if r(X1, X2) and r(X1, X3) are both large,
: say, they are very close to 1, then r(X2, X3) should be large and close to 1

t**********y
发帖数: 374
17
我这里是随机取十个雌性个体 (f1, f2, f3,..., f10), 同样8个雄性个体(m1, m2,.
..,m8),每个个体测了同样10,000个基因的表达值.
雌性个体之间的correlation coefficients计算: 所有组合, (f1,f2), (f1,f3),...
.,(f1,f10),(f2,f3),...,(f2,f10),.....,(f9,f10)
雄性个体同样处理
如前所述,correlation显然不是独立的,所以不敢贸然选test

【在 T*******I 的大作中提到】
: 根据他的新近描述,我最近才琢磨出来。
: 他是这样做的。他首先在第一个总体中抽取一定样本量的样本,可能只有一个随机变量
: ,例如某个基因的表达。一共进行了10次抽样,所以得到10个样本,每个样本包含一定
: 数量的个体观察值。然后,他使用排列组合的方式将10个样本两两配成一对计算同一个
: 基因的两次抽样间的相关系数,于是得到45个相关系数。
: 对第二个总体做了相似的工作,但只抽了8次样,所以得到28个两两组合和28个相关系
: 数。
: 他最后想知道,这两组相关系数的分布是否一致。所以,t-test或rank sum test就是
: 可选的检验方法。所以,关于检验的事情可能没那么复杂。
: 而真正复杂的是,我想知道他是如何把两两组合的样本配成对子来计算相关系数的。同

a******1
发帖数: 201
18
He has 10 female sample, for each of them he has a data point of 10,000
dimension, and he calculate the r between each pair of combinations. I don't
know what those 10,000 dimension are, and only he knows. Anyway, his
problem should be to compare the distributions between F of 45 data points
and M of 28 data points with each data point is 10,000 dimensional. He wants
to simplify the problem by calculating r. But by calculating r between, say
, F1 and F2, he switches the data to as 10,000 data points with each data
consisting of same DNA values from F1 and F2.That is why I said only he can
determine if that is the way to look into the data, because only he knows
what those 10,000 DNA values are and how they are related.

【在 T*******I 的大作中提到】
: 根据他的新近描述,我最近才琢磨出来。
: 他是这样做的。他首先在第一个总体中抽取一定样本量的样本,可能只有一个随机变量
: ,例如某个基因的表达。一共进行了10次抽样,所以得到10个样本,每个样本包含一定
: 数量的个体观察值。然后,他使用排列组合的方式将10个样本两两配成一对计算同一个
: 基因的两次抽样间的相关系数,于是得到45个相关系数。
: 对第二个总体做了相似的工作,但只抽了8次样,所以得到28个两两组合和28个相关系
: 数。
: 他最后想知道,这两组相关系数的分布是否一致。所以,t-test或rank sum test就是
: 可选的检验方法。所以,关于检验的事情可能没那么复杂。
: 而真正复杂的是,我想知道他是如何把两两组合的样本配成对子来计算相关系数的。同

g*****o
发帖数: 812
19
lol
你到底高中毕业没有啊, 排列组合都不知道吗?

【在 T*******I 的大作中提到】
: 根据他的新近描述,我最近才琢磨出来。
: 他是这样做的。他首先在第一个总体中抽取一定样本量的样本,可能只有一个随机变量
: ,例如某个基因的表达。一共进行了10次抽样,所以得到10个样本,每个样本包含一定
: 数量的个体观察值。然后,他使用排列组合的方式将10个样本两两配成一对计算同一个
: 基因的两次抽样间的相关系数,于是得到45个相关系数。
: 对第二个总体做了相似的工作,但只抽了8次样,所以得到28个两两组合和28个相关系
: 数。
: 他最后想知道,这两组相关系数的分布是否一致。所以,t-test或rank sum test就是
: 可选的检验方法。所以,关于检验的事情可能没那么复杂。
: 而真正复杂的是,我想知道他是如何把两两组合的样本配成对子来计算相关系数的。同

T*******I
发帖数: 5138
20
搞了半天原来你是从第一个总体中抽取10个人, 然后用每两个人的1万个基因表达值构
筑相关关系, 共得到45个相关系数, 如此得到第二个总体中的8个人共28个相关系数。
最后你想看这两组相关系数是否一致。
我的建议是, 唯有t-test或rank sum test能帮助你达到目的。这是一个简单的统计检
验的问题。
尽管一时不能给出详尽的解释, 直观地说, 这两组相关系数中的正态性和独立性应该都
不是问题。

,.
..

【在 t**********y 的大作中提到】
: 我这里是随机取十个雌性个体 (f1, f2, f3,..., f10), 同样8个雄性个体(m1, m2,.
: ..,m8),每个个体测了同样10,000个基因的表达值.
: 雌性个体之间的correlation coefficients计算: 所有组合, (f1,f2), (f1,f3),...
: .,(f1,f10),(f2,f3),...,(f2,f10),.....,(f9,f10)
: 雄性个体同样处理
: 如前所述,correlation显然不是独立的,所以不敢贸然选test

相关主题
PCA (principle component analysis) analysisLinear Regression
请教, linear regression model问题有谁知道crossover design里面作linear mixed model如何计算coefficient of variation (CV)?
Test for stationarity in time series请问,3个binary variable怎么做这样的 hypothesis test
进入Statistics版参与讨论
T*******I
发帖数: 5138
21
我早就告诉过, 我目前的数学计算技能仅及小学和初中水平。排列组合知道有那么回事
, 但公式都不记得。翻翻书也许还能用。
如果你和我讨论数学问题, 那是找错了对象。我在统计学里从不讨论数学问题。这不是
我的强项。

【在 g*****o 的大作中提到】
: lol
: 你到底高中毕业没有啊, 排列组合都不知道吗?

t**********y
发帖数: 374
22
由于是所以组合的correlation,所以的确是不独立

【在 T*******I 的大作中提到】
: 搞了半天原来你是从第一个总体中抽取10个人, 然后用每两个人的1万个基因表达值构
: 筑相关关系, 共得到45个相关系数, 如此得到第二个总体中的8个人共28个相关系数。
: 最后你想看这两组相关系数是否一致。
: 我的建议是, 唯有t-test或rank sum test能帮助你达到目的。这是一个简单的统计检
: 验的问题。
: 尽管一时不能给出详尽的解释, 直观地说, 这两组相关系数中的正态性和独立性应该都
: 不是问题。
:
: ,.
: ..

T*******I
发帖数: 5138
23
对于每个相关系数, 它们都是独立的测量, 因为每一对的组合及其相关系数的计算不受
其它对的影响和左右, 也不会与其它任何一对相混淆。此外, 由于两组相关系数采用了
同样的数据生成机制, 即使存在所谓的 "非独立性" 的影响, 这种影响也是均衡的, 从
而你只需看两组相关系数的分布差异性。这就是使用t-test或rank sum test的核心理
由。

【在 t**********y 的大作中提到】
: 由于是所以组合的correlation,所以的确是不独立
g**a
发帖数: 2129
24
1.what is the platform for your expression data: RNA-seq or expression assay
. For RNA-seq, it is count data. So what ever you guys talked about are
meaningless. Let's assume it is expression assay.
2.Based on assumption in 1, the expression value you were talking about, are
they value/probe or value/gene? If it is value/gene, how did you combine
the reading from different probes within one gene? It is not that easy to
combine the expression from multiple probes in one gene, just like combine
different SNP probes within one gene for GWAS. Those are still hot research
area in the field. Since you didn't mention that at all, I assume it is
value/probe.So the biological question you were asking is that whether those
10,000 probes you chosen are more variable in female/male.
3. I can't image any biology PI would ask this kind of question. Because I
can't see how he could benefit from the conclusion from this question. But
let's assume he asked.
Pearson correlation coefficient measures the linear dependency between two
variables. It assumes that data points are independent. In your case, 1. the
10,000 points are from same patient, they are not independent. 2. Let's
assume they are independent. The correlation coefficient close to one only
indicates that two patients are linearly dependent.It has nothing to do with
the variability.
Solution:
I propose to use euclidean distance. Define a multi-dimension space with
the same dimension of the number of probes you chosen. Calculate euclidean
distance between any two patients within the same group. A small euclidean
distance would indicate a more condense group which means less variability
over all chosen features within the group. You can even develop a statistics
for those distance so that a p-value can be calculated if you can find a
distribution for it. If that is not feasible, you can perform bootstrap,
select m samples from each population to obtain a empirical distribution of
this distance and calculate a p-value from it.
a******1
发帖数: 201
25
He does not have 10,000 data point from one sample, what he has is actually
one data point of 10,000 dimensions. Although I said that he is the only one
who can determine the "correlation coefficient" he calculated is the way to
characterize the distribution of his data. I have come to the conclusion
that his "correlation coefficient" does not mean much, or he just
misunderstood the concept of correlation coefficient. As a simple example
for him to understand, let's say we have two people, and we have their
height and weight (h1, w1), and (h2, w2), you don't calculate the r between
these two people with data points (h1, w1) and (h2, w2). What you can do and
should do is to calculate the r between HEIGHT and WEIGHT with data points
(h1, h2) and (w1, w2). What OP did is exactly what he called the "
correlation coefficient" between samples, but what he should do is to
calculate the correlation coefficient between the 10,000 DNA characteristics.
The Euclidean distance is one way to look at. But since his covariance
matrix is of 10,000 dimensions, it will not be easy. He may want to reduce
the data dimensions first by checking the correlation between DNA
characteristics, or transforming the data into fewer dimensions.

assay
are
research
those

【在 g**a 的大作中提到】
: 1.what is the platform for your expression data: RNA-seq or expression assay
: . For RNA-seq, it is count data. So what ever you guys talked about are
: meaningless. Let's assume it is expression assay.
: 2.Based on assumption in 1, the expression value you were talking about, are
: they value/probe or value/gene? If it is value/gene, how did you combine
: the reading from different probes within one gene? It is not that easy to
: combine the expression from multiple probes in one gene, just like combine
: different SNP probes within one gene for GWAS. Those are still hot research
: area in the field. Since you didn't mention that at all, I assume it is
: value/probe.So the biological question you were asking is that whether those

g**a
发帖数: 2129
26
Based on what I read in this post. He did use expression values from 10,000
probes for each patient. Then he calculated correlation coefficient between
any two patients. That is why he could get 45 (choose(10,2)) correlation
coefficient. Those values are measuring expression for different genes. So
they are different variables for the same person. So they are not
independent but different variables.
I agree with you that to calculate correlation coefficient between 2 samples
for those genes are meaningless. It is either a "new breakthrough" for data
analysis or he just doesn't understand what he was doing.
Finally I agree with you 10000 probes are too much. A dimension reduction
procedure should be applied. How to reduce dimension? Hehe, that is another
topic.

actually
one
to
between
and
points

【在 a******1 的大作中提到】
: He does not have 10,000 data point from one sample, what he has is actually
: one data point of 10,000 dimensions. Although I said that he is the only one
: who can determine the "correlation coefficient" he calculated is the way to
: characterize the distribution of his data. I have come to the conclusion
: that his "correlation coefficient" does not mean much, or he just
: misunderstood the concept of correlation coefficient. As a simple example
: for him to understand, let's say we have two people, and we have their
: height and weight (h1, w1), and (h2, w2), you don't calculate the r between
: these two people with data points (h1, w1) and (h2, w2). What you can do and
: should do is to calculate the r between HEIGHT and WEIGHT with data points

T*******I
发帖数: 5138
27
讨论到此, 说点体会。
我们对统计学的理解有时依然很欠缺, 包括本人在内的一些人脑子还是很僵化, 这从有
关所谓的独立性和相关性等的讨论中可见一斑。
总之, lz构造的相关性是可测的连续型随机变量, 且能反映两组各自的内部一致性或变
异性, 因此可以试试ttest或rank sum test.
a******1
发帖数: 201
28
Hehe, do you really think so?
Let's look at the example of people's measurements of height and weight. In
an artificial case, let's say, a group of women are with similar weights but
a lot of variation in height, and another group of men are just opposite,
with similar heights but different wights. If you calculate the r as OP did,
will you see what will you get? And if you change the unit of the height or
weight, what will happen?

【在 T*******I 的大作中提到】
: 讨论到此, 说点体会。
: 我们对统计学的理解有时依然很欠缺, 包括本人在内的一些人脑子还是很僵化, 这从有
: 关所谓的独立性和相关性等的讨论中可见一斑。
: 总之, lz构造的相关性是可测的连续型随机变量, 且能反映两组各自的内部一致性或变
: 异性, 因此可以试试ttest或rank sum test.

T*******I
发帖数: 5138
29
Usually 数据结构是这样的:
ID JY1 JY2 .....JY10000
女1
女2
.
.
.
女10
男1
男2
.
.
.
男8
and the correlation coefficient can be calculated among 基因s for 女 or 男,
respectively.
But, 他的数据结构是这样的:
基因 女1 女2 .... 女10 男1 男2 ... 男8
JY1
JY2
.
.
.
JY1万
他计算的是每一性别内每两个人形成的组合中一万个基因的变异性之间的相关性。由于
基因数据的产生和记录方式对每一个基因都是一样的, 例如, 都是某种表达率, 他所定
义的相关系数作为一种随机测量是没有问题的, 且不会有重复的组合发生, 从而每一个
组合也即每一个相关系数都是独立的。

【在 a******1 的大作中提到】
: Hehe, do you really think so?
: Let's look at the example of people's measurements of height and weight. In
: an artificial case, let's say, a group of women are with similar weights but
: a lot of variation in height, and another group of men are just opposite,
: with similar heights but different wights. If you calculate the r as OP did,
: will you see what will you get? And if you change the unit of the height or
: weight, what will happen?

T*******I
发帖数: 5138
30
如果楼主顾虑用ttest或rank sum test, 我还可以推荐你采用我在一篇jsm
proceedings的文章中提出的一个通用随机测量: 两个同质的随机点测量的平方差的绝
对值与
它们的平方和的比值。这个比值的含义是两个随机点测量之间的差异性, 而其定义域为
闭区间[0,1]。
例如, 你的例子里两组相关系数的平均值分别是0.41和0.9, 那么, 它们之间的那个比
值将是0.6563, 表明它们之间的差异性达到0.6以上, 也即这个差异是显著的。
再如, 假设上述两组相关系数分别是0.41和0.52, 那么, 其差异性就是0.2333; 而0.41
和0.45之间的差异性只有0.0928。但是, 0.86与0.9之间, 以及0.9与0.94之间的差异性
则分别为 0.04543和0.04345。
你也许可以构造一个你自己的测量来判断它们的差异性。
相关主题
关于correlate coefficience看看你会不会出错!(Correlation factor vs Similarity)
confidence level of correlation coefficient怎么求?[合集] 请教个CORRELATION COEFFICIENT的问题
请教一个问题model里有multicollinearity,该如何处理呢?
进入Statistics版参与讨论
a******1
发帖数: 201
31
You just don't understand, do you?
From statistics standpoint, LY1, LY2, ... are not sampling individuals (
randomly selected from a population, and independently), so you don't use
them to calculate r between females or males. The OP's problem is to compare
the distributions of (LY1, LY2, ...) (or some characteristics of the
distributions, like one OP called correlation coefficient) between 2 groups:
Female and Male. So that correlation or whatever OP was interested in is
between (LY1, LY2, ...).
And the common sense standpoint, in the simple example I gave, change the
unit for height or weight, r can be anything and the order between F and M
can be reversed. In that case does it make any sense to use r to
characterize anything?

41

【在 T*******I 的大作中提到】
: 如果楼主顾虑用ttest或rank sum test, 我还可以推荐你采用我在一篇jsm
: proceedings的文章中提出的一个通用随机测量: 两个同质的随机点测量的平方差的绝
: 对值与
: 它们的平方和的比值。这个比值的含义是两个随机点测量之间的差异性, 而其定义域为
: 闭区间[0,1]。
: 例如, 你的例子里两组相关系数的平均值分别是0.41和0.9, 那么, 它们之间的那个比
: 值将是0.6563, 表明它们之间的差异性达到0.6以上, 也即这个差异是显著的。
: 再如, 假设上述两组相关系数分别是0.41和0.52, 那么, 其差异性就是0.2333; 而0.41
: 和0.45之间的差异性只有0.0928。但是, 0.86与0.9之间, 以及0.9与0.94之间的差异性
: 则分别为 0.04543和0.04345。

T*******I
发帖数: 5138
32
你这脑袋僵化得真够可以的。要么数学学得太好了,
要么统计学学得很差。当然, 还有一种情形就是教你统计学的老师水平有限。

compare
groups:

【在 a******1 的大作中提到】
: You just don't understand, do you?
: From statistics standpoint, LY1, LY2, ... are not sampling individuals (
: randomly selected from a population, and independently), so you don't use
: them to calculate r between females or males. The OP's problem is to compare
: the distributions of (LY1, LY2, ...) (or some characteristics of the
: distributions, like one OP called correlation coefficient) between 2 groups:
: Female and Male. So that correlation or whatever OP was interested in is
: between (LY1, LY2, ...).
: And the common sense standpoint, in the simple example I gave, change the
: unit for height or weight, r can be anything and the order between F and M

a******1
发帖数: 201
33
Then how good is your statistics teacher?
In the simple example I gave, to calculate the r as OP did, you need to
calculate variance first. What is the variance in my example? The first
sample is (h1, h2), and the second is (w1, w2), they don't even measure the
same thing, you can't get any meaningful variance from these two "samples".

【在 T*******I 的大作中提到】
: 你这脑袋僵化得真够可以的。要么数学学得太好了,
: 要么统计学学得很差。当然, 还有一种情形就是教你统计学的老师水平有限。
:
: compare
: groups:

a******1
发帖数: 201
34
Actually, what is the mean, (h1+w1)/2 and h2+w2)/2?

the

【在 a******1 的大作中提到】
: Then how good is your statistics teacher?
: In the simple example I gave, to calculate the r as OP did, you need to
: calculate variance first. What is the variance in my example? The first
: sample is (h1, h2), and the second is (w1, w2), they don't even measure the
: same thing, you can't get any meaningful variance from these two "samples".

T*******I
发帖数: 5138
35
Do you want me to repeat my words at 29 lou?
"他计算的是每一性别内每两个人形成的组合中一万个基因的变异性之间的相关性。由
于基因数据的产生和记录方式对每一个基因都是一样的, 例如, 都是某种表达率, 他所
定义的相关系数作为一种随机测量是没有问题的, 且不会有重复的组合发生, 从而每一
个组合也即每一个相关系数都是独立的。"
The variance is of 10000 genes and calculated for each person.

【在 a******1 的大作中提到】
: Actually, what is the mean, (h1+w1)/2 and h2+w2)/2?
:
: the

g*****o
发帖数: 812
36
lol,你是靠直觉在搞统计么
你不能去wiki上看下相关系数的定义和公式么,我觉得崔永元都要对你甘拜下风了

【在 T*******I 的大作中提到】
: Do you want me to repeat my words at 29 lou?
: "他计算的是每一性别内每两个人形成的组合中一万个基因的变异性之间的相关性。由
: 于基因数据的产生和记录方式对每一个基因都是一样的, 例如, 都是某种表达率, 他所
: 定义的相关系数作为一种随机测量是没有问题的, 且不会有重复的组合发生, 从而每一
: 个组合也即每一个相关系数都是独立的。"
: The variance is of 10000 genes and calculated for each person.

T*******I
发帖数: 5138
37
看起来你也是个数学素养很高,但统计学没学好的伙计。自己去捉摸那个相关系数的定
义去吧。上次要你去读一读黑格尔,你不屑一顾。看来,脑子还是僵化的。
所以,我懒得和你争论了。这本没有什么可争的。

【在 g*****o 的大作中提到】
: lol,你是靠直觉在搞统计么
: 你不能去wiki上看下相关系数的定义和公式么,我觉得崔永元都要对你甘拜下风了

g*****o
发帖数: 812
38
是,没你灵活,在你的世界里可以1/1=0
这是黑格尔教你的→_→

【在 T*******I 的大作中提到】
: 看起来你也是个数学素养很高,但统计学没学好的伙计。自己去捉摸那个相关系数的定
: 义去吧。上次要你去读一读黑格尔,你不屑一顾。看来,脑子还是僵化的。
: 所以,我懒得和你争论了。这本没有什么可争的。

T*******I
发帖数: 5138
39
你误解黑格尔了, 或者, 根本没搞懂他的东西是怎么回事。

【在 g*****o 的大作中提到】
: 是,没你灵活,在你的世界里可以1/1=0
: 这是黑格尔教你的→_→

T*******I
发帖数: 5138
40
本帖的标题准确地说应该是
请教一个关于两组相关系数的差异性检验的问题。
原标题有歧义, 且原贴一开始就没有把事情说清楚, 尤其因为 "samples" 这个词本身
就有歧义, 引起了
一些不必要的误解。

when

【在 t**********y 的大作中提到】
: 两个populations: population 1, 取10个samples; population 2, 取8 samples;
: 计算correlation coefficients among the samples from each population:
: population 1, a total of 45 correlation coefficients calculated and the
: average was 0.41; population, a total of 28 correlation coefficients
: calculated and the averages was 0.9;
: 我的结论是:samples in population 2 have significantly lower variations when
: compared with the samples in population 1
: 请问这种情况什么test可以用,得个p value?
: 多谢了!!

相关主题
model里有multicollinearity,该如何处理呢?Is there any correlation between the two data set?
网上看到一道题longitudinal, correlation, useless?
请问multi variate linear regression 选择risk factor 问题如何在应用model前把correlated的predictors去掉?
进入Statistics版参与讨论
t**********y
发帖数: 374
41
The data were from RNAseq. The numbers are the gene counts (each gene has
one specific count in a specific sample)
I feel that correlation coefficient seem to be acceptable in term of
evaluating sample reproducibilities even though we all know the pitfalls.
I am interested in what you mentioned: euclidean distance. Could you please
know how to test those distances? Or any paper for recommendation?
Thanks a lot.

assay
are
research
those

【在 g**a 的大作中提到】
: 1.what is the platform for your expression data: RNA-seq or expression assay
: . For RNA-seq, it is count data. So what ever you guys talked about are
: meaningless. Let's assume it is expression assay.
: 2.Based on assumption in 1, the expression value you were talking about, are
: they value/probe or value/gene? If it is value/gene, how did you combine
: the reading from different probes within one gene? It is not that easy to
: combine the expression from multiple probes in one gene, just like combine
: different SNP probes within one gene for GWAS. Those are still hot research
: area in the field. Since you didn't mention that at all, I assume it is
: value/probe.So the biological question you were asking is that whether those

g**a
发帖数: 2129
42
If it is RNA-seq data, all we talked about won't make any sense. RNA-seq is
count data. Euclidean distance, Pearson regression or any multivariate
methods are all based on continuous variable. Don't use them for RNA-seq. If
you can clarify what is the biological question you want to answer, I may
be able to help.


please

【在 t**********y 的大作中提到】
: The data were from RNAseq. The numbers are the gene counts (each gene has
: one specific count in a specific sample)
: I feel that correlation coefficient seem to be acceptable in term of
: evaluating sample reproducibilities even though we all know the pitfalls.
: I am interested in what you mentioned: euclidean distance. Could you please
: know how to test those distances? Or any paper for recommendation?
: Thanks a lot.
:
: assay
: are

T*******I
发帖数: 5138
43
In my opinion, Pearson's correlation coefficient does work for your case.
You also can try Spearman's correlation coefficient, which is said to be
used with categorically count data. But I believe for your case, both should
be very close to each other.

【在 t**********y 的大作中提到】
: The data were from RNAseq. The numbers are the gene counts (each gene has
: one specific count in a specific sample)
: I feel that correlation coefficient seem to be acceptable in term of
: evaluating sample reproducibilities even though we all know the pitfalls.
: I am interested in what you mentioned: euclidean distance. Could you please
: know how to test those distances? Or any paper for recommendation?
: Thanks a lot.
:
: assay
: are

t**********y
发帖数: 374
44
the question is: 雌性个体差异大于雄性个体差异??

is
If

【在 g**a 的大作中提到】
: If it is RNA-seq data, all we talked about won't make any sense. RNA-seq is
: count data. Euclidean distance, Pearson regression or any multivariate
: methods are all based on continuous variable. Don't use them for RNA-seq. If
: you can clarify what is the biological question you want to answer, I may
: be able to help.
:
:
: please

t**********y
发帖数: 374
45
euclidean distance 也不行吗?

is
If

【在 g**a 的大作中提到】
: If it is RNA-seq data, all we talked about won't make any sense. RNA-seq is
: count data. Euclidean distance, Pearson regression or any multivariate
: methods are all based on continuous variable. Don't use them for RNA-seq. If
: you can clarify what is the biological question you want to answer, I may
: be able to help.
:
:
: please

1 (共1页)
进入Statistics版参与讨论
相关主题
有谁知道crossover design里面作linear mixed model如何计算coefficient of variation (CV)?网上看到一道题
请问,3个binary variable怎么做这样的 hypothesis test请问multi variate linear regression 选择risk factor 问题
关于correlate coefficienceIs there any correlation between the two data set?
confidence level of correlation coefficient怎么求?longitudinal, correlation, useless?
请教一个问题如何在应用model前把correlated的predictors去掉?
看看你会不会出错!(Correlation factor vs Similarity)请教一个样本相关系数的问题
[合集] 请教个CORRELATION COEFFICIENT的问题请问一个统计学的问题
model里有multicollinearity,该如何处理呢?PCA (principle component analysis) analysis
相关话题的讨论汇总
话题: samples话题: population话题: he