由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
DataSciences版 - [Data Science Project] Location data quality
相关主题
only average statisticsScience杂志一篇关于clustering的新文章 (转载)
请推荐生物界认可的Clustering Analysis的免费软件我有大概80000~100000个左右的时间序列,希望对他们进行分类。
讨论一下:几种clustering方法的特点,区别,长处各是什么?有没有谁自己买服务器组建几个clusters跑hadoop大数据的?
请问常考的cluster algorithm有哪些若问entropy和gini的选择
有关clustering请问决策树连续值的分界点怎么选
Optimization over more than one metricslending club的notes 数据
Some thoughts on data science and data scientists[Road map] From ClickStream to ConsumerInsight
问个问题:一堆(1M)二维座标系的点,每个点有weight,怎么做clustering?计算 confidence interval 和 prediction interval的一般方法
相关话题的讨论汇总
话题: location话题: good话题: data话题: distance话题: quality
进入DataSciences版参与讨论
1 (共1页)
c***z
发帖数: 6348
1
Hi all,
This is my first project in the new company, and it is about third party
data quality. There is no gold standard for quality, but we know that
repetition of location in the dataset might imply bad quality, because in
this case the location might come from a centroid (e.g. a cell tower, rather
than a cell phone).
There is also no ground truth about which datasets are good, but we know
some good ones, particularly the channels we own.
We are exploring the relationship between data quality of a vendor and the
distance of its location distribution from the known good ones. Here comes
the other moving part, what does distance mean here. Basically, each vendor
provides us requests to display ads, with the request there is location.
Hence we can group by location and see how many times each appears. We can
then group by frequency and see how many locations appear that many times.
This way each vendor gives a contingency table with two columns: frequency
and count.
In terms of comparing contingency table, what would you suggest?
Or should I go back to the raw data, or the intermediate table (location and
frequency)?
Thanks a lot!
w**p
发帖数: 4080
2
可以建一个index来量化数据质量的好坏,比如location repeat两次的pentaly是多少
,repeat三次的penalty是多少。而penalty多少可以根据business的理解来给,并且这
是可以调整的。
这样每个vendor都会得到一个data quality的分数。然后再map这个分数based on
distance to the good vender?
个人浅见,高手莫笑。
c***z
发帖数: 6348
3
There is a scale problem in this approach, some vendors have 100 times more
data. But maybe we can try normalizing to percentages...
g*****o
发帖数: 1564
4
看到contigency table就想到chi-square和fisher.exact test了

rather

【在 c***z 的大作中提到】
: Hi all,
: This is my first project in the new company, and it is about third party
: data quality. There is no gold standard for quality, but we know that
: repetition of location in the dataset might imply bad quality, because in
: this case the location might come from a centroid (e.g. a cell tower, rather
: than a cell phone).
: There is also no ground truth about which datasets are good, but we know
: some good ones, particularly the channels we own.
: We are exploring the relationship between data quality of a vendor and the
: distance of its location distribution from the known good ones. Here comes

c***z
发帖数: 6348
5
I tried Chi square and G tests, but sometimes a bad partner is closer to a
good one than another good one (e.g. dist(G1, G2) > dist(G1, B1)). Also,
they both drop the zero cells, while these cells are important to us (e.g.
one bad partner has locations that repeat millions of times, while this
never happen for good partners, and in G test this case will be omitted).
Fisher's test is exponential and too slow for our case, while there are
thousands of rows. Maybe we can rebin the table to make it fewer rows, but I
would like to delay rebinning as much as possible, since it loses
information.
Thanks a lot!
c***z
发帖数: 6348
6
In some sense this is similar to the word distributions in documents and I
am measuring the distance between the documents using the count tables (
rather, aggregated count tables with only two columns: frequency and count).
Another analogy I can think of is the wealth distribution (e.g. Gini index).
Any suggestions are extremely welcome! Thanks a lot!
l*******m
发帖数: 1096
7
do you have other data, such as user id, ip address timestamp, carrier id,
app id....
with additional info, it is much easier

rather

【在 c***z 的大作中提到】
: Hi all,
: This is my first project in the new company, and it is about third party
: data quality. There is no gold standard for quality, but we know that
: repetition of location in the dataset might imply bad quality, because in
: this case the location might come from a centroid (e.g. a cell tower, rather
: than a cell phone).
: There is also no ground truth about which datasets are good, but we know
: some good ones, particularly the channels we own.
: We are exploring the relationship between data quality of a vendor and the
: distance of its location distribution from the known good ones. Here comes

l******n
发帖数: 9344
8
The statistical tests on contingency table mentioned in previous posts do
not help in this case, because they only tell you whether they are different
. As Gini index, it tells you how inequality the income across a nation's
papulation, but does not tell you which population has good income.
What you need is criteria to measure the goodness of the data.I would
suggest you use entropy or some form of variation.

).
).

【在 c***z 的大作中提到】
: In some sense this is similar to the word distributions in documents and I
: am measuring the distance between the documents using the count tables (
: rather, aggregated count tables with only two columns: frequency and count).
: Another analogy I can think of is the wealth distribution (e.g. Gini index).
: Any suggestions are extremely welcome! Thanks a lot!

g*****o
发帖数: 1564
9
I'm not really sure.
as also mentioned using entropy.
would Mutual Information or KL-divergence be used based on the count (bin)
data of the locations between good and bad vendors?

I

【在 c***z 的大作中提到】
: I tried Chi square and G tests, but sometimes a bad partner is closer to a
: good one than another good one (e.g. dist(G1, G2) > dist(G1, B1)). Also,
: they both drop the zero cells, while these cells are important to us (e.g.
: one bad partner has locations that repeat millions of times, while this
: never happen for good partners, and in G test this case will be omitted).
: Fisher's test is exponential and too slow for our case, while there are
: thousands of rows. Maybe we can rebin the table to make it fewer rows, but I
: would like to delay rebinning as much as possible, since it loses
: information.
: Thanks a lot!

w**p
发帖数: 4080
10
This is what I meant at the first point.
Create a data quality score using some criteria then analyze the
relationship between this score and the distance.
Or, in other words, you can calculate a "distance" using the location
repetition frequency. A good definition of this "distance" and an
appropriate transformation will finally make it has a linear relation with
the physical distance.

different

【在 l******n 的大作中提到】
: The statistical tests on contingency table mentioned in previous posts do
: not help in this case, because they only tell you whether they are different
: . As Gini index, it tells you how inequality the income across a nation's
: papulation, but does not tell you which population has good income.
: What you need is criteria to measure the goodness of the data.I would
: suggest you use entropy or some form of variation.
:
: ).
: ).

相关主题
Optimization over more than one metricsScience杂志一篇关于clustering的新文章 (转载)
Some thoughts on data science and data scientists我有大概80000~100000个左右的时间序列,希望对他们进行分类。
问个问题:一堆(1M)二维座标系的点,每个点有weight,怎么做clustering?有没有谁自己买服务器组建几个clusters跑hadoop大数据的?
进入DataSciences版参与讨论
c***z
发帖数: 6348
11
I had the same concern that there might not be some intrinsic relationship
between the distance/difference and quality/performance. I also proposed
that we should focus on the goodness of the data. But at this moment I am
asked to focus on the distance.
I think the logic of my boss is to build wheels first then find a way to use
it, rather than study if we need the wheel first.
PS: I don't have other data yet, not very familiar with all the data yet.
PS2: I tried G test which is related to KL-divergence, but it didn't work
well.
PS3: I don't have physical locations yet, the tables I have are aggregated
to one level higher, containing only two columns: location frequency and how
many locations are repeated that many times. Maybe I should propose to go
back to the finer level table with location and frequency.
PS4: Just tried cosine distance, and it is not working well either. Some bad
partners are closer to good ones than they are to each other.
Thanks so much for your replies!

different

【在 l******n 的大作中提到】
: The statistical tests on contingency table mentioned in previous posts do
: not help in this case, because they only tell you whether they are different
: . As Gini index, it tells you how inequality the income across a nation's
: papulation, but does not tell you which population has good income.
: What you need is criteria to measure the goodness of the data.I would
: suggest you use entropy or some form of variation.
:
: ).
: ).

c***z
发帖数: 6348
12
The problem is that we have neither a good criteria for quality nor for
distance nor an intrinsic relationship between the two...

【在 w**p 的大作中提到】
: This is what I meant at the first point.
: Create a data quality score using some criteria then analyze the
: relationship between this score and the distance.
: Or, in other words, you can calculate a "distance" using the location
: repetition frequency. A good definition of this "distance" and an
: appropriate transformation will finally make it has a linear relation with
: the physical distance.
:
: different

l******n
发帖数: 9344
13
I have many projects like this which is more of science project other than
real business project. I usually go back to the client and ask for
clarification and objectives. Also it is the opportunity to educate your
client what can be done and what can't.
It is your show time, and don't be too shy to say it does not make sense.

【在 c***z 的大作中提到】
: The problem is that we have neither a good criteria for quality nor for
: distance nor an intrinsic relationship between the two...

c***z
发帖数: 6348
14
唉,还是比较难做到啊,尤其是才开始工作,还没有多少credit
我也是反复ask for clarification and objectives,领导从一开始说free end到确定
要distance,我也就弄distance。实在不行了再跟头说我们还是弄performance吧。

【在 l******n 的大作中提到】
: I have many projects like this which is more of science project other than
: real business project. I usually go back to the client and ask for
: clarification and objectives. Also it is the opportunity to educate your
: client what can be done and what can't.
: It is your show time, and don't be too shy to say it does not make sense.

m******a
发帖数: 77
15
读了很多遍,还是没弄明白楼主到底想干啥,问题没有很好定义,很难下手
能否先抛开这些数据, 从Business 那边来看这个问题, 看看他们到底想干什么
知道Business 那边的目的了,再回头看这些数据怎样用
现在好象是连在什么 Level - Location 还是Vendor 上来定义问题都不清楚
想了两个办法, aggregate to some level 之后
1. cluster 的办法, 看看能否和已知的好的 cluster 到一块
2. classification 的办法, 看看 score 是否像好的, 试试机器学习中
bootstrapping 的办法 - 和统计中的bootstrapping 不是一个东西
但不知道到底有多大的 sample

rather

【在 c***z 的大作中提到】
: Hi all,
: This is my first project in the new company, and it is about third party
: data quality. There is no gold standard for quality, but we know that
: repetition of location in the dataset might imply bad quality, because in
: this case the location might come from a centroid (e.g. a cell tower, rather
: than a cell phone).
: There is also no ground truth about which datasets are good, but we know
: some good ones, particularly the channels we own.
: We are exploring the relationship between data quality of a vendor and the
: distance of its location distribution from the known good ones. Here comes

c***z
发帖数: 6348
16
Exactly. I have propose to start from the business questions.
And this is the reply from boss:
"I am not clear what kind of answers from 'business' you are looking for. It
has always been the same: Ability to differentiate good location quality
traffic from bad location quality traffic."
Still, no idea about what "good traffic" means, just a bunch of good/bad
traffic samples, need to generalize to a definition.
So we don't have a definition for goodness, nor a definition for metric, nor
an idea about the intrinsic relation between the two. We are just exploring
.
I tried clustering with a few data points (each vendor is a point) and the
bad ones are mixed into the good ones. The metrics I used are X^2, G, RMSE,
consine, area between curves, etc
I also tried classification, there are too few features and data points and
there is serious overfitting.
Can you explain a bit about the difference in bootstrapping in ML and stat?
Thanks so much!

【在 m******a 的大作中提到】
: 读了很多遍,还是没弄明白楼主到底想干啥,问题没有很好定义,很难下手
: 能否先抛开这些数据, 从Business 那边来看这个问题, 看看他们到底想干什么
: 知道Business 那边的目的了,再回头看这些数据怎样用
: 现在好象是连在什么 Level - Location 还是Vendor 上来定义问题都不清楚
: 想了两个办法, aggregate to some level 之后
: 1. cluster 的办法, 看看能否和已知的好的 cluster 到一块
: 2. classification 的办法, 看看 score 是否像好的, 试试机器学习中
: bootstrapping 的办法 - 和统计中的bootstrapping 不是一个东西
: 但不知道到底有多大的 sample
:

c***z
发帖数: 6348
17
Got some progress. I did a clustering analysis on 150 vendors (112 good ones
and 38 bad ones), using a strange metric (average height of the area
between two log-log curves).
The result is almost too good to be true: in group 1, everyone is bad; in
group 2, everyone except one is good.
The interesting thing is that as I throw in more data points, things can get
worse or better...
Take a look at the picture. Any suggestions and comments are extremely
welcome!

It
nor
exploring

【在 c***z 的大作中提到】
: Exactly. I have propose to start from the business questions.
: And this is the reply from boss:
: "I am not clear what kind of answers from 'business' you are looking for. It
: has always been the same: Ability to differentiate good location quality
: traffic from bad location quality traffic."
: Still, no idea about what "good traffic" means, just a bunch of good/bad
: traffic samples, need to generalize to a definition.
: So we don't have a definition for goodness, nor a definition for metric, nor
: an idea about the intrinsic relation between the two. We are just exploring
: .

T*****u
发帖数: 7103
18
能解释一下什么是(average height of the area between two log-log curves)?

ones
get

【在 c***z 的大作中提到】
: Got some progress. I did a clustering analysis on 150 vendors (112 good ones
: and 38 bad ones), using a strange metric (average height of the area
: between two log-log curves).
: The result is almost too good to be true: in group 1, everyone is bad; in
: group 2, everyone except one is good.
: The interesting thing is that as I throw in more data points, things can get
: worse or better...
: Take a look at the picture. Any suggestions and comments are extremely
: welcome!
:

T*****u
发帖数: 7103
19
不是很确定你做的是什么,但是感觉这种出现频率的东西和zipf's distribution可能
相关,或者 log-normal distribution有关。
c***z
发帖数: 6348
20
Thanks a lot! Will take a look at the zipf stuff.
Just realized that the MKFC metric is just the Cramér-von Mises stat using
raw count instead of probability mass. Will try Cramér-von Mises instead. :
)
http://journal.r-project.org/archive/2011-2/RJournal_2011-2_Arn
相关主题
若问entropy和gini的选择[Road map] From ClickStream to ConsumerInsight
请问决策树连续值的分界点怎么选计算 confidence interval 和 prediction interval的一般方法
lending club的notes 数据T家onsite面经
进入DataSciences版参与讨论
m********t
发帖数: 94
21
你这套东东我真的不太熟 follow 你这个tread看看 怎么实际解决问题
不过有点好奇 为啥用hierarchical clustering 我知道计算起来方便些
除此以外呢?

using
:

【在 c***z 的大作中提到】
: Thanks a lot! Will take a look at the zipf stuff.
: Just realized that the MKFC metric is just the Cramér-von Mises stat using
: raw count instead of probability mass. Will try Cramér-von Mises instead. :
: )
: http://journal.r-project.org/archive/2011-2/RJournal_2011-2_Arn

c***z
发帖数: 6348
22
I have been asking the same question to my boss, about the practical use of
this abstract metric...
The reason we can't use k-mean is that these metrics are actually not real
metrics, as they don't follow triangular inequality, and hence the mean
means nothing (convergence of mean doesn't imply convergence of variance).
The only thing I can think of is then hierarchical clustering...
m********t
发帖数: 94
23
可能我从最开始就没听明白 你的metrics到底是啥。。。
另外hierarchical method 你不是也要算距离么。。。
我对你那个fuzzy model不太熟。。。 能避开算距离的问题?

of

【在 c***z 的大作中提到】
: I have been asking the same question to my boss, about the practical use of
: this abstract metric...
: The reason we can't use k-mean is that these metrics are actually not real
: metrics, as they don't follow triangular inequality, and hence the mean
: means nothing (convergence of mean doesn't imply convergence of variance).
: The only thing I can think of is then hierarchical clustering...

c***z
发帖数: 6348
24
Strictly speaking, these distance are not metrics but ordinals, so I can do
hierarchical clustering using the order, iirc. :)
c***z
发帖数: 6348
25
Had some more progress. I run 100 trials on 7 metrics, each with 200 vendors
, and recorded the F1 scores. Attached is the plot of the F1 score.

do

【在 c***z 的大作中提到】
: Strictly speaking, these distance are not metrics but ordinals, so I can do
: hierarchical clustering using the order, iirc. :)

c***z
发帖数: 6348
26
Current question is to investigate the misclassified vendors (e.g. a vendor
which is hand labeled good - the first letter being "G", but the algorithm
puts in the "bad" cluster).
The plots of TP and FN are awfully close to each other; also are TN and FP.
I am totally clueless now (as always)...
Any suggestion is extremely welcome! The x-axis is recurrence (i.e.
frequency) and the y-axis is traffic (i.e. total volume of records with
locations repeated that many times).

vendors

【在 c***z 的大作中提到】
: Had some more progress. I run 100 trials on 7 metrics, each with 200 vendors
: , and recorded the F1 scores. Attached is the plot of the F1 score.
:
: do

c***z
发帖数: 6348
27
Same comparison, in percentiles of recurrence and percentages of traffic.

vendor
.

【在 c***z 的大作中提到】
: Current question is to investigate the misclassified vendors (e.g. a vendor
: which is hand labeled good - the first letter being "G", but the algorithm
: puts in the "bad" cluster).
: The plots of TP and FN are awfully close to each other; also are TN and FP.
: I am totally clueless now (as always)...
: Any suggestion is extremely welcome! The x-axis is recurrence (i.e.
: frequency) and the y-axis is traffic (i.e. total volume of records with
: locations repeated that many times).
:
: vendors

c***z
发帖数: 6348
28
Same comparison, in log-log.

【在 c***z 的大作中提到】
: Same comparison, in percentiles of recurrence and percentages of traffic.
:
: vendor
: .

T*****u
发帖数: 7103
29
超哥威武。在不透露商业机密的基础上,呼吁这类实战的帖子。太有用了。
c***z
发帖数: 6348
30
阶段性总结
Overall this task can be conducted iteratively between two steps: the
training step using clustering of labeled samples and the bootstrapping step
adding unlabeled samples to increase coverage. Currently we can consider
the first iteration of the training step complete and move on the the
bootstrapping step.
1. 2000+ good and 2000+ bad partners provided;
2. I conducted hierarchical clustering analysis with seven metrics on a set
of good and bad samples, luckily the clusters are highly correlated with the
hand labeling - in other words the in-group distances are usually larger
than the between-group distances;
3. four top performing metrics identified with 100 trials on 200 samples
each;
4. consistently misclassified samples identified, but investigation on the
cause is currently on hold - no clear clue how why they are mislabeled;
5. attempt to trial on 4000 samples encountered engineering difficulty - R
is inefficient with such large scale computation;
6. I am currently working on the bootstrapping step to increase the coverage
of labels, there are several methods being considered;
6a. we can measure the distance between the unlabeled sample to a typical
good point and a typical bad point, then compare the two to decide a label;
the task of finding typical good and bad points are troublesome though;
6b. we can also find the nearest neighbors of the unlabeled sample and
decide a label based on this; we can use all four metrics and conduct a vote
(ensemble learning);
6c. we can also view this in the Bayesian way, i.e. assume the unknown
sample is good, find its nearest neighbor, label the unknown with its
neighbor's label; the mean in-group and mean between-group distances can be
used to produce confidence;
6d. we can also use supervised learning, with the percentile percentages as
features;
6e. confidence intervals are doable but require more research;
6f. engineering to scale up is doable as well, need to pick up Java or Scala
(for Spark).
相关主题
怎样能才能快速的找到KNN请推荐生物界认可的Clustering Analysis的免费软件
Statistics PhD 如何转data scientist讨论一下:几种clustering方法的特点,区别,长处各是什么?
only average statistics请问常考的cluster algorithm有哪些
进入DataSciences版参与讨论
c***z
发帖数: 6348
31
Any suggestions and comment are extremely welcome! Thanks a lot!
c***z
发帖数: 6348
32
Had some more progress. Using some better data, and after correcting for
flipped clusters (i.e. usually the bad points are in cluster 1, but
occasionally they like cluster 2 better), I had 95% accuracy in clustering
the points.
Now the bootstrap step, I labeled test points with its nearest neighbor, and
had 80% accuracy using a majority vote by the metrics. I am modifying
the algorithm so that I can allow more false positives and less false
negatives, as required by the business.
The real headache is when I look at the mislabeled cases, I have no clue why
they are mislabeled - hence cannot make improvement.
Any suggestions and comment are extremely welcome! Thanks a lot!
c***z
发帖数: 6348
33
Hi all,
This is my first project in the new company, and it is about third party
data quality. There is no gold standard for quality, but we know that
repetition of location in the dataset might imply bad quality, because in
this case the location might come from a centroid (e.g. a cell tower, rather
than a cell phone).
There is also no ground truth about which datasets are good, but we know
some good ones, particularly the channels we own.
We are exploring the relationship between data quality of a vendor and the
distance of its location distribution from the known good ones. Here comes
the other moving part, what does distance mean here. Basically, each vendor
provides us requests to display ads, with the request there is location.
Hence we can group by location and see how many times each appears. We can
then group by frequency and see how many locations appear that many times.
This way each vendor gives a contingency table with two columns: frequency
and count.
In terms of comparing contingency table, what would you suggest?
Or should I go back to the raw data, or the intermediate table (location and
frequency)?
Thanks a lot!
w**p
发帖数: 4080
34
可以建一个index来量化数据质量的好坏,比如location repeat两次的pentaly是多少
,repeat三次的penalty是多少。而penalty多少可以根据business的理解来给,并且这
是可以调整的。
这样每个vendor都会得到一个data quality的分数。然后再map这个分数based on
distance to the good vender?
个人浅见,高手莫笑。
c***z
发帖数: 6348
35
There is a scale problem in this approach, some vendors have 100 times more
data. But maybe we can try normalizing to percentages...
g*****o
发帖数: 1564
36
看到contigency table就想到chi-square和fisher.exact test了

rather

【在 c***z 的大作中提到】
: Hi all,
: This is my first project in the new company, and it is about third party
: data quality. There is no gold standard for quality, but we know that
: repetition of location in the dataset might imply bad quality, because in
: this case the location might come from a centroid (e.g. a cell tower, rather
: than a cell phone).
: There is also no ground truth about which datasets are good, but we know
: some good ones, particularly the channels we own.
: We are exploring the relationship between data quality of a vendor and the
: distance of its location distribution from the known good ones. Here comes

c***z
发帖数: 6348
37
I tried Chi square and G tests, but sometimes a bad partner is closer to a
good one than another good one (e.g. dist(G1, G2) > dist(G1, B1)). Also,
they both drop the zero cells, while these cells are important to us (e.g.
one bad partner has locations that repeat millions of times, while this
never happen for good partners, and in G test this case will be omitted).
Fisher's test is exponential and too slow for our case, while there are
thousands of rows. Maybe we can rebin the table to make it fewer rows, but I
would like to delay rebinning as much as possible, since it loses
information.
Thanks a lot!
c***z
发帖数: 6348
38
In some sense this is similar to the word distributions in documents and I
am measuring the distance between the documents using the count tables (
rather, aggregated count tables with only two columns: frequency and count).
Another analogy I can think of is the wealth distribution (e.g. Gini index).
Any suggestions are extremely welcome! Thanks a lot!
l*******m
发帖数: 1096
39
do you have other data, such as user id, ip address timestamp, carrier id,
app id....
with additional info, it is much easier

rather

【在 c***z 的大作中提到】
: Hi all,
: This is my first project in the new company, and it is about third party
: data quality. There is no gold standard for quality, but we know that
: repetition of location in the dataset might imply bad quality, because in
: this case the location might come from a centroid (e.g. a cell tower, rather
: than a cell phone).
: There is also no ground truth about which datasets are good, but we know
: some good ones, particularly the channels we own.
: We are exploring the relationship between data quality of a vendor and the
: distance of its location distribution from the known good ones. Here comes

l******n
发帖数: 9344
40
The statistical tests on contingency table mentioned in previous posts do
not help in this case, because they only tell you whether they are different
. As Gini index, it tells you how inequality the income across a nation's
papulation, but does not tell you which population has good income.
What you need is criteria to measure the goodness of the data.I would
suggest you use entropy or some form of variation.

).
).

【在 c***z 的大作中提到】
: In some sense this is similar to the word distributions in documents and I
: am measuring the distance between the documents using the count tables (
: rather, aggregated count tables with only two columns: frequency and count).
: Another analogy I can think of is the wealth distribution (e.g. Gini index).
: Any suggestions are extremely welcome! Thanks a lot!

相关主题
请问常考的cluster algorithm有哪些Some thoughts on data science and data scientists
有关clustering问个问题:一堆(1M)二维座标系的点,每个点有weight,怎么做clustering?
Optimization over more than one metricsScience杂志一篇关于clustering的新文章 (转载)
进入DataSciences版参与讨论
g*****o
发帖数: 1564
41
I'm not really sure.
as also mentioned using entropy.
would Mutual Information or KL-divergence be used based on the count (bin)
data of the locations between good and bad vendors?

I

【在 c***z 的大作中提到】
: I tried Chi square and G tests, but sometimes a bad partner is closer to a
: good one than another good one (e.g. dist(G1, G2) > dist(G1, B1)). Also,
: they both drop the zero cells, while these cells are important to us (e.g.
: one bad partner has locations that repeat millions of times, while this
: never happen for good partners, and in G test this case will be omitted).
: Fisher's test is exponential and too slow for our case, while there are
: thousands of rows. Maybe we can rebin the table to make it fewer rows, but I
: would like to delay rebinning as much as possible, since it loses
: information.
: Thanks a lot!

w**p
发帖数: 4080
42
This is what I meant at the first point.
Create a data quality score using some criteria then analyze the
relationship between this score and the distance.
Or, in other words, you can calculate a "distance" using the location
repetition frequency. A good definition of this "distance" and an
appropriate transformation will finally make it has a linear relation with
the physical distance.

different

【在 l******n 的大作中提到】
: The statistical tests on contingency table mentioned in previous posts do
: not help in this case, because they only tell you whether they are different
: . As Gini index, it tells you how inequality the income across a nation's
: papulation, but does not tell you which population has good income.
: What you need is criteria to measure the goodness of the data.I would
: suggest you use entropy or some form of variation.
:
: ).
: ).

c***z
发帖数: 6348
43
I had the same concern that there might not be some intrinsic relationship
between the distance/difference and quality/performance. I also proposed
that we should focus on the goodness of the data. But at this moment I am
asked to focus on the distance.
I think the logic of my boss is to build wheels first then find a way to use
it, rather than study if we need the wheel first.
PS: I don't have other data yet, not very familiar with all the data yet.
PS2: I tried G test which is related to KL-divergence, but it didn't work
well.
PS3: I don't have physical locations yet, the tables I have are aggregated
to one level higher, containing only two columns: location frequency and how
many locations are repeated that many times. Maybe I should propose to go
back to the finer level table with location and frequency.
PS4: Just tried cosine distance, and it is not working well either. Some bad
partners are closer to good ones than they are to each other.
Thanks so much for your replies!

different

【在 l******n 的大作中提到】
: The statistical tests on contingency table mentioned in previous posts do
: not help in this case, because they only tell you whether they are different
: . As Gini index, it tells you how inequality the income across a nation's
: papulation, but does not tell you which population has good income.
: What you need is criteria to measure the goodness of the data.I would
: suggest you use entropy or some form of variation.
:
: ).
: ).

c***z
发帖数: 6348
44
The problem is that we have neither a good criteria for quality nor for
distance nor an intrinsic relationship between the two...

【在 w**p 的大作中提到】
: This is what I meant at the first point.
: Create a data quality score using some criteria then analyze the
: relationship between this score and the distance.
: Or, in other words, you can calculate a "distance" using the location
: repetition frequency. A good definition of this "distance" and an
: appropriate transformation will finally make it has a linear relation with
: the physical distance.
:
: different

l******n
发帖数: 9344
45
I have many projects like this which is more of science project other than
real business project. I usually go back to the client and ask for
clarification and objectives. Also it is the opportunity to educate your
client what can be done and what can't.
It is your show time, and don't be too shy to say it does not make sense.

【在 c***z 的大作中提到】
: The problem is that we have neither a good criteria for quality nor for
: distance nor an intrinsic relationship between the two...

c***z
发帖数: 6348
46
唉,还是比较难做到啊,尤其是才开始工作,还没有多少credit
我也是反复ask for clarification and objectives,领导从一开始说free end到确定
要distance,我也就弄distance。实在不行了再跟头说我们还是弄performance吧。

【在 l******n 的大作中提到】
: I have many projects like this which is more of science project other than
: real business project. I usually go back to the client and ask for
: clarification and objectives. Also it is the opportunity to educate your
: client what can be done and what can't.
: It is your show time, and don't be too shy to say it does not make sense.

m******a
发帖数: 77
47
读了很多遍,还是没弄明白楼主到底想干啥,问题没有很好定义,很难下手
能否先抛开这些数据, 从Business 那边来看这个问题, 看看他们到底想干什么
知道Business 那边的目的了,再回头看这些数据怎样用
现在好象是连在什么 Level - Location 还是Vendor 上来定义问题都不清楚
想了两个办法, aggregate to some level 之后
1. cluster 的办法, 看看能否和已知的好的 cluster 到一块
2. classification 的办法, 看看 score 是否像好的, 试试机器学习中
bootstrapping 的办法 - 和统计中的bootstrapping 不是一个东西
但不知道到底有多大的 sample

rather

【在 c***z 的大作中提到】
: Hi all,
: This is my first project in the new company, and it is about third party
: data quality. There is no gold standard for quality, but we know that
: repetition of location in the dataset might imply bad quality, because in
: this case the location might come from a centroid (e.g. a cell tower, rather
: than a cell phone).
: There is also no ground truth about which datasets are good, but we know
: some good ones, particularly the channels we own.
: We are exploring the relationship between data quality of a vendor and the
: distance of its location distribution from the known good ones. Here comes

c***z
发帖数: 6348
48
Exactly. I have propose to start from the business questions.
And this is the reply from boss:
"I am not clear what kind of answers from 'business' you are looking for. It
has always been the same: Ability to differentiate good location quality
traffic from bad location quality traffic."
Still, no idea about what "good traffic" means, just a bunch of good/bad
traffic samples, need to generalize to a definition.
So we don't have a definition for goodness, nor a definition for metric, nor
an idea about the intrinsic relation between the two. We are just exploring
.
I tried clustering with a few data points (each vendor is a point) and the
bad ones are mixed into the good ones. The metrics I used are X^2, G, RMSE,
consine, area between curves, etc
I also tried classification, there are too few features and data points and
there is serious overfitting.
Can you explain a bit about the difference in bootstrapping in ML and stat?
Thanks so much!

【在 m******a 的大作中提到】
: 读了很多遍,还是没弄明白楼主到底想干啥,问题没有很好定义,很难下手
: 能否先抛开这些数据, 从Business 那边来看这个问题, 看看他们到底想干什么
: 知道Business 那边的目的了,再回头看这些数据怎样用
: 现在好象是连在什么 Level - Location 还是Vendor 上来定义问题都不清楚
: 想了两个办法, aggregate to some level 之后
: 1. cluster 的办法, 看看能否和已知的好的 cluster 到一块
: 2. classification 的办法, 看看 score 是否像好的, 试试机器学习中
: bootstrapping 的办法 - 和统计中的bootstrapping 不是一个东西
: 但不知道到底有多大的 sample
:

c***z
发帖数: 6348
49
Got some progress. I did a clustering analysis on 150 vendors (112 good ones
and 38 bad ones), using a strange metric (average height of the area
between two log-log curves).
The result is almost too good to be true: in group 1, everyone is bad; in
group 2, everyone except one is good.
The interesting thing is that as I throw in more data points, things can get
worse or better...
Take a look at the picture. Any suggestions and comments are extremely
welcome!

It
nor
exploring

【在 c***z 的大作中提到】
: Exactly. I have propose to start from the business questions.
: And this is the reply from boss:
: "I am not clear what kind of answers from 'business' you are looking for. It
: has always been the same: Ability to differentiate good location quality
: traffic from bad location quality traffic."
: Still, no idea about what "good traffic" means, just a bunch of good/bad
: traffic samples, need to generalize to a definition.
: So we don't have a definition for goodness, nor a definition for metric, nor
: an idea about the intrinsic relation between the two. We are just exploring
: .

T*****u
发帖数: 7103
50
能解释一下什么是(average height of the area between two log-log curves)?

ones
get

【在 c***z 的大作中提到】
: Got some progress. I did a clustering analysis on 150 vendors (112 good ones
: and 38 bad ones), using a strange metric (average height of the area
: between two log-log curves).
: The result is almost too good to be true: in group 1, everyone is bad; in
: group 2, everyone except one is good.
: The interesting thing is that as I throw in more data points, things can get
: worse or better...
: Take a look at the picture. Any suggestions and comments are extremely
: welcome!
:

相关主题
我有大概80000~100000个左右的时间序列,希望对他们进行分类。请问决策树连续值的分界点怎么选
有没有谁自己买服务器组建几个clusters跑hadoop大数据的?lending club的notes 数据
若问entropy和gini的选择[Road map] From ClickStream to ConsumerInsight
进入DataSciences版参与讨论
T*****u
发帖数: 7103
51
不是很确定你做的是什么,但是感觉这种出现频率的东西和zipf's distribution可能
相关,或者 log-normal distribution有关。
c***z
发帖数: 6348
52
Thanks a lot! Will take a look at the zipf stuff.
Just realized that the MKFC metric is just the Cramér-von Mises stat using
raw count instead of probability mass. Will try Cramér-von Mises instead. :
)
http://journal.r-project.org/archive/2011-2/RJournal_2011-2_Arn
m********t
发帖数: 94
53
你这套东东我真的不太熟 follow 你这个tread看看 怎么实际解决问题
不过有点好奇 为啥用hierarchical clustering 我知道计算起来方便些
除此以外呢?

using
:

【在 c***z 的大作中提到】
: Thanks a lot! Will take a look at the zipf stuff.
: Just realized that the MKFC metric is just the Cramér-von Mises stat using
: raw count instead of probability mass. Will try Cramér-von Mises instead. :
: )
: http://journal.r-project.org/archive/2011-2/RJournal_2011-2_Arn

c***z
发帖数: 6348
54
I have been asking the same question to my boss, about the practical use of
this abstract metric...
The reason we can't use k-mean is that these metrics are actually not real
metrics, as they don't follow triangular inequality, and hence the mean
means nothing (convergence of mean doesn't imply convergence of variance).
The only thing I can think of is then hierarchical clustering...
m********t
发帖数: 94
55
可能我从最开始就没听明白 你的metrics到底是啥。。。
另外hierarchical method 你不是也要算距离么。。。
我对你那个fuzzy model不太熟。。。 能避开算距离的问题?

of

【在 c***z 的大作中提到】
: I have been asking the same question to my boss, about the practical use of
: this abstract metric...
: The reason we can't use k-mean is that these metrics are actually not real
: metrics, as they don't follow triangular inequality, and hence the mean
: means nothing (convergence of mean doesn't imply convergence of variance).
: The only thing I can think of is then hierarchical clustering...

c***z
发帖数: 6348
56
Strictly speaking, these distance are not metrics but ordinals, so I can do
hierarchical clustering using the order, iirc. :)
c***z
发帖数: 6348
57
Had some more progress. I run 100 trials on 7 metrics, each with 200 vendors
, and recorded the F1 scores. Attached is the plot of the F1 score.

do

【在 c***z 的大作中提到】
: Strictly speaking, these distance are not metrics but ordinals, so I can do
: hierarchical clustering using the order, iirc. :)

c***z
发帖数: 6348
58
Current question is to investigate the misclassified vendors (e.g. a vendor
which is hand labeled good - the first letter being "G", but the algorithm
puts in the "bad" cluster).
The plots of TP and FN are awfully close to each other; also are TN and FP.
I am totally clueless now (as always)...
Any suggestion is extremely welcome! The x-axis is recurrence (i.e.
frequency) and the y-axis is traffic (i.e. total volume of records with
locations repeated that many times).

vendors

【在 c***z 的大作中提到】
: Had some more progress. I run 100 trials on 7 metrics, each with 200 vendors
: , and recorded the F1 scores. Attached is the plot of the F1 score.
:
: do

c***z
发帖数: 6348
59
Same comparison, in percentiles of recurrence and percentages of traffic.

vendor
.

【在 c***z 的大作中提到】
: Current question is to investigate the misclassified vendors (e.g. a vendor
: which is hand labeled good - the first letter being "G", but the algorithm
: puts in the "bad" cluster).
: The plots of TP and FN are awfully close to each other; also are TN and FP.
: I am totally clueless now (as always)...
: Any suggestion is extremely welcome! The x-axis is recurrence (i.e.
: frequency) and the y-axis is traffic (i.e. total volume of records with
: locations repeated that many times).
:
: vendors

c***z
发帖数: 6348
60
Same comparison, in log-log.

【在 c***z 的大作中提到】
: Same comparison, in percentiles of recurrence and percentages of traffic.
:
: vendor
: .

相关主题
计算 confidence interval 和 prediction interval的一般方法Statistics PhD 如何转data scientist
T家onsite面经only average statistics
怎样能才能快速的找到KNN请推荐生物界认可的Clustering Analysis的免费软件
进入DataSciences版参与讨论
T*****u
发帖数: 7103
61
超哥威武。在不透露商业机密的基础上,呼吁这类实战的帖子。太有用了。
c***z
发帖数: 6348
62
阶段性总结
Overall this task can be conducted iteratively between two steps: the
training step using clustering of labeled samples and the bootstrapping step
adding unlabeled samples to increase coverage. Currently we can consider
the first iteration of the training step complete and move on the the
bootstrapping step.
1. 2000+ good and 2000+ bad partners provided;
2. I conducted hierarchical clustering analysis with seven metrics on a set
of good and bad samples, luckily the clusters are highly correlated with the
hand labeling - in other words the in-group distances are usually larger
than the between-group distances;
3. four top performing metrics identified with 100 trials on 200 samples
each;
4. consistently misclassified samples identified, but investigation on the
cause is currently on hold - no clear clue how why they are mislabeled;
5. attempt to trial on 4000 samples encountered engineering difficulty - R
is inefficient with such large scale computation;
6. I am currently working on the bootstrapping step to increase the coverage
of labels, there are several methods being considered;
6a. we can measure the distance between the unlabeled sample to a typical
good point and a typical bad point, then compare the two to decide a label;
the task of finding typical good and bad points are troublesome though;
6b. we can also find the nearest neighbors of the unlabeled sample and
decide a label based on this; we can use all four metrics and conduct a vote
(ensemble learning);
6c. we can also view this in the Bayesian way, i.e. assume the unknown
sample is good, find its nearest neighbor, label the unknown with its
neighbor's label; the mean in-group and mean between-group distances can be
used to produce confidence;
6d. we can also use supervised learning, with the percentile percentages as
features;
6e. confidence intervals are doable but require more research;
6f. engineering to scale up is doable as well, need to pick up Java or Scala
(for Spark).
c***z
发帖数: 6348
63
Any suggestions and comment are extremely welcome! Thanks a lot!
c***z
发帖数: 6348
64
Had some more progress. Using some better data, and after correcting for
flipped clusters (i.e. usually the bad points are in cluster 1, but
occasionally they like cluster 2 better), I had 95% accuracy in clustering
the points.
Now the bootstrap step, I labeled test points with its nearest neighbor, and
had 80% accuracy using a majority vote by the metrics. I am modifying
the algorithm so that I can allow more false positives and less false
negatives, as required by the business.
The real headache is when I look at the mislabeled cases, I have no clue why
they are mislabeled - hence cannot make improvement.
Any suggestions and comment are extremely welcome! Thanks a lot!
c***z
发帖数: 6348
65
终于完成了project。
Summary of Findings. Overall we were able to verify the hypothesis regarding
partner quality and location recurrence; we were also able to design a fast
mechanism to classify partners by data quality based on location recurrence
; finally we were able to identify and correct for errors in hand labeling.
The engineers will take over implementation, even though I would like to do
it myself on Scala + Spark...
c***z
发帖数: 6348
66
附两张图,可以看到false negative and true negatives are very similar, as
well as the positives. 这说明很可能算法是对的,而手工label是错的。business的
人去核对去了。

regarding
fast
recurrence
.
do

【在 c***z 的大作中提到】
: 终于完成了project。
: Summary of Findings. Overall we were able to verify the hypothesis regarding
: partner quality and location recurrence; we were also able to design a fast
: mechanism to classify partners by data quality based on location recurrence
: ; finally we were able to identify and correct for errors in hand labeling.
: The engineers will take over implementation, even though I would like to do
: it myself on Scala + Spark...

1 (共1页)
进入DataSciences版参与讨论
相关主题
计算 confidence interval 和 prediction interval的一般方法有关clustering
T家onsite面经Optimization over more than one metrics
怎样能才能快速的找到KNNSome thoughts on data science and data scientists
Statistics PhD 如何转data scientist问个问题:一堆(1M)二维座标系的点,每个点有weight,怎么做clustering?
only average statisticsScience杂志一篇关于clustering的新文章 (转载)
请推荐生物界认可的Clustering Analysis的免费软件我有大概80000~100000个左右的时间序列,希望对他们进行分类。
讨论一下:几种clustering方法的特点,区别,长处各是什么?有没有谁自己买服务器组建几个clusters跑hadoop大数据的?
请问常考的cluster algorithm有哪些若问entropy和gini的选择
相关话题的讨论汇总
话题: location话题: good话题: data话题: distance话题: quality