[Data Science Project Case] Bias Correction - DataSciences版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

DataSciences版 - [Data Science Project Case] Bias Correction

相关主题
● [Data Science Project Case] Bias Correction - second try
● [Data Science Project Case] Bias Correction - third try
● 问一个关于clustering analysis的问题
● [Data Science Project Case] Data Monitoring
● 工作中遇到的一个现象，问问大家怎么解释 (转载)
● 请问关于小的dataset evaluation的问题
● pig能做iterative的问题吗?
● 请教做信号处理的老帮菜怎么转DS?
● 一道面试题，向本版求教一下。
● 零经验大妈真诚求转data analysis建议，长！！！

相关话题的讨论汇总
话题: data话题: bias话题: gender话题: ratio话题: however

进入DataSciences版参与讨论

(共1页)

c***z
发帖数: 6348

This is the main project I am working on. I would greatly appreciate if you
can share some insight. :)_
Say you have huge data set about online activities, you know that the data
set is biased, i.e. most of the information are collected from young males
making less than 50K. However you don't know how much is the bias.
To sell data products to online marketing companies, you need to correct for
the bias, so that your products work for the whole online population, or a
different population, such as young females making more than 50K.
What should you do?
Thanks!

r*******y
发帖数: 626

Do you mean you can not infer individual demographic info from available
data directly? If so, I would suggest to access more data source (e.g. from
the third party) to cross-examine your data.

you
for
a

【在 c***z 的大作中提到】

: This is the main project I am working on. I would greatly appreciate if you
: can share some insight. :)_
: Say you have huge data set about online activities, you know that the data
: set is biased, i.e. most of the information are collected from young males
: making less than 50K. However you don't know how much is the bias.
: To sell data products to online marketing companies, you need to correct for
: the bias, so that your products work for the whole online population, or a
: different population, such as young females making more than 50K.
: What should you do?
: Thanks!

d****n
发帖数: 12461

一个思路：
你先看看30K的male到50K的male有啥变化，然后再推算120K的female到150K的female有
啥变化。
当然，高收入female的baseline是必须有的。

you
for
a

【在 c***z 的大作中提到】

s*********e
发帖数: 1051

1. build a lookalike model to predict the likelihood of young male and
derive the propensity score
2. use the propensity score to correct the bias

you
for
a

【在 c***z 的大作中提到】

h********3
发帖数: 2075

我想到的方法，一个是Stratified sampling。就是把人分成高中低3个档次，然后每个
档次里面抽100个人出来，最后在抽出来的人里面估计。
另外一个就是importance sampling。你要估计的是函数是f(x)，但是你p(x)是x在你
bias数据下的概率密度，而q(x)是x在整个online population下的概率密度。假设你有
办法从另外online population数据集里面得到q(x)，然后通过importance sampling可
以得到f(x)在q(x)下的期望。

m***o
发帖数: 225

我前段时间做过一个相似的项目。基于一个google research发的paper。你可以search
一下。
题目大概是bias correction theory

you
for
a

【在 c***z 的大作中提到】

c***z
发帖数: 6348

Thank you all for your input!
We have a panel of click data, as well as search data; we bought the site
demography data (percentage of male visitors to sites, etc). We are trying
to infer gender from these two datasets, using Bayesian update and the
propensity score method.
However the result is still around 50-50. :)
Will keep working on this.
mibco, is this the paper? http://www.cs.nyu.edu/~mohri/pub/bias.pdf

c******r
发帖数: 300

However the result is still around 50-50. :)
Does this simply mean there is no predictive power on gender given what you
have (assuming you spent reasonable effort on building the model itself)?
Not sure this is good news to you or bad ones.

【在 c***z 的大作中提到】

: Thank you all for your input!
: We have a panel of click data, as well as search data; we bought the site
: demography data (percentage of male visitors to sites, etc). We are trying
: to infer gender from these two datasets, using Bayesian update and the
: propensity score method.
: However the result is still around 50-50. :)
: Will keep working on this.
: mibco, is this the paper? http://www.cs.nyu.edu/~mohri/pub/bias.pdf

c***z
发帖数: 6348

Yeah, it is difficult to tell whether this is because our data is good, or
because there is insufficient data, or because my model sucked. :)

w*****a
发帖数: 218

这个是正道

【在 h********3 的大作中提到】

: 我想到的方法，一个是Stratified sampling。就是把人分成高中低3个档次，然后每个
: 档次里面抽100个人出来，最后在抽出来的人里面估计。
: 另外一个就是importance sampling。你要估计的是函数是f(x)，但是你p(x)是x在你
: bias数据下的概率密度，而q(x)是x在整个online population下的概率密度。假设你有
: 办法从另外online population数据集里面得到q(x)，然后通过importance sampling可
: 以得到f(x)在q(x)下的期望。

相关主题
● [Data Science Project Case] Data Monitoring
● 工作中遇到的一个现象，问问大家怎么解释 (转载)
● 请问关于小的dataset evaluation的问题
● pig能做iterative的问题吗?
进入DataSciences版参与讨论

c*****o
发帖数: 1702

If you get data from other segment but just male making less than 50k is
dominant, you can use census data to do the bias correction. I mean in the
case you get other demographic segment but just with fewer data, bias
correction is possible.

M***e
发帖数: 531

或许从同行业其他的data来做一些assumption，看young female会有怎样的behavior。
。然后在原有模型上做一些改变

c***z
发帖数: 6348

Just did a second round and it was way off.
Here is what I did. Any input is extremely welcome!
Business objective: Correct for panel bias in terms of demographic breakdown
, to obtain an accurate multiplier, in order to calculate site traffic,
sales, etc
Input data:
1. Site visits by people (own data, company Q);
2. Third party (C) 2000 site gender decomposition;
3. Third party personal gender labels (two companies, E and L);
Technical logic:
1. Assign a score to sites based on their gender decomposition (manually);
2. Join the site visits data with the score sheet, using sites as key, to
obtain person-site-score triples (using Pig);
3. Sum up the gender scores for each guid, and infer the individual’s
gender (using Pig);
4. Compare this gender label with third party label, choose the subset of
guids where all sources (E, L, Q) agree on the gender, take the male ratio
of this subset (35%) as global ratio - assume that this is the ratio of all
our users (using R);
5. Join this subset with the site visits data, using the person as key,
to obtain site-person-gender triples (using Pig);
6. Average up the gender scores for each site, and infer the gender
decomposition (using Pig);
7. Multiply the decomposition with the global ratio, to obtain adjusted
gender decomposition, and compare with the third party (C) decomposition (
using R).
Initial result:
1. In average, my gender ratio is 23.4% off C ratio (RMSE = 0.234),
skewed downwards, e.g. if the comScore male ratio for a site is 75.2%, the
Q male ratio can be between 51.8% and 98.6%, and most likely closer to the
51.8% end;
2. There are good sites where the two ratios are close, such as
microsoftstore.com, and bad sites where the ratios are way off, such as
reddit.com. One could infer that the panel is also skewed in terms of site
visit frequencies;
3. One interesting fact is that, if we do not adjust by the global ratio,
the final result is similar, i.e. my gender ratio is 21.1% off comScore
ration (RMSE = 0.21), skewed downwards.

c***z
发帖数: 6348

Now we both individual demo data from two companies, but they contradict
themselves, as well as each other.
(Man, they are selling crap for 20K/mon, business is good!)
We are using a subset where they agree, but that might skewed the
distribution and introduced more bias...

from

【在 r*******y 的大作中提到】

: Do you mean you can not infer individual demographic info from available
: data directly? If so, I would suggest to access more data source (e.g. from
: the third party) to cross-examine your data.
:
: you
: for
: a

c***z
发帖数: 6348

It is definitely non linear and we have no idea how to infer the latter from
the former.

【在 d****n 的大作中提到】

: 一个思路：
: 你先看看30K的male到50K的male有啥变化，然后再推算120K的female到150K的female有
: 啥变化。
: 当然，高收入female的baseline是必须有的。
:
: you
: for
: a

c***z
发帖数: 6348

Did something like this, but result was not good...

【在 s*********e 的大作中提到】

: 1. build a lookalike model to predict the likelihood of young male and
: derive the propensity score
: 2. use the propensity score to correct the bias
:
: you
: for
: a

c***z
发帖数: 6348

In this case, q(x) = 0.5, since we know that online population is not skewed
, right?
I need to read more about importance sampling...

【在 h********3 的大作中提到】

c***z
发帖数: 6348

This one?
http://www.cs.nyu.edu/~mohri/pub/bias.pdf

search

【在 m***o 的大作中提到】

: 我前段时间做过一个相似的项目。基于一个google research发的paper。你可以search
: 一下。
: 题目大概是bias correction theory
:
: you
: for
: a

c***z
发帖数: 6348

I think the biggest problem is that we don't know what data we can trust.
Also, if we take subset we think we can trust, new bias may be introduced.

l******0
发帖数: 244

However you don't know how much is the bias.
--- Wouldn't it be easy to get to know the actual bias rate by random
sampling and annotating a manageable subset? However, there may not be a
magic way to make it work for all populations if the data is inherently too
biased.

you
for
a

【在 c***z 的大作中提到】

c***z
发帖数: 6348

我们的panel是US population的10%，然后我们只知道panel中10%的人的性别，所以“
有性别信息”这一点可能也带来了新的bias。
（而且这10%的性别信息来自3个来源，这些来源相互矛盾而且自相矛盾。）
就是说要用1%的sample去correct 10%的sample的bias。
目前的想法是从这10%人群中bootstrap一个性别1：1的子集作为基准。。。
感觉不管怎样都是shoot in the air，头疼。

too

【在 l******0 的大作中提到】

: However you don't know how much is the bias.
: --- Wouldn't it be easy to get to know the actual bias rate by random
: sampling and annotating a manageable subset? However, there may not be a
: magic way to make it work for all populations if the data is inherently too
: biased.
:
:
: you
: for
: a

(共1页)

进入DataSciences版参与讨论

相关主题
● 零经验大妈真诚求转data analysis建议，长！！！
● only average statistics
● Bayesian inference
● 分享一个Data Scientist的面经攒RP。。 (转载)
● 新面试需准备的问题
● 关于data preprocessing的问题求教
● p value被摈弃了？如何算confidence interval之类的东西？
● Some thoughts on data science and data scientists
● 求教：转data analyst需要学习哪些东西？
● Coursera上拿到了Data Science的certificate，可以找什么样的工作

相关话题的讨论汇总
话题: data话题: bias话题: gender话题: ratio话题: however

boards