DataSciences版 - [Data Science Project Case] Bias Correction
[Data Science Project Case] Bias Correction - second try
[Data Science Project Case] Bias Correction - third try
问一个关于clustering analysis的问题
[Data Science Project Case] Data Monitoring
工作中遇到的一个现象,问问大家怎么解释 (转载)
请问关于小的dataset evaluation的问题
零经验大妈真诚求转data analysis建议,长!!!
This is the main project I am working on. I would greatly appreciate if you
can share some insight. :)_
Say you have huge data set about online activities, you know that the data
set is biased, i.e. most of the information are collected from young males
making less than 50K. However you don't know how much is the bias.
To sell data products to online marketing companies, you need to correct for
the bias, so that your products work for the whole online population, or a
different population, such as young females making more than 50K.
What should you do?
Do you mean you can not infer individual demographic info from available
data directly? If so, I would suggest to access more data source (e.g. from
the third party) to cross-examine your data.


1. build a lookalike model to predict the likelihood of young male and
derive the propensity score
2. use the propensity score to correct the bias


我想到的方法,一个是Stratified sampling。就是把人分成高中低3个档次,然后每个
另外一个就是importance sampling。你要估计的是函数是f(x),但是你p(x)是x在你
bias数据下的概率密度,而q(x)是x在整个online population下的概率密度。假设你有
办法从另外online population数据集里面得到q(x),然后通过importance sampling可
我前段时间做过一个相似的项目。基于一个google research发的paper。你可以search
题目大概是bias correction theory


Thank you all for your input!
We have a panel of click data, as well as search data; we bought the site
demography data (percentage of male visitors to sites, etc). We are trying
to infer gender from these two datasets, using Bayesian update and the
propensity score method.
However the result is still around 50-50. :)
Will keep working on this.
mibco, is this the paper?
However the result is still around 50-50. :)
Does this simply mean there is no predictive power on gender given what you
have (assuming you spent reasonable effort on building the model itself)?
Not sure this is good news to you or bad ones.

Yeah, it is difficult to tell whether this is because our data is good, or
because there is insufficient data, or because my model sucked. :)
If you get data from other segment but just male making less than 50k is
dominant, you can use census data to do the bias correction. I mean in the
case you get other demographic segment but just with fewer data, bias
correction is possible.
或许从同行业其他的data来做一些assumption,看young female会有怎样的behavior。
Just did a second round and it was way off.
Here is what I did. Any input is extremely welcome!
Business objective: Correct for panel bias in terms of demographic breakdown
, to obtain an accurate multiplier, in order to calculate site traffic,
sales, etc
Input data:
1. Site visits by people (own data, company Q);
2. Third party (C) 2000 site gender decomposition;
3. Third party personal gender labels (two companies, E and L);
Technical logic:
1. Assign a score to sites based on their gender decomposition (manually);
2. Join the site visits data with the score sheet, using sites as key, to
obtain person-site-score triples (using Pig);
3. Sum up the gender scores for each guid, and infer the individual’s
gender (using Pig);
4. Compare this gender label with third party label, choose the subset of
guids where all sources (E, L, Q) agree on the gender, take the male ratio
of this subset (35%) as global ratio - assume that this is the ratio of all
our users (using R);
5. Join this subset with the site visits data, using the person as key,
to obtain site-person-gender triples (using Pig);
6. Average up the gender scores for each site, and infer the gender
decomposition (using Pig);
7. Multiply the decomposition with the global ratio, to obtain adjusted
gender decomposition, and compare with the third party (C) decomposition (
using R).
Initial result:
1. In average, my gender ratio is 23.4% off C ratio (RMSE = 0.234),
skewed downwards, e.g. if the comScore male ratio for a site is 75.2%, the
Q male ratio can be between 51.8% and 98.6%, and most likely closer to the
51.8% end;
2. There are good sites where the two ratios are close, such as, and bad sites where the ratios are way off, such as One could infer that the panel is also skewed in terms of site
visit frequencies;
3. One interesting fact is that, if we do not adjust by the global ratio,
the final result is similar, i.e. my gender ratio is 21.1% off comScore
ration (RMSE = 0.21), skewed downwards.
Now we both individual demo data from two companies, but they contradict
themselves, as well as each other.
(Man, they are selling crap for 20K/mon, business is good!)
We are using a subset where they agree, but that might skewed the
distribution and introduced more bias...


It is definitely non linear and we have no idea how to infer the latter from
the former.

Did something like this, but result was not good...

In this case, q(x) = 0.5, since we know that online population is not skewed
, right?
I need to read more about importance sampling...

This one?


I think the biggest problem is that we don't know what data we can trust.
Also, if we take subset we think we can trust, new bias may be introduced.
However you don't know how much is the bias.
--- Wouldn't it be easy to get to know the actual bias rate by random
sampling and annotating a manageable subset? However, there may not be a
magic way to make it work for all populations if the data is inherently too


我们的panel是US population的10%,然后我们只知道panel中10%的人的性别,所以“
就是说要用1%的sample去correct 10%的sample的bias。
感觉不管怎样都是shoot in the air,头疼。


