c***z 发帖数: 6348 | 1 This is the main project I am working on. I would greatly appreciate if you
can share some insight. :)_
Say you have huge data set about online activities, you know that the data
set is biased, i.e. most of the information are collected from young males
making less than 50K. However you don't know how much is the bias.
To sell data products to online marketing companies, you need to correct for
the bias, so that your products work for the whole online population, or a
different population, such as young females making more than 50K.
What should you do?
Thanks! | r*******y 发帖数: 626 | 2 Do you mean you can not infer individual demographic info from available
data directly? If so, I would suggest to access more data source (e.g. from
the third party) to cross-examine your data.
you
for
a
【在 c***z 的大作中提到】 : This is the main project I am working on. I would greatly appreciate if you : can share some insight. :)_ : Say you have huge data set about online activities, you know that the data : set is biased, i.e. most of the information are collected from young males : making less than 50K. However you don't know how much is the bias. : To sell data products to online marketing companies, you need to correct for : the bias, so that your products work for the whole online population, or a : different population, such as young females making more than 50K. : What should you do? : Thanks!
| d****n 发帖数: 12461 | 3 一个思路:
你先看看30K的male到50K的male有啥变化,然后再推算120K的female到150K的female有
啥变化。
当然,高收入female的baseline是必须有的。
you
for
a
【在 c***z 的大作中提到】 : This is the main project I am working on. I would greatly appreciate if you : can share some insight. :)_ : Say you have huge data set about online activities, you know that the data : set is biased, i.e. most of the information are collected from young males : making less than 50K. However you don't know how much is the bias. : To sell data products to online marketing companies, you need to correct for : the bias, so that your products work for the whole online population, or a : different population, such as young females making more than 50K. : What should you do? : Thanks!
| s*********e 发帖数: 1051 | 4 1. build a lookalike model to predict the likelihood of young male and
derive the propensity score
2. use the propensity score to correct the bias
you
for
a
【在 c***z 的大作中提到】 : This is the main project I am working on. I would greatly appreciate if you : can share some insight. :)_ : Say you have huge data set about online activities, you know that the data : set is biased, i.e. most of the information are collected from young males : making less than 50K. However you don't know how much is the bias. : To sell data products to online marketing companies, you need to correct for : the bias, so that your products work for the whole online population, or a : different population, such as young females making more than 50K. : What should you do? : Thanks!
| h********3 发帖数: 2075 | 5 我想到的方法,一个是Stratified sampling。就是把人分成高中低3个档次,然后每个
档次里面抽100个人出来,最后在抽出来的人里面估计。
另外一个就是importance sampling。你要估计的是函数是f(x),但是你p(x)是x在你
bias数据下的概率密度,而q(x)是x在整个online population下的概率密度。假设你有
办法从另外online population数据集里面得到q(x),然后通过importance sampling可
以得到f(x)在q(x)下的期望。 | m***o 发帖数: 225 | 6 我前段时间做过一个相似的项目。基于一个google research发的paper。你可以search
一下。
题目大概是bias correction theory
you
for
a
【在 c***z 的大作中提到】 : This is the main project I am working on. I would greatly appreciate if you : can share some insight. :)_ : Say you have huge data set about online activities, you know that the data : set is biased, i.e. most of the information are collected from young males : making less than 50K. However you don't know how much is the bias. : To sell data products to online marketing companies, you need to correct for : the bias, so that your products work for the whole online population, or a : different population, such as young females making more than 50K. : What should you do? : Thanks!
| c***z 发帖数: 6348 | 7 Thank you all for your input!
We have a panel of click data, as well as search data; we bought the site
demography data (percentage of male visitors to sites, etc). We are trying
to infer gender from these two datasets, using Bayesian update and the
propensity score method.
However the result is still around 50-50. :)
Will keep working on this.
mibco, is this the paper? http://www.cs.nyu.edu/~mohri/pub/bias.pdf | c******r 发帖数: 300 | 8 However the result is still around 50-50. :)
Does this simply mean there is no predictive power on gender given what you
have (assuming you spent reasonable effort on building the model itself)?
Not sure this is good news to you or bad ones.
【在 c***z 的大作中提到】 : Thank you all for your input! : We have a panel of click data, as well as search data; we bought the site : demography data (percentage of male visitors to sites, etc). We are trying : to infer gender from these two datasets, using Bayesian update and the : propensity score method. : However the result is still around 50-50. :) : Will keep working on this. : mibco, is this the paper? http://www.cs.nyu.edu/~mohri/pub/bias.pdf
|
| c***z 发帖数: 6348 | 9 Yeah, it is difficult to tell whether this is because our data is good, or
because there is insufficient data, or because my model sucked. :) | w*****a 发帖数: 218 | 10
这个是正道
【在 h********3 的大作中提到】 : 我想到的方法,一个是Stratified sampling。就是把人分成高中低3个档次,然后每个 : 档次里面抽100个人出来,最后在抽出来的人里面估计。 : 另外一个就是importance sampling。你要估计的是函数是f(x),但是你p(x)是x在你 : bias数据下的概率密度,而q(x)是x在整个online population下的概率密度。假设你有 : 办法从另外online population数据集里面得到q(x),然后通过importance sampling可 : 以得到f(x)在q(x)下的期望。
| | | c*****o 发帖数: 1702 | 11 If you get data from other segment but just male making less than 50k is
dominant, you can use census data to do the bias correction. I mean in the
case you get other demographic segment but just with fewer data, bias
correction is possible. | M***e 发帖数: 531 | 12 或许从同行业其他的data来做一些assumption,看young female会有怎样的behavior。
。然后在原有模型上做一些改变 | c***z 发帖数: 6348 | 13 Just did a second round and it was way off.
Here is what I did. Any input is extremely welcome!
Business objective: Correct for panel bias in terms of demographic breakdown
, to obtain an accurate multiplier, in order to calculate site traffic,
sales, etc
Input data:
1. Site visits by people (own data, company Q);
2. Third party (C) 2000 site gender decomposition;
3. Third party personal gender labels (two companies, E and L);
Technical logic:
1. Assign a score to sites based on their gender decomposition (manually);
2. Join the site visits data with the score sheet, using sites as key, to
obtain person-site-score triples (using Pig);
3. Sum up the gender scores for each guid, and infer the individual’s
gender (using Pig);
4. Compare this gender label with third party label, choose the subset of
guids where all sources (E, L, Q) agree on the gender, take the male ratio
of this subset (35%) as global ratio - assume that this is the ratio of all
our users (using R);
5. Join this subset with the site visits data, using the person as key,
to obtain site-person-gender triples (using Pig);
6. Average up the gender scores for each site, and infer the gender
decomposition (using Pig);
7. Multiply the decomposition with the global ratio, to obtain adjusted
gender decomposition, and compare with the third party (C) decomposition (
using R).
Initial result:
1. In average, my gender ratio is 23.4% off C ratio (RMSE = 0.234),
skewed downwards, e.g. if the comScore male ratio for a site is 75.2%, the
Q male ratio can be between 51.8% and 98.6%, and most likely closer to the
51.8% end;
2. There are good sites where the two ratios are close, such as
microsoftstore.com, and bad sites where the ratios are way off, such as
reddit.com. One could infer that the panel is also skewed in terms of site
visit frequencies;
3. One interesting fact is that, if we do not adjust by the global ratio,
the final result is similar, i.e. my gender ratio is 21.1% off comScore
ration (RMSE = 0.21), skewed downwards. | c***z 发帖数: 6348 | 14 Now we both individual demo data from two companies, but they contradict
themselves, as well as each other.
(Man, they are selling crap for 20K/mon, business is good!)
We are using a subset where they agree, but that might skewed the
distribution and introduced more bias...
from
【在 r*******y 的大作中提到】 : Do you mean you can not infer individual demographic info from available : data directly? If so, I would suggest to access more data source (e.g. from : the third party) to cross-examine your data. : : you : for : a
| c***z 发帖数: 6348 | 15 It is definitely non linear and we have no idea how to infer the latter from
the former.
【在 d****n 的大作中提到】 : 一个思路: : 你先看看30K的male到50K的male有啥变化,然后再推算120K的female到150K的female有 : 啥变化。 : 当然,高收入female的baseline是必须有的。 : : you : for : a
| c***z 发帖数: 6348 | 16 Did something like this, but result was not good...
【在 s*********e 的大作中提到】 : 1. build a lookalike model to predict the likelihood of young male and : derive the propensity score : 2. use the propensity score to correct the bias : : you : for : a
| c***z 发帖数: 6348 | 17 In this case, q(x) = 0.5, since we know that online population is not skewed
, right?
I need to read more about importance sampling...
【在 h********3 的大作中提到】 : 我想到的方法,一个是Stratified sampling。就是把人分成高中低3个档次,然后每个 : 档次里面抽100个人出来,最后在抽出来的人里面估计。 : 另外一个就是importance sampling。你要估计的是函数是f(x),但是你p(x)是x在你 : bias数据下的概率密度,而q(x)是x在整个online population下的概率密度。假设你有 : 办法从另外online population数据集里面得到q(x),然后通过importance sampling可 : 以得到f(x)在q(x)下的期望。
| c***z 发帖数: 6348 | 18 This one?
http://www.cs.nyu.edu/~mohri/pub/bias.pdf
search
【在 m***o 的大作中提到】 : 我前段时间做过一个相似的项目。基于一个google research发的paper。你可以search : 一下。 : 题目大概是bias correction theory : : you : for : a
| c***z 发帖数: 6348 | 19 I think the biggest problem is that we don't know what data we can trust.
Also, if we take subset we think we can trust, new bias may be introduced. | l******0 发帖数: 244 | 20 However you don't know how much is the bias.
--- Wouldn't it be easy to get to know the actual bias rate by random
sampling and annotating a manageable subset? However, there may not be a
magic way to make it work for all populations if the data is inherently too
biased.
you
for
a
【在 c***z 的大作中提到】 : This is the main project I am working on. I would greatly appreciate if you : can share some insight. :)_ : Say you have huge data set about online activities, you know that the data : set is biased, i.e. most of the information are collected from young males : making less than 50K. However you don't know how much is the bias. : To sell data products to online marketing companies, you need to correct for : the bias, so that your products work for the whole online population, or a : different population, such as young females making more than 50K. : What should you do? : Thanks!
| c***z 发帖数: 6348 | 21 我们的panel是US population的10%,然后我们只知道panel中10%的人的性别,所以“
有性别信息”这一点可能也带来了新的bias。
(而且这10%的性别信息来自3个来源,这些来源相互矛盾而且自相矛盾。)
就是说要用1%的sample去correct 10%的sample的bias。
目前的想法是从这10%人群中bootstrap一个性别1:1的子集作为基准。。。
感觉不管怎样都是shoot in the air,头疼。
too
【在 l******0 的大作中提到】 : However you don't know how much is the bias. : --- Wouldn't it be easy to get to know the actual bias rate by random : sampling and annotating a manageable subset? However, there may not be a : magic way to make it work for all populations if the data is inherently too : biased. : : : you : for : a
|
|