c***z 发帖数: 6348 | 1 Dear all, thank you so much for your earlier inputs! Now I am able to put my
thoughts together and understand the project better.
Let me write down the thing again. Any comments are extremely welcome!
Project name: Bias correction
Business objective: We have a panel of 25M users’ shopping cart information
, we want to infer national online sales by brand and channel. We do so by
finding and applying multipliers to each shopping cart item, based on our
panel size and selection bias towards particular population (e.g. if our
panel is more skewed towards low income people than the IBP, then their
shopping records should and have a smaller multiplier than those of high
income people).
Technical logic: We have a biased sample of the IBP, among which only a
subset have third party demographic labels. Hence there are three
subproblems:
1. Bias correction (from panel to IBP): this is a special kind of missing
data problem, where the population stats are known. We compute and assign
weights to each subgroups (defined by demographics and brand/site). The
method here is Rim weighting. Another classical method is regression. The
weights can be obtained from and applied to three levels: guids, sites and
panel; hence overall there are nine ways. We are particularly interested in:
a. From panel, to guids (current approach);
b. From panel, to sites;
c. From sites, to guids;
d. From sites, to sites;
e. From sites, to panel.
2. Missing data (inside panel): we are missing a majority of the
demographic data of our panel, and the panel stats are unknown. This is the
typical missing data problem. There are several ways:
a. Drop the incomplete records;
b. Use the mean/median or other sensible stat from the known data;
c. Reconstruct the sample using bootstrapping, to fit the IBP stats;
d. Infer the missing data with supervised learning (e.g. decision trees);
e. Infer the missing data with unsupervised learning (e.g. clustering);
f. Rim weighting also helps with missing data, with some assumptions.
3. Data quality (subset of panel): we use Exelate/Latome demographic data
as seed for above tasks, however we cannot completely trust the third party
data. We have designed several ways to test for quality, using the K-S stat
and ROC as error metrics:
a. Use the subset of data where E and L agree;
b. Use independent data to compare with E and L (e.g. the naïve
Bayes one);
c. Aggregate from guid level to site level and compare with comScore.
I am currently focused on RIM weights, a simplified Propensity score
matching method, even though I have some reservations due to the assumptions
we make with RIM weights.
What would you think? Thanks a lot! | d****n 发帖数: 12461 | 2 所以你的小部分数据有demo,大部分没有,然后你试图看看能不能重建demo?
my
information
【在 c***z 的大作中提到】 : Dear all, thank you so much for your earlier inputs! Now I am able to put my : thoughts together and understand the project better. : Let me write down the thing again. Any comments are extremely welcome! : Project name: Bias correction : Business objective: We have a panel of 25M users’ shopping cart information : , we want to infer national online sales by brand and channel. We do so by : finding and applying multipliers to each shopping cart item, based on our : panel size and selection bias towards particular population (e.g. if our : panel is more skewed towards low income people than the IBP, then their : shopping records should and have a smaller multiplier than those of high
| c***z 发帖数: 6348 | 3 Yeah, that is part of the problem.
I also want to weight my data to fit the internet population (IBP).
Thanks!
【在 d****n 的大作中提到】 : 所以你的小部分数据有demo,大部分没有,然后你试图看看能不能重建demo? : : my : information
| l*******s 发帖数: 1258 | 4 其实Propensity score就挺好用的
关键是用什么样的方法取得Propensity score
最简单的linear regression或者logistic之类的
多考虑下如何用更好的feature engineering来表达data set的各种性质。 |
|