由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
DataSciences版 - [Data Science Project Case] Bias Correction - third try
相关主题
[Data Science Project Case] Bias Correction - second tryp value被摈弃了?如何算confidence interval之类的东西?
[Data Science Project Case] Bias CorrectionSome thoughts on data science and data scientists
零经验大妈真诚求转data analysis建议,长!!!求教:转data analyst需要学习哪些东西?
only average statisticsCoursera上拿到了Data Science的certificate,可以找什么样的工作
Bayesian inferenceOR出身转DS求建议
分享一个Data Scientist的面经攒RP。。 (转载)[内推] NetBrain tech (software engineer等)职位内部推荐机会
新面试需准备的问题[挖个坑]数据分析都有哪些开源工具呀?
关于data preprocessing的问题求教问个问题:一堆(1M)二维座标系的点,每个点有weight,怎么做clustering?
相关话题的讨论汇总
话题: data话题: panel话题: ibp话题: sites话题: rim
进入DataSciences版参与讨论
1 (共1页)
c***z
发帖数: 6348
1
Dear all, thank you so much for your earlier inputs! Now I am able to put my
thoughts together and understand the project better.
Let me write down the thing again. Any comments are extremely welcome!
Project name: Bias correction
Business objective: We have a panel of 25M users’ shopping cart information
, we want to infer national online sales by brand and channel. We do so by
finding and applying multipliers to each shopping cart item, based on our
panel size and selection bias towards particular population (e.g. if our
panel is more skewed towards low income people than the IBP, then their
shopping records should and have a smaller multiplier than those of high
income people).

Technical logic: We have a biased sample of the IBP, among which only a
subset have third party demographic labels. Hence there are three
subproblems:
1. Bias correction (from panel to IBP): this is a special kind of missing
data problem, where the population stats are known. We compute and assign
weights to each subgroups (defined by demographics and brand/site). The
method here is Rim weighting. Another classical method is regression. The
weights can be obtained from and applied to three levels: guids, sites and
panel; hence overall there are nine ways. We are particularly interested in:
a. From panel, to guids (current approach);
b. From panel, to sites;
c. From sites, to guids;
d. From sites, to sites;
e. From sites, to panel.
2. Missing data (inside panel): we are missing a majority of the
demographic data of our panel, and the panel stats are unknown. This is the
typical missing data problem. There are several ways:
a. Drop the incomplete records;
b. Use the mean/median or other sensible stat from the known data;
c. Reconstruct the sample using bootstrapping, to fit the IBP stats;
d. Infer the missing data with supervised learning (e.g. decision trees);
e. Infer the missing data with unsupervised learning (e.g. clustering);
f. Rim weighting also helps with missing data, with some assumptions.
3. Data quality (subset of panel): we use Exelate/Latome demographic data
as seed for above tasks, however we cannot completely trust the third party
data. We have designed several ways to test for quality, using the K-S stat
and ROC as error metrics:
a. Use the subset of data where E and L agree;
b. Use independent data to compare with E and L (e.g. the naïve
Bayes one);
c. Aggregate from guid level to site level and compare with comScore.

I am currently focused on RIM weights, a simplified Propensity score
matching method, even though I have some reservations due to the assumptions
we make with RIM weights.
What would you think? Thanks a lot!
d****n
发帖数: 12461
2
所以你的小部分数据有demo,大部分没有,然后你试图看看能不能重建demo?

my
information

【在 c***z 的大作中提到】
: Dear all, thank you so much for your earlier inputs! Now I am able to put my
: thoughts together and understand the project better.
: Let me write down the thing again. Any comments are extremely welcome!
: Project name: Bias correction
: Business objective: We have a panel of 25M users’ shopping cart information
: , we want to infer national online sales by brand and channel. We do so by
: finding and applying multipliers to each shopping cart item, based on our
: panel size and selection bias towards particular population (e.g. if our
: panel is more skewed towards low income people than the IBP, then their
: shopping records should and have a smaller multiplier than those of high

c***z
发帖数: 6348
3
Yeah, that is part of the problem.
I also want to weight my data to fit the internet population (IBP).
Thanks!

【在 d****n 的大作中提到】
: 所以你的小部分数据有demo,大部分没有,然后你试图看看能不能重建demo?
:
: my
: information

l*******s
发帖数: 1258
4
其实Propensity score就挺好用的
关键是用什么样的方法取得Propensity score
最简单的linear regression或者logistic之类的
多考虑下如何用更好的feature engineering来表达data set的各种性质。
1 (共1页)
进入DataSciences版参与讨论
相关主题
问个问题:一堆(1M)二维座标系的点,每个点有weight,怎么做clustering?Bayesian inference
问一个统计算average from ranges (转载)分享一个Data Scientist的面经攒RP。。 (转载)
请教一道面试题~~新面试需准备的问题
[Data Science Project Case] Bias Correction - third try (转载)关于data preprocessing的问题求教
[Data Science Project Case] Bias Correction - second tryp value被摈弃了?如何算confidence interval之类的东西?
[Data Science Project Case] Bias CorrectionSome thoughts on data science and data scientists
零经验大妈真诚求转data analysis建议,长!!!求教:转data analyst需要学习哪些东西?
only average statisticsCoursera上拿到了Data Science的certificate,可以找什么样的工作
相关话题的讨论汇总
话题: data话题: panel话题: ibp话题: sites话题: rim