[Data Science Project Case] Bias Correction - third try - DataSciences版 - 未名存档

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

DataSciences版 - [Data Science Project Case] Bias Correction - third try

相关主题
● [Data Science Project Case] Bias Correction - second try	● p value被摈弃了？如何算confidence interval之类的东西？
● [Data Science Project Case] Bias Correction	● Some thoughts on data science and data scientists
● 零经验大妈真诚求转data analysis建议，长！！！	● 求教：转data analyst需要学习哪些东西？
● only average statistics	● Coursera上拿到了Data Science的certificate，可以找什么样的工作
● Bayesian inference	● OR出身转DS求建议
● 分享一个Data Scientist的面经攒RP。。 (转载)	● [内推] NetBrain tech (software engineer等)职位内部推荐机会
● 新面试需准备的问题	● [挖个坑]数据分析都有哪些开源工具呀？
● 关于data preprocessing的问题求教	● 问个问题：一堆（1M）二维座标系的点，每个点有weight，怎么做clustering？

相关话题的讨论汇总
话题: data话题: panel话题: ibp话题: sites话题: rim

进入DataSciences版参与讨论

1

(共1页)

c***z 发帖数: 6348	1 Dear all, thank you so much for your earlier inputs! Now I am able to put my thoughts together and understand the project better. Let me write down the thing again. Any comments are extremely welcome! Project name: Bias correction Business objective: We have a panel of 25M users’ shopping cart information , we want to infer national online sales by brand and channel. We do so by finding and applying multipliers to each shopping cart item, based on our panel size and selection bias towards particular population (e.g. if our panel is more skewed towards low income people than the IBP, then their shopping records should and have a smaller multiplier than those of high income people). Technical logic: We have a biased sample of the IBP, among which only a subset have third party demographic labels. Hence there are three subproblems: 1. Bias correction (from panel to IBP): this is a special kind of missing data problem, where the population stats are known. We compute and assign weights to each subgroups (defined by demographics and brand/site). The method here is Rim weighting. Another classical method is regression. The weights can be obtained from and applied to three levels: guids, sites and panel; hence overall there are nine ways. We are particularly interested in: a. From panel, to guids (current approach); b. From panel, to sites; c. From sites, to guids; d. From sites, to sites; e. From sites, to panel. 2. Missing data (inside panel): we are missing a majority of the demographic data of our panel, and the panel stats are unknown. This is the typical missing data problem. There are several ways: a. Drop the incomplete records; b. Use the mean/median or other sensible stat from the known data; c. Reconstruct the sample using bootstrapping, to fit the IBP stats; d. Infer the missing data with supervised learning (e.g. decision trees); e. Infer the missing data with unsupervised learning (e.g. clustering); f. Rim weighting also helps with missing data, with some assumptions. 3. Data quality (subset of panel): we use Exelate/Latome demographic data as seed for above tasks, however we cannot completely trust the third party data. We have designed several ways to test for quality, using the K-S stat and ROC as error metrics: a. Use the subset of data where E and L agree; b. Use independent data to compare with E and L (e.g. the naïve Bayes one); c. Aggregate from guid level to site level and compare with comScore. I am currently focused on RIM weights, a simplified Propensity score matching method, even though I have some reservations due to the assumptions we make with RIM weights. What would you think? Thanks a lot!
d****n 发帖数: 12461	2 所以你的小部分数据有demo，大部分没有，然后你试图看看能不能重建demo？ my information 【在 c***z 的大作中提到】 : Dear all, thank you so much for your earlier inputs! Now I am able to put my : thoughts together and understand the project better. : Let me write down the thing again. Any comments are extremely welcome! : Project name: Bias correction : Business objective: We have a panel of 25M users’ shopping cart information : , we want to infer national online sales by brand and channel. We do so by : finding and applying multipliers to each shopping cart item, based on our : panel size and selection bias towards particular population (e.g. if our : panel is more skewed towards low income people than the IBP, then their : shopping records should and have a smaller multiplier than those of high
c***z 发帖数: 6348	3 Yeah, that is part of the problem. I also want to weight my data to fit the internet population (IBP). Thanks! 【在 d****n 的大作中提到】 : 所以你的小部分数据有demo，大部分没有，然后你试图看看能不能重建demo？ : : my : information
l*******s 发帖数: 1258	4 其实Propensity score就挺好用的关键是用什么样的方法取得Propensity score 最简单的linear regression或者logistic之类的多考虑下如何用更好的feature engineering来表达data set的各种性质。

1

(共1页)

进入DataSciences版参与讨论

相关主题
● 问个问题：一堆（1M）二维座标系的点，每个点有weight，怎么做clustering？	● Bayesian inference
● 问一个统计算average from ranges (转载)	● 分享一个Data Scientist的面经攒RP。。 (转载)
● 请教一道面试题~~	● 新面试需准备的问题
● [Data Science Project Case] Bias Correction - third try (转载)	● 关于data preprocessing的问题求教
● [Data Science Project Case] Bias Correction - second try	● p value被摈弃了？如何算confidence interval之类的东西？
● [Data Science Project Case] Bias Correction	● Some thoughts on data science and data scientists
● 零经验大妈真诚求转data analysis建议，长！！！	● 求教：转data analyst需要学习哪些东西？
● only average statistics	● Coursera上拿到了Data Science的certificate，可以找什么样的工作

相关话题的讨论汇总
话题: data话题: panel话题: ibp话题: sites话题: rim

未名新帖统计// 7月16日

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

* 这里只显示发帖超过25的版面，努力灌水吧:-)