c***z 发帖数: 6348 | 1 【 以下文字转载自 DataSciences 讨论区 】
发信人: chaoz (面朝大海,吃碗凉皮), 信区: DataSciences
标 题: [Data Science Project Case] Bias Correction - second try
发信站: BBS 未名空间站 (Fri Jan 24 18:08:30 2014, 美东)
Hi all,
First thank you all so much for your inputs! They were extremely helpful!
Here is what we are doing as a second try (actually maybe 5th try, but we
only count major overhauls here).
Again, any input is extremely welcome! Thanks!
Situation Brief
Project name: Bias correction
Business objective: We have a panel of 25M users’ shopping cart information
, we want to infer national online sales by brand and channel. We do so by
finding and applying multipliers to each shopping cart item, based on our
panel size and selection bias towards particular population (e.g. if our
panel is more skewed towards low income people than the IBP, then their
shopping records should and have a smaller multiplier than those of high
income people).
Challenge:
1. Our panel is perceived to be skewed in many ways, such as age, gender,
income, tech and financial expertise, etc., due to the ways we acquire
users and data
2. Our data is incomplete in that other than shopping cart data, only a
small percentage of our panel has third party demographic data
3. We cannot completely trust the third party data, even though we try to
get close to comScore data as a benchmark
4. What is a good metric to measure “closeness”
5. How the other bias, for which we have no data, interact with the bias
in demographics; as well as whether new bias can be introduced when taking
samples with particular information
Technical logic:
1. First we need to decide the level of analysis: individual level, site/
brand level or panel level.
a. Individual level: first cluster users in terms of similarity in search
and click behavior (natural language processing, see SO technical brief),
then label users using their nearest neighbor
b. Site/brand level: direct attempt towards the final product, first join
the inferred or third party individual gender labels with our own page
visit dataset, to obtain site-person-gender triples, then aggregate at the
site level for gender decomposition, and compare with the comScore data to
obtain a multiplier for each site (and later brand or site-brand pairs)
c. Panel level: this approach serves more as a testing, similar to the
site/brand approach, generate site decomposition, but adjust it for bias
using a panel level multiplier (which is the quotient of IBP ratio and panel
ratio – for the available users), then compare with the comScore data
2. Second we need to build a testing method: compare data from different
sources for confidence.
a. Bench mark: we need data we can trust as bench mark (anchor), we chose
comScore, see the panel level approach above for details
b. Error metric: we need a metric to measure performance of inferred or
third party data, we chose the K-S test
3. Third we need presentable results | s******0 发帖数: 1269 | 2 搬板凳坐等回复。
顺便问一句,你是哪里人,也喜欢凉皮阿 | c***z 发帖数: 6348 | 3 Hunan, my wife is from Shanxi, she likes it :) |
|