[Road map] From ClickStream to ConsumerInsight - DataSciences版 - 未名存档

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

DataSciences版 - [Road map] From ClickStream to ConsumerInsight

相关主题
● 学习Pig Latin	● Bioinformatics Position in a Genomics Center in a University in the Southern California
● Tumblr HQ NYC refer (转载)	● Bioinformatics Position in a Genomics Center in a University in the Southern California
● [Data Science Project Case] Parsing URLS	● SE/Data scientist找工作总结[F/G/L/T/D/P/U…] (转载)
● 欢迎加入“机器学习实践” 俱乐部	● 新手学python，有个简单数据结构问题，在线急等
● 说说浅学ML的感受	● 求问一道关于NLP的面试题
● [Data Science Project Case] Generate Categories for Product	● 如何evaluate an unsupervised learning method?
● [Data Science Project Case] Topic Learning	● 有没有大牛来classifiy一下 PCA用法吗？
● Bioinformatics Position in a Genomics Center in a University in the Southern California	● Regression也属于ML？

相关话题的讨论汇总
话题: data话题: issue话题: time

进入DataSciences版参与讨论

1

(共1页)

c***z 发帖数: 6348	1 In case you are curious about what data scientists do, this is one case that has multiple projects and involves multiple teams. It is a big thing and not completely in my scope, but I will try my best to describe it. Stage 1. We need the clickstream data. It is the crawler/parser team's job to get the urls (optimally, the whole pages as well) from websites and classify them, and the hadoop admin team's job to store them in place. It is a monster in its own sake, and I am no expert on this at all. How do you know who did what? Well, that is a trade secret. You can also buy those data from app developers (e.g. from Chrome app store). Many free software collects data that is more than necessary - nothing is really free. Stage 2. We need to clean the data. It is the data team's job to remove garbage, inject structure and compensate for bad data. The job can be very different based on the final product. There are many difficulties, besides the data being huge. No traditional SDE or statistician alone can do this. Just some examples: Issue 1: data format. The parser might mess up and give garbage, we should be able to detect and remove those. Issue 2: date and time. The clickstream time is from host computer's system clock and it might be wrong; there is also time zone difference. Issue 3: item names. Same item can have different item names from different pages. One way to deal with it is to build a product database, with SKU or ASIN as key. But not all product page urls have these. Another way is to use some kind of string distance measure, such as Jaccard index. But just like any unsupervised learning, testing of this method is difficult. Issue 4: sample bias. The clickstream collected from apps are naturally biased towards app users, as well as towards less geeky users since the geeky ones can disable the data collection functionality. This is a big thing since clients want unbiased data. One way to deal with this is RIM weighting, using some third party data as truth. Another way is bootstrapping. There is only one thing for sure: there will be bias. Issue 5: incomplete data. We have only data from part of the population, and that data is also incomplete. For example, we may have only 2% of the shopping cart information. One way to deal with it is statistical inference. Stage 3. We need to build and test models to answer questions, such as 1. popular items 2. paths to purchase 3. market share 4. sales prediction 5. recommender system But before that, we need to extract features for modeling when simple aggregations won’t work (e.g. 2, 4, 5), i.e. transform lists of pages visited to predictors such as number of time an item is viewed, ordinal position, whether it is from a search, etc. When we finally build and test model, things come back to ordinary for statisticians, except that now the training data can be huge. One can pilot with a sample using R, Matlab or Python, then push to large scale. I have been using Scalding, with a mixed feeling of love and hatred. Thanks for reading! Please share your comments and/or workflow!
d****n 发帖数: 12461	2 ad targeting这块也很激烈啊，除了google以外我已经看过很多家大的小的在做了，上游下游都有。不过始终觉得没啥意思。赚钱是真的，但是没成就感。
D**u 发帖数: 288	3 Thanks for sharing the insights! 很有帮助。问个问题，是用R prototype 好了，然后用scalding去calculate score么？还有，有用到real time的analytic么？
c***z 发帖数: 6348	4 we don't do real time analytics, I am interested in learning that if people are willing to share :) 【在 D**u 的大作中提到】 : Thanks for sharing the insights! 很有帮助。问个问题，是用R prototype 好了， : 然后用scalding去calculate score么？还有，有用到real time的analytic么？
d****n 发帖数: 12461	5 analyze和real time本来就很矛盾啊，现在无非是搞些mouse heatmap, A/B test, location-aware ad targeting 搞real time有三大要素 1. monetizable 2. actionable 3. profitable 很多号称realtime的东西就只有1，最后就黄了。 people 【在 c***z 的大作中提到】 : we don't do real time analytics, I am interested in learning that if people : are willing to share :)
g**********l 发帖数: 214	6 it is so hard to get real-time data, for tech or org reasons. of course, real-time product recommendation for ecommerce is always the biggest use case.
D**u 发帖数: 288	7 got you, 能否讲一下，你们的analysis cycle一般多长，2 weeks or 2 days? people 【在 c***z 的大作中提到】 : we don't do real time analytics, I am interested in learning that if people : are willing to share :)
c***z 发帖数: 6348	8 we re-run model daily, new models take one week to deploy, new products take much longer, usually in quaters 【在 D**u 的大作中提到】 : got you, 能否讲一下，你们的analysis cycle一般多长，2 weeks or 2 days? : : people
l*******m 发帖数: 1096	9 real time 还是很有搞头。因为要求高，cs, probability, statistics 都有领会【在 d****n 的大作中提到】 : analyze和real time本来就很矛盾啊，现在无非是搞些mouse heatmap, A/B test, : location-aware ad targeting : 搞real time有三大要素 : 1. monetizable : 2. actionable : 3. profitable : 很多号称realtime的东西就只有1，最后就黄了。 : : people
r*****d 发帖数: 346	10 这个real time是指real time recommendation吗？【在 d****n 的大作中提到】 : analyze和real time本来就很矛盾啊，现在无非是搞些mouse heatmap, A/B test, : location-aware ad targeting : 搞real time有三大要素 : 1. monetizable : 2. actionable : 3. profitable : 很多号称realtime的东西就只有1，最后就黄了。 : : people
r*****d 发帖数: 346	11 赞一个。 that to is buy 【在 c***z 的大作中提到】 : In case you are curious about what data scientists do, this is one case that : has multiple projects and involves multiple teams. : It is a big thing and not completely in my scope, but I will try my best to : describe it. : Stage 1. We need the clickstream data. It is the crawler/parser team's job : to get the urls (optimally, the whole pages as well) from websites and : classify them, and the hadoop admin team's job to store them in place. It is : a monster in its own sake, and I am no expert on this at all. : How do you know who did what? Well, that is a trade secret. You can also buy : those data from app developers (e.g. from Chrome app store). Many free

1

(共1页)

进入DataSciences版参与讨论

相关主题
● Regression也属于ML？	● 说说浅学ML的感受
● One phone interview question.	● [Data Science Project Case] Generate Categories for Product
● Customer Journey Analytics的一般方法跟models	● [Data Science Project Case] Topic Learning
● 有没有做sentiment analysis的，求思路 (转载)	● Bioinformatics Position in a Genomics Center in a University in the Southern California
● 学习Pig Latin	● Bioinformatics Position in a Genomics Center in a University in the Southern California
● Tumblr HQ NYC refer (转载)	● Bioinformatics Position in a Genomics Center in a University in the Southern California
● [Data Science Project Case] Parsing URLS	● SE/Data scientist找工作总结[F/G/L/T/D/P/U…] (转载)
● 欢迎加入“机器学习实践” 俱乐部	● 新手学python，有个简单数据结构问题，在线急等

相关话题的讨论汇总
话题: data话题: issue话题: time

未名新帖统计// 7月16日

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

* 这里只显示发帖超过25的版面，努力灌水吧:-)