由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
DataSciences版 - [Road map] From ClickStream to ConsumerInsight
相关主题
学习Pig LatinBioinformatics Position in a Genomics Center in a University in the Southern California
Tumblr HQ NYC refer (转载)Bioinformatics Position in a Genomics Center in a University in the Southern California
[Data Science Project Case] Parsing URLSSE/Data scientist找工作总结[F/G/L/T/D/P/U…] (转载)
欢迎加入“机器学习实践” 俱乐部新手学python, 有个简单数据结构问题,在线急等
说说浅学ML的感受求问一道关于NLP的面试题
[Data Science Project Case] Generate Categories for Product如何evaluate an unsupervised learning method?
[Data Science Project Case] Topic Learning有没有大牛来classifiy一下 PCA用法吗?
Bioinformatics Position in a Genomics Center in a University in the Southern CaliforniaRegression也属于ML?
相关话题的讨论汇总
话题: data话题: issue话题: time
进入DataSciences版参与讨论
1 (共1页)
c***z
发帖数: 6348
1
In case you are curious about what data scientists do, this is one case that
has multiple projects and involves multiple teams.
It is a big thing and not completely in my scope, but I will try my best to
describe it.
Stage 1. We need the clickstream data. It is the crawler/parser team's job
to get the urls (optimally, the whole pages as well) from websites and
classify them, and the hadoop admin team's job to store them in place. It is
a monster in its own sake, and I am no expert on this at all.
How do you know who did what? Well, that is a trade secret. You can also buy
those data from app developers (e.g. from Chrome app store). Many free
software collects data that is more than necessary - nothing is really free.
Stage 2. We need to clean the data. It is the data team's job to remove
garbage, inject structure and compensate for bad data. The job can be very
different based on the final product.
There are many difficulties, besides the data being huge. No traditional SDE
or statistician alone can do this. Just some examples:
Issue 1: data format. The parser might mess up and give garbage, we should
be able to detect and remove those.
Issue 2: date and time. The clickstream time is from host computer's system
clock and it might be wrong; there is also time zone difference.
Issue 3: item names. Same item can have different item names from different
pages. One way to deal with it is to build a product database, with SKU or
ASIN as key. But not all product page urls have these. Another way is to use
some kind of string distance measure, such as Jaccard index. But just like
any unsupervised learning, testing of this method is difficult.
Issue 4: sample bias. The clickstream collected from apps are naturally
biased towards app users, as well as towards less geeky users since the
geeky ones can disable the data collection functionality. This is a big
thing since clients want unbiased data. One way to deal with this is RIM
weighting, using some third party data as truth. Another way is
bootstrapping. There is only one thing for sure: there will be bias.
Issue 5: incomplete data. We have only data from part of the population, and
that data is also incomplete. For example, we may have only 2% of the
shopping cart information. One way to deal with it is statistical inference.
Stage 3. We need to build and test models to answer questions, such as
1. popular items
2. paths to purchase
3. market share
4. sales prediction
5. recommender system
But before that, we need to extract features for modeling when simple
aggregations won’t work (e.g. 2, 4, 5), i.e. transform lists of pages
visited to predictors such as number of time an item is viewed, ordinal
position, whether it is from a search, etc.
When we finally build and test model, things come back to ordinary for
statisticians, except that now the training data can be huge. One can pilot
with a sample using R, Matlab or Python, then push to large scale. I have
been using Scalding, with a mixed feeling of love and hatred.
Thanks for reading! Please share your comments and/or workflow!
d****n
发帖数: 12461
2
ad targeting这块也很激烈啊,除了google以外我已经看过很多家大的小的在做了,上
游下游都有。
不过始终觉得没啥意思。赚钱是真的,但是没成就感。
D**u
发帖数: 288
3
Thanks for sharing the insights! 很有帮助。问个问题,是用R prototype 好了,
然后用scalding去calculate score么?还有,有用到real time的analytic么?
c***z
发帖数: 6348
4
we don't do real time analytics, I am interested in learning that if people
are willing to share :)

【在 D**u 的大作中提到】
: Thanks for sharing the insights! 很有帮助。问个问题,是用R prototype 好了,
: 然后用scalding去calculate score么?还有,有用到real time的analytic么?

d****n
发帖数: 12461
5
analyze和real time本来就很矛盾啊,现在无非是搞些mouse heatmap, A/B test,
location-aware ad targeting
搞real time有三大要素
1. monetizable
2. actionable
3. profitable
很多号称realtime的东西就只有1,最后就黄了。

people

【在 c***z 的大作中提到】
: we don't do real time analytics, I am interested in learning that if people
: are willing to share :)

g**********l
发帖数: 214
6
it is so hard to get real-time data, for tech or org reasons.
of course, real-time product recommendation for ecommerce is always the
biggest use case.
D**u
发帖数: 288
7
got you, 能否讲一下,你们的analysis cycle一般多长,2 weeks or 2 days?

people

【在 c***z 的大作中提到】
: we don't do real time analytics, I am interested in learning that if people
: are willing to share :)

c***z
发帖数: 6348
8
we re-run model daily,
new models take one week to deploy,
new products take much longer, usually in quaters

【在 D**u 的大作中提到】
: got you, 能否讲一下,你们的analysis cycle一般多长,2 weeks or 2 days?
:
: people

l*******m
发帖数: 1096
9
real time 还是很有搞头。因为要求高,cs, probability, statistics 都有领会

【在 d****n 的大作中提到】
: analyze和real time本来就很矛盾啊,现在无非是搞些mouse heatmap, A/B test,
: location-aware ad targeting
: 搞real time有三大要素
: 1. monetizable
: 2. actionable
: 3. profitable
: 很多号称realtime的东西就只有1,最后就黄了。
:
: people

r*****d
发帖数: 346
10
这个real time是指real time recommendation吗?

【在 d****n 的大作中提到】
: analyze和real time本来就很矛盾啊,现在无非是搞些mouse heatmap, A/B test,
: location-aware ad targeting
: 搞real time有三大要素
: 1. monetizable
: 2. actionable
: 3. profitable
: 很多号称realtime的东西就只有1,最后就黄了。
:
: people

r*****d
发帖数: 346
11
赞一个。

that
to
is
buy

【在 c***z 的大作中提到】
: In case you are curious about what data scientists do, this is one case that
: has multiple projects and involves multiple teams.
: It is a big thing and not completely in my scope, but I will try my best to
: describe it.
: Stage 1. We need the clickstream data. It is the crawler/parser team's job
: to get the urls (optimally, the whole pages as well) from websites and
: classify them, and the hadoop admin team's job to store them in place. It is
: a monster in its own sake, and I am no expert on this at all.
: How do you know who did what? Well, that is a trade secret. You can also buy
: those data from app developers (e.g. from Chrome app store). Many free

1 (共1页)
进入DataSciences版参与讨论
相关主题
Regression也属于ML?说说浅学ML的感受
One phone interview question.[Data Science Project Case] Generate Categories for Product
Customer Journey Analytics的一般方法跟models[Data Science Project Case] Topic Learning
有没有做sentiment analysis的,求思路 (转载)Bioinformatics Position in a Genomics Center in a University in the Southern California
学习Pig LatinBioinformatics Position in a Genomics Center in a University in the Southern California
Tumblr HQ NYC refer (转载)Bioinformatics Position in a Genomics Center in a University in the Southern California
[Data Science Project Case] Parsing URLSSE/Data scientist找工作总结[F/G/L/T/D/P/U…] (转载)
欢迎加入“机器学习实践” 俱乐部新手学python, 有个简单数据结构问题,在线急等
相关话题的讨论汇总
话题: data话题: issue话题: time