由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
DataSciences版 - [Data Science Project Case] Generate Categories for Product
相关主题
[Data Science Project Case] Fuzzy matching on names怎么计算距离比较好?
说说浅学ML的感受问个问题:一堆(1M)二维座标系的点,每个点有weight,怎么做clustering?
如何evaluate an unsupervised learning method?Science杂志一篇关于clustering的新文章 (转载)
求问一道关于NLP的面试题我有大概80000~100000个左右的时间序列,希望对他们进行分类。
Customer Journey Analytics的一般方法跟models有没有谁自己买服务器组建几个clusters跑hadoop大数据的?
有没有做sentiment analysis的,求思路 (转载)[Road map] From ClickStream to ConsumerInsight
only average statistics欢迎加入“机器学习实践” 俱乐部
有关clustering[Data Science Project Case] Topic Learning
相关话题的讨论汇总
话题: product话题: categories话题: generate话题: data话题: case
进入DataSciences版参与讨论
1 (共1页)
c***z
发帖数: 6348
1
Hi all,
Currently I am working on building a uniform product category for the
products at various websites.
I can think of several approaches:
1. clustering using Jaccard index
2a. decision tree based on a manually built dictionary
2b. decision tree based on entropy (real machine learning)
3. neural network (I am least familiar with this approach but my boss is all
into it)
It would be great if you guys can give some suggestions/comments! I can
provide more details if needed.
Thanks alot!
c****t
发帖数: 19049
2
没看明白。深吗样的NN? 你老板要搞deep learning?
c***z
发帖数: 6348
3
yes, my boss wanna do deep learning, and wanna do it in one week...
c****t
发帖数: 19049
4
现成的codes好像matlab的居多。要么自己改成python,要么用lisa lab那个吧。lisa
lab那好像是做boltzmann machine的,没有bayesian network。都是api programming
,不用怕

【在 c***z 的大作中提到】
: yes, my boss wanna do deep learning, and wanna do it in one week...
c***z
发帖数: 6348
5
thanks alot for the information! will check it up and keep you updated
c***z
发帖数: 6348
6
Some update: clustering didn't work well
I knew that k-mean won't work since Jaccard doesn't follow triangular
inequality - hence convergence of mean doesn't guarantee convergence of
variance.
I tried hierarchical agglomerative and it didn't work well. I believe the
reason is feature selection - I should have used trigrams and such, instead
of words, as trivial words led to mis-clustering.
I am working on trigrams as well as NN, will keep updating here. Thanks a
lot casact and guys!
l*******m
发帖数: 1096
7
is this problem supervised or unsupervised?

instead

【在 c***z 的大作中提到】
: Some update: clustering didn't work well
: I knew that k-mean won't work since Jaccard doesn't follow triangular
: inequality - hence convergence of mean doesn't guarantee convergence of
: variance.
: I tried hierarchical agglomerative and it didn't work well. I believe the
: reason is feature selection - I should have used trigrams and such, instead
: of words, as trivial words led to mis-clustering.
: I am working on trigrams as well as NN, will keep updating here. Thanks a
: lot casact and guys!

c***z
发帖数: 6348
8
It is unsupervised. Even though we have Y label, there is no easy way to
check accuracy if I understand it correctly. But I could be wrong, and I
would be very glad to know that! :)
c***z
发帖数: 6348
9
Some update:
The requirement has been changed to match with NPD categories, so it became
supervised learning.
I used hand coded dictionary of keywords and hand labeled items. The model
used was decision tree, with two stages: first filter out irrelevant items (
90% accuracy), then assign labels (82% accuracy).
The good thing is that this can be iterative: we can improve the dictionary
using the confusion matrix, and then repeat until high accuracy is achieved.
Thanks a lot guys!
1 (共1页)
进入DataSciences版参与讨论
相关主题
[Data Science Project Case] Topic LearningCustomer Journey Analytics的一般方法跟models
Bioinformatics Position in a Genomics Center in a University in the Southern California有没有做sentiment analysis的,求思路 (转载)
Bioinformatics Position in a Genomics Center in a University in the Southern Californiaonly average statistics
Bioinformatics Position in a Genomics Center in a University in the Southern California有关clustering
[Data Science Project Case] Fuzzy matching on names怎么计算距离比较好?
说说浅学ML的感受问个问题:一堆(1M)二维座标系的点,每个点有weight,怎么做clustering?
如何evaluate an unsupervised learning method?Science杂志一篇关于clustering的新文章 (转载)
求问一道关于NLP的面试题我有大概80000~100000个左右的时间序列,希望对他们进行分类。
相关话题的讨论汇总
话题: product话题: categories话题: generate话题: data话题: case