[Data Science Project Case] Generate Categories for Product - DataSciences版 - 未名存档

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

DataSciences版 - [Data Science Project Case] Generate Categories for Product

相关主题
● [Data Science Project Case] Fuzzy matching on names	● 怎么计算距离比较好？
● 说说浅学ML的感受	● 问个问题：一堆（1M）二维座标系的点，每个点有weight，怎么做clustering？
● 如何evaluate an unsupervised learning method?	● Science杂志一篇关于clustering的新文章 (转载)
● 求问一道关于NLP的面试题	● 我有大概80000～100000个左右的时间序列，希望对他们进行分类。
● Customer Journey Analytics的一般方法跟models	● 有没有谁自己买服务器组建几个clusters跑hadoop大数据的？
● 有没有做sentiment analysis的，求思路 (转载)	● [Road map] From ClickStream to ConsumerInsight
● only average statistics	● 欢迎加入“机器学习实践” 俱乐部
● 有关clustering	● [Data Science Project Case] Topic Learning

相关话题的讨论汇总
话题: product话题: categories话题: generate话题: data话题: case

进入DataSciences版参与讨论

1

(共1页)

c***z 发帖数: 6348	1 Hi all, Currently I am working on building a uniform product category for the products at various websites. I can think of several approaches: 1. clustering using Jaccard index 2a. decision tree based on a manually built dictionary 2b. decision tree based on entropy (real machine learning) 3. neural network (I am least familiar with this approach but my boss is all into it) It would be great if you guys can give some suggestions/comments! I can provide more details if needed. Thanks alot!
c****t 发帖数: 19049	2 没看明白。深吗样的NN? 你老板要搞deep learning?
c***z 发帖数: 6348	3 yes, my boss wanna do deep learning, and wanna do it in one week...
c****t 发帖数: 19049	4 现成的codes好像matlab的居多。要么自己改成python,要么用lisa lab那个吧。lisa lab那好像是做boltzmann machine的，没有bayesian network。都是api programming ，不用怕【在 c***z 的大作中提到】 : yes, my boss wanna do deep learning, and wanna do it in one week...
c***z 发帖数: 6348	5 thanks alot for the information! will check it up and keep you updated
c***z 发帖数: 6348	6 Some update: clustering didn't work well I knew that k-mean won't work since Jaccard doesn't follow triangular inequality - hence convergence of mean doesn't guarantee convergence of variance. I tried hierarchical agglomerative and it didn't work well. I believe the reason is feature selection - I should have used trigrams and such, instead of words, as trivial words led to mis-clustering. I am working on trigrams as well as NN, will keep updating here. Thanks a lot casact and guys!
l*******m 发帖数: 1096	7 is this problem supervised or unsupervised? instead 【在 c***z 的大作中提到】 : Some update: clustering didn't work well : I knew that k-mean won't work since Jaccard doesn't follow triangular : inequality - hence convergence of mean doesn't guarantee convergence of : variance. : I tried hierarchical agglomerative and it didn't work well. I believe the : reason is feature selection - I should have used trigrams and such, instead : of words, as trivial words led to mis-clustering. : I am working on trigrams as well as NN, will keep updating here. Thanks a : lot casact and guys!
c***z 发帖数: 6348	8 It is unsupervised. Even though we have Y label, there is no easy way to check accuracy if I understand it correctly. But I could be wrong, and I would be very glad to know that! :)
c***z 发帖数: 6348	9 Some update: The requirement has been changed to match with NPD categories, so it became supervised learning. I used hand coded dictionary of keywords and hand labeled items. The model used was decision tree, with two stages: first filter out irrelevant items ( 90% accuracy), then assign labels (82% accuracy). The good thing is that this can be iterative: we can improve the dictionary using the confusion matrix, and then repeat until high accuracy is achieved. Thanks a lot guys!

1

(共1页)

进入DataSciences版参与讨论

相关主题
● [Data Science Project Case] Topic Learning	● Customer Journey Analytics的一般方法跟models
● Bioinformatics Position in a Genomics Center in a University in the Southern California	● 有没有做sentiment analysis的，求思路 (转载)
● Bioinformatics Position in a Genomics Center in a University in the Southern California	● only average statistics
● Bioinformatics Position in a Genomics Center in a University in the Southern California	● 有关clustering
● [Data Science Project Case] Fuzzy matching on names	● 怎么计算距离比较好？
● 说说浅学ML的感受	● 问个问题：一堆（1M）二维座标系的点，每个点有weight，怎么做clustering？
● 如何evaluate an unsupervised learning method?	● Science杂志一篇关于clustering的新文章 (转载)
● 求问一道关于NLP的面试题	● 我有大概80000～100000个左右的时间序列，希望对他们进行分类。

相关话题的讨论汇总
话题: product话题: categories话题: generate话题: data话题: case

未名新帖统计// 7月16日

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

* 这里只显示发帖超过25的版面，努力灌水吧:-)