由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
DataSciences版 - Random forests on imbalanced data
相关主题
model selection problem问一个统计算average from ranges (转载)
请教一道面试题有人参加这星期在new york的strata hadoop conference 吗?
有人参加明天在SANTA CLARA的strata请教一个用R做cox regression的问题
本周去了 O'Reilly的 Strata Data ConferenceSan Jose Strata Conference Meet up (转载)
National Big Data Events请问data scientist的待遇如何?
基于DS的startup究竟都申请些什么专利?用10-fold cross-validation 之后怎么挑Model?
Strata+Hadoop NYC 2014 开会归来,总结+job information推荐三本书
有人去Strata San Jose 2015 么?求问一道关于NLP的面试题
相关话题的讨论汇总
话题: true话题: false话题: relevant话题: acc话题: rf
进入DataSciences版参与讨论
1 (共1页)
c***z
发帖数: 6348
1
Recently I used RF for imbalanced data (10% positive, 90% negative) and I
played with several tricks. Below are the comparison of results. We are most
concerned about false negatives.
Any comments and suggestions are extremely welcome!
1. vanilla version:
> randomForest(Relevant ~ ., data = train, ntree = 1000)
# prediction_1a FALSE TRUE
# actual
# FALSE 22667 83
# TRUE 523 1723
acc = 0.9757561
2. lower threshold (predict TRUE if prob.TRUE > 0):
# prediction_1d FALSE TRUE
# actual
# FALSE 20263 2487
# TRUE 156 2090
acc = 0.8942631
3. undersample (aka balanced RF, resample so that T/F is about 1:1 in
training set):
> randomForest(Relevant ~ ., data = train, ntree = 1000, strata = train$
Relevant, sampsize = rep(sum(train$Relevant == TRUE), 2))
# prediction_1e FALSE TRUE
# actual
# FALSE 21434 1316
# TRUE 221 2025
acc = 0.9385102
4. cost sensitive learning (weighted RF)
> randomForest(Relevant ~ ., data = train, ntree = 1000, classwt = c(p.TRUE,
p.FALSE))
# prediction_1g FALSE TRUE
# actual
# FALSE 18407 4417
# TRUE 83 2089
acc = 0.8199712
There is still room for improvement since I am fine tuning the a-priori
weights.
A*****n
发帖数: 243
2
这种情况下ACC的比较应该没有什么意义,可以看看你4种情况下的ROC曲线,说不定都
很像。
y**3
发帖数: 267
3
请问acc是啥?是AIC, OR AICC吗?
不比较ACC的话,应该比较什么呢?请指教
y**3
发帖数: 267
4
Just figured out. ACC should be accuracy
w*****a
发帖数: 218
5
这个是正解
必要情况下,X-轴用 LOG, 或Y-轴也用 LOG

【在 A*****n 的大作中提到】
: 这种情况下ACC的比较应该没有什么意义,可以看看你4种情况下的ROC曲线,说不定都
: 很像。

g*********n
发帖数: 119
6
Try adaboost. It may give you a better result. I worked on a much more
imbalanced data set (pos. rate is about 1e-5), and adaboost performed better
than RF.
1 (共1页)
进入DataSciences版参与讨论
相关主题
求问一道关于NLP的面试题National Big Data Events
pyspark subtract 如何使用?基于DS的startup究竟都申请些什么专利?
来个技术问题Strata+Hadoop NYC 2014 开会归来,总结+job information
问一道(大)数据 algorithm (转载)有人去Strata San Jose 2015 么?
model selection problem问一个统计算average from ranges (转载)
请教一道面试题有人参加这星期在new york的strata hadoop conference 吗?
有人参加明天在SANTA CLARA的strata请教一个用R做cox regression的问题
本周去了 O'Reilly的 Strata Data ConferenceSan Jose Strata Conference Meet up (转载)
相关话题的讨论汇总
话题: true话题: false话题: relevant话题: acc话题: rf