c***z 发帖数: 6348 | 1 Recently I used RF for imbalanced data (10% positive, 90% negative) and I
played with several tricks. Below are the comparison of results. We are most
concerned about false negatives.
Any comments and suggestions are extremely welcome!
1. vanilla version:
> randomForest(Relevant ~ ., data = train, ntree = 1000)
# prediction_1a FALSE TRUE
# actual
# FALSE 22667 83
# TRUE 523 1723
acc = 0.9757561
2. lower threshold (predict TRUE if prob.TRUE > 0):
# prediction_1d FALSE TRUE
# actual
# FALSE 20263 2487
# TRUE 156 2090
acc = 0.8942631
3. undersample (aka balanced RF, resample so that T/F is about 1:1 in
training set):
> randomForest(Relevant ~ ., data = train, ntree = 1000, strata = train$
Relevant, sampsize = rep(sum(train$Relevant == TRUE), 2))
# prediction_1e FALSE TRUE
# actual
# FALSE 21434 1316
# TRUE 221 2025
acc = 0.9385102
4. cost sensitive learning (weighted RF)
> randomForest(Relevant ~ ., data = train, ntree = 1000, classwt = c(p.TRUE,
p.FALSE))
# prediction_1g FALSE TRUE
# actual
# FALSE 18407 4417
# TRUE 83 2089
acc = 0.8199712
There is still room for improvement since I am fine tuning the a-priori
weights. | A*****n 发帖数: 243 | 2 这种情况下ACC的比较应该没有什么意义,可以看看你4种情况下的ROC曲线,说不定都
很像。 | y**3 发帖数: 267 | 3 请问acc是啥?是AIC, OR AICC吗?
不比较ACC的话,应该比较什么呢?请指教 | y**3 发帖数: 267 | 4 Just figured out. ACC should be accuracy | w*****a 发帖数: 218 | 5 这个是正解
必要情况下,X-轴用 LOG, 或Y-轴也用 LOG
【在 A*****n 的大作中提到】 : 这种情况下ACC的比较应该没有什么意义,可以看看你4种情况下的ROC曲线,说不定都 : 很像。
| g*********n 发帖数: 119 | 6 Try adaboost. It may give you a better result. I worked on a much more
imbalanced data set (pos. rate is about 1e-5), and adaboost performed better
than RF. |
|