l******n 发帖数: 9344 | 1 regression training data set里面,有个categorical variable只有3个level
需要prediction的data里面有一个data,这个categorical variable的值不再这3个
level里面,怎么做prediction?
谢谢 | h***x 发帖数: 586 | 2 Two ways,
1) one way which is the safest method is not using this categorical variable
. :-)
2) the other way is building model using training dataset as it is, if the
variable(indicator) is significant, include it. When you apply the model to
the new data you mentioned, the indicator is 0 and will not affect
predictive results.
just my 2 cents,
【在 l******n 的大作中提到】 : regression training data set里面,有个categorical variable只有3个level : 需要prediction的data里面有一个data,这个categorical variable的值不再这3个 : level里面,怎么做prediction? : 谢谢
| l******n 发帖数: 9344 | 3 Thanks, huxxx
both methods may sense.
variable
to
【在 h***x 的大作中提到】 : Two ways, : 1) one way which is the safest method is not using this categorical variable : . :-) : 2) the other way is building model using training dataset as it is, if the : variable(indicator) is significant, include it. When you apply the model to : the new data you mentioned, the indicator is 0 and will not affect : predictive results. : just my 2 cents,
| A*******s 发帖数: 3942 | 4 is there parameterization problem in the second one? i think it treats the
unknown category as the reference category. Not sure if it is valid if no
intercept in the model.
not sure if this would work--1st step, fit the model with that categorical
variable and other covariates; 2nd step, fit the model without the
categorical one and fix the coeffs of other covariates in order to find the
intercept estimate.
or, treat the categorical variable as a random effect. two methods should
have very close result if sample size in each category is large.
variable
to
【在 h***x 的大作中提到】 : Two ways, : 1) one way which is the safest method is not using this categorical variable : . :-) : 2) the other way is building model using training dataset as it is, if the : variable(indicator) is significant, include it. When you apply the model to : the new data you mentioned, the indicator is 0 and will not affect : predictive results. : just my 2 cents,
| h***x 发帖数: 586 | 5 其实我觉得最好的方法就是先检查两个dataset这个变量的分布,看是不是由于编码错
误导致编码不一样。如果确实不一样,那么这个变量就不应该用。也就没有必要检测包
括这个变量和不包括这个变量的区别了。
至于第二种方法,是基于training dataset的最优解,在具体model deployment的时候
,我们不知道data的分布会是怎样,但我们的假设就是就是要预测的data和training的
data有相似的分布。就这个具体例子看,model scores会变一点,但score ranking不
怎么会变,so final scoring results should be the same.
I think you are right, we can treat the categorical variable as a random
effect ...
the
【在 A*******s 的大作中提到】 : is there parameterization problem in the second one? i think it treats the : unknown category as the reference category. Not sure if it is valid if no : intercept in the model. : not sure if this would work--1st step, fit the model with that categorical : variable and other covariates; 2nd step, fit the model without the : categorical one and fix the coeffs of other covariates in order to find the : intercept estimate. : or, treat the categorical variable as a random effect. two methods should : have very close result if sample size in each category is large. :
|
|