通读了一遍nature的文章,没看懂这个围棋软件是怎么判断落子的后续价值的。
文章上说
We use a reward function r(s) that is zero for all non-terminal time steps t
< T. The outcome zt = ± r(sT) is the
terminal reward at the end of the game from the perspective of the current
player at time step t: +1 for winning and −1 for losing. Weights are
then updated at each time step t by stochastic gradient ascent in the
direction that maximizes expected outcome.
就是说如果没算到结果的tree branch对落子价值的贡献为0。而算出结果的对落子价值
有正1或者负1的贡献。这个价值判断怎么感觉不大靠谱啊?算得出结果的难道不都是大
胜大负的局面?我觉得我理解错了。。。
c*****t 发帖数: 10738
2
你说的这一段根本不是train value network的,说的是train RL policy network的。
就是让电脑和电脑下无数局,然后根据最后的输赢给前面的落子策略+1或者-1的
weight.