Understanding Training against Q as Knowledge Distillation.

AdminX · Post by **AdminX** » Thu Oct 11, 2018 12:15 am

Interesting write up on the LC0 Blog:

Knowledge Distillation (KD) is a technique where there are two neural networks at play: a teacher network and a student network. The teacher network is usually a fixed, fully-trained network, perhaps of bigger size than the student network. Through KD, the goal is usually to produce a smaller student network than the teacher -- which allows for faster inference -- while still encoding the same "knowledge" within the network; the teacher teaches its knowledge to the student. When training the student network, instead of training with the dataset labels as targets (in our case this is the policy distribution and the value output), the student is trained to match the outputs of the teacher.

http://blog.lczero.org/2018/10/understa ... .html#more