Simplest would be making network layout smaller as Connor suggested.
Also I found a decent speed up going from float to int16 (or even int8, but I never tried int8 myself.) However the limited value range means more things to consider and more possible bugs.
So the main thing I would suggest if you're not already doing it is, assuming the input layer is only 1 or 0, just converting from board -> first layer net values when you evaluate. You can memset the biases, then add in the weights corresponding to each "1" input of the board. (As opposed setting inputs, looping every single input and doing a multiply add.) In 8x8 checkers just this was pretty much equal terms to incremental updates, although in chess the incremental updates are a significant enough optimization that they should be on a ToDo list.