There are two options here. One can either control for this in the training data provided, or perhaps a more efficient and interesting approach would be to test for it afterwards using crafted test sets for different variables.how do we know that this correlates to gender (or testosterone) and not age, height, weight, eye color, or left-handedness?
Certainly not... But Uri's original question can essentially be boiled down to could something be created such that, when fed a random pgn and selected for a side, it is able to determine whether a man or woman played with greater than coin-flip accuracy. *Note his example is particularly easy for a coin-flip due to the uniform distribution of the test set.Not everything that seems to correlate is a cause-effect.
Now, better than 50% is a pretty low bar to cross.... And you are quite correct that at the lower end of the spectrum, the possibility of just picking up on some random criteria is far from impossible. But any researcher worth their salt would use a very large data set for training and try to control for any obvious biases (% of games played being the most obvious one for example).
So the worse it is at it, the more concerned I would be. But I would guess one could get significantly better than 50% accuracy... that is just a guess though.