Novel addition to Adam is worth a write-up/paper in itself if you are interested in that .AndrewGrant wrote: ↑Thu Nov 19, 2020 8:57 am So I wrote an NNTraining tool in C, made it very well threaded, cranked up optimizations, wrote AVX2 code for portions to perform weight updates faster, came up with a novel addition to Adam to limit throughput when dealing with sparse inputs by batching in fixed sizes, and refusing to shuffle intra-batch, only inter-batch.
There is literature on this, it's just pretty specialized. See https://engineering.fb.com/2018/11/07/m ... ns/fbgemm/ (and the open source implementation), as well as this paper https://arxiv.org/abs/1712.05877 for the pytorch methodology.AndrewGrant wrote: ↑Thu Nov 19, 2020 8:57 am One thing that has made me quite upset is that there appears to be very little literature online discussing the nuances of these things, and how to actually implement such things into C. It looks like looking at CFish is the best crash-course you could get, but I'm not too keen since I'm not a fan of the programming style, and I also don't want to just duplicate that effort. I wrote the trainer with no direct contact with the SF code, so I'de like to see if I could do the implementation on my own as well.
The SF quantization approach is unusual, agreed. It's not anywhere close to theoretically optimal, so there is a ton of room to improve/experiment here.