Fast count and remove double sfens (compressed EPD's typically 40 bytes) for large NNUE data.
....
https://github.com/egh-s/Remove-Double-Sfens
Remove-Double-Sfens
Moderator: Ras
-
- Posts: 7288
- Joined: Thu Aug 18, 2011 12:04 pm
- Full name: Ed Schröder
Remove-Double-Sfens
90% of coding is debugging, the other 10% is writing bugs.
-
- Posts: 1952
- Joined: Tue Apr 19, 2016 6:08 am
- Location: U.S.A
- Full name: Andrew Grant
Re: Remove-Double-Sfens
If we are linking useful tools:
My Byteshuf.c is a fixed-sized binary file shuffler, which implements the mechanism from TeraShuf. Terashuf shuffles extremely large text files, by writing to disk to avoid needing to hold the entire contents in memory at once. It does this by reading chunks of the file at a time, shuffling the chunk, and saving. After all chunks are read, shuffled, and saved, it will recombine them, by randomly taking an entry from each file, done with a probability derived from the remaining number of entries in a given file.
Utility: https://gist.github.com/AndyGrant/b3a70 ... 6381ee30c4
Unfortunately this only works in Linux envs. But could be modified to Windows, as the only critical thing is file system stuff where you check the contents of directories. The top portion of the file allows you to define the data structure of the files you are shuffling.
My Byteshuf.c is a fixed-sized binary file shuffler, which implements the mechanism from TeraShuf. Terashuf shuffles extremely large text files, by writing to disk to avoid needing to hold the entire contents in memory at once. It does this by reading chunks of the file at a time, shuffling the chunk, and saving. After all chunks are read, shuffled, and saved, it will recombine them, by randomly taking an entry from each file, done with a probability derived from the remaining number of entries in a given file.
Utility: https://gist.github.com/AndyGrant/b3a70 ... 6381ee30c4
Unfortunately this only works in Linux envs. But could be modified to Windows, as the only critical thing is file system stuff where you check the contents of directories. The top portion of the file allows you to define the data structure of the files you are shuffling.
Code: Select all
$ ./byteshuf.out --help
Usage: byteshuf.out [OPTION...]
-v, --verbose Produce verbose output
-d, --directory= Directory of files of unshuffled bytes
-i, --input= Source file of unshuffled bytes
-o, --output= Output file for shuffled bytes
-n, --per-file= Samples per output file (33554432)
-s, --chunk-size= Samples per interim file (33554432)
-?, --help Give this help list
--usage Give a short usage message
-
- Posts: 7288
- Joined: Thu Aug 18, 2011 12:04 pm
- Full name: Ed Schröder
Re: Remove-Double-Sfens
Hi Andy,
Ah, shuffling, nice topic.
Unfortunately I could not compile your Linux code with VS, too many errors I don't understand, what a pity.
I am currently trying to improve our shuffler, it's looks good in the sense it's a lot faster. I do pre-shuffling first, with that I mean : I first create a 10 million integer table with the numbers 0 to 9.999.999, shuffle them and use this table as pointers for the sfens to shuffle. I read 10 million sfens into memory and do 10 million swaps and store the shuffled sfens back to disk.
I noticed there are several shuffle techniques, what me wonder how many swaps they do. Can you tell the number of swaps your Byteshuf util does? In my case it's just the number of sfens I read.
Ah, shuffling, nice topic.
Unfortunately I could not compile your Linux code with VS, too many errors I don't understand, what a pity.
I am currently trying to improve our shuffler, it's looks good in the sense it's a lot faster. I do pre-shuffling first, with that I mean : I first create a 10 million integer table with the numbers 0 to 9.999.999, shuffle them and use this table as pointers for the sfens to shuffle. I read 10 million sfens into memory and do 10 million swaps and store the shuffled sfens back to disk.
I noticed there are several shuffle techniques, what me wonder how many swaps they do. Can you tell the number of swaps your Byteshuf util does? In my case it's just the number of sfens I read.
90% of coding is debugging, the other 10% is writing bugs.
-
- Posts: 1952
- Joined: Tue Apr 19, 2016 6:08 am
- Location: U.S.A
- Full name: Andrew Grant
Re: Remove-Double-Sfens
I don't really touch Windows, or Windows associated tools, so that is expected.
Supposedly, this mechanism I used- "tera-shuf" is a mathematically valid shuffle. I don't know how I would quantify the number of swaps. If you started with 1 million positions, and wanted to shuffle them with intermediate files of 100k....
Then there would be 100k swaps per file, making 1 million total. But then you piece those together, which you could argue is another swap of sorts. So depending on how pedantic you are, there are effectively 1 million swaps, but it might take up to 2 million steps depending on how you define a step.
-
- Posts: 7288
- Joined: Thu Aug 18, 2011 12:04 pm
- Full name: Ed Schröder
Re: Remove-Double-Sfens
Yep.
I must say (that so far) I am pretty satisfied with my "pre-shuffle" approach, it shuffles my current best net of 25B sfens in 2 hours and 2 minutes where as the shuffle in use takes about 16 hours. I am testing it right now and if it's any good I will put the source code on Github.
I must say (that so far) I am pretty satisfied with my "pre-shuffle" approach, it shuffles my current best net of 25B sfens in 2 hours and 2 minutes where as the shuffle in use takes about 16 hours. I am testing it right now and if it's any good I will put the source code on Github.
90% of coding is debugging, the other 10% is writing bugs.
-
- Posts: 2644
- Joined: Fri Nov 26, 2010 2:00 pm
- Location: Czech Republic
- Full name: Martin Sedlak
Re: Remove-Double-Sfens
very interesting, I was in need of something similar recently.
I went with a different approach sort of similar of what you do, Ed:
split the input int 50M-100M parts, where each part size is chosen randomly (--min and --max arguments change this)
those parts form a plan which is then shuffled to randomize part positions in the output file
after that, the input file is read according to the plan, each chunk is loaded into memory, shuffled (fisher-yates) and written to outfile
seems fast enough and completely I/O bound (requires about 5 GB of memory with default settings)
(entry size can be set with the --entry n argument, in my case it's 38 bytes)
it's not a "full shuffle" but should be good enough plus since blocks are of random size, repeating the process should converge to a full shuffle given enough iterations.
but I can't guesstimate how good it actually is
the source code is here if someone's interested, I hereby place it in public domain (C++, should be portable):
I haven't tested it extensively, so there may be bugs
I went with a different approach sort of similar of what you do, Ed:
split the input int 50M-100M parts, where each part size is chosen randomly (--min and --max arguments change this)
those parts form a plan which is then shuffled to randomize part positions in the output file
after that, the input file is read according to the plan, each chunk is loaded into memory, shuffled (fisher-yates) and written to outfile
seems fast enough and completely I/O bound (requires about 5 GB of memory with default settings)
(entry size can be set with the --entry n argument, in my case it's 38 bytes)
it's not a "full shuffle" but should be good enough plus since blocks are of random size, repeating the process should converge to a full shuffle given enough iterations.
but I can't guesstimate how good it actually is
the source code is here if someone's interested, I hereby place it in public domain (C++, should be portable):
I haven't tested it extensively, so there may be bugs
Code: Select all
#include <cstdint>
#include <cstdio>
#include <cstring>
#include <vector>
#include <string>
#include <random>
#include <chrono>
#include <utility>
#include <fstream>
#include <iostream>
// matches cheng's entry size
size_t entry_size = 38;
// min/max entries per chunk
size_t min_entries = 50'000'000;
size_t max_entries = 100'000'000;
std::string in_filename;
std::string out_filename;
struct Plan
{
uint64_t source_file_offset;
uint64_t num_entries;
};
std::vector<Plan> plan;
int parse_args(int argc, const char **argv)
{
int file_name_idx = 0;
for (int i=1; i<argc; i++)
{
std::string arg = argv[i];
// ugh... no starts_with for string
if (arg.length() > 2 && arg[0] == '-' && arg[1] == '-')
{
if (i+1 >= argc)
{
std::cerr << "no argument after " << arg << std::endl;
return 2;
}
// entry size
if (arg == "--entry")
{
entry_size = strtol(argv[i+1], nullptr, 10);
i++;
}
if (arg == "--min")
{
min_entries = strtol(argv[i+1], nullptr, 10);
i++;
}
if (arg == "--max")
{
max_entries = strtol(argv[i+1], nullptr, 10);
i++;
}
continue;
}
switch(file_name_idx)
{
case 0:
in_filename = arg;
break;
case 1:
out_filename = arg;
break;
default:
std::cerr << "invalid extra filename" << std::endl;
return 1;
}
++file_name_idx;
}
if (min_entries < 1 || max_entries <= min_entries)
{
std::cerr << "invalid min/max entries" << std::endl;
return 3;
}
if (entry_size < 1)
{
std::cerr << "invalid entry size" << std::endl;
return 3;
}
if (file_name_idx != 2)
{
std::cout << "usage: shuffler <infile> <outfile>" << std::endl;
std::cout << " --entry n entry size in bytes, default 38" << std::endl;
std::cout << " --min n minimum number of entries per batch, default 50 million" << std::endl;
std::cout << " --max n maximum number of entries per batch, default 100 million" << std::endl;
return -1;
}
return 0;
}
template<typename T, typename U>
void shuffle_vector(T &rng, std::vector<U> &vec)
{
if (vec.empty())
return;
// fisher-yates shuffle
for (size_t i=vec.size()-1; i>1; i--)
{
auto swidx = rng() % i;
std::swap(vec[i], vec[swidx]);
}
}
template<typename T>
void shuffle_buffer(T &rng, std::vector<uint8_t> &buffer, size_t count)
{
if (!count)
return;
std::vector<uint8_t> ebuf;
ebuf.resize(entry_size);
// fisher-yates shuffle
for (size_t i=count-1; i>1; i--)
{
auto swidx = rng() % i;
// swap
memcpy(ebuf.data(), &buffer[i*entry_size], entry_size);
memcpy(&buffer[i*entry_size], &buffer[swidx*entry_size], entry_size);
memcpy(&buffer[swidx*entry_size], ebuf.data(), entry_size);
}
}
template<typename T>
void create_plan(T &rnd_engine, uint64_t in_size)
{
plan.clear();
uint64_t offset = 0;
while (in_size > 0)
{
// note: no uniform distribution but who cares
auto count = rnd_engine() % (max_entries - min_entries) + min_entries;
count *= entry_size;
if (count > in_size)
count = in_size;
plan.push_back({offset, count/entry_size});
offset += count;
in_size -= count;
}
// shuffle plan now
shuffle_vector(rnd_engine, plan);
std::cout << "plan: " << plan.size() << " chunks" << std::endl;
}
int shuffle_main()
{
std::ifstream f;
f.open(in_filename.c_str(), std::ios_base::binary | std::ios_base::in);
if (f.fail())
{
std::cerr << "cannot open " << in_filename << std::endl;
return 4;
}
f.seekg(0, std::ios_base::end);
auto in_size = f.tellg();
f.seekg(0, std::ios_base::beg);
std::mt19937_64 rnd_engine;
rnd_engine.seed(std::chrono::high_resolution_clock::now().time_since_epoch().count());
create_plan(rnd_engine, in_size);
std::vector<uint8_t> buffer;
buffer.resize(max_entries * entry_size);
std::ofstream fo;
fo.open(out_filename.c_str(), std::ios_base::binary | std::ios_base::trunc | std::ios_base::out);
if (fo.fail())
{
std::cerr << "cannot create " << out_filename << std::endl;
return 5;
}
for (size_t i=0; i<plan.size(); i++)
{
const auto &it = plan[i];
std::cout << "plan " << i+1 << " / " << plan.size() << std::endl;
f.seekg(it.source_file_offset, std::ios_base::beg);
f.read((char *)buffer.data(), it.num_entries * entry_size);
if (f.fail())
{
std::cerr << "cannot read " << in_filename << std::endl;
return 6;
}
shuffle_buffer(rnd_engine, buffer, it.num_entries);
fo.write((char *)buffer.data(), it.num_entries * entry_size);
if (fo.fail())
{
std::cerr << "cannot write " << out_filename << std::endl;
return 7;
}
}
std::cout << "all done" << std::endl;
return 0;
}
int main(int argc, const char **argv)
{
if (int res = parse_args(argc, argv))
return res;
return shuffle_main();
}
-
- Posts: 7288
- Joined: Thu Aug 18, 2011 12:04 pm
- Full name: Ed Schröder
-
- Posts: 2644
- Joined: Fri Nov 26, 2010 2:00 pm
- Location: Czech Republic
- Full name: Martin Sedlak
Re: Remove-Double-Sfens
ok, the default version doesn't produce good shuffles even with several iterations,
but I found out that two-pass ping-pong shuffle should generate reasonably good output, quite close to full shuffle
(maximizing average difference between each two consecutive shuffled elements as a measure of "quality"):
first pass is run with --min 1000 --max 2000 source temp
second pass (backward) is run with defaults (--min 50000 --max 100000) temp source
after that the source should be reasonably shuffled
an improved version of the program (not spamming the output too much):
but I found out that two-pass ping-pong shuffle should generate reasonably good output, quite close to full shuffle
(maximizing average difference between each two consecutive shuffled elements as a measure of "quality"):
first pass is run with --min 1000 --max 2000 source temp
second pass (backward) is run with defaults (--min 50000 --max 100000) temp source
after that the source should be reasonably shuffled
an improved version of the program (not spamming the output too much):
Code: Select all
#include <cstdint>
#include <cstdio>
#include <cstring>
#include <vector>
#include <string>
#include <random>
#include <chrono>
#include <utility>
#include <fstream>
#include <iostream>
// matches cheng's entry size
size_t entry_size = 38;
// min/max entries per chunk
size_t min_entries = 50'000'000;
size_t max_entries = 100'000'000;
std::string in_filename;
std::string out_filename;
struct Plan
{
uint64_t source_file_offset;
uint64_t num_entries;
};
std::vector<Plan> plan;
int parse_args(int argc, const char **argv)
{
int file_name_idx = 0;
for (int i=1; i<argc; i++)
{
std::string arg = argv[i];
// ugh... no starts_with for string
if (arg.length() > 2 && arg[0] == '-' && arg[1] == '-')
{
if (i+1 >= argc)
{
std::cerr << "no argument after " << arg << std::endl;
return 2;
}
// entry size
if (arg == "--entry")
{
entry_size = strtol(argv[i+1], nullptr, 10);
i++;
}
if (arg == "--min")
{
min_entries = strtol(argv[i+1], nullptr, 10);
i++;
}
if (arg == "--max")
{
max_entries = strtol(argv[i+1], nullptr, 10);
i++;
}
continue;
}
switch(file_name_idx)
{
case 0:
in_filename = arg;
break;
case 1:
out_filename = arg;
break;
default:
std::cerr << "invalid extra filename" << std::endl;
return 1;
}
++file_name_idx;
}
if (min_entries < 1 || max_entries <= min_entries)
{
std::cerr << "invalid min/max entries" << std::endl;
return 3;
}
if (entry_size < 1)
{
std::cerr << "invalid entry size" << std::endl;
return 3;
}
if (file_name_idx != 2)
{
std::cout << "usage: shuffler <infile> <outfile>" << std::endl;
std::cout << " --entry n entry size in bytes, default 38" << std::endl;
std::cout << " --min n minimum number of entries per batch, default 50 million" << std::endl;
std::cout << " --max n maximum number of entries per batch, default 100 million" << std::endl;
return -1;
}
return 0;
}
template<typename T, typename U>
void shuffle_vector(T &rng, std::vector<U> &vec)
{
if (vec.empty())
return;
// fisher-yates shuffle
for (size_t i=vec.size()-1; i>1; i--)
{
auto swidx = rng() % i;
std::swap(vec[i], vec[swidx]);
}
}
template<typename T>
void shuffle_buffer(T &rng, std::vector<uint8_t> &buffer, size_t count)
{
if (!count)
return;
std::vector<uint8_t> ebuf;
ebuf.resize(entry_size);
// fisher-yates shuffle
for (size_t i=count-1; i>1; i--)
{
auto swidx = rng() % i;
// swap
memcpy(ebuf.data(), &buffer[i*entry_size], entry_size);
memcpy(&buffer[i*entry_size], &buffer[swidx*entry_size], entry_size);
memcpy(&buffer[swidx*entry_size], ebuf.data(), entry_size);
}
}
template<typename T>
void create_plan(T &rnd_engine, uint64_t in_size)
{
plan.clear();
uint64_t offset = 0;
while (in_size > 0)
{
// note: no uniform distribution but who cares
auto div = max_entries - min_entries;
if (div == 0)
div = 1;
auto count = rnd_engine() % div + min_entries;
count *= entry_size;
if (count > in_size)
count = in_size;
plan.push_back({offset, count/entry_size});
offset += count;
in_size -= count;
}
// shuffle plan now
shuffle_vector(rnd_engine, plan);
std::cout << "plan: " << plan.size() << " chunks" << std::endl;
}
int shuffle_main()
{
std::ifstream f;
f.open(in_filename.c_str(), std::ios_base::binary | std::ios_base::in);
if (f.fail())
{
std::cerr << "cannot open " << in_filename << std::endl;
return 4;
}
f.seekg(0, std::ios_base::end);
auto in_size = f.tellg();
f.seekg(0, std::ios_base::beg);
std::mt19937_64 rnd_engine;
rnd_engine.seed(std::chrono::high_resolution_clock::now().time_since_epoch().count());
create_plan(rnd_engine, in_size);
std::vector<uint8_t> buffer;
buffer.resize(max_entries * entry_size);
std::ofstream fo;
fo.open(out_filename.c_str(), std::ios_base::binary | std::ios_base::trunc | std::ios_base::out);
if (fo.fail())
{
std::cerr << "cannot create " << out_filename << std::endl;
return 5;
}
size_t step_div = 100;
size_t step_report = plan.size()/step_div;
if (step_report == 0)
step_report = 1;
size_t next_report = step_report;
for (size_t i=0; i<plan.size(); i++)
{
const auto &it = plan[i];
if (i == next_report)
{
auto percent = (i * 100 + 50) / plan.size();
std::cout << "plan " << i+1 << " / " << plan.size() << " (" << percent << " %)" << std::endl;
next_report += step_report;
}
f.seekg(it.source_file_offset, std::ios_base::beg);
f.read((char *)buffer.data(), it.num_entries * entry_size);
if (f.fail())
{
std::cerr << "cannot read " << in_filename << std::endl;
return 6;
}
shuffle_buffer(rnd_engine, buffer, it.num_entries);
fo.write((char *)buffer.data(), it.num_entries * entry_size);
if (fo.fail())
{
std::cerr << "cannot write " << out_filename << std::endl;
return 7;
}
}
std::cout << "all done" << std::endl;
return 0;
}
int main(int argc, const char **argv)
{
if (int res = parse_args(argc, argv))
return res;
return shuffle_main();
}
-
- Posts: 4619
- Joined: Tue Apr 03, 2012 4:28 pm
- Location: Midi-Pyrénées
- Full name: Christopher Whittington
Re: Remove-Double-Sfens
There mathematical/statistical shuffle and then there’s good enough shuffle for computer chess.
*if* you’re lucky enough to start with PGNs rather than an epd list of unknown status, then the clumping of “similar” PGNs will be because time controls, testing group origin and engines clumping. Therefore:
1. Ream the PGNs randomly out into N different blocks, where N is designed to give a block size just holdable in your particular RAM.
2. Shuffle the blocks by EPD in RAM and concatenate.
If you’re not lucky enough to have PGNs initially, then you can usually get away with just reaming EPDs in blocks of 256 or whatever.
How can you tell if you’ve shuffled well? Plot a graph during training of loss against epoch. Any consistent period waveform, or sudden spikes, tell you not. Graph should be smooth decay.
*if* you’re lucky enough to start with PGNs rather than an epd list of unknown status, then the clumping of “similar” PGNs will be because time controls, testing group origin and engines clumping. Therefore:
1. Ream the PGNs randomly out into N different blocks, where N is designed to give a block size just holdable in your particular RAM.
2. Shuffle the blocks by EPD in RAM and concatenate.
If you’re not lucky enough to have PGNs initially, then you can usually get away with just reaming EPDs in blocks of 256 or whatever.
How can you tell if you’ve shuffled well? Plot a graph during training of loss against epoch. Any consistent period waveform, or sudden spikes, tell you not. Graph should be smooth decay.
-
- Posts: 2644
- Joined: Fri Nov 26, 2010 2:00 pm
- Location: Czech Republic
- Full name: Martin Sedlak
Re: Remove-Double-Sfens
yes, seeing large differences in loss for various minibatches is what made me think more about better shuffling. until reccently I could fit everything in RAM so I could afford to reshuffle everything between epochs and it was perfect.
now I can only afford to shuffle once on-disk and have to stream in sequentially during training.
the two pass method where the first shuffle has smaller blocks ensured good global mix while the second pass with much larger blocks ensures good mix locally. only together they seem to produce good results