Remove-Double-Sfens

Rebel · Post by **Rebel** » Wed Feb 07, 2024 4:49 pm

Fast count and remove double sfens (compressed EPD's typically 40 bytes) for large NNUE data.

....

https://github.com/egh-s/Remove-Double-Sfens

AndrewGrant · Post by **AndrewGrant** » Wed Feb 07, 2024 11:37 pm

If we are linking useful tools:

My Byteshuf.c is a fixed-sized binary file shuffler, which implements the mechanism from TeraShuf. Terashuf shuffles extremely large text files, by writing to disk to avoid needing to hold the entire contents in memory at once. It does this by reading chunks of the file at a time, shuffling the chunk, and saving. After all chunks are read, shuffled, and saved, it will recombine them, by randomly taking an entry from each file, done with a probability derived from the remaining number of entries in a given file.

Utility: https://gist.github.com/AndyGrant/b3a70 ... 6381ee30c4

Unfortunately this only works in Linux envs. But could be modified to Windows, as the only critical thing is file system stuff where you check the contents of directories. The top portion of the file allows you to define the data structure of the files you are shuffling.

Code: Select all

$ ./byteshuf.out --help
Usage: byteshuf.out [OPTION...]

  -v, --verbose              Produce verbose output
  -d, --directory=           Directory of files of unshuffled bytes
  -i, --input=               Source file of unshuffled bytes
  -o, --output=              Output file for shuffled bytes
  -n, --per-file=            Samples per output file (33554432)
  -s, --chunk-size=          Samples per interim file (33554432)
  -?, --help                 Give this help list
      --usage                Give a short usage message

Rebel · Post by **Rebel** » Mon Feb 19, 2024 8:55 pm

Hi Andy,

Ah, shuffling, nice topic.

Unfortunately I could not compile your Linux code with VS, too many errors I don't understand, what a pity.

I am currently trying to improve our shuffler, it's looks good in the sense it's a lot faster. I do pre-shuffling first, with that I mean : I first create a 10 million integer table with the numbers 0 to 9.999.999, shuffle them and use this table as pointers for the sfens to shuffle. I read 10 million sfens into memory and do 10 million swaps and store the shuffled sfens back to disk.

I noticed there are several shuffle techniques, what me wonder how many swaps they do. Can you tell the number of swaps your Byteshuf util does? In my case it's just the number of sfens I read.

AndrewGrant · Post by **AndrewGrant** » Wed Feb 21, 2024 4:13 am

Rebel wrote: ↑Mon Feb 19, 2024 8:55 pm Unfortunately I could not compile your Linux code with VS, too many errors I don't understand, what a pity.

I don't really touch Windows, or Windows associated tools, so that is expected.

Rebel wrote: ↑Mon Feb 19, 2024 8:55 pm I noticed there are several shuffle techniques, what me wonder how many swaps they do. Can you tell the number of swaps your Byteshuf util does? In my case it's just the number of sfens I read.

Supposedly, this mechanism I used- "tera-shuf" is a mathematically valid shuffle. I don't know how I would quantify the number of swaps. If you started with 1 million positions, and wanted to shuffle them with intermediate files of 100k....

Then there would be 100k swaps per file, making 1 million total. But then you piece those together, which you could argue is another swap of sorts. So depending on how pedantic you are, there are effectively 1 million swaps, but it might take up to 2 million steps depending on how you define a step.

Rebel · Post by **Rebel** » Wed Feb 21, 2024 4:01 pm

Yep.

I must say (that so far) I am pretty satisfied with my "pre-shuffle" approach, it shuffles my current best net of 25B sfens in 2 hours and 2 minutes where as the shuffle in use takes about 16 hours. I am testing it right now and if it's any good I will put the source code on Github.

mar · Post by **mar** » Sat Mar 09, 2024 1:17 am

very interesting, I was in need of something similar recently.

I went with a different approach sort of similar of what you do, Ed:

split the input int 50M-100M parts, where each part size is chosen randomly (--min and --max arguments change this)
those parts form a plan which is then shuffled to randomize part positions in the output file

after that, the input file is read according to the plan, each chunk is loaded into memory, shuffled (fisher-yates) and written to outfile

seems fast enough and completely I/O bound (requires about 5 GB of memory with default settings)

(entry size can be set with the --entry n argument, in my case it's 38 bytes)

it's not a "full shuffle" but should be good enough plus since blocks are of random size, repeating the process should converge to a full shuffle given enough iterations.
but I can't guesstimate how good it actually is

the source code is here if someone's interested, I hereby place it in public domain (C++, should be portable):
I haven't tested it extensively, so there may be bugs

Code: Select all

#include <cstdint>
#include <cstdio>
#include <cstring>

#include <vector>
#include <string>

#include <random>
#include <chrono>
#include <utility>

#include <fstream>
#include <iostream>

// matches cheng's entry size
size_t entry_size = 38;

// min/max entries per chunk
size_t min_entries = 50'000'000;
size_t max_entries = 100'000'000;

std::string in_filename;
std::string out_filename;

struct Plan
{
	uint64_t source_file_offset;
	uint64_t num_entries;
};

std::vector<Plan> plan;

int parse_args(int argc, const char **argv)
{
	int file_name_idx = 0;

	for (int i=1; i<argc; i++)
	{
		std::string arg = argv[i];

		// ugh... no starts_with for string
		if (arg.length() > 2 && arg[0] == '-' && arg[1] == '-')
		{
			if (i+1 >= argc)
			{
				std::cerr << "no argument after " << arg << std::endl;
				return 2;
			}

			// entry size
			if (arg == "--entry")
			{
				entry_size = strtol(argv[i+1], nullptr, 10);
				i++;
			}

			if (arg == "--min")
			{
				min_entries = strtol(argv[i+1], nullptr, 10);
				i++;
			}

			if (arg == "--max")
			{
				max_entries = strtol(argv[i+1], nullptr, 10);
				i++;
			}

			continue;
		}

		switch(file_name_idx)
		{
		case 0:
			in_filename = arg;
			break;

		case 1:
			out_filename = arg;
			break;

		default:
			std::cerr << "invalid extra filename" << std::endl;
			return 1;
		}

		++file_name_idx;
	}

	if (min_entries < 1 || max_entries <= min_entries)
	{
		std::cerr << "invalid min/max entries" << std::endl;
		return 3;
	}

	if (entry_size < 1)
	{
		std::cerr << "invalid entry size" << std::endl;
		return 3;
	}

	if (file_name_idx != 2)
	{
		std::cout << "usage: shuffler <infile> <outfile>" << std::endl;
		std::cout << "       --entry n    entry size in bytes, default 38" << std::endl;
		std::cout << "       --min n      minimum number of entries per batch, default 50 million" << std::endl;
		std::cout << "       --max n      maximum number of entries per batch, default 100 million" << std::endl;
		return -1;
	}

	return 0;
}

template<typename T, typename U>
void shuffle_vector(T &rng, std::vector<U> &vec)
{
	if (vec.empty())
		return;

	// fisher-yates shuffle
	for (size_t i=vec.size()-1; i>1; i--)
	{
		auto swidx = rng() % i;
		std::swap(vec[i], vec[swidx]);
	}
}

template<typename T>
void shuffle_buffer(T &rng, std::vector<uint8_t> &buffer, size_t count)
{
	if (!count)
		return;

	std::vector<uint8_t> ebuf;
	ebuf.resize(entry_size);

	// fisher-yates shuffle
	for (size_t i=count-1; i>1; i--)
	{
		auto swidx = rng() % i;

		// swap
		memcpy(ebuf.data(), &buffer[i*entry_size], entry_size);
		memcpy(&buffer[i*entry_size], &buffer[swidx*entry_size], entry_size);
		memcpy(&buffer[swidx*entry_size], ebuf.data(), entry_size);
	}
}

template<typename T>
void create_plan(T &rnd_engine, uint64_t in_size)
{
	plan.clear();

	uint64_t offset = 0;

	while (in_size > 0)	
	{
		// note: no uniform distribution but who cares
		auto count = rnd_engine() % (max_entries - min_entries) + min_entries;
		count *= entry_size;

		if (count > in_size)
			count = in_size;

		plan.push_back({offset, count/entry_size});

		offset += count;
		in_size -= count;
	}

	// shuffle plan now
	shuffle_vector(rnd_engine, plan);

	std::cout << "plan: " << plan.size() << " chunks" << std::endl;
}

int shuffle_main()
{
	std::ifstream f;
	f.open(in_filename.c_str(), std::ios_base::binary | std::ios_base::in);

	if (f.fail())
	{
		std::cerr << "cannot open " << in_filename << std::endl;
		return 4;
	}

	f.seekg(0, std::ios_base::end);
	auto in_size = f.tellg();
	f.seekg(0, std::ios_base::beg);

	std::mt19937_64 rnd_engine;
	rnd_engine.seed(std::chrono::high_resolution_clock::now().time_since_epoch().count());

	create_plan(rnd_engine, in_size);

	std::vector<uint8_t> buffer;
	buffer.resize(max_entries * entry_size);

	std::ofstream fo;
	fo.open(out_filename.c_str(), std::ios_base::binary | std::ios_base::trunc | std::ios_base::out);

	if (fo.fail())
	{
		std::cerr << "cannot create " << out_filename << std::endl;
		return 5;
	}

	for (size_t i=0; i<plan.size(); i++)
	{
		const auto &it = plan[i];

		std::cout << "plan " << i+1 << " / " << plan.size() << std::endl;

		f.seekg(it.source_file_offset, std::ios_base::beg);
		f.read((char *)buffer.data(), it.num_entries * entry_size);

		if (f.fail())
		{
			std::cerr << "cannot read " << in_filename << std::endl;
			return 6;
		}

		shuffle_buffer(rnd_engine, buffer, it.num_entries);

		fo.write((char *)buffer.data(), it.num_entries * entry_size);

		if (fo.fail())
		{
			std::cerr << "cannot write " << out_filename << std::endl;
			return 7;
		}
	}

	std::cout << "all done" << std::endl;

	return 0;
}

int main(int argc, const char **argv)
{
	if (int res = parse_args(argc, argv))
		return res;

	return shuffle_main();
}

Rebel · Post by **Rebel** » Sat Mar 09, 2024 6:33 am

Nice, will try.

mar · Post by **mar** » Sat Mar 09, 2024 2:13 pm

ok, the default version doesn't produce good shuffles even with several iterations,
but I found out that two-pass ping-pong shuffle should generate reasonably good output, quite close to full shuffle
(maximizing average difference between each two consecutive shuffled elements as a measure of "quality"):

first pass is run with --min 1000 --max 2000 source temp
second pass (backward) is run with defaults (--min 50000 --max 100000) temp source

after that the source should be reasonably shuffled

an improved version of the program (not spamming the output too much):

Code: Select all

#include <cstdint>
#include <cstdio>
#include <cstring>

#include <vector>
#include <string>

#include <random>
#include <chrono>
#include <utility>

#include <fstream>
#include <iostream>

// matches cheng's entry size
size_t entry_size = 38;

// min/max entries per chunk
size_t min_entries = 50'000'000;
size_t max_entries = 100'000'000;

std::string in_filename;
std::string out_filename;

struct Plan
{
	uint64_t source_file_offset;
	uint64_t num_entries;
};

std::vector<Plan> plan;

int parse_args(int argc, const char **argv)
{
	int file_name_idx = 0;

	for (int i=1; i<argc; i++)
	{
		std::string arg = argv[i];

		// ugh... no starts_with for string
		if (arg.length() > 2 && arg[0] == '-' && arg[1] == '-')
		{
			if (i+1 >= argc)
			{
				std::cerr << "no argument after " << arg << std::endl;
				return 2;
			}

			// entry size
			if (arg == "--entry")
			{
				entry_size = strtol(argv[i+1], nullptr, 10);
				i++;
			}

			if (arg == "--min")
			{
				min_entries = strtol(argv[i+1], nullptr, 10);
				i++;
			}

			if (arg == "--max")
			{
				max_entries = strtol(argv[i+1], nullptr, 10);
				i++;
			}

			continue;
		}

		switch(file_name_idx)
		{
		case 0:
			in_filename = arg;
			break;

		case 1:
			out_filename = arg;
			break;

		default:
			std::cerr << "invalid extra filename" << std::endl;
			return 1;
		}

		++file_name_idx;
	}

	if (min_entries < 1 || max_entries <= min_entries)
	{
		std::cerr << "invalid min/max entries" << std::endl;
		return 3;
	}

	if (entry_size < 1)
	{
		std::cerr << "invalid entry size" << std::endl;
		return 3;
	}

	if (file_name_idx != 2)
	{
		std::cout << "usage: shuffler <infile> <outfile>" << std::endl;
		std::cout << "       --entry n    entry size in bytes, default 38" << std::endl;
		std::cout << "       --min n      minimum number of entries per batch, default 50 million" << std::endl;
		std::cout << "       --max n      maximum number of entries per batch, default 100 million" << std::endl;
		return -1;
	}

	return 0;
}

template<typename T, typename U>
void shuffle_vector(T &rng, std::vector<U> &vec)
{
	if (vec.empty())
		return;

	// fisher-yates shuffle
	for (size_t i=vec.size()-1; i>1; i--)
	{
		auto swidx = rng() % i;
		std::swap(vec[i], vec[swidx]);
	}
}

template<typename T>
void shuffle_buffer(T &rng, std::vector<uint8_t> &buffer, size_t count)
{
	if (!count)
		return;

	std::vector<uint8_t> ebuf;
	ebuf.resize(entry_size);

	// fisher-yates shuffle
	for (size_t i=count-1; i>1; i--)
	{
		auto swidx = rng() % i;

		// swap
		memcpy(ebuf.data(), &buffer[i*entry_size], entry_size);
		memcpy(&buffer[i*entry_size], &buffer[swidx*entry_size], entry_size);
		memcpy(&buffer[swidx*entry_size], ebuf.data(), entry_size);
	}
}

template<typename T>
void create_plan(T &rnd_engine, uint64_t in_size)
{
	plan.clear();

	uint64_t offset = 0;

	while (in_size > 0)	
	{
		// note: no uniform distribution but who cares
		auto div = max_entries - min_entries;

		if (div == 0)
			div = 1;

		auto count = rnd_engine() % div + min_entries;
		count *= entry_size;

		if (count > in_size)
			count = in_size;

		plan.push_back({offset, count/entry_size});

		offset += count;
		in_size -= count;
	}

	// shuffle plan now
	shuffle_vector(rnd_engine, plan);

	std::cout << "plan: " << plan.size() << " chunks" << std::endl;
}

int shuffle_main()
{
	std::ifstream f;
	f.open(in_filename.c_str(), std::ios_base::binary | std::ios_base::in);

	if (f.fail())
	{
		std::cerr << "cannot open " << in_filename << std::endl;
		return 4;
	}

	f.seekg(0, std::ios_base::end);
	auto in_size = f.tellg();
	f.seekg(0, std::ios_base::beg);

	std::mt19937_64 rnd_engine;
	rnd_engine.seed(std::chrono::high_resolution_clock::now().time_since_epoch().count());

	create_plan(rnd_engine, in_size);

	std::vector<uint8_t> buffer;
	buffer.resize(max_entries * entry_size);

	std::ofstream fo;
	fo.open(out_filename.c_str(), std::ios_base::binary | std::ios_base::trunc | std::ios_base::out);

	if (fo.fail())
	{
		std::cerr << "cannot create " << out_filename << std::endl;
		return 5;
	}

	size_t step_div = 100;
	size_t step_report = plan.size()/step_div;

	if (step_report == 0)
		step_report = 1;

	size_t next_report = step_report;

	for (size_t i=0; i<plan.size(); i++)
	{
		const auto &it = plan[i];

		if (i == next_report)
		{
			auto percent = (i * 100 + 50) / plan.size();
			std::cout << "plan " << i+1 << " / " << plan.size() << " (" << percent << " %)" << std::endl;
			next_report += step_report;
		}

		f.seekg(it.source_file_offset, std::ios_base::beg);
		f.read((char *)buffer.data(), it.num_entries * entry_size);

		if (f.fail())
		{
			std::cerr << "cannot read " << in_filename << std::endl;
			return 6;
		}

		shuffle_buffer(rnd_engine, buffer, it.num_entries);

		fo.write((char *)buffer.data(), it.num_entries * entry_size);

		if (fo.fail())
		{
			std::cerr << "cannot write " << out_filename << std::endl;
			return 7;
		}
	}

	std::cout << "all done" << std::endl;

	return 0;
}

int main(int argc, const char **argv)
{
	if (int res = parse_args(argc, argv))
		return res;

	return shuffle_main();
}

chrisw · Post by **chrisw** » Sat Mar 09, 2024 4:19 pm

There mathematical/statistical shuffle and then there’s good enough shuffle for computer chess.

*if* you’re lucky enough to start with PGNs rather than an epd list of unknown status, then the clumping of “similar” PGNs will be because time controls, testing group origin and engines clumping. Therefore:

1. Ream the PGNs randomly out into N different blocks, where N is designed to give a block size just holdable in your particular RAM.

2. Shuffle the blocks by EPD in RAM and concatenate.

If you’re not lucky enough to have PGNs initially, then you can usually get away with just reaming EPDs in blocks of 256 or whatever.

How can you tell if you’ve shuffled well? Plot a graph during training of loss against epoch. Any consistent period waveform, or sudden spikes, tell you not. Graph should be smooth decay.

mar · Post by **mar** » Sat Mar 09, 2024 5:08 pm

chrisw wrote: ↑Sat Mar 09, 2024 4:19 pm How can you tell if you’ve shuffled well? Plot a graph during training of loss against epoch. Any consistent period waveform, or sudden spikes, tell you not. Graph should be smooth decay.

yes, seeing large differences in loss for various minibatches is what made me think more about better shuffling. until reccently I could fit everything in RAM so I could afford to reshuffle everything between epochs and it was perfect.
now I can only afford to shuffle once on-disk and have to stream in sequentially during training.

the two pass method where the first shuffle has smaller blocks ensured good global mix while the second pass with much larger blocks ensures good mix locally. only together they seem to produce good results

Remove-Double-Sfens

Remove-Double-Sfens

Re: Remove-Double-Sfens

Re: Remove-Double-Sfens

Re: Remove-Double-Sfens

Re: Remove-Double-Sfens

Re: Remove-Double-Sfens

Re: Remove-Double-Sfens

Re: Remove-Double-Sfens

Re: Remove-Double-Sfens

Re: Remove-Double-Sfens