partition result into subgroups

Discussion of chess software programming and technical issues.

Moderators: hgm, Dann Corbit, Harvey Williamson

kbhearn
Posts: 411
Joined: Thu Dec 30, 2010 4:48 am

Re: partition result into subgroups

Post by kbhearn »

Daniel Shawul wrote:
kbhearn wrote:How fast does it need to be? If it can have some computational overhead to iterate through a random partition then you could use a pseudorandom permutation and not need to store the results of the permutation, just use it to translate indexes as you need them. Changing the key would give you a different partitioning.

i.e. translate indexes 0->N/k through a PRP, repeat for next N/k if you want another partition, the PRP guarantees that each starting index will correspond to a unique finishing index.
It can be slow as hell since it is done one time. Can you give an example of a PRP that can generate permutation of [0..N] integers where N can be as big as a million?
you can use a 3 round feistel network as a generic PRP with a set size of any even number of bits.

Then you can convert the n-bit PRP to a PRP from 0...N by running any index > N through the PRP again until you get an index < N (hence you want to keep the order close).

i.e. if 1 million was your N, then you'd use 20 bits for your base PRP.

and if index 3 happens to map to 1031000 which is outside your set, then you'd rerun it and map from 1031000 instead, ultimately it results in a PRP with an arbitrary range - at worst you have to run the PRP an average of 4 times per index if your N is (2^a) + 1 which would require a base PRP of size 2^(a+2) when a is even. when a is odd it'd just be 2^(a+1)
petero2
Posts: 680
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: partition result into subgroups

Post by petero2 »

Daniel Shawul wrote:I seem to have a mental block with permutation stuff and can't figure out what seems to be easy problem.. Say I have N games with WDL results so that N = W + D + L. How to partition this into k sub-groups of N1(W1,D1,L1) so that they add up to N(W,D,L). I thought of random shuffling of N indices and then picking up Nk = N / k elements for each group. But that requires the indices to be stored in an array. That could be very big if I have thousands/millions of games so I want to avoid that. How do you generate n random integers that add up to some value ?
Thanks
Maybe I missed something but I don't see why you can't store millions of indices in an array. If you mean billions of games I can understand your concern, but 4MB for 1 million games doesn't seem like a problem to me.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: partition result into subgroups

Post by Daniel Shawul »

Actually this seems to be the best approach. Picking one at a time and decreasing either W , D or L depending on where the picked result fall. I can not be sure till I code it but it sure sounds like much easier than shuffling through the whole list at the beginning
Thanks.

Edit Indeed this works well. Moral of story is if something doesn't occur to you, take some fresh air :)

Code: Select all

#include <stdlib.h>
#include <stdio.h>

void gen&#40;int* r,int* rk,int Nk&#41; &#123;
	int N = 0;
	for&#40;int j = 0;j < 3;j++)
		N += r&#91;j&#93;;
	for&#40;int i = 0;i < Nk;i++) &#123;
		int v = rand&#40;) % N;
		int total = 0;
		for&#40;int j = 0;j < 3;j++) &#123;
			total += r&#91;j&#93;;
			if&#40;total > v&#41; &#123;
				rk&#91;j&#93;++;
				r&#91;j&#93;--;
				N--;
				break;
			&#125;
		&#125;
	&#125;
&#125;
int main&#40;) &#123;
	int r&#91;3&#93; = &#123;10000,20000,10000&#125;;
	int Nk = 4000;
	int rk&#91;10&#93;&#91;3&#93;;
	srand&#40;1&#41;;
	for&#40;int i = 0;i < 10;i++) &#123;
		rk&#91;i&#93;&#91;0&#93; = rk&#91;i&#93;&#91;1&#93; = rk&#91;i&#93;&#91;2&#93; = 0;
		gen&#40;r,rk&#91;i&#93;,Nk&#41;;
		printf&#40;"%d %d %d\n",rk&#91;i&#93;&#91;0&#93;,rk&#91;i&#93;&#91;1&#93;,rk&#91;i&#93;&#91;2&#93;);
	&#125;

	return 0;
&#125;
Test run

Code: Select all

1140	2319	541		4000
1025	2008	967		4000
1226	1812	962		4000
1417	1683	900		4000
1080	2142	778		4000
865	2154	981		4000
944	2016	1040		4000
834	1997	1169		4000
771	1927	1302		4000
698	1942	1360		4000
				
10000	20000	10000		

Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: partition result into subgroups

Post by Daniel Shawul »

petero2 wrote:
Daniel Shawul wrote:I seem to have a mental block with permutation stuff and can't figure out what seems to be easy problem.. Say I have N games with WDL results so that N = W + D + L. How to partition this into k sub-groups of N1(W1,D1,L1) so that they add up to N(W,D,L). I thought of random shuffling of N indices and then picking up Nk = N / k elements for each group. But that requires the indices to be stored in an array. That could be very big if I have thousands/millions of games so I want to avoid that. How do you generate n random integers that add up to some value ?
Thanks
Maybe I missed something but I don't see why you can't store millions of indices in an array. If you mean billions of games I can understand your concern, but 4MB for 1 million games doesn't seem like a problem to me.
Actually I don't have large number of games in mind, so I was mentioning million/billions to limit people from coming up with array solutions. In practise W,D,L could be anything an integer or long long could hold. Anyway it turns out I and many others have been looking at it from the wrong perspective. What Bob suggested is simple and a clear winner.
gladius
Posts: 568
Joined: Tue Dec 12, 2006 10:10 am
Full name: Gary Linscott

Re: partition result into subgroups

Post by gladius »

Daniel Shawul wrote:Thanks a lot Gary for taking the time to present it in code! Ok so the code can partition W wins in to k groups, and similarly L losses into k groups, and finally for the losses I just take Li = N / k - Wi - Di. I think that should work unless I missed something. Will check and come back.
Thanks again
Edit: There is a problem. Say I have N(W,D,L) = 2100(1000,1000,100) and I want to divide in to 10 thus 210(100,100,10) games each then say I get 50,50 for W & D, then for L it has to be 210 - 50 - 50 = 110. But we have only a total of 100 losses.
Sorry, I should have been a bit more verbose :). The idea was to call it once each for w/d/l. This method is marginally more efficient, as it only takes O(k) operations, where k = buckets. The other method takes O(n) operations, where n = # games.

Code: Select all

#include <vector>
#include <cstdlib>
#include <cstdio>
#include <cmath>

std&#58;&#58;vector<int> generate&#40;int value, int n&#41; &#123;
  std&#58;&#58;vector<int> nums;
  while &#40;nums.size&#40;) < n - 1&#41; &#123;
    int r = rand&#40;) % &#40;value / int&#40;log&#40;n&#41; + 1&#41;);
    nums.push_back&#40;r&#41;;
    value -= r;
  &#125;
  nums.push_back&#40;value&#41;;
  return nums;
&#125;

int main&#40;) &#123;
  int N = 10, w = 10000, l = 20000, d = 10000;
  std&#58;&#58;vector<int> wins = generate&#40;w, N&#41;;
  std&#58;&#58;vector<int> draws = generate&#40;d, N&#41;;
  std&#58;&#58;vector<int> losses = generate&#40;l, N&#41;;
  for &#40;int i = 0; i < N; ++i&#41; &#123;
    printf&#40;"%d %d %d\n", wins&#91;i&#93;, losses&#91;i&#93;, draws&#91;i&#93;);
  &#125;
  return 0;
&#125;
Which outputs:

Code: Select all

3163 6250 397
2069 123 2087
51 17 1942
1495 728 588
449 1053 781
367 1289 965
562 1181 166
208 2866 124
524 1826 800
1112 4667 2150
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: partition result into subgroups

Post by Daniel Shawul »

I think there is some kind of systematic error with this method. I did few runs and it seems wins decrease from sample 1 to sample 10, and losses increase. My guess is rand() is running out of cycles or I made a mistake somewhere

Code: Select all

int main&#40;) &#123;
	int r&#91;3&#93;;
	int Nk = 4000;
	int rk&#91;10&#93;&#91;3&#93;;
	for&#40;int k = 0;k < 10;k++) &#123;
		r&#91;0&#93; = 10000;r&#91;1&#93; = 20000;r&#91;2&#93; = 10000;
		for&#40;int i = 0;i < 10;i++) &#123;
			rk&#91;i&#93;&#91;0&#93; = rk&#91;i&#93;&#91;1&#93; = rk&#91;i&#93;&#91;2&#93; = 0;
			gen&#40;r,rk&#91;i&#93;,Nk&#41;;
			printf&#40;"%d %d %d\n",rk&#91;i&#93;&#91;0&#93;,rk&#91;i&#93;&#91;1&#93;,rk&#91;i&#93;&#91;2&#93;);
		&#125;
		printf&#40;"========\n");
	&#125;
	return 0;
&#125;
Result

Code: Select all

1140 2319 541
1025 2008 967
1226 1812 962
1417 1683 900
1080 2142 778
865 2154 981
944 2016 1040
834 1997 1169
771 1927 1302
698 1942 1360
========
1202 2296 502
967 2040 993
1228 1819 953
1402 1739 859
1122 2099 779
885 2177 938
977 1950 1073
786 2033 1181
756 1945 1299
675 1902 1423
========
1129 2293 578
1045 2053 902
1234 1743 1023
1371 1758 871
1127 2131 742
874 2163 963
1012 1911 1077
799 2005 1196
727 1975 1298
682 1968 1350
========
1134 2314 552
987 2065 948
1219 1786 995
1380 1744 876
1160 2079 761
897 2206 897
957 1964 1079
827 1991 1182
764 1953 1283
675 1898 1427
========
1184 2250 566
997 2115 888
1237 1831 932
1402 1748 850
1144 2080 776
862 2156 982
955 1924 1121
783 2021 1196
754 1955 1291
682 1920 1398
========
1189 2271 540
993 2059 948
1230 1840 930
1392 1748 860
1116 2056 828
856 2164 980
974 1974 1052
787 2022 1191
780 1949 1271
683 1917 1400
========
1164 2273 563
1037 2004 959
1202 1808 990
1396 1711 893
1118 2137 745
875 2171 954
978 1947 1075
834 2013 1153
749 1973 1278
647 1963 1390
========
1176 2275 549
1028 2048 924
1200 1846 954
1393 1767 840
1114 2122 764
890 2161 949
923 2003 1074
830 1953 1217
759 1942 1299
687 1883 1430
========
1150 2311 539
1010 2094 896
1164 1847 989
1372 1724 904
1145 2101 754
925 2142 933
1024 1962 1014
787 1958 1255
763 1950 1287
660 1911 1429
========
1136 2301 563
1028 2042 930
1231 1788 981
1383 1722 895
1125 2141 734
868 2183 949
939 2015 1046
837 1948 1215
773 1939 1288
680 1921 1399
========
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: partition result into subgroups

Post by Daniel Shawul »

Indeed running out of cycles was the problem. Pseudo random number generator with big cycles such as Mersenne twister should be used. Well this turned itself into an interesting problem afterall. Complete code for the curios

Code: Select all

#include <stdlib.h>
#include <stdio.h>

void seed_rand&#40;unsigned long&#41;;
unsigned long gen_rand&#40;);

void gen&#40;int* r,int* rk,int Nk&#41; &#123;
	int N = 0;
	for&#40;int j = 0;j < 3;j++)
		N += r&#91;j&#93;;
	for&#40;int i = 0;i < Nk;i++) &#123;
		int v = gen_rand&#40;) % N;
		//int v = rand&#40;) % N;
		int total = 0;
		for&#40;int j = 0;j < 3;j++) &#123;
			total += r&#91;j&#93;;
			if&#40;total > v&#41; &#123;
				rk&#91;j&#93;++;
				r&#91;j&#93;--;
				N--;
				break;
			&#125;
		&#125;
	&#125;
&#125;

int main&#40;) &#123;
	int r&#91;3&#93;;
	int Nk = 4000;
	int rk&#91;10&#93;&#91;3&#93;;
	seed_rand&#40;1&#41;;
	//srand&#40;1&#41;;
	for&#40;int k = 0;k < 10;k++) &#123;
		r&#91;0&#93; = 10000;r&#91;1&#93; = 20000;r&#91;2&#93; = 10000;
		for&#40;int i = 0;i < 10;i++) &#123;
			rk&#91;i&#93;&#91;0&#93; = rk&#91;i&#93;&#91;1&#93; = rk&#91;i&#93;&#91;2&#93; = 0;
			gen&#40;r,rk&#91;i&#93;,Nk&#41;;
			printf&#40;"%d %d %d\n",rk&#91;i&#93;&#91;0&#93;,rk&#91;i&#93;&#91;1&#93;,rk&#91;i&#93;&#91;2&#93;);
		&#125;
		printf&#40;"========\n");
	&#125;
	return 0;
&#125;

/* Period parameters */  
#define N 624
#define M 397
#define MATRIX_A 0x9908b0dfUL   /* constant vector a */
#define UPPER_MASK 0x80000000UL /* most significant w-r bits */
#define LOWER_MASK 0x7fffffffUL /* least significant r bits */

static unsigned long mt&#91;N&#93;; /* the array for the state vector  */
static int mti=N+1; /* mti==N+1 means mt&#91;N&#93; is not initialized */

/* initializes mt&#91;N&#93; with a seed */
void seed_rand&#40;unsigned long s&#41;
&#123;
    mt&#91;0&#93;= s & 0xffffffffUL;
    for &#40;mti=1; mti<N; mti++) &#123;
        mt&#91;mti&#93; = 
	    &#40;1812433253UL * &#40;mt&#91;mti-1&#93; ^ &#40;mt&#91;mti-1&#93; >> 30&#41;) + mti&#41;; 
        /* See Knuth TAOCP Vol2. 3rd Ed. P.106 for multiplier. */
        /* In the previous versions, MSBs of the seed affect   */
        /* only MSBs of the array mt&#91;&#93;.                        */
        /* 2002/01/09 modified by Makoto Matsumoto             */
        mt&#91;mti&#93; &= 0xffffffffUL;
        /* for >32 bit machines */
    &#125;
&#125;

/* generates a random number on &#91;0,0xffffffff&#93;-interval */
unsigned long gen_rand&#40;)
&#123;
    unsigned long y;
    static unsigned long mag01&#91;2&#93;=&#123;0x0UL, MATRIX_A&#125;;
    /* mag01&#91;x&#93; = x * MATRIX_A  for x=0,1 */

    if &#40;mti >= N&#41; &#123; /* generate N words at one time */
        int kk;

        if &#40;mti == N+1&#41;   /* if init_genrand&#40;) has not been called, */
            seed_rand&#40;5489UL&#41;; /* a default initial seed is used */

        for &#40;kk=0;kk<N-M;kk++) &#123;
            y = &#40;mt&#91;kk&#93;&UPPER_MASK&#41;|&#40;mt&#91;kk+1&#93;&LOWER_MASK&#41;;
            mt&#91;kk&#93; = mt&#91;kk+M&#93; ^ &#40;y >> 1&#41; ^ mag01&#91;y & 0x1UL&#93;;
        &#125;
        for (;kk<N-1;kk++) &#123;
            y = &#40;mt&#91;kk&#93;&UPPER_MASK&#41;|&#40;mt&#91;kk+1&#93;&LOWER_MASK&#41;;
            mt&#91;kk&#93; = mt&#91;kk+&#40;M-N&#41;&#93; ^ &#40;y >> 1&#41; ^ mag01&#91;y & 0x1UL&#93;;
        &#125;
        y = &#40;mt&#91;N-1&#93;&UPPER_MASK&#41;|&#40;mt&#91;0&#93;&LOWER_MASK&#41;;
        mt&#91;N-1&#93; = mt&#91;M-1&#93; ^ &#40;y >> 1&#41; ^ mag01&#91;y & 0x1UL&#93;;

        mti = 0;
    &#125;
  
    y = mt&#91;mti++&#93;;

    /* Tempering */
    y ^= &#40;y >> 11&#41;;
    y ^= &#40;y << 7&#41; & 0x9d2c5680UL;
    y ^= &#40;y << 15&#41; & 0xefc60000UL;
    y ^= &#40;y >> 18&#41;;

    return y;
&#125;
Now the results look more random with no discernible pattern

Code: Select all

943 2039 1018
1014 1995 991
972 2023 1005
1034 1958 1008
1008 1991 1001
979 2007 1014
975 2016 1009
1010 2027 963
1047 1978 975
1018 1966 1016
========
1049 1997 954
1004 1994 1002
992 1976 1032
1001 1997 1002
1032 1994 974
976 2013 1011
1011 1974 1015
1010 1981 1009
942 2021 1037
983 2053 964
========
972 1964 1064
989 2043 968
993 2018 989
971 2021 1008
1021 1986 993
1073 1901 1026
986 2026 988
972 2055 973
996 2013 991
1027 1973 1000
========
961 2015 1024
1039 1993 968
969 1990 1041
997 2017 986
1007 1979 1014
999 1987 1014
1031 1995 974
1006 2020 974
1012 2011 977
979 1993 1028
========
1036 1996 968
969 2027 1004
1007 1963 1030
1020 1975 1005
992 2032 976
999 1988 1013
1019 1982 999
995 2037 968
1013 1989 998
950 2011 1039
========
970 1978 1052
987 2037 976
1034 1954 1012
1013 1957 1030
987 2008 1005
1020 2009 971
976 1999 1025
988 2020 992
1024 2038 938
1001 2000 999
========
953 2037 1010
1027 1992 981
1028 1991 981
998 2001 1001
967 2055 978
1039 2006 955
947 2032 1021
1034 1961 1005
1005 1978 1017
1002 1947 1051
========
935 2054 1011
1018 2020 962
1017 1997 986
979 1980 1041
995 1979 1026
1010 2044 946
1028 1977 995
1012 1954 1034
971 2042 987
1035 1953 1012
========
1017 1979 1004
1010 2005 985
981 2023 996
975 2025 1000
996 2033 971
992 1984 1024
990 2033 977
1051 1945 1004
985 1997 1018
1003 1976 1021
========
976 2048 976
1017 2006 977
1045 1932 1023
1015 1985 1000
1019 1971 1010
985 1999 1016
987 1986 1027
996 2015 989
994 2035 971
966 2023 1011
========
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: partition result into subgroups

Post by Daniel Shawul »

Hey,Gary. Thanks for your efforts. But the sample sizes should be the same and it is difficult to do that with your approach I think.
petero2
Posts: 680
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: partition result into subgroups

Post by petero2 »

Daniel Shawul wrote:
petero2 wrote:
Daniel Shawul wrote:I seem to have a mental block with permutation stuff and can't figure out what seems to be easy problem.. Say I have N games with WDL results so that N = W + D + L. How to partition this into k sub-groups of N1(W1,D1,L1) so that they add up to N(W,D,L). I thought of random shuffling of N indices and then picking up Nk = N / k elements for each group. But that requires the indices to be stored in an array. That could be very big if I have thousands/millions of games so I want to avoid that. How do you generate n random integers that add up to some value ?
Thanks
Maybe I missed something but I don't see why you can't store millions of indices in an array. If you mean billions of games I can understand your concern, but 4MB for 1 million games doesn't seem like a problem to me.
Actually I don't have large number of games in mind, so I was mentioning million/billions to limit people from coming up with array solutions. In practise W,D,L could be anything an integer or long long could hold. Anyway it turns out I and many others have been looking at it from the wrong perspective. What Bob suggested is simple and a clear winner.
I see, in that case using an array is unjustified. Your implementation of Bob's suggestion is a nice solution for moderate sizes of N, but for really large values it would be more efficient to utilize a function that can sample the hypergeometric distribution. See for example: http://svn.r-project.org/R/trunk/src/nmath/rhyper.c

Code: Select all

int main&#40;int argc, char* argv&#91;&#93;) &#123;
    int nBins;
    str2Num&#40;argv&#91;1&#93;, nBins&#41;;

    std&#58;&#58;vector<int> rk;
    int total = 0;
    int nPerBin = 0;
    for &#40;int i = 2; i < argc; i++) &#123;
        int tmp;
        str2Num&#40;argv&#91;i&#93;, tmp&#41;;
        rk.push_back&#40;tmp * nBins&#41;;
        nPerBin += tmp;
        total += tmp * nBins;
    &#125;
    const int T = rk.size&#40;);

    for &#40;int b = 0; b < nBins; b++) &#123;
        int nSelect = nPerBin;
        int nSelectFrom = total;
        for &#40;int t = 0; t < T; t++) &#123;
            assert&#40;rk&#91;t&#93; <= nSelectFrom&#41;;
            int tmp = hyper&#40;rk&#91;t&#93;, nSelectFrom-rk&#91;t&#93;, nSelect&#41;;
            cout << setw&#40;10&#41; << tmp;
            nSelect -= tmp;
            nSelectFrom -= rk&#91;t&#93;;
            
            rk&#91;t&#93; -= tmp;
            total -= tmp;
        &#125;
        cout << endl;
    &#125;
&#125;
Some sample runs:

Code: Select all

$ time ./test 10 1000 2000 2000
      1034      1992      1974
       979      2051      1970
      1021      2025      1954
       966      2041      1993
       998      1937      2065
       972      1996      2032
       977      1979      2044
      1053      2001      1946
       963      1997      2040
      1037      1981      1982

real        0m0.003s
user        0m0.001s
sys         0m0.002s
$ time ./test 10 10000000 20000000 20000000
   9993566  20001871  20004563
   9996331  20001708  20001961
   9997688  19995470  20006842
  10001086  20000713  19998201
   9998881  20005422  19995697
  10004310  19999344  19996346
  10001546  19996712  20001742
   9999040  19996948  20004012
  10003735  20002569  19993696
  10003817  19999243  19996940

real        0m0.003s
user        0m0.000s
sys         0m0.002s
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: partition result into subgroups

Post by Daniel Shawul »

Thanks Peter. Indeed that is what it is , a multivariate version of hyper geometric distribution. Do you have the 'hyper' function that you used. I am trying to keep a list of utility functions for combinatorics so that I don't have to rewrite them again.
Edit: I see it is probably the rhyper function from the svn link you gave. In that case I am afraid I can't use it because I have 6 variables. For home WDL and away WDL. One more thing I needed to take care of was that when N is not divisible by k, the last sub-group should take what ever is left over.