I do believe that I brought this topic in the past but I haven't seen anyone take this into account!
I have seen a lot of folks make testing to see the overall strength of a chess engine, some used general opening book and others used sorta tuned book or fritz/shredder or hiarcs opening books.
Here is a list of suggestions that i believe it will make testing more interesting to watch and it could lead to some improvement to chess strength of a program! Instead of testing programs over and over again and you will only see rybka is on the top and you may get similar results every time when you have engine tournaments. Which that alone can be boring to some of us, would be worth it if we can find slight improvement
List for testing ;
1) The use of 3-4-5-6 Nalimov Tablebase (Okay, it is understandable piece 6 is hard to get it since it's large size but it will certainly improve the endgame/solve most endings plus it would be interesting to see if there very few engines would take advantage for it over others) It may improve engines by maybe 3-5 Elo if not slightly more!
2) Hand tuned opening book (This is very necessary, and honestly I believe the people that use general opening book for testing will not get the absolute conclusion of the program strength, The fact is think about it that way, human player that play over the board tournament have his own openings that difference rest of the players. Someone need to create or find openings that fit exactly to the programs style!!)
3) Programs settings! (Like rybka or hiarcs. It is also worth trying different settings, I am certain there are some mystery settings that are better than default that comes with the program. Someone can try that by having engine tournament and you can find out after numerous of testing an improvement will be seen!
4) Contempt (Now this is important and I haven't seen folks share games/tournaments results with messing with Contempt, although it's recommended to leave it at 0. but i believe this has a impact, you can use + value for weaker program to favor more draws if it is especially playing against stronger programs, and opposite for stronger programs to - value to see if they can have more wins and avoid draws against weaker/medium strength programs!)
I would like everyone opinions, I believe doing all four different testings will provide a more interesting results. I will be going for it myself and see . Now the use of hardware, that is not too much of issue, we can assume all programs are using quads and 64 bit windows. So this is even a strong push to see the real strength of all programs out there!