Detecting database doubles

sje · Post by **sje** » Sat Sep 20, 2008 6:12 pm

It seems to me that the process of detecting multiple instances of the same game in a database can be largely automated, as can be other database improvement operations.

Duplicate move texts are easily detected using hashing techniques. Games that differ only in length with identical initial sequences can also be detected without too much cleverness.

More advanced techniques can be used to resolve obvious variants of names of players, events, sites, etc.

Within move texts, obvious typographical errors can be flagged with a blunder detector for later manual inspection.

I have no problem with people wanting to market commercial chess database applications. In fact, I thought of doing one myself some years ago but an attempt by me would have been an obvious conflict with the standards process. But the unannotated game data itself must always be free.

I have heard reports about some who charge for chess data that maliciously false information is included to detect copying. I hope such reports are false, but the tactic has been employed in other fields. I would avoid giving a single penny to poisoners.

Rolf · Post by **Rolf** » Sat Sep 20, 2008 7:29 pm

sje wrote:It seems to me that the process of detecting multiple instances of the same game in a database can be largely automated, as can be other database improvement operations.

Duplicate move texts are easily detected using hashing techniques. Games that differ only in length with identical initial sequences can also be detected without too much cleverness.

More advanced techniques can be used to resolve obvious variants of names of players, events, sites, etc.

Within move texts, obvious typographical errors can be flagged with a blunder detector for later manual inspection.

I have no problem with people wanting to market commercial chess database applications. In fact, I thought of doing one myself some years ago but an attempt by me would have been an obvious conflict with the standards process. But the unannotated game data itself must always be free.

I have heard reports about some who charge for chess data that maliciously false information is included to detect copying. I hope such reports are false, but the tactic has been employed in other fields. I would avoid giving a single penny to poisoners.

LOL

But serious, Steve, you're a funny guy. The input of so many moves of thousands of games isnt worth a penny for you. Interesting, really. Of course NOT.

And what these false informations are considered you are quite right. But I havent mentioned it because I expected Alexander to make his own painful experiences for himself.

IMO the clean moves of a game must be protecte as well because it takes so much time to put them into machine code and the tournament organmisers who do the job have earned also some recompensation IMO.

Interesting to hear from you with a life-long income from taxes I suppose that you seem to be authorized to publish that lower educated people than professors shouldnt be able to make a business and then to have their data protected against thieves.

The same inconsequence appears when veritable universitry professors with a life assurance for free are preaching in favor of freeware open source code programs while all commercial guys are doubted as if they were squoundrels. I have this bad impressions over all the years I read in the internet groups. IMO this is hypocrisy high five.

Marek Soszynski · Post by **Marek Soszynski** » Sat Sep 20, 2008 7:59 pm

sje wrote:It seems to me that the process of detecting multiple instances of the same game in a database can be largely automated, as can be other database improvement operations.

Duplicate move texts are easily detected using hashing techniques. Games that differ only in length with identical initial sequences can also be detected without too much cleverness.

Detected, yes.

sje wrote:More advanced techniques can be used to resolve obvious variants of names of players, events, sites, etc.

Resolved, no. Is Smythe an "obvious" variant of Smyth or a different player entirely? Likewise Gunther and Guenther.

sje wrote:Within move texts, obvious typographical errors can be flagged with a blunder detector for later manual inspection.

A blunder detector will detect... blunders (not typographical errors as such). Manual inspection will probably discover that the blunder is a... blunder. What has been achieved by that?

sje · Post by **sje** » Sat Sep 20, 2008 9:51 pm

Marek Soszynski wrote:
sje wrote:It seems to me that the process of detecting multiple instances of the same game in a database can be largely automated, as can be other database improvement operations.

Duplicate move texts are easily detected using hashing techniques. Games that differ only in length with identical initial sequences can also be detected without too much cleverness.
Detected, yes.

sje wrote:More advanced techniques can be used to resolve obvious variants of names of players, events, sites, etc.
Resolved, no. Is Smythe an "obvious" variant of Smyth or a different player entirely? Likewise Gunther and Guenther.

sje wrote:Within move texts, obvious typographical errors can be flagged with a blunder detector for later manual inspection.
A blunder detector will detect... blunders (not typographical errors as such). Manual inspection will probably discover that the blunder is a... blunder. What has been achieved by that?

First, a remarkably effective technique for detecting spelling variants:

http://en.wikipedia.org/wiki/Soundex

Second, a blunder detector will be most effective on games where it's needed the most: those games played by high level players. Obviously, it's not going to work all that well for the under elo 1200 group. I'd guess that most typographical move text errors leave a piece hanging or some other improbable move that resultn in a condition lasting several moves.

Rolf · Post by **Rolf** » Sat Sep 20, 2008 10:25 pm

Illegal positions are a much bigger problem in databases. Probability of this is higher in this OM because it's mostly about lower valued games. Since the higher are sufficiently covered by CB. That is another hint that 2 million games MORE is unrealistic.

James Constance · Post by **James Constance** » Sun Sep 21, 2008 5:37 am

sje wrote:I have heard reports about some who charge for chess data that maliciously false information is included to detect copying.

Is there some sort of copyright on games in databases? I mean is it unacceptable legally (not morally) to copy 3 million games from Megabase for instance and sell them in your own database?

sje · Post by **sje** » Sun Sep 21, 2008 6:06 am

James Constance wrote:
sje wrote:I have heard reports about some who charge for chess data that maliciously false information is included to detect copying.
Is there some sort of copyright on games in databases? I mean is it unacceptable legally (not morally) to copy 3 million games from Megabase for instance and sell them in your own database?

How much do you think Megabase paid for the games? Zero, most likely.

In some jurisdictions, there is a concept of compilation copyright. But as I recall from various rulings, a requirement for a compilation copyright is that there has been a substantial value added in the compilation process. Just concatenating existing data isn't enough.

Dirt · Post by **Dirt** » Sun Sep 21, 2008 6:25 am

sje wrote:How much do you think Megabase paid for the games? Zero, most likely.

In some jurisdictions, there is a concept of compilation copyright. But as I recall from various rulings, a requirement for a compilation copyright is that there has been a substantial value added in the compilation process. Just concatenating existing data isn't enough.

If they have done more than just aggregating games, such as using a standard spelling for prominent players or removing uninteresting short games, I think they may have a basis for copyright. It could be an issue requiring a court decision before you can be confidant of the answer.

sje · Post by **sje** » Sun Sep 21, 2008 7:15 am

Dirt wrote:
sje wrote:In some jurisdictions, there is a concept of compilation copyright. But as I recall from various rulings, a requirement for a compilation copyright is that there has been a substantial value added in the compilation process. Just concatenating existing data isn't enough.
If they have done more than just aggregating games, such as using a standard spelling for prominent players or removing uninteresting short games, I think they may have a basis for copyright. It could be an issue requiring a court decision before you can be confidant of the answer.

I'd say that it takes a lot more than just fiddling with spelling to provide "substantial value". If the so called added value can be done by a not-too-complex computer program, then I'd doubt that any court would rule for the creative addition status needed to establish a compilation copyright. I don't recall the details, but I think that this has already been handled in the case of printed telephone directories.

On the other hand, I'd think a court would look rather unkindly on a commercial provider that deliberately published false data without prior consumer notification.

Rolf · Post by **Rolf** » Sun Sep 21, 2008 12:43 pm

sje wrote:On the other hand, I'd think a court would look rather unkindly on a commercial provider that deliberately published false data without prior consumer notification.

It's always funny when lays simply think what judges might do, or any other group of experts.

To correct a possibly false adoption of something I collaborated too:

Nowhere it was officially said that for example CB intentiously would have input false data into its databases. This is already the difference a judge would have to see. So, who wants to prove that anything was done intentiously?? <g>

Detecting database doubles

Detecting database doubles

Re: Detecting database doubles

Re: Detecting database doubles

Re: Detecting database doubles

Re: Detecting database doubles

Re: Detecting database doubles

Re: Detecting database doubles

Re: Detecting database doubles

Re: Detecting database doubles

Re: Detecting database doubles