An Annotated Dataset For Near-Duplicate Detection In Personal Photo Collections

Managing photo collections involves a variety of image quality assessment tasks, e.g. the selection of the “best” photos. Detecting near-duplicates is a prerequisite for automating these tasks. We created the California-ND dataset to assist researchers in testing algorithms for the detection of near duplicate images.

Contrary to other existing datasets in this domain, California-ND contains 701 photos taken directly from a real user’s personal photo collection. As a result, while including many challenging non-identical near-duplicate cases without the use of artificial image transformations. The original image sequence was maintained as much as possible.

More importantly, in order to deal with the inevitable subjectivity and ambiguity that near-duplicate cases exhibit, the dataset is annotated by 10 different subjects, including the photographer himself. These annotations can be combined into a non-binary ground truth, representing the probability that a pair of images is considered a near-duplicate.

Sample photos

The dataset is released under a creative commons license and can be downloaded here: (80MB). The zip-file is encrypted; please email us for the password.

More details about the dataset can be found in the following paper:

A. Jinda-Apiraksa, V. Vonikakis, S. Winkler.
California-ND: An annotated dataset for near-duplicate detection in personal photo collections.
Proc. 5th International Workshop on Quality of Multimedia Experience (QoMEX), Klagenfurt, Austria, July 3-5, 2013.

Please cite the above paper if you use the California-ND dataset.