A Dataset With Over 100,000 Face Images of 530 People

Large face datasets are important for advancing face recognition research, but they are tedious to build, because a lot of work has to go into cleaning the huge amount of raw data. To facilitate this task, we developed an approach to building face datasets that detects faces in images returned from searches for public figures on the Internet, followed by automatically discarding those not belonging to each queried person.

The FaceScrub dataset was created using this approach, followed by manually checking and cleaning the results. It comprises a total of 106,863 face images* of male and female 530 celebrities, with about 200 images per person. As such, it is one of the largest public face databases. The dataset was also used as part of the MegaFace face recognition challenges.

The images were retrieved from the Internet and are taken under real-world situations (uncontrolled conditions). Name and gender annotations of the faces are included.

# people:265265530
# images:55,30651,557106,863

Sample images

The dataset is released under a creative commons license. Note that we can only provide the URLs to the images (plus annotations), as we do not own the content (more details in the readme file).
Please fill out this form to receive the access instructions.

More details about the automated cleaning process can be found in the following paper:

H.-W. Ng, S. Winkler.
A data-driven approach to cleaning large face datasets.
Proc. IEEE International Conference on Image Processing (ICIP), Paris, France, Oct. 27-30, 2014.

Please cite the above paper if you use the FaceScrub dataset.

* The original release of FaceScrub contained slightly more face images (107,818) but included a number of duplicates as well as a few mislabeled and morphed faces. These have been removed from the database with effect from March 2016.