This dataset is used in our MM'15 paper "Deep Multimodal Speaker Naming".
Project page: http://herohuyongtao.github.io/publications/speaker-naming/.

Compared with existing datasets

Dataset summary

This is a multimodal dataset containing both face images and corresponding speaking audio clips, which is extracted from over 4 hours videos of 11 episodes of 2 popular TV series, naming "Friends" and "The Big Bang Theory". The 11 episodes that we selected are S01E03 (Season 01, Episode 03), S04E04, S05E05, S07E07, S10E15 from "Friends" and S01E01, S01E02, S01E03, S01E04, S01E05, S01E06 from "The Big Bang Theory". We believe this dataset has the potential to foster more sophisticated and robust data-driven methods to problems involved with multimodal information, e.g. speaker naming, as well as other applications.

In total, the dataset contains 303,529 face images for the 11 leading actors/actresses (6 for "Friends", i.e. Rachel, Monica, Phoebe, Joey, Chandler and Ross, and 5 for "The Big Bang Theory", i.e. Sheldon, Leonard, Howard, Raj and Penny). There are over 27K face images for each subject in average. All speaking audio clips are merged into one per character per episode. All images are in JPG format and audio in WAV.

Face samples in the dataset

All data has been organized by each episode, see each folder with the corresponding episode name, within which are the two folders for face images, see folder "face-images", and speaking audio, see folder "speaking-audio", for each character.


Download

The total size of the dataset is 1.41 GB. Please send an email to Yongtao Hu (herohuyongtao@gmail.com) with subject title Request for Multimodel Face+Audio Dataset to request for a copy.


Terms of use

The dataset is provided for research purposes only. Any commercial use is prohibited. When using the dataset in your research work, please cite the following paper:

"Deep Multimodal Speaker Naming."
Yongtao Hu, Jimmy SJ. Ren, Jingwen Dai, Chang Yuan, Li Xu, and Wenping Wang.
ACMMM 2015.

@inproceedings{hu2015deep,
  title={{Deep Multimodal Speaker Naming}},
  author={Hu, Yongtao and Ren, Jimmy SJ. and Dai, Jingwen and Yuan, Chang and Xu, Li and Wang, Wenping},
  booktitle={Proceedings of the 23rd Annual ACM International Conference on Multimedia},
  pages={1107--1110},
  year={2015},
  organization={ACM}
}


Contact

If you have any question or suggestion about this dataset, please email to Yongtao Hu (herohuyongtao@gmail.com).