Google Scholar   ·   DBLP

2017

Close to the Action: Eye-Tracking Evaluation of Speaker-Following Subtitles
Kuno Kurzhals, Emine Cetinkaya, Yongtao Hu, Wenping Wang, and Daniel Weiskopf
The 35th ACM Conference on Human Factors in Computing Systems (CHI 2017)
"eye-tracking evaluation of speaker-following subtitles"
  • project page
  • paper
  • abstract
    The incorporation of subtitles in multimedia content plays an important role in communicating spoken content. For example, subtitles in the respective language are often preferred to expensive audio translation of foreign movies. The traditional representation of subtitles displays text centered at the bottom of the screen. This layout can lead to large distances between text and relevant image content, causing eye strain and even that we miss visual content. As a recent alternative, the technique of speaker-following subtitles places subtitle text in speech bubbles close to the current speaker. We conducted a controlled eye-tracking laboratory study (n = 40) to compare the regular approach (center-bottom subtitles) with content-sensitive, speaker-following subtitles. We compared different dialog-heavy video clips with the two layouts. Our results show that speaker-following subtitles lead to higher fixation counts on relevant image regions and reduce saccade length, which is an important factor for eye strain.
  • bibtex
    @inproceedings{kurzhals2017close,
      title={{Close to the Action: Eye-Tracking Evaluation of Speaker-Following Subtitles}},
      author={Kurzhals, Kuno and Cetinkaya, Emine and Hu, Yongtao and Wang, Wenping and Weiskopf, Daniel},
      booktitle={Proceedings of the 35th ACM Conference on Human Factors in Computing Systems},
      pages={6559--6568},
      year={2017},
      organization={ACM}
    }
  • DOI

2016

Look, Listen and Learn - A Multimodal LSTM for Speaker Identification
Jimmy SJ. Ren, Yongtao Hu, Yu-Wing Tai, Chuan Wang, Li Xu, Wenxiu Sun, and Qiong Yan
The 30th AAAI Conference on Artificial Intelligence (AAAI 2016)
"speaker identification through multimodal weight sharing LSTM"
  • project page
  • paper
  • abstract
    Speaker identification refers to the task of localizing the face of a person who has the same identity as the ongoing voice in a video. This task not only requires collective perception over both visual and auditory signals, the robustness to handle severe quality degradations and unconstrained content variations are also indispensable. In this paper, we describe a novel multimodal Long Short-Term Memory (LSTM) architecture which seamlessly unifies both visual and auditory modalities from the beginning of each sequence input. The key idea is to extend the conventional LSTM by not only sharing weights across time steps, but also sharing weights across modalities. We show that modeling the temporal dependency across face and voice can significantly improve the robustness to content quality degradations and variations. We also found that our multimodal LSTM is robustness to distractors, namely the non-speaking identities. We applied our multimodal LSTM to The Big Bang Theory dataset and showed that our system outperforms the state-of-the-art systems in speaker identification with lower false alarm rate and higher recognition accuracy.
  • source code & dataset
  • more applications
  • bibtex
    @inproceedings{ren2016look,
      title={{Look, Listen and Learn - A Multimodal LSTM for Speaker Identification}},
      author={Ren, Jimmy SJ. and Hu, Yongtao and Tai, Yu-Wing and Wang, Chuan and Xu, Li and Sun, Wenxiu and Yan, Qiong},
      booktitle={Proceedings of the 30th AAAI Conference on Artificial Intelligence},
      pages={3581--3587},
      year={2016}
    }
  • DOI

2015

Deep Multimodal Speaker Naming
Yongtao Hu, Jimmy SJ. Ren, Jingwen Dai, Chang Yuan, Li Xu, and Wenping Wang
The 23rd Annual ACM International Conference on Multimedia (MM 2015)
"realtime speaker naming through CNN based deeply learned face-audio fusion"
  • project page
  • paper
  • poster
  • abstract
    Automatic speaker naming is the problem of localizing as well as identifying each speaking character in a TV/movie/live show video. This is a challenging problem mainly attributes to its multimodal nature, namely face cue alone is insufficient to achieve good performance. Previous multimodal approaches to this problem usually process the data of different modalities individually and merge them using handcrafted heuristics. Such approaches work well for simple scenes, but fail to achieve high performance for speakers with large appearance variations. In this paper, we propose a novel convolutional neural networks (CNN) based learning framework to automatically learn the fusion function of both face and audio cues. We show that without using face tracking, facial landmark localization or subtitle/transcript, our system with robust multimodal feature extraction is able to achieve state-of-the-art speaker naming performance evaluated on two diverse TV series. The dataset and implementation of our algorithm are publicly available online.
  • dataset
  • source code
  • bibtex
    @inproceedings{hu2015deep,
      title={{Deep Multimodal Speaker Naming}},
      author={Hu, Yongtao and Ren, Jimmy SJ. and Dai, Jingwen and Yuan, Chang and Xu, Li and Wang, Wenping},
      booktitle={Proceedings of the 23rd Annual ACM International Conference on Multimedia},
      pages={1107--1110},
      year={2015},
      organization={ACM}
    }
  • DOI
Content-Aware Video2Comics with Manga-Style Layout
Guangmei Jing, Yongtao Hu, Yanwen Guo, Yizhou Yu, and Wenping Wang
IEEE Transactions on Multimedia (TMM 2015)
"auto-convert conversational videos into comics with manga-style layout"
  • project page
  • paper
  • abstract
    We introduce in this paper a new approach that conveniently converts conversational videos into comics with manga-style layout. With our approach, the manga-style layout of a comic page is achieved in a content-driven manner, and the main components, including panels and word balloons, that constitute a visually pleasing comic page are intelligently organized. Our approach extracts key frames on speakers by using a speaker detection technique such that word balloons can be placed near the corresponding speakers. We qualitatively measure the information contained in a comic page. With the initial layout automatically determined, the final comic page is obtained by maximizing such a measure and optimizing the parameters relating to the optimal display of comics. An efficient Markov chain Monte Carlo sampling algorithm is designed for the optimization. Our user study demonstrates that users much prefer our manga-style comics to purely Western style comics. Extensive experiments and comparisons against previous work also verify the effectiveness of our approach.
  • bibtex
    @article{jing2015sontent,
      title={{Content-Aware Video2Comics with Manga-Style Layout}},
      author={Jing, Guangmei and Hu, Yongtao and Guo, Yanwen and Yu, Yizhou and Wang, Wenping},
      journal={IEEE Transactions on Multimedia (TMM)},
      volume={17},
      number={12},
      pages={2122--2133},
      year={2015},
      publisher={IEEE}
    }
  • DOI

2014

Speaker-Following Video Subtitles
Yongtao Hu, Jan Kautz, Yizhou Yu, and Wenping Wang
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM 2014)
"auto-generate speaker-following subtitles to enhance video viewing experience"
  • project page
  • arXiv (with color)
  • paper
  • abstract
    We propose a new method for improving the presentation of subtitles in video (e.g. TV and movies). With conventional subtitles, the viewer has to constantly look away from the main viewing area to read the subtitles at the bottom of the screen, which disrupts the viewing experience and causes unnecessary eyestrain. Our method places on-screen subtitles next to the respective speakers to allow the viewer to follow the visual content while simultaneously reading the subtitles. We use novel identification algorithms to detect the speakers based on audio and visual information. Then the placement of the subtitles is determined using global optimization. A comprehensive usability study indicated that our subtitle placement method outperformed both conventional fixed-position subtitling and another previous dynamic subtitling method in terms of enhancing the overall viewing experience and reducing eyestrain.
  • bibtex
    @article{hu2014speaker,
      title={{Speaker-Following Video Subtitles}},
      author={Hu, Yongtao and Kautz, Jan and Yu, Yizhou and Wang, Wenping},
      journal={ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)},
      volume={11},
      number={2},
      pages={32:1--32:17},
      year={2014},
      publisher={ACM}
    }
  • DOI

Ph.D. Dissertation

Multimodal Speaker Localization and Identification for Video Processing
Yongtao Hu
The University of Hong Kong, December 2014
"a more-or-less summary of the works published at TOMM'14, TMM'15 and MM'15"
  • paper
  • abstract
    With the rapid growth of the multimedia data, especially for videos, the ability to better and time-efficiently understand them is becoming increasingly important. For videos, speakers, which are normally what our eyes are focused on, have played a key role to understand the content. With the detailed information of the speakers like their positions and identities, many high-level video processing/analysis tasks, such as semantic indexing, retrieval summarization. Recently, some multimedia content providers, such as Amazon/IMDb and Google Play, had the ability to provide additional cast and characters information for movies and TV series during playback, which can be achieved via a combination of face tracking, automatic identification and crowd sourcing. The main topics includes speaker localization, speaker identification, speech recognition, etc.
    
    This thesis first investigates the problem of speaker localization. A new algorithm for effectively detecting and localizing speakers based on multimodal visual and audio information is presented. We introduce four new features for speaker detection and localization, including lip motion, center contribution, length consistency and audio-visual synchrony, and combine them in a cascade model. Experiments on several movies and TV series indicate that, all together, they improve the speaker detection and localization accuracy by 7.5%--20.5%. Based on the locations of speakers, an efficient optimization algorithm for determining appropriate locations to place subtitles is proposed. This further enables us to develop an automatic end-to-end system for subtitle placement for TV series and movies.
    The second part of this thesis studies the speaker identification problem in videos. We propose a novel convolutional neural networks (CNN) based learning framework to automatically learn the fusion function of both faces and audio cues. A systematic multimodal dataset with face and audio samples collected from the real-life videos is created. The high variation of the samples in the dataset, including pose, illumination, facial expression, accessory, occlusion, image quality, scene and aging, wonderfully approximates the realistic scenarios and allows us to fully explore the potential of our method in practical applications. Extensive experiments on our new multi-modal dataset show that our method achieves state-of-the-art performance (over 90%) in speaker naming task without using face/person tracking, facial landmark localization or subtitle/transcript, thus making it suitable for real-life applications.
    The speaker-oriented techniques presented in this thesis have lots of applications for video processing. Through extensive experimental results on multiple real-life videos including TV series, movies and online video clips, we demonstrate the ability to extend our previous multimodal speaker localization and speaker identification algorithms in video processing tasks. Particularly, three main categories of applications are introduced, including (1) combine applying our speaker-following video subtitles and speaker naming work to enhance video viewing experience, where a comprehensive usability study with 219 users verifies that our subtitle placement method outperformed both conventional fixed-position subtitling and another previous dynamic subtitling method in terms of enhancing the overall viewing experience and reducing eyestrain; (2) automatically convert a video sequence into comics based on our speaker localization algorithms; and (3) extend our speaker naming work to handle real-life video summarization tasks.
  • bibtex
    @phdthesis{hu2014thesis,
      title={{Multimodal Speaker Localization and Identification for Video Processing}},
      author={Hu, Yongtao},
      year={2014},
      month={12},
      school={The University of Hong Kong}
    }
  • DOI