Bạn đã nói điều đó?: Tổng hợp khuôn mặt nói từ âm thanh

Springer Science and Business Media LLC - Tập 127 - Trang 1767-1779 - 2019
Amir Jamaludin1, Joon Son Chung1, Andrew Zisserman1
1Department of Engineering Science, University of Oxford, Oxford, UK

Tóm tắt

Chúng tôi mô tả một phương pháp để tạo ra video một khuôn mặt đang nói. Phương pháp này sử dụng hình ảnh tĩnh của khuôn mặt mục tiêu và một đoạn âm thanh làm đầu vào, và tạo ra một video của khuôn mặt mục tiêu đồng bộ với âm thanh. Phương pháp này hoạt động trong thời gian thực và có thể áp dụng cho khuôn mặt và âm thanh chưa được thấy trong thời gian huấn luyện. Để đạt được điều này, chúng tôi phát triển một mô hình mạng nơ-ron tích chập encoder–decoder (CNN) sử dụng nhúng đồng bộ của khuôn mặt và âm thanh để tạo ra các khung video khuôn mặt nói được tổng hợp. Mô hình này được huấn luyện trên các video không có nhãn bằng cách sử dụng tự giám sát qua nhiều phương thức. Chúng tôi cũng đề xuất các phương pháp để lồng ghép lại video bằng cách pha trộn hình ảnh khuôn mặt được tạo ra vào khung video nguồn bằng cách sử dụng một mô hình CNN đa luồng.

Từ khóa

#khuôn mặt nói #tổng hợp video #âm thanh #mạng nơ-ron tích chập #tự giám sát #lồng ghép video

Tài liệu tham khảo

Afouras, T., Chung, J. S., Senior, A., Vinyals, O., & Zisserman, A. (2018). Deep audio-visual speech recognition. In IEEE transactions on pattern analysis and machine intelligence. arXiv preprint arXiv:1809.02108. Arandjelović, R., & Zisserman, A. (2017). Look, listen and learn. In Proceedings of the international conference on computer vision. Aytar, Y., Vondrick, C., & Torralba, A. (2016). SoundNet: Learning sound representations from unlabeled video. In Advances in neural information processing systems. Cappelletta, L., & Harte, N. (2012). Phoneme-to-viseme mapping for visual speech recognition. In ICPRAM. Charles, J., Magee, D., & Hogg, D. (2016). Virtual immortality: Reanimating characters from TV shows. In Computer vision–ECCV 2016 workshops (pp. 879–886). Springer. Chatfield, K., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. In Proceedings of the british machine vision conference. Chen, Q., & Koltun, V. (2017). Photographic image synthesis with cascaded refinement networks. In Proceedings of the international conference on computer vision. Chung, J. S., & Zisserman, A. (2016). Out of time: automated lip sync in the wild. In Workshop on multi-view lip-reading, ACCV. Chung, J. S., Jamaludin, A., & Zisserman, A. (2017). You said that? In Proceedings of the british machine vision conference. Chung, J. S., Nagrani, A., & Zisserman, A. (2018). VoxCeleb2: Deep speaker recognition. In INTERSPEECH. Chung, J. S., Senior, A., Vinyals, O., & Zisserman, A. (2017). Lip reading sentences in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition. Chung, S. W., Chung, J. S., & Kang, H. G. (2019). Perfect match: Improved cross-modal embeddings for audio-visual synchronisa-tion. In IEEE international conference on acoustics, speech and signal processing. arXiv preprint arXiv:1809.08001. Denton, E. L., & Birodkar, V. (2017). Unsupervised learning of disentangled representations from video. In Advances in neural information processing systems. Doersch, C., Gupta, A., & Efros, A. A. (2015). Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition Ezzat, T., & Poggio, T. (2000). Visual speech synthesis by morphing visemes. International Journal of Computer Vision, 38(1), 45–57. Fan, B., Wang, L., Soong, F. K., & Xie, L. (2015). Photo-real talking head with deep bidirectional LSTM. In IEEE international conference on acoustics, speech and signal processing. Fernando, B., Bilen, H., Gavves, E., & Gould, S. (2017). Self-supervised video representation learning with odd-one-out networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. Garrido, P., Valgaerts, L., Sarmadi, H., Steiner, I., Varanasi, K., Pérez, P., et al. (2015). VDUB: Modifying face video of actors for plausible visual alignment to a dubbed audio track. In O. Deussen & H. Zhang (Eds.), Computer graphics forum (Vol. 34, pp. 193–204). London: Wiley. Gatys, L. A., Ecker, A. S., & Bethge, M. (2016). Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2414–2423). Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. C., & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems (pp. 2672–2680). Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507. Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. Isola, P., Zoran, D., Krishnan, D., & Adelson, E. H. (2016). Learning visual groups from co-occurrences in space and time. In Workshop at international conference on learning representations. Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3128–3137). Karras, T., Aila, T., Laine, S., Herva, A., & Lehtinen, J. (2017). Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics, 36(4), 94:1–94:12. https://doi.org/10.1145/3072959.3073658. Kazemi, V., & Sullivan, J. (2014). One millisecond face alignment with an ensemble of regression trees. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1867–1874). Kim, J., Lee, J. K., & Lee, K. M. (2016). Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. King, D. E. (2009). Dlib-ml: A machine learning toolkit. The Journal of Machine Learning Research, 10, 1755–1758. Lienhart, R. (2001). Reliable transition detection in videos: A survey and practitioner’s guide. International Journal of Image and Graphics, 1, 469. Lucas, B. D., & Kanade, T. (1981). An iterative image registration technique with an application to stereo vision. In Proceedings of the 7th international joint conference on artificial intelligence (pp. 674–679). http://citeseer.nj.nec.com/lucas81optical.html. Misra, I., Zitnick, C. L., & Hebert, M. (2016). Shuffle and learn: unsupervised learning using temporal order verification. In Proceedings of the European conference on computer vision. Nagrani, A., Albanie, S., & Zisserman, A. (2018). Seeing voices and hearing faces: Cross-modal biometric matching. In Proceedings of the IEEE conference on computer vision and pattern recognition. Owens, A., Isola, P., McDermott, J. H., Torralba, A., Adelson, E. H., & Freeman, W. T. (2016). Visually indicated sounds. In Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2405–2413. IEEE Computer Society. Parkhi, O. M. (2015). Features and methods for improving large scale face recognition. Ph.D. thesis, Department of Engineering Science Oxford University. Parkhi, O. M., Vedaldi, A., & Zisserman, A. (2015). Deep face recognition. In Proceedings of the British machine vision conference Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., & Efros, A. A. (2016). Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2536–2544). Pătrăucean, V., Handa, A., & Cipolla, R. (2016). Spatio-temporal video autoencoder with differentiable memory. In Advances in neural information processing systems. Perez, P., Gangnet, M., & Blake, A. (2003). Poisson image editing. ACM Transactions on Graphics, 22(3), 313–318. Reed, S. E., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., & Lee, H. (2016). Generative adversarial text to image synthesis. In M. E. Balcan & K. Q. Weinberger (Eds.), ICML. JMLR Workshop and Conference Proceedings (Vol. 48, pp. 1060–1069). Cambridge: JMLR. Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International conference on medical image computing and computer-assisted intervention (pp. 234–241). Springer. Suwajanakorn, S., Seitz, S. M., & Kemelmacher-Shlizerman, I. (2017). Synthesizing obama: Learning lip sync from audio. ACM Transactions on Graphics (TOG), 36(4), 95. Taylor, S., Kim, T., Yue, Y., Mahler, M., Krahe, J., Rodriguez, A. G., et al. (2017). A deep learning approach for generalized speech animation. ACM Transactions on Graphics (TOG), 36(4), 93. van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al. (2016a). Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems (pp. 4790–4798). van den Oord, A., Kalchbrenner, N., & Kavukcuoglu, K. (2016b). Pixel recurrent neural networks. In M. E. Balcan & K. Q. Weinberger (Eds.), Proceedings of the 33rd International Conference on Machine Learning. Proceedings of Machine Learning Research (Vol. 48, pp. 1747–1756). NewYork: PMLR. Vedaldi, A., & Lenc, K. (2015). Matconvnet: Convolutional neural networks for matlab. In Proceedings of the ACM multimedia conference. Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3156–3164). Wang, X., & Gupta, A. (2015). Unsupervised learning of visual representations using videos. In Proceedings of the international conference on computer vision. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., et al. (2015). Show, attend and tell: Neural image caption generation with visual attention. In F. Bach & D. Blei (Eds.), Proceedings of the 32nd International Conference on Machine Learning. Proceedings of Machine Learning Research (Vol. 37, pp. 2048–2057). Lille: PMLR. Xue, T., Wu, J., Bouman, K., & Freeman, B. (2016). Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In Advances in neural information processing systems. Zhang, R., Isola, P., & Efros, A. A. (2016). Colorful image colorization. In Proceedings of the European conference on computer vision (pp. 649–666). Springer.