Paper, Tags: #sarcasm-detection, #datasets
Humor is produced in a multimodal manner through text, vision and acoustic clues. Multimodal research is a trend in NLP that models NL as it happens in face-to-face communication.
This paper presents a diverse multimodal dataset called UR-FUNNY, the first multimodal language dataset of humor detection. UR-FUNNY focuses on punchline detection.
Humor styles include gradually building up to a punchline using text, audio, video or a combination of any of them, a sudden twist to the story with an unexpected punchline, creating a discrepancy between modalities. Modelling humor is difficult because:
- Idiosyncrasy: humorous people are also the most cretive ones, adds complexity to how humor is expressed in a multimodal manner.
- Contextual dependencies: humor often develops through time, there's a gradual build up in the story.
Understanding the unique dependencies across modalities and its impact on humor require knowledge from multimodal language. In this paper, apart from text, we include gestures such as smile, or vocal properties such as loudness.
A suitable dataset should be diverse in speakers, topics. TED talks for example, they include diverse topics, speakers, transcription of speaker and audience too.
We collect 1866 videos + transcriptions from TED, chosen from 1741 speakers and across 417 topics. The laughter markup is used to filter out 8257 humorous punchlines from the transcripts. 8257 negative samples are chosen at random intervals. The statistics for the datasets can be found in paper section 3.2.
For each modality, we extract following features:
- Language: Glove embeddings are used as pre-trained word vectors for the text features, we align the text and audio on phoneme level, and acoustic and visual cues are aligned on word level by interpolation
- Acoustic: COVAREP software is used to extract acoustic features, we extract 81 features
- Visual: OpenFace facial behavioral analysis tool is used.
We study the UR-FUNNY dataset through the lens of a contextualized extension of Memory Fusion Network (MFN), a state-of-the-art model in multimodal language. More in paper section 4.
Our goal is to establish a performance baseline for the UR-FUNNY dataset. We want to understand the role of context and punchline, and the role of individual modalities. More in section 5.
Models that use all modalities outperform models that use only one or two modalities. Text shows to be the most important modality. The human performance on the UR-FUNNY is 82.5%, and our best model achieves 64.4%, so it's good but there's still a gap.