EDGE

ABSTRACT

Sarcasm explanation in dialogue is a new yet challenging task, which aims to generate natural language explanation for the given sarcastic dialogue, including utterance, video, and audio. Although the existing pioneer work has achieved great success, it overlooks the sentiment involved in the dialogue, which is a vital clue for sarcasm explanation. However, it is non-trivial to incorporate sentiment into the context of dialogue for sarcasm explanation generation, due to three main challenges: (1) utterance-specific sentiment inference; (2) consistency-guided vision-audio sentiment inference; and (3) relations modeling among utterance, utterance-specific sentiment, and vision-audio sentiment. To tackle these challenges, we propose a novel multi-sourcE sentiment-enhanceD Graph-based multimodal sarcasm Explanation framework, named EDGE, which contains utterance-specific sentiment inference, consistency-guided vision-audio sentiment inference, multi-source sentiment-enhanced graph encoding, and sarcasm explanation generation. In particular, EDGE first infers the utterance-specific sentiment with our proposed heuristic text sentiment revision strategy. Meanwhile, EDGE extracts the vision-audio sentiment with JCA instead of directly inputting visual and acoustic features. Thereafter, EDGE introduces a multi-source sentiment-enhanced graph to comprehensively model the sarcastic semantic relations among utterance, utterance-specific sentiment and vision-audio sentiment to facilitate sarcasm explanation generation. Extensive experiments on the publicly released dataset WITS verify the superiority of our model over cutting-edge methods.