MMD: Towards Building Large Scale Multimodal Domain-Aware Conversation Systems | MMD dataset

MMD: Towards Building Large Scale Multimodal Domain-Aware Conversation Systems

Abstract
While multimodal conversation agents are gaining importance in several domains such as retail, travel etc., deep learning research in this area has been limited primarily due to the lack of availability of large-scale, open chatlogs. To overcome this bottleneck, in this paper we introduce the task of multimodal, domain-aware conversations, and propose the MMD benchmark dataset. This dataset was gathered by working in close coordination with large number of domain experts in the retail domain. These experts suggested various conversations flows and dialog states which are typically seen in multimodal conversations in the fashion domain. Keeping these flows and states in mind, we created a dataset consisting of over 150K conversation sessions between shoppers and sales agents, with the help of in-house annotators using a semi-automated manually intense iterative process. With this dataset, we propose 5 new sub-tasks for multimodal conversations along with their evaluation methodology. We also propose two multimodal neural models in the encode-attend-decode paradigm and demonstrate their performance on two of the sub-tasks, namely text response generation and best image response selection. These experiments serve to establish baseline performance and open new research directions for each of these sub-tasks. Further, for each of the sub-tasks, we present a 'per-state evaluation' of 9 most significant dialog states, which would enable more focused research into understanding the challenges and complexities involved in each of these states.


CODE

Github Repository for the code repo link



PAPER

Please download the paper here paper link



AAAI 2018 SLIDES

Please download the slides here slides link



BIBTEX

@article{1704.00200,
Author = {Amrita Saha and Mitesh M. Khapra and Karthik Sankaranarayanan},
Title = {Towards Building Large Scale Multimodal Domain-Aware Conversation Systems},
Year = {2017},
Eprint = {arXiv:1704.00200},
}



DATASET

Please Click here to download the dataset.



EXAMPLE

Example Dialog Session



Multi-Fold Challenges of this Dataset

Type of Complexity Example Dialog State Example Utterance
Long-Term Dialog Context At the beginning of the dialog the user mentions his budget or size preference and after a few utterances, asks the agent to show something under his budget or size I like the 4th image. Show me something like it but in style as in this image within my budget.
Quantitative Inferencing (Counting) User points to the nth item displayed and asks a question about it Show me more images of the 3rd product in some different directions
Quantitative Inferencing (Sorting/Filtering) User wants sorting/filtering of a list based on a numerical field, e.g. price or product rating Show me something like it but in style as in this image within my budget.
Logical Inference User likes one fashion attribute of the nth image displayed but does not like another attribute of the same I am keen on seeing something similar to the 1st image but in a different sole material
Visual Inference System adds a visual description of the product alongside the images Viscata shoes are lightweight and made of natural jute, premium leather, suedes and woven cloth
Collective inference over multiple Images User?s question can have multiple aspects, drawn from multiple images displayed in the current or past context List more in the upper material of the 5th image and style as the 3rd and the 5th
Multimodal Inference User gives partial information in form of images and text in the context See the first espadrille. I wish to see more like it but in a silver colored type
Inference using domain knowledge and context Sometimes inferences for the user?s questions go beyond the dialog context to understanding the domain Will the 5th result go well with a large sized messenger bag?
Coreference or Ellipses Resolution Temporal continuity between successive questions from the user may cause some of them to be incomplete or to refer to items or aspects mentioned previously Show me the 3rd product in some different directions ... What about the product in the 5th image?