ChatEval
ChatEval is a scientific framework for evaluating open domain chatbots. Researchers can submit their trained models to effortlessly receive comparisons with baselines and prior work. Since all evaluation code is open source, we ensure evaluation is performed in a standardized and transparent way. Additionally, open source baseline models and an ever growing groups public evaluation sets are available for public use.
Upload your systemFAQ
How much does ChatEval cost?
ChatEval is currently free for academic researchers. It is actively developed by the NLP Group of the University of Pennyslvania.
Is there an online demo video?
You can find a video tutorial for ChatEval here.
How was ChatEval built?
The ChatEval webapp is built using Django and React (front-end) using Magnitude word embeddings format for evaluation. Our source code is available on Github.
What should I cite?
@InProceedings{N19-4011,
author = "Sedoc, Jo{~a}o and Ippolito, Daphne and Kirubarajan, Arun and Thirani, Jai and Ungar, Lyle and Callison-Burch, Chris",
title = "ChatEval: A Tool for Chatbot Evaluation",
booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations) ",
year = "2019",
publisher = "Association for Computational Linguistics",
pages = "60--65",
location = "Minneapolis, Minnesota",
url = "http://aclweb.org/anthology/N19-4011"
}
About Evaluation
Model responses are generated using an evaluation dataset of prompts and then uploaded to ChatEval. The responses are then evaluated using a series of automatic evaluation metrics, and are compared against selected baseline/ground truth models (e.g. humans).
Evaluation Datasets
ChatEval offers evaluation datasets consisting of prompts that uploaded chatbots are to respond to. Evaluation datasets are available to download for free and have corresponding baseline models.
Neural Conversational Model
Supported ChatEval Dataset
In (Vinyals and Le 2015), human evaluation is conducted on a set of 200 hand-picked prompts. We make this dataset available for comparison.
Download DatasetDialogue Breakdown Detection Challenge
Supported ChatEval Dataset
The DBDC dataset consists of a series of text-based conversations between a human and a chatbot where the human was aware they were chatting with a computer (Higashinaka et al. 2016).
Download DatasetOpen Subtitles
Supported ChatEval Dataset
This evaluation dataset contains a random subset of 200 prompts from the English OpenSubtitles 2009 dataset (Tiedemann 2009).
Download DatasetSupported ChatEval Dataset
The random Twitter test set is a random subset of 200 prompts from the ParlAi Twitter derived test set.
Download DatasetCornell Movie Dialogue Corpus
Supported ChatEval Dataset
The Cornell Movie Dialogue Corpus (Danescu-Niculescu-Mizil and Lee, 2011) contains accurate speaker annotations. We use 1000 prompts selected by Baheti et al., 2018.
Download DatasetDialogue Breakdown Detection Challenge multi-turn dataset
Supported ChatEval Dataset
This dataset is for the Next Utterance Recovery task, which is a shared task in the 2020 WOCHAT+DBDC. This dataset is derived from the Third Dialogue Breakdown Detection Challenge. Here we’ve taken the most difficult turns in the dataset and are using them to evaluate next utterance generation.
Download DatasetESL three turn snippets
Supported ChatEval Dataset
This is an English as a second language conversational learning from ESL, Inc (https://eslfast.com). The ChatEval team selected 3 turns of 200 conversations.
Download DatasetPersona Chat - USR Evaluations
Supported ChatEval Dataset
This evaluation dataset consists of 300 model responses, where five different models each respond to 60 prompts. The prompts are sourced from PersonaChat (Zhang et al. 2018). Three humans annotated for each model response, based on 6 different scoring categories. (Mehri & Eskenazi 2020)
Download DatasetTopical Chat - USR Evaluations
Supported ChatEval Dataset
This evaluation dataset consists of 360 model responses, where six different models each respond to 60 prompts. The prompts are sourced from TopicalChat (Gopalakrishnan et al. 2019). Three humans annotated for each model response, based on 6 different scoring categories. (Mehri & Eskenazi 2020)
Download DatasetDaily Dialog - Gupta
Supported ChatEval Dataset
This evaluation dataset provides model responses and multiple references to the DailyDialog dataset(Li et. al., 2017), open-sourced by Gupta et. al., 2019
Download DatasetDaily Dialog - Zhao
Supported ChatEval Dataset
This evaluation dataset provides model responses and multiple references to the DailyDialog dataset(Li et. al., 2017), open-sourced by Zhao et. al., 2020
Download DatasetDaily Dialog - Huang (GRADE)
Supported ChatEval Dataset
This evaluation dataset provides model responses and human annotations to the Daily Dialog dataset (Li et. al. 2017), which was open sourced by Huang et. al. 2020
Download DatasetConvAI2 - Huang (GRADE)
Supported ChatEval Dataset
This evaulation dataset provides model responses and human annotations to the ConvAI2 dataset (Dinan et al., 2019), provided by Huang et al. 2020
Download DatasetEmpathetic - Huang (GRADE)
Supported ChatEval Dataset
This evaulation dataset provides model responses and human annotations to the EmpatheticDialogues ataset (Rashkin et al., 2019), provided by Huang et al. 2020
Download DatasetDSTC6
Supported ChatEval Dataset
This evaluation dataset provides model responses and human annotations to the DSTC6 dataset, provided by Hori et al.
Download DatasetDSTC7
Supported ChatEval Dataset
This evaluation dataset consists of conversational data from reddit, as well as contextual "facts" , taken from the websites that started the (Reddit) conversation. The dataset is provided by Galley et al. (2019)
Download DatasetPersona Chat - Zhao
Supported ChatEval Dataset
This evaluation dataset provides model responses and multiple references to the PersonaChat dataset (Zhang et. al., 2018), model responses and annotations open-sourced by Zhao et. al., 2020
Download DatasetChatEval Baselines
ChatEval offers "ground-truth" baselines to compare uploaded models with. Baseline models range from human responders to established chatbot models. Baselines are handpicked and uploaded by the ChatEval Team.
Cornell Movie DC Baseline
Actual responses from the Cornell Movie Dialogue Corpus to the prompts
View ModelAutomated Evaluation Systems
The ChatEval Platform handles certain automated evaluations of chatbot responses. These metrics are documented here. Systems can be ranked according to a specific metric and viewed as a leaderboard.
Distinct 1
Metric
The number of unique unigrams in the model's responses divided by the total number of generated tokens.
View SourceDistinct 2
Metric
The number of unique bigrams in the model's responses divided by the total number of generated tokens.
View SourceEmbedding Greedy Match Score
Metric
Greedy matching between word embeddings of target utterance and model utterance (Rus et al., 2012).
View SourceEmbedding Extrema Score
Metric
Vector extrema of a model's response token vectors (Forgues et al., 2014).
View SourceBLEU Score
Metric
The Bilingual Evaluation Understudy Score, or BLEU for short, is a metric for evaluating a generated sentence to a reference sentence.
View SourceReferences
Higashinaka, Ryuichiro, Kotaro Funakoshi, Yuka Kobayashi, and Michimasa Inaba. "The dialogue breakdown detection challenge: Task description, datasets, and evaluation metrics." In LREC. 2016.
Liu, Chia-Wei, Ryan Lowe, Iulian Serban, Mike Noseworthy,Laurent Charlin, and Joelle Pineau. "How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation." In EMNLP, pp. 2122–2132. Association for Computational Linguistics, 2016.
Forgues, Gabriel, Joelle Pineau, Jean-Marie Larchevêque, and Réal Tremblay. "Bootstrapping dialog systems with word embeddings." In NIPS, modern machine learning and natural language processing workshop, vol. 2. 2014.
Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. "BLEU: a method for automatic evaluation of machine translation." In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311-318. Association for Computational Linguistics, 2002.
Rus, Vasile, and Mihai Lintean. "A comparison of greedy and optimal assessment of natural language student input using word-to-word similarity metrics." In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, pp. 157-162. Association for Computational Linguistics, 2012.
Tiedemann, Jörg. "News from OPUS-A collection of multilingual parallel corpora with tools and interfaces." In Recent advances in natural language processing, vol. 5, pp. 237-248. 2009.
Vinyals, Oriol, and Quoc Le. "A neural conversational model." arXiv preprint arXiv:1506.05869 (2015).