For database updates, follow or
updated 01.21.20
Dataset Added Language Description Instances Format Task Created Creator Download
arXiv Bulk Data 01.15.20 English A collection of research papers on arXiv. n/a Tar Text Corpora 2011 n/a
A Conversational Question Answering Challenge (CoQA) 01.15.20 English Dataset for measuring the ability of machines to understand a text passage and answer a series of interconnected questions that appear in a conversation. 127,000+ JSON Question Answering, Reading Comprehension 2019 Redy et al.
ABC Australia News Corpus 01.15.20 English Entire news corpus of ABC Australia from 2003 to 2017. 1,103,664 CSV Clustering, Events, Sentiment Analysis 2017 Kulkarni
AI2 Reasoning Challenge (ARC) 01.15.20 English Dataset contains 7,787 genuine grade-school level, multiple-choice science questions. 7,787 JSON, CSV Question Answering, Reading Comprehension 2018 Clark et al.
Amazon Fine Food Reviews 01.15.20 English Dataset consists of reviews of fine foods from amazon. 568,454 CSV Classification, Sentiment Analysis 2013 McAuley et al.
Amazon Reviews 01.15.20 English US product reviews from Amazon. 233.1M JSON Classification, Sentiment Analysis 2018 McAuley et al.
ASU Twitter Dataset 01.15.20 English Twitter network data, not actual tweets. Shows connections between a large number of users. 11,316,811 users, 85,331,846 connections CSV Clustering, Graph Analysis 2009 Zafarani et al.
Automated Essay Scoring 01.15.20 English Dataset contains student-written essays with scores. n/a TSV, xlsx Scoring Classification 2017 The Hewlett Foundation
Automatic Keyphrase Extraction 01.15.20 English Multiple datasets for automatic keyphrase extraction. n/a Multiple Information Retrieval 1999-2008 Several
bAbI 20 Tasks 01.15.20 English, Hindi Dataset cotains a set of contexts, with multiple question-answer pairs available based on the contexts. 2,000 Text Question Answering, Reading Comprehension 2015 Weston et al.
babI 6 Tasks Dialogue 01.15.20 English Dataset contains 6 tasks for testing end-to-end dialog systems in the restaurant domain. 3,000 Text Dialogue 2017 Bordes et al.
BlogFeedback Dataset 01.15.20 English Dataset to predict the number of comments a post will receive based on features of that post. 60,021 Text Regression 2014 Buza
Blogger Authorship Corpus 01.15.20 English Blog post entries of 19,320 people from 681,288 Text Classification, Sentiment Analysis 2006 Schler et al.
BoolQ 01.15.20 English Question answering dataset for yes/no questions. 15,942 JSON Binary Question Answering 2019 Clark et al.
Buzz in Social Media Dataset 01.15.20 English Data from Twitter and Tom's Hardware. This dataset focuses on specific buzz topics being discussed on those sites. 140,000 Text Classification 2013 Kawala et al.
Car Evaluation Dataset 01.15.20 English Car properties and their overall acceptability. 1,728 Text Classification 1997 Bohanec
Children’s Book Test (CBT) 01.15.20 English Dataset contains ‘questions’ from chapters in the book by enumerating 21 consecutive sentences. In each question, the first 20 sentences form the context, and a word is removed from the 21st sentence, which becomes the query. Models must identify the answer word among a selection of 10 candidate answers appearing in the context sentences and the query. ~688,000 Text Question Answering, Reading Comprehension 2016 Hill et al.
Choice of Plausible Alternatives 01.15.20 English Dataset used for open-domain commonsense causal reasoning. 1,000 Text Commonsense Reasoning 2011 Roemmele et al.
Clinical Case Reports for Machine Reading Comprehension (CliCR) 01.15.20 English Dataset was built from clinical case reports, requiring the reader to answer the query with a medical problem/test/treatment entity. 100,000 JSON Question Answering, Reading Comprehension 2018 Šuster et al.
ClueWeb Corpora 01.15.20 English Annotated web pages from the ClueWeb09 and ClueWeb12 corpora. 340,451,982 Text Classification 2013 Gabrilovich et al.
CNN / Daily Mail Dataset 01.15.20 English Cloze-style reading comprehension dataset created from CNN and Daily Mail news articles. 1M+ Question Question Answering, Reading Comprehension 2015 Hermann et al.
CommitmentBank 01.15.20 English Dataset contains naturally occurring discourses whose final sentence contains a clause-embedding predicate under an entailment canceling operator (question, modal, negation, antecedent of conditional). 1,200 CSV Entailment, Inference 2019 Marneffe et al.
CommonCrawl 01.15.20 Multi-Lingual Dataset contains data from 25 billion web pages. 25B WET Text Corpora 2013-2019 Common Crawl Foundation
Cornell Movie--Dialogs Corpus 01.15.20 English This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts. 220,579 conversational exchanges between 10,292 pairs of movie characters, involves 9,035 characters from 617 moviesin. total 304,713 utterances. 304,713 Text Dialogue 2011 Danescu et al.
Corporate Messaging Corpus 01.15.20 English Dataset contains classifed statements as information, dialog (replies to users, etc.), or action (messages that ask for votes or ask users to click on links, etc. 3,118 CSV Classification 2015 Crowdflower
Dataset for the Machine Comprehension of Text 01.15.20 English Stories and associated questions for testing comprehension of text. 660 Text Question Answering, Reading Comprehension 2013 Richardson et al.
Dbpedia 01.15.20 Multi-Lingual The English version of the DBpedia knowledge base currently describes 6.6M entities of which 4.9M have abstracts, 1.9M have geo coordinates and 1.7M depictions. In total, 5.5M resources are classified in a consistent ontology. 6.6M Multiple Knowledge Base 2016 Dbpedia
DEXTER Dataset 01.15.20 English Task given is to determine, from features given, which articles are about corporate acquisitions. 2,600 Text Classification 2008 Reuters
DSL Corpus Collection (DSLCC) 01.15.20 Multi-Lingual Dataset contains short excerpts of journalistic texts in similar languages and dialects. 294,000 Text Discriminating between similar languages 2017 Tang et al.
DuReader 01.15.20 Mandarin DuReader version 2.0 contains more than 300K question, 1.4M evidence documents and 660K human generated answers. 1,431,429 JSON Question Answering, Reading Comprehension 2018 He et al.
Enron Email Dataset 01.15.20 English Emails from employees at Enron organized into folders. ~500,000 Text Text Corpora 2004 (2015) Klimt et al.
European Parliament Proceedings (Europarl) 01.15.20 Multi-Lingual The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages. 10M+ XML Text Corpora, Machine Translation 2002 Koehn et al.
Examiner Pseudo-News Corpus 01.15.20 English Clickbait, spam, crowd-sourced headlines from 2010 to 2015. 3,089,781 CSV Clustering, Events, Sentiment Analysis 2017 Kulkarni
Google Books N-grams 01.15.20 Multi-Lingual N-grams from a very large corpus of books. 2.2 TB of text Text Classification, Clustering 2011 Google
Guttenberg Book Corpus 01.15.20 Multi-Lingual Dataset contains 60,000 eBooks. 60,000 Text Text Corpora 1996-2019 Guttenberg
Hansards Canadian Parliament 01.15.20 English Dataset contains pairs of aligned text chunks (sentences or smaller fragments) from the official records (Hansards) of the 36th Canadian Parliament. 1.3M Text Text Corpora 2001 Natural Language Group - USC
Harvard Library 01.15.20 English Dataset contains books, journals, electronic resources, manuscripts, archival materials, scores, audio, video and other materials. 12.7M MODS, Dublin Core Text Corpora n/a Harvard
Hate Speech Identification Dataset 01.15.20 English Dataset contains lexicons, notebooks containing content that is racist, sexist, homophobic, and offensive in general. n/a CSV Classification 2017 Davidson et al.
Historical Newspapers Daily Word Time Series Dataset 01.15.20 English Dataset contains daily contents of newspapers published in the US and UK from 1836 to 1922. 25,000 n/a Text Corpora 2017 Dzogang et al.
Home Depot Product Search Relevance 01.15.20 English Dataset contains a number of products and real customer search terms from Home Depot's website. n/a CSV Classification 2015 Home Depot
HotpotQA 01.15.20 English  Dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems. 1.25M JSON Question Answering, Reading Comprehension 2018 Yang et al.
Human-in-the-loop Dialogue Simulator (HITL) 01.15.20 English Dataset provides a framework for evaluating a bot’s ability to learn to improve its performance in an online setting using feedback from its dialog partner. The dataset contains questions based on the bAbI and WikiMovies datasets, with the addition of feedback from the dialog partner. n/a Text Question Answering, Reading Comprehension 2016 Li et al.
IWSLT 15 English-Vietnamese 01.15.20 Multi-Lingual Sentence pairs for translation. 133,000 Text Machine Translation 2015 Stanford
Jeapardy Questions Answers 01.15.20 English Dataset contains Jeopardy questions, answers and other data. 216,930 JSON Question Answering, Reading Comprehension 2014 Anonymous
Legal Case Reports 01.15.20 English Federal Court of Australia cases from 2006 to 2009. 4,000 Text Classification 2012 Galgani et al.
Ling-Spam Dataset 01.15.20 English Corpus contains both legitimate and spam emails. n/a Text Classification 2000 Androutsopoulos et al.
Microsoft Machine Reading COmprehension Dataset (MS MARCO) 01.15.20 English  Dataset focused on machine reading comprehension, question answering, and passage ranking, keyphrase extraction, and conversational search studies. 1,010,916 JSON Question Answering, Reading Comprehension 2016 Bajaj et al.
Microsoft Research Paraphrase Corpus 01.15.20 English Dataset contains pairs of sentences which have been extracted from news sources on the web, along with human annotations indicating whether each pair captures a paraphrase/semantic equivalence relationship. 5,800 Text Paraphrasing 2005 Dolan et al.
Microsoft Research Social Media Conversation Corpus 01.15.20 English A-B-A triples extracted from Twitter. 4,232 Text Graph Analysis 2016 Sordini et al.
MovieLens 01.15.20 English Dataset contains 22,000,000 ratings and 580,000 tags applied to 33,000 movies by 240,000 users. ~22M Text Clustering, Classification, Regression 2016 GroupLens Research
MovieTweetings 01.15.20 English Movie rating dataset based on public and well-structured tweets. 822,784 Text Classification, Regression 2018 Dooms
Multimodal Comprehension of Cooking Recipes (RecipeQA) 01.15.20 English Dataset for multimodal comprehension of cooking recipes. It consists of over 36K question-answer pairs automatically generated from approximately 20K unique recipes with step-by-step instructions and images. 20,000 JSON Question Answering, Reading Comprehension 2018 Yagcioglu et al.
MultiNLI Matched/Mismatched 01.15.20 English Dataset contains sentence pairs annotated with textual entailment information. 433,000 JSON, Text Entailment 2017 Williams et al.
Multi-Sentence Reading Comprehension 01.15.20 English Dataset of short paragraphs and multi-sentence questions that can be answered from the content of the paragraph. ~10,000 JSON Question Answering, Reading Comprehension 2018 Khashabi et al.
NarrativeQA 01.15.20 English Dataset contains the list of documents with Wikipedia summaries, links to full stories, and questions and answers. 1,572 CSV Question Answering, Reading Comprehension 2017 Kočiský et al.
News Headlines Dataset for Sarcasm Detection 01.15.20 English High quality dataset with Sarcastic and Non-sarcastic news headlines. 26,709 JSON Clustering, Events, Language Detection 2018 Misra
News Headlines Of India 01.15.20 English Dataset contains archive of noteable events in India during 2001-2018, recorded by the Times of India. 2,969,922 CSV Text Corpora 2017 Kaggle
NewsQA 01.15.20 English Crowdworkers supply questions and answers based on a set of over 10,000 news articles from CNN. 12,744 JSON, CSV Question Answering, Reading Comprehension 2017 Trischler et al.
NPS Chat Corpus 01.15.20 English Posts from age-specific online chat rooms. ~500,000 XML Dialogue 2007 Forsyth et al.
NUS SMS Corpus 01.15.20 Mandarin, English SMS messages collected between 2 users, with timing analysis. 67,093 XML Dialogue 2013 Kan et al.
NYSK Dataset 01.15.20 English English news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn. 10,421 XML Sentiment Analysis, Topic Extraction 2013 Dermouche et al.
One Week of Global News Feeds 01.15.20 Multi-Lingual Dataset contains most of the new news content published online over one week in 2017 and 2018. 3.3M CSV Text Corpora 2018 Kulkarni et al.
OpinRank Review Dataset 01.15.20 English Reviews of cars and hotels from and TripAdvisor. Edmunds: 42,230, TripAdivsor: 259,000 Text Information Retrieval, Entity Ranking, Entiry Retrieval 2011 Ganesan et al.
Paraphrase and Semantic Similarity in Twitter (PIT) 01.15.20 English Dataset focuses on whether tweets have (almost) same meaning/information or not. 18,762 Text Classification 2015 Xu et al.
Personae Corpus 01.15.20 Dutch Collected for experiments in Authorship Attribution and Personality Prediction. Consists of 145 Dutch-language essays. 145 Text Classification, Regression 2008 Luyckx et al.
Plaintext Jokes 01.15.20 English  208,000 jokes in this database scraped from three sources. 208,000 JSON Text Corpora 2016 Pungas et al.
Quasar-S & T 01.15.20 English The Quasar-S dataset consists of 37,000 cloze-style queries constructed from definitions of software entity tags on the popular website Stack Overflow. The Quasar-T dataset consists of 43,000 open-domain trivia questions and their answers obtained from various internet sources. 80,000 JSON Question Answering, Reading Comprehension 2017 Dhingra et al.
Question Answering in Context (QuAC) 01.15.20 English Dataset for modeling, understanding, and participating in information seeking dialog. 14,000 JSON Question Answering, Reading Comprehension 2018 Choi et al.
Question NLI 01.15.20 English Dataset converts SQuAD dataset into sentence pair classification by forming a pair between each question and each sentence in the corresponding context. 110,000 JSON Inference 2018 Rajpurkar et al.
Quora Question Pairs 01.15.20 English The task is to determine whether a pair of questions are semantically equivalent. 400,000 TSV Semantic Similarity 2017 Quora
ReAding Comprehension Dataset From Examinations (RACE) 01.15.20 English Dataset was collected from the English exams evaluating the students' ability in understanding and reasoning. 28,000 JSON Question Answering, Reading Comprehension 2017 Lai et al.
Reading Comprehension over Multiple Sentences (MultiRC) 01.15.20 English Dataset of short paragraphs and multi-sentence questions that can be answered from the content of the paragraph. ~10,000 JSON Question Answering, Reading Comprehension 2018 Khashabi et al.
Reading Comprehension with Commonsense Reasoning Dataset (Record) 01.15.20 English Reading comprehension dataset which requires commonsense reasoning. Contains 120,000+ queries from 70,000+ news articles. 70,000+ JSON Question Answering, Reading Comprehension 2018 Zhang et al.
Reading Comprehension with Multiple Hops (Qangaroo) 01.15.20 English Reading Comprehension datasets focussing on multi-hop (alias multi-step) inference. There are 2 datasets: Wikihop (based on wikipedia) and Medhop (based on PubMed research papers). ~53,000 JSON Question Answering, Reading Comprehension 2018 Welbl et al.
Recognizing Textual Entailment (RTE) 01.15.20 English Datasets are combined and converted to two-class classification: entailment and not_entailment. n/a JSON Entailment 2006-2009 Dagan et al, Bar Haim et al, Giampiccolo, and Bentivogli et al.
Reddit All Comments Corpus 01.15.20 English All Reddit comments (as of 2017). 3,329,219,008 JSON Text Corpora 2017 Reddit
Relationship and Entity Extraction Evaluation Dataset (RE3D) 01.15.20 English Entity and Relation marked data from various news and government sources. n/a JSON Classification, Entity and Relation Recognition 2017 Dstl
Reuters News Wire Headline 01.15.20 English Dataset contains 11 years of timestamped events published on the news-wire. 16,121,310 TSV Clustering, Events, Language Detection 2018 Kulkarni
Saudi Newspapers Corpus 01.15.20 Arabic Dataset contains 31,030 Arabic newspaper articles. 31,030 JSON Text Corpora 2015 Alhagri
SearchQA 01.15.20 English Dataset from Jeapardy archives which consists of more than 140k question-answer pairs with each pair having 49.6 snippets on average. 140,000 JSON Question Answering, Reading Comprehension 2017 Dunn et al.
Semantic Textual Similarity Benchmark 01.15.20 English The task is to predict textual similarity between sentence pairs. 8,628 CSV Semantic Similarity 2017 Cer et al.
Sentiment Labeled Sentences Dataset 01.15.20 English Dataset contains 3000 sentiment labeled sentences. 3,000 Text Classification, Sentiment Analysis 2015 Kotzias
Sentiment140 01.15.20 English Tweet data from 2009 including original text, time stamp, user and sentiment. 1,578,627 CSV Sentiment Analysis 2009 Go et al.
Shaping Answers with Rules through Conversation (ShARC) 01.15.20 English ShARC is a Conversational Question Answering dataset focussing on question answering from texts containing rules. 32,000 JSON Question Answering, Reading Comprehension 2018 Saeidi et al.
Short Answer Scoring 01.15.20 English Student-written short-answer responses. n/a TSV Scoring Classification 2012 The Hewlett Foundation
Skytrax User Reviews Dataset 01.15.20 English User reviews of airlines, airports, seats, and lounges from Skytrax. 41,396 CSV Classification, Sentiment Analysis 2015 Nguyen
SMS Spam Collection Dataset 01.15.20 English Dataset contains SMS spam messages. 5,574 Text Classification 2011 Almeida et al.
SNAP Social Circles: Twitter Database 01.15.20 English Large Twitter network data. Nodes: 81,306, Edges:1,768,149 Text Clustering, Graph Analysis 2012 McAuley et al.
Spambase Dataset 01.15.20 English Dataset contains spam emails. 4,601 Text Classification 1999 Hopkins et al.
SQuAD v2.0 01.15.20 English Paragraphs w/ questions and answers. 150,000 JSON Question Answering, Reading Comprehension 2018 Rajpurkar et al.
Stack Overlow BigQuery Dataset 01.15.20 English BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges. n/a n/a Text Corpora 2018 Stack Overflow
Stanford Natural Language Inference (SNLI) Corpus 01.15.20 English Image captions matched with newly constructed sentences to form entailment, contradiction, or neutral pairs. 570,000 Text Inference, Entailment 2015 Bowman et al.
The Corpus of Linguistic Acceptability 01.15.20 English Dataset used to classifiy sentences as grammatical or not grammatical. 10,657 TSV Grammatical Acceptability 2018 Warstadt et al.
The Dialog-based Language Learning Dataset 01.15.20 English Dataset was designed to measure how well models can perform at learning as a student given a teacher’s textual responses to the student’s answer. n/a Text Question Answering, Reading Comprehension 2016 Weston
The Irish Times IRS 01.15.20 English Dataset contains 23 years of events from Ireland. 1,425,460 CSV Clustering, Events, Language Detection 2018 Kulkarni
The Movie Dialog Dataset 01.15.20 English Dataset measures how well models can perform at goal and non-goal orientated dialogue centered around the topic of movies (question answering, recommendation and discussion). ~3.5M Text Question Answering, Reading Comprehension 2016 Dodge et al.
The Penn Treebank Project 01.15.20 English Naturally occurring text annotated for linguistic structure. ~1M words Text POS 1995 Marcus et al.
The SimpleQuestions Dataset 01.15.20 English  Dataset for question answering with human generated questions paired with a corresponding fact, formatted as (subject, relationship, object), that provides the answer but also a complete explanation. 108,442 Text Question Answering, Reading Comprehension 2015 Bordes et al.
The Stanford Sentiment Treebank (SST) 01.15.20 English Sentence sentiment classification of movie reviews. 69,000 PTB Sentiment Analysis 2013 Socher et al.
The Story Cloze Test | ROCStories 01.15.20 English  Dataset for story understanding that provides systems with four-sentence stories and two possible endings. The systems must then choose the correct ending to the story. 100,000+ JSON Question Answering, Reading Comprehension 2017 Mostafazadeh et al.
The WikiMovies Dataset 01.15.20 English Dataset contains only the QA part of the Movie Dialog dataset, but using three different settings of knowledge: using a traditional knowledge base (KB), using Wikipedia as the source of knowledge, or using IE (information extraction) over Wikipedia. ~100,000 Text Question Answering, Reading Comprehension 2016 Miller et al.
The Winograd Schema Challenge 01.15.20 English Dataset to determine the correct referrent of the pronoun from among the provided choices. 150 XML Coreference Resolution 2012 Levesque et al.
Topical-Chat 01.15.20 English A knowledge-grounded human-human conversation dataset where the underlying knowledge spans 8 broad topics and conversation partners don’t have explicitly defined roles. 10,784 JSON Dialogue 2019 Gopalakrishnan et al.
T-REx 01.15.20 English Dataset contains Wikipedia abstracts aligned with Wikidata entities. 11M aligned triples JSON and NIF Relation Extraction 2018 Elsahar et al.
Twenty Newsgroups Dataset 01.15.20 English Messages from 20 different newsgroups. 20,000 Text Classification, Clustering 1999 Mitchell et al.
Twitter Dataset for Arabic Sentiment Analysis 01.15.20 Arabic Dataset contains Arabic tweets. 2,000 Text Classification 2014 Abdulla
Twitter US Airline Sentiment 01.15.20 English Contributors were asked to classify positive, negative, and neutral tweets, followed by categorizing negative reasons. 14,500 CSV Classification, Sentiment Analysis 2016 Figure Eight
Twitter100k 01.15.20 English Pairs of images and tweets. 100,000 Text and Images Multi-Modal Learning 2017 Hu et al.
Ubuntu Dialogue Corpus 01.15.20 English Dialogues extracted from Ubuntu chat stream on IRC. 930,000  CSV Text Corpora 2015 Lowe et al.
Urban Dictionary Dataset 01.15.20 English Corpus of words, votes and definitions. 2,606,522 CSV Reading Comprehension 2016-05 Anonymous
UseNet Corpus 01.15.20 English UseNet forum postings. 7B Text Dialogue 2011 Shaoul et al.
Web of Science Dataset 01.15.20 English Hierarchical Datasets for Text Classification. 46,985 Text Classification 2017 Kowsari et al.
Wikipedia 2012 01.15.20 English Dataset contains the full text of Wikipedia. Encompassing 1.9 billion words from more than 4 million articles. 4M Text Text Corpora 2012 Google
WikiQA Corpus 01.15.20 English Dataset contains Bing query logs as the question source. Each question is linked to a Wikipedia page that potentially has the answer.  3,047 TSV Question Answering, Reading Comprehension 2015 Yang et al.
Winogender Schemas 01.15.20 English Dataset with pairs of sentences that differ only by the gender of one pronoun in the sentence, designed to test for the presence of gender bias in automated coreference resolution systems. 720 TSV Coreference Resolution 2018 Rudinger et al.
WMT 14 English-German 01.15.20 Multi-Lingual Sentence pairs for translation. 4.5M Text Machine Translation 2015 Stanford
WMT 15 English-Czech 01.15.20 Multi-Lingual Sentence pairs for translation. 15.8M Text Machine Translation 2016 Stanford
WMT 19 Multiple Datasets 01.15.20 Multi-Lingual Multiple text corpora in multiple languages. n/a Text Text Corpora, Machine Translation 2019 ACL Workshop
Words in Context 01.15.20 English Dataset for evaluating contextualized word representations. 2,400 Text Word Sense Disambiguation 2019 Pilehvar et al.
Worldwide News - Aggregate of 20K Feeds 01.15.20 Multi-Lingual One week snapshot of all online headlines in 20+ languages. 1,398,431 CSV Clustering, Events, Machine Translation 2017 Kulkarni
Yahoo! Music User Ratings of Musical Artists 01.15.20 English Over 10M ratings of artists by Yahoo users. May be used to validate recommender systems or collaborative filtering algorithms. ~10M Text Clustering, PCA 2004 Yahoo!
YouTube Comedy Slam Preference Dataset 01.15.20 English User vote data for pairs of videos shown on YouTube. Users voted on funnier videos. 1,138,562 Text Classification 2012 Google
Visual QA (VQA) 01.15.20 English Dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense to answer. 265,016 images JSON Visual Question Answering 2015 Antol et al.
TrecQA 01.15.20 English Dataset is commonly used for evaluating answer selection in question answering. n/a XML Question Answering, Reading Comprehension 2007 Wang et al.
TriviaQA 01.15.20 English Dataset containing over 650K question-answer-evidence triples. It includes 95K QA pairs authored by trivia enthusiasts and independently gathered evidence documents, 6 per question on average. 650,000+ JSON Question Answering, Reading Comprehension 2017 Joshi et al.
SemEvalCQA 01.15.20 Arabic, English Dataset for community question answering. n/a XML Question Answering, Reading Comprehension 2016 Nakov et al.
QuaRTz Dataset 01.15.20 English Dataset contains 3,864 questions about open domain qualitative relationships. Each question is paired with one of 405 different background sentences (sometimes short paragraphs). 3,864 JSON Question Answering, Reading Comprehension 2019 Tajford et al.
QuaRel Dataset 01.15.20 English Dataset contains 2,771 story questions about qualitative relationships. 2,771 JSON Question Answering, Reading Comprehension 2018 Tajford et al.
OpenBookQA 01.15.20 English Dataset modeled after open book exams for assessing human understanding of a subject. It consists of 5,957 multiple-choice elementary-level science questions (4,957 train, 500 dev, 500 test), which probe the understanding of a small "book" of 1,326 core science facts and the application of these facts to novel situations. 5,957 JSON Question Answering, Reading Comprehension 2018 Mihaylov et al.
ProPara Dataset 01.15.20 English Dataset is used for comprehension of simple paragraphs describing processes, e.g., photosynthesis. The comprehension task relies on predicting, tracking, and answering questions about how entities change during the process. 488  Google Sheets Question Answering, Reading Comprehension 2018 Mishra et al.
Open Research Corpus 01.15.20 English Dataset contains over 39 million published research papers in Computer Science, Neuroscience, and Biomedical. 39M JSON Text Corpora 2018 Ammar et al.
SciTail Dataset 01.15.20 English Dataset is an entailment dataset created from multiple-choice science exams and web sentences. Each question and the correct answer choice are converted into an assertive statement to form the hypothesis. 27,026 SNLI, TSV, DGEM Entailment 2018 Khot et al.
AI2 Science Questions v2.1 01.15.20 English Dataset consists of questions used in student assessments in the United States across elementary and middle school grade levels. Each question is 4-way multiple choice format and may or may not include a diagram element. 5,060 JSON, CSV Question Answering, Reading Comprehension 2017 Allen Institute
SciQ Dataset 01.15.20 English Dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each. 13,769 JSON Question Answering, Reading Comprehension 2017 Welbl et al.
TupleInf Open IE Dataset 01.15.20 English Dataset contains Open IE tuples extracted from 263K sentences that were used by the solver in "Answering Complex Questions Using Open Information Extraction" (referred as Tuple KB, T). 263,000 Text Knowledge Base 2017 Allen Institute
Textbook Question Answering 01.15.20 English The M3C task builds on the popular Visual Question Answering (VQA) and Machine Comprehension (MC) paradigms by framing question answering as a machine comprehension task, where the context needed to answer questions is provided and composed of both text and images. 26,620 JSON, PNG Question Answering, Reading Comprehension, Visual 2017 Kembhavi et al.
Aristo Tuple KB 01.15.20 English Dataset contains a collection of high-precision, domain-targeted (subject,relation,object) tuples extracted from text using a high-precision extraction pipeline, and guided by domain vocabulary constraints. 282,594 TSV Knowledge Base 2017 Dalvi et al.
AI2 Science Questions Mercury 01.15.20 English Dataset consists of questions used in student assessments across elementary and middle school grade levels. Includes questions with diagrams and without. 6,940 JSON, JPG Reading Comprehension 2017 Allen Institute
Explanations for Science Questions 01.15.20 English Data contains: gold explanation sentences supporting 363 science questions, relation annotation for a subset of those explanations, and a graphical annotation tool with annotation guidelines. 1,363 CSV Question Answering, Reading Comprehension 2016 Jansen et al.
AQuA 01.15.20 English Dataset containing algebraic word problems with rationales for their answers. 100,000 JSON Question Answering, Reading Comprehension 2017 Ling et al.
Who Did What Dataset 01.15.20 English Dataset contains over 200,000 fill-in-the-gap (cloze) multiple choice reading comprehension problems constructed from the LDC English Gigaword newswire corpus. 200,000K XML Question Answering, Reading Comprehension 2016 Onishi et al.
Coached Conversational Preference Elicitation 01.15.20 English Dataset consisting of 502 English dialogs with 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language. 12,000 JSON Dialogue 2019 Radlinski et al.
Dialogue NLI 01.15.20 English The Dialogue NLI dataset consists of sentence pairs labeled as entailment (E), neutral (N), or contradiction (C). ~350,000 JSON Dialogue, Entailment 2019 Welleck et al.
Meta-Learning Wizard-of-Oz (MetaLWOz) 01.15.20 English Dataset designed to help develop models capable of predicting user responses in unseen domains. It was created by crowdsourcing 37,884 goal-oriented dialogs, covering 227 tasks in 47 domains. 37,884 Text Dialogue 2019 Microsoft
Situations With Adversarial Generations (SWAG) 01.15.20 English Dataset consists of 113k multiple choice questions about grounded situations. Each question is a video caption from LSMDC or ActivityNet Captions, with four answer choices about what might happen next in the scene. 113,000 CSV Question Answering, Reading Comprehension 2018 Zellers et al.
A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs (DROP) 01.15.20 English Dataset is used to resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). 96,000 JSON Question Answering, Reading Comprehension 2019 Dua et al.
Cosmos QA 01.15.20 English Dataset containing thousands of problems that require commonsense-based reading comprehension, formulated as multiple-choice questions. 35,000 CSV Question Answering, Reading Comprehension, Commonsense 2019 Huang et al.
COmmonsense Dataset Adversarially-authored by Humans (CODAH) 01.15.20 English Commonsense QA in the sentence completion style of SWAG. As opposed to other automatically generated NLI datasets, CODAH is adversarially constructed by humans who can view feedback from a pre-trained model and use this information to design challenging commonsense questions. 2,776 TSV Question Answering, Reading Comprehension, Commonsense 2019 Chen et al.
Deal or No Deal? End-to-End Learning for Negotiation Dialogues 01.15.20 English This dataset consists of 5,808 dialogues, based on 2,236 unique scenarios dealing with negotiations and complex communication. 5,808 Text Dialogue 2017 Lewis et al.
CommonsenseQA 01.15.20 English Dataset contains multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers . It contains 12,102 questions with one correct answer and four distractor answers. 12,012 JSON Question Answering, Reading Comprehension, Commonsense 2018 Talmor et al.
ComplexWebQuestions 01.15.20 English Dataset contains a large set of complex questions in natural language, and can be used in multiple ways. 34,689 JSON Question Answering, Reading Comprehension 2018 Talmor et al.
DVQA 01.15.20 English Dataset containing data visualizations and natural language questions. 3,487,194 JSON, PNG Question Answering, Visual, Commonsense 2018 Kafle et al.
Social-IQ Dataset 01.15.20 English Dataset containing videos and natural language questions for visual reasoning. 7,500 n/a Question Answering, Visual, Commonsense 2019 Zadeh et al."
TextVQA 01.15.20 English TextVQA requires models to read and reason about text in images to answer questions about them. Specifically, models need to incorporate a new modality of text present in the images and reason over it to answer TextVQA questions. 36,602 JSON, PNG Question Answering, Visual, Commonsense 2019 Singh et al.
Natural Questions (NQ) 01.15.20 English Dataset contains questions from real users, and it requires QA systems to read and comprehend an entire Wikipedia article that may or may not contain the answer to the question. 320,000+ HTML Question Answering, Reading Comprehension 2019 Kwiatkowski et al.
Activitynet-QA 01.15.20 English Dataset contains 58,000 human-annotated QA pairs on 5,800 videos derived from the popular ActivityNet dataset. The dataset provides a benckmark for testing the performance of VideoQA models on long-term spatio-temporal. 58,000 JSON Question Answering, Visual, Commonsense 2019 Yu et al.
GQA 01.15.20 English Question answering on image scene graphs. 22M JSON, H5 Question Answering, Visual, Commonsense 2019 Hudson et al.
Video Commonsense Reasoning (VCR) 01.15.20 English Dataset contains 290K multiple-choice questions on 110K images. 290,000 JSON, JPG Question Answering, Visual, Commonsense 2018 Zellers et al.
The Benchmark of Linguistic Minimal Pairs (BLiMP) 01.15.20 English BLiMP is a challenge set for evaluating what language models (LMs) know about major grammatical phenomena in English. 67 sub-datasets each with 1,000 minimal pairs JSON Language Modeling 2019 Warstadt et al.
An Open Information Extraction Corpus (OPIEC) 01.15.20 English OPIEC is an Open Information Extraction (OIE) corpus, constructed from the entire English Wikipedia containing more than 341M triples. 341M AVRO Knowledge Base 2019 Nanni et al.
Total-Text-Dataset 01.15.20 English Dataset used to classify curved text in pictures. ~1,500 JPG Scene Text Detection 2019 Ch'ng et al.
Schema-Guided Dialogue State Tracking (DSTC 8) 01.15.20 English Dataset contains 18K dialogues between a virtual assistant and a user. ~18,000 JSON Dialogue State Tracking 2019 Rastogi et al.
TabFact 01.15.20 English Dataset contains 16k Wikipedia tables as evidence for 118k human annotated statements to study fact verification with semi-structured evidence. 16,000 JSON Natural Language Inference 2020 Chen et al.
Taskmaster-1 01.15.20 English Dataset contains 13,215 task-based dialogs, including 5,507 spoken and 7,708 written dialogs created with two distinct procedures. Each conversation falls into one of six domains: ordering pizza, creating auto repair appointments, setting up ride service, ordering movie tickets, ordering coffee drinks and making restaurant reservations. 13,215 JSON Dialogue 2019 Byrne et al.
OpenWebTextCorpus 01.15.20 English Dataset contains millions of webpages text stemming from reddit urls totalling 38Gb of text data. 8,013,769 n/a Text Corpora 2019 Radford et al.
NLP Chinese Corpus 01.15.20 Chinese Large text corpora in Chinese. 10M+ JSON Text Corpora 2019 Xu et al.
Ten Thousand German News Articles Dataset (10kGNAD) 01.15.20 German Dataset consists of 10273 german language news articles from an austrian online newspaper categorized into nine topics. 10,273 CSV Text Corpora 2019 Timo Block
Tencent AI Lab Embedding Corpus 01.15.20 Chinese Dataset provides 200-dimension vector representations, a.k.a. embeddings, for over 8 million Chinese words and phrases. 8M Text Embeddings 2018 Song et al.
CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) 01.15.20 English Dataset contains more than 23,500 sentence utterance videos from more than 1000 online YouTube speakers. The dataset is gender balanced. All the sentences utterance are randomly chosen from various topics and monologue videos. 23,500 n/a Sentiment Analysis, Emotion Recognition, Visual 2018 MultiComp Lab
Cornell Newsroom 01.15.20 English Dataset contains 1.3 million articles and summaries written by authors and editors in the newsrooms of 38 major publications. The summaries are obtained from search and social metadata between 1998 and 2017. 1.3M JSON Text Corpora, Summarization 2018 Grusky et al.
WikiHow 01.15.20 English Dataset contains article and summary pairs extracted and constructed from an online knowledge base written by different human authors. 230,000+ Text Text Corpora, Summarization 2018 Koupaee et al.
Multi-Domain Wizard-of-Oz Dataset (MultiWoz) 01.15.20 English Dataset of human-human written conversations spanning over multiple domains and topics. The dataset was collected based on the Wizard of Oz experiment on Amazon MTurk. 10,438 JSON Dialogue 2018 Budzianowski et al.
Spider 1.0 01.15.20 English Dataset consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables covering 138 different domains. 10,181 JSON, SQL Semantic Parsing, SQL-to-Text 2018 Yu et al.
Semantic Parsing in Context (SParC) 01.15.20 English Dataset consists of 4,298 coherent question sequences (12k+ unique individual questions annotated with SQL queries annotated byt. It is the context-dependent/multi-turn version of the Spider task. 4,298 JSON, SQL Semantic Parsing, SQL-to-Text 2019 Yu et al.
Conversational Text-to-SQL Systems (CoSQL) 01.15.20 English Dataset consists of 30k+ turns plus 10k+ annotated SQL queries, obtained from a Wizard-of-Oz collection of 3k dialogues querying 200 complex databases spanning 138 domains.It is the dilaogue version of the Spider and SParC tasks. 3,000 JSON, SQL Dialogue, SQL-to-Text 2019 Yu et al.
Conceptual Captions 01.15.20 English Dataset contains ~3.3M images annotated with captions to be used for the task of automatically producing a natural-language description for an image. 3,318,333 TSV Automatic Image Captioning 2018 Sharma et al.
Yelp Open Dataset 01.15.20 English Dataset containing millions of reviews on Yelp. In addition it contains business data including location data, attributes, and categories. 6,685,900 JSON Classification, Sentiment Analysis 2015 Yelp
Stanford Natural Language Inference (SNLI) Corpus 01.15.20 English Dataset containing human-written sentence pairs manually labeled for the labels entailment, contradiction, and neutral. 570,000 JSON Natural Language Inference, Entailment 2015 Bowman et al.
Common Voice 01.15.20 Multi-Lingual Dataset containing audio in 29 languages and 2,454 recorded hours . n/a MP3 Speech Recognition 2019 Mozilla
Voices Obscured in Complex Environmental Settings (VOiCES) 01.15.20 English Dataset contains a total of 15 hours (3,903 audio files) in male and female read speech. n/a Wav Speech Recognition 2018 Various
VoxCeleb 01.15.20 Multi-Lingual An audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube. n/a MD5, URL Speech Recognition, Visual 2017 Nagrani et al.
AudioSet 01.15.20 Multi-Lingual Dataset consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. n/a CSV, TFR Speech Recognition, Visual 2017 Google
LibriSpeech ASR 01.15.20 English Large-scale (1000 hours) corpus of read English speech. n/a FLAC Speech Recognition 2015 OpenSLR
Dataset for Fill-in-the-Blank Humor 01.15.20 English Dataset contains 50 fill-in-the-blank stories similar in style to Mad Libs. The blanks in these stories include the original word and the hint type (e.g. animal, food, noun, adverb). 50 JSON Text Generation 2017 Hossain et al.
MSParS 01.15.20 English Dataset for the open domain semantic parsing task. 81,826 Satori Semantic Parsing 2019 Microsoft
Microsoft Speech Corpus 01.15.20 Indian Dataset contains conversational and phrasal speech training and test data for Telugu, Tamil and Gujarati languages. n/a Wav Speech Recognition 2019 Microsoft
Microsoft Information-Seeking Conversation (MISC) dataset 01.15.20 English Dataset contains recordings of information-seeking conversations between human “seekers” and “intermediaries”. It includes audio and video signals; transcripts of conversation; affectual and physiological signals; recordings of search and other computer use; and post-task surveys on emotion, success, and effort. n/a various Speech Recognition, Dialogue, Visual 2018 Microsoft
Microsoft Speech Language Translation Corpus (MSLT) 01.15.20 Multi-Lingual Dataset contains conversational, bilingual speech test and tuning data for English, Chinese, and Japanese. It includes audio data, transcripts, and translations; and allows end-to-end testing of spoken language translation systems on real-world data. n/a Wav Speech Recognition, Machine Translation 2017 Microsoft
WebQuestions Semantic Parses Dataset 01.15.20 English Dataset contains full semantic parses in SPARQL queries for 4,737 questions, and “partial” annotations for the remaining 1,073 questions for which a valid parse could not be formulated or where the question itself is bad or needs a descriptive answer. 5,810 JSON Semantic Parsing 2016 Yih et al.
MutualFriends 01.15.20 English Task where two agents must discover which friend of theirs is mutual based on the friend's attributes. n/a JSON Dialogue 2017 He et al.
Personalized Dialog 01.15.20 English Dataset of dialogs from movie scripts. 12,000 Text Dialogue 2017 Joshi et al.
The Conversational Intelligence Challenge 2 (ConvAI2) 01.15.20 English A chit-chat dataset based on PersonaChat dataset. 3,127 JSON Dialogue 2018 NeurIPS
IIT Bombay English-Hindi Corpus 01.21.20 English, Hindi Dataset contains parallel corpus for English-Hindi as well as monolingual Hindi corpus collected from a variety of existing sources. 1.49M n/a Machine Translation 2018 Kunchukuttan et al.
DiscoFuse 01.21.20 English Dataset contains examples for training sentence fusion models. Sentence fusion is the task of joining several independent sentences into a single coherent text. The data has been collected from Wikipedia and from Sports articles. ~60M TSV Sentence Fusion 2019 Geva et al.
Relation Extraction Corpus 01.21.20 English A human-judged dataset of two relations involving public figures on Wikipedia: about 10,000 examples of "place of birth" and 40,000 examples of "attended or graduated from an institution." 10,000 JSON Relation Extraction 2013 Google
Argentinian Spanish [es-ar] Speech Multi-Speaker Dataset. 01.21.20 Spanish (Argentinan) Speech dataset containing about 5,900 transcribed high-quality audio from Argentinian Spanish [es-ar] sentences recorded by volunteers. ~5,900 Wav Speech Recognition 2018 Google
Switchboard Dialogue Act Corpus (SwDA) 01.21.20 English A subset of the Switchboard-1 corpus consisting of 1,155 conversations and 42 tags 1,155 UTT Dialogue Act Classification 1997 Bates et al.
Social Media Mining for Health (SMM4H) 01.21.20 English Dataset contains medication-related text classification and concept normalization from Twitter 25,678 Text Classification 2018 Sarker et al.
Event2Mind 01.21.20 English Dataset contains 25,000 events and free-form descriptions of their intents and reactions 25,000 CSV Commonsense Inference 2018 Rashkin et al.
OntoNotes 5.0 01.21.20 Multi-Lingual Dataset contains various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference). n/a Text, SQL Information Retrieval, Syntactic Parsing 2013 Weischedel et al.
Dataset for Intent Classification and Out-of-Scope Prediction 01.21.20 English Dataset is a benchmark for evaluating intent classification systems for dialog systems / chatbots in the presence of out-of-scope queries. 23,000+ JSON Intent Classification 2019 Larson et al.
Soccer Dialogues 01.21.20 English Dataset contains soccer dialogues over a knowledge graph 2,890 JSON Knowledge Graphs, Dialogue 2019 SDA Lab, Uni. Of Bonn & Volkswagen Research
Dutch Book Reviews 01.21.20 Dutch Dataset contains book reviews along with associated binary sentiment polarity labels. 118,516 Text Classification, Sentiment Analysis 2019 van der Burgh
Portuguese Newswire Corpus 01.21.20 Portuguese (Brazil) Dataset contains x number of newswire articles collected between years 1994-2016. Requires preprocesing of HTML pages, found in GitHub in the download link. n/a HTML Text Corpora 2016 Boğaziçi University
Portuguese SQuAD v1.1 01.21.20 Portuguese Portuguese translation of the SQuAD dataset. The translation was performed using the Google Cloud API. ~100,000 JSON Question Answering, Reading Comprehension 2019 Carvalho et al.
Paraphrase Adversaries from Word Scrambling (PAWS-X) 01.21.20 Multi-Lingual Dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All translated pairs are sourced from examples in PAWS-Wiki. 300,000+ TSV Paraphrasing Identification 2019 Yang et al.
Paraphrase Adversaries from Word Scrambling (PAWS) 01.21.20 English Dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification. 750,000+ TSV Paraphrasing Identification 2019 Zhang et al.
Event-focused Emotion Corpora for German and English 01.21.20 English, German German and English emotion corpora for emotion classification, annotated with crowdsourcing in the style of the ISEAR resources. 2,002 TSV Text Corpora, Emotion Classification 2019 Troiano et al.
Classify Emotional Relationships of Fictional Characters 01.21.20 English Dataset contains 19 short stories that are shorter than 1,500 words, and depict at least four different characters. 19 Text Text Corpora, Emotion Classification 2019 Kim et al.
Affective Text 01.21.20 English Classification of emotions in 250 news headlines. Categories: anger, disgust, fear, joy, happiness, sadness, surprise. 250 SGML, Text Emotion Classification 2007 Strapparava et al.
The Emotion in Text 01.21.20 English Dataset of tweets labelled with emotion. Categories: empty, sadness, enthusiasm, neutral, worry, sadness, love, fun, hate, happiness, relief, boredom, surprise, anger. 40,000 CSV Emotion Classification 2016 CrowdFlower
DailyDialog 01.21.20 English A manually labelled conversations dataset. Categories: no emotion, anger, disgust, fear, happiness, sadness, surprise. 13,118 Text Emotion Classification 2017 Li et al.
Emotion-Stimulus 01.21.20 English Dataset annotated with both the emotion and the stimulus using FrameNet’s emotions-directed frame. 820 sentences with both cause and emotion and 1594 sentences marked with their emotion tag. Categories: happiness, sadness, anger, fear, surprise, disgust and shame. 2,414 XML Emotion Classification 2015 Ghazi et al.