| arXiv Bulk Data |
01.15.20 |
English |
A collection of research papers on arXiv. |
n/a |
Tar |
Text Corpora |
2011 |
n/a |
|
| A Conversational Question Answering Challenge (CoQA) |
01.15.20 |
English |
Dataset for measuring the ability of machines to understand a text passage and answer a
series
of interconnected questions that appear in a conversation. |
127,000+ |
JSON |
Question Answering, Reading Comprehension |
2019 |
Redy et al. |
|
| ABC Australia News Corpus |
01.15.20 |
English |
Entire news corpus of ABC Australia from 2003 to 2017. |
1,103,664 |
CSV |
Clustering, Events, Sentiment Analysis |
2017 |
Kulkarni |
|
| AI2 Reasoning Challenge (ARC) |
01.15.20 |
English |
Dataset contains 7,787 genuine grade-school level, multiple-choice science questions.
|
7,787 |
JSON, CSV |
Question Answering, Reading Comprehension |
2018 |
Clark et al. |
|
| Amazon Fine Food Reviews |
01.15.20 |
English |
Dataset consists of reviews of fine foods from amazon. |
568,454 |
CSV |
Classification, Sentiment Analysis |
2013 |
McAuley et al. |
|
| Amazon Reviews |
01.15.20 |
English |
US product reviews from Amazon. |
233.1M |
JSON |
Classification, Sentiment Analysis |
2018 |
McAuley et al. |
|
| ASU Twitter Dataset |
01.15.20 |
English |
Twitter network data, not actual tweets. Shows connections between a large number of
users. |
11,316,811 users, 85,331,846 connections |
CSV |
Clustering, Graph Analysis |
2009 |
Zafarani et al. |
|
| Automated Essay Scoring |
01.15.20 |
English |
Dataset contains student-written essays with scores. |
n/a |
TSV, xlsx |
Scoring Classification |
2017 |
The Hewlett Foundation |
|
| Automatic Keyphrase Extraction |
01.15.20 |
English |
Multiple datasets for automatic keyphrase extraction. |
n/a |
Multiple |
Information Retrieval |
1999-2008 |
Several |
|
| bAbI 20 Tasks |
01.15.20 |
English, Hindi |
Dataset cotains a set of contexts, with multiple question-answer pairs available based
on the
contexts. |
2,000 |
Text |
Question Answering, Reading Comprehension |
2015 |
Weston et al. |
|
| babI 6 Tasks Dialogue |
01.15.20 |
English |
Dataset contains 6 tasks for testing end-to-end dialog systems in the restaurant domain.
|
3,000 |
Text |
Dialogue |
2017 |
Bordes et al. |
|
| BlogFeedback Dataset |
01.15.20 |
English |
Dataset to predict the number of comments a post will receive based on features of that
post.
|
60,021 |
Text |
Regression |
2014 |
Buza |
|
| Blogger Authorship Corpus |
01.15.20 |
English |
Blog post entries of 19,320 people from blogger.com. |
681,288 |
Text |
Classification, Sentiment Analysis |
2006 |
Schler et al. |
|
| BoolQ |
01.15.20 |
English |
Question answering dataset for yes/no questions. |
15,942 |
JSON |
Binary Question Answering |
2019 |
Clark et al. |
|
| Buzz in Social Media Dataset |
01.15.20 |
English |
Data from Twitter and Tom's Hardware. This dataset focuses on specific buzz topics being
discussed
on those sites. |
140,000 |
Text |
Classification |
2013 |
Kawala et al. |
|
| Car Evaluation Dataset |
01.15.20 |
English |
Car properties and their overall acceptability. |
1,728 |
Text |
Classification |
1997 |
Bohanec |
|
| Children’s Book Test (CBT) |
01.15.20 |
English |
Dataset contains ‘questions’ from chapters in the book by enumerating 21 consecutive
sentences.
In each question, the first 20 sentences form the context, and a word is removed from
the
21st sentence, which becomes the query. Models must identify the answer word among a
selection
of 10 candidate answers appearing in the context sentences and the query. |
~688,000 |
Text |
Question Answering, Reading Comprehension |
2016 |
Hill et al. |
|
| Choice of Plausible Alternatives (COPA) |
01.15.20 |
English |
Dataset used for open-domain commonsense causal reasoning. |
1,000 |
XML |
Commonsense Reasoning |
2011 |
Roemmele et al. |
|
| Clinical Case Reports for Machine Reading Comprehension (CliCR) |
01.15.20 |
English |
Dataset was built from clinical case reports, requiring the reader to answer the query
with a
medical problem/test/treatment entity. |
100,000 |
JSON |
Question Answering, Reading Comprehension |
2018 |
Šuster et al. |
|
| ClueWeb Corpora |
01.15.20 |
English |
Annotated web pages from the ClueWeb09 and ClueWeb12 corpora. |
340,451,982 |
Text |
Classification |
2013 |
Gabrilovich et al. |
|
| CNN / Daily Mail Dataset |
01.15.20 |
English |
Cloze-style reading comprehension dataset created from CNN and Daily Mail news articles.
|
1M+ |
Question |
Question Answering, Reading Comprehension |
2015 |
Hermann et al. |
|
| CommitmentBank |
01.15.20 |
English |
Dataset contains naturally occurring discourses whose final sentence contains a
clause-embedding
predicate under an entailment canceling operator (question, modal, negation, antecedent
of
conditional). |
1,200 |
CSV |
Entailment, Inference |
2019 |
Marneffe et al. |
|
| CommonCrawl |
01.15.20 |
Multi-Lingual |
Dataset contains data from 25 billion web pages. |
25B |
WET |
Text Corpora |
2013-2019 |
Common Crawl Foundation |
|
| Cornell Movie--Dialogs Corpus |
01.15.20 |
English |
This corpus contains a large metadata-rich collection of fictional conversations
extracted from
raw movie scripts. 220,579 conversational exchanges between 10,292 pairs of movie
characters,
involves 9,035 characters from 617 moviesin. total 304,713 utterances. |
304,713 |
Text |
Dialogue |
2011 |
Danescu et al. |
|
| Corporate Messaging Corpus |
01.15.20 |
English |
Dataset contains classifed statements as information, dialog (replies to users, etc.),
or action
(messages that ask for votes or ask users to click on links, etc. |
3,118 |
CSV |
Classification |
2015 |
Crowdflower |
|
| Dataset for the Machine Comprehension of Text |
01.15.20 |
English |
Stories and associated questions for testing comprehension of text. |
660 |
Text |
Question Answering, Reading Comprehension |
2013 |
Richardson et al. |
|
| Dbpedia |
01.15.20 |
Multi-Lingual |
The English version of the DBpedia knowledge base currently describes 6.6M entities of
which
4.9M have abstracts, 1.9M have geo coordinates and 1.7M depictions. In total, 5.5M
resources
are classified in a consistent ontology. |
6.6M |
Multiple |
Knowledge Base |
2016 |
Dbpedia |
|
| DEXTER Dataset |
01.15.20 |
English |
Task given is to determine, from features given, which articles are about corporate
acquisitions.
|
2,600 |
Text |
Classification |
2008 |
Reuters |
|
| DSL Corpus Collection (DSLCC) |
01.15.20 |
Multi-Lingual |
Dataset contains short excerpts of journalistic texts in similar languages and dialects.
|
294,000 |
Text |
Discriminating between similar languages |
2017 |
Tang et al. |
|
| DuReader |
01.15.20 |
Mandarin |
DuReader version 2.0 contains more than 300K question, 1.4M evidence documents and 660K
human
generated answers. |
1,431,429 |
JSON |
Question Answering, Reading Comprehension |
2018 |
He et al. |
|
| Enron Email Dataset |
01.15.20 |
English |
Emails from employees at Enron organized into folders. |
~500,000 |
Text |
Text Corpora |
2004 (2015) |
Klimt et al. |
|
| European Parliament Proceedings (Europarl) |
01.15.20 |
Multi-Lingual |
The Europarl parallel corpus is extracted from the proceedings of the European
Parliament. It
includes versions in 21 European languages. |
10M+ |
XML |
Text Corpora, Machine Translation |
2002 |
Koehn et al. |
|
| Examiner Pseudo-News Corpus |
01.15.20 |
English |
Clickbait, spam, crowd-sourced headlines from 2010 to 2015. |
3,089,781 |
CSV |
Clustering, Events, Sentiment Analysis |
2017 |
Kulkarni |
|
| Google Books N-grams |
01.15.20 |
Multi-Lingual |
N-grams from a very large corpus of books. |
2.2 TB of text |
Text |
Classification, Clustering |
2011 |
Google |
|
| Guttenberg Book Corpus |
01.15.20 |
Multi-Lingual |
Dataset contains 60,000 eBooks. |
60,000 |
Text |
Text Corpora |
1996-2019 |
Guttenberg |
|
| Hansards Canadian Parliament |
01.15.20 |
English |
Dataset contains pairs of aligned text chunks (sentences or smaller fragments) from the
official
records (Hansards) of the 36th Canadian Parliament. |
1.3M |
Text |
Text Corpora |
2001 |
Natural Language Group - USC |
|
| Harvard Library |
01.15.20 |
English |
Dataset contains books, journals, electronic resources, manuscripts, archival materials,
scores,
audio, video and other materials. |
12.7M |
MODS, Dublin Core |
Text Corpora |
n/a |
Harvard |
|
| Hate Speech Identification Dataset |
01.15.20 |
English |
Dataset contains lexicons, notebooks containing content that is racist, sexist,
homophobic, and
offensive in general. |
n/a |
CSV |
Classification |
2017 |
Davidson et al. |
|
| Historical Newspapers Daily Word Time Series Dataset |
01.15.20 |
English |
Dataset contains daily contents of newspapers published in the US and UK from 1836 to
1922. |
25,000 |
n/a |
Text Corpora |
2017 |
Dzogang et al. |
|
| Home Depot Product Search Relevance |
01.15.20 |
English |
Dataset contains a number of products and real customer search terms from Home Depot's
website.
|
n/a |
CSV |
Classification |
2015 |
Home Depot |
|
| HotpotQA |
01.15.20 |
English |
Dataset featuring natural, multi-hop questions, with strong supervision for supporting
facts
to enable more explainable question answering systems. |
1.25M |
JSON |
Question Answering, Reading Comprehension |
2018 |
Yang et al. |
|
| Human-in-the-loop Dialogue Simulator (HITL) |
01.15.20 |
English |
Dataset provides a framework for evaluating a bot’s ability to learn to improve its
performance
in an online setting using feedback from its dialog partner. The dataset contains
questions
based on the bAbI and WikiMovies datasets, with the addition of feedback from the dialog
partner. |
n/a |
Text |
Question Answering, Reading Comprehension |
2016 |
Li et al. |
|
| IWSLT 15 English-Vietnamese |
01.15.20 |
Multi-Lingual |
Sentence pairs for translation. |
133,000 |
Text |
Machine Translation |
2015 |
Stanford |
|
| Jeapardy Questions Answers |
01.15.20 |
English |
Dataset contains Jeopardy questions, answers and other data. |
216,930 |
JSON |
Question Answering, Reading Comprehension |
2014 |
Anonymous |
|
| Legal Case Reports |
01.15.20 |
English |
Federal Court of Australia cases from 2006 to 2009. |
4,000 |
Text |
Classification |
2012 |
Galgani et al. |
|
| Ling-Spam Dataset |
01.15.20 |
English |
Corpus contains both legitimate and spam emails. |
n/a |
Text |
Classification |
2000 |
Androutsopoulos et al. |
|
| Microsoft Machine Reading COmprehension Dataset (MS MARCO) |
01.15.20 |
English |
Dataset focused on machine reading comprehension, question answering, and passage
ranking, keyphrase
extraction, and conversational search studies. |
1,010,916 |
JSON |
Question Answering, Reading Comprehension |
2016 |
Bajaj et al. |
|
| Microsoft Research Paraphrase Corpus |
01.15.20 |
English |
Dataset contains pairs of sentences which have been extracted from news sources on the
web, along
with human annotations indicating whether each pair captures a paraphrase/semantic
equivalence
relationship. |
5,800 |
Text |
Paraphrasing |
2005 |
Dolan et al. |
|
| Microsoft Research Social Media Conversation Corpus |
01.15.20 |
English |
A-B-A triples extracted from Twitter. |
4,232 |
Text |
Graph Analysis |
2016 |
Sordini et al. |
|
| MovieLens |
01.15.20 |
English |
Dataset contains 22,000,000 ratings and 580,000 tags applied to 33,000 movies by 240,000
users.
|
~22M |
Text |
Clustering, Classification, Regression |
2016 |
GroupLens Research |
|
| MovieTweetings |
01.15.20 |
English |
Movie rating dataset based on public and well-structured tweets. |
822,784 |
Text |
Classification, Regression |
2018 |
Dooms |
|
| Multimodal Comprehension of Cooking Recipes (RecipeQA) |
01.15.20 |
English |
Dataset for multimodal comprehension of cooking recipes. It consists of over 36K
question-answer
pairs automatically generated from approximately 20K unique recipes with step-by-step
instructions
and images. |
20,000 |
JSON |
Question Answering, Reading Comprehension |
2018 |
Yagcioglu et al. |
|
| MultiNLI Matched/Mismatched |
01.15.20 |
English |
Dataset contains sentence pairs annotated with textual entailment information. |
433,000 |
JSON, Text |
Entailment |
2017 |
Williams et al. |
|
| Multi-Sentence Reading Comprehension |
01.15.20 |
English |
Dataset of short paragraphs and multi-sentence questions that can be answered from the
content
of the paragraph. |
~10,000 |
JSON |
Question Answering, Reading Comprehension |
2018 |
Khashabi et al. |
|
| NarrativeQA |
01.15.20 |
English |
Dataset contains the list of documents with Wikipedia summaries, links to full stories,
and questions
and answers. |
1,572 |
CSV |
Question Answering, Reading Comprehension |
2017 |
Kočiský et al. |
|
| News Headlines Dataset for Sarcasm Detection |
01.15.20 |
English |
High quality dataset with Sarcastic and Non-sarcastic news headlines. |
26,709 |
JSON |
Clustering, Events, Language Detection |
2018 |
Misra |
|
| News Headlines Of India |
01.15.20 |
English |
Dataset contains archive of noteable events in India during 2001-2018, recorded by the
Times
of India. |
2,969,922 |
CSV |
Text Corpora |
2017 |
Kaggle |
|
| NewsQA |
01.15.20 |
English |
Crowdworkers supply questions and answers based on a set of over 10,000 news articles
from CNN.
|
12,744 |
JSON, CSV |
Question Answering, Reading Comprehension |
2017 |
Trischler et al. |
|
| NPS Chat Corpus |
01.15.20 |
English |
Posts from age-specific online chat rooms. |
~500,000 |
XML |
Dialogue |
2007 |
Forsyth et al. |
|
| NUS SMS Corpus |
01.15.20 |
Mandarin, English |
SMS messages collected between 2 users, with timing analysis. |
67,093 |
XML |
Dialogue |
2013 |
Kan et al. |
|
| NYSK Dataset |
01.15.20 |
English |
English news articles about the case relating to allegations of sexual assault against
the former IMF director Dominique
Strauss-Kahn. |
10,421 |
XML |
Sentiment Analysis, Topic Extraction |
2013 |
Dermouche et al. |
|
| One Week of Global News Feeds |
01.15.20 |
Multi-Lingual |
Dataset contains most of the new news content published online over one week in 2017 and
2018.
|
3.3M |
CSV |
Text Corpora |
2018 |
Kulkarni et al. |
|
| OpinRank Review Dataset |
01.15.20 |
English |
Reviews of cars and hotels from Edmunds.com and TripAdvisor. |
Edmunds: 42,230, TripAdivsor: 259,000 |
Text |
Information Retrieval, Entity Ranking, Entiry Retrieval |
2011 |
Ganesan et al. |
|
| Paraphrase and Semantic Similarity in Twitter (PIT) |
01.15.20 |
English |
Dataset focuses on whether tweets have (almost) same meaning/information or not. |
18,762 |
Text |
Classification |
2015 |
Xu et al. |
|
| Personae Corpus |
01.15.20 |
Dutch |
Collected for experiments in Authorship Attribution and Personality Prediction. Consists
of 145
Dutch-language essays. |
145 |
Text |
Classification, Regression |
2008 |
Luyckx et al. |
|
| Plaintext Jokes |
01.15.20 |
English |
208,000 jokes in this database scraped from three sources. |
208,000 |
JSON |
Text Corpora |
2016 |
Pungas et al. |
|
| Quasar-S & T |
01.15.20 |
English |
The Quasar-S dataset consists of 37,000 cloze-style queries constructed from definitions
of software
entity tags on the popular website Stack Overflow. The Quasar-T dataset consists of
43,000
open-domain trivia questions and their answers obtained from various internet sources.
|
80,000 |
JSON |
Question Answering, Reading Comprehension |
2017 |
Dhingra et al. |
|
| Question Answering in Context (QuAC) |
01.15.20 |
English |
Dataset for modeling, understanding, and participating in information seeking dialog.
|
14,000 |
JSON |
Question Answering, Reading Comprehension |
2018 |
Choi et al. |
|
| Question NLI |
01.15.20 |
English |
Dataset converts SQuAD dataset into sentence pair classification by forming a pair
between each
question and each sentence in the corresponding context. |
110,000 |
JSON |
Inference |
2018 |
Rajpurkar et al. |
|
| Quora Question Pairs |
01.15.20 |
English |
The task is to determine whether a pair of questions are semantically equivalent. |
400,000 |
TSV |
Semantic Similarity |
2017 |
Quora |
|
| ReAding Comprehension Dataset From Examinations (RACE) |
01.15.20 |
English |
Dataset was collected from the English exams evaluating the students' ability in
understanding
and reasoning. |
28,000 |
JSON |
Question Answering, Reading Comprehension |
2017 |
Lai et al. |
|
| Reading Comprehension over Multiple Sentences (MultiRC) |
01.15.20 |
English |
Dataset of short paragraphs and multi-sentence questions that can be answered from the
content
of the paragraph. |
~10,000 |
JSON |
Question Answering, Reading Comprehension |
2018 |
Khashabi et al. |
|
| Reading Comprehension with Commonsense Reasoning Dataset (Record) |
01.15.20 |
English |
Reading comprehension dataset which requires commonsense reasoning. Contains 120,000+
queries
from 70,000+ news articles. |
70,000+ |
JSON |
Question Answering, Reading Comprehension |
2018 |
Zhang et al. |
|
| Reading Comprehension with Multiple Hops (Qangaroo) |
01.15.20 |
English |
Reading Comprehension datasets focussing on multi-hop (alias multi-step) inference.
There are
2 datasets: Wikihop (based on wikipedia) and Medhop (based on PubMed research papers).
|
~53,000 |
JSON |
Question Answering, Reading Comprehension |
2018 |
Welbl et al. |
|
| Recognizing Textual Entailment (RTE) |
01.15.20 |
English |
Datasets are combined and converted to two-class classification: entailment and
not_entailment.
|
n/a |
JSON |
Entailment |
2006-2009 |
Dagan et al, Bar Haim et al, Giampiccolo, and Bentivogli et al. |
|
| Reddit All Comments Corpus |
01.15.20 |
English |
All Reddit comments (as of 2017). |
3,329,219,008 |
JSON |
Text Corpora |
2017 |
Reddit |
|
| Relationship and Entity Extraction Evaluation Dataset (RE3D) |
01.15.20 |
English |
Entity and Relation marked data from various news and government sources. |
n/a |
JSON |
Classification, Entity and Relation Recognition |
2017 |
Dstl |
|
| Reuters News Wire Headline |
01.15.20 |
English |
Dataset contains 11 years of timestamped events published on the news-wire. |
16,121,310 |
TSV |
Clustering, Events, Language Detection |
2018 |
Kulkarni |
|
| Saudi Newspapers Corpus |
01.15.20 |
Arabic |
Dataset contains 31,030 Arabic newspaper articles. |
31,030 |
JSON |
Text Corpora |
2015 |
Alhagri |
|
| SearchQA |
01.15.20 |
English |
Dataset from Jeapardy archives which consists of more than 140k question-answer pairs
with each
pair having 49.6 snippets on average. |
140,000 |
JSON |
Question Answering, Reading Comprehension |
2017 |
Dunn et al. |
|
| Semantic Textual Similarity Benchmark |
01.15.20 |
English |
The task is to predict textual similarity between sentence pairs. |
8,628 |
CSV |
Semantic Similarity |
2017 |
Cer et al. |
|
| Sentiment Labeled Sentences Dataset |
01.15.20 |
English |
Dataset contains 3000 sentiment labeled sentences. |
3,000 |
Text |
Classification, Sentiment Analysis |
2015 |
Kotzias |
|
| Sentiment140 |
01.15.20 |
English |
Tweet data from 2009 including original text, time stamp, user and sentiment. |
1,578,627 |
CSV |
Sentiment Analysis |
2009 |
Go et al. |
|
| Shaping Answers with Rules through Conversation (ShARC) |
01.15.20 |
English |
ShARC is a Conversational Question Answering dataset focussing on question answering
from texts
containing rules. |
32,000 |
JSON |
Question Answering, Reading Comprehension |
2018 |
Saeidi et al. |
|
| Short Answer Scoring |
01.15.20 |
English |
Student-written short-answer responses. |
n/a |
TSV |
Scoring Classification |
2012 |
The Hewlett Foundation |
|
| Skytrax User Reviews Dataset |
01.15.20 |
English |
User reviews of airlines, airports, seats, and lounges from Skytrax. |
41,396 |
CSV |
Classification, Sentiment Analysis |
2015 |
Nguyen |
|
| SMS Spam Collection Dataset |
01.15.20 |
English |
Dataset contains SMS spam messages. |
5,574 |
Text |
Classification |
2011 |
Almeida et al. |
|
| SNAP Social Circles: Twitter Database |
01.15.20 |
English |
Large Twitter network data. |
Nodes: 81,306, Edges:1,768,149 |
Text |
Clustering, Graph Analysis |
2012 |
McAuley et al. |
|
| Spambase Dataset |
01.15.20 |
English |
Dataset contains spam emails. |
4,601 |
Text |
Classification |
1999 |
Hopkins et al. |
|
| SQuAD v2.0 |
01.15.20 |
English |
Paragraphs w/ questions and answers. |
150,000 |
JSON |
Question Answering, Reading Comprehension |
2018 |
Rajpurkar et al. |
|
| Stack Overlow BigQuery Dataset |
01.15.20 |
English |
BigQuery dataset includes an archive of Stack Overflow content, including posts, votes,
tags,
and badges. |
n/a |
n/a |
Text Corpora |
2018 |
Stack Overflow |
|
| Stanford Natural Language Inference (SNLI) Corpus |
01.15.20 |
English |
Image captions matched with newly constructed sentences to form entailment,
contradiction, or
neutral pairs. |
570,000 |
Text |
Inference, Entailment |
2015 |
Bowman et al. |
|
| The Corpus of Linguistic Acceptability |
01.15.20 |
English |
Dataset used to classifiy sentences as grammatical or not grammatical. |
10,657 |
TSV |
Grammatical Acceptability |
2018 |
Warstadt et al. |
|
| The Dialog-based Language Learning Dataset |
01.15.20 |
English |
Dataset was designed to measure how well models can perform at learning as a student
given a
teacher’s textual responses to the student’s answer. |
n/a |
Text |
Question Answering, Reading Comprehension |
2016 |
Weston |
|
| The Irish Times IRS |
01.15.20 |
English |
Dataset contains 23 years of events from Ireland. |
1,425,460 |
CSV |
Clustering, Events, Language Detection |
2018 |
Kulkarni |
|
| The Movie Dialog Dataset |
01.15.20 |
English |
Dataset measures how well models can perform at goal and non-goal orientated dialogue
centered
around the topic of movies (question answering, recommendation and discussion). |
~3.5M |
Text |
Question Answering, Reading Comprehension |
2016 |
Dodge et al. |
|
| The Penn Treebank Project |
01.15.20 |
English |
Naturally occurring text annotated for linguistic structure. |
~1M words |
Text |
POS |
1995 |
Marcus et al. |
|
| The SimpleQuestions Dataset |
01.15.20 |
English |
Dataset for question answering with human generated questions paired with a
corresponding fact,
formatted as (subject, relationship, object), that provides the answer but also a
complete
explanation. |
108,442 |
Text |
Question Answering, Reading Comprehension |
2015 |
Bordes et al. |
|
| The Stanford Sentiment Treebank (SST) |
01.15.20 |
English |
Sentence sentiment classification of movie reviews. |
69,000 |
PTB |
Sentiment Analysis |
2013 |
Socher et al. |
|
| The Story Cloze Test | ROCStories |
01.15.20 |
English |
Dataset for story understanding that provides systems with four-sentence stories and
two possible
endings. The systems must then choose the correct ending to the story. |
100,000+ |
JSON |
Question Answering, Reading Comprehension |
2017 |
Mostafazadeh et al. |
|
| The WikiMovies Dataset |
01.15.20 |
English |
Dataset contains only the QA part of the Movie Dialog dataset, but using three different
settings
of knowledge: using a traditional knowledge base (KB), using Wikipedia as the source of
knowledge,
or using IE (information extraction) over Wikipedia. |
~100,000 |
Text |
Question Answering, Reading Comprehension |
2016 |
Miller et al. |
|
| The Winograd Schema Challenge |
01.15.20 |
English |
Dataset to determine the correct referrent of the pronoun from among the provided
choices. |
150 |
XML |
Coreference Resolution |
2012 |
Levesque et al. |
|
| Topical-Chat |
01.15.20 |
English |
A knowledge-grounded human-human conversation dataset where the underlying knowledge
spans 8
broad topics and conversation partners don’t have explicitly defined roles. |
10,784 |
JSON |
Dialogue |
2019 |
Gopalakrishnan et al. |
|
| T-REx |
01.15.20 |
English |
Dataset contains Wikipedia abstracts aligned with Wikidata entities. |
11M aligned triples |
JSON and NIF |
Relation Extraction |
2018 |
Elsahar et al. |
|
| Twenty Newsgroups Dataset |
01.15.20 |
English |
Messages from 20 different newsgroups. |
20,000 |
Text |
Classification, Clustering |
1999 |
Mitchell et al. |
|
| Twitter Dataset for Arabic Sentiment Analysis |
01.15.20 |
Arabic |
Dataset contains Arabic tweets. |
2,000 |
Text |
Classification |
2014 |
Abdulla |
|
| Twitter US Airline Sentiment |
01.15.20 |
English |
Contributors were asked to classify positive, negative, and neutral tweets, followed by
categorizing
negative reasons. |
14,500 |
CSV |
Classification, Sentiment Analysis |
2016 |
Figure Eight |
|
| Twitter100k |
01.15.20 |
English |
Pairs of images and tweets. |
100,000 |
Text and Images |
Multi-Modal Learning |
2017 |
Hu et al. |
|
| Ubuntu Dialogue Corpus |
01.15.20 |
English |
Dialogues extracted from Ubuntu chat stream on IRC. |
930,000 |
CSV |
Text Corpora, Dialogue |
2015 |
Lowe et al. |
|
| Urban Dictionary Dataset |
01.15.20 |
English |
Corpus of words, votes and definitions. |
2,606,522 |
CSV |
Reading Comprehension |
2016-05 |
Anonymous |
|
| UseNet Corpus |
01.15.20 |
English |
UseNet forum postings. |
7B |
Text |
Dialogue |
2011 |
Shaoul et al. |
|
| Web of Science Dataset |
01.15.20 |
English |
Hierarchical Datasets for Text Classification. |
46,985 |
Text |
Classification |
2017 |
Kowsari et al. |
|
| WikiLinks |
01.15.20 |
English |
Dataset contains 40 million mentions over 3 million entities based on hyperlinks from
Wikipedia. |
~10M |
Text |
Text Corpora |
2012 |
Singh et al. |
|
| WikiQA Corpus |
01.15.20 |
English |
Dataset contains Bing query logs as the question source. Each question is linked to a
Wikipedia
page that potentially has the answer. |
3,047 |
TSV |
Question Answering, Reading Comprehension |
2015 |
Yang et al. |
|
| Winogender Schemas |
01.15.20 |
English |
Dataset with pairs of sentences that differ only by the gender of one pronoun in the
sentence,
designed to test for the presence of gender bias in automated coreference resolution
systems.
|
720 |
TSV |
Coreference Resolution |
2018 |
Rudinger et al. |
|
| WMT 14 English-German |
01.15.20 |
Multi-Lingual |
Sentence pairs for translation. |
4.5M |
Text |
Machine Translation |
2015 |
Stanford |
|
| WMT 15 English-Czech |
01.15.20 |
Multi-Lingual |
Sentence pairs for translation. |
15.8M |
Text |
Machine Translation |
2016 |
Stanford |
|
| WMT 19 Multiple Datasets |
01.15.20 |
Multi-Lingual |
Multiple text corpora in multiple languages. |
n/a |
Text |
Text Corpora, Machine Translation |
2019 |
ACL Workshop |
|
| Words in Context |
01.15.20 |
English |
Dataset for evaluating contextualized word representations. |
2,400 |
Text |
Word Sense Disambiguation |
2019 |
Pilehvar et al. |
|
| Worldwide News - Aggregate of 20K Feeds |
01.15.20 |
Multi-Lingual |
One week snapshot of all online headlines in 20+ languages. |
1,398,431 |
CSV |
Clustering, Events, Machine Translation |
2017 |
Kulkarni |
|
| Yahoo! Music User Ratings of Musical Artists |
01.15.20 |
English |
Over 10M ratings of artists by Yahoo users. May be used to validate recommender systems
or collaborative
filtering algorithms. |
~10M |
Text |
Clustering, PCA |
2004 |
Yahoo! |
|
| YouTube Comedy Slam Preference Dataset |
01.15.20 |
English |
User vote data for pairs of videos shown on YouTube. Users voted on funnier videos.
|
1,138,562 |
Text |
Classification |
2012 |
Google |
|
| Visual QA (VQA) |
01.15.20 |
English |
Dataset containing open-ended questions about images. These questions require an
understanding
of vision, language and commonsense to answer. |
265,016 images |
JSON |
Visual Question Answering |
2015 |
Antol et al. |
|
| TrecQA |
01.15.20 |
English |
Dataset is commonly used for evaluating answer selection in question answering. |
n/a |
XML |
Question Answering, Reading Comprehension |
2007 |
Wang et al. |
|
| TriviaQA |
01.15.20 |
English |
Dataset containing over 650K question-answer-evidence triples. It includes 95K QA pairs
authored
by trivia enthusiasts and independently gathered evidence documents, 6 per question on
average.
|
650,000+ |
JSON |
Question Answering, Reading Comprehension |
2017 |
Joshi et al. |
|
| SemEvalCQA |
01.15.20 |
Arabic, English |
Dataset for community question answering. |
n/a |
XML |
Question Answering, Reading Comprehension |
2016 |
Nakov et al. |
|
| QuaRTz Dataset |
01.15.20 |
English |
Dataset contains 3,864 questions about open domain qualitative relationships. Each
question is
paired with one of 405 different background sentences (sometimes short paragraphs).
|
3,864 |
JSON |
Question Answering, Reading Comprehension |
2019 |
Tajford et al. |
|
| QuaRel Dataset |
01.15.20 |
English |
Dataset contains 2,771 story questions about qualitative relationships. |
2,771 |
JSON |
Question Answering, Reading Comprehension |
2018 |
Tajford et al. |
|
| OpenBookQA |
01.15.20 |
English |
Dataset modeled after open book exams for assessing human understanding of a subject. It
consists
of 5,957 multiple-choice elementary-level science questions (4,957 train, 500 dev, 500
test),
which probe the understanding of a small "book" of 1,326 core science facts and the
application
of these facts to novel situations. |
5,957 |
JSON |
Question Answering, Reading Comprehension |
2018 |
Mihaylov et al. |
|
| ProPara Dataset |
01.15.20 |
English |
Dataset is used for comprehension of simple paragraphs describing processes, e.g.,
photosynthesis.
The comprehension task relies on predicting, tracking, and answering questions about how
entities change during the process. |
488 |
Google Sheets |
Question Answering, Reading Comprehension |
2018 |
Mishra et al. |
|
| Open Research Corpus |
01.15.20 |
English |
Dataset contains over 39 million published research papers in Computer Science,
Neuroscience,
and Biomedical. |
39M |
JSON |
Text Corpora |
2018 |
Ammar et al. |
|
| SciTail Dataset |
01.15.20 |
English |
Dataset is an entailment dataset created from multiple-choice science exams and web
sentences.
Each question and the correct answer choice are converted into an assertive statement to
form the hypothesis. |
27,026 |
SNLI, TSV, DGEM |
Entailment |
2018 |
Khot et al. |
|
| AI2 Science Questions v2.1 |
01.15.20 |
English |
Dataset consists of questions used in student assessments in the United States across
elementary
and middle school grade levels. Each question is 4-way multiple choice format and may or
may not include a diagram element. |
5,060 |
JSON, CSV |
Question Answering, Reading Comprehension |
2017 |
Allen Institute |
|
| SciQ Dataset |
01.15.20 |
English |
Dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and
Biology,
among others. The questions are in multiple-choice format with 4 answer options each.
|
13,769 |
JSON |
Question Answering, Reading Comprehension |
2017 |
Welbl et al. |
|
| TupleInf Open IE Dataset |
01.15.20 |
English |
Dataset contains Open IE tuples extracted from 263K sentences that were used by the
solver in
"Answering Complex Questions Using Open Information Extraction" (referred as Tuple KB,
T).
|
263,000 |
Text |
Knowledge Base |
2017 |
Allen Institute |
|
| Textbook Question Answering |
01.15.20 |
English |
The M3C task builds on the popular Visual Question Answering (VQA) and Machine
Comprehension
(MC) paradigms by framing question answering as a machine comprehension task, where the
context
needed to answer questions is provided and composed of both text and images. |
26,620 |
JSON, PNG |
Question Answering, Reading Comprehension, Visual |
2017 |
Kembhavi et al. |
|
| Aristo Tuple KB |
01.15.20 |
English |
Dataset contains a collection of high-precision, domain-targeted
(subject,relation,object) tuples
extracted from text using a high-precision extraction pipeline, and guided by domain
vocabulary
constraints. |
282,594 |
TSV |
Knowledge Base |
2017 |
Dalvi et al. |
|
| AI2 Science Questions Mercury |
01.15.20 |
English |
Dataset consists of questions used in student assessments across elementary and middle
school
grade levels. Includes questions with diagrams and without. |
6,940 |
JSON, JPG |
Reading Comprehension |
2017 |
Allen Institute |
|
| Explanations for Science Questions |
01.15.20 |
English |
Data contains: gold explanation sentences supporting 363 science questions, relation
annotation
for a subset of those explanations, and a graphical annotation tool with annotation
guidelines.
|
1,363 |
CSV |
Question Answering, Reading Comprehension |
2016 |
Jansen et al. |
|
| AQuA |
01.15.20 |
English |
Dataset containing algebraic word problems with rationales for their answers. |
100,000 |
JSON |
Question Answering, Reading Comprehension |
2017 |
Ling et al. |
|
| Who Did What Dataset |
01.15.20 |
English |
Dataset contains over 200,000 fill-in-the-gap (cloze) multiple choice reading
comprehension problems
constructed from the LDC English Gigaword newswire corpus. |
200,000K |
XML |
Question Answering, Reading Comprehension |
2016 |
Onishi et al. |
|
| Coached Conversational Preference Elicitation |
01.15.20 |
English |
Dataset consisting of 502 English dialogs with 12,000 annotated utterances between a
user and
an assistant discussing movie preferences in natural language. |
12,000 |
JSON |
Dialogue |
2019 |
Radlinski et al. |
|
| Dialogue NLI |
01.15.20 |
English |
The Dialogue NLI dataset consists of sentence pairs labeled as entailment (E), neutral
(N), or
contradiction (C). |
~350,000 |
JSON |
Dialogue, Entailment |
2019 |
Welleck et al. |
|
| Meta-Learning Wizard-of-Oz (MetaLWOz) |
01.15.20 |
English |
Dataset designed to help develop models capable of predicting user responses in unseen
domains.
It was created by crowdsourcing 37,884 goal-oriented dialogs, covering 227 tasks in 47
domains.
|
37,884 |
Text |
Dialogue |
2019 |
Microsoft |
|
| Situations With Adversarial Generations (SWAG) |
01.15.20 |
English |
Dataset consists of 113k multiple choice questions about grounded situations. Each
question is
a video caption from LSMDC or ActivityNet Captions, with four answer choices about what
might
happen next in the scene. |
113,000 |
CSV |
Question Answering, Reading Comprehension |
2018 |
Zellers et al. |
|
| A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs (DROP)
|
01.15.20 |
English |
Dataset is used to resolve references in a question, perhaps to multiple input
positions, and
perform discrete operations over them (such as addition, counting, or sorting). |
96,000 |
JSON |
Question Answering, Reading Comprehension |
2019 |
Dua et al. |
|
| Cosmos QA |
01.15.20 |
English |
Dataset containing thousands of problems that require commonsense-based reading
comprehension,
formulated as multiple-choice questions. |
35,000 |
CSV |
Question Answering, Reading Comprehension, Commonsense |
2019 |
Huang et al. |
|
| COmmonsense Dataset Adversarially-authored by Humans (CODAH) |
01.15.20 |
English |
Commonsense QA in the sentence completion style of SWAG. As opposed to other
automatically generated
NLI datasets, CODAH is adversarially constructed by humans who can view feedback from a
pre-trained
model and use this information to design challenging commonsense questions. |
2,776 |
TSV |
Question Answering, Reading Comprehension, Commonsense |
2019 |
Chen et al. |
|
| Deal or No Deal? End-to-End Learning for Negotiation Dialogues |
01.15.20 |
English |
This dataset consists of 5,808 dialogues, based on 2,236 unique scenarios dealing with
negotiations
and complex communication. |
5,808 |
Text |
Dialogue |
2017 |
Lewis et al. |
|
| CommonsenseQA |
01.15.20 |
English |
Dataset contains multiple-choice question answering dataset that requires different
types of
commonsense knowledge to predict the correct answers . It contains 12,102 questions with
one correct answer and four distractor answers. |
12,012 |
JSON |
Question Answering, Reading Comprehension, Commonsense |
2018 |
Talmor et al. |
|
| ComplexWebQuestions |
01.15.20 |
English |
Dataset contains a large set of complex questions in natural language, and can be used
in multiple
ways. |
34,689 |
JSON |
Question Answering, Reading Comprehension |
2018 |
Talmor et al. |
|
| DVQA |
01.15.20 |
English |
Dataset containing data visualizations and natural language questions. |
3,487,194 |
JSON, PNG |
Question Answering, Visual, Commonsense |
2018 |
Kafle et al. |
|
| Social-IQ Dataset |
01.15.20 |
English |
Dataset containing videos and natural language questions for visual reasoning. |
7,500 |
n/a |
Question Answering, Visual, Commonsense |
2019 |
Zadeh et al." |
|
| TextVQA |
01.15.20 |
English |
TextVQA requires models to read and reason about text in images to answer questions
about them.
Specifically, models need to incorporate a new modality of text present in the images
and
reason over it to answer TextVQA questions. |
36,602 |
JSON, PNG |
Question Answering, Visual, Commonsense |
2019 |
Singh et al. |
|
| Natural Questions (NQ) |
01.15.20 |
English |
Dataset contains questions from real users, and it requires QA systems to read and
comprehend
an entire Wikipedia article that may or may not contain the answer to the question.
|
320,000+ |
HTML |
Question Answering, Reading Comprehension |
2019 |
Kwiatkowski et al. |
|
| Activitynet-QA |
01.15.20 |
English |
Dataset contains 58,000 human-annotated QA pairs on 5,800 videos derived from the
popular ActivityNet
dataset. The dataset provides a benckmark for testing the performance of VideoQA models
on
long-term spatio-temporal. |
58,000 |
JSON |
Question Answering, Visual, Commonsense |
2019 |
Yu et al. |
|
| GQA |
01.15.20 |
English |
Question answering on image scene graphs. |
22M |
JSON, H5 |
Question Answering, Visual, Commonsense |
2019 |
Hudson et al. |
|
| Video Commonsense Reasoning (VCR) |
01.15.20 |
English |
Dataset contains 290K multiple-choice questions on 110K images. |
290,000 |
JSON, JPG |
Question Answering, Visual, Commonsense |
2018 |
Zellers et al. |
|
| The Benchmark of Linguistic Minimal Pairs (BLiMP) |
01.15.20 |
English |
BLiMP is a challenge set for evaluating what language models (LMs) know about major
grammatical
phenomena in English. |
67 sub-datasets each with 1,000 minimal pairs |
JSON |
Language Modeling |
2019 |
Warstadt et al. |
|
| An Open Information Extraction Corpus (OPIEC) |
01.15.20 |
English |
OPIEC is an Open Information Extraction (OIE) corpus, constructed from the entire
English Wikipedia
containing more than 341M triples. |
341M |
AVRO |
Knowledge Base, Information Extraction, Knowledge Base |
2019 |
Gashteovski et al. |
|
| Total-Text-Dataset |
01.15.20 |
English |
Dataset used to classify curved text in pictures. |
~1,500 |
JPG |
Scene Text Detection |
2019 |
Ch'ng et al. |
|
| Schema-Guided Dialogue State Tracking (DSTC 8) |
01.15.20 |
English |
Dataset contains 18K dialogues between a virtual assistant and a user. |
~18,000 |
JSON |
Dialogue State Tracking |
2019 |
Rastogi et al. |
|
| TabFact |
01.15.20 |
English |
Dataset contains 16k Wikipedia tables as evidence for 118k human annotated statements to
study
fact verification with semi-structured evidence. |
16,000 |
JSON |
Natural Language Inference |
2020 |
Chen et al. |
|
| Taskmaster-1 |
01.15.20 |
English |
Dataset contains 13,215 task-based dialogs, including 5,507 spoken and 7,708 written
dialogs
created with two distinct procedures. Each conversation falls into one of six domains:
ordering
pizza, creating auto repair appointments, setting up ride service, ordering movie
tickets,
ordering coffee drinks and making restaurant reservations. |
13,215 |
JSON |
Dialogue |
2019 |
Byrne et al. |
|
| OpenWebTextCorpus |
01.15.20 |
English |
Dataset contains millions of webpages text stemming from reddit urls totalling 38Gb of
text data.
|
8,013,769 |
n/a |
Text Corpora |
2019 |
Radford et al. |
|
| NLP Chinese Corpus |
01.15.20 |
Chinese |
Large text corpora in Chinese. |
10M+ |
JSON |
Text Corpora |
2019 |
Xu et al. |
|
| Ten Thousand German News Articles Dataset (10kGNAD) |
01.15.20 |
German |
Dataset consists of 10273 german language news articles from an austrian online
newspaper categorized
into nine topics. |
10,273 |
CSV |
Text Corpora |
2019 |
Timo Block |
|
| Tencent AI Lab Embedding Corpus |
01.15.20 |
Chinese |
Dataset provides 200-dimension vector representations, a.k.a. embeddings, for over 8
million
Chinese words and phrases. |
8M |
Text |
Embeddings |
2018 |
Song et al. |
|
| CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) |
01.15.20 |
English |
Dataset contains more than 23,500 sentence utterance videos from more than 1000 online
YouTube
speakers. The dataset is gender balanced. All the sentences utterance are randomly
chosen
from various topics and monologue videos. |
23,500 |
n/a |
Sentiment Analysis, Emotion Recognition, Visual |
2018 |
MultiComp Lab |
|
| Cornell Newsroom |
01.15.20 |
English |
Dataset contains 1.3 million articles and summaries written by authors and editors in
the newsrooms
of 38 major publications. The summaries are obtained from search and social metadata
between
1998 and 2017. |
1.3M |
JSON |
Text Corpora, Summarization |
2018 |
Grusky et al. |
|
| WikiHow |
01.15.20 |
English |
Dataset contains article and summary pairs extracted and constructed from an online
knowledge
base written by different human authors. |
230,000+ |
Text |
Text Corpora, Summarization |
2018 |
Koupaee et al. |
|
| Multi-Domain Wizard-of-Oz Dataset (MultiWoz) |
01.15.20 |
English |
Dataset of human-human written conversations spanning over multiple domains and topics.
The dataset
was collected based on the Wizard of Oz experiment on Amazon MTurk. |
10,438 |
JSON |
Dialogue |
2018 |
Budzianowski et al. |
|
| Spider 1.0 |
01.15.20 |
English |
Dataset consists of 10,181 questions and 5,693 unique complex SQL queries on 200
databases with
multiple tables covering 138 different domains. |
10,181 |
JSON, SQL |
Semantic Parsing, SQL-to-Text |
2018 |
Yu et al. |
|
| Semantic Parsing in Context (SParC) |
01.15.20 |
English |
Dataset consists of 4,298 coherent question sequences (12k+ unique individual questions
annotated
with SQL queries annotated byt. It is the context-dependent/multi-turn version of
the Spider
task. |
4,298 |
JSON, SQL |
Semantic Parsing, SQL-to-Text |
2019 |
Yu et al. |
|
| Conversational Text-to-SQL Systems (CoSQL) |
01.15.20 |
English |
Dataset consists of 30k+ turns plus 10k+ annotated SQL queries, obtained from
a Wizard-of-Oz collection
of 3k dialogues querying 200 complex databases spanning 138 domains.It is the dilaogue
version
of the Spider and SParC tasks. |
3,000 |
JSON, SQL |
Dialogue, SQL-to-Text |
2019 |
Yu et al. |
|
| Conceptual Captions |
01.15.20 |
English |
Dataset contains ~3.3M images annotated with captions to be used for the task of
automatically
producing a natural-language description for an image. |
3,318,333 |
TSV |
Automatic Image Captioning |
2018 |
Sharma et al. |
|
| Yelp Open Dataset |
01.15.20 |
English |
Dataset containing millions of reviews on Yelp. In addition it contains business data
including
location data, attributes, and categories. |
6,685,900 |
JSON |
Classification, Sentiment Analysis |
2015 |
Yelp |
|
| Stanford Natural Language Inference (SNLI) Corpus |
01.15.20 |
English |
Dataset containing human-written sentence pairs manually labeled for the
labels entailment, contradiction,
and neutral. |
570,000 |
JSON |
Natural Language Inference, Entailment |
2015 |
Bowman et al. |
|
| Common Voice |
01.15.20 |
Multi-Lingual |
Dataset containing audio in 29 languages and 2,454 recorded hours . |
n/a |
MP3 |
Speech Recognition |
2019 |
Mozilla |
|
| Voices Obscured in Complex Environmental Settings (VOiCES) |
01.15.20 |
English |
Dataset contains a total of 15 hours (3,903 audio files) in male and female read speech.
|
n/a |
Wav |
Speech Recognition |
2018 |
Various |
|
| VoxCeleb |
01.15.20 |
Multi-Lingual |
An audio-visual dataset consisting of short clips of human speech, extracted from
interview videos
uploaded to YouTube. |
n/a |
MD5, URL |
Speech Recognition, Visual |
2017 |
Nagrani et al. |
|
| AudioSet |
01.15.20 |
Multi-Lingual |
Dataset consists of an expanding ontology of 632 audio event classes and a collection of
2,084,320
human-labeled 10-second sound clips drawn from YouTube videos. |
n/a |
CSV, TFR |
Speech Recognition, Visual |
2017 |
Google |
|
| LibriSpeech ASR |
01.15.20 |
English |
Large-scale (1000 hours) corpus of read English speech. |
n/a |
FLAC |
Speech Recognition |
2015 |
OpenSLR |
|
| Dataset for Fill-in-the-Blank Humor |
01.15.20 |
English |
Dataset contains 50 fill-in-the-blank stories similar in style to Mad Libs. The blanks
in these
stories include the original word and the hint type (e.g. animal, food, noun, adverb).
|
50 |
JSON |
Text Generation |
2017 |
Hossain et al. |
|
| MSParS |
01.15.20 |
English |
Dataset for the open domain semantic parsing task. |
81,826 |
Satori |
Semantic Parsing |
2019 |
Microsoft |
|
| Microsoft Speech Corpus |
01.15.20 |
Indian |
Dataset contains conversational and phrasal speech training and test data for Telugu,
Tamil and
Gujarati languages. |
n/a |
Wav |
Speech Recognition |
2019 |
Microsoft |
|
| Microsoft Information-Seeking Conversation (MISC) dataset |
01.15.20 |
English |
Dataset contains recordings of information-seeking conversations between human “seekers”
and
“intermediaries”. It includes audio and video signals; transcripts of conversation;
affectual
and physiological signals; recordings of search and other computer use; and post-task
surveys
on emotion, success, and effort. |
n/a |
various |
Speech Recognition, Dialogue, Visual |
2018 |
Microsoft |
|
| Microsoft Speech Language Translation Corpus (MSLT) |
01.15.20 |
Multi-Lingual |
Dataset contains conversational, bilingual speech test and tuning data for English,
Chinese,
and Japanese. It includes audio data, transcripts, and translations; and allows
end-to-end
testing of spoken language translation systems on real-world data. |
n/a |
Wav |
Speech Recognition, Machine Translation |
2017 |
Microsoft |
|
| WebQuestions Semantic Parses Dataset |
01.15.20 |
English |
Dataset contains full semantic parses in SPARQL queries for 4,737 questions, and
“partial” annotations
for the remaining 1,073 questions for which a valid parse could not be formulated or
where
the question itself is bad or needs a descriptive answer. |
5,810 |
JSON |
Semantic Parsing |
2016 |
Yih et al. |
|
| MutualFriends |
01.15.20 |
English |
Task where two agents must discover which friend of theirs is mutual based on the
friend's attributes.
|
n/a |
JSON |
Dialogue |
2017 |
He et al. |
|
| Personalized Dialog |
01.15.20 |
English |
Dataset of dialogs from movie scripts. |
12,000 |
Text |
Dialogue |
2017 |
Joshi et al. |
|
| The Conversational Intelligence Challenge 2 (ConvAI2) |
01.15.20 |
English |
A chit-chat dataset based on PersonaChat dataset. |
3,127 |
JSON |
Dialogue |
2018 |
NeurIPS |
|
| IIT Bombay English-Hindi Corpus |
01.21.20 |
English, Hindi |
Dataset contains parallel corpus for English-Hindi as well as monolingual Hindi corpus
collected
from a variety of existing sources. |
1.49M |
n/a |
Machine Translation |
2018 |
Kunchukuttan et al. |
|
| DiscoFuse |
01.21.20 |
English |
Dataset contains examples for training sentence fusion models. Sentence fusion is the
task of
joining several independent sentences into a single coherent text. The data has been
collected
from Wikipedia and from Sports articles. |
~60M |
TSV |
Sentence Fusion |
2019 |
Geva et al. |
|
| Relation Extraction Corpus |
01.21.20 |
English |
A human-judged dataset of two relations involving public figures on Wikipedia: about
10,000 examples
of "place of birth" and 40,000 examples of "attended or graduated from an institution."
|
10,000 |
JSON |
Relation Extraction |
2013 |
Google |
|
| Argentinian Spanish [es-ar] Speech Multi-Speaker Dataset. |
01.21.20 |
Spanish (Argentinan) |
Speech dataset containing about 5,900 transcribed high-quality audio from Argentinian
Spanish
[es-ar] sentences recorded by volunteers. |
~5,900 |
Wav |
Speech Recognition |
2018 |
Google |
|
| Switchboard Dialogue Act Corpus (SwDA) |
01.21.20 |
English |
A subset of the Switchboard-1 corpus consisting of 1,155 conversations and 42 tags |
1,155 |
UTT |
Dialogue Act Classification |
1997 |
Bates et al. |
|
| Social Media Mining for Health (SMM4H) |
01.21.20 |
English |
Dataset contains medication-related text classification and concept normalization from
Twitter
|
25,678 |
Text |
Classification |
2018 |
Sarker et al. |
|
| Event2Mind |
01.21.20 |
English |
Dataset contains 25,000 events and free-form descriptions of their intents and reactions
|
25,000 |
CSV |
Commonsense Inference |
2018 |
Rashkin et al. |
|
| OntoNotes 5.0 |
01.21.20 |
Multi-Lingual |
Dataset contains various genres of text (news, conversational telephone speech, weblogs,
usenet
newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic)
with
structural information (syntax and predicate argument structure) and shallow semantics
(word
sense linked to an ontology and coreference). |
n/a |
Text, SQL |
Information Retrieval, Syntactic Parsing |
2013 |
Weischedel et al. |
|
| Dataset for Intent Classification and Out-of-Scope Prediction |
01.21.20 |
English |
Dataset is a benchmark for evaluating intent classification systems for dialog systems /
chatbots
in the presence of out-of-scope queries. |
23,000+ |
JSON |
Intent Classification |
2019 |
Larson et al. |
|
| Soccer Dialogues |
01.21.20 |
English |
Dataset contains soccer dialogues over a knowledge graph |
2,890 |
JSON |
Knowledge Graphs, Dialogue |
2019 |
SDA Lab, Uni. Of Bonn & Volkswagen Research |
|
| Dutch Book Reviews |
01.21.20 |
Dutch |
Dataset contains book reviews along with associated binary sentiment polarity labels.
|
118,516 |
Text |
Classification, Sentiment Analysis |
2019 |
van der Burgh |
|
| Portuguese Newswire Corpus |
01.21.20 |
Portuguese (Brazil) |
Dataset contains x number of newswire articles collected between years 1994-2016.
Requires preprocesing
of HTML pages, found in GitHub in the download link. |
n/a |
HTML |
Text Corpora |
2016 |
Boğaziçi University |
|
| Portuguese SQuAD v1.1 |
01.21.20 |
Portuguese |
Portuguese translation of the SQuAD dataset. The translation was performed using the
Google Cloud
API. |
~100,000 |
JSON |
Question Answering, Reading Comprehension |
2019 |
Carvalho et al. |
|
| Paraphrase Adversaries from Word Scrambling (PAWS-X) |
01.21.20 |
Multi-Lingual |
Dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine
translated
training pairs in six typologically distinct languages: French, Spanish, German,
Chinese,
Japanese, and Korean. All translated pairs are sourced from examples in PAWS-Wiki. |
300,000+ |
TSV |
Paraphrasing Identification |
2019 |
Yang et al. |
|
| Paraphrase Adversaries from Word Scrambling (PAWS) |
01.21.20 |
English |
Dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the
importance
of modeling structure, context, and word order information for the problem of paraphrase
identification. |
750,000+ |
TSV |
Paraphrasing Identification |
2019 |
Zhang et al. |
|
| Event-focused Emotion Corpora for German and English |
01.21.20 |
English, German |
German and English emotion corpora for emotion classification, annotated with
crowdsourcing in
the style of the ISEAR resources. |
2,002 |
TSV |
Text Corpora, Emotion Classification |
2019 |
Troiano et al. |
|
| Classify Emotional Relationships of Fictional Characters |
01.21.20 |
English |
Dataset contains 19 short stories that are shorter than 1,500 words, and depict at least
four
different characters. |
19 |
Text |
Text Corpora, Emotion Classification |
2019 |
Kim et al. |
|
| Affective Text |
01.21.20 |
English |
Classification of emotions in 250 news headlines. Categories: anger, disgust, fear, joy,
happiness,
sadness, surprise. |
250 |
SGML, Text |
Emotion Classification |
2007 |
Strapparava et al. |
|
| The Emotion in Text |
01.21.20 |
English |
Dataset of tweets labelled with emotion. Categories: empty, sadness, enthusiasm,
neutral, worry,
sadness, love, fun, hate, happiness, relief, boredom, surprise, anger. |
40,000 |
CSV |
Emotion Classification |
2016 |
CrowdFlower |
|
| DailyDialog |
01.21.20 |
English |
A manually labelled conversations dataset. Categories: no emotion, anger, disgust, fear,
happiness,
sadness, surprise. |
13,118 |
Text |
Emotion Classification |
2017 |
Li et al. |
|
| Emotion-Stimulus |
01.21.20 |
English |
Dataset annotated with both the emotion and the stimulus using FrameNet’s
emotions-directed frame.
820 sentences with both cause and emotion and 1594 sentences marked with their emotion
tag.
Categories: happiness, sadness, anger, fear, surprise, disgust and shame. |
2,414 |
XML |
Emotion Classification |
2015 |
Ghazi et al. |
|
| Open Super-Large Crawled Almanach Corpus (OSCAR) |
01.29.20 |
Multi-Lingual |
Multilingual corpus obtained by language classification and filtering of the Common
Crawl corpus using the goclassy architecture.166 different languages available. |
n/a |
Text |
Text Corpora |
2019 |
Suárez et al. |
|
| Dialogue Natural Language Inference (NLI) |
01.29.20 |
English |
Dataset used to improve the consistency of a dialogue model. It consists of sentence
pairs labeled as entailment (E), neutral (N), or contradiction (C)." |
340,000+ |
JSON |
Dialogue, Entailment |
2019 |
Welleck et al. |
|
| Fact-based Visual Question Answering (FVQA) |
01.29.20 |
English |
Dataset contains image question anwering triples |
5,826 questions; 2,190 images |
JSON |
Question Answering, Visual |
2017 |
Wang et al. |
|
| OpenSubtitles |
01.29.20 |
Multi-Lingual |
Dataset of multi-lingual dialogs from movie scripts. Includes 62 languages. |
n/a |
XML, XCES |
Dialogue |
2016 |
Tiedemann et al. |
|
| A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning (CLEVR
& CoGenT) |
01.29.20 |
English |
Visual question answering dataset contains 100,000 images and 999,968 questions. |
999,968 questions; 100,000 images |
JSON |
Question Answering, Visual |
2016 |
Johnson et al. |
|
| InsuranceQA |
01.29.20 |
English |
Dataset contains questions and answers collected from the website Insurance Library. It
consists of questions from real world users, the answers with high quality were composed
by professionals with deep domain knowledge. There are 16,889 questions in total. |
16,889 |
n/a |
Question Answering, Reading Comprehension |
2015 |
Feng et al. |
|
| VisDial |
01.29.20 |
English |
Dataset contains images from COCO training set, and dialogues. Meant to be used for
model to be trained in answering questions about images during conversation. Contains
1.2M dialog question-answers. |
1.2M |
JSON |
Question Answering, Visual, Dialogue |
2017 |
Das et al. |
|
| Cornell Natural Language for Visual Reasoning (NLVR and NLVR2) |
01.29.20 |
English |
Dataset contains two language grounding datasets containing natural language sentences
grounded in images. The task is to determine whether a sentence is true about a visual
input. |
NLVR2 107,292; NLVR 92,244 |
JSON |
Question Answering, Visual |
2019 |
Suhr et al. |
|
| Simplified Versions of the CommAI Navigation tasks (SCAN) |
01.29.20 |
English |
Dataset used for for studying compositional learning and zero-shot generalization. SCAN
consists of a set of commands and their corresponding action sequences. |
20,000+ |
Text |
Compositional Learning |
2018 |
Lake et al. |
|
| Twitter Chat Corpus |
01.29.20 |
English |
Dataset contains Twitter question-answer pairs. |
5M |
Text |
Dialogue |
2017 |
Marsan Ma |
|
| Common Objects in Context (COCO) |
01.29.20 |
English |
COCO is a large-scale object detection, segmentation, and captioning dataset. Dataset
contains 330K images (>200K labeled) 1.5 million object instances, 80 object categories,
91 stuff categories, 5 captions per image. |
330,000 |
JSON, JPG |
Automatic Image Captioning |
2014 |
Lin et al. |
|
| QA-SRL Bank |
01.29.20 |
English |
Dataset contains question answer pairs for 64,000 sentences. Dataset is used to train
model for semantic role labeling |
64,000 |
JSON |
Question Answering, Semantic Role Labeling |
2018 |
FitzGerald et al. |
|
| Web Inventory of Transcribed and Translated Talks (WIT3) |
01.29.20 |
Multi-Lingual |
Dataset contains a collection of transcribed and translated talks. The core of the
dataset is from Ted Talks corpus. As of 2016, It holds 109 languages. |
n/a |
XML |
Machine Translation |
2012 |
Cettolo et al. |
|
| QA-ZRE |
01.29.20 |
English |
Dataset contain question answer pairs with each instance containing a relation, a
question, a sentence, and an answer set. |
30M |
Text |
Question Answering, Relation Extraction |
2017 |
Levy et al. |
|
| EmpatheticDialogues |
01.29.20 |
English |
Dataset of 25k conversations grounded in emotional situations. |
25,000 |
CSV |
Dialogue |
2019 |
Rashkin et al. |
|
| OneCommon |
01.29.20 |
English |
Dataset contains 6,760 dialogues. |
6,760 |
JSON |
Dialogue |
2019 |
Udagawa et al. |
|
| WikiReading |
01.29.20 |
Multi-Lingual |
The task is to predict textual
values from the structured knowledge base
Wikidata by reading the text of the corresponding Wikipedia articles. Includes English,
Russian and Turkish. |
18M |
JSON |
Knowledge Base, NLU |
2016 |
Hewlett & Kenter et al. |
|
| EmoBank |
01.29.20 |
English |
Dataset is a large-scale text corpus manually annotated with emotion according to the
psychological Valence-Arousal-Dominance scheme. |
10,000 |
CSV |
Text Corpora |
2017 |
Buechel et al. |
|
| Irony Sarcasm Analysis Corpus |
01.29.20 |
English |
Dataset contains tweets in 4 subgroups: irony, sarcasm, regular and figurative.
Requires using Twitter API in order to obtain tweets. |
33,000 |
TSV |
Classification, Sentiment Analysis |
2016 |
Ling et al. |
|
| Sentiment Corpus of App Reviews with Fine-grained Annotations in German (SCARE) |
01.29.20 |
German |
Dataset consists of fine-grained annotations for mobile application reviews from the
Google Play Store. For each user review the mentioned application aspects, i.e., the
design or the usability, as well as subjective phrases, which evaluate these aspects,
are annotated. In addition, the polarity (positive, negative or neutral) of each
subjective phrase is recorded as well as the relationship of an aspect to the main app
in discussion. Requires emailing source for password to retrieve data. |
800,000 |
CSV |
Sentiment Analysis |
2016 |
Sänger et al. |
|
| ReVerb45k, Base and Ambiguous |
01.29.20 |
English |
3 Datasets. In total, there are 91K triples. |
91,000 |
JSON |
Information Retrieval, Knowledge Base |
2018 |
Vashishth et al. |
|
| Abductive Natural Language Inference (aNLI) |
01.29.20 |
English |
Dataset is a binary-classification task, the goal is to pick the most plausible
explanatory hypothesis given two observations from narrative contexts. It contains 20k
commonsense narrative contexts
and 200k explanations." |
20,000 |
JSON |
Classification, Commonsense |
2019 |
Bhagavatula et al. |
|
| Physical IQA |
01.29.20 |
English |
Dataset is used for commonsense QA benchmark for naive physics reasoning focusing on
how we interact with everyday objects in everyday situations. The dataset includes
20,000 QA pairs that are either multiple-choice or true/false questions. |
20,000 |
JSON |
Question Answering, Commonsense |
2019 |
Bisk et al. |
|
| Social IQA |
01.29.20 |
English |
Dataset used fo question-answering benchmark for testing social commonsense
intelligence. |
37,000+ |
JSON |
Question Answering, Commonsense |
2019 |
Sap et al. |
|
| WinoGrande |
01.29.20 |
English |
Formulated as a fill-in-a-blank task with binary options, the goal is to choose the
right option for a given sentence which requires commonsense reasoning. |
44,000 |
JSON |
Commonsense Reasoning |
2019 |
Sakaguchi et al. |
|
| HellaSwag |
01.29.20 |
English |
Dataset for studying grounded commonsense inference. It consists of 70k multiple choice
questions about grounded situations: each question comes from one of two domains --
activitynet or wikihow -- with four answer choices about what might happen next in the
scene. |
70,000 |
JSON |
Commonsense Reasoning |
2019 |
Zellers et al. |
|