100s of datasets for machine learning developers (and counting)
With the advent of deep learning and the necessity for more and diverse data, researchers are constantly hunting for the most up-to-date datasets that can help train their ML model. Currently, NLP data seems to be scattered across several 3rd party libraries, Reddit, or in the research arms of big tech. And while these mediums are useful, there doesn’t seem to be a central hub for housing NLP data that can be easily reached and searched by the ML engineer. As a result, we’ve created the “Big Bad NLP Database,” the world’s largest data library in natural language processing:
We have been working on this project for several weeks. To date, we have amassed nearly 200 datasets (and counting) covering various domains in NLP. Besides the inclusion of classic datasets found in GLUE and SuperGLUE, we also have included datasets ranging from the humongous CommonCrawl to the classic Penn Treebank.
We didn’t solely focus on classic NLP tasks either. While including standards for classifying, question answering, we’ve also covered datasets pertaining to text-to-SQL, speech recognition, and multi-modal (text and images).
We want this library to have a community mindset, so if you have a new dataset or want to recommend an edit, please use the contact link on the webpage, and we’ll get back to you ASAP.
Lastly, we are always looking for datasets that help diversify language (most datasets are in English) to make our library more global. Please don’t hesitate to contact us if there is a dataset outside of English that we should look into. We hope you find this library useful in your development endeavors.
Ricky Costa | Quantum Stat | www.quantumstat.com