MLCommons and Hugging Face collaborate to launch a substantial speech dataset for AI research

MLCommons, a nonprofit AI safety working group, has partnered with AI development platform Hugging Face to release one of the world’s largest collections of public domain voice recordings for AI research. The dataset, known as Unsupervised People’s Speech, comprises over a million hours of audio in at least 89 languages. The motivation behind creating this dataset was to support research and development in various areas of speech technology.

Supporting Natural Language Processing Research

MLCommons highlighted the importance of supporting broader natural language processing research for languages other than English to make communication technologies accessible to more people globally. The organization sees several opportunities for the research community to enhance low-resource language speech models, improve speech recognition across various accents and dialects, and explore new applications in speech synthesis.

Challenges and Risks

While the goal of Unsupervised People’s Speech is commendable, using AI datasets like this poses risks for researchers. One major risk is biased data, as most recordings in the dataset are in American-accented English, potentially leading to biases in AI systems like speech recognition and voice synthesis models. Additionally, there is a concern that some recordings may have been included without the knowledge of the individuals involved, raising ethical and privacy issues.

Ensuring Ethical Practices

MLCommons has committed to updating and maintaining the quality of Unsupervised People’s Speech. However, developers are advised to proceed with caution due to the potential flaws and risks associated with using such datasets. It is essential to prioritize ethical considerations and ensure that AI research respects the rights and privacy of individuals involved in creating the datasets.