MLCommons and Hugging Face collaborate to launch a substantial speech dataset for AI research

Opinion: Anthropic CEO Dario Amodei attempts to avoid testimony in OpenAI copyright lawsuit

Anthropic CEO Dario Amodei is facing challenges in a copyright lawsuit against OpenAI, as reported in new court filings. The Read more

Unlocking the Complexity of Artificial Intelligence Terminology

Artificial Intelligence Terminology: A Comprehensive Guide Artificial General Intelligence (AGI) Artificial General Intelligence (AGI) surpasses conventional AI capabilities by excelling Read more

EU bans AI systems with ‘unacceptable risk’

On Sunday, the European Union granted regulators the authority to prohibit the use of AI systems considered to pose an Read more

Data broker LexisNexis reports breach affecting personal data of more than 364,000 individuals

LexisNexis Risk Solutions, a data broker that collects and uses consumers’ personal data to help its paying corporate customers detect Read more

MLCommons, a nonprofit AI safety working group, has partnered with AI development platform Hugging Face to release one of the world’s largest collections of public domain voice recordings for AI research. The dataset, known as Unsupervised People’s Speech, comprises over a million hours of audio in at least 89 languages. The motivation behind creating this dataset was to support research and development in various areas of speech technology.

Supporting Natural Language Processing Research

MLCommons highlighted the importance of supporting broader natural language processing research for languages other than English to make communication technologies accessible to more people globally. The organization sees several opportunities for the research community to enhance low-resource language speech models, improve speech recognition across various accents and dialects, and explore new applications in speech synthesis.

See also  OpenAI might let you "sign in with ChatGPT" for other apps soon

Challenges and Risks

While the goal of Unsupervised People’s Speech is commendable, using AI datasets like this poses risks for researchers. One major risk is biased data, as most recordings in the dataset are in American-accented English, potentially leading to biases in AI systems like speech recognition and voice synthesis models. Additionally, there is a concern that some recordings may have been included without the knowledge of the individuals involved, raising ethical and privacy issues.

Ensuring Ethical Practices

MLCommons has committed to updating and maintaining the quality of Unsupervised People’s Speech. However, developers are advised to proceed with caution due to the potential flaws and risks associated with using such datasets. It is essential to prioritize ethical considerations and ensure that AI research respects the rights and privacy of individuals involved in creating the datasets.

MIT Denounces Doctoral Student’s Paper on AI’s Productivity Benefits

There’s a new security flaw in TheTruthSpy phone spyware causing concern for victims