Unsupervised People’s Speech: The New Frontier in AI Voice Datasets

In a significant development for AI research, MLCommons, a nonprofit AI safety organization, has partnered with AI development platform Hugging Face to launch one of the world’s largest collections of public domain voice recordings for AI research. The dataset, named Unsupervised People’s Speech, includes over a million hours of audio in at least 89 languages, designed to support research in natural language processing (NLP) and speech technology. However, while this new dataset has vast potential, it also carries certain risks that developers and researchers need to consider.

Contents hide

1 The Power and Potential of Unsupervised People’s Speech

2 The Challenges: Biases and Ethical Concerns

3 Legal and Ethical Implications: Opt-Out Challenges

4 MLCommons’ Commitment to Improving the Dataset

5 Conclusion: A Game-Changer with Cautionary Considerations

The Power and Potential of Unsupervised People’s Speech

Unsupervised People’s Speech is a massive collection of voice recordings curated from Archive.org, the nonprofit platform best known for its Wayback Machine. The project aims to aid research in a range of speech technologies by providing a large-scale dataset for training and developing AI models. One of the key motivations behind the creation of the dataset is to support the advancement of speech technology, particularly for low-resource languages and diverse dialects.

According to MLCommons, this dataset is expected to enhance several areas of AI research, including:

Improving speech recognition models across different accents and dialects
Developing more inclusive speech synthesis applications
Enhancing models for languages beyond English, thus broadening access to communication technologies

This effort aligns with the organization’s goal to ensure that more people around the world benefit from the advancements in natural language processing and AI. By expanding the range of languages covered in speech technology, Unsupervised People’s Speech can pave the way for more equitable AI applications globally.

The Challenges: Biases and Ethical Concerns

While the Unsupervised People’s Speech dataset offers great promise for advancing AI research, it is not without its challenges. One of the key concerns raised by the dataset is the potential for bias in the data. Since many of the recordings in the dataset come from English-speaking contributors—primarily from the United States—the majority of the audio is in American-accented English. This imbalance could lead to issues when training AI systems, particularly in the areas of speech recognition and voice synthesis.

For example, models trained using this dataset might struggle to accurately transcribe speech from non-native English speakers or generate synthetic voices in languages that are underrepresented in the dataset. This could hinder the development of inclusive AI systems that can recognize and synthesize speech from a diverse range of users.

Moreover, there’s a risk that the recordings in the dataset may have been collected from individuals who were not fully informed that their voices would be used for AI research. While MLCommons asserts that all recordings in Unsupervised People’s Speech are either public domain or available under Creative Commons licenses, there’s always the possibility of oversight or errors in the data collection process.

Legal and Ethical Implications: Opt-Out Challenges

Another important issue with large-scale AI datasets is the challenge of ensuring that creators can opt out if they do not wish for their work to be used. Ed Newton-Rex, the CEO of AI ethics nonprofit Fairly Trained, argues that creators should not bear the burden of opting out of AI datasets, as this often involves complicated, confusing, and incomplete methods. Many creators, especially those using platforms like Squarespace, may not even be aware that their work is being used for AI research and development.

Newton-Rex emphasizes that AI datasets should not use creators’ work without proper opt-in mechanisms and clear consent. The lack of universal, straightforward opt-out processes could mean that individuals and creators are unknowingly contributing to AI models that may compete with their own work or use their creations in unintended ways. As the field of generative AI continues to grow, addressing these concerns will be crucial to ensure that AI development remains ethical and respects the rights of creators.

MLCommons’ Commitment to Improving the Dataset

Recognizing the potential flaws and risks, MLCommons has pledged to continuously update and maintain the Unsupervised People’s Speech dataset to improve its quality. The organization has expressed its commitment to addressing biases, enhancing the accuracy of the recordings, and ensuring that the dataset remains valuable for researchers worldwide.

However, despite these efforts, developers and researchers are encouraged to approach the dataset with caution. Given the dataset’s potential to contain biases and other limitations, it is important to ensure that it is used responsibly and in conjunction with other diverse datasets that can help create more inclusive AI systems.

Conclusion: A Game-Changer with Cautionary Considerations

Unsupervised People’s Speech represents a major step forward in the development of AI and speech technology. With its vast collection of audio recordings in multiple languages, the dataset offers an invaluable resource for improving speech recognition and synthesis across diverse languages and dialects. However, as with any large-scale AI dataset, it is important to remain mindful of the ethical challenges it presents, including biases, consent issues, and the potential for unintended consequences.

Researchers and developers who choose to use the Unsupervised People’s Speech dataset should be cautious and aware of its limitations. By carefully filtering the data and combining it with other datasets that represent a broader range of voices and accents, the AI community can work towards creating more inclusive and ethical AI systems that serve the global population.

Cookie	Duration	Description
bp_user-registered	1 year 1 month 4 days	This cookie is used to set which users can access the private pages of the website. It is a functional cookie.
bp_user-role	1 year 1 month 4 days	This is a functional cookie. It is used to set restriction to the user on acessing certain pages like back office, account page etc.
bp_ut_session	1 year 1 month 4 days	This is a functional cookie. This cookie is used to set restriction to the user on acessing certain pages like back office, account page etc.

Cookie	Duration	Description
CONSENT	2 years	YouTube sets this cookie via embedded YouTube videos and registers anonymous statistical data.
uid	5 months 27 days	This is a Google UserID cookie that tracks users across various website segments.
_ga	1 year 1 month 4 days	Google Analytics sets this cookie to calculate visitor, session and campaign data and track site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognise unique visitors.
_ga_*	1 year 1 month 4 days	Google Analytics sets this cookie to store and count page views.
__gads	1 year 24 days	Google sets this cookie under the DoubleClick domain, tracks the number of times users see an advert, measures the campaign's success, and calculates its revenue. This cookie can only be read from the domain they are currently on and will not track any data while they are browsing other sites.

Cookie	Duration	Description
A3	1 year	Yahoo set this cookie for targeted advertising.
DSID	1 hour	This cookie is set by DoubleClick to note the user's specific user identity. It contains a hashed/encrypted unique ID.
GoogleAdServingTest	session	Google sets this cookie to determine what ads have been shown to the website visitor.
google_push	5 minutes	BidSwitch sets the google_push cookie as a user identifier to allow multiple advertisers to share user profile identities when a web page is loaded.
IDE	1 year 24 days	Google DoubleClick IDE cookies store information about how the user uses the website to present them with relevant ads according to the user profile.
mc	1 year 1 month	Quantserve sets the mc cookie to track user behaviour on the website anonymously.
mt_mop	1 month	MediaMath uses this cookie to synchronize the visitor ID with a limited number of trusted exchanges and data partners.
test_cookie	15 minutes	doubleclick.net sets this cookie to determine if the user's browser supports cookies.
tuuid	1 year	The tuuid cookie, set by BidSwitch, stores an unique ID to determine what adverts the users have seen if they have visited any of the advertiser's websites. The information is used to decide when and how often users will see a certain banner.
tuuid_lu	1 year	This cookie, set by BidSwitch, stores a unique ID to determine what adverts the users have seen while visiting an advertiser's website. This information is then used to understand when and how often users will see a certain banner.
uuid	1 year 27 days	MediaMath sets this cookie to avoid the same ads from being shown repeatedly and for relevant advertising.
VISITOR_INFO1_LIVE	5 months 27 days	YouTube sets this cookie to measure bandwidth, determining whether the user gets the new or old player interface.
YSC	session	Youtube sets this cookie to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the user's video preferences using embedded YouTube videos.
yt-remote-device-id	never	YouTube sets this cookie to store the user's video preferences using embedded YouTube videos.
yt.innertube::nextId	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.
__gpi	1 year 24 days	Google Ads Service uses this cookie to collect information about from multiple websites for retargeting ads.

Cookie	Duration	Description
bsw_origin_init	past	Description is currently not available.
C	1 month	Description is currently not available.

The Power and Potential of Unsupervised People’s Speech

The Challenges: Biases and Ethical Concerns

Legal and Ethical Implications: Opt-Out Challenges

MLCommons’ Commitment to Improving the Dataset

Conclusion: A Game-Changer with Cautionary Considerations

Related Posts

Leave a Comment Cancel Reply