Machine Learning for Audio Workshop

Discover the harmony of AI and sound.

View the Project on GitHub sadiela/ml-for-audio

For questions, email mlforaudio@googlegroups.com

The Machine Learning for Audio Workshop at NeurIPS 2024 will bring together audio practitioners and machine learning researchers to a venue focused on various problems in audio, including music information retrieval, acoustic event detection, bioacoustics, speech transcription, multimodal modeling, and generative modeling of speech and other sounds. Our team has previously held multiple audio-related workshops at top machine learning venues—including a successful version of this workshop at NeurIPS last year—and both the organizing team and invited speakers represent broad diversity in terms of gender identity, affiliation, seniority, and geography. We also plan to solicit workshop papers on the topic.

Workshop Description

Audio research has recently been enjoying a renaissance of sorts; in the field of audio synthesis alone, many prominent papers have been released in just the last couple of years, with no sign of slowing down. There are numerous additional key problems within the audio research domain that continue to attract widespread attention. Building on this momentum, we ran a workshop last year at NeurIPS on Machine Learning for Audio, and this workshop’s success and the interest it generated has prompted us to organize this workshop again for 2024. We believe that a workshop focused on machine learning in the audio domain would provide a good opportunity to bring together both practitioners of audio tools along with core machine learning researchers interested in audio, in order to foster collaboration and discussion as well as forge new directions within this important area of research. In addition, with the field moving so rapidly, we believe this workshop will provide a dedicated space for the crucial ethical discussions that must be facilitated among researchers around applications of generative machine learning for audio.

The Machine Learning for Audio workshop at NeurIPS 2024 will cover a broad range of tasks involving audio data. These include, but are not limited to: methods of speech modeling, environmental sound generation or other forms of ambient sound, novel generative models, music generation in the form of raw audio, text-to-speech methods, denoising of speech and music, data augmentation, classification of acoustic events, transcription, source separation, and multimodal problems.

We will solicit original workshop papers in these areas, which will be reviewed by the organizers and an additional set of reviewers. We will run a demo session alongside a poster session, where contributors will be able to present live demos of their work where applicable. We believe this session will complement the NeurIPS Creative AI Track very nicely; as synthesis is a prominent subfield within audio machine learning research, we will be able to further highlight novel generative methods that do not necessarily overlap with a creative application.

Call for Papers

We are calling for extended abstracts up to 4 pages excluding references. Accepted submissions will be posted on the workshop website but not published/archived. Several submissions will be chosen for 15-minute contributed talks and the remaining selected submissions will participate in the poster & demo session. Please make sure submissions adhere to the NeurIPS format. Submit your work via CMT (link to provided). The review process will be double-blind so please make sure not to put any author information in your submission.

Authors may also submit supplementary materials along with their papers if they wish (e.g., a preview of a potential demo). Reviewers will not be required to read/view/listen to said supplementary material.

Timeline

Proposed Schedule

We plan for the workshop to be a one-day event. Below is an approximate timetable of the workshop schedule, subject to change.

9:00 - Invited Speakers 1 & 2

10:00 - Contributed Talks 4-6

11:00 - Coffee Break

11:30 - Invited Speakers 3 & 4

12:30 - Lunch

13:30 - Poster & Demo Session

14:30 - Invited Speakers 5 & 6

15:30 - Contributed Talks 4-6

16:30 - Panel Discussion with Invited Speakers

17:00 - Wrap-up and Open Conversation

Invited Speakers

James Betker is a research scientist at OpenAI, where he is one of the audio leads for the GPT-4o model. Previously, he was also the author of TortoiseTTS, a popular open source text-to-speech system. His interests include generative models for audio and images.

Nick Brian is a senior research scientist and head of the Music AI research group at Adobe Research. His research interests include audio and music, generative AI, and signal processing. He received his PhD and MA from CCRMA, Stanford University and MS in Electrical Eng., also from Stanford.

Daniel PW Ellis is a research scientist at Google. His research interests include signal processing and machine learning for analysis and classification of general audio, speech, and music. Prior to Google, Dan was a professor at Columbia University from 2000-2015.

Albert Gu is an assistant professor at Carnegie Mellon University. He is broadly interested in theoretical and empirical aspects of deep learning. His research involves understanding and developing approaches that can be practically useful for modern large-scale machine learning models, such as his current focus on deep sequence models. His work on state-space models, and in particular S4 and its variants, has been hugely influential in the audio community.

Rupal Patel founded and directs the Communication Analysis and Design Laboratory (CadLab) at Northeastern University, an interdisciplinary research group that conducts research along two broad themes: analysis of spoken communication and the design of novel human-computer communication interfaces. She is also the founder and CEO of VocalID, which has been creating custom synthetic voices using state-of-the-art machine learning and speech blending algorithms since 2014. After VocalID was acquired by Veritone, Rupal began her current role as Veritone’s VP of Voice AI and Accessibility, where she is responsible for setting strategy and leading innovation efforts in voice AI, novel monetizations models for voice talent and influencers and expanding the reach and impact of Veritone’s voice solutions for those living with disabilities or inequities.

Katie Zacarian is a conservationist, human rights advocate, and a leader in artificial intelligence research as the co-founder and CEO of Earth Species Project (ESP). ESP is a 501c3 nonprofit developing novel machine learning research that can advance our understanding of animal communication for a more empathetic, individualized, and effective approach to protecting the natural world. She has published extensively in the area of bioacoustics.

Organizers