[“Machine Learning for Audio Workshop”]

For questions, email mlforaudio@googlegroups.com

The Machine Learning for Audio Workshop at NeurIPS 2024 will bring together audio practitioners and machine learning researchers to a venue focused on various problems in audio, including music information retrieval, acoustic event detection, bioacoustics, speech transcription, multimodal modeling, and generative modeling of speech and other sounds. Our team has previously held multiple audio-related workshops at top machine learning venues—including a successful version of this workshop at NeurIPS last year—and both the organizing team and invited speakers represent broad diversity in terms of gender identity, affiliation, seniority, and geography. We also plan to solicit workshop papers on the topic.

Workshop Description

Audio research has recently been enjoying a renaissance of sorts; in the field of audio synthesis alone, many prominent papers have been released in just the last couple of years, with no sign of slowing down. There are numerous additional key problems within the audio research domain that continue to attract widespread attention. Building on this momentum, we ran a workshop last year at NeurIPS on Machine Learning for Audio, and this workshop’s success and the interest it generated has prompted us to organize this workshop again for 2024. We believe that a workshop focused on machine learning in the audio domain would provide a good opportunity to bring together both practitioners of audio tools along with core machine learning researchers interested in audio, in order to foster collaboration and discussion as well as forge new directions within this important area of research. In addition, with the field moving so rapidly, we believe this workshop will provide a dedicated space for the crucial ethical discussions that must be facilitated among researchers around applications of generative machine learning for audio.

The Machine Learning for Audio workshop at NeurIPS 2024 will cover a broad range of tasks involving audio data. These include, but are not limited to: methods of speech modeling, environmental sound generation or other forms of ambient sound, novel generative models, music generation in the form of raw audio, text-to-speech methods, denoising of speech and music, data augmentation, classification of acoustic events, transcription, source separation, and multimodal problems.

We will solicit original workshop papers in these areas, which will be reviewed by the organizers and an additional set of reviewers. We will run a demo session alongside a poster session, where contributors will be able to present live demos of their work where applicable. We believe this session will complement the NeurIPS Creative AI Track very nicely; as synthesis is a prominent subfield within audio machine learning research, we will be able to further highlight novel generative methods that do not necessarily overlap with a creative application.

Call for Papers

We are calling for extended abstracts up to 4 pages excluding references. Accepted submissions will be posted on the workshop website but not published/archived. Several submissions will be chosen for 15-minute contributed talks and the remaining selected submissions will participate in the poster & demo session. Please make sure submissions adhere to the NeurIPS format. Submit your work via CMT (link to provided). The review process will be double-blind so please make sure not to put any author information in your submission.

Authors may also submit supplementary materials along with their papers if they wish (e.g., a preview of a potential demo). Reviewers will not be required to read/view/listen to said supplementary material.

Timeline

Submission deadline (main paper & all supplementary material): TBD
Accept/Reject notification date: TBD

Proposed Schedule

We plan for the workshop to be a one-day event. Below is an approximate timetable of the workshop schedule, subject to change.

9:00 - Invited Speakers 1 & 2

10:00 - Contributed Talks 4-6

11:00 - Coffee Break

11:30 - Invited Speakers 3 & 4

12:30 - Lunch

13:30 - Poster & Demo Session

14:30 - Invited Speakers 5 & 6

15:30 - Contributed Talks 4-6

16:30 - Panel Discussion with Invited Speakers

17:00 - Wrap-up and Open Conversation

Invited Speakers

James Betker is a research scientist at OpenAI, where he is one of the audio leads for the GPT-4o model. Previously, he was also the author of TortoiseTTS, a popular open source text-to-speech system. His interests include generative models for audio and images.

Nick Brian is a senior research scientist and head of the Music AI research group at Adobe Research. His research interests include audio and music, generative AI, and signal processing. He received his PhD and MA from CCRMA, Stanford University and MS in Electrical Eng., also from Stanford.

Daniel PW Ellis is a research scientist at Google. His research interests include signal processing and machine learning for analysis and classification of general audio, speech, and music. Prior to Google, Dan was a professor at Columbia University from 2000-2015.

Albert Gu is an assistant professor at Carnegie Mellon University. He is broadly interested in theoretical and empirical aspects of deep learning. His research involves understanding and developing approaches that can be practically useful for modern large-scale machine learning models, such as his current focus on deep sequence models. His work on state-space models, and in particular S4 and its variants, has been hugely influential in the audio community.

Rupal Patel founded and directs the Communication Analysis and Design Laboratory (CadLab) at Northeastern University, an interdisciplinary research group that conducts research along two broad themes: analysis of spoken communication and the design of novel human-computer communication interfaces. She is also the founder and CEO of VocalID, which has been creating custom synthetic voices using state-of-the-art machine learning and speech blending algorithms since 2014. After VocalID was acquired by Veritone, Rupal began her current role as Veritone’s VP of Voice AI and Accessibility, where she is responsible for setting strategy and leading innovation efforts in voice AI, novel monetizations models for voice talent and influencers and expanding the reach and impact of Veritone’s voice solutions for those living with disabilities or inequities.

Katie Zacarian is a conservationist, human rights advocate, and a leader in artificial intelligence research as the co-founder and CEO of Earth Species Project (ESP). ESP is a 501c3 nonprofit developing novel machine learning research that can advance our understanding of animal communication for a more empathetic, individualized, and effective approach to protecting the natural world. She has published extensively in the area of bioacoustics.

Organizers

Sadie Allen is a PhD student studying computer engineering at Boston University. She co-organized the Machine Learning for Audio Synthesis workshop at ICML 2022 and the Workshop on Machine Learning for Audio at NeurIPS 2023. Her current research focuses on controllable music generation in both the symbolic and raw audio domains. Her previous work centered around the security and efficiency of distributed systems.
Alice Baird is a research scientist at Hume AI, NY, USA, where she currently works on modeling expressive human behaviors from audio and other modalities. She earned her Ph.D. at the University of Augsburg, in 2022. Her work on emotion understanding from auditory, physiological, and multimodal data has been published extensively in the leading journals and conferences in her field. She has co-organized several machine learning competitions including the 2022 ICML Expressive Vocalizations Workshop and the 2023 NeurIPS Workshop on Machine Learning for Audio.
Sander Dieleman is a research scientist at DeepMind in London, UK, where he has worked on the development of AlphaGo and WaveNet. His research is currently focused on generative modelling of perceptual signals at scale, including audio (speech & music) and visual data. He has previously co-organised four editions of the NeurIPS workshop on machine learning for creativity and design (2017-2020) and three editions of the Recsys workshop on deep learning for recommender systems (DLRS 2016-2018). He also co-organized the Machine Learning for Audio Synthesis workshop at ICML 2022 and the Workshop on Machine Learning for Audio at NeurIPS 2023.
**Chris Donahue is an assistant professor at Carnegie Mellon University and a research scientist at Google DeepMind. His research focuses on developing and responsibly deploying generative AI for music and creativity, thereby unlocking and augmenting human creative potential. His work involves (1) improving machine learning methods for controllable generative modeling for music, audio, and other sequential data, and (2) deploying real-world interactive systems that allow a broader audience—inclusive of non-musicians—to harness generative music AI through intuitive controls.
Brian Kulis is an associate professor at Boston University and former Amazon scholar who worked on Alexa. His research focuses broadly on machine learning, recently focused on applications to audio problems such as detection and generation. He has won two best paper awards at ICML and a best paper award at CVPR. He has previously organized two workshops at ICCV (in 2011 and 2013), two workshop at NeurIPS (in 2011 and 2023), and two workshops at ICML (in 2019 and 2022). He is regularly an area or senior area chair at major AI conferences, was the local arrangements chair for CVPR in 2014, and has organized tutorials at ICML and ECCV.
Rachel Manzelliis a senior machine learning engineer at Modulate, where she works on both audio generation and classification models to assist video game moderation teams in decreasing toxicity in voice chat. She co-organized the Machine Learning for Audio Synthesis workshop at ICML 2022 and the Workshop on Machine Learning for Audio at NeurIPS 2023. She earned her bachelor’s degree in computer engineering from Boston University in 2019. During her undergraduate career, she conducted research in the areas of structured music generation and MIR.
Shrikanth Narayanan is a University Professor and holder of the Niki and Max Nikias Chair in Engineering at the University of Southern California (USC). Shri is a Fellow of the National Academy of Inventors (NAI), the Acoustical Society of America (ASA), the Institute of Electrical and Electronics Engineers (IEEE), the International Speech Communication Association (ISCA), the Association for Psychological Science (APS), the American Association for the Advancement of Science (AAAS), American Institute for Medical and Biological Engineering (AIMBE) and the Association for the Advancement of Affective Computing (AAAC). Shri Narayanan is a member of the European Academy of Sciences and Arts and a 2022 Guggenheim Fellow.