Machine Learning for Audio Workshop

Discover the harmony of AI and sound.

View the Project on GitHub sadiela/ml-for-audio

For questions, email

The Machine Learning for Audio Workshop at NeurIPS 2023 will bring together audio practitioners and machine learning researchers to a venue focused on various problems in audio, including music information retrieval, acoustic event detection, computational paralinguistics, speech transcription, multimodal modeling, and generative modeling of speech and other sounds.

Workshop Description

Audio research has recently been enjoying a renaissance of sorts; in the field of audio synthesis alone, many prominent papers have been released in just the past calendar year, with no sign of slowing down. There are numerous additional key problems within the audio research domain that continue to attract widespread attention. We believe that a workshop focused on machine learning in the audio domain would provide a good opportunity to bring together both practitioners of audio tools along with core machine learning researchers interested in audio, in order to foster collaboration and discussion as well as forge new directions within this important area of research. In addition, with the field moving so rapidly, we believe this workshop will provide a dedicated space for the crucial ethical discussions that must be facilitated among researchers around applications of generative machine learning for audio.

The Machine Learning for Audio workshop at NeurIPS 2023 will cover a broad range of tasks involving audio data. These include, but are not limited to: methods of speech modeling, environmental sound generation or other forms of ambient sound, novel generative models, music generation in the form of raw audio, text-to-speech methods, denoising of speech and music, data augmentation, classification of acoustic events, transcription, source separation, and multimodal problems.

We plan to solicit original workshop papers in these areas, which will be reviewed by the organizers and an additional set of reviewers. We also plan to run a demo session alongside the poster session, where contributors will be able to present live demos of their work where applicable. We believe this session will complement the newly announced Creative AI Track very nicely; as synthesis is a prominent subfield within audio machine learning research, we will be able to further highlight novel generative methods that do not necessarily overlap with a creative application.

Call for Papers

We are calling for extended abstracts up to 4 pages excluding references. Accepted submissions will be posted on the workshop website but not published/archived. Several submissions will be chosen for 15-minute contributed talks and the remaining selected submissions will participate in the poster & demo session. Please make sure submissions adhere to the NeurIPS format. Submit your work via CMT. The review process will be double-blind so please make sure not to put any author information in your submission.

Authors may also submit supplementary materials along with their papers if they wish (e.g., a preview of a potential demo). Reviewers will not be required to read/view/listen to said supplementary material.


Available Data for Workshop Submissions

As part of the workshop, the organizers have made several large-scale audio datasets available. These datasets allow researchers to evaluate a wide range of machine learning approaches.

Two of the datasets can be benchmarked against previous competitions. For more detailed information, you can refer to the following publications:

  1. Baird, A., Tzirakis, P., Brooks, J. A., Gregory, C. B., Schuller, B., Batliner, A., … & Cowen, A. (2022, October). “The ACII 2022 Affective Vocal Bursts Workshop & Competition.” In 2022 10th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW) (pp. 1-5). IEEE.
  2. Baird, A., Tzirakis, P., Gidel, G., Jiralerspong, M., Muller, E. B., Mathewson, K., … & Cowen, A. (2022). “The ICML 2022 Expressive Vocalizations Workshop and Competition: Recognizing, Generating, and Personalizing Vocal Bursts.” arXiv preprint arXiv:2205.01780.
  3. Schuller, Björn W., Anton Batliner, Shahin Amiriparian, Alexander Barnhill, Maurice Gerczuk, Andreas Triantafyllopoulos, Alice Baird et al. “The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests.” arXiv preprint arXiv:2304.14882 (2023).

A white paper providing more comprehensive details on the available data can be found here. To request access to the datasets, please reach out to


The workshop will take place on December 16th, 2023.

8:40 - Invited Speaker Bjorn Schuller: Computer Audition Disrupted 2.0: The Foundation Models Era

9:00 - Contributed Talk: Explainable AI for Audio via Virtual Inspection Layers

9:20 - Contributed Talk: Self-Supervised Speech Enhancement using Multi-Modal Data

9:40 - Invited Speaker Dimitra Emmanouilidou: A Multi-view Approach for Audio-based Speech Emotion Recognition

10:10 - Break

10:50 - Invited Speaker Neil Zeghidour: Audio Language Models

11:10 - Contributed Talk: Zero-shot Audio Captioning with Audio-Language Model Guidance and Audio Context Keywords

11:30 - Invited Speaker Rachel Bittner: Lark: A Multimodal Foundation Model for Music

12:00 - Lunch

13:30 - Poster & Demo Session: Poster session alongside live demos from selected submissions.

15:00 - Break

15:30 - Invited Speaker Ben Hayes: Uninformative Gradients: Optimisation Pathologies in Differentiable Digital Signal Processing

16:00 - Contributed Talk: EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis

16:20 - Contributed Talk: Towards Generalizable SER: Soft Labeling and Data Augmentation for Modeling Temporal Emotion Shifts in Large-Scale Multilingual Speech

16:40 - Contributed Talk: Audio Personalization through Human-in-the-loop Optimization

17:00 - Invited Speaker Shoko Araki: Multi-channel Speech Enhancement for Moving Sources

Invited Speakers

Shoko Araki Shoko Araki is a Senior Research Scientist at NTT Communication Science Laboratories, NTT Corporation, Japan where she is currently leading the Signal Processing Research Group. Since joining NTT in 2000, she has been engaged in research on acoustic signal processing, microphone array signal processing, blind speech separation, meeting diarization, and auditory scene analysis. She was formerly a member of the IEEE SPS Audio and Acoustic Signal Processing Technical Committee (AASP-TC) (2014-2019) and currently serves as its Chair. She was a board member of the Acoustical Society of Japan (ASJ) (2017-2020), and she served as vice president of ASJ (2021-2022). She is an IEEE Fellow.

Rachel Bittner is a Research Manager at Spotify in Paris. Before Spotify, she worked at NASA Ames Research Center in the Human Factors division. She received her Ph.D. degree in music technology and digital signal processing from New York University. Before that, she did a Master’s degree in Mathematics at New York University, and a joint Bachelor’s degree in Music Performance and Math at UC Irvine. Her research interests include automatic music transcription, musical source separation, metrics, and dataset creation.

Dimitra Emmanouilidou is a sr researcher in Microsoft Research, Redmond, WA, USA. Her interests lie in Signal Processing using Machine Learning and AI approaches, with specific applications to Audio Event Detection, Audio Captioning, Speech Emotion Recognition, EEG and bio-signal analysis. She serves in the AASP Technical Committee, and has served as a reviewer, Area and Technical Chair for most major conferences and journals in Signal Processing. Dimitra received her PhD from the Electrical and Computer Engineering Department at Johns Hopkins University. She also holds a M.Sc. in Biomedical Informatics and Technology and a B.Sc. in Computer Science by the University of Crete, Greece.

Ben Hayes is a final year PhD student in Artificial Intelligence and Music at the Centre for Digital Music, Queen Mary University of London. His research focuses on differentiable digital signal processing for audio synthesis. His work has been accepted to leading conferences in the field, including ISMIR, ICLR, ICASSP, ICA, and the AES Convention, and published in the Journal of the Audio Engineering Society. He has worked as a Research intern at Sony Computer Science Laboratories in Paris and ByteDance’s Speech Audio and Music Intelligence team in London. He was previously Music Lead at the award-winning generative music startup Jukedeck, and an internationally touring musician signed to R&S Records.

Bjorn Schuller is a Full Professor \& Head of the Chair of Embedded Intelligence for Health Care and Wellbeing at University of Augsburg, Germany. He is also a Professor of Artificial Intelligence \& Head of the Group on Language, Audio \& Music at Imperial College London, Chief Scientific Officer (CSO) and Co-Founding CEO at audEERING GmbH, and a Visiting Professor at the School of Computer Science and Technology, Harbin Institute of Technology in Harbin/P.R. China. His research areas of interest include computer audition for health and computational paralinguistics.

Neil Zeghidour is co-founder and Chief Modeling Officer of the Kyutai non-profit research lab. He was previously at Google DeepMind, where he started and led a team working on generative audio, with contributions including Google’s first text-to-music API, a voice preserving speech-to-speech translation system, and the first neural audio codec that outperforms general-purpose audio codecs. Before that, Neil spent three years at Facebook AI Research, working on automatic speech recognition and audio understanding. He graduated with a PhD in machine learning from Ecole Normale Supérieure (Paris), and holds an MSc in machine learning from Ecole Normale Supérieure (Saclay) and an MSc in quantitative finance from Université Paris Dauphine. In parallel with his research activities, Neil teaches speech processing technologies at the École Normale Supérieure (Saclay).


Accepted Papers