For questions, email mlforaudio@googlegroups.com
The Machine Learning for Audio Workshop at NeurIPS 2023 will bring together audio practitioners and machine learning researchers to a venue focused on various problems in audio, including music information retrieval, acoustic event detection, computational paralinguistics, speech transcription, multimodal modeling, and generative modeling of speech and other sounds.
Workshop Description
Audio research has recently been enjoying a renaissance of sorts; in the field of audio synthesis alone, many prominent papers have been released in just the past calendar year, with no sign of slowing down. There are numerous additional key problems within the audio research domain that continue to attract widespread attention. We believe that a workshop focused on machine learning in the audio domain would provide a good opportunity to bring together both practitioners of audio tools along with core machine learning researchers interested in audio, in order to foster collaboration and discussion as well as forge new directions within this important area of research. In addition, with the field moving so rapidly, we believe this workshop will provide a dedicated space for the crucial ethical discussions that must be facilitated among researchers around applications of generative machine learning for audio.
The Machine Learning for Audio workshop at NeurIPS 2023 will cover a broad range of tasks involving audio data. These include, but are not limited to: methods of speech modeling, environmental sound generation or other forms of ambient sound, novel generative models, music generation in the form of raw audio, text-to-speech methods, denoising of speech and music, data augmentation, classification of acoustic events, transcription, source separation, and multimodal problems.
We plan to solicit original workshop papers in these areas, which will be reviewed by the organizers and an additional set of reviewers. We also plan to run a demo session alongside the poster session, where contributors will be able to present live demos of their work where applicable. We believe this session will complement the newly announced Creative AI Track very nicely; as synthesis is a prominent subfield within audio machine learning research, we will be able to further highlight novel generative methods that do not necessarily overlap with a creative application.
Call for Papers
We are calling for extended abstracts up to 4 pages excluding references. Accepted submissions will be posted on the workshop website but not published/archived. Several submissions will be chosen for 15-minute contributed talks and the remaining selected submissions will participate in the poster & demo session. Please make sure submissions adhere to the NeurIPS format. Submit your work via CMT. The review process will be double-blind so please make sure not to put any author information in your submission.
Authors may also submit supplementary materials along with their papers if they wish (e.g., a preview of a potential demo). Reviewers will not be required to read/view/listen to said supplementary material.
Timeline
-
Submission deadline (main paper & all supplementary material): September 29, 2023 AOE
-
Accept/Reject notification date: UPDATED October 27, 2023 AOE
Available Data for Workshop Submissions
As part of the workshop, the organizers have made several large-scale audio datasets available. These datasets allow researchers to evaluate a wide range of machine learning approaches.
Two of the datasets can be benchmarked against previous competitions. For more detailed information, you can refer to the following publications:
- Baird, A., Tzirakis, P., Brooks, J. A., Gregory, C. B., Schuller, B., Batliner, A., … & Cowen, A. (2022, October). “The ACII 2022 Affective Vocal Bursts Workshop & Competition.” In 2022 10th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW) (pp. 1-5). IEEE.
- Baird, A., Tzirakis, P., Gidel, G., Jiralerspong, M., Muller, E. B., Mathewson, K., … & Cowen, A. (2022). “The ICML 2022 Expressive Vocalizations Workshop and Competition: Recognizing, Generating, and Personalizing Vocal Bursts.” arXiv preprint arXiv:2205.01780.
- Schuller, Björn W., Anton Batliner, Shahin Amiriparian, Alexander Barnhill, Maurice Gerczuk, Andreas Triantafyllopoulos, Alice Baird et al. “The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests.” arXiv preprint arXiv:2304.14882 (2023).
A white paper providing more comprehensive details on the available data can be found here. To request access to the datasets, please reach out to competitions@hume.ai.
Schedule
The workshop will take place on December 16th, 2023.
8:40 - Invited Speaker Bjorn Schuller: Computer Audition Disrupted 2.0: The Foundation Models Era
9:00 - Contributed Talk: Explainable AI for Audio via Virtual Inspection Layers
9:20 - Contributed Talk: Self-Supervised Speech Enhancement using Multi-Modal Data
9:40 - Invited Speaker Dimitra Emmanouilidou: A Multi-view Approach for Audio-based Speech Emotion Recognition
10:10 - Break
10:50 - Invited Speaker Neil Zeghidour: Audio Language Models
11:10 - Contributed Talk: Zero-shot Audio Captioning with Audio-Language Model Guidance and Audio Context Keywords
11:30 - Invited Speaker Rachel Bittner: Lark: A Multimodal Foundation Model for Music
12:00 - Lunch
13:30 - Poster & Demo Session: Poster session alongside live demos from selected submissions.
15:00 - Break
15:30 - Invited Speaker Ben Hayes: Uninformative Gradients: Optimisation Pathologies in Differentiable Digital Signal Processing
16:00 - Contributed Talk: EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis
16:20 - Contributed Talk: Towards Generalizable SER: Soft Labeling and Data Augmentation for Modeling Temporal Emotion Shifts in Large-Scale Multilingual Speech
16:40 - Contributed Talk: Audio Personalization through Human-in-the-loop Optimization
17:00 - Invited Speaker Shoko Araki: Multi-channel Speech Enhancement for Moving Sources
Invited Speakers
Shoko Araki Shoko Araki is a Senior Research Scientist at NTT Communication Science Laboratories, NTT Corporation, Japan where she is currently leading the Signal Processing Research Group. Since joining NTT in 2000, she has been engaged in research on acoustic signal processing, microphone array signal processing, blind speech separation, meeting diarization, and auditory scene analysis. She was formerly a member of the IEEE SPS Audio and Acoustic Signal Processing Technical Committee (AASP-TC) (2014-2019) and currently serves as its Chair. She was a board member of the Acoustical Society of Japan (ASJ) (2017-2020), and she served as vice president of ASJ (2021-2022). She is an IEEE Fellow.
- Talk abstract: Speech enhancement technology has made remarkable progress in recent years. While many single-channel methods have been proposed, and their performance has improved, multi-channel speech enhancement technology remains important due to its high performance in estimating and retaining sound source spatial information. Many multi-channel processing methods have been proposed so far for cases where the sound source and noise positions are fixed. However, for real-world applications, it is necessary to consider sound source movement and improve robustness to moving sources. In this presentation, I will introduce multi-channel audio enhancement technologies for moving sources. First, I will present an extension of mask-based neural beamforming, which is widely used as an ASR front-end, to moving sound sources. This extension is achieved by integrating model-based array signal processing and data-driven deep learning approaches. Then, I will discuss model-based, unsupervised multi-channel source separation and extraction approaches, e.g., independent component/vector analysis (ICA/IVA). For multi-channel processing, in addition to dealing with moving sources, it is also essential to devise techniques that limit the increase in computational complexity as the number of microphones increases. To address this issue, I will introduce a fast online IVA algorithm for tracking a single moving source that achieves optimal time complexity and operates significantly faster than conventional approaches.
Rachel Bittner is a Research Manager at Spotify in Paris. Before Spotify, she worked at NASA Ames Research Center in the Human Factors division. She received her Ph.D. degree in music technology and digital signal processing from New York University. Before that, she did a Master’s degree in Mathematics at New York University, and a joint Bachelor’s degree in Music Performance and Math at UC Irvine. Her research interests include automatic music transcription, musical source separation, metrics, and dataset creation.
- Talk abstract: Music has a unique and complex structure which is challenging for both expert humans and existing AI systems to understand, and presents unique challenges relative to other forms of audio. We present LLARK, an instruction-tuned multimodal model for music understanding. We detail our process for dataset creation, which involves augmenting the annotations of diverse open-source music datasets and converting them to a unified instruction-tuning format. We propose a multimodal architecture for LLARK, integrating a pretrained generative model for music with a pretrained language model. In evaluations on three types of tasks (music understanding, captioning, and reasoning), we show that our model matches or outperforms existing baselines in zero-shot generalization for music understanding, and that humans show a high degree of agreement with the model’s responses in captioning and reasoning tasks.
Dimitra Emmanouilidou is a sr researcher in Microsoft Research, Redmond, WA, USA. Her interests lie in Signal Processing using Machine Learning and AI approaches, with specific applications to Audio Event Detection, Audio Captioning, Speech Emotion Recognition, EEG and bio-signal analysis. She serves in the AASP Technical Committee, and has served as a reviewer, Area and Technical Chair for most major conferences and journals in Signal Processing. Dimitra received her PhD from the Electrical and Computer Engineering Department at Johns Hopkins University. She also holds a M.Sc. in Biomedical Informatics and Technology and a B.Sc. in Computer Science by the University of Crete, Greece.
- Talk abstract: The area of speech emotion recognition (SER) has seen significant advances with the wider availability of pre-trained models and embeddings, and the creation of larger publicly available corpora. In this talk we will touch upon some of the challenges that continue to riddle audio-based SER, such as domain adaptation, data augmentation and output generalization, and further discuss the advantages of a multi-view model approach, one that jointly learns from both categorical and dimensional affect labels.
Ben Hayes is a final year PhD student in Artificial Intelligence and Music at the Centre for Digital Music, Queen Mary University of London. His research focuses on differentiable digital signal processing for audio synthesis. His work has been accepted to leading conferences in the field, including ISMIR, ICLR, ICASSP, ICA, and the AES Convention, and published in the Journal of the Audio Engineering Society. He has worked as a Research intern at Sony Computer Science Laboratories in Paris and ByteDance’s Speech Audio and Music Intelligence team in London. He was previously Music Lead at the award-winning generative music startup Jukedeck, and an internationally touring musician signed to R&S Records.
- Talk abstract: Differentiable digital signal processing (DDSP) allows us to constrain the outputs of a neural network to those of a known class of signal processor. This can help us train with limited data, reduce audio artefacts, infer parameters of signal models, and expose human interpretable controls. However, numerous failure modes still exist for certain important families of signal processor. This talk illustrates two such challenges, frequency parameter non-convexity and permutation symmetry, and introduces promising approaches to solving them.
Bjorn Schuller is a Full Professor \& Head of the Chair of Embedded Intelligence for Health Care and Wellbeing at University of Augsburg, Germany. He is also a Professor of Artificial Intelligence \& Head of the Group on Language, Audio \& Music at Imperial College London, Chief Scientific Officer (CSO) and Co-Founding CEO at audEERING GmbH, and a Visiting Professor at the School of Computer Science and Technology, Harbin Institute of Technology in Harbin/P.R. China. His research areas of interest include computer audition for health and computational paralinguistics.
- Talk abstract: Computer Audition is changing. Since the advent of Large Audio, Language, and Multimodal Models, or generally Foundation Models, a new age has begun. Emergence of abilities in such large models by zero- or few-shot learning render it partially unnecessary to collect task-specific data and train an according model. After the last major disruption – learning representations and model architectures directly from data – this can be judged as the second major disruption in a field that once was coined by highly specialized features, approaches, and datasets shifting towards being absorbed by sheer size of models and data used for their training. In this talk, I will first argue that Computer Audition will be massively influenced by this “plate displacement” in Artificial Intelligence as a whole. I will then move towards “informed tea-leaf reading” how present and tomorrow’s Computer Audition will change in more detail. This includes prompt optimisation, fine-tuning, or synergistic combination of different foundation models and traditional approaches. Finally, I will turn towards dangers to this new glittery era – among many, the “nightshades” of audio may soon start to poison audio data. A new time has begun – it will empower Computer Audition at a whole new level while challenging us in whole new ways – let’s get ready.
Neil Zeghidour is co-founder and Chief Modeling Officer of the Kyutai non-profit research lab. He was previously at Google DeepMind, where he started and led a team working on generative audio, with contributions including Google’s first text-to-music API, a voice preserving speech-to-speech translation system, and the first neural audio codec that outperforms general-purpose audio codecs. Before that, Neil spent three years at Facebook AI Research, working on automatic speech recognition and audio understanding. He graduated with a PhD in machine learning from Ecole Normale Supérieure (Paris), and holds an MSc in machine learning from Ecole Normale Supérieure (Saclay) and an MSc in quantitative finance from Université Paris Dauphine. In parallel with his research activities, Neil teaches speech processing technologies at the École Normale Supérieure (Saclay).
- Talk abstract: Audio analysis and audio synthesis require modeling long-term, complex phenomena and have historically been tackled in an asymmetric fashion, with specific analysis models that differ from their synthesis counterpart. In this presentation, we will introduce the concept of audio language models, a recent innovation aimed at overcoming these limitations. By discretizing audio signals using a neural audio codec, we can frame both audio generation and audio comprehension as similar autoregressive sequence-to-sequence tasks, capitalizing on the well-established Transformer architecture commonly used in language modeling. This approach unlocks novel capabilities in areas such as textless speech modeling, zero-shot voice conversion, and even text-to-music generation. Furthermore, we will illustrate how the integration of analysis and synthesis within a single model enables the creation of versatile audio models capable of handling a wide range of tasks involving audio as inputs or outputs. We will conclude by highlighting the promising prospects offered by these models and discussing the key challenges that lie ahead in their development.
Organizers
- Sadie Allen is a PhD student studying computer engineering at Boston University. She co-organized the Machine Learning for Audio Synthesis workshop at ICML 2022. Her current research focuses on controllable music generation in both the symbolic and raw audio domains. Her previous work centered around the security and efficiency of distributed systems.
- Alice Baird is a research scientist at Hume AI, NY, USA, where she currently works on modeling expressive human behaviors from audio and other modalities. She earned her Ph.D. at the University of Augsburg, in 2022. Her work on emotion understanding from auditory, physiological, and multimodal data has been published extensively in the leading journals and conferences in her field. She has co-organized, several machine learning competitions including the 2022 ICML Expressive Vocalizations Workshop.
- Alan Cowen is an applied mathematician and computational emotion scientist developing new data-driven methods to study human experience and expression. He was previously a researcher at the University of California and visiting scientist at Google, where he helped establish affective computing research efforts. His discoveries have been featured in top journals such as Nature, PNAS, Science Advances, and Nature Human Behavior (i10-index: 16) and covered in press outlets ranging from CNN to Scientific American. His research applies new computational tools to address how emotional behaviors can be evoked, conceptualized, predicted, and annotated, how they influence our social interactions, and how they bring meaning to our everyday lives.
- Sander Dieleman is a research scientist at DeepMind in London, UK, where he has worked on the development of AlphaGo and WaveNet. His research is currently focused on generative modelling of perceptual signals at scale, including audio (speech \& music) and visual data. He has previously co-organised four editions of the NeurIPS workshop on machine learning for creativity and design (2017-2020) and three editions of the Recsys workshop on deep learning for recommender systems (DLRS 2016-2018). He also co-organized the Machine Learning for Audio Synthesis workshop at ICML last year.
- Brian Kulis is an associate professor at Boston University and former Amazon scholar who worked on Alexa. His research focuses broadly on machine learning, recently focused on applications to audio problems such as detection and generation. He has won two best paper awards at ICML and a best paper award at CVPR. He has previously organized two workshops at ICCV (in 2011 and 2013), one workshop at NeurIPS (in 2011), and two workshops at ICML (in 2019 and 2022). He is regularly an area or senior area chair at major AI conferences, was the local arrangements chair for CVPR in 2014, and has organized tutorials at ICML and ECCV.
- Rachel Manzelli is a senior machine learning engineer at Modulate, where she works on both audio generation and classification models to assist video game moderation teams in decreasing toxicity in voice chat. She co-organized the Machine Learning for Audio Synthesis workshop at ICML 2022. She earned her bachelor’s degree in computer engineering from Boston University in 2019. During her undergraduate career, she conducted research in the areas of structured music generation and MIR.
- Shrikanth Narayanan is a University Professor and holder of the Niki and Max Nikias Chair in Engineering at the University of Southern California (USC). Shri is a Fellow of the National Academy of Inventors (NAI), the Acoustical Society of America (ASA), the Institute of Electrical and Electronics Engineers (IEEE), the International Speech Communication Association (ISCA), the Association for Psychological Science (APS), the American Association for the Advancement of Science (AAAS), American Institute for Medical and Biological Engineering (AIMBE) and the Association for the Advancement of Affective Computing (AAAC). Shri Narayanan is a member of the European Academy of Sciences and Arts and a 2022 Guggenheim Fellow.
Accepted Papers
- EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis Ge Zhu, Marc-André Carbonneau, Zhiyao Duan (Oral)
- Explainable AI for Audio via Virtual Inspection Layers Johanna Vielhaben, Sebastian Lapuschkin, Grégoire Montavon, Wojciech Samek (Oral)
- Self-Supervised Speech Enhancement using Multi-Modal Data Yu-Lin Wei, Rajalaxmi Rajagopalan, Bashima Islam, Romit Roy Choudhury (Oral)
- Audio Personalization through Human-in-the-loop Optimization Rajalaxmi Rajagopalan, Yu-Lin Wei, Romit Roy Choudhury (Oral)
- Zero-shot audio captioning with audio-language model guidance and audio context keyword Leonard Salewski, Stefan Fauth, A. Sophia Koepke, Zeynep Akata (Oral)
- Towards Generalizable SER: Soft Labeling and Data Augmentation for Modeling Temporal Emotion Shifts in Large-Scale Multilingual Speech Mohamed Osman, Tamer Nadeem, Ghada Khoriba (Oral)
- Audio classification with Dilated Convolution with Learnable Spacings Ismail Khalfaoui Hassani, Timothée Masquelier, Thomas Pellegrini
- Creative Text-to-Audio Generation via Synthesizer Programming Nikhil Singh, Manuel Cherep, Jessica Shand
- Jointly Recognizing Speech and Singing Voices Based on Multi-Task Audio Source Separation Ye Bai, Chenxing Li, Xiaorui Wang, Yuanyuan Zhao, Hao Li
- Leveraging Content-based Features from Multiple Acoustic Models for Singing Voice Conversion Xueyao Zhang, Yicheng Gu, Haopeng Chen, Zihao Fang, Lexiao Zou, Liumeng Xue, Zhizheng Wu
- Diffusion Models as Masked Audio-Video Learners Elvis Nunez, Yanzi Jin, Mohammad Rastegari, Sachin Mehta, Maxwell C Horton
- InstrumentGen: Generating Sample-Based Musical Instruments From Text Shahan Nercessian, Johannes Imort
- Multi-Resolution Audio-Visual Feature Fusion for Temporal Action Localization Edward Fish, Jon Weinbren, Andrew Gilbert
- Composing and Validating Large-Scale Datasets for Training Open Foundation Models for Audio Marianna Nezhurina, Ke Chen, Yusong Wu, Tianyu Zhang, Haohe Liu, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov, Jenia Jitsev
- Unsupervised Musical Object Discovery from Audio Joonsu Gha, Vincent Herrmann, Benjamin F. Grewe, Jürgen Schmidhuber, Anand Gopalakrishnan
- Data is Overrated: Perceptual Metrics Can Lead Learning in the Absence of Training Data Tashi Namgyal, Alexander Hepburn, Raul Santos Rodriguez, Valero Laparra, Jesus Malo
- Improved sound quality human-inspired DNN-based audio applications Chuan Wen, Sarah Verhulst, Guy Torfs
- Synthia’s Melody: A Benchmark Framework for Unsupervised Domain Adaptation in Audio Harry Coppock, Chia-Hsin Lin
- AttentionStitch: How Attention Solves the Speech Editing Problem Antonios Alexos, Pierre Baldi
- MusT3: Unified Multi-Task Model for Fine-Grained Music Understanding Martin Kukla, Minz Won, Yun-Ning Hung, Duc Le
- Benchmarks and deep learning models for localizing rodent vocalizations in social interactions Ralph E Peterson, Aramis Tanelus, Aman Choudhri, Violet Ivan, Aaditya Prasad, David Schneider, Dan Sanes, Alex Williams
- The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation Ilaria Manco, Benno Weck, Seungheon Doh, Minz Won, Yixiao Zhang, Dmitry Bogdanov, Yusong Wu, Ke Chen, Philip Tovstogan, Emmanouil Benetos, Elio Quinton, George Fazekas, Juhan Nam
- ScripTONES: Sentiment-Conditioned Music Generation for Movie Scripts Vishruth Veerendranath, Vibha Masti, Utkarsh Gupta, Hrishit Chaudhuri, Gowri Srinivasa
- Self-Supervised Music Source Separation Using Vector-Quantized Source Category Estimates Marco Pasini, Stefan Lattner, George Fazekas
- Deep Generative Models of Music Expectation Ninon Lizé Masclef, Thomas A Keller
- mir_ref: A Representation Evaluation Framework for Music Information Retrieval Tasks Christos Plachouras, Dmitry Bogdanov, Pablo Alonso-Jiménez