I am a master’s student at Carnegie Mellon University. I work with Professor Bhiksha Raj on Speech Processing and Audio Language Models and Professor Chris Donahue on Text-to-Audio Models. My research interests focus on speech/audio processing and LLMs.

Previously, I worked with Dr Satrajit Ghosh at MIT on the explainability of self-supervised learning (SSL) embeddings, such as WavLM, for speech emotion recognition. I have also worked at EPFL on room acoustics simulation. I completed my undergraduate degree in Electrical Engineering at IIT Delhi, where my concentration was on signals processing and ML.

Education

Publications and Preprints

MACE Preview

MACE: Leveraging Audio for Evaluating Audio Captioning Systems

Satvik Dixit, Soham Deshmukh, Bhiksha Raj

  • Under review at ICASSP 2025 Speech and Audio Language Models (SALMA) Workshop
  • Paper
  • Code
Vision Language Models Preview

Vision Language Models Are Few-Shot Audio Spectrogram Classifiers

Satvik Dixit, Laurie Heller, Chris Donahue

  • Accepted at NeuRIPS Audio Imagination Workshop
  • Paper
Speaker Representations Preview

Improving Speaker Representations Using Contrastive Losses on Multi-scale Features

Satvik Dixit, Massa Baali, Rita Singh, Bhiksha Raj

  • Under review at ICASSP 2025 (Main)
  • Paper
  • Code
Speech Emotion Recognition Preview

Explaining Deep Learning Embeddings for Speech Emotion Recognition by Predicting Interpretable Acoustic Features

Satvik Dixit, Daniel M. Low, Gasser Elbanna, Fabio Catania, Satrajit S. Ghosh

  • Under review at ICASSP 2025 (Main)
  • Paper
  • Code