Applications of digital audio, voice, and image processing

Signal Theory is an essential area within ICT whose mathematical foundations date back to the 17th century. The development of electronic computers and their proliferation from the second half of the 20th century gave rise to Digital Signal Processing (DSP), a discipline that has allowed engineers to apply these techniques to a wide range of problems in today’s world, providing us with a better way to understand it and oftentimes providing us with solutions that manage to radically transform it.

The scope of DSP ranges from telecommunications systems to methods for financial analysis and forecasting, from control and decision making systems to multimedia technologies, etc. affecting all fields of human activity, including the way we interact with machines.

At CITSEM we focus on the study of signals that are the source for some of the main applications of these technologies: audio, voice and images.

Audio Processing consists in the study of sounds recorded by one or more microphones and includes multiple objectives: analysis of the sounds for its characterization, coding for transmission or storage through digital media, enhancement of sound, etc. This field also includes the processing of musical signals with multiple applications: automatic classification of contents, digital synthesis, modeling of instruments’ acoustics...

Our lines of research in this area are:

  • Detection of sound events.
  • Measurement of parameters for quantification of emotions.
  • Classification of acoustic scenes.
  • Analysis, characterization and reproduction of the choral effect.


MATLAB graphical interface to assess the automatic detection of musical notes. The orange line represents the pitch contour of a singer and the blue line the assigned musical notes.


Voice processing can be seen as a subset of audio processing in which the signal under study is the human voice and speech. This area also covers many disciplines according to the objectives pursued: coding, enhancement, speech recognition, synthesis, analysis, speaker recognition or identification, language detection...

The lines of investigation that we follow are focused on the analysis of voice quality:

  • Extraction of objective parameters of the voice for the detection, classification and quantification of pathologies of the phonatory and other physiological systems.
  • Voice characterization for speech synthesis and speaker modeling.


Spectrogram. Speech analysis of a patient with Alzheimer's disease showing a pitch rise of nearly two octaves towards the end of a phonetic unit. 


Image processing focuses on the study of images obtained using optical sensors (cameras, scanners) or generated directly by the computer. You can pursue different objectives: enhancement and restoration of images, compression, segmentation and description of shapes within the image and recognition and interpretation of the content by pattern recognition methods.

The lines of research in this area focus on automatic methods for segmentation and classification of images, primarily in medical applications:

  • Detection of the vocal folds in videos recorded using stroboscopy for analysis and assessment of the phonatory system.
  • Segmentation and 3D reconstruction of anatomical structures from CT scans
  • Segmentation of skin lesions, detection of clinical attributes and classification of possible diseases from dermoscopy images.

Lesin piel

Automatic location of a skin lesion on a dermoscopy image using Deep-Learning.