Skip to content

Behind the Scenes

How MS2LDA Works

MS2LDA applies probabilistic topic modeling, originally developed for natural language processing (NLP), to tandem mass spectrometry (MS/MS) data. This allows the unsupervised discovery of fragmentation patterns in complex chemical mixtures using tandem mass spectrometry.

The basic Concept

Test Image

Just as topic modeling identifies themes in a collection of texts by detecting patterns of word co-occurrence, MS2LDA identifies recurring patterns of mass fragments and neutral losses (called Mass2Motifs) across large MS/MS datasets. MS2LDA uses Latent Dirichlet Allocation (LDA) to infer which motifs are most likely to explain the observed fragmentation patterns.

Step-by-Step Overview

1. Preprocessing 🧹

  • Convert MS/MS spectra into a bag-of-fragments format
  • Extract neutral losses
  • Filter out noise

2. Model Training 🧠

  • Apply LDA to the processed spectra
  • Learn Mass2Motifs that describe recurring fragmentation patterns

3. Postprocessing & Annotation 🧾

  • Visualize motif loadings across spectra
  • Compare motifs to known entries in MotifDB
  • Automated annotation of M2M using MAG

4. Analysis & Interpretation

  • Visualize motif loadings across spectra
  • Compare motifs to known entries in MotifDB
  • Automated annotation of M2M using MAG

Do you want to learn more?

Check out the following references 📚:

van der Hooft et al. PNAS, 2016 (https://doi.org/10.1073/pnas.1608041113)

Rogers et al. Faraday Discussions, 2019 (https://doi.org/10.1039/C8FD00235E)

Torres Ortega et al. bioRxiv, 2025 (https://doi.org/10.1101/2025.06.19.659491)