In a recent review article published in the journal Annual Review of Analytical Chemistry, researchers highlighted the transformative role of machine learning (ML) in advancing mass spectrometry (MS) techniques for small-molecule analysis. Traditionally, MS has been fundamental in elucidating molecular structures based on spectral data; however, the complexity of spectra and the vast diversity of small molecules present significant challenges. The integration of ML aims to overcome these limitations by automating data interpretation, predicting properties, and facilitating structure elucidation. A particular emphasis is placed on enhancing spectral prediction, annotation, and molecular identification, with optical aspects emerging through the analysis of spectral features such as collision cross sections (CCS) and retention times (RT).

Image Credit: Intothelight Photography/Shutterstock.com
Background
Mass spectrometry works by ionizing molecules and measuring their mass-to-charge ratios, producing spectra that reveal structural information. Features like isotope patterns and fragmentation trees help deduce chemical formulas and molecular architecture.
In parallel, techniques such as ion mobility spectrometry (IMS) measure how ions drift through a buffer gas under an electric field, yielding collision cross-section (CCS) values that relate to a molecule’s shape and size. These measurements offer spatial or geometric insights, providing a complementary perspective to mass-based analysis.
Retention time (RT), captured during chromatographic separations, adds a temporal dimension. It reflects how molecules interact with the stationary phase, often correlating with polarity and other physical properties that influence their movement through a medium.
When mass spectrometry is combined with IMS and chromatography, the result is a rich, multidimensional dataset. Machine learning models are increasingly used to interpret these data collectively, enhancing the accuracy of molecular identification. However, the wide variability in instrument platforms and experimental setups continues to pose challenges for standardization and automation that machine learning is becoming better equipped to handle.
The Current Study
The review explores a range of machine learning approaches developed for spectral interpretation, with a focus on how molecular and spectral data are represented. For optical properties, models are designed to predict values like collision cross-section (CCS) and retention time (RT) based on molecular structure. These models often use graph neural networks (GNNs), transformers, or other deep learning architectures that capture spatial and electronic features relevant to optical measurements.
To process this information, models rely on molecular descriptors such as SMILES strings, molecular graphs, or high-dimensional feature vectors derived from spectral data. Tools like SIRIUS and BUDDY incorporate workflows that combine isotope pattern analysis, fragmentation trees, and heuristic scoring to predict chemical formulas and structures. Optical proxies like CCS are also integrated into these workflows to support structural inference.
Transfer learning (TL) is highlighted as a useful strategy for adapting models across different instruments and experimental setups. This helps maintain the accuracy of optical predictions under varying conditions. At the same time, machine learning models for RT and CCS prediction are evolving from traditional support vector regressions to deep learning methods, which provide better generalization and precision. These improvements are especially important for estimating optical properties in complex mixtures.
Results & Discussion
The integration of ML for interpreting optical properties has led to notable improvements in small-molecule analysis workflows. Accurate prediction of CCS values enables better differentiation between isomers and conformers, directly influencing structural elucidation. Similarly, ML-enhanced RT prediction improves chromatographic alignment and compound annotation. The models demonstrate promising accuracy, even when trained on limited spectral datasets, through TL and multimodal approaches combining spectral and physical property data. The analysis underscores the importance of molecular representations, graph-based models tend to outperform traditional descriptors by effectively encoding three-dimensional shape and electronic features that influence optical measurements. Nevertheless, substantial challenges remain, particularly in generalizing models across different instrument platforms and experimental conditions. There are ongoing efforts to refine ML architectures for better interpretability and to incorporate complementary data types, such as collisional cross sections and optical spectra, to improve the robustness of predictions. The discussion emphasizes that optical proxies like CCS and RT are vital for constraint-based structure elucidation, and ML accelerates this by providing rapid, accurate estimations that would traditionally require exhaustive experimental or computational efforts.
Conclusion
The review concludes that machine learning significantly enhances small-molecule identification by incorporating optical properties such as CCS and RT as informative features. These properties act as a link between physical principles and spectral data, enabling more accurate molecular characterization.
Looking ahead, there is strong potential for developing integrated, multimodal ML models that combine optical, spectral, and chemical information to improve small-molecule analysis workflows. Progress in predicting optical properties, supported by advances in deep learning, transfer learning, and multimodal data integration, is expected to reduce reliance on extensive spectral libraries, accelerate identification, and aid in discovering new molecules.
The integration of optical concepts within machine learning frameworks reflects a broader shift toward physically grounded, data-driven spectral analysis. This approach offers a more complete and precise understanding of molecular structures, especially in complex biological and environmental samples.
Interested in related developments? You might explore how multimodal ML is shaping metabolomics or how CCS prediction is adapting across diverse instrumentation platforms.
Journal Reference
Hong Y., et al. (2025). Machine Learning in Small-Molecule Mass Spectrometry. Annual Review of Analytical Chemistry, 18:193-215. DOI: 10.1146/annurev-anchem-071224-082157, https://www.annualreviews.org/content/journals/10.1146/annurev-anchem-071224-082157