Meta-features as predictors of breast cancer intrinsic subtypes in the METABRIC gene expression dataset — ASN Events

Meta-features as predictors of breast cancer intrinsic subtypes in the METABRIC gene expression dataset (#260)

Heloisa H Milioli 1 2 , Renato Vimieiro 1 3 , Carlos Riveros 1 3 , Regina Berretta 1 3 , Pablo Moscato 1 3
  1. Priority Research Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine, Hunter Medical Research Institute, Newcastle, NSW, Australia
  2. School of Environmental and Life Science, The University of Newcastle, Newcastle, NSW, Australia
  3. School of Electrical Engineering and Computer Science, The University of Newcastle, Newcastle, NSW, Australia

Gene expression microarray data has expanded our understanding of breast cancer disease and also supported further classification in five distinct subtypes: luminal A, luminal B, HER2-enriched, normal-like, and basal-like [1,2]. The investigation of individual transcriptomic signatures remains a valuable tool to determine patient diagnosis and prognosis, and predict therapy response. Novel methods for tumour stratification, biomarkers identification and subtype prediction are, therefore, urgently needed for future applications in clinical practice [3,4]. In this study, we explore the competence of a newly proposed method to select reliable combinations of probes for breast cancer individuation. We expanded the analysis of the original METABRIC breast cancer data set, with over 2,000 samples, by considering the pair-wise differences of the gene expression values (meta-features). In addition, we computed the CM1 scores for each subtype and selected the partial balanced top-10 meta-features from the five groups of patients. The ability of these meta-features to assign subtypes was assessed using a list of classifiers from the Weka software suite [5], based on a 10-fold cross-validation model and a training-test setting. Classifiers demonstrated extensive predictive power on labelling samples, with the average Cramer’s V of 0.91 ± 0.044 and 0.92 ± 0.034 using the selected meta-features, in the discovery and validation sets, respectively. Our results also revealed an almost perfect agreement (κ≈0.97) between labels assigned by the majority of classifiers and the refined labels from METABRIC in both discovery and validation sets. The selected meta-features included classic genes outlining breast cancer subtypes and markedly improved label prediction using novel potential biomarkers in the pair-wise analysis. Moreover, our approach highlighted the greater performance of an ensemble of classifiers or methods to accurately predict sample subtype. Ultimately, our achievements may enhance the molecular understanding of breast cancer gene signatures and support future applications in clinical practice.