Abstract
BACKGROUND/OBJECTIVES
This study aimed to use plasma metabolites to identify clusters of breast cancer survivors and to compare their dietary characteristics and health-related factors across the clusters using unsupervised machine learning.
SUBJECTS/METHODS
A total of 419 breast cancer survivors were included in this cross-sectional study. We considered 30 plasma metabolites, quantified by high-throughput nuclear magnetic resonance metabolomics. Clusters were obtained based on metabolites using 4 different unsupervised clustering methods: k-means (KM), partitioning around medoids (PAM), self-organizing maps (SOM), and hierarchical agglomerative clustering (HAC). The t-test, χ2 test, and Fisher’s exact test were used to compare sociodemographic, lifestyle, clinical, and dietary characteristics across the clusters. P-values were adjusted through a false discovery rate (FDR).
RESULTS
Two clusters were identified using the 4 methods. Participants in cluster 2 had lower concentrations of apolipoprotein A1 and large high-density lipoprotein (HDL) particles and smaller HDL particle sizes, but higher concentrations of chylomicrons and extremely large very-low-density-lipoprotein (VLDL) particles and glycoprotein acetyls, a higher ratio of monounsaturated fatty acids to total fatty acids, and larger VLDL particle sizes compared with cluster 1. Body mass index was significantly higher in cluster 2 compared with cluster 1 (FDR adjusted-PKM < 0.001; PPAM = 0.001; PSOM < 0.001; and PHAC = 0.043).
Breast cancer is the most commonly diagnosed cancer among women worldwide [1]. In Korea, the age-standardized incidence rate of breast cancer has been increasing since 1999, and it was the most diagnosed cancer among women in 2021 [2]. The 5-yr survival rate in Korea also continued to increase, reaching 93.8% in 2017–2021 [2].
An increase in both incidence and survival rates suggests that evaluating and improving modifiable lifestyle factors among cancer survivors may be beneficial. Previous studies have shown that diet, obesity, and physical activity are associated with breast cancer prognosis [34567]. However, studies on intermediates, including metabolites, have been limited, despite the potential association of modifiable factors with metabolites and breast cancer prognosis. Metabolites, small molecules produced during metabolic processes within cells, are emerging biomarkers of chronic diseases, including cancer. They are downstream products of proteins and genes and interact with and are influenced by lifestyle factors [8]. Metabolomics is the comprehensive study of metabolites in biological samples [9] and has been used in breast cancer research [1011], including the areas of diagnosis [1213], recurrence [14], and metastasis [1516]. Samples of metabolites can be easily obtained through non-invasive liquid biopsies [11].
High-throughput techniques have enabled the simultaneous measurement of numerous metabolites and the generation of a massive quantity of metabolomics data. Multivariate analysis methods have been used to handle these high-dimensional metabolomics data [10]. Analyses can be categorized into supervised and unsupervised methods depending on the presence of supervisory variables, such as endpoints and true-labels; an unsupervised method is often used to discover unknown labels. The most commonly used unsupervised method in metabolomics is principal component analysis (PCA), which characterizes dimension reduction. Unlike PCA, machine-learning-based clustering methods can be used to divide samples into subsets based on the similarities in features [17]. Clustering is well-suited for exploratory analyses, as it identifies hidden patterns and natural groupings in complex datasets without requiring predefined labels or known group structures.
To identify similar features and cluster metabolites, we used unsupervised machine learning-based clustering methods, including k-means (KM) [18], partitioning around medoids (PAM) [19], self-organizing maps (SOM) [20] and hierarchical agglomerative clustering (HAC) [21]. We then compared metabolic, clinical, and dietary characteristics across the clusters. Identifying metabolite-based clusters allows for the categorization of cancer survivors into subgroups with distinct metabolic patterns. These clusters can be used to tailor post-cancer interventions, such as lifestyle recommendations, to improve metabolic health, and reduce recurrence risks. Furthermore, understanding cluster-specific characteristics could guide future research into the mechanisms underlying metabolic changes in cancer survivors, enabling the development of more personalized care guidelines for breast cancer survivors.
Study participants who had been diagnosed with stage I–III primary breast cancer according to the American Joint Committee on Cancer (AJCC) [22] and who underwent breast cancer surgery at least 6 mon before enrollment were recruited from 5 hospitals in Korea between March 2015 and June 2019. Altogether, 535 female breast cancer survivors provided written informed consent at enrollment. Of these, a total of 419 breast cancer survivors were included in the cluster analysis after exclusions (Supplementary Fig. 1). Participants who had not been diagnosed with AJCC stage I–III breast cancer (n = 17), those who had undergone breast cancer surgery less than 6 mon before enrollment (n = 5), who were diagnosed, before enrollment, with other cancers either before or after breast cancer diagnosis (n = 20) or who had a breast cancer recurrence before enrollment (n = 4) were excluded. A further 74 participants for whom information on plasma metabolites was not available (n = 71) or who had a metabolite measurement failure rate of 10% or higher (n = 3) were also excluded, resulting in a total of 419 subjects. Of these, we additionally excluded 54 participants due to incomplete dietary data (n = 51) or implausible energy intakes (below or above 3 SD from the mean value of the log-transformed energy intake) (n = 3). Finally, dietary profiling analysis of the specified clusters was conducted on 365 participants. This study was performed in line with the principles of the Declaration of Helsinki. This study was approved by the Institutional Review Board of each of the 5 hospitals: Soonchunhyang University Hospital (SCHBC2014-12-004-001), Jeonbuk National University Hospital (CUH2014-05-002-005 and CUH2018-02-004-004), Keimyung University Dongsan Medical Center (DSMC2015-03-026), Dankook University Hospital (DKUH 2016-07-001-002), and Chosun University Hospital (CHOSUN 2016-06 and CHOSUN 2018-06). Informed consent was obtained from all individual participants included in the study.
Three-day dietary records and food frequency questionnaires (FFQs) were administered to assess the dietary intake of the participants. Among the 365 participants, 167 completed 3-day dietary records (3DR), while the other 198 participants completed a validated 123-item semi-quantitative FFQ developed for breast cancer survivors [2324]. Participants using the 3DR were requested to record all food and beverages consumed on 2 non-consecutive weekdays and one weekend. Photographic booklets of common foods were provided to assist them in estimating portion sizes. The amounts of food and nutrient intake from the dietary records were computed using the Computer-Aided Nutritional Analysis Program version 4.0 (The Korean Nutrition Society, Seoul, Korea), with the daily intake of foods and nutrients calculated by averaging the 3DR. The FFQ respondents reported how often they had consumed each food item on average over the past year in 9 frequency categories (never, once a month, 2–3 times/mon, once a week, 2–4 times/week, 5–6 times/week, once a day, twice/day, and 3 times/day). They also indicated their usual portion size under one of 3 categories (small, medium, and large). The daily food and nutrient intakes were calculated by multiplying the daily frequency by the portion size.
Blood samples were drawn at enrollment in a non-fasting state and stored at −80°C until analysis. Frozen blood samples were shipped on dry ice for metabolomic analysis. We used a high-throughput proton nuclear magnetic resonance (NMR) metabolomics platform (Nightingale Health Plc, Helsinki, Finland), which provides quantification of routine lipids, lipoprotein subclass profiling with lipid concentrations within 14 subclasses, fatty acid composition, and various low-molecular-weight metabolites, including amino acids, ketone bodies, and gluconeogenesis-related metabolites, in molar concentration units [25]. The 14 lipoprotein subclasses were defined by their sizes as follows: extremely large very-low-density lipoprotein (VLDL) with particle diameters from 75 nm upwards and a possible contribution of chylomicrons; 5 VLDL subclasses—very large (average particle diameter of 64.0 nm), large (53.6 nm), medium (44.5 nm), small (36.8 nm), and very small (31.3 nm); intermediate-density lipoprotein (IDL) (28.6 nm); 3 low-density lipoprotein (LDL) subclasses—large (25.5 nm), medium (23.0 nm), and small (18.7 nm); and 4 high-density lipoprotein (HDL) subclasses—very large (14.3 nm), large (12.1 nm), medium (10.9 nm), and small (8.7 nm). Out of the 249 metabolic biomarkers for the plasma samples, 37 clinically validated biomarkers were consistent with the results of other clinical methods and were free of batch effects [26].
The metabolite levels were log-transformed and winsorized to 5 SD. To adjust for the differences in metabolite levels according to the study entry year, we divided the study period into 2 phases (2015–2016 and 2017–2019) and scaled the metabolite levels within each phase. We excluded 70 relative lipoprotein lipid concentration markers and 67 metabolites with missing values or values below the limit of quantification from the cluster analysis. Substantial correlations existed between the remaining 112 metabolites. Supplementary Fig. 2 shows a correlation plot of 57 metabolites, excluding the lipoprotein subclass markers. If the Pearson’s correlation coefficient between 2 log-transformed metabolite values was higher than 0.90, then non-clinically validated metabolites or those with a larger mean absolute correlation with the remaining metabolites were excluded. Finally, we selected the 30 metabolites listed in Supplementary Table 1. Representative coefficients of variation (CVs) across thousands of samples for the NMR-based metabolic measures were assessed in previous studies [2728]. The CVs of the selected metabolites were all below 10%, except for the concentrations of chylomicrons and extremely large VLDL particles (XXL-VLDL-P; CV = 16.2%).
Structured questionnaires were administered to collect anthropometric, sociodemographic, lifestyle data, and reproductive histories, as well as the use of dietary supplements. Self-reported height and weight at enrollment were used to calculate the body mass index (BMI, kg/m2). If any anthropometric information at enrollment was missing, the height and weight measurements at breast cancer diagnosis were taken from the medical records. Information on marital status, smoking status, alcohol consumption, and menopausal status was self-reported. Physical activity data, including the type, time spent, and frequency of exercises, was converted into metabolic equivalent tasks (METs)-hours per week. The MET value of each activity was determined according to the Compendium of Physical Activities [29]. The total MET-hours per week were calculated by summing the MET-hours per week for each exercise type. Additionally, participants provided information on the type, product name, dose, and frequency of any dietary supplements they had regularly consumed over the past year.
We calculated adherence scores according to the American Cancer Society (ACS) guidelines for cancer survivors regarding body weight, physical activity, and diet [30]. The healthy weight management score ranged from 1 to 4 points, assigned to BMI categories of < 18.5 or ≥ 30, 25 to < 30, 23 to < 25, and 18.5 to < 23 kg/m2, respectively. A physical activity score of 1 to 4 points was given based on the quartiles of physical activity levels. For diet, 1 to 4 points were given according to quartiles of fruit/vegetable and whole-grain intake, and inversely for decreasing quartiles of red and processed meat intake. The sum of scores for each food group was divided into categories of 3–5, 6–7, 8–9, and 10–12 points, and assigned 1 to 4 total points, respectively. The ACS score was calculated as the total sum of the weight management, physical activity, and diet scores, ranging from 3 to 12 points.
Clinical information, including the height and weight at diagnosis, AJCC stage, histological grade, diagnosis and operation dates, estrogen receptor (ER), progesterone receptor (PR), human epidermal growth factor receptor 2 (HER-2) status, treatment, other cancer diagnoses, recurrence, and metastasis before study enrollment, was collected from the medical records.
We used the KM, PAM, SOM, and HAC clustering methods, which are commonly used methods in omics data analysis. We employed 4 different methods to account for the variability in data characteristics and structure, as the optimal clustering approach can vary depending on these factors. The characteristics, advantages, and limitations of these methods are summarized in Supplementary Table 2. For KM, we used the Hartigan-Wong algorithm, and the initial centers were determined by the mean of each cluster identified in advance using HAC with the Ward2 linkage method. For PAM, we set options for calculating dissimilarities using Euclidean distance. Initial medoids were specified using the "build" algorithm, and the final medoids were optimized through the "swap" phase. For SOM, a hexagonal grid and squared Euclidean distance were chosen, and the algorithm was iterated 100 times, during which the learning rate decreased from 0.05 to 0.01. The Euclidean distance matrix and Ward2 linkage method were used for HAC. The silhouette index was used to determine the optimal number of clusters and to assess the performance of the clustering methods [31]. We visualized the identified clusters in a 2-dimensional plot using t-distributed stochastic neighbor embedding [32]. Cluster agreement between different clustering methods was evaluated as the percentage of participants assigned to the same clusters by both methods.
Depending on the importance of cluster formation, the metabolites were ranked in 3 ways and the average ranking of each metabolite was calculated. First, t-statistics for the differences in means across clusters were used; the higher the absolute t-statistics, the greater the metabolite’s contribution to cluster identification. Second, a supervised approach was applied using the identified cluster membership as the label of the participants. The random forest (RF) model was trained using original data and tested using data with the variables permuted one at a time. The misclassification rate of the test results, known as permutation importance (PI), was calculated for each variable [33]; the higher the misclassification rate, the greater the variable’s contribution to cluster identification. Third, the adjusted rand index (ARI) [34] was calculated between the cluster membership obtained from the original data and from data that excluded one variable (drop-column); the lower the ARI, the greater the contribution to cluster identification.
The metabolic characteristics of the clusters were compared using boxplots of the scaled metabolic levels (z-scores). The clusters were also compared based on socioeconomic, lifestyle, clinical, and dietary characteristics, as well as ACS scores, using the t-test for continuous variables and χ2 test or Fisher’s exact tests for categorical data. The dietary intakes of the nutrients were adjusted for total energy intake using the residual method [35]. For food intake, the food consumption status (non-consumer vs. consumer), the daily food intake across all participants, and the daily food intake among food consumers only, were compared across the clusters.
All statistical analyses were performed using the R software (version 4.1.0; R Foundation for Statistical Computing, Vienna, Austria). The following R packages were used: clValid for internal validation and optimal clustering results, randomForest for the PI of RF, and flexclust for the ARI calculation. The tests for comparing the characteristics of the clusters were 2-sided, and the significance level was set at 0.05. To account for multiple comparisons, we also checked the adjusted P-values using the false discovery rate (FDR) method.
The silhouette index values were plotted in Fig. 1 according to the number of clusters from 2 to 5. The optimal number of clusters was 2 for KM, SOM, and HAC, and 3 for PAM. Considering that the silhouette index values were similar in k = 2 (silhouette index = 0.146) and k = 3 (0.151), 2 clusters were identified across all clustering methods. The cluster assignment of the 419 participants is shown in Fig. 2. There were 273, 214, 238, and 237 participants in cluster 1 and 146, 205, 181, and 182 participants in cluster 2, identified by KM, PAM, SOM, and HAC, respectively. A total of 111 participants in cluster 1 and 76 participants in cluster 2 were common across all 4 clustering methods. The cluster assignment results were most similar between KM and SOM; 88.3% (% agreement) of participants were assigned to the same clusters (231 in cluster 1 and 139 in cluster 2). Cluster agreements ranged from 45.8% to 88.3% (Supplementary Table 3).
KM, k-means; PAM, partitioning around medoids; SOM, self-organizing maps; HAC, hierarchical agglomerative clustering.
t-SNE, t-distributed stochastic neighboring embedding; KM, k-means; C1, cluster 1; C2, cluster 2; PAM, partitioning around medoids; SOM, self-organizing maps; HAC, hierarchical agglomerative clustering.
The top 7 metabolites (top 25% out of 30) that contributed most to cluster formation are listed in Supplementary Table 4. XXL-VLDL-P, MUFA and MUFA% were included in the top 7 for KM, PAM, and SOM; glycoprotein acetyls (GlycA) and total-triglycerides (total-TG) were ranked in the top 7 for KM and PAM; and VLDL size, HDL size, L-HDL-P, and apolipoprotein A1 (ApoA1) were ranked in the top 7 for SOM and HAC.
The z-scores of the metabolites listed in Supplementary Table 4 were compared across the clusters identified by the 4 clustering methods (Fig. 3). The mean levels of the 30 selected metabolites are shown in Table 1. As identified by KM, PAM, and SOM, but not HAC, participants in cluster 2 had higher levels of MUFA, VLDL cholesterol (VLDL-C), phospholipids in very small VLDL (XS-VLDL-PL), and total-TG compared with those in cluster 1. For all the clustering methods, participants in cluster 2 had higher levels of MUFA% (ratio to total fatty acids), saturated fatty acids (SFA)%, XXL-VLDL-P, and GlycA, and larger VLDL size. However, participants in cluster 1 had higher levels of L-HDL-P, ApoA1, omega-6%, linoleic acids (LAs)%, and unsaturation, and larger HDL size.
KM, k-means; VLDL size, average diameter for very-low-density lipoprotein particles; VLDL-C, very-low-density lipoprotein cholesterol; LDL size, average diameter for low-density lipoprotein particles; Total-TG, total-triglycerides; Omega-6%, ratio of omega-6 fatty acids to total fatty acids; MUFA, monounsaturated fatty acid; MUFA%, ratio of MUFA to total fatty acids; PAM, partitioning around medoids; XXL-VLDL-P, extremely large very-low-density lipoprotein particles; GlycA, glycoprotein acetyls; XS-VLDL-PL, phospholipids in very small very-low-density lipoprotein; SOM, self-organizing maps; L-HDL-P, large high-density lipoprotein particles; ApoA1, apolipoprotein A1; HDL size, average diameter for high-density lipoprotein particles; HAC, hierarchical agglomerative clustering; Total-PL, total-phospholipids in lipoprotein particles; XL-HDL-L, total lipids in very large high-density lipoprotein; LA, linoleic acid; C1, cluster 1; C2, cluster 2.
KM, K-means; PAM, partitioning around medoids; SOM, self-organizing maps; HAC, hierarchical agglomerative clustering; C1, cluster 1; C2, cluster 2; VLDL-C, very-low-density lipoprotein cholesterol; LDL-C, low-density lipoprotein cholesterol; Total-TG, total-triglycerides; Total-PL, total-phospholipids in lipoprotein particles; VLDL size, average diameter for very-low-density lipoprotein particles; LDL size, average diameter for low-density lipoprotein particles; HDL size, average diameter for high-density lipoprotein particles; ApoA1, apolipoprotein A1; ApoB/ApoA1, ratio of apolipoprotein B to apolipoprotein A1; PUFA, polyunsaturated fatty acid; MUFA, monounsaturated fatty acid; LA, linoleic acid; Omega-6%, ratio of omega-6 fatty acids to total fatty acids; MUFA%, ratio of MUFA to total fatty acids; SFA%, ratio of saturated fatty acids to total fatty acids; LA%, ratio of LA to total fatty acids; Ala, alanine; BCAA, branched chain amino acid; GlycA, glycoprotein acetyls; XXL-VLDL-P, extremely large very-low-density lipoprotein particles; XS-VLDL-PL, phospholipids in very small very-low-density lipoprotein; S-LDL-PL, phospholipids in small low-density lipoprotein; XL-HDL-L, total lipids in very large high-density lipoprotein; XL-HDL-FC, free cholesterol in very large high-density lipoprotein particles; L-HDL-P, large high-density lipoprotein particles; S-HDL-CE, cholesteryl esters in small high-density lipoprotein.
Tables 2 and 3 present the general and clinical characteristics of the participants according to the clusters. For SOM and HAC, the mean age at breast cancer diagnosis was higher in cluster 2 compared to cluster 1 (PSOM = 0.010; PHAC = 0.029), and it remained significant after the FDR adjustment for SOM (FDR adjusted-PSOM = 0.044). For all the clustering methods, BMI was significantly higher in cluster 2 (FDR adjusted-PKM < 0.001; PPAM = 0.001; PSOM < 0.001; and PHAC = 0.043), and there was a particularly high proportion of participants with a BMI ≥ 25 kg/m2 in cluster 2 (FDR adjusted-PKM = 0.003; PPAM = 0.016; PSOM = 0.001; PHAC = 0.044). As identified by PAM, the proportion of participants who regularly used dietary supplements in the past year was significantly higher in cluster 1 than in cluster 2 (FDR adjusted-P = 0.023). The KM, SOM, and HAC methods showed similar tendencies, but these were not statistically significant. There were no significant differences between the clusters in terms of age at enrollment, menopausal status, educational levels, marital status, physical activity levels, alcohol consumption status, or smoking status. As identified by PAM, cluster 2 had a higher proportion of survivors with ER positive or PR positive breast cancer than cluster 1 (PPAM = 0.002), but this was not significant after FDR adjustment. There were no differences in the cancer stage, time since surgery, personal history of chronic diseases (hypertension, diabetes mellitus, hyperlipidemia, or cardiovascular disease), HER-2 status, or cancer therapy between the clusters.
Values are presented as mean ± SD or number (%). P-values were obtained from a t-test for continuous variables and a χ2 test for categorical data.
KM, k-means; PAM, partitioning around medoids; SOM, self-organizing maps; HAC, hierarchical agglomerative clustering; C1, cluster 1; C2, cluster 2; BMI, body mass index; MET, metabolic equivalent task; 3DR, 3-day dietary record; FFQ, food frequency questionnaire.
1)False discovery rate adjusted P-value < 0.05.
Values are presented as mean ± SD or number (%). P-values were obtained from a χ2 test.
KM, k-means; PAM, partitioning around medoids; SOM, self-organizing maps; HAC, hierarchical agglomerative clustering; C1, cluster 1; C2, cluster 2; AJCC, American Joint Committee on Cancer; HER-2, human epidermal growth factor receptor 2; ER, estrogen receptor; PR, progesterone receptor; SERM, selective estrogen receptor modulator; AI, aromatase inhibitor.
1)Chronic disease included hypertension, diabetes mellitus, hyperlipidemia, and cardiovascular disease.
The food and nutrient intakes of the participants who completed the dietary assessment (n = 365) are shown in Tables 4 and 5. Daily thiamine intake was higher in cluster 2, as identified by KM and SOM (PKM = 0.012; and PSOM = 0.012), but it was not significant after FDR adjustment. As identified by SOM, red meat intake among red meat consumers was higher in cluster 1 than in cluster 2 (PSOM = 0.035), but this difference became non-significant after FDR adjustment. Energy, other energy-adjusted food, and nutrient intakes were not significantly different between the clusters.
Values are presented as mean ± SD or number (%). P-values were obtained from a t-test.
KM, k-means; PAM, partitioning around medoids; SOM, self-organizing maps; HAC, hierarchical agglomerative clustering; C1, cluster 1; C2, cluster 2; SFA, saturated fatty acid; MUFA, monounsaturated fatty acid; PUFA, polyunsaturated fatty acid.
Table 6 shows the ACS guideline adherence scores of the participants according to the clusters. The total ACS score and sub-scores for the diet and physical activity did not differ between the clusters. Regarding the healthy weight maintenance score, cluster 1 had a lower proportion of participants who received 2 points for a BMI of 25 to < 30 kg/m2 and a higher proportion of participants who received 4 points for a BMI of 18.5 to < 23 kg/m2 (PKM = 0.013; PPAM = 0.010; PSOM = 0.002; PHAC = 0.048). The result for SOM was still significant after the FDR adjustment.
Values are presented as mean ± SD or number (%). P-values were obtained from a χ2 test.
KM, k-means; PAM, partitioning around medoids; SOM, self-organizing maps; HAC, hierarchical agglomerative clustering; C1, cluster 1; C2, cluster 2; ACS, American Cancer Society; MET, metabolic equivalent task.
1)False discovery rate adjusted P-value < 0.05.
We applied 4 unsupervised clustering methods—KM, PAM, SOM, and HAC—and identified 2 clusters of breast cancer survivors based on plasma metabolites that included lipids, lipoproteins, fatty acids, and amino acids. When compared with cluster 1, participants in cluster 2 showed higher levels of XXL-VLDL-P, GlycA, MUFA%, and SFA%, and larger VLDL size, but lower levels of L-HDL-P, ApoA1, Omega-6%, and LA%, and smaller HDL size, as identified by all 4 clustering methods. Participants in cluster 1 had a lower BMI and a higher proportion of dietary supplement users compared with cluster 2. Intakes of food and nutrients and other lifestyle factors showed no significant differences between the clusters.
The cluster assignment by HAC and the metabolic characteristics of each cluster were different from those of the other clustering methods. MUFA, VLDL-C, XS-VLDL-PL, and total-TG were higher in cluster 2 as identified by KM, PAM, and SOM, but this pattern was not observed for HAC. This may be due to the way clustering is performed. HAC is the only method among the 4 clustering methods used in the current study that employs a hierarchical approach. Hierarchical algorithms cannot re-separate observations once they are merged in a previous step, even if the data are not optimally clustered. If a data structure is not hierarchical, the performance of the HAC method may not be as desirable compared to partitioning clustering methods such as KM [36].
In the current study, the BMI significantly differed between the clusters based on metabolites. Our results were consistent with the findings of studies that examined the association between BMI and lipid profiles. A cross-sectional study of U.S. women showed significantly higher concentrations of large VLDL particles and larger average VLDL size, but lower concentrations of large HDL particles and smaller average HDL size in a group with a BMI of 30–45 kg/m2 when compared with a group with a BMI of 18.5–25 kg/m2 [37]. In a previous study of healthy Finnish twins, measures of obesity including the BMI, waist circumference, and body fat composition, were positively correlated with the apolipoprotein B to ApoA1 ratio, glycoprotein, and MUFA%, and inversely correlated with omega-6% and polyunsaturated fatty acid% [38]. Fasting insulin, the homeostatic model assessment of insulin resistance, and C-reactive protein showed similar correlations with the aforementioned metabolites.
The proportion of dietary supplement users was higher in cluster 1 compared to cluster 2, but this difference was only significant when using the PAM method. A meta-analysis of randomized clinical trials conducted on overweight or obese participants found that calcium and vitamin D supplementation in lower doses (≤ 400 IU/day and ≤ 600 mg/day) reduced blood concentrations of total cholesterol and TG [39]. In our study, participants in cluster 1 had lower total cholesterol and TG levels as identified by PAM, which is consistent with previous findings.
Several studies have reported associations between human metabolites and dietary intake [404142434445]. Previous studies found that the intake of thiamine was inversely associated with low high density lipoprotein cholesterol (HDL-C; < 50 mg/dL for women and < 40 mg/dL for men) [46] and red meat intake was associated with low HDL-C and high TG [47]. Our study found that thiamine intake was higher in cluster 1 (by KM and SOM, characterized by higher HDL-C and lower total-TG), and red meat intake among consumers was lower in cluster 1 (by SOM). These results were not significant after FDR adjustment and therefore, these findings should be interpreted with caution. We did not find any significant associations between dietary factors and the clusters based on plasma metabolites after multiple comparison correction. There are several possible explanations. First, there may have been measurement errors in dietary assessments, as 2 different assessment methods (FFQ and 3DR) were used. Second, several metabolites, including glucose, pyruvate, and lactate which are related to the glycolysis pathway, were excluded due to their low measurement quality. Third, the sample size may not have been sufficient to obtain significant results. Fourth, subtle between-person differences in metabolites may not have been enough to distinguish clusters based on diet. Lastly, most previous studies investigating the associations between dietary intake and metabolic profiles have been conducted in the general population. In contrast, our study focused on a specific population, that of female breast cancer survivors. Cancer survivors may be influenced by a complex interplay of factors beyond dietary intake, such as hormonal changes, and lifestyle modifications following diagnosis. These unique factors may overshadow or modify the associations observed in the general population. As a result, there is a critical need for further studies targeting cancer survivor populations to validate these findings.
To the best of our knowledge, this is the first study to identify clusters of breast cancer survivors in the Korean population using multiple methods and to compare dietary and health-related factors among the clusters. Our study has several limitations. First, environmental and genetic factors that were not investigated in this study may be related to the metabolic characteristics of the clusters. Second, plasma samples used in the study were from individuals in the non-fasting state. However, a previous study showed high correlation between lipid levels from fasting and non-fasting individuals, wherein the association with CVD risk was consistent regardless of fasting status [48]. Third, although we used the 4 most commonly employed clustering methods, other model-based or density-based clustering methods, such as the Gaussian mixture model and density-based spatial clustering of applications with noise (DBSCAN), could also have been considered. Fourth, dietary intake was assessed using both the FFQ and the 3DR methods, which may have introduced variability in the data. Future studies could consider increasing the sample size and utilizing a single dietary assessment method to ensure consistency and improve the reliability of the findings. Finally, we only included metabolites for clustering; various additional factors could also be included, which would require further studies.
In conclusion, the unsupervised clustering methods allowed us to analyze multiple metabolites without any supervisory outcomes and to identify meaningful clusters of breast cancer survivors, wherein we found that BMI levels differed significantly between the 2 clusters identified by us. Further prospective studies are needed to comprehensively investigate the associations between metabolites, obesity, dietary factors, and breast cancer prognosis.
ACKNOWLEDGMENTS
We would like to thank the participants and the research team for their contributions.
Notes
Funding: This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2014R1A2A2A01007794, 2019R1F1A1061017 and 2021R1F1A1062476).
Author Contributions:
Conceptualization: Yie GE, Lee JE.
Data curation: Yie GE, Song S, Kim Z, Youn HJ, Cho J, Min JW, Kim YS, Lee JE.
Formal analysis: Yie GE.
Funding acquisition: Lee JE.
Methodology: Yie GE, Lee JE.
Writing - original draft: Yie GE, Lee JE.
Writing - review & editing: Yie GE, Kyeong W, Song S, Kim Z, Youn HJ, Cho J, Min JW, Kim YS, Lee JE.
References
1. Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, Bray F. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2021; 71:209–249. PMID: 33538338.

2. Park EH, Jung KW, Park NJ, Kang MJ, Yun EH, Kim HJ, Kim JE, Kong HJ, Im JS, Seo HG, et al. Cancer statistics in Korea: incidence, mortality, survival, and prevalence in 2021. Cancer Res Treat. 2024; 56:357–371. PMID: 38487832.

3. Kwan ML, Weltzien E, Kushi LH, Castillo A, Slattery ML, Caan BJ. Dietary patterns and breast cancer recurrence and survival among women with early-stage breast cancer. J Clin Oncol. 2009; 27:919–926. PMID: 19114692.

4. Chan DSM, Vieira AR, Aune D, Bandera EV, Greenwood DC, McTiernan A, Navarro Rosenblatt D, Thune I, Vieira R, Norat T. Body mass index and survival in women with breast cancer-systematic literature review and meta-analysis of 82 follow-up studies. Ann Oncol. 2014; 25:1901–1914. PMID: 24769692.

5. Lahart IM, Metsios GS, Nevill AM, Carmichael AR. Physical activity, risk of death and recurrence in breast cancer survivors: a systematic review and meta-analysis of epidemiological studies. Acta Oncol. 2015; 54:635–654. PMID: 25752971.

6. He J, Gu Y, Zhang S. Consumption of vegetables and fruits and breast cancer survival: a systematic review and meta-analysis. Sci Rep. 2017; 7:599. PMID: 28377568.

7. Jayedi A, Emadi A, Khan TA, Abdolshahi A, Shab-Bidar S. Dietary fiber and survival in women with breast cancer: a dose-response meta-analysis of prospective cohort studies. Nutr Cancer. 2021; 73:1570–1580. PMID: 32795218.

8. Gibney MJ, Walsh M, Brennan L, Roche HM, German B, van Ommen B. Metabolomics in human nutrition: opportunities and challenges. Am J Clin Nutr. 2005; 82:497–503. PMID: 16155259.

9. Dunn WB, Ellis DI. Metabolomics: current analytical platforms and methodologies. Trends Analyt Chem. 2005; 24:285–294.

10. Silva C, Perestrelo R, Silva P, Tomás H, Câmara JS. Breast cancer metabolomics: from analytical platforms to multivariate data analysis. a review. Metabolites. 2019; 9:102. PMID: 31121909.

11. McCartney A, Vignoli A, Biganzoli L, Love R, Tenori L, Luchinat C, Di Leo A. Metabolomics in breast cancer: a decade in review. Cancer Treat Rev. 2018; 67:88–96. PMID: 29775779.

12. Lécuyer L, Victor Bala A, Deschasaux M, Bouchemal N, Nawfal Triba M, Vasson MP, Rossary A, Demidem A, Galan P, Hercberg S, et al. NMR metabolomic signatures reveal predictive plasma metabolites associated with long-term risk of developing breast cancer. Int J Epidemiol. 2018; 47:484–494. PMID: 29365091.

13. Yang L, Wang Y, Cai H, Wang S, Shen Y, Ke C. Application of metabolomics in the diagnosis of breast cancer: a systematic review. J Cancer. 2020; 11:2540–2551. PMID: 32201524.

14. Asiago VM, Alvarado LZ, Shanaiah N, Gowda GA, Owusu-Sarfo K, Ballas RA, Raftery D. Early detection of recurrent breast cancer using metabolite profiling. Cancer Res. 2010; 70:8309–8318. PMID: 20959483.

15. Jobard E, Pontoizeau C, Blaise BJ, Bachelot T, Elena-Herrmann B, Trédan O. A serum nuclear magnetic resonance-based metabolomic signature of advanced metastatic human breast cancer. Cancer Lett. 2014; 343:33–41. PMID: 24041867.

16. Tenori L, Oakman C, Morris PG, Gralka E, Turner N, Cappadona S, Fornier M, Hudis C, Norton L, Luchinat C, et al. Serum metabolomic profiles evaluated after surgery may identify patients with oestrogen receptor negative early breast cancer at increased risk of disease recurrence. Results from a retrospective study. Mol Oncol. 2015; 9:128–139. PMID: 25151299.

17. Eicher T, Kinnebrew G, Patt A, Spencer K, Ying K, Ma Q, Machiraju R, Mathé AEA. Metabolomics and multi-omics integration: a survey of computational methods and resources. Metabolites. 2020; 10:202. PMID: 32429287.

18. Hartigan JA, Wong MA. Algorithm AS 136: a k-means clustering algorithm. J R Stat Soc Ser C Appl Stat. 1979; 28:100–108.

19. In : Schubert E, Rousseeuw PJ, editors. Faster k-medoids clustering: improving the PAM, CLARA, and CLARANS algorithms. Similarity Search and Applications, SISAP 2019; 2019 Oct 2-4; Newark, NJ, USA. Cham: Springer;2019.
21. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York (NY): Springer New York;2017. p. 520–527.
22. Giuliano AE, Connolly JL, Edge SB, Mittendorf EA, Rugo HS, Solin LJ, Weaver DL, Winchester DJ, Hortobagyi GN. Breast cancer-major changes in the American Joint Committee on Cancer eighth edition cancer staging manual. CA Cancer J Clin. 2017; 67:290–303. PMID: 28294295.

23. Shin WK, Song S, Hwang E, Moon HG, Noh DY, Lee JE. Development of a FFQ for breast cancer survivors in Korea. Br J Nutr. 2016; 116:1781–1786. PMID: 27842613.

24. Moon SE, Shin WK, Song S, Koh D, Ahn JS, Yoo Y, Kang M, Lee JE. Validity and reproducibility of a food frequency questionnaire for breast cancer survivors in Korea. Nutr Res Pract. 2022; 16:789–800. PMID: 36467770.

25. Soininen P, Kangas AJ, Würtz P, Suna T, Ala-Korpela M. Quantitative serum nuclear magnetic resonance metabolomics in cardiovascular epidemiology and genetics. Circ Cardiovasc Genet. 2015; 8:192–206. PMID: 25691689.

26. Nightingale Health Plc. Clinically validated biomarkers [Internet]. Helsinki: Nightingale Health Plc.;2024. cited 2024 December 30. Available from: https://nightingalehealth.com/uploads/documents/Nightingale-Blood-Analysis_List-of-Biomarkers.pdf
.
27. Kettunen J, Demirkan A, Würtz P, Draisma HH, Haller T, Rawal R, Vaarhorst A, Kangas AJ, Lyytikäinen LP, Pirinen M, et al. Genome-wide study for circulating metabolites identifies 62 loci and reveals novel systemic effects of LPA. Nat Commun. 2016; 7:11122. PMID: 27005778.

28. Holmes MV, Millwood IY, Kartsonaki C, Hill MR, Bennett DA, Boxall R, Guo Y, Xu X, Bian Z, Hu R, et al. Lipids, lipoproteins, and metabolites and risk of myocardial infarction and stroke. J Am Coll Cardiol. 2018; 71:620–632. PMID: 29420958.

29. Ainsworth BE, Haskell WL, Herrmann SD, Meckes N, Bassett DR Jr, Tudor-Locke C, Greer JL, Vezina J, Whitt-Glover MC, Leon AS. 2011 Compendium of physical activities: a second update of codes and MET values. Med Sci Sports Exerc. 2011; 43:1575–1581. PMID: 21681120.
30. Rock CL, Doyle C, Demark-Wahnefried W, Meyerhardt J, Courneya KS, Schwartz AL, Bandera EV, Hamilton KK, Grant B, McCullough M, et al. Nutrition and physical activity guidelines for cancer survivors. CA Cancer J Clin. 2012; 62:243–274. PMID: 22539238.

31. Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987; 20:53–65.

32. Van der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res. 2008; 9:2579–2605.
33. Breiman L. Random forests. Mach Learn. 2001; 45:5–32.
35. Willett W, Stampfer MJ. Total energy intake: implications for epidemiologic analyses. Am J Epidemiol. 1986; 124:17–27. PMID: 3521261.

36. James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning. New York (NY): Springer;2013. p. 523.
37. Magkos F, Mohammed BS, Mittendorfer B. Effect of obesity on the plasma lipoprotein subclass profile in normoglycemic and normolipidemic men and women. Int J Obes. 2008; 32:1655–1664.

38. Bogl LH, Kaye SM, Rämö JT, Kangas AJ, Soininen P, Hakkarainen A, Lundbom J, Lundbom N, Ortega-Alonso A, Rissanen A, et al. Abdominal obesity and circulating metabolites: a twin study approach. Metabolism. 2016; 65:111–121. PMID: 26892522.

39. Kashkooli S, Choghakhori R, Hasanvand A, Abbasnezhad A. Effect of calcium and vitamin D co-supplementation on lipid profile of overweight/obese subjects: a systematic review and meta-analysis of the randomized clinical trials. Obes Med. 2019; 15:100124.
40. O’Sullivan A, Gibney MJ, Brennan L. Dietary intake patterns are reflected in metabolomic profiles: potential role in dietary assessment studies. Am J Clin Nutr. 2011; 93:314–321. PMID: 21177801.

41. Schmidt JA, Rinaldi S, Ferrari P, Carayol M, Achaintre D, Scalbert A, Cross AJ, Gunter MJ, Fensom GK, Appleby PN, et al. Metabolic profiles of male meat eaters, fish eaters, vegetarians, and vegans from the EPIC-Oxford cohort. Am J Clin Nutr. 2015; 102:1518–1526. PMID: 26511225.

42. Gibbons H, Carr E, McNulty BA, Nugent AP, Walton J, Flynn A, Gibney MJ, Brennan L. Metabolomic-based identification of clusters that reflect dietary patterns. Mol Nutr Food Res. 2017; 61:1601050.

43. Lindqvist HM, Rådjursöga M, Malmodin D, Winkvist A, Ellegård L. Serum metabolite profiles of habitual diet: evaluation by 1H-nuclear magnetic resonance analysis. Am J Clin Nutr. 2019; 110:53–62. PMID: 31127814.

44. Navarro SL, Tarkhan A, Shojaie A, Randolph TW, Gu H, Djukovic D, Osterbauer KJ, Hullar MA, Kratz M, Neuhouser ML, et al. Plasma metabolomics profiles suggest beneficial effects of a low-glycemic load dietary pattern on inflammation and energy metabolism. Am J Clin Nutr. 2019; 110:984–992. PMID: 31432072.

45. Walker ME, Song RJ, Xu X, Gerszten RE, Ngo D, Clish CB, Corlin L, Ma J, Xanthakis V, Jacques PF, et al. Proteomic and metabolomic correlates of healthy dietary patterns: the Framingham Heart Study. Nutrients. 2020; 12:1476. PMID: 32438708.

46. Wu Y, Li S, Wang W, Zhang D. Associations of dietary vitamin B1, vitamin B2, niacin, vitamin B6, vitamin B12 and folate equivalent intakes with metabolic syndrome. Int J Food Sci Nutr. 2020; 71:738–749. PMID: 31986943.

47. Azadbakht L, Esmaillzadeh A. Red meat intake is associated with metabolic syndrome and the plasma C-reactive protein concentration in women. J Nutr. 2009; 139:335–339. PMID: 19074209.

48. Tikkanen E, Kanerva N, Aittomaki V, Männistö S, Salomaa VV, Wurtz P. Fasting samples are not required for NMR metabolic profiling studies of cardiovascular disease risk: prospective data for 4,400 individuals profiled few weeks apart. Circulation. 2019; 140:A10212.
SUPPLEMENTARY MATERIALS
Supplementary Table 2
Strengths and limitations of the KM, PAM, SOM, and HAC clustering methods
Supplementary Table 3
Cluster agreements between the KM, PAM, SOM, and HAC clustering methods
Supplementary Table 4
The 7 metabolites that contributed most to cluster formation
Supplementary Fig. 1
Flow diagram of breast cancer survivors included in the study.
Supplementary Fig. 2
A correlation plot of the 57 metabolites, excluding lipoprotein subclasses markers.



PDF
Citation
Print



XML Download