Journal List > Nutr Res Pract > v.19(2) > 1516090510

Yie, Kyeong, Song, Kim, Youn, Cho, Min, Kim, and Lee: Plasma metabolite based clustering of breast cancer survivors and identification of dietary and health related characteristics: an application of unsupervised machine learning

Abstract

BACKGROUND/OBJECTIVES

This study aimed to use plasma metabolites to identify clusters of breast cancer survivors and to compare their dietary characteristics and health-related factors across the clusters using unsupervised machine learning.

SUBJECTS/METHODS

A total of 419 breast cancer survivors were included in this cross-sectional study. We considered 30 plasma metabolites, quantified by high-throughput nuclear magnetic resonance metabolomics. Clusters were obtained based on metabolites using 4 different unsupervised clustering methods: k-means (KM), partitioning around medoids (PAM), self-organizing maps (SOM), and hierarchical agglomerative clustering (HAC). The t-test, χ2 test, and Fisher’s exact test were used to compare sociodemographic, lifestyle, clinical, and dietary characteristics across the clusters. P-values were adjusted through a false discovery rate (FDR).

RESULTS

Two clusters were identified using the 4 methods. Participants in cluster 2 had lower concentrations of apolipoprotein A1 and large high-density lipoprotein (HDL) particles and smaller HDL particle sizes, but higher concentrations of chylomicrons and extremely large very-low-density-lipoprotein (VLDL) particles and glycoprotein acetyls, a higher ratio of monounsaturated fatty acids to total fatty acids, and larger VLDL particle sizes compared with cluster 1. Body mass index was significantly higher in cluster 2 compared with cluster 1 (FDR adjusted-PKM < 0.001; PPAM = 0.001; PSOM < 0.001; and PHAC = 0.043).

CONCLUSION

The breast cancer survivors clustered on the basis of plasma metabolites had distinct characteristics. Further prospective studies are needed to investigate the associations between metabolites, obesity, dietary factors, and breast cancer prognosis.

INTRODUCTION

Breast cancer is the most commonly diagnosed cancer among women worldwide [1]. In Korea, the age-standardized incidence rate of breast cancer has been increasing since 1999, and it was the most diagnosed cancer among women in 2021 [2]. The 5-yr survival rate in Korea also continued to increase, reaching 93.8% in 2017–2021 [2].
An increase in both incidence and survival rates suggests that evaluating and improving modifiable lifestyle factors among cancer survivors may be beneficial. Previous studies have shown that diet, obesity, and physical activity are associated with breast cancer prognosis [34567]. However, studies on intermediates, including metabolites, have been limited, despite the potential association of modifiable factors with metabolites and breast cancer prognosis. Metabolites, small molecules produced during metabolic processes within cells, are emerging biomarkers of chronic diseases, including cancer. They are downstream products of proteins and genes and interact with and are influenced by lifestyle factors [8]. Metabolomics is the comprehensive study of metabolites in biological samples [9] and has been used in breast cancer research [1011], including the areas of diagnosis [1213], recurrence [14], and metastasis [1516]. Samples of metabolites can be easily obtained through non-invasive liquid biopsies [11].
High-throughput techniques have enabled the simultaneous measurement of numerous metabolites and the generation of a massive quantity of metabolomics data. Multivariate analysis methods have been used to handle these high-dimensional metabolomics data [10]. Analyses can be categorized into supervised and unsupervised methods depending on the presence of supervisory variables, such as endpoints and true-labels; an unsupervised method is often used to discover unknown labels. The most commonly used unsupervised method in metabolomics is principal component analysis (PCA), which characterizes dimension reduction. Unlike PCA, machine-learning-based clustering methods can be used to divide samples into subsets based on the similarities in features [17]. Clustering is well-suited for exploratory analyses, as it identifies hidden patterns and natural groupings in complex datasets without requiring predefined labels or known group structures.
To identify similar features and cluster metabolites, we used unsupervised machine learning-based clustering methods, including k-means (KM) [18], partitioning around medoids (PAM) [19], self-organizing maps (SOM) [20] and hierarchical agglomerative clustering (HAC) [21]. We then compared metabolic, clinical, and dietary characteristics across the clusters. Identifying metabolite-based clusters allows for the categorization of cancer survivors into subgroups with distinct metabolic patterns. These clusters can be used to tailor post-cancer interventions, such as lifestyle recommendations, to improve metabolic health, and reduce recurrence risks. Furthermore, understanding cluster-specific characteristics could guide future research into the mechanisms underlying metabolic changes in cancer survivors, enabling the development of more personalized care guidelines for breast cancer survivors.

SUBJECTS AND METHODS

Study population

Study participants who had been diagnosed with stage I–III primary breast cancer according to the American Joint Committee on Cancer (AJCC) [22] and who underwent breast cancer surgery at least 6 mon before enrollment were recruited from 5 hospitals in Korea between March 2015 and June 2019. Altogether, 535 female breast cancer survivors provided written informed consent at enrollment. Of these, a total of 419 breast cancer survivors were included in the cluster analysis after exclusions (Supplementary Fig. 1). Participants who had not been diagnosed with AJCC stage I–III breast cancer (n = 17), those who had undergone breast cancer surgery less than 6 mon before enrollment (n = 5), who were diagnosed, before enrollment, with other cancers either before or after breast cancer diagnosis (n = 20) or who had a breast cancer recurrence before enrollment (n = 4) were excluded. A further 74 participants for whom information on plasma metabolites was not available (n = 71) or who had a metabolite measurement failure rate of 10% or higher (n = 3) were also excluded, resulting in a total of 419 subjects. Of these, we additionally excluded 54 participants due to incomplete dietary data (n = 51) or implausible energy intakes (below or above 3 SD from the mean value of the log-transformed energy intake) (n = 3). Finally, dietary profiling analysis of the specified clusters was conducted on 365 participants. This study was performed in line with the principles of the Declaration of Helsinki. This study was approved by the Institutional Review Board of each of the 5 hospitals: Soonchunhyang University Hospital (SCHBC2014-12-004-001), Jeonbuk National University Hospital (CUH2014-05-002-005 and CUH2018-02-004-004), Keimyung University Dongsan Medical Center (DSMC2015-03-026), Dankook University Hospital (DKUH 2016-07-001-002), and Chosun University Hospital (CHOSUN 2016-06 and CHOSUN 2018-06). Informed consent was obtained from all individual participants included in the study.

Dietary assessments

Three-day dietary records and food frequency questionnaires (FFQs) were administered to assess the dietary intake of the participants. Among the 365 participants, 167 completed 3-day dietary records (3DR), while the other 198 participants completed a validated 123-item semi-quantitative FFQ developed for breast cancer survivors [2324]. Participants using the 3DR were requested to record all food and beverages consumed on 2 non-consecutive weekdays and one weekend. Photographic booklets of common foods were provided to assist them in estimating portion sizes. The amounts of food and nutrient intake from the dietary records were computed using the Computer-Aided Nutritional Analysis Program version 4.0 (The Korean Nutrition Society, Seoul, Korea), with the daily intake of foods and nutrients calculated by averaging the 3DR. The FFQ respondents reported how often they had consumed each food item on average over the past year in 9 frequency categories (never, once a month, 2–3 times/mon, once a week, 2–4 times/week, 5–6 times/week, once a day, twice/day, and 3 times/day). They also indicated their usual portion size under one of 3 categories (small, medium, and large). The daily food and nutrient intakes were calculated by multiplying the daily frequency by the portion size.

Analysis of metabolomics

Blood samples were drawn at enrollment in a non-fasting state and stored at −80°C until analysis. Frozen blood samples were shipped on dry ice for metabolomic analysis. We used a high-throughput proton nuclear magnetic resonance (NMR) metabolomics platform (Nightingale Health Plc, Helsinki, Finland), which provides quantification of routine lipids, lipoprotein subclass profiling with lipid concentrations within 14 subclasses, fatty acid composition, and various low-molecular-weight metabolites, including amino acids, ketone bodies, and gluconeogenesis-related metabolites, in molar concentration units [25]. The 14 lipoprotein subclasses were defined by their sizes as follows: extremely large very-low-density lipoprotein (VLDL) with particle diameters from 75 nm upwards and a possible contribution of chylomicrons; 5 VLDL subclasses—very large (average particle diameter of 64.0 nm), large (53.6 nm), medium (44.5 nm), small (36.8 nm), and very small (31.3 nm); intermediate-density lipoprotein (IDL) (28.6 nm); 3 low-density lipoprotein (LDL) subclasses—large (25.5 nm), medium (23.0 nm), and small (18.7 nm); and 4 high-density lipoprotein (HDL) subclasses—very large (14.3 nm), large (12.1 nm), medium (10.9 nm), and small (8.7 nm). Out of the 249 metabolic biomarkers for the plasma samples, 37 clinically validated biomarkers were consistent with the results of other clinical methods and were free of batch effects [26].
The metabolite levels were log-transformed and winsorized to 5 SD. To adjust for the differences in metabolite levels according to the study entry year, we divided the study period into 2 phases (2015–2016 and 2017–2019) and scaled the metabolite levels within each phase. We excluded 70 relative lipoprotein lipid concentration markers and 67 metabolites with missing values or values below the limit of quantification from the cluster analysis. Substantial correlations existed between the remaining 112 metabolites. Supplementary Fig. 2 shows a correlation plot of 57 metabolites, excluding the lipoprotein subclass markers. If the Pearson’s correlation coefficient between 2 log-transformed metabolite values was higher than 0.90, then non-clinically validated metabolites or those with a larger mean absolute correlation with the remaining metabolites were excluded. Finally, we selected the 30 metabolites listed in Supplementary Table 1. Representative coefficients of variation (CVs) across thousands of samples for the NMR-based metabolic measures were assessed in previous studies [2728]. The CVs of the selected metabolites were all below 10%, except for the concentrations of chylomicrons and extremely large VLDL particles (XXL-VLDL-P; CV = 16.2%).

Assessment of covariates

Structured questionnaires were administered to collect anthropometric, sociodemographic, lifestyle data, and reproductive histories, as well as the use of dietary supplements. Self-reported height and weight at enrollment were used to calculate the body mass index (BMI, kg/m2). If any anthropometric information at enrollment was missing, the height and weight measurements at breast cancer diagnosis were taken from the medical records. Information on marital status, smoking status, alcohol consumption, and menopausal status was self-reported. Physical activity data, including the type, time spent, and frequency of exercises, was converted into metabolic equivalent tasks (METs)-hours per week. The MET value of each activity was determined according to the Compendium of Physical Activities [29]. The total MET-hours per week were calculated by summing the MET-hours per week for each exercise type. Additionally, participants provided information on the type, product name, dose, and frequency of any dietary supplements they had regularly consumed over the past year.
We calculated adherence scores according to the American Cancer Society (ACS) guidelines for cancer survivors regarding body weight, physical activity, and diet [30]. The healthy weight management score ranged from 1 to 4 points, assigned to BMI categories of < 18.5 or ≥ 30, 25 to < 30, 23 to < 25, and 18.5 to < 23 kg/m2, respectively. A physical activity score of 1 to 4 points was given based on the quartiles of physical activity levels. For diet, 1 to 4 points were given according to quartiles of fruit/vegetable and whole-grain intake, and inversely for decreasing quartiles of red and processed meat intake. The sum of scores for each food group was divided into categories of 3–5, 6–7, 8–9, and 10–12 points, and assigned 1 to 4 total points, respectively. The ACS score was calculated as the total sum of the weight management, physical activity, and diet scores, ranging from 3 to 12 points.
Clinical information, including the height and weight at diagnosis, AJCC stage, histological grade, diagnosis and operation dates, estrogen receptor (ER), progesterone receptor (PR), human epidermal growth factor receptor 2 (HER-2) status, treatment, other cancer diagnoses, recurrence, and metastasis before study enrollment, was collected from the medical records.

Statistical analysis

We used the KM, PAM, SOM, and HAC clustering methods, which are commonly used methods in omics data analysis. We employed 4 different methods to account for the variability in data characteristics and structure, as the optimal clustering approach can vary depending on these factors. The characteristics, advantages, and limitations of these methods are summarized in Supplementary Table 2. For KM, we used the Hartigan-Wong algorithm, and the initial centers were determined by the mean of each cluster identified in advance using HAC with the Ward2 linkage method. For PAM, we set options for calculating dissimilarities using Euclidean distance. Initial medoids were specified using the "build" algorithm, and the final medoids were optimized through the "swap" phase. For SOM, a hexagonal grid and squared Euclidean distance were chosen, and the algorithm was iterated 100 times, during which the learning rate decreased from 0.05 to 0.01. The Euclidean distance matrix and Ward2 linkage method were used for HAC. The silhouette index was used to determine the optimal number of clusters and to assess the performance of the clustering methods [31]. We visualized the identified clusters in a 2-dimensional plot using t-distributed stochastic neighbor embedding [32]. Cluster agreement between different clustering methods was evaluated as the percentage of participants assigned to the same clusters by both methods.
Depending on the importance of cluster formation, the metabolites were ranked in 3 ways and the average ranking of each metabolite was calculated. First, t-statistics for the differences in means across clusters were used; the higher the absolute t-statistics, the greater the metabolite’s contribution to cluster identification. Second, a supervised approach was applied using the identified cluster membership as the label of the participants. The random forest (RF) model was trained using original data and tested using data with the variables permuted one at a time. The misclassification rate of the test results, known as permutation importance (PI), was calculated for each variable [33]; the higher the misclassification rate, the greater the variable’s contribution to cluster identification. Third, the adjusted rand index (ARI) [34] was calculated between the cluster membership obtained from the original data and from data that excluded one variable (drop-column); the lower the ARI, the greater the contribution to cluster identification.
The metabolic characteristics of the clusters were compared using boxplots of the scaled metabolic levels (z-scores). The clusters were also compared based on socioeconomic, lifestyle, clinical, and dietary characteristics, as well as ACS scores, using the t-test for continuous variables and χ2 test or Fisher’s exact tests for categorical data. The dietary intakes of the nutrients were adjusted for total energy intake using the residual method [35]. For food intake, the food consumption status (non-consumer vs. consumer), the daily food intake across all participants, and the daily food intake among food consumers only, were compared across the clusters.
All statistical analyses were performed using the R software (version 4.1.0; R Foundation for Statistical Computing, Vienna, Austria). The following R packages were used: clValid for internal validation and optimal clustering results, randomForest for the PI of RF, and flexclust for the ARI calculation. The tests for comparing the characteristics of the clusters were 2-sided, and the significance level was set at 0.05. To account for multiple comparisons, we also checked the adjusted P-values using the false discovery rate (FDR) method.

RESULTS

Results of the cluster analysis

The silhouette index values were plotted in Fig. 1 according to the number of clusters from 2 to 5. The optimal number of clusters was 2 for KM, SOM, and HAC, and 3 for PAM. Considering that the silhouette index values were similar in k = 2 (silhouette index = 0.146) and k = 3 (0.151), 2 clusters were identified across all clustering methods. The cluster assignment of the 419 participants is shown in Fig. 2. There were 273, 214, 238, and 237 participants in cluster 1 and 146, 205, 181, and 182 participants in cluster 2, identified by KM, PAM, SOM, and HAC, respectively. A total of 111 participants in cluster 1 and 76 participants in cluster 2 were common across all 4 clustering methods. The cluster assignment results were most similar between KM and SOM; 88.3% (% agreement) of participants were assigned to the same clusters (231 in cluster 1 and 139 in cluster 2). Cluster agreements ranged from 45.8% to 88.3% (Supplementary Table 3).
Fig. 1

Silhouette index plot for the KM, PAM, SOM, and HAC clustering methods.

KM, k-means; PAM, partitioning around medoids; SOM, self-organizing maps; HAC, hierarchical agglomerative clustering.
nrp-19-273-g001
Fig. 2

t-SNE visualization of cluster assignments by KM, PAM, SOM, and HAC.

t-SNE, t-distributed stochastic neighboring embedding; KM, k-means; C1, cluster 1; C2, cluster 2; PAM, partitioning around medoids; SOM, self-organizing maps; HAC, hierarchical agglomerative clustering.
nrp-19-273-g002
The top 7 metabolites (top 25% out of 30) that contributed most to cluster formation are listed in Supplementary Table 4. XXL-VLDL-P, MUFA and MUFA% were included in the top 7 for KM, PAM, and SOM; glycoprotein acetyls (GlycA) and total-triglycerides (total-TG) were ranked in the top 7 for KM and PAM; and VLDL size, HDL size, L-HDL-P, and apolipoprotein A1 (ApoA1) were ranked in the top 7 for SOM and HAC.

Comparison of the profiles of the identified clusters

The z-scores of the metabolites listed in Supplementary Table 4 were compared across the clusters identified by the 4 clustering methods (Fig. 3). The mean levels of the 30 selected metabolites are shown in Table 1. As identified by KM, PAM, and SOM, but not HAC, participants in cluster 2 had higher levels of MUFA, VLDL cholesterol (VLDL-C), phospholipids in very small VLDL (XS-VLDL-PL), and total-TG compared with those in cluster 1. For all the clustering methods, participants in cluster 2 had higher levels of MUFA% (ratio to total fatty acids), saturated fatty acids (SFA)%, XXL-VLDL-P, and GlycA, and larger VLDL size. However, participants in cluster 1 had higher levels of L-HDL-P, ApoA1, omega-6%, linoleic acids (LAs)%, and unsaturation, and larger HDL size.
Fig. 3

Boxplots of z-scores of the top 7 metabolites that contributed most to cluster formation.

KM, k-means; VLDL size, average diameter for very-low-density lipoprotein particles; VLDL-C, very-low-density lipoprotein cholesterol; LDL size, average diameter for low-density lipoprotein particles; Total-TG, total-triglycerides; Omega-6%, ratio of omega-6 fatty acids to total fatty acids; MUFA, monounsaturated fatty acid; MUFA%, ratio of MUFA to total fatty acids; PAM, partitioning around medoids; XXL-VLDL-P, extremely large very-low-density lipoprotein particles; GlycA, glycoprotein acetyls; XS-VLDL-PL, phospholipids in very small very-low-density lipoprotein; SOM, self-organizing maps; L-HDL-P, large high-density lipoprotein particles; ApoA1, apolipoprotein A1; HDL size, average diameter for high-density lipoprotein particles; HAC, hierarchical agglomerative clustering; Total-PL, total-phospholipids in lipoprotein particles; XL-HDL-L, total lipids in very large high-density lipoprotein; LA, linoleic acid; C1, cluster 1; C2, cluster 2.
nrp-19-273-g003
Table 1

Mean levels of raw metabolites according to the clusters

nrp-19-273-i001
Metabolites KM PAM SOM HAC
C1 C2 C1 C2 C1 C2 C1 C2
Cholesterol (mmol/L)
VLDL-C 0.785 1.107 0.728 1.074 0.802 1.023 0.903 0.890
Clinical LDL-C 2.500 2.479 2.331 2.661 2.610 2.338 2.746 2.163
Triglycerides (mmol/L)
Total-TG 0.911 1.475 0.818 1.410 0.912 1.364 1.156 1.045
Phospholipids (mmol/L)
Total-PL 2.476 2.551 2.337 2.675 2.569 2.414 2.781 2.139
Lipoprotein particle sizes (nm)
VLDL size 37.909 39.148 37.931 38.768 37.794 39.059 38.004 38.780
LDL size 23.900 23.755 23.894 23.803 23.908 23.773 23.887 23.800
HDL size 9.639 9.514 9.611 9.579 9.684 9.479 9.721 9.432
Other lipids (mmol/L)
Sphingomyelins 0.429 0.473 0.411 0.479 0.444 0.444 0.471 0.409
Apolipoproteins (g/L)
ApoA1 1.114 0.992 1.081 1.061 1.152 0.965 1.213 0.888
ApoB/ApoA1 0.826 1.084 0.801 1.036 0.820 1.042 0.839 1.016
Fatty acids (mmol/L)
Unsaturation 1.392 1.261 1.388 1.303 1.404 1.270 1.398 1.279
PUFA 4.434 4.071 4.175 4.446 4.574 3.958 4.906 3.529
MUFA 2.313 3.332 2.152 3.207 2.357 3.076 2.758 2.551
LA 2.895 2.266 2.667 2.686 3.003 2.246 3.196 1.999
Omega-6% 40.036 31.478 40.349 33.615 40.207 32.909 38.249 35.498
MUFA% 23.028 28.421 22.945 26.955 22.798 27.680 23.716 26.459
SFA% 33.227 37.528 33.223 36.295 33.119 36.838 33.786 35.950
LA% 28.227 18.346 27.654 21.788 28.684 19.656 27.604 21.112
Amino acids (mmol/L)
Ala 0.471 0.515 0.450 0.524 0.480 0.494 0.502 0.466
BCAA 0.409 0.448 0.395 0.451 0.407 0.443 0.437 0.404
Glycolysis-related metabolites (mmol/L)
Citrate 0.077 0.087 0.077 0.085 0.077 0.085 0.080 0.082
Ketone bodies (mmol/L)
Acetone 0.017 0.015 0.017 0.015 0.017 0.015 0.017 0.014
Inflammation (mmol/L)
GlycA 0.779 0.952 0.769 0.913 0.779 0.918 0.821 0.863
Lipoprotein subclasses (mmol/L)
XXL-VLDL-P (nmol/L) 1.733 4.864 1.671 4.027 1.636 4.386 2.278 3.534
XS-VLDL-PL 0.121 0.155 0.111 0.156 0.125 0.144 0.140 0.124
S-LDL-PL 0.083 0.078 0.078 0.085 0.086 0.075 0.090 0.070
XL-HDL-L 0.196 0.175 0.187 0.190 0.209 0.161 0.218 0.150
XL-HDL-FC 0.032 0.033 0.031 0.033 0.033 0.031 0.034 0.030
L-HDL-P (μmol/L) 1.135 0.767 1.058 0.954 1.260 0.674 1.353 0.556
S-HDL-CE 0.246 0.217 0.245 0.226 0.248 0.220 0.249 0.218
KM, K-means; PAM, partitioning around medoids; SOM, self-organizing maps; HAC, hierarchical agglomerative clustering; C1, cluster 1; C2, cluster 2; VLDL-C, very-low-density lipoprotein cholesterol; LDL-C, low-density lipoprotein cholesterol; Total-TG, total-triglycerides; Total-PL, total-phospholipids in lipoprotein particles; VLDL size, average diameter for very-low-density lipoprotein particles; LDL size, average diameter for low-density lipoprotein particles; HDL size, average diameter for high-density lipoprotein particles; ApoA1, apolipoprotein A1; ApoB/ApoA1, ratio of apolipoprotein B to apolipoprotein A1; PUFA, polyunsaturated fatty acid; MUFA, monounsaturated fatty acid; LA, linoleic acid; Omega-6%, ratio of omega-6 fatty acids to total fatty acids; MUFA%, ratio of MUFA to total fatty acids; SFA%, ratio of saturated fatty acids to total fatty acids; LA%, ratio of LA to total fatty acids; Ala, alanine; BCAA, branched chain amino acid; GlycA, glycoprotein acetyls; XXL-VLDL-P, extremely large very-low-density lipoprotein particles; XS-VLDL-PL, phospholipids in very small very-low-density lipoprotein; S-LDL-PL, phospholipids in small low-density lipoprotein; XL-HDL-L, total lipids in very large high-density lipoprotein; XL-HDL-FC, free cholesterol in very large high-density lipoprotein particles; L-HDL-P, large high-density lipoprotein particles; S-HDL-CE, cholesteryl esters in small high-density lipoprotein.
Tables 2 and 3 present the general and clinical characteristics of the participants according to the clusters. For SOM and HAC, the mean age at breast cancer diagnosis was higher in cluster 2 compared to cluster 1 (PSOM = 0.010; PHAC = 0.029), and it remained significant after the FDR adjustment for SOM (FDR adjusted-PSOM = 0.044). For all the clustering methods, BMI was significantly higher in cluster 2 (FDR adjusted-PKM < 0.001; PPAM = 0.001; PSOM < 0.001; and PHAC = 0.043), and there was a particularly high proportion of participants with a BMI ≥ 25 kg/m2 in cluster 2 (FDR adjusted-PKM = 0.003; PPAM = 0.016; PSOM = 0.001; PHAC = 0.044). As identified by PAM, the proportion of participants who regularly used dietary supplements in the past year was significantly higher in cluster 1 than in cluster 2 (FDR adjusted-P = 0.023). The KM, SOM, and HAC methods showed similar tendencies, but these were not statistically significant. There were no significant differences between the clusters in terms of age at enrollment, menopausal status, educational levels, marital status, physical activity levels, alcohol consumption status, or smoking status. As identified by PAM, cluster 2 had a higher proportion of survivors with ER positive or PR positive breast cancer than cluster 1 (PPAM = 0.002), but this was not significant after FDR adjustment. There were no differences in the cancer stage, time since surgery, personal history of chronic diseases (hypertension, diabetes mellitus, hyperlipidemia, or cardiovascular disease), HER-2 status, or cancer therapy between the clusters.
Table 2

General characteristics of breast cancer survivors according to the clusters identified by the KM, PAM, SOM, and HAC clustering methods

nrp-19-273-i002
Variables KM PAM SOM HAC
C1 (n = 273) C2 (n = 146) P-value C1 (n = 214) C2 (n = 205) P-value C1 (n = 238) C2 (n = 181) P-value C1 (n = 237) C2 (n = 182) P-value
Age at diagnosis (yrs) 48.35 ± 8.16 49.34 ± 8.15 0.237 48.75 ± 8.16 48.64 ± 8.19 0.887 47.81 ± 7.65 49.87 ± 8.67 0.0101) 47.93 ± 7.90 49.69 ± 8.42 0.029
Age at enrollment (yrs) 51.64 ± 8.12 52.51 ± 8.64 0.312 51.99 ± 8.27 51.90 ± 8.36 0.909 51.16 ± 7.74 52.98 ± 8.91 0.028 51.13 ± 7.89 53.01 ± 8.73 0.021
BMI (kg/m2) 22.83 ± 3.05 24.39 ± 3.24 < 0.0011) 22.77 ± 3.04 24.00 ± 3.25 < 0.0011) 22.73 ± 3.16 24.22 ± 3.07 < 0.0011) 23.00 ± 3.24 23.85 ± 3.09 0.0071)
< 23 153 (56.46) 57 (39.31) < 0.0011) 121 (57.08) 89 (43.63) 0.0021) 140 (59.32) 70 (38.89) < 0.0011) 132 (56.17) 78 (43.09) 0.0091)
23 to < 25 61 (22.51) 32 (22.07) 49 (23.11) 44 (21.57) 50 (21.19) 43 (23.89) 52 (22.13) 41 (22.65)
≥ 25 57 (21.03) 56 (38.62) 42 (19.81) 71 (34.80) 46 (19.49) 67 (37.22) 51 (21.70) 62 (34.25)
Menopausal status 0.068 0.578 0.141 0.153
Premenopausal 43 (16.93) 13 (9.56) 31 (15.58) 25 (13.09) 37 (16.89) 19 (11.11) 37 (16.82) 19 (11.18)
Postmenopausal 211 (83.07) 123 (90.44) 168 (84.42) 166 (86.91) 182 (83.11) 152 (88.89) 183 (83.18) 151 (88.82)
Education level 0.405 0.297 0.546 0.135
≤ Middle school 54 (22.31) 36 (27.27) 42 (22.58) 48 (25.53) 46 (22.12) 44 (26.51) 51 (24.29) 39 (23.78)
High school 114 (47.11) 63 (47.73) 84 (45.16) 93 (49.47) 99 (47.60) 78 (46.99) 91 (43.33) 86 (52.44)
≥ College 74 (30.58) 33 (25.00) 60 (32.26) 47 (25.00) 63 (30.29) 44 (26.51) 68 (32.38) 39 (23.78)
Marital status 0.296 0.443 0.441 0.217
Married/cohabiting 191 (79.58) 108 (82.44) 146 (79.35) 153 (81.82) 164 (79.61) 135 (81.82) 164 (78.85) 135 (82.82)
Never married 14 (5.83) 3 (2.29) 11 (5.98) 6 (3.21) 12 (5.83) 5 (3.03) 13 (6.25) 4 (2.45)
Others 35 (14.58) 20 (15.27) 27 (14.67) 28 (14.97) 30 (14.56) 25 (15.15) 31 (14.90) 24 (14.72)
Physical activity (MET-h/week) 34.46 ± 40.98 30.70 ± 36.21 0.378 34.29 ± 36.84 31.99 ± 41.78 0.573 33.38 ± 36.37 32.83 ± 42.93 0.894 34.39 ± 38.72 31.52 ± 40.22 0.484
Alcohol status 0.403 0.923 0.333 0.293
None 183 (75.93) 106 (80.30) 145 (77.96) 144 (77.01) 156 (75.36) 133 (80.12) 158 (75.24) 131 (80.37)
Current 58 (24.07) 26 (19.70) 41 (22.04) 43 (22.99) 51 (24.64) 33 (19.88) 52 (24.76) 32 (19.63)
Smoking status 0.658 0.076 0.466 0.681
Never 204 (89.87) 114 (91.94) 163 (93.68) 155 (87.57) 176 (89.34) 142 (92.21) 181 (91.41) 137 (89.54)
Ever 23 (10.13) 10 (8.06) 11 (6.32) 22 (12.43) 21 (10.66) 12 (7.79) 17 (8.59) 16 (10.46)
Dietary supplement use 0.118 0.0031) 0.422 0.701
No 78 (32.77) 54 (41.54) 52 (28.26) 80 (43.48) 69 (33.82) 63 (38.41) 72 (34.78) 60 (37.27)
Yes 160 (67.23) 76 (58.46) 132 (71.74) 104 (56.52) 135 (66.18) 101 (61.59) 135 (65.22) 101 (62.73)
Dietary assessment 0.635 0.627 0.433 0.697
3DR 112 (46.86) 55 (43.65) 87 (47.28) 80 (44.20) 98 (47.80) 69 (43.12) 91 (44.61) 76 (47.20)
FFQ 127 (53.14) 71 (56.35) 97 (52.72) 101 (55.80) 107 (52.20) 91 (56.88) 113 (55.39) 85 (52.80)
Values are presented as mean ± SD or number (%). P-values were obtained from a t-test for continuous variables and a χ2 test for categorical data.
KM, k-means; PAM, partitioning around medoids; SOM, self-organizing maps; HAC, hierarchical agglomerative clustering; C1, cluster 1; C2, cluster 2; BMI, body mass index; MET, metabolic equivalent task; 3DR, 3-day dietary record; FFQ, food frequency questionnaire.
1)False discovery rate adjusted P-value < 0.05.
Table 3

Clinical characteristics of breast cancer survivors according to the clusters identified by the KM, PAM, SOM, and HAC clustering methods

nrp-19-273-i003
Variables KM PAM SOM HAC
C1 (n = 273) C2 (n = 146) P-value C1 (n = 214) C2 (n = 205) P-value C1 (n = 238) C2 (n = 181) P-value C1 (n = 237) C2 (n = 182) P-value
AJCC stage 0.323 0.428 0.388 0.387
I 135 (49.45) 62 (42.47) 106 (49.53) 91 (44.39) 116 (48.74) 81 (44.75) 116 (48.95) 81 (44.51)
II 110 (40.29) 64 (43.84) 87 (40.65) 87 (42.44) 99 (41.60) 75 (41.44) 98 (41.35) 76 (41.76)
III 28 (10.26) 20 (13.70) 21 (9.81) 27 (13.17) 23 (9.66) 25 (13.81) 23 (9.70) 25 (13.74)
Time since surgery 0.251 0.985 0.533 0.747
< 2 yrs 121 (44.32) 62 (42.47) 94 (43.93) 89 (43.41) 105 (44.12) 78 (43.09) 105 (44.30) 78 (42.86)
2 to < 4 yrs 72 (26.37) 49 (33.56) 61 (28.50) 60 (29.27) 64 (26.89) 57 (31.49) 65 (27.43) 56 (30.77)
≥ 4 yrs 80 (29.30) 35 (23.97) 59 (27.57) 56 (27.32) 69 (28.99) 46 (25.41) 67 (28.27) 48 (26.37)
History of chronic diseases1) 0.243 0.909 0.163 0.070
No 217 (79.49) 108 (73.97) 165 (77.10) 160 (78.05) 191 (80.25) 134 (74.03) 192 (81.01) 133 (73.08)
Yes 56 (20.51) 38 (26.03) 49 (22.90) 45 (21.95) 47 (19.75) 47 (25.97) 45 (18.99) 49 (26.92)
HER-2 status 0.333 0.736 0.099 0.044
Equivocal 51 (18.68) 28 (19.18) 39 (18.22) 40 (19.51) 43 (18.07) 36 (19.89) 50 (21.10) 29 (15.93)
Negative 146 (53.48) 68 (46.58) 107 (50.00) 107 (52.20) 132 (55.46) 82 (45.30) 127 (53.59) 87 (47.80)
Positive 76 (27.84) 50 (34.25) 68 (31.78) 58 (28.29) 63 (26.47) 63 (34.81) 60 (25.32) 66 (36.26)
Hormone receptor status 0.101 0.002 0.518 0.134
ER−/PR− 63 (23.08) 23 (15.75) 57 (26.64) 29 (14.15) 52 (21.85) 34 (18.78) 42 (17.72) 44 (24.18)
ER+ or PR+ 210 (76.92) 123 (84.25) 157 (73.36) 176 (85.85) 186 (78.15) 147 (81.22) 195 (82.28) 138 (75.82)
Radiation therapy 0.696 0.400 0.304 0.086
No 98 (35.90) 56 (38.36) 74 (34.58) 80 (39.02) 93 (39.08) 61 (33.70) 96 (40.51) 58 (31.87)
Yes 175 (64.10) 90 (61.64) 140 (65.42) 125 (60.98) 145 (60.92) 120 (66.30) 141 (59.49) 124 (68.13)
Chemotherapy 0.063 0.245 0.124 0.175
No 70 (25.64) 25 (17.12) 54 (25.23) 41 (20.00) 61 (25.63) 34 (18.78) 60 (25.32) 35 (19.23)
Yes 203 (74.36) 121 (82.88) 160 (74.77) 164 (80.00) 177 (74.37) 147 (81.22) 177 (74.68) 147 (80.77)
Among ER+ (n = 328)
Current hormone therapy use 0.156 0.884 0.166 0.927
No 40 (19.23) 15 (12.50) 25 (16.13) 30 (17.34) 36 (19.57) 19 (13.19) 33 (17.19) 22 (16.18)
Yes 168 (80.77) 105 (87.50) 130 (83.87) 143 (82.66) 148 (80.43) 125 (86.81) 159 (82.81) 114 (83.82)
Types of hormone Therapy 0.551 0.489 0.815 0.625
SERM only 104 (61.90) 64 (60.95) 79 (60.77) 89 (62.24) 91 (61.49) 77 (61.60) 101 (63.52) 67 (58.77)
AI only 51 (30.36) 36 (34.29) 40 (30.77) 47 (32.87) 46 (31.08) 41 (32.80) 47 (29.56) 40 (35.09)
Others 13 (7.74) 5 (4.76) 11 (8.46) 7 (4.90) 11 (7.43) 7 (5.60) 11 (6.92) 7 (6.14)
Types of hormone therapy ever use 0.582 0.375 0.282 0.453
SERM only 117 (57.07) 68 (57.14) 84 (54.90) 101 (59.06) 101 (55.80) 84 (58.74) 112 (58.95) 73 (54.48)
AI only 60 (29.27) 39 (32.77) 46 (30.07) 53 (30.99) 53 (29.28) 46 (32.17) 53 (27.89) 46 (34.33)
Others 28 (13.66) 12 (10.08) 23 (15.03) 17 (9.94) 27 (14.92) 13 (9.09) 25 (13.16) 15 (11.19)
Values are presented as mean ± SD or number (%). P-values were obtained from a χ2 test.
KM, k-means; PAM, partitioning around medoids; SOM, self-organizing maps; HAC, hierarchical agglomerative clustering; C1, cluster 1; C2, cluster 2; AJCC, American Joint Committee on Cancer; HER-2, human epidermal growth factor receptor 2; ER, estrogen receptor; PR, progesterone receptor; SERM, selective estrogen receptor modulator; AI, aromatase inhibitor.
1)Chronic disease included hypertension, diabetes mellitus, hyperlipidemia, and cardiovascular disease.
The food and nutrient intakes of the participants who completed the dietary assessment (n = 365) are shown in Tables 4 and 5. Daily thiamine intake was higher in cluster 2, as identified by KM and SOM (PKM = 0.012; and PSOM = 0.012), but it was not significant after FDR adjustment. As identified by SOM, red meat intake among red meat consumers was higher in cluster 1 than in cluster 2 (PSOM = 0.035), but this difference became non-significant after FDR adjustment. Energy, other energy-adjusted food, and nutrient intakes were not significantly different between the clusters.
Table 4

Daily energy and nutrient intakes of breast cancer survivors according to the clusters identified by the KM, PAM, SOM, and HAC clustering methods (n = 365)

nrp-19-273-i004
Variables KM PAM SOM HAC
C1 (n = 239) C2 (n = 126) P-value C1 (n = 184) C2 (n = 181) P-value C1 (n = 205) C2 (n = 160) P-value C1 (n = 204) C2 (n = 161) P-value
Energy (kcal/day) 1,779.35 ± 641.89 1,756.45 ± 629.62 0.745 1,759.00 ± 633.69 1,784.09 ± 641.68 0.707 1,800.63 ± 655.85 1,734.05 ± 611.78 0.322 1,794.18 ± 624.36 1,742.64 ± 653.29 0.443
Macronutrients (energy-adjusted)
Carbohydrates (g/day) 289.44 ± 33.79 290.35 ± 38.37 0.816 288.26 ± 34.29 291.27 ± 36.51 0.418 288.94 ± 34.89 290.80 ± 36.10 0.620 290.37 ± 35.59 288.98 ± 35.23 0.709
Protein (g/day) 67.01 ± 11.98 67.56 ± 13.22 0.686 67.81 ± 11.97 66.58 ± 12.84 0.348 66.71 ± 11.04 67.82 ± 13.97 0.411 66.47 ± 12.02 68.12 ± 12.86 0.206
Plant-based 31.02 ± 10.93 30.63 ± 12.41 0.760 31.25 ± 11.17 30.51 ± 11.74 0.536 30.86 ± 10.72 30.91 ± 12.36 0.970 30.74 ± 11.10 31.07 ± 11.91 0.782
Animal-based 36.05 ± 9.96 37.01 ± 9.91 0.382 36.63 ± 9.18 36.13 ± 10.67 0.634 35.90 ± 10.14 37.00 ± 9.66 0.297 35.79 ± 10.45 37.14 ± 9.21 0.198
Fat (g/day) 40.73 ± 11.41 40.07 ± 12.69 0.616 40.94 ± 11.49 40.05 ± 12.23 0.479 40.94 ± 11.95 39.93 ± 11.74 0.422 40.60 ± 12.03 40.37 ± 11.66 0.852
Plant-based 19.74 ± 9.27 18.99 ± 10.22 0.479 19.52 ± 9.30 19.43 ± 9.92 0.931 19.83 ± 9.73 19.02 ± 9.45 0.424 19.77 ± 9.09 19.11 ± 10.23 0.517
Animal-based 21.01 ± 8.46 21.14 ± 8.52 0.890 21.45 ± 8.47 20.65 ± 8.48 0.366 21.12 ± 8.53 20.97 ± 8.42 0.873 20.85 ± 8.61 21.31 ± 8.30 0.611
SFA (g/day) 8.95 ± 4.45 9.16 ± 6.26 0.740 8.90 ± 4.30 9.15 ± 5.88 0.638 8.82 ± 4.51 9.30 ± 5.85 0.391 8.80 ± 4.41 9.32 ± 5.94 0.352
MUFA (g/day) 10.80 ± 5.29 11.11 ± 8.65 0.711 10.67 ± 4.99 11.14 ± 7.97 0.497 10.72 ± 5.05 11.14 ± 8.25 0.577 10.62 ± 4.84 11.27 ± 8.38 0.377
PUFA (g/day) 9.31 ± 3.90 9.67 ± 4.92 0.481 9.22 ± 3.85 9.65 ± 4.66 0.336 9.35 ± 3.98 9.54 ± 4.63 0.680 9.47 ± 4.01 9.38 ± 4.59 0.845
Dietary fiber (g/day) 20.11 ± 9.81 20.72 ± 10.43 0.582 20.03 ± 9.66 20.62 ± 10.39 0.580 20.04 ± 9.63 20.68 ± 10.51 0.549 20.18 ± 10.02 20.50 ± 10.04 0.762
Micronutrients (energy-adjusted)
Calcium (mg/day) 539.60 ± 199.93 548.06 ± 211.46 0.707 541.47 ± 169.02 543.59 ± 234.29 0.921 533.09 ± 201.19 554.60 ± 206.95 0.318 542.16 ± 215.56 542.98 ± 188.35 0.969
Phosphorus (mg/day) 1,120.46 ± 209.50 1,153.22 ± 242.00 0.179 1,134.87 ± 205.41 1,128.62 ± 237.23 0.788 1,116.02 ± 202.31 1,151.95 ± 243.00 0.133 1,127.24 ± 216.69 1,137.51 ± 227.94 0.661
Iron (mg/day) 19.97 ± 33.03 17.67 ± 5.06 0.293 17.44 ± 4.62 20.94 ± 37.88 0.219 20.51 ± 35.59 17.47 ± 5.04 0.228 20.57 ± 35.70 17.41 ± 4.80 0.214
Sodium (mg/day) 2,869.41 ± 1,011.99 2,880.40 ± 1,079.01 0.923 2,862.41 ± 985.53 2,884.18 ± 1,083.97 0.841 2,845.19 ± 976.63 2,909.09 ± 1,105.56 0.559 2,891.51 ± 1,029.03 2,850.01 ± 1,043.34 0.704
Potassium (mg/day) 3,543.66 ± 986.98 3,741.14 ± 1,182.23 0.110 3,578.27 ± 1,033.08 3,645.95 ± 1,090.47 0.543 3,514.86 ± 968.62 3,736.07 ± 1,160.07 0.053 3,612.24 ± 1,059.06 3,611.30 ± 1,066.78 0.993
Vitamin A (μgRE/day) 880.67 ± 692.51 927.85 ± 555.70 0.480 892.06 ± 748.36 901.94 ± 529.28 0.884 853.45 ± 495.90 952.70 ± 800.43 0.170 877.66 ± 493.20 921.41 ± 803.78 0.545
Retinol (μg/day) 144.39 ± 472.11 118.63 ± 74.11 0.411 155.25 ± 536.45 115.41 ± 73.82 0.320 112.71 ± 81.66 164.70 ± 572.90 0.256 119.66 ± 84.33 155.56 ± 571.34 0.430
Thiamin (mg/day) 1.71 ± 0.47 1.84 ± 0.52 0.012 1.73 ± 0.48 1.77 ± 0.50 0.447 1.70 ± 0.47 1.83 ± 0.51 0.012 1.75 ± 0.51 1.76 ± 0.47 0.815
Riboflavin (mg/day) 1.32 ± 0.40 1.36 ± 0.43 0.369 1.33 ± 0.44 1.33 ± 0.38 0.883 1.30 ± 0.33 1.37 ± 0.49 0.103 1.33 ± 0.34 1.34 ± 0.48 0.813
Niacin (mg/day) 16.12 ± 3.56 16.95 ± 4.62 0.080 16.47 ± 4.07 16.35 ± 3.88 0.783 16.05 ± 3.31 16.87 ± 4.66 0.058 16.20 ± 3.39 16.68 ± 4.60 0.262
Vitamin C (mg/day) 183.08 ± 111.96 193.62 ± 124.44 0.411 181.85 ± 113.41 191.66 ± 119.39 0.421 183.48 ± 116.29 190.86 ± 116.68 0.549 191.22 ± 118.50 181.01 ± 113.69 0.406
Values are presented as mean ± SD or number (%). P-values were obtained from a t-test.
KM, k-means; PAM, partitioning around medoids; SOM, self-organizing maps; HAC, hierarchical agglomerative clustering; C1, cluster 1; C2, cluster 2; SFA, saturated fatty acid; MUFA, monounsaturated fatty acid; PUFA, polyunsaturated fatty acid.
Table 5

Daily food intake of breast cancer survivors according to the clusters identified by the KM, PAM, SOM, and HAC clustering methods (n = 365)

nrp-19-273-i005
Variables KM PAM SOM HAC
C1 (n = 239) C2 (n = 126) P-value C1 (n = 184) C2 (n = 181) P-value C1 (n = 205) C2 (n = 160) P-value C1 (n = 204) C2 (n = 161) P-value
Fruits and vegetables
Fruits intake (g/day) 341.60 ± 345.98 376.46 ± 529.80 0.505 332.34 ± 346.10 375.28 ± 480.66 0.329 356.97 ± 373.97 349.36 ± 470.11 0.867 367.20 ± 363.84 336.45 ± 479.04 0.500
Vegetables intake (g/day) 301.31 ± 227.79 340.64 ± 232.68 0.120 305.19 ± 218.51 324.75 ± 241.20 0.417 299.90 ± 230.46 334.09 ± 228.55 0.159 303.17 ± 219.62 329.74 ± 242.25 0.273
Raw 100.03 ± 105.43 123.66 ± 172.26 0.161 104.97 ± 114.13 111.46 ± 149.31 0.641 96.10 ± 100.60 123.67 ± 163.84 0.062 96.79 ± 103.14 122.63 ± 161.65 0.079
Salted 65.53 ± 65.53 68.08 ± 66.04 0.724 63.95 ± 64.73 68.91 ± 66.61 0.471 66.94 ± 67.98 65.73 ± 62.68 0.861 67.51 ± 68.66 65.02 ± 61.76 0.720
Fruits and vegetables intake (g/day) 642.92 ± 495.57 717.10 ± 704.26 0.294 637.54 ± 491.37 700.03 ± 651.33 0.302 656.87 ± 524.59 683.45 ± 637.90 0.670 670.37 ± 505.90 666.19 ± 656.44 0.947
Whole grains
Whole grain intake (g/day) 104.81 ± 88.08 115.39 ± 85.54 0.271 104.20 ± 82.96 112.80 ± 91.40 0.348 108.25 ± 89.68 108.74 ± 84.27 0.957 105.55 ± 89.05 112.16 ± 85.01 0.473
Whole grain-eating status 0.907 > 0.999 0.909 0.420
Non-consumer 29 (12.13) 14 (11.11) 22 (11.96) 21 (11.60) 25 (12.20) 18 (11.25) 27 (13.24) 16 (9.94)
Consumer 210 (87.87) 112 (88.89) 162 (88.04) 160 (88.40) 180 (87.80) 142 (88.75) 177 (86.76) 145 (90.06)
Whole grain intake among consumers (g/day) 119.29 ± 84.26 129.81 ± 79.69 0.278 118.36 ± 78.35 127.60 ± 86.93 0.317 123.28 ± 85.46 122.53 ± 79.42 0.935 121.65 ± 84.72 124.54 ± 80.49 0.756
Red and processed meat
Red meat intake (g/day) 57.90 ± 62.77 52.95 ± 59.34 0.466 57.46 ± 64.20 54.90 ± 58.93 0.692 61.23 ± 65.27 49.75 ± 56.02 0.072 55.42 ± 60.96 57.17 ± 62.51 0.787
Red meat-eating status 0.576 0.938 0.759 0.520
Non-consumer 39 (16.32) 17 (13.49) 29 (15.76) 27 (14.92) 33 (16.10) 23 (14.38) 34 (16.67) 22 (13.66)
Consumer 200 (83.68) 109 (86.51) 155 (84.24) 154 (85.08) 172 (83.90) 137 (85.62) 170 (83.33) 139 (86.34)
Red meat intake among consumers (g/day) 69.20 ± 62.66 61.21 ± 59.71 0.277 68.21 ± 64.49 64.53 ± 58.82 0.600 72.97 ± 64.96 58.10 ± 56.39 0.035 66.50 ± 61.01 66.22 ± 62.67 0.968
Processed meat intake (g/day) 1.00 ± 4.26 1.29 ± 6.53 0.651 0.85 ± 4.40 1.35 ± 5.82 0.354 1.07 ± 4.56 1.13 ± 5.84 0.915 1.22 ± 5.28 0.94 ± 5.00 0.604
Processed meat-eating status 0.812 0.318 0.974 0.183
Non-consumer 207 (86.61) 111 (88.10) 164 (89.13) 154 (85.08) 178 (86.83) 140 (87.50) 173 (84.80) 145 (90.06)
Consumer 32 (13.39) 15 (11.90) 20 (10.87) 27 (14.92) 27 (13.17) 20 (12.50) 31 (15.20) 16 (9.94)
Processed meat intake among consumers (g/day) 7.44 ± 9.49 10.82 ± 16.43 0.468 7.80 ± 11.37 9.05 ± 12.73 0.730 8.13 ± 10.17 9.04 ± 14.49 0.800 8.04 ± 11.48 9.45 ± 13.46 0.708
Red and processed meat intake (g/day) 58.90 ± 63.59 54.24 ± 59.89 0.497 58.31 ± 64.73 56.25 ± 59.87 0.753 62.30 ± 66.10 50.88 ± 56.60 0.077 56.64 ± 61.75 58.11 ± 63.16 0.823
Red and processed meat-eating status 0.723 0.935 0.959 0.491
Non-consumer 37 (15.48) 17 (13.49) 28 (15.22) 26 (14.36) 31 (15.12) 23 (14.38) 33 (16.18) 21 (13.04)
Consumer 202 (84.52) 109 (86.51) 156 (84.78) 155 (85.64) 174 (84.88) 137 (85.62) 171 (83.82) 140 (86.96)
Red and processed meat intake among consumers (g/day) 69.69 ± 63.50 62.70 ± 60.14 0.346 68.78 ± 64.98 65.69 ± 59.72 0.663 73.40 ± 65.82 59.42 ± 56.87 0.049 67.57 ± 61.72 66.83 ± 63.29 0.917
Values are presented as mean ± SD or number (%). P-values were obtained from a t-test for continuous variables and a χ2 test for categorical data.
KM, k-means; PAM, partitioning around medoids; SOM, self-organizing maps; HAC, hierarchical agglomerative clustering; C1, cluster 1; C2, cluster 2.
Table 6 shows the ACS guideline adherence scores of the participants according to the clusters. The total ACS score and sub-scores for the diet and physical activity did not differ between the clusters. Regarding the healthy weight maintenance score, cluster 1 had a lower proportion of participants who received 2 points for a BMI of 25 to < 30 kg/m2 and a higher proportion of participants who received 4 points for a BMI of 18.5 to < 23 kg/m2 (PKM = 0.013; PPAM = 0.010; PSOM = 0.002; PHAC = 0.048). The result for SOM was still significant after the FDR adjustment.
Table 6

American Cancer Society scores of breast cancer survivors according to the clusters identified by the KM, PAM, SOM, and HAC clustering methods (n = 365)

nrp-19-273-i006
Variables KM PAM SOM HAC
C1 (n = 239) C2 (n = 126) P-value C1 (n = 184) C2 (n = 181) P-value C1 (n = 205) C2 (n = 160) P-value C1 (n = 204) C2 (n = 161) P-value
ACS scores (points) 8.18 ± 1.98 7.94 ± 1.97 0.273 8.28 ± 2.00 7.91 ± 1.94 0.073 8.20 ± 2.00 7.95 ± 1.94 0.222 8.17 ± 1.94 7.99 ± 2.02 0.393
Diet 0.033 0.206 0.209 0.898
1 point (3–5 points for diet) 48 (20.08) 17 (13.49) 37 (20.11) 28 (15.47) 39 (19.02) 26 (16.25) 38 (18.63) 27 (16.77)
2 points (6–7 points) 75 (31.38) 34 (26.98) 56 (30.43) 53 (29.28) 67 (32.68) 42 (26.25) 61 (29.90) 48 (29.81)
3 points (8–9 points) 72 (30.13) 57 (45.24) 56 (30.43) 73 (40.33) 63 (30.73) 66 (41.25) 69 (33.82) 60 (37.27)
4 points (10–12 points) 44 (18.41) 18 (14.29) 35 (19.02) 27 (14.92) 36 (17.56) 26 (16.25) 36 (17.65) 26 (16.15)
Physical activity 0.360 0.407 0.684 0.907
1 point (< 10.5 MET-h/week) 62 (25.94) 30 (23.81) 42 (22.83) 50 (27.62) 49 (23.90) 43 (26.88) 49 (24.02) 43 (26.71)
2 points (10.6–22.5 MET-h/week) 58 (24.27) 33 (26.19) 47 (25.54) 44 (24.31) 54 (26.34) 37 (23.12) 53 (25.98) 38 (23.60)
3 points (23.0–46.6 MET-h/week) 54 (22.59) 37 (29.37) 43 (23.37) 48 (26.52) 48 (23.41) 43 (26.88) 50 (24.51) 41 (25.47)
4 points (≥ 46.8 MET-h/week) 65 (27.20) 26 (20.63) 52 (28.26) 39 (21.55) 54 (26.34) 37 (23.12) 52 (25.49) 39 (24.22)
Healthy weight maintenance 0.013 0.010 0.0021) 0.048
1 point (< 18.5 or ≥ 30 kg/m2) 17 (7.11) 11 (8.80) 14 (7.61) 14 (7.78) 17 (8.29) 11 (6.92) 18 (8.87) 10 (6.21)
2 points (25 to < 30 kg/m2) 45 (18.83) 40 (32.00) 30 (16.30) 55 (30.56) 35 (17.07) 50 (31.45) 38 (18.72) 47 (29.19)
3 points (23 to < 25 kg/m2) 51 (21.34) 28 (22.40) 41 (22.28) 38 (21.11) 40 (19.51) 39 (24.53) 41 (20.20) 38 (23.60)
4 points (18.5 to < 23 kg/m2) 126 (52.72) 46 (36.80) 99 (53.80) 73 (40.56) 113 (55.12) 59 (37.11) 106 (52.22) 66 (40.99)
Values are presented as mean ± SD or number (%). P-values were obtained from a χ2 test.
KM, k-means; PAM, partitioning around medoids; SOM, self-organizing maps; HAC, hierarchical agglomerative clustering; C1, cluster 1; C2, cluster 2; ACS, American Cancer Society; MET, metabolic equivalent task.
1)False discovery rate adjusted P-value < 0.05.

DISCUSSION

We applied 4 unsupervised clustering methods—KM, PAM, SOM, and HAC—and identified 2 clusters of breast cancer survivors based on plasma metabolites that included lipids, lipoproteins, fatty acids, and amino acids. When compared with cluster 1, participants in cluster 2 showed higher levels of XXL-VLDL-P, GlycA, MUFA%, and SFA%, and larger VLDL size, but lower levels of L-HDL-P, ApoA1, Omega-6%, and LA%, and smaller HDL size, as identified by all 4 clustering methods. Participants in cluster 1 had a lower BMI and a higher proportion of dietary supplement users compared with cluster 2. Intakes of food and nutrients and other lifestyle factors showed no significant differences between the clusters.
The cluster assignment by HAC and the metabolic characteristics of each cluster were different from those of the other clustering methods. MUFA, VLDL-C, XS-VLDL-PL, and total-TG were higher in cluster 2 as identified by KM, PAM, and SOM, but this pattern was not observed for HAC. This may be due to the way clustering is performed. HAC is the only method among the 4 clustering methods used in the current study that employs a hierarchical approach. Hierarchical algorithms cannot re-separate observations once they are merged in a previous step, even if the data are not optimally clustered. If a data structure is not hierarchical, the performance of the HAC method may not be as desirable compared to partitioning clustering methods such as KM [36].
In the current study, the BMI significantly differed between the clusters based on metabolites. Our results were consistent with the findings of studies that examined the association between BMI and lipid profiles. A cross-sectional study of U.S. women showed significantly higher concentrations of large VLDL particles and larger average VLDL size, but lower concentrations of large HDL particles and smaller average HDL size in a group with a BMI of 30–45 kg/m2 when compared with a group with a BMI of 18.5–25 kg/m2 [37]. In a previous study of healthy Finnish twins, measures of obesity including the BMI, waist circumference, and body fat composition, were positively correlated with the apolipoprotein B to ApoA1 ratio, glycoprotein, and MUFA%, and inversely correlated with omega-6% and polyunsaturated fatty acid% [38]. Fasting insulin, the homeostatic model assessment of insulin resistance, and C-reactive protein showed similar correlations with the aforementioned metabolites.
The proportion of dietary supplement users was higher in cluster 1 compared to cluster 2, but this difference was only significant when using the PAM method. A meta-analysis of randomized clinical trials conducted on overweight or obese participants found that calcium and vitamin D supplementation in lower doses (≤ 400 IU/day and ≤ 600 mg/day) reduced blood concentrations of total cholesterol and TG [39]. In our study, participants in cluster 1 had lower total cholesterol and TG levels as identified by PAM, which is consistent with previous findings.
Several studies have reported associations between human metabolites and dietary intake [404142434445]. Previous studies found that the intake of thiamine was inversely associated with low high density lipoprotein cholesterol (HDL-C; < 50 mg/dL for women and < 40 mg/dL for men) [46] and red meat intake was associated with low HDL-C and high TG [47]. Our study found that thiamine intake was higher in cluster 1 (by KM and SOM, characterized by higher HDL-C and lower total-TG), and red meat intake among consumers was lower in cluster 1 (by SOM). These results were not significant after FDR adjustment and therefore, these findings should be interpreted with caution. We did not find any significant associations between dietary factors and the clusters based on plasma metabolites after multiple comparison correction. There are several possible explanations. First, there may have been measurement errors in dietary assessments, as 2 different assessment methods (FFQ and 3DR) were used. Second, several metabolites, including glucose, pyruvate, and lactate which are related to the glycolysis pathway, were excluded due to their low measurement quality. Third, the sample size may not have been sufficient to obtain significant results. Fourth, subtle between-person differences in metabolites may not have been enough to distinguish clusters based on diet. Lastly, most previous studies investigating the associations between dietary intake and metabolic profiles have been conducted in the general population. In contrast, our study focused on a specific population, that of female breast cancer survivors. Cancer survivors may be influenced by a complex interplay of factors beyond dietary intake, such as hormonal changes, and lifestyle modifications following diagnosis. These unique factors may overshadow or modify the associations observed in the general population. As a result, there is a critical need for further studies targeting cancer survivor populations to validate these findings.
To the best of our knowledge, this is the first study to identify clusters of breast cancer survivors in the Korean population using multiple methods and to compare dietary and health-related factors among the clusters. Our study has several limitations. First, environmental and genetic factors that were not investigated in this study may be related to the metabolic characteristics of the clusters. Second, plasma samples used in the study were from individuals in the non-fasting state. However, a previous study showed high correlation between lipid levels from fasting and non-fasting individuals, wherein the association with CVD risk was consistent regardless of fasting status [48]. Third, although we used the 4 most commonly employed clustering methods, other model-based or density-based clustering methods, such as the Gaussian mixture model and density-based spatial clustering of applications with noise (DBSCAN), could also have been considered. Fourth, dietary intake was assessed using both the FFQ and the 3DR methods, which may have introduced variability in the data. Future studies could consider increasing the sample size and utilizing a single dietary assessment method to ensure consistency and improve the reliability of the findings. Finally, we only included metabolites for clustering; various additional factors could also be included, which would require further studies.
In conclusion, the unsupervised clustering methods allowed us to analyze multiple metabolites without any supervisory outcomes and to identify meaningful clusters of breast cancer survivors, wherein we found that BMI levels differed significantly between the 2 clusters identified by us. Further prospective studies are needed to comprehensively investigate the associations between metabolites, obesity, dietary factors, and breast cancer prognosis.

ACKNOWLEDGMENTS

We would like to thank the participants and the research team for their contributions.

Notes

Funding: This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2014R1A2A2A01007794, 2019R1F1A1061017 and 2021R1F1A1062476).

Conflict of Interest: The authors declare no potential conflicts of interests.

Author Contributions:

  • Conceptualization: Yie GE, Lee JE.

  • Data curation: Yie GE, Song S, Kim Z, Youn HJ, Cho J, Min JW, Kim YS, Lee JE.

  • Formal analysis: Yie GE.

  • Funding acquisition: Lee JE.

  • Methodology: Yie GE, Lee JE.

  • Writing - original draft: Yie GE, Lee JE.

  • Writing - review & editing: Yie GE, Kyeong W, Song S, Kim Z, Youn HJ, Cho J, Min JW, Kim YS, Lee JE.

References

1. Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, Bray F. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2021; 71:209–249. PMID: 33538338.
crossref
2. Park EH, Jung KW, Park NJ, Kang MJ, Yun EH, Kim HJ, Kim JE, Kong HJ, Im JS, Seo HG, et al. Cancer statistics in Korea: incidence, mortality, survival, and prevalence in 2021. Cancer Res Treat. 2024; 56:357–371. PMID: 38487832.
crossref
3. Kwan ML, Weltzien E, Kushi LH, Castillo A, Slattery ML, Caan BJ. Dietary patterns and breast cancer recurrence and survival among women with early-stage breast cancer. J Clin Oncol. 2009; 27:919–926. PMID: 19114692.
crossref
4. Chan DSM, Vieira AR, Aune D, Bandera EV, Greenwood DC, McTiernan A, Navarro Rosenblatt D, Thune I, Vieira R, Norat T. Body mass index and survival in women with breast cancer-systematic literature review and meta-analysis of 82 follow-up studies. Ann Oncol. 2014; 25:1901–1914. PMID: 24769692.
crossref
5. Lahart IM, Metsios GS, Nevill AM, Carmichael AR. Physical activity, risk of death and recurrence in breast cancer survivors: a systematic review and meta-analysis of epidemiological studies. Acta Oncol. 2015; 54:635–654. PMID: 25752971.
crossref
6. He J, Gu Y, Zhang S. Consumption of vegetables and fruits and breast cancer survival: a systematic review and meta-analysis. Sci Rep. 2017; 7:599. PMID: 28377568.
crossref
7. Jayedi A, Emadi A, Khan TA, Abdolshahi A, Shab-Bidar S. Dietary fiber and survival in women with breast cancer: a dose-response meta-analysis of prospective cohort studies. Nutr Cancer. 2021; 73:1570–1580. PMID: 32795218.
crossref
8. Gibney MJ, Walsh M, Brennan L, Roche HM, German B, van Ommen B. Metabolomics in human nutrition: opportunities and challenges. Am J Clin Nutr. 2005; 82:497–503. PMID: 16155259.
crossref
9. Dunn WB, Ellis DI. Metabolomics: current analytical platforms and methodologies. Trends Analyt Chem. 2005; 24:285–294.
crossref
10. Silva C, Perestrelo R, Silva P, Tomás H, Câmara JS. Breast cancer metabolomics: from analytical platforms to multivariate data analysis. a review. Metabolites. 2019; 9:102. PMID: 31121909.
crossref
11. McCartney A, Vignoli A, Biganzoli L, Love R, Tenori L, Luchinat C, Di Leo A. Metabolomics in breast cancer: a decade in review. Cancer Treat Rev. 2018; 67:88–96. PMID: 29775779.
crossref
12. Lécuyer L, Victor Bala A, Deschasaux M, Bouchemal N, Nawfal Triba M, Vasson MP, Rossary A, Demidem A, Galan P, Hercberg S, et al. NMR metabolomic signatures reveal predictive plasma metabolites associated with long-term risk of developing breast cancer. Int J Epidemiol. 2018; 47:484–494. PMID: 29365091.
crossref
13. Yang L, Wang Y, Cai H, Wang S, Shen Y, Ke C. Application of metabolomics in the diagnosis of breast cancer: a systematic review. J Cancer. 2020; 11:2540–2551. PMID: 32201524.
crossref
14. Asiago VM, Alvarado LZ, Shanaiah N, Gowda GA, Owusu-Sarfo K, Ballas RA, Raftery D. Early detection of recurrent breast cancer using metabolite profiling. Cancer Res. 2010; 70:8309–8318. PMID: 20959483.
crossref
15. Jobard E, Pontoizeau C, Blaise BJ, Bachelot T, Elena-Herrmann B, Trédan O. A serum nuclear magnetic resonance-based metabolomic signature of advanced metastatic human breast cancer. Cancer Lett. 2014; 343:33–41. PMID: 24041867.
crossref
16. Tenori L, Oakman C, Morris PG, Gralka E, Turner N, Cappadona S, Fornier M, Hudis C, Norton L, Luchinat C, et al. Serum metabolomic profiles evaluated after surgery may identify patients with oestrogen receptor negative early breast cancer at increased risk of disease recurrence. Results from a retrospective study. Mol Oncol. 2015; 9:128–139. PMID: 25151299.
crossref
17. Eicher T, Kinnebrew G, Patt A, Spencer K, Ying K, Ma Q, Machiraju R, Mathé AEA. Metabolomics and multi-omics integration: a survey of computational methods and resources. Metabolites. 2020; 10:202. PMID: 32429287.
crossref
18. Hartigan JA, Wong MA. Algorithm AS 136: a k-means clustering algorithm. J R Stat Soc Ser C Appl Stat. 1979; 28:100–108.
crossref
19. In : Schubert E, Rousseeuw PJ, editors. Faster k-medoids clustering: improving the PAM, CLARA, and CLARANS algorithms. Similarity Search and Applications, SISAP 2019; 2019 Oct 2-4; Newark, NJ, USA. Cham: Springer;2019.
20. Kohonen T. The self-organizing map. Neurocomputing. 1998; 21:1–6.
crossref
21. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York (NY): Springer New York;2017. p. 520–527.
22. Giuliano AE, Connolly JL, Edge SB, Mittendorf EA, Rugo HS, Solin LJ, Weaver DL, Winchester DJ, Hortobagyi GN. Breast cancer-major changes in the American Joint Committee on Cancer eighth edition cancer staging manual. CA Cancer J Clin. 2017; 67:290–303. PMID: 28294295.
crossref
23. Shin WK, Song S, Hwang E, Moon HG, Noh DY, Lee JE. Development of a FFQ for breast cancer survivors in Korea. Br J Nutr. 2016; 116:1781–1786. PMID: 27842613.
crossref
24. Moon SE, Shin WK, Song S, Koh D, Ahn JS, Yoo Y, Kang M, Lee JE. Validity and reproducibility of a food frequency questionnaire for breast cancer survivors in Korea. Nutr Res Pract. 2022; 16:789–800. PMID: 36467770.
crossref
25. Soininen P, Kangas AJ, Würtz P, Suna T, Ala-Korpela M. Quantitative serum nuclear magnetic resonance metabolomics in cardiovascular epidemiology and genetics. Circ Cardiovasc Genet. 2015; 8:192–206. PMID: 25691689.
crossref
26. Nightingale Health Plc. Clinically validated biomarkers [Internet]. Helsinki: Nightingale Health Plc.;2024. cited 2024 December 30. Available from: https://nightingalehealth.com/uploads/documents/Nightingale-Blood-Analysis_List-of-Biomarkers.pdf .
27. Kettunen J, Demirkan A, Würtz P, Draisma HH, Haller T, Rawal R, Vaarhorst A, Kangas AJ, Lyytikäinen LP, Pirinen M, et al. Genome-wide study for circulating metabolites identifies 62 loci and reveals novel systemic effects of LPA. Nat Commun. 2016; 7:11122. PMID: 27005778.
crossref
28. Holmes MV, Millwood IY, Kartsonaki C, Hill MR, Bennett DA, Boxall R, Guo Y, Xu X, Bian Z, Hu R, et al. Lipids, lipoproteins, and metabolites and risk of myocardial infarction and stroke. J Am Coll Cardiol. 2018; 71:620–632. PMID: 29420958.
crossref
29. Ainsworth BE, Haskell WL, Herrmann SD, Meckes N, Bassett DR Jr, Tudor-Locke C, Greer JL, Vezina J, Whitt-Glover MC, Leon AS. 2011 Compendium of physical activities: a second update of codes and MET values. Med Sci Sports Exerc. 2011; 43:1575–1581. PMID: 21681120.
30. Rock CL, Doyle C, Demark-Wahnefried W, Meyerhardt J, Courneya KS, Schwartz AL, Bandera EV, Hamilton KK, Grant B, McCullough M, et al. Nutrition and physical activity guidelines for cancer survivors. CA Cancer J Clin. 2012; 62:243–274. PMID: 22539238.
crossref
31. Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987; 20:53–65.
crossref
32. Van der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res. 2008; 9:2579–2605.
33. Breiman L. Random forests. Mach Learn. 2001; 45:5–32.
34. Hubert L, Arabie P. Comparing partitions. J Classif. 1985; 2:193–218.
crossref
35. Willett W, Stampfer MJ. Total energy intake: implications for epidemiologic analyses. Am J Epidemiol. 1986; 124:17–27. PMID: 3521261.
crossref
36. James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning. New York (NY): Springer;2013. p. 523.
37. Magkos F, Mohammed BS, Mittendorfer B. Effect of obesity on the plasma lipoprotein subclass profile in normoglycemic and normolipidemic men and women. Int J Obes. 2008; 32:1655–1664.
crossref
38. Bogl LH, Kaye SM, Rämö JT, Kangas AJ, Soininen P, Hakkarainen A, Lundbom J, Lundbom N, Ortega-Alonso A, Rissanen A, et al. Abdominal obesity and circulating metabolites: a twin study approach. Metabolism. 2016; 65:111–121. PMID: 26892522.
crossref
39. Kashkooli S, Choghakhori R, Hasanvand A, Abbasnezhad A. Effect of calcium and vitamin D co-supplementation on lipid profile of overweight/obese subjects: a systematic review and meta-analysis of the randomized clinical trials. Obes Med. 2019; 15:100124.
40. O’Sullivan A, Gibney MJ, Brennan L. Dietary intake patterns are reflected in metabolomic profiles: potential role in dietary assessment studies. Am J Clin Nutr. 2011; 93:314–321. PMID: 21177801.
crossref
41. Schmidt JA, Rinaldi S, Ferrari P, Carayol M, Achaintre D, Scalbert A, Cross AJ, Gunter MJ, Fensom GK, Appleby PN, et al. Metabolic profiles of male meat eaters, fish eaters, vegetarians, and vegans from the EPIC-Oxford cohort. Am J Clin Nutr. 2015; 102:1518–1526. PMID: 26511225.
crossref
42. Gibbons H, Carr E, McNulty BA, Nugent AP, Walton J, Flynn A, Gibney MJ, Brennan L. Metabolomic-based identification of clusters that reflect dietary patterns. Mol Nutr Food Res. 2017; 61:1601050.
crossref
43. Lindqvist HM, Rådjursöga M, Malmodin D, Winkvist A, Ellegård L. Serum metabolite profiles of habitual diet: evaluation by 1H-nuclear magnetic resonance analysis. Am J Clin Nutr. 2019; 110:53–62. PMID: 31127814.
crossref
44. Navarro SL, Tarkhan A, Shojaie A, Randolph TW, Gu H, Djukovic D, Osterbauer KJ, Hullar MA, Kratz M, Neuhouser ML, et al. Plasma metabolomics profiles suggest beneficial effects of a low-glycemic load dietary pattern on inflammation and energy metabolism. Am J Clin Nutr. 2019; 110:984–992. PMID: 31432072.
crossref
45. Walker ME, Song RJ, Xu X, Gerszten RE, Ngo D, Clish CB, Corlin L, Ma J, Xanthakis V, Jacques PF, et al. Proteomic and metabolomic correlates of healthy dietary patterns: the Framingham Heart Study. Nutrients. 2020; 12:1476. PMID: 32438708.
crossref
46. Wu Y, Li S, Wang W, Zhang D. Associations of dietary vitamin B1, vitamin B2, niacin, vitamin B6, vitamin B12 and folate equivalent intakes with metabolic syndrome. Int J Food Sci Nutr. 2020; 71:738–749. PMID: 31986943.
crossref
47. Azadbakht L, Esmaillzadeh A. Red meat intake is associated with metabolic syndrome and the plasma C-reactive protein concentration in women. J Nutr. 2009; 139:335–339. PMID: 19074209.
crossref
48. Tikkanen E, Kanerva N, Aittomaki V, Männistö S, Salomaa VV, Wurtz P. Fasting samples are not required for NMR metabolic profiling studies of cardiovascular disease risk: prospective data for 4,400 individuals profiled few weeks apart. Circulation. 2019; 140:A10212.

SUPPLEMENTARY MATERIALS

Supplementary Table 1

List of metabolites included in the cluster analysis
nrp-19-273-s001.xls

Supplementary Table 2

Strengths and limitations of the KM, PAM, SOM, and HAC clustering methods
nrp-19-273-s002.xls

Supplementary Table 3

Cluster agreements between the KM, PAM, SOM, and HAC clustering methods
nrp-19-273-s003.xls

Supplementary Table 4

The 7 metabolites that contributed most to cluster formation
nrp-19-273-s004.xls

Supplementary Fig. 1

Flow diagram of breast cancer survivors included in the study.
nrp-19-273-s005.ppt

Supplementary Fig. 2

A correlation plot of the 57 metabolites, excluding lipoprotein subclasses markers.
nrp-19-273-s006.ppt
TOOLS
Similar articles