Statistical Methods By Sp Gupta.epub

0 views

Skip to first unread message

Llanque Mazurek

unread,

Aug 19, 2024, 12:07:00 AM8/19/24

to barquawesol

Dietary pattern analysis is a promising approach to understanding the complex relationship between diet and health. While many statistical methods exist, the literature predominantly focuses on classical methods such as dietary quality scores, principal component analysis, factor analysis, clustering analysis, and reduced rank regression. There are some emerging methods that have rarely or never been reviewed or discussed adequately.

Statistical Methods By Sp Gupta.epub

Download File https://xiuty.com/2A33Ww

This paper presents a landscape review of the existing statistical methods used to derive dietary patterns, especially the finite mixture model, treelet transform, data mining, least absolute shrinkage and selection operator and compositional data analysis, in terms of their underlying concepts, advantages and disadvantages, and available software and packages for implementation.

In the past few decades, statistical methods have emerged that make full use of dietary information collected across populations to create dietary patterns [2, 4, 8]. In nutritional epidemiology studies, regardless of the statistical method used for dietary pattern analysis, the goal is to explore the relationship between dietary patterns and health outcomes [2, 3]. From this perspective, evaluating a method depends not only on whether the dietary patterns derived by the method comprehensively reflect the dietary preferences but also on whether these patterns can predict diseases more accurately and promote health.

The majority of published reviews divide the statistical methods for dietary pattern analysis into three categories: investigator-driven, data-driven, and hybrid methods widely used in nutritional epidemiology [2, 3, 8,9,10]. Additionally, several emerging methods have been applied to dietary pattern analyses that are less often or never reviewed adequately. To demonstrate these methods more clearly, we classify the emerging methods based on the existing categories and add a new category.

This paper provides an updated landscape review of these methods based on the underlying concepts, strengths, limitations, and software packages commonly used while paying particular attention to emerging methods. The subsequent content is introduced from the following aspects: (1) investigator-driven methods, containing various dietary scores and dietary indexes; (2) data-driven methods, comprising PCA, factor analysis, traditional cluster analysis (TCA), FMM, and TT; (3) hybrid methods, consisting of reduced rank regression (RRR), DM, and LASSO; (4) compositional data analysis, including compositional principal component coordinates, balance coordinates and principal balances. To conclude, we compare and evaluate these methods, identify the remaining methodological issues, and provide suggestions for future research.

Recent studies on the relationship between dietary quality scores and health indicate that scores such as the HEI, Alternative Healthy Eating Index (AHEI) [15], Alternative Mediterranean Diet [35], and Dietary Approaches to Stop Hypertension (DASH) diet scores [27] are negatively correlated with the risk of death from cardiovascular disease, cancer, and all-cause mortality [36,37,38,39,40]. The last three dietary patterns were also recommended as easy and practical dietary plans for the public in the 2015 Dietary Guidelines for Americans [41]. Additionally, plant-based diets are receiving increasing attention because of their benefits to human health and environmental sustainability. Three plant-based diet indexes have been established in recent years: the total Plant-based Diet Index (PDI), Healthy Plant-based Diet Index (hPDI), and Unhealthy Plant-based Diet Index (uPDI) [42, 43]. Unlike other dietary quality scores, these plant-based dietary indexes focus on the quality of plant foods in the diet; all animal foods, including those animal foods known to promote health, are negatively scored when calculating the plant-based dietary indexes [44, 45]. Research has found that the higher the hPDI score, the lower the risk of coronary heart disease, type 2 diabetes, and all-cause mortality, whereas the uPDI shows the opposite trend [44,45,46,47].

Each component is individually scored and summed into a total score in the different scoring systems, but the total scores of different dietary quality scores vary greatly. Additionally, the total score can also be dichotomized but is less used [48, 49]. No research has been done to establish the preferable scoring system for specific situations [12]. It is important to consider the research purpose when applying dietary quality scores and that there is not necessarily a single diet plan to follow to achieve a healthy dietary pattern [9, 41].

The dietary guidelines and recommendations used to construct dietary quality scores are primarily based on scientific evidence from health and disease prevention studies. These scores can be used to describe overall dietary characteristics and repeat or compare results across populations. Many dietary quality scores have significant associations with disease and mortality outcomes. The total score is easy to understand and use, and the summing process is simpler than in other statistical methods for dietary pattern analysis.

The construction of the scores, the definition of dietary diversity, and interpretation of the guidelines are generally determined subjectively by the researchers [2]. Additionally, dietary scores cannot describe overall dietary patterns because they focus only on selected aspects of diet and do not consider the correlation of different dietary components [2, 13]. Finally, since a diet is usually multidimensional, the comprehensive dietary scores do not provide specific information on multiple foods, often leading to an unclear interpretation of intermediate scores. Individuals with a middle-range score likely have different nutritional compositions and dietary patterns [2, 9].

In nutritional epidemiological studies, data-driven methods refer to the dietary intake patterns derived from population data through data dimensionality reduction techniques. These methods use the existing data collected from food frequency questionnaires, 24-h recall questionnaires, or dietary records to obtain dietary patterns instead of defined dietary guidelines [2, 3, 50].

When determining the number of principal components or factors to be retained, the three selection criteria that are typically used include 1) retaining factors with an eigenvalue greater than one, 2) the scree plot, and 3) the interpretable variance percentage [8]. The correlation coefficients between the principal component and the food groups are called factor loadings, and they reflect the importance of the food groups. The greater the absolute value of the factor loadings, the stronger is the correlation between the corresponding food groups and the principal components or factors. Therefore, the principal components or factors are named primarily based on the food groups retained by the selection criteria applied to the factor loadings. Owing to the similarity between PCA and EFA [10], only PCA is shown in Fig. 1.

Unlike EFA, confirmatory factor analysis (CFA) is seldom used in nutritional epidemiology [52]. However, CFA can impose statistical tests on the factor structure and factor loadings of food groups and determine the number of factors and food groups contributing significantly to those factors [2, 8]. In the past, CFA was applied as a second step to verify the goodness of fit and reproducibility of the factor structure of dietary patterns after PCA or EFA in the first step [9, 53, 54]. However, it remains uncertain whether the results are better than those obtained only with EFA [54]. Therefore, several studies have used CFA as a one-step approach to replace PCA or EFA [52, 55]. The advantage of CFA is that a latent variable model can be specified and tested, and additional priori knowledge can also be incorporated into the model [55].

There is no singular method for identifying the number of clusters or an appropriate clustering algorithm [68, 69]. One approach is to combine several methods, that is, based on factor analysis, the appropriate k value and a reasonable initial cluster center are identified by hierarchical clustering to minimize the influence of subjective judgment on the clustering results [68, 70]. The other approach is the optimal clustering method, in which several different k values are tried, and quantitative indicators for these k values are compared to select the optimal value of k [8, 71]. The selection of the clustering algorithm mainly depends on the stability of the clusters and their reproducibility, which are often evaluated by the split-half cross-validation method or classifier [64, 72]. The most appropriate clustering algorithm is the one with the highest reproducibility and stability.

Distinct subgroups of individuals can be identified according to their dietary characteristics, and everyone belongs only to one specific dietary pattern group. Thus, the relationship between dietary pattern subgroups and health outcomes or other characteristics can be examined, and the subgroup at nutritional risk can also be identified. The results are also highly intuitive, and a dendrogram can be drawn to show the clustering process and results visually.

There are, however, a few drawbacks: first, each individual is assigned a cluster with a probability of 1 or 0, without considering the uncertainty of individual classification [73]. Second, the researcher is required to make several subjective decisions, such as the selection of the food groupings, clustering algorithms to determine the similarity of individuals, initial values, and the number of clusters. Although some relatively objective methods for selecting clustering algorithms and the number of clusters exist, the reproducibility of results cannot ensure their validity [64]. Third, there is no convenient method for comparing different clustering criteria [74]. Finally, the use of a control group and the unequal sample size of different clusters will limit the power of the statistical analysis [75].