Insulin resistance syndrome revisited: application of self-organizing ...

4 downloads 16808 Views 353KB Size Report
Mar 13, 2002 - data. The SOM is especially suitable for exploratory data analysis. (data mining) of large .... the SOM algorithm, which is a problem with big data sets. Therefore ..... Analysis (NDA) package (http://erin.mit.jyu.fi), Visipoint Oy for.
© International Epidemiological Association 2002

Printed in Great Britain

International Journal of Epidemiology 2002;31:864–871

DIABETES

Insulin resistance syndrome revisited: application of self-organizing maps Veli-Pekka Valkonen,a Mikko Kolehmainen,b,c Hanna-Maaria Lakkaa,d and Jukka T Salonena,d,e

Background Most common chronic diseases have a multifaceted aetiological background. Because currently used statistical methods have severe limitations in describing complex non-linear processes, the authors evaluated the usefulness of a multivariate method which is able to describe non-linear phenomena, the self-organizing map (SOM). Methods

The study subjects were the 1650 participants of the Kuopio Ischemic Heart Disease Risk Factor Study (KIHD). The SOM model was constructed using 25 continuous biochemical and physiological variables. The aim of the SOM algorithm, together with Sammon’s mapping, is to group the data into reduced but representative format and divide the study population into homogeneous subgroups.

Results

The study population consisted of four groups (clusters) according to the method used. In the clusters C1 to C4 were 637, 445, 275 and 121 men, respectively. There were eight neurons (n = 172) which were not included to the four main clusters. The mean values of the variables related to insulin resistance syndrome in the identified SOM map were 32.1 (kg/m2) for body mass index (BMI), 1.01 for waist-to-hip ratio (WHR), 158.7 mmHg and 103.8 mmHg for systolic (SBP) and diastolic blood pressure (DBP), 2.8 mmol/l for triglycerides, 6.2 mmol/l for blood glucose and 22.4 mU/l for serum insulin. There was a statistically significant difference in the mean values of BMI, WHR, SBP, DBP, HDL, triglycerides and blood glucose between the cluster representing the insulin resistance syndrome and the normal cluster.

Conclusions This study shows that the multidimensional structures of insulin resistance syndrome can be visualized and identified at qualitative and quantitative level using the SOM algorithm. Keywords

Self-organizing map, insulin resistance syndrome, statistical methods, visualization

Accepted

13 March 2002

Most common chronic diseases have a complex aetiological background consisting of several factors related to heredity, biology and behaviour. Cardiovascular disease (CVD) represents a good example of a disease with a multifaceted background. Cardiovascular risk factors tend to cluster in individuals and the existence of a cluster of metabolic risk factors, most frequently referred to as the ‘insulin resistance syndrome’, has gained much interest in recent years.1,2 The interrelations between these factors have been investigated in cross-sectional and a Research Institute of Public Health, b Department of Environmental Sciences, c AI Virtanen Institute, d Department of Public Health and General

Practice, University of Kuopio, Kuopio, Finland. e Inner Savo Health Centre, Suonenjoki, Finland.

Correspondence: Prof. Jukka T Salonen, Research Institute of Public Health, University of Kuopio, PO Box 1627, FIN-70211 Kuopio, Finland. E-mail: [email protected]

prospective study settings.3 There are also few studies in which factor analysis, a statistical method for studies including interrelating variables, has been used to evaluate the risk factor clustering in insulin resistance syndrome.4–12 However, even though factor analysis is a sophisticated multivariate method, it cannot take into account possible non-linearities in the data. Usually, this situation can be observed by clear discontinuities in the data. The basic idea for this study stems from the limitations of conventional statistical methods in describing complex processes such as in human biochemistry.13,14 In order to obtain a better view of the phenomenon to be modelled, methods of computational intelligence, such as neural networks, have been proposed and evaluated as an alternative.15 We used the self-organizing map (SOM) algorithm16 as a candidate and as the basis of practical evaluation. The main benefit of the SOM

864

865

SELF-ORGANIZING MAPS AND INSULIN RESISTANCE SYNDROME

algorithm is its ability to visualize multidimensional data in a two-dimensional format, and to make data reduction and abstraction by generating prototype vectors from measurement data. The SOM is especially suitable for exploratory data analysis (data mining) of large data sets.17 The aim of this study was to evaluate the usefulness of the SOM algorithm in describing and modelling multivariate epidemiological data. We used data derived from a large prospective population-based study in middle-aged men from eastern Finland, with special emphasis on factors related to insulin resistance syndrome and CVD.

Methods Subjects The study subjects were participants in the Kuopio Ischemic Heart Disease Risk Factor Study (KIHD), which is an ongoing population-based study designed to investigate risk factors for CVD, carotid atherosclerosis and related outcomes in men from eastern Finland.18 A total of 2682 randomly selected men (82.9% of those eligible), aged 42, 48, 54 or 60 years participated in the baseline examinations between March 1984 and December 1989. Examinations consisted of a wide variety of biochemical, physiological, anthropometric, and psychosocial measures. In total, almost 5000 variables were collected for each participant as part of the KIHD project. The present study is based on data from the 1650 men who had complete information on the selected biochemical variables (Table 1).

Laboratory methods The KIHD examination protocol and collection of samples have been described previously.18–20 The subjects gave blood specimens between 8:00 and 10:00 a.m. on Tuesday, Wednesday, or Thursday. They were instructed to fast and to abstain from smoking for 12 hours and to abstain from drinking alcohol for 3 days. After the subjects had rested in the supine position for 30 minutes, blood was drawn with vacuum tubes (Terumo, Tokyo, Japan). No tourniquet was used. Serum insulin was determined with a radioimmunoassay kit (Novo Nordisk, Bagsvaerd, Denmark). Blood glucose was measured using a glucose dehydrogenase method (Merck, Darmstad, Germany) after precipitation of proteins by trichloroacetic acid. The measurement of low density lipoprotein (LDL), high density lipoprotein (HDL) and very low density lipoprotein (VLDL) cholesterol concentrations,9 serum total cholesterol and triglycerides has been described previously.

Self-organizing maps One of the best-known unsupervised neural learning algorithms is the SOM.16 The goal of the SOM algorithm is to find vectors which can represent the input data set as prototypes and at the same time realize a continuous mapping from input space to a lattice. This lattice is usually a two-dimensional map that can be easily visualized. Each neuron of the SOM has as its weight vector (prototype vector) as many numbers as there are measured variables. The weight vectors of the SOM are first initialized to random values. With each training pattern the winning neuron (Best-Matching Unit, BMU) is first found by comparing the input (measured)

Table 1 Baseline characteristics of the subjects, Kuopio Ischaemic Heart Disease Risk Factor Study, 1984–1989 Characteristic

Mean

SD

Range

Age (years)a

52.9

5.6

42.0–61.3

Body mass index (kg/m2)

26.9

3.5

18.4–48.6

Waist-to-hip ratio (cm/cm)

0.95

0.06

0.71–1.73

Systolic blood pressure (mmHg)

132.7

16.4

88.7–213.3

Diastolic blood pressure (mmHg)

88.1

10.4

56.7–127.3

4.8

1.2

3.2–18.2

11.5

7.1

1.0–77.0

Serum total cholesterol (mmol/l)a

5.9

1.05

2.6–10.1

Serum LDLb cholesterol (mmol/l)

4.0

1.0

0.8–8.5

Serum HDLc cholesterol (mmol/l)

1.3

0.3

0.5–3.1

Serum VLDLd cholesterol (mmol/l)

0.6

0.42

0.01–5.4

Blood glucose (mmol/l) Serum insulin (mU/l)

Serum triglycerides (mmol/l) Serum albumin (g/l) Plasma fibrinogen (g/l)

3.0

1.8

0.4–10.9

42.2

3.6

23.0–60.0

3.0

0.6

1.3–6.7

173.5

156.8

11–2270

Serum creatinine (µmol/l)

90.0

13.8

28–250

Serum γ-gt (U/l)

30.6

38.0

5–737

MCVe (fl)

92.6

4.5

74.0–110.0

Serum ferritin (µg/l)

Plasma vitamin A (ng/ml) Plasma ascorbic acid (mg/l)

615.4

137.0 189.8–1417.8

8.2

4.1

0.3–24.2

Plasma α-tocopherol (µmol/l)

19.84

5.5

6.3–54.9

Serum selenium (µg/l)

44–241

109.6

18.5

Serum zinc (mg/l)

0.93

0.12

0.5–1.6

Serum magnesium (mg/l)

19.6

1.5

9.2–25.5

Serum copper (mg/l) Urinary sodium (mmol/l) Urinary potassium (mmol/l)

1.11

0.18

0.6–2.3

200.6

62.9

29.0–602.1

85.5

23.6

19.0–245.0

a Age, and serum total cholesterol were not used as training variables. b Low-density lipoprotein. c High-density lipoprotein. d Very low density lipoprotein. e Mean cell volume of erythrocyte.

and weight vectors of the neurons by Euclidean distance metrics. This is based on calculating squared differences of each variable and summing them to yield only one value to describe the distance. Then the weights of the winning neuron (BMU) and also those of its neighbours in the lattice are moved towards the input vector according to update rule, which states the direction and amount of adjustments to be carried out. The learning rate factor of the update rule is decreasing monotonically towards the end of learning. The basic idea behind the SOM algorithm is that the weights of the neurons come to represent a number of original measurement vectors. The weights can then be used as a basis of further analysis. For a more thorough description of the SOM algorithm and its application to gene expression data see Törönen et al. as an example.21 A variation of the SOM, called tree-structured SOM (TS-SOM) was used in this work.22 The software implementation consists of several SOM that are organized hierarchically in a pyramidlike fashion in several layers. The number of neurons at a larger level is four times the number of the previous level, thus limiting the number of neurons in each map (level) to 4, 16, 64, 256

866

INTERNATIONAL JOURNAL OF EPIDEMIOLOGY

and so forth. However, each level is an independent SOM and thus comparable to those achieved by the ‘standard’ SOM algorithm with similar neuron structure.

Sammon’s mapping Sammon’s mapping is an iterative method based on a gradient search.23 The aim of the algorithm is to represent points of higher dimension in two-dimensional target space. The algorithm optimizes the locations in the target space so that as much as possible of the original structure (distances) of the measurement vectors in the n-dimensional space are conserved. Particularly, it is able to represent the relative distances of vectors in a measurement space, and is thus useful in determining the shape of clusters and the relative distances between them. However, the numerical calculation is more time consuming than the SOM algorithm, which is a problem with big data sets. Therefore, Sammon’s mapping is in the implementation used in this study (i.e. Visual Data software) applied to the result of the SOM algorithm. Thus, by using the weight vectors of the SOM as the starting point, the iteration of the Sammon’s mapping algorithm is computed faster than for the original data. Note that this approach has already been studied earlier by one of us in another article.21 The Sammon’s mapping algorithm is initialized by constructing two data structures where the first one consists of the weight vectors. Each weight vector is now handled as a co-ordinate of N-dimensional space where N is the number of variables. The second data structure is two-dimensional and thus the target of the Sammon’s mapping. It consists of the same number of elements as the first structure and there is a one-to-one correspondence between the elements of the structures. The target structure is initialized randomly. After the initialization the first step [1] is to calculate relative errors between the source and target structures. This is accomplished by calculating the pair-wise Euclidean distances among all the elements of the both structures and then comparing the distances of each element pairs between the two structures. The second step [2] is to adjust the positions of the elements in the target space in order to reduce the mapping error. This adjustment is done using steepest descent procedure, which is based on calculating the gradient for the direction into which the error decreases most quickly. Thus, the approach leads into iterative algorithm where steps [1] and [2] are repeated until the error is below the preset limit or the change of error between two rounds of the iteration is below the preset limit.

Selection of the variables for the study The first task before applying the SOM algorithm to measurement data is to select the appropriate variables from the available baseline measurements. Because the KIHD database includes about 5000 variables it was not possible to input them all and some a priori selection had to be made. In this case, the variables were chosen to be physiological variables. The result was to include the 25 variables (listed in Table 1) that maximized the number of study cases to be used. Some candidate variables were rejected due to too many missing values which would have led to a reduced number of measurement vectors and thus to less significant results (note that this implementation of the SOM algorithm cannot handle missing values). This is due to the fact that the SOM algorithm can only give meaningful results when there are enough samples to be grouped and

then to be inspected in each neuron and in the clusters to be formed from the neurons. In practice this means that there has to be at least hundreds of study subjects, preferably thousands. Limitation of the study population by using the 25 variables did not lead to any selection bias, because the number of missing values were spread equally across different variables.

Applying the SOM algorithm The selected data set was first pre-processed by scaling each variable into range [0…1]. This was done in order to prevent any variable dominating the self-organizing process. The neural network was then trained by applying the SOM algorithm to the pre-processed data. A map size of 8 × 8 (64 neurons) was used in the study. This was based on general rule that the number of measurement vectors should be more than 5 to 10 times the number of neurons. Note also that the aim of the SOM algorithm is not to cluster the data into a minimal amount of clusters but rather to group the data into a reduced but representative format that can be used to form the clusters in the later stage. After the training phase the original measurement data were sorted into the neurons according to the weight vectors and statistics for each neuron were calculated.

Visualizing the self-organizing map The SOM algorithm gives a map (two-dimensional lattice) in which similar input patterns (i.e. data on study participants) are located in one neuron or in a group of neurons that are situated near each other. This grouping was highlighted by calculating the statistics for each neuron and by visualizing the distribution of central variables of insulin resistance syndrome using bar graphs.

Applying the Sammon’s mapping algorithm The result of the SOM algorithm (prototype vectors) was used as a starting point for generating the Sammon’s mapping. This was achieved by inputting the weight vectors to the algorithm and letting it to optimize the locations of neurons in two dimensions. Thus, the shapes and relative locations of the clusters in the study population could be visualized in more realistic scale. Additionally, the neuron sizes (i.e. radius of the circles representing neurons) were adjusted to represent the number of study participants in each neuron. The different subgroups of the study population could then be determined and the corresponding neurons grouped together. Finally, statistics for groups were calculated.

Limitations of the method The clusters generated in the study do not represent the only possible grouping and others, even though similar, could be found by a user or a clustering algorithm like the well-known k-means algorithm. However, we wanted to show a way in which the user can participate in the exploration of the data in order to form hypotheses for further investigation by traditional statistical methods. Also, we have not included all the neurons in the clusters because eight of them did not seem to belong clearly to any of the clusters. This was confirmed by inspecting the data and statistics of the neurons one by one. Another question often raised with neural network algorithms is that of overfitting, which means that the algorithm learns the examples presented to it too well and is not able to generalize to new data. This is usually handled by monitoring the development of mapping error and the number of iterations used. In this study, these details are handled by the particular implementation

SELF-ORGANIZING MAPS AND INSULIN RESISTANCE SYNDROME

of the SOM namely the TS-SOM. A more annoying problem in the approach used can be that the resulting SOM and Sammon’s mapping may differ slightly due to the initialization of the SOM and changes in the learning data or in the amount of it. Additionally, the non-linear behaviour of the Sammon’s mapping may amplify these differences. However, when the method is used as an explorative tool, these disadvantages are superseded by the ease-of-use and the possibility of getting a general understanding of the data handled. Yet another issue is the pre-processing method used before applying the SOM algorithm. We have used linear scaling which maps each variable in the range 0...1 and thus manipulates the original measurements as little as possible. Another choice often used is variance scaling, which sets the mean of each variable to zero and variance to 1, leading to better resistance to outlier values. However, different pre-processing leads to slightly different mappings and we show here only one choice for the sake of simpler presentation.

Results Baseline characteristics of the study subjects are shown in Table 1, including the variables needed in the neural data analysis as training variables, and age and serum total cholesterol as descriptive variables. Pearson’s correlation table for insulin resistance syndrome related variables is presented in Table 2. Body mass index (BMI) had a positive correlation with blood glucose, serum insulin and triglycerides, systolic blood pressure (SBP) and diastolic blood pressure (DBP) and waist-to-hip ratio, and a negative correlation with serum HDL cholesterol. The SOM, which was formed using the selected training variables, is presented in Figure 1 with mean values of six central variables as bar graphs (blood glucose, serum insulin, triglyceride and HDL cholesterol, systolic blood pressure, and waistto-hip ratio). The grey background colour is used to show the distribution of BMI among the study population. The highest mean value of BMI was found on the white-coloured square i.e. neuron (one of the neurons labelled C4). The different bars were scaled so that, for example in neuron C4b, the blood glucose (first bar) had its highest mean value (11.44 mmol/l) in this study cohort (n = 1650) while it can be seen to be near the lowest mean value (4.23 mmol/l) in most of the neurons.

867

The group labelled C4 presented quite a heterogeneous subgroup: in one neuron there were men with high blood glucose (1st bar) and moderately high serum triglyceride values (3rd bar) but in another neuron high levels of BMI (white background colour), waist-to-hip ratio (6th bar) and serum insulin (2nd bar) were found. Finally, in one neuron also belonging to the group C4 there were men with high values of SBP (5th bar). All these features of neurons belonging to the group C4 correspond to characteristic phenomena in the insulin resistance syndrome. The neurons labelled with C1 clearly represented healthy men according to the parameters used. In the neurons belonging to the groups C2 and C3 the variables used in bars did not give enough information for more specific interpretations. However, there was a difference between these groups, C2 and C3, in a few of the training variables: plasma ascorbic acid, serum selenium, urinary potassium and sodium excretion, and serum ferritin were higher in cluster C2 than in C3 (P for difference were ,0.05 for serum ferritin and ,0.001 for others). Furthermore, the SBP, DBP, serum LDL, serum γ-gt and plasma fibrinogen were higher in cluster C3 than in cluster C2 (P for difference ,0.01 for all). The corresponding Sammon’s mapping, which was obtained from the SOM (Figure 1) is illustrated in Figure 2. The relative distances of neurons (circles) describe the overall change in the 25 variables modelled among the 1650 men. Thus, the subdivision in the study population can be seen to consist of four groups marked C1–C4 in Figure 2. Additionally, one neuron (C4b) was marked because of its separate location and specific bar profile. The radius of each neuron describes the number of men in each of them. The characteristics of the clusters (C1–C4 and C4b) are shown in Table 3, confirming the trends found in Figure 1. Healthy subjects were found in cluster C1 (n = 637), where there were normotensives and also other variables were in normal range. The cluster C4 (n = 121) could be identified as the hypertensive group, while cluster C4b (n = 19) could be considered to consist of men probably having the most clear insulin resistance syndrome. The identification of clusters C2 (n = 445) and C3 (n = 275) is not straightforward, and needs further clarification by using additional variables. Differences between healthy subjects (cluster C1) and hypertensive and compensatory hyperinsulinaemic subjects (cluster

Table 2 Pearson’s correlations for insulin resistance syndrome related variables at baseline (n = 1650), Kuopio Ischemic Heart Disease Risk Factor Study, 1984–1989 Variable

BMIa

WHR

0.55**

WHRb

SBPc

DBPd

HDLe

Trigylceride

SBP

0.31

DBP

0.36

0.27

0.74

HDL

–0.22

–0.19

0.03

–0.02

Trigylceride

0.28

0.22

0.13

0.10

–0.31

Blood glucose

0.27

0.20

0.16

0.12

–0.11

0.22

Serum insulin

0.58

0.38

0.22

0.22

–0.22

0.33

a Body mass index. b Waist-to-hip circumference ratio. c Systolic blood pressure. d Diastolic blood pressure. e High density lipoprotein cholesterol.

Bold indicates P , 0.001.

Blood glucose

0.22

0.36

868

INTERNATIONAL JOURNAL OF EPIDEMIOLOGY

Figure 1 Self-organizing map with bar graphs showing the distribution of the central variables: blood glucose (1st bar), serum insulin (2nd bar), triglycerides (3rd bar), high density lipoprotein cholesterol (4th bar), systolic blood pressure (5th bar) and waist-to-hip ratio (6th bar). The neurons are grouped together with labels (C1–C4) according to clustering detected in Sammon’s mapping (Figure 2). Additionally, the distribution of body mass index (BMI) is given as grey tone level (dark denotes low BMI, light high BMI)

C4) were statistically significant (P , 0.001) for mean BMI, waist-to-hip ratio, DBP, serum HDL and triglyceride and blood glucose. The difference was also statistically significant for mean SBP (P , 0.05).

Conclusion The aim of this study stems from the limitations of univariate methods in describing complex processes like human biochemistry. The most severe restrictions are due to their inability

to take into account non-linear dynamics that divide the problem space into several functional regions. Each region has its own local dynamics and therefore the effective variables are different for each of them. Considering this kind of problem with linear univariate methods gives a very restricted view of the phenomenon.13,14 To overcome these limitations, methods of computational intelligence have been proposed as an alternative.15 The SOM algorithm was selected as a candidate and used as the basis of practical evaluation. The main benefits of the SOM algorithm

SELF-ORGANIZING MAPS AND INSULIN RESISTANCE SYNDROME

Figure 2 Clustering of the study population revealed by the Sammon’s mapping algorithm. Each circle corresponds to one neuron in the selforganizing map map (Figure 1). Relative distances between neurons describe the distance in the original 25-dimensional space of the training variables. The diameter of each neuron corresponds to the number of subjects included in the neuron. The labels correspond to groups identified from the study cohort: C1 includes healthy study subjects, C4 and C4b correspond to characteristic phenomena in the insulin resistance syndrome, and C2 and C3 are intermediate groups

are its ability to visualize multidimensional data in twodimensional format, and to make data abstraction by yielding prototype vectors from a natural cluster structure of the data. Previously, Baxt has summarized the use of neural networks in clinical medicine.13 In addition, Tu has reported the advantages and disadvantages of neural networks compared to logistic

869

regression.24 Their results suggest that better results can be obtained using neural networks than by using traditional statistical methods in medical applications, especially when the modelled phenomena are complex and non-linear. However, our results stress the importance of visualizing the data before modelling, and that important knowledge can be gained initially at the qualitative level. To attain quantitative results, standard statistical methods should also be used. This can be achieved by using SOM analysis to form groupings in the study population. These groups can then be compared with appropriate statistical algorithms. The advantage of this combination of two methods is that the groups are formed more objectively as they reflect the natural structure of the data. Thus, the SOM-based analysis could be seen as a hypothesis-forming tool, and the hypothesis can then be tested with more traditional statistical algorithms. In the present study, we evaluated the usefulness of the SOM algorithm in describing and modelling multivariate epidemiological data from a prospective population-based study in middle-aged men, with special emphasis on factors related to the insulin resistance syndrome. Insulin resistance or compensatory hyperinsulinaemia has been associated with a number of CVD risk factors in cross-sectional studies, including low HDL cholesterol, hypertriglyceridaemia, impaired glucose tolerance, hypertension and obesity or central fat distribution. This cluster of risk factors has been most frequently referred to as the insulin resistance syndrome or the metabolic syndrome.1,2 More recently, a number of other metabolic disturbances have been suggested as components of the syndrome, e.g. hyperuricaemia, increased plasminogen activator inhibitor 1, micro-albuminuria and increased levels of small dense LDL particles. The concept of insulin resistance syndrome has been widely accepted,2 but it has also been challenged,25,26 and despite the associations among these factors, the existence of a distinct syndrome caused by insulin resistance remains controversial. There are studies in which factor analysis, a statistical method for studies including interrelating variables, has been used to evaluate the risk factor clustering in the insulin resistance syndrome.4–12 Factor analysis can be used as a tool to reduce interrelating variables into independent uncorrelated composite variables or factors, and to find possible underlying

Table 3 The characteristics of the five subgroups detected in the Kuopio Ischaemic Heart Disease Risk Factor Study, 1984–1989. (See Figure 2) Variable BMIa (kg/m2) WHRc (cm/cm) SBPd (mmHg)

C1 (n = 637)

C2 (n = 445)

C3 (n = 275)

C4 (n = 121)

C4b (n = 19)

25.1 (2.5)b

27.7 (3.0)

27.5 (2.8)

32.1 (4.3)

31.2 (3.9)

0.92 (0.0)

0.96 (0.0)

0.98 (0.0)

1.01 (0.0)

1.01 (0.0)

121.8 (11.0)

138.2 (11.9)

141.3 (13.1)

158.7 (17.4)

148.0 (19.4) 92.3 (7.4)

DBPe (mmHg)

80.5 (7.1)

92.6 (7.2)

94.1 (7.6)

103.8 (9.9)

HDLf (mg/dl)

52.6 (11.6)

49.1 (11.6)

49.1 (11.6)

46.0 (7.7)

42.6 (7.7)

131.1 (63.6)

166.7 (76.4)

162.9 (63.6)

248.2 (152.7)

378.0 (254.5)

Triglyceride (mg/dl) Blood glucose (mmol/l)

4.6 (0.8)

4.7 (0.7)

4.8 (0.8)

6.2 (2.9)

11.4 (3.4)

Serum insulin (mU/l)

9.2 (4.1)

11.5 (5.5)

11.3 (4.2)

22.4 (14.5)

23.4 (12.8)

a Body mass index. b Means and standard deviations (in parentheses). c Waist-to-hip circumference ratio. d Systolic blood pressure. e Diastolic blood pressure. f High density lipoprotein cholesterol.

870

INTERNATIONAL JOURNAL OF EPIDEMIOLOGY

pathophysiological mechanisms for the syndrome. These studies have revealed more than one composite factor, suggesting that a single-factor model cannot explain the clustering of the components of the syndrome. Several studies have shown that hyperinsulinaemia/insulin resistance, dyslipidaemia and obesity form a close entity.5,7,8,12 In our results of the SOM and Sammon’s mapping analysis, a cluster (C4) with typical features of the insulin resistance syndrome could be identified even though this cluster was heterogeneous with regard to the selected six variables. Men in the cluster C4 had hyperinsulinaemia, hyperglycaemia, dyslipidaemia, obesity, and hypertension. In particular, indicators of obesity and glucose metabolism as well as dyslipidaemia were elevated in the cluster C4b, a subgroup of C4. However, cluster C4b consists of only one neuron with 19 subjects and therefore this finding must be interpreted with caution. This is due to the fact that the division of cases between neuron C4b and its neighbours is to a certain extent arbitrary, which is a limitation of the SOM method. Therefore, clusters C1–C4 and the interpretation of the division into these subgroups of the cohort has a firmer basis and should be regarded as the main result of this study. Hypertension has been the most controversial component of the insulin resistance syndrome.25,27 In some studies which have evaluated the syndrome using factor analysis, hypertension has not been found to be a characteristic of the central syndrome5 or it has constituted a separate factor.7,8 In a recent study, hypertension was linked to a metabolic entity (hyperinsulinaemia/ insulin resistance, dyslipidaemia, and obesity) through shared correlation with hyperinsulinaemia/insulin resistance.12

Interestingly, in our SOM analysis men with the highest blood pressure values were located in a different neuron than men having the highest values of blood glucose, serum insulin and triglycerides, BMI and waist-to-hip ratio. Correspondingly, men in the neuron with the highest SBP, among cluster C4, were neither dyslipidaemic nor hyperglycaemic. The present and previous results therefore imply that hypertension may be a part of, or related to, the insulin resistance syndrome, but they also indicate that more than one physiological process mediates the phenomenon of this risk factor clustering. The evaluation of this SOM-based analysis shows that the complex multidimensional structures of the present populationbased epidemiological data can be effectively visualized and identified at qualitative level. Our results confirm that the SOM method can be used to describe complex multivariate phenomena. The method can bring new insight into problems which are still under discussion, or even raise new questions for further research.

Acknowledgements This study was supported by a grant from Yrjö Jahnsson Foundation. We would like to thank the research group on Engineering and Computational Intelligence at the University of Jyväskylä, Finland, for allowing us to use their Neural Data Analysis (NDA) package (http://erin.mit.jyu.fi), Visipoint Oy for use of their Visual Data software (http://www.visipoint.fi) for interactive analysis and Oy Jurilab LTD (http://www.jurilab.com) for other support.

KEY MESSAGES •

A neural network model for the participants of the Kuopio Ischemic Heart Disease Risk Factor Study (KIHD) was constructed using 25 continuous variables.



This study shows that the multidimensional structures of insulin resistance syndrome can be visualized and identified at qualitative and quantitative level using SOM and Sammon’s mapping algorithms.



The results also indicate that more than one physiological process mediates this phenomenon.

References 1 Reaven GM. Banting lecture 1988: Role of insulin resistance in

human disease. Diabetes 1988;37:1595–607. 2 Haffner SM. Epidemiology of hypertension and insulin resistance

syndrome. J Hypertens Suppl 1997;15:S25–30.

7 Edwards KL, Burchfield CM, Sharp DS et al. Factors of the insulin

resistance syndrome in non-diabetic and diabetic elderly JapaneseAmerican men. Am J Epidemiol 1998;147:441–47. 8 Gray RS, Fabsitz RR, Cowan LD, Lee ET, Howard BV, Savage PJ. Risk

factor clustering in the insulin resistance syndrome. The Strong Heart Study. Am J Epidemiol 1998;148:869–78.

3 Salonen JT, Lakka TA, Lakka H-M, Valkonen V-P, Everson SA, Kaplan

9 Leyva F, Godsland IF, Worthington M, Walton C, Stevenson JC.

GA. Hyperinsulinemia is associated with the incindence of hypertension and dyslipidemia in middle-aged men. Diabetes 1998;47: 270–75.

Factors of the metabolic syndrome: baseline interrelationships in the first follow-up cohort of the HDDRISC Study (HDDRISC-1). Heart Disease and Diabetes Risk Indicators in a Screened Cohort. Arterioscler Thromb Vasc Biol 1998;18:208–14.

4 Edwards KL, Austin MA, Newman B, Mayer E, Krauss RM, Selby JV.

Multivariate analysis of the insulin resistance syndrome in women. Arterioscler Thromb 1994;14:1940–45. 5 Meigs JB, D’Agostino RB Sr, Wilson PW, Cupples LA, Nathan DM,

Singer DE. Risk variable clustering in the insulin resistance syndrome. The Framingham Offspring Study. Diabetes 1997;46:1594–600. 6 Donahue RP, Bean JA, Donahue RD, Goldberg RB, Prineas RJ. Does

insulin resistance unite the separate components of the insulin resistance syndrome? Evidence from the Miami Community Health Study. Arterioscler Thromb Vasc Biol 1997;17:2413–17.

10 Kekäläinen P, Sarlund H, Pyörälä K, Laakso M. Hyperinsulinemia

cluster predicts the development of type 2 diabetes independently of family history of diabetes. Diabetes Care 1999;22:86–92. 11 Lempiäinen P, Mykkänen L, Pyörälä K, Laakso M, Kuusisto J. Insulin

resistance syndrome predicts coronary heart disease events in elderly nondiabetic men. Circulation 1999;100(2):123–28. 12 Chen W, Srinivasan SR, Elkasabany A, Berenson GS. Cardiovascular

risk factors clustering features of insulin resistance syndrome (Syndrome X) in a biracial (Black-White) population of children,

SELF-ORGANIZING MAPS AND INSULIN RESISTANCE SYNDROME

adolescents, and young adults: the Bogalusa Heart Study. Am J Epidemiol 1999;150:667–74. 13 Baxt WG. Application of artificial neural networks to clinical medicine.

Lancet 1995;346:1135–38.

871

20 Tuomainen T-P, Nyyssönen K, Salonen R et al. Body iron stores are

associated with serum insulin and blood glucose concentrations: population study in 1013 eastern Finnish men. Diabetes Care 1997; 20:426–28.

14 Ioannidis JP, McQueen PG, Goedert JJ, Kaslow RA. Use of neural

21 Törönen P, Kolehmainen M, Wong G, Castrén E. Analysis of gene

networks to model complex immunogenetic associations of disease: human leukocyte antigen impact on the progression of human immunodeficiency virus infection. Am J Epidemiol 1998;147:464–71.

22 Koikkalainen P. Progress with the tree-structured self-organizing map,

15 Cross SS, Harrison RF, Kennedy RL. Introduction to neural networks.

Lancet 1995;346:1075–79. 16 Kohonen T. Self-Organizing Maps. 2nd Edn. Berlin, Heidelberg:

Springer-Verlag, Germany, 1997. 17 Oja E. Unsupervised neural learning. In: Bulsari AB (ed.). Neural

Networks for Chemical Engineers. Amsterdam: Elsevier Science BV, 1995, pp. 21–32. 18 Salonen JT. Is there a continuing need for longitudinal epidemiologic

research? The Kuopio Ischaemic Heart Disease Risk Factor Study. Ann Clin Res 1988;20:46–50.

expression data using self-organizing maps. Federation of European Biochemical Societies Letters 1999;451:142–46. ECAI’94. In: Cohn A (ed.). Proceedings of the 11th European Conference on Artificial Intelligence. Chichester: Wiley & Sons, 1994, pp. 211–15. 23 Sammon Jr JW. A nonlinear mapping for data structure analysis.

Institute of Electrical and Electronic Engineers Transactions on Computers 1969;C-18:401–09. 24 Tu JV. Advantages and disadvantages of using artifical neural

networks versus logistic regression for predicting medical outcomes. J Clin Epidemiol 1996;49:1225–31. 25 Jarrett RJ. In defence of insulin: a critique of syndrome X. Lancet

1992;340:469–71.

19 Salonen JT, Nyyssönen K, Korpela H, Tuomilehto J, Seppänen R,

26 Neel JV, Julius S, Weder A, Yamada M, Kardia SL, Haviland MB.

Salonen R. High stored iron levels are associated with excess risk of myocardial infarction in eastern Finnish men. Circulation 1992;86: 803–11.

27 Haffner SM. Insulin and blood pressure: fact or fantasy? J Clin

Syndrome X: is it for real? Genet Epidemiol 1998;15:19–32. Endocrinol Metab 1993;76:541–43.