Quiz: why does the factor structure of depression scales change over time?

We published a paper in Psychological Assessment a few weeks ago, and I would like to take the time to explain what these results imply. You can find the full text here, the analytic code (R & Mplus) including the output of all models here (scroll down to the paper), and while I am not allowed to share the data we re-analyzed, I wrote some pointers on how to apply for the datasets here.

In contrast to other blog posts, this will be a quiz: in the paper, we find a very consistent pattern of violations of temporal measurement invariance (I will explain in a second what that means), in different datasets, but we don’t really have a good idea what causes this pattern of observations.

In contrast to other quizzes, however, there is no prize because … we don’t know what the true answer is as of yet ;).

So what did we do in the paper?

We examined two crucial psychometric assumptions that are part of nearly all contemporary depression research. We find strong evidence that these assumptions do not seem to hold in general, which impacts on the validity of depression research as a whole.

What are these psychometric assumptions? In depression research, various symptoms are routinely assessed via rating scales and added to construct sum-scores. These scores are used as a proxy for depression severity in cross-sectional research, and differences in sum-scores over time are taken to reflect changes in an underlying depression construct. For example, a sum-score of symptoms that supposedly reflects “depression severity” is often correlated to stress or gender or biomarkers to find out what the relationship between these variables and depression is; this is only valid if a sum-score of symptoms is actually a reasonable proxy for depression severity. In longitudinal research, if a sum-score decreases from 20 points to 15 points in a population, we conclude that depression improved somewhat. This is only valid if the 20 points and the 15 points reflect the same thing (if the 20 points would reflect intelligence and the 15 points neuroticism, the difference of 5 points over time would be meaningless).

To allow for such interpretations, rating scales must (a) measure a single construct, and (b) measure that construct in the same way across time. These requirements are referred to as unidimensionality and measurement invariance. In the study, we investigated these two requirements in 2 large prospective studies (combined n = 3,509) in which overall depression levels decrease, examining 4 common depression rating scales (1 self-report, 3 clinician-report) with different time intervals between assessments (between 6 weeks and 2 years).

A consistent pattern of results emerged. For all instruments, neither unidimensionality nor measurement invariance appeared remotely tenable. At least 3 factors were required to describe each scale (this means that the sum-score does not reflect 1 underlying construct, but at least 3 and sometimes up to 6), and the factor structure changed over time. Typically, the structure became less multifactorial as depression severity decreased (without however reaching unidimensionality). The decrease in the sum-scores was accompanied by an increase in the variances of the sum-scores, and increases in internal consistency.

You can see the results in the graph below. The four sections represent four different rating scales, the lines represent the first (red) and second (green) measurement point of the longitudinal datasets, PA (blue) means parallel analysis which tells us how many factors a scale has at a given timepoint, and the x-axis represents the number of factors that have to be extracted. If our red and green data lines are above the blue PA line, it means that we should extract a factor. You can read up on all the ESEM modeling in the paper itself, but the gist is that in order to be unidimensional, a scale must only have 1 factor; and as you can see, all scales require the extraction of at least 3 factors. In order to be measurement invariance, the factor solution has to be stable across time; it’s highly evident from the graphs that this is not the case, because the lines for the 2 measurement points per scale should roughly overlap if this were the case (do me the favor and click on the image, I can’t embed vector graphics here). For a scale to be unidimensional and measurement invariant, the red and green line should be very similar, and they should be above the blue line for the first factor and then drop below the blue line for the second etc factors.


These findings challenge the common interpretation of sum-scores and their changes as reflecting 1 underlying construct. In other words, summing up symptoms to a total score, and correlating this total score statistically with other variables such as risk factors or biomarkers, is very questionable if the score itself is not unidimensional and does not reflect 1 underlying construct (depression). Obviously, if you have worked with depression symptoms before you know that they are very different from each other, and the idea that they are interchangeable indicators of 1 condition (depression) is very problematic. But now we have empirical evidence that this is the case — which is consistent with many other papers that have shown similar results. The special thing about this paper is that we examined these properties across a whole range of scales, and tested the robustness of the results by varying many dimensions (datasets, clinician- vs patient-rated, and timeframes).

But what is the reason for these violations of temporal invariance? In the paper, we discuss a number of possibilities that we all exclude as sufficient explanations. Among these are response shift bias, regression towards the mean, selection bias, floor and ceiling effects of items, and that item responses over time may have been influenced by medication.

As we say in the paper:

Overall, these possibilities unlikely fully explain the causes of the pronounced and consistent shifts of the factorial space observed in this report, although they may each contribute somewhat. In other words, while we have provided a thorough description of the crime scene, we have no good idea who the main suspect may be.

The violations of common measurement requirements are sufficiently severe to suggest alternative interpretations of depression sum-scores as formative instead of reflective measures. A reflective sum-score is one that indicates an underlying disorder, the same way having a number of measles symptoms tells us that you have measles: the symptoms inform us about a problem because the problem caused the symptoms. A formative sum-score, on the other hand, is nothing but an index: a sum of problems. These problems are not meant to reflect or indicate an underlying problem. Still, we can learn something from such a sum-score: the more problems people have, the worse they are probably doing in their lives.

» Fried, E. I., van Borkulo, C. D., Epskamp, S., Schoevers, R. A., Tuerlinckx, F., & Borsboom, D. (2016). Measuring Depression over Time … or not? Lack of Unidimensionality and Longitudinal Measurement Invariance in Four Common Rating Scales of Depression. Psychological Assessment. Advance Online Publication. (PDF) (URL)

Paper on comparing networks of two groups of patients with MDD

Our paper on comparing networks of two groups of patients with Major Depressive Disorder was published in JAMA Psychiatry (PDF).

In this paper, we investigated the association between baseline network structure of depression symptoms and the course of depression. We compared the baseline network structure of persisters (defined as patients with MDD at baseline and depressive symptomatology at 2-year follow-up) and remitters (patients with MDD at baseline without depressive symptomatology at 2-year follow-up). To compare network structures we used the first statistical test that directly compares connectivity of two networks (Network Comparison Test; NCT). While both groups have similar symptomatology at baseline, persisters have a more densely connected network compared to remitters. More specific symptom associations seem to be an important determinant of persistence of depression.

A Dutch newspaper (NRC Handelsblad, November 21st, 2015) published a piece about this paper (Link).

Paper on network model of attitudes

Our paper on the Causal Attitude Network (CAN) model was published in Psychological Review (PDF).

In the paper, we introduce the CAN model, which conceptualizes attitudes as networks consisting of interacting evaluative reactions, such as beliefs (e.g., judging a presidential candidate as competent and charismatic), feelings (e.g., feeling proudness and hope about the candidate), and behaviors (e.g., voting for the candidate). Interactions arise through direct causal connections between the evaluative reactions (e.g., feeling hopeful about the candidate because one judges her as competent and charismatic). The CAN model assumes that causal connections between evaluative reactions serve to heighten the consistency of the attitude and we argue that the Ising model’s axiom of energy expenditure reduction represents a formalized account of consistency pressure. Because individuals not only strive for consistency but also for accuracy, network representations of attitudes have to deal with the tradeoff between consistency and accuracy. This tradeoff is likely to lead to a small-world structure and we show that attitude networks indeed have a small-world structure. We also discuss the CAN model’s implication for attitude change and stability. Furthermore, we show that connectivity of attitude networks provides a formalized and parsimonious account of the dynamical differences between strong and weak attitudes.

Dalege, J., Borsboom, D., van Harreveld, F., van den Berg, H., Conner, M., & van der Maas, H. L. J. (2015). Toward a formalized account of attitudes: The Causal Attitude Network (CAN) model. Psychological Review. Advance online publication.

New network study: what are ‘good’ depression symptoms?

Our new paper “What are ‘good’ depression symptoms? Comparing the centrality of DSM and non-DSM symptoms of depression in a network analysis” was published in the Journal of Affective Disorders (PDF).

In the paper we develop a novel theoretical and empirical framework to answer the question what a “good” symptom is. Traditionally, all depression symptoms are considered somewhat interchangeable indicators of depression, and it’s not clear what a good or clinically relevant symptom is. From the perspective of depression as a network of interacting symptoms, however, important symptoms are those with a large number of strong connections to other symptoms in the dynamic system (i.e. symptoms with a high centrality).

So we went ahead and estimated the network structure of 28 depression symptoms in about 3,500 depressed patients. We found that the 28 symptoms are intertwined with each other in complicated ways (it is not the case that all symptoms have roughly equally strong ties to each other), and symptoms differed substantially in their centrality values. Interestingly, both depression symptoms as listed in the Diagnostic and Statistical Manual of Mental Disorders (DSM) — as well as non-DSM symptoms such as anxiety — were among the most central symptoms.


When we compared the centrality of DSM and non-DSM symptoms, we found that, on average, DSM symptoms are not more central. At least from a network perspective, this raises substantial doubts about the validity of the depression symptoms featured in the DSM. Our findings suggest the value of research focusing on especially central symptoms to increase the accuracy of predicting outcomes such as the course of illness, probability of relapse, and treatment response.

Fried, E., Epskamp, S., Nesse, R. M., Tuerlinckx, F., & Borsboom, D. (in press). What are ‘good’ depression symptoms? Comparing the centrality of DSM and non-DSM symptoms of depression in a network analysis. Journal of Affective Disorders, 189, 314–320. doi:10.1016/j.jad.2015.09.005

HRQoL Paper published

Recently, our paper “The application of a network approach to health-related quality of life (HRQoL): introducing a new method for assessing hrqol in healthy adults and cancer patients” was published in Quality of Life Research.

The objective of this paper was to introduce a new approach for analyzing Health-Related Quality of Life (HRQoL) data, namely a network model.

The goal of this paper was to introduce the network approach in the analyzation of Health-Related Quality of Life (HRQoL) data. To show that the network approach can aid in the analysis of these kinds of data, we constructed networks of two samples: Dutch cancer patients (N = 485) and Dutch healthy adults (N = 1742). Both completed the 36-item Short Form Health Survey (SF-36), a commonly used instrument across different disease conditions and patient groups [1]. In order to investigate the influence of diagnostic status, we added this binary variable to a third network that was constructed using both samples. The SF-36 consists of 8 sub-scales (domains). We constructed so-called “sub-scale” networks to gain more insight into the dynamics of HRQoL on domain level.

Results showed that the global structure of the SF-36 is dominant in all networks, supporting the validity of questionnaire’s subscales. Furthermore, we found that the network structure of the individual samples were similar with respect to the basic structure (item level), and that the network structure of the individual samples were highly similar not only with respect to the basic structure, but also with respect to the strength of the connections (subscale level). Lastly, centrality analyses revealed that maintaining a daily routine despite one’s physical health predicts HRQoL levels best.

We concluded that the network approach offers an alternative view on Healt-Related Quality of Life. We showed that the HRQoL network is, in its basic structure, similar across samples. Moreover, by using the network approach, we are able to identify important characteristics in the structure, which may inform treatment decisions.

Kossakowski, J. J., Epskamp, S., Kieffer, J. M., Borkulo, C. D. van, Rhemtulla, M., & Borsboom, D. (in press). The Application of a Network Approach to Health-Related Quality of Life: Introducing a New Method for Assessing HRQoL in Healthy Adults and Cancer Patients. Quality of Life Research. DOI: 10.1007/s11136-015-1127-z.

[1] Ware, J. E, Jr, & Sherbourne, C. D. (1992). The MOS 36-item short-form health survey (SF-36): I. Conceptual framework and item selection. Medical Care, 30, 473–483.