«Edited by ANNE MASON Research Fellow, Centre for Health Economics University of York and ADRIAN TOWSE Director, Ofﬁce of Health Economics Radcliffe ...»
It was recognised that for most people the concept of valuing health would certainly be a novel experience and potentially somewhat challenging. To introduce both the descriptive system and the VAS task, respondents were ﬁrst asked to assess their own health status. It rapidly became clear that this preliminary two-page segment of the valuation questionnaire was capable of generating information of independent signiﬁcance. The ‘simple’ descriptive classiﬁcation was shown to be responsive to differences in age, current health experience, social class, educational attainment, housing tenure and health behaviours; a similar pattern was demonstrated in the self-reported VAS data. This type of variation, which seemed to be consistently in evidence in the various surveys conducted by researchers in the formative days of the EuroQoL group, conﬁrmed the view that EQ-5D (as it subsequently became termed) had legitimate value as a measure of health-related quality of life.
EQ-5D is probably the most widely used of a small set of generic index measures of health-related quality of life that are appropriate for application in cost–utility analysis. This set additionally includes HUI, 15D, AQLQ (Asthma Quality of Life Questionnaire), SF-6D and QWB. In the context of UK decision making, it has achieved particular salience as a result of guidance on technology assessment issued by NICE. As with all measures of this type, EQ-5D conforms to a general model used by instrument developers from Bush onwards, and consists of two discrete but linked components. First, a standard descriptive system is used to classify patients into one of a ﬁxed set of health states. A value or weight is assigned to this health state from a previously PUTTING THE ‘Q’ IN QALYS 117 established valuation set that forms the second component in this general model. It is the means whereby these values are elicited, that is the central issue that dominates research (and practice) in measuring health outcomes for economic evaluation. From the outset the EuroQoL group encouraged local experimentation around this issue by its research membership. The sole proviso was that whatever else was done, the ‘standard’ valuation task would be administered alongside any other variant. The Measurement and Valuation of Health (MVH) project at York had already tested a variety of valuation methods including TTO and SG, and had selected the former as the basis of a national UK population survey conducted in 1993. The survey led to the production of a set of (TTO) utility weights for EQ-5D health states, somewhat awkwardly labelled a ‘tariff ’. At the time, the MVH protocol/survey represented the most sophisticated attempt to capture social preferences, and since then has been adopted (and adapted) in a number of other countries where domestic national value sets have been established. The production of the MVH value sets (some 32 different sets were published as part of the ﬁnal report) raised new problems that had not hitherto appeared to be signiﬁcant (or intractable).
The foremost of these remains that of formulating a method to interpolate values for health states not directly recorded as part of the survey protocol.
The MVH survey was based on a subset of 45 health states selected from the total descriptive array of 245. The modelling/interpolation effort invested in the MVH data by the York team, and by the EuroQoL group more generally, was frankly enormous and the data continue to be the subject of analysis to this day. A similar enterprise conducted with the beneﬁt of computer-based adaptive testing methods could generate data that mitigated this need. The asymmetry in protocols for valuing health states better or worse than dead, indeed the entire question of the valuation of ‘dead’, became apparent in the aftermath of the survey and remain signiﬁcant issues. The mere fact that the MVH protocol incorporated a TTO task does not resolve the issue of what constitutes the appropriate method of eliciting such values. As yet there has been no signiﬁcant comparison of such values with those derived from SG.
Finally, the assumed invariance of values for EQ-5D health states needs to be confronted. This is part of the wider agenda of the generalisability of health state values that ﬁrst brought the EuroQoL group together. Do values obtained from a UK population survey properly represent the social preferences of the citizens of other countries? Can ‘utility’ weights generated by TTO legitimately substitute for SG weights? Do such weights change over time? The research agenda remains substantial.
THE CURRENT STATE OF PLAYSo to what point does this long journey of discovery now bring us? Are we closer to realising the aspiration of those who initially set all this in motion?
Of the enduring nature of the QALY itself there seems little doubt. A recent
118 THE IDEAS AND INFLUENCE OF ALAN WILLIAMSInternational Society for Pharmacoeconomics and Outcomes Research (ISPOR) symposium heard from critics and practitioners alike that they envisaged its continued survival. It is the means by which we achieve the ‘Q’ in QALYs that is of most importance, and it is here that there seems to be the greatest variability of interpretation. The Washington Panel of Cost-effectiveness in Health and Medicine distinguished between two broad approaches to the
assignment of preference weights to health states in computing QALYs:
those based on expected utility theory and those derived from psychological or psychophysical scaling methods (Gold et al., 1996). It was noted that ‘the diversity in how preference weights are gathered markedly constrains the ability to credibly compare analyses where the effectiveness measure is presented in QALYs’ (Gold et al., 1996, p. 119). The panel recognised what they politely termed ‘disagreement’ as to the best measurement strategy. The existence of a ‘correct’ method ‘depends in the ﬁrst instance’, they suggested, ‘on whether there are theoretical reasons for adopting a particular approach’ (Gold et al., 1996, p. 118). They go on intriguingly to state that ‘it is not clearly the case that incorporation of risk attitudes into the utilities that represent the “quality” dimension of QALYs is necessary for CEA [cost-effectiveness analysis] studies designed to inform resource allocation decisions’ (Gold et al., 1996, p. 118). This position differs somewhat from the stance taken by
NICE. In favouring cost–utility analysis as a means of providing ‘a comparative context for judging the relative value of health beneﬁts from interventions in different disease areas’, NICE accords equal status to SG utilities and TTO utilities (NICE, 2001). As shown in Table 10.2, the reference case model most recently espoused by NICE is speciﬁc in rejecting preferences derived using rating scales. In his deﬁnitive review, Torrance (1986) clearly sets out the subtle but important difference between (vNM (von-Neumann and Morgenstern)) utility and values. The former can only be generated via choice-based methods operating under conditions of uncertainty. Everything else (at best) falls into the second category. This distinction is reinforced in at least one leading textbook (see Figure 10.1) (Drummond et al., 2005).
Despite this accumulation of expert opinion to the contrary, the literature is replete with reports in which researchers claim that ‘utilities’ may be generated using one of three methods: SG, TTO or rating scales. It would be naïve in the extreme to expect that all methods would yield convergent results, or that there might be a single transformation that would convert one metric into another.
Weights based on rating scale methods typically avoid explicit reference to uncertainty and exchange, so that in the strictest sense it is hard to see a case for their use as a cardinal measure of utility. However such methods do entail an element of choice, albeit one that is more subtly embedded. Analytical methods that enable cardinal scales to be derived from ordinal data generated by rating scales, have long been recognised, but these do not extend to the rote application of a power transformation, so often used as the mechanism for smartening-up such data.
TABLE 10.2 THE NICE REFERENCE CASE
Element of health technology The reference case assessment Measure of health beneﬁts QALYs Description of health states for Health states described using a standardised and calculation of QALYs validated generic instrument (#5.5.3) Method of preference elicitation for Choice-based methods, for example time tradehealth-state valuation off or standard gamble, not rating scale Source of preference data Representative sample of the general public Source: National Institute for Clinical Excellence (2004) Guide to the Methods of Technology Appraisal.
London: NICE. Available at: www.nice.org.uk Reproduced and adapted with permission.
The hard fact of the matter is that the two principal methods of utility elicitation yield different estimates. Weights derived using SG are known to differ systematically from corresponding weights derived using TTO (Read et al., 1984). The reluctance to entertain even the smallest risk of death in order to forego any portion of life expectancy at all, to avoid remaining in an apparently
120 THE IDEAS AND INFLUENCE OF ALAN WILLIAMSminor dysfunctional health state is well known. In the face of such demonstrable failure of the ‘standard’ techniques, researchers continue to struggle to reconcile the differences in empirical data generated using these methods.
Were a sustainable case to be made that supported the dominance of SG, then the issue of valuation method might be settled once and for all. However, as noted by Brazier et al. (1999) ‘if there is doubt about the axioms of expected utility theory as they relate to health state valuations, as many commentators suggest, there can be no justiﬁcation for SG as the reference method or “gold standard” for health state valuation’. Furthermore, the practical procedure of implementing SG is itself open to widespread local variation. For example, there are at least three methods for determining indifference points other than the ‘standard’ ping-pong (top-down, bottom-up and iterative division). These different strategies can and do yield different results, so that even the existence of a ‘standard’ form of SG remains problematic.
Of equal importance in seeking to justify exclusive reliance on utility weights for QALY computation is the difﬁculty in establishing that any given set of weights does indeed possess the ‘utility’ attribute. A straw poll conducted among a convenience sample of health economists yielded a consensus that the attribute is conferred by virtue of the method by which the weights were obtained. But in the absence of a standardised protocol for determining ‘utility’ weights, it is hard to subscribe to this interpretation. It is this circularity that further weakens the case for the ‘utility-only’ approach to QALY calculation.
Supposing that we are presented with two sets of weights and told that one was generated by SG/TTO methods and the other was an ordered set of numbers generated by the RAND function in Excel. What test would be used to determine the ‘utility’ set?
In fact, fairly simple attributes are required of the quality-adjustment factor used to compute QALYs. Table 10.3 sets out those attributes for QALYs in NICE appraisals. Some properties are more critical than others. For example, it would be inconceivable to undertake any arithmetic without access to a quality-adjustment factor with an index format. Nor would it be acceptable if such a factor lacked cardinal properties. These ﬁrst four attributes are strictly non-negotiable, and failure to conform with any of them should be regarded as an irrecoverable defect. There may be more scope for ﬂexibility in respect of the last two attributes. Accepting an alternative reference source could lead to the recognition of (say) patient-based values or those generated in a nonUK setting. In this regard it should be noted that since the MVH A1 value set represents the preferences of a national sample of the UK population, it allows Scottish ‘voters’ to inﬂuence decisions made on behalf of the English when applied in NICE appraisals. Not only did the Scottish respondents in the 1993 survey report poorer health status in themselves, they tended to assign lower values to EQ-5D health states than did their English counterparts. The effects of this health analogue to the West Lothian question have been described elsewhere (Kind, 2005).
PUTTING THE ‘Q’ IN QALYS 121
TABLE 10.3 ATTRIBUTES OF QUALITY-ADJUSTMENT FACTORS
Table 10.4 sets out different approaches for distinguishing between preferenceelicitation procedures used in QALY calculations.
If utility measurement is an absolute requirement and SG is recognised as being the deﬁnitive method of choice, then TTO might be treated as close second and all other procedures would be grouped together. If the uncertainty requirement were removed, this would make any choice-based method acceptable, arguably with category rating and VAS being relegated to the second tier. The relative strength of preference can at least be inferred from any of the methods listed in Table 10.4, and in this respect there appears to be no way of distinguishing between these alternatives. So, if QALYs can only legitimately be computed using vNM utilities, then SG appears to be the lead method, with TTO acting as a proxy.
If a more relaxed interpretation of social preferences is accepted, then methods that do not strictly yield utilities could be accepted as quality-adjustment weights in QALY calculations.
TABLE 10.4 HIERARCHY OF PREFERENCE-ELICITATION PROCEDURES
It may be noted as an aside that the high scruples now aspired to by NICE did not always constitute an obstacle to the dissemination of economic evaluation. The Rosser–Kind index was accepted even though it was based on the preferences of a small convenience sample using magnitude estimation