• Nem Talált Eredményt

Author's Response

2. Interpretation of the results

2.1. Reliability levels ere correct as reported. A majority of commentators accepted the low levels of reliability as valid, and offered a n u m b e r of suggestions for improving t h e reliability (and at times even the validity) of peer reviews (Adams, Bornstein, Cohen, Cole, Colman, C o n e , Crandall, D e l c o m y n , Fletcher, Ciimore, Gor-m a n , G r e e n e , K r a e Gor-m e r , LaGor-ming, Lock, M a h o n e y , Nelson, Roediger, R o u r k e , Salzinger, T y r e r , and Zen-tall). These views are discussed in later sections of the report.

Cole feels that both editors and granting officials need to admit that since reliability is so poor, much high quality research is rejected or disapproved, whereas some poor quality research is accepted or funded. Therefore, editors should gradually increase the number of manuscripts they accept and granting officials should put f u n d i n g aside for meritorious but disapproved proposals. T h e major problem with this otherwise good idea is that t h e time required to reverse a funding decision may equal or exceed the time required to revise the proposal and resubmit it to the same or a different funding agency.

Zentall, Roediger, and Laming doubt that levels of relia-bility could ever be improved substantially. Zentall ar-gues that much of t h e disagreement reflects d e e p the-oretical and methodological (confirmational) biases.

Similarly, Roediger argues that the corpus of psychologi-cal literature has demonstrated consistently that human j u d g m e n t s of such complex issues as hiring decisions or making clinical diagnoses, are of questionable reliability and validity. Hence, t h e similarity in results for peer reviews is to be expected. Laming, in a most imaginative commentary, argues by analogy with the results of a n u m b e r of psychophysical studies across sensory modalit-ies that t h e constantly shifting frames of reference with which successive stimuli are compared limit t h e accuracy of human j u d g m e n t s to t h e extent that about 2 / 3 of the variability in j u d g m e n t s can be attributed to the vari-ability in frames of reference. Thus, it is the absence of a stable frame of reference that sets limits on t h e extent of j u d g m e n t a l accuracy. Applying this knowledge to the field of p e e r reviews of manuscript and grant submis-sions, Laming concludes that the shared variability be-tween independent reviews would be restricted to an u p p e r limit of about 0.33. H e ends his commentary on a rather sombre and pessimistic note that he contrasts to my own more optimistic view of progress in science (in general) and peer review (in particular). Laming's

pessi-mism is based on his examination of journal articles in his field of interest (experimental psychology) that were published between 50 and 100 years ago. H e concludes that if more than 90% of this research had never been published, the state of experimental psychology would b e no different than it is today.

In contrast to t h e pessimism shared by Laming, Roediger, and Zentall, I must state emphatically that the progress in my own field of inquiry, assessing t h e reliabil-ity and validreliabil-ity of standard and state-of-the-art diagnostic instruments in both behavioral science and medicine, has b e e n nothing short of dramatic. Thus, my colleagues and I have developed highly reliable and valid instruments o v e r a wide range of disorders:

1. In behavioral science, for example, adaptive behav-ior (Sparrow et al. 1984a; 1984b; 1985), alexithymia (Krystal et al. 1986), personality disorders (Cicchetti &

T y r e r 1988; Tyrer, Cicchetti etal. 1984; Tyrer, Strauss et al. 1984), anxiety (Tyrer, Owen et al. 1984), affective behaviors of d e m e n t e d patients (Nelson et al. 1989), and dissociative disorders (Steinberg et al. 1990); and

2. In medicine, t h e Yale Observation Scales for identi-fying seriously ill febrile children (McCarthy et al. 1982;

McCarthy et al. 1990), new methods for classifying cata-racts both in vitro (Cicchetti et al. 1982); and in vivo (Cotlier et al. 1982), and accuracy of the barium e n e m a in diagnosing (a) H i r s c h s p r u n g Disease (Rosenfield et al.

1984), and (b) acute appendicitis (Garcia et al. 1987).

For each of these diverse areas, we have consistently shown levels of reliability in the G O O D to E X C E L L E N T range (usually kappa or R, values . 90 and above), as well as good evidence for validity. When I became actively involved in research m o r e than two decades ago, there was little optimism that t h e low levels of reliability and accuracy of j u d g m e n t (especially in the behavioral sci-ences) would ever b e c o m e "respectable." Yet, less that a d e c a d e ago, the field of psychiatric diagnosis had im-proved dramatically as encapsulated in t h e writings of G r o v e et al. (1981, p. 408):

For years, achieving adequate diagnostic reliability in psychiatry was considered to be a hopeless undertak-ing. A n u m b e r of landmark studies suggested that psychiatrists looking at the same patients frequently disagreed about t h e appropriate diagnoses. As a conse-quence, the i m p o r t a n c e of diagnosis was minimized in both research and clinical work. . . T h e reversal of nihilistic attitudes about psychiatric diagnosis has led to a rigorous (and successful) attempt to rework the entire American diagnostic system used by clinicians, DSM-III, which d e m o n s t r a t e d in field trials that good agreement could b e achieved even in routine practice.

T h e specific details about how I believe that similar breakthroughs can b e m a d e in the field of p e e r review (namely, improving b o t h its reliability and validity) are expressed in a later section of the report.

2.2. Reliability levels are worse than Indicated. Examples a r e given by S c h ö n e m a n n from the published literature in which false claims about a number of p h e n o m e n a have b e e n made and p e r p e t u a t e d (e.g., indeterminacy, herita-bility, the results of mathematical modeling). Because manuscripts with high reliability (editors and reviewers agree they should have b e e n published at the time) have

"negative" validity, this can only mean that reliability is

lower than one would think, perhaps at random or chance levels.

Although Schönemann*s argument has a certain face-validity appeal, I am hard pressed to calculate t h e actual f r e q u e n c y with which t h e unfortunate phenomena he reports occur relative to the mammoth corpus of research that has been published. In mathematical terms, we are faced with trying to interpret a ratio with both unknown n u m e r a t o r (the n u m b e r of invalid published research findings) and unknown denominator (the total n u m b e r of n o n r e d u n d a n t published findings). In short, it is not possible for m e to draw a cause and effect conclusion on these matters given t h e data presented thus far. Perhaps, given the enormousness of published research in such diverse outlets, one could never arrive at a valid conclu-sion.

2.3. Reliability Is better than Indicated. Several commen-tators (Hargens, M a r s h & Ball) mention that reliability levels may have been underestimated by taking into account only the recommendations of two independent reviewers. M a r s h & Ball, for example, note that in addition to t h e initial two reviews, the editor often has his own review, author revisions, and further reviews of the revised manuscript on which to base a decision, thereby probably increasing t h e reliability of the process. The additional review, however, w h e t h e r by the editor or a third reviewer, is often not an independent one and so may b e heavily influenced by t h e results of the initial two reviews. Despite this p r o b l e m , there is a factor men-tioned by both H a r g e n s and by Marsh & Ball that one can test empirically, namely, that the editor's process of w e e d i n g out very poor quality manuscripts (rejected without being sent out for review) might reduce the variance and subsequently increase the levels of inter-reviewer agreement, because these very submissions are t h e t y p e we have shown to produce the highest levels of consensus. H a r g e n s cites both Gordon (1977) and Zuckerman and Merton (1971) to suggest that the editor's sole "summary-rejection" rates for prestigious journals in both social science and medicine may reach levels as high as 50%. Fortunately, I have b e e n able to analyze further some additional data deriving from reviews for the Jour-nal of Abnormal Psychology (JAP) during 1973 and 1977.

As given in Table 3, and based on 996 submissions, there was an overall R, (or kappa) value of .24 with 73%

a g r e e m e n t on rejection, 51% on acceptance, and 65%

overall agreement. In addition to these 996 submissions, t h e editor received 384 additional manuscripts. He re-j e c t e d 333 (86.7%) and accepted the remaining 51 (13.3%). If we make t h e assumption that the rejected manuscripts would also have been rejected by another i n d e p e n d e n t reviewer because of their obvious poor quality or inappropriateness for JAP, the results show that: Overall agreement increases from 65% to 74%;

a g r e e m e n t on rejection increases from 73% to 82%;

a g r e e m e n t on acceptance remains at 51%; and Rj (or kappa) increases from .24 to .34. In conclusion, even if one assumes that t h e reliability of negative editorial reviews is perfect, it may not have a profound effect on increasing the reliability of the peer review process.

Thus, whereas the a g r e e m e n t level on rejection im-proves, t h e lack of a corresponding increase in reliability for acceptance keeps t h e R, value at relatively low levels.

Table 3. Effect of editorial summary rejection of 333 manuscripts on the overall reliability of peer review of manuscripts submitted to Journal of Abnormal Psychology

(1973-1978)

A. Based on two independent reviews First Second Review

B. Adding the 333 editor's rejections to the reject-reject ceil First

The great majority of commentators viewed t h e target article as a worthwhile e n d e a v o r , although they differed on their specific interpretation of what the results mean;

two remaining commentators, however, Kiesler and Bailar, questioned the value of such research. T h e s e two commentators share the minority view that t h e only meaningful goal of peer review is to improve decisions about which submissions should be accepted (or ap-proved) and which should b e rejected (or disapap-proved).

As such, t h e issue of reliability is essentially irrelevant to them. T h e y also express t h e view that high levels of agreement signal that t h e r e is too much redundancy in the p e e r review process, that it is not working well, and that a balanced review has not been achieved.

Kiesler is convinced at a basic conceptual level that high levels of reliability are incompatible with what he terms "wise" editorial and f u n d i n g decisions. H e states specifically that to expect high levels of reviewer agree-ment is "naive" because it falsely assumes that reviewers are randomly drawn by editors. I would submit that herein lies the most serious error in Kiesler's reasoning.

In fact, if h e were to choose reviewers randomly in his own general area of focus (the broad field of psychology), this p r o c e d u r e would almost guarantee levels of reviewer agreement even lower than what has been r e p o r t e d . Given that Kiesler needed a Freudian theorist as well as a sophisticated statistician to obtain a balanced review (using his hypothetical example), the probability that such expertise could be obtained on the basis of purely

random selection procedures would indeed approach zero. In fact, any set of reviewers selected at random in any genera] focus area (behavioral science, medicine, general suhfields of physics) would almost perforce, be expected to disagree to a greater extent than those chosen specifically for their areas and levels of expertise Rourke correctly intimates that the validity of the comments of randomly selected reviewers would also be comprised because of insufficient knowledge about the area they would have b e e n asked to evaluate. (A similar view is expressed by Lock.) In short, the balanced selection of reviewers should, if anything, e n h a n c e both the reliabili-ty and the validireliabili-ty of the resulting reviews.

If we accept Bailar's commentary at face value then to expect the p e e r review process to b e "reliable," "fair,"

and "objective" would be considered an "inappropriate"

goal. A careful reading of Bailar's comments suggests that as an editor h e chose to work around t h e obvious unre-liability, unfairness, and subjectivity of the peer review process for t h e Journal of The National Cancer Institute (JNCI). As one example, his regular use of reviewers who were clearly biased (i.e., would n e v e r recommend pub-lication or would never criticize their colleagues) would prompt other commentators to act quite differently (I agree). Thus K r a e m e r would remove reviewers who

"condemn everything" or have an apparent conflict of interest with the author(s) of the p a p e r under review Similarly, other commentators would rather remove than live with or "work around" other obvious biases in the peer review system (I again agree). These biases include

"confirmatory bias" against "negative" research findings;

well-conceived replication studies (Corman, Lock, Sal-zinger, Schönemann, Zentall), innovative research (Armstrong & H u b b a r d , Lock); t h e time of day that grants are evaluated, subjective "rating scale use habits"

of grant reviewers, and the hypothesized harsher (more negative) evaluations provided by less experienced grant reviewers (Cohen).

In summary, for Bailar to allow individuals who are clearly biased or who may have a potential conflict of interest to remain as "regular" reviewers stretches to the breaking point my limits of permissible peer review practices. Consistent with the views of peers at large, I am totally opposed to the practice. It is also somewhat curious that Bailar voices concern that ethical issues were not discussed in the target article. His comments follow closely his voicing obvious frustration with not being able to discuss such issues directly in connection with the Peters & Ceci (1982) publication about eight years ago.

The fact of t h e matter is that about 20% of the authors' reply was devoted to the ethical issue. Mahoney ad-dresses t h e ethical issue more broadly and I endorse his sanguine remarks heartily.

Another issue that both Kiesler and Bailar seems to have overlooked is that high quality research (worthy of support) is integrally related to: (a) asking important questions; (b) designing and executing t h e research in an exemplary manner (utilizing p r o p e r controls); (c) using state-of-the-art instrumentation (and/or test materials);

(d) writing clearly and succinctly; and (e) presenting a compelling discussion of the results and their implica-tions (or heuristic value) for furthering scientific advance-ments in t h e field. Because of t h e interrelatedness of these five evaluation attributes, my many years of

experi-e n c experi-e rexperi-eviexperi-ewing manuscripts and grants ovexperi-er a broad s p e c t r u m of disciplines (behavioral science, medicine, biostatistics), as well as my activities on editorial hoards and grant review committees, have indicated to me that w h e n t h e peer review process is working properly (i.e., reviewers are selected for their varying areas of compe-t e n c e and compe-they compe-take compe-their reviews seriously) icompe-t is nocompe-t unusual to find high levels of agreement on at least t h e final recommendation, if not on a n u m b e r of manuscript or grant attributes as well.

To clarify the relevance of Bailar's example, t h e r e is no a priori reason to believe that the cardiologist, pharmaco-logist, and statistician should not agree that a given clinical trial evaluating a n e w hypertensive d r u g is or is not worth supporting simply because each represents a different area of expertise. They would surely agree more than alternative reviewers selected randomly. The major d i s a g r e e m e n t s I have experienced (or witnessed) among reviewers (whether for manuscripts or grants) have oc-c u r r e d primarily beoc-cause a proper matoc-ch was not m a d e b e t w e e n submitters and reviewers. Although t h e dis-a g r e e m e n t cdis-an be occdis-asioned by dis-a n u m b e r of fdis-actors, not least among them is a lack of sufficient expertise (or even bias) on the part of one or more of the reviewers.

So, in response to both Kiesler and Bailar, I would e m p h a s i z e that the proper selection of reviewers to evalu-ate a given submission, should, in the long run, increase both t h e reliability and validity of the peer review pro-cess. T h e sine qua non necessity of obtaining a balanced set of reviews (for both manuscript and grant submissions) is widely accepted by editors, granting officials, re-viewers and authors alike. See, for example the additional c o m m e n t s on this important issue by Adams, E c k b e r g , G r e e n e , H a r g e n s , K r a e m e r , Roediger, and Strieker, as well as t h e recently published work of Fiske and Fogg (1990).

T h e next major issue I discuss concerns how a given editor or program director uses the information obtained from p e e r reviews - quite apart from issues of reliability (or validity) - to make publication or funding decisions.

3. Use of peer reviews to improve