Fisher’s significance test: A gentle introduction

mibe000206 10.3205/mibe000206 urn:nbn:de:0183-mibe0002065 Review Article Übersichtsarbeit Fisher’s significance test: A gentle introduction Fishers Signifikanztest: Eine sanfte Einführung Stang Stang Andreas A Prof. Dr. med. MPH

Center of Clinical Epidemiology, Institute of Medical Informatics, Biometry and Epidemiology, University Hospital of Essen, Hufelandstr. 55, 45147 Essen, Germany, Phone: +49 201-723-77-289, Fax: +49 201-723-77-333Institute of Medical Informatics, Biometry and Epidemiology; University Hospital of Essen, GermanySchool of Public Health, Department of Epidemiology, Boston University, Boston, United States

Institut für Medizinische Informatik, Biometrie und Epidemiologie, Universitätsklinikum Essen, Hufelandstr. 55, 45147 Essen, Deutschland, Tel.: 0201-723-77-289, Fax: 0201-723-77-333Institut für Medizinische Informatik, Biometrie und Epidemiologie, Universitätsklinikum Essen, DeutschlandSchool of Public Health, Department of Epidemiology, Boston University, Boston, Vereinigte Staaten

andreas.stang@uk-essen.de author Kowall Kowall Bernd B PD Dr. Dr.

Institute of Medical Informatics, Biometry and Epidemiology; University Hospital of Essen, Germany

Institut für Medizinische Informatik, Biometrie und Epidemiologie, Universitätsklinikum Essen, Deutschland

author German Medical Science GMS Publishing House

Düsseldorf

610 statistical models statistical data interpretation data analysis 20200511 engl germ This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). 1860-9171 16 1 GMS Medizinische Informatik, Biometrie und Epidemiologie GMS Med Inform Biom Epidemiol 03 Der p-Wert wird häufig missverstanden und beispielsweise als Wahrscheinlichkeit für die Richtigkeit der Nullhypothese fehlinterpretiert. Ziel des vorliegenden Beitrags ist es, zunächst die Definition des p-Werts zu erläutern. Die Ermittlung des p-Werts erfordert die Kenntnis einer Wahrscheinlichkeitsfunktion. Wie ein geeignetes statistisches Modell ausgewählt wird und anhand dieses Modells, der Nullhypothese und der empirischen Daten der p-Wert bestimmt wird, wird an der t-Verteilung erklärt. Bei der Interpretation des so erhaltenen p-Werts stehen sich zwei nicht vereinbare statistische Denkschulen gegenüber: Der orthodoxe Neyman-Pearson Hypothesentest, der auf eine Entscheidung zwischen der Nullhypothese und einer komplementären Alternativhypothese hinausläuft, und Fishers Signifikanztest, bei dem keine Alternativhypothese formuliert wird und in der die Evidenz gegen die Nullhypothese umso größer ist, je kleiner der p-Wert ist. Der Beitrag endet mit einigen kritischen Bemerkungen zum Umgang mit p-Werten. The p-value is often misunderstood and, for example, misinterpreted as a probability for the correctness of the null hypothesis. The aim of this article is to first explain the definition of the p-value. Determining the p-value requires knowledge of a probability function. How an appropriate statistical model is selected and how the p-value is determined using this model, the null hypothesis and the empirical data is explained using the t-distribution. When interpreting the p-value obtained in this way, two incompatible statistical schools of thought are confronted: the orthodox Neyman-Pearson hypothesis test, which amounts to a decision between the null hypothesis and a complementary alternative hypothesis, and Fisher’s significance test, in which no alternative hypothesis is formulated and in which the smaller the p-value, the greater the evidence against the null hypothesis. The amount ends with some critical remarks about the handling of p-values. IntroductionThe p-value is often misunderstood and, for example, misinterpreted as a probability for the correctness of the null hypothesis. P-values play an important role in two schools of thought: Fisher’s significance test and Neyman and Pearson’s hypothesis test , . While the significance test leads to a quantitative interpretation of the p-value, in which it is interpreted as a continuous measure of evidence against the null hypothesis, the p-value in the null hypothesis test merely serves a decision using predefined rules.In 2016, the American Statistical Association (ASA) published a statement on the handling of p-values. Among other things it was stated: “The widespread use of ‘statistical significance’ (generally interpreted as ‘p≤0.05’) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process” . In 2019 Amrhein et al. published an article entitled “Retire statistical significance” in Nature in which they draw attention to the many pitfalls in the dichotomization of p-values into “significant” (usually p≤0.05) and “non-significant” (usually p>0.05) and generally discourage this dichotomization of p-values, i.e. the categorization into two areas .A dilemma in the application of the significance or hypothesis test remains the lack of understanding of what these methods can answer at all. The aim of this paper is to illustrate essential background information and the steps of the significance test by means of a fictive study in which two groups are compared with each other. Most biostatistics textbooks do not consistently provide this background information and steps of the significance test. The article is intended for people who can only vaguely describe what the procedure does. EinleitungDer p-Wert wird oft missverstanden und z.B. als Wahrscheinlichkeit für die Richtigkeit der Nullhypothese missinterpretiert. P-Werte spielen in zwei Denkschulen eine wichtige Rolle: Dem Signifikanztest nach Fisher und dem Hypothesentest nach Neyman und Pearson , . Während der Signifikanztest zu einer quantitativen Interpretation des p-Wertes führt, in der er als ein kontinuierliches Maß für die Evidenz gegen die Nullhypothese interpretiert wird, dient der p-Wert im Nullhypothesentest lediglich einer Entscheidung anhand vordefinierter Regeln.Im Jahr 2016 veröffentlichte die American Statistical Association (ASA) eine Erklärung über die Handhabung von p-Werten. Darin wurde unter anderem erklärt: „Die weit verbreitete Verwendung von statistischer Signifikanz‘ (im Allgemeinen als p≤0,05 interpretiert) als Lizenz für die Behauptung eines wissenschaftlichen Befundes (oder einer impliziten Wahrheit) führt zu einer erheblichen Verzerrung des wissenschaftlichen Prozesses“ . Im Jahr 2019 veröffentlichten Amrhein et al. in der Fachzeitschrift Nature einen Artikel mit dem Titel „Retire statistical significance“, in dem sie auf die vielen Fallstricke bei der Dichotomisierung von p-Werten in „signifikant“ (üblicherweise p≤0,05) und „nicht-signifikant“ (üblicherweise p>0,05) aufmerksam machen und generell von dieser Dichotomisierung von p-Werten, d.h. der Einteilung in zwei Bereiche, abraten .Ein Dilemma bei der Anwendung des Signifikanz- oder Hypothesentests bleibt das mangelnde Verständnis dafür, was diese Methoden überhaupt beantworten können. Das Ziel dieser Arbeit ist es, wesentliche Hintergrundinformationen und die Schritte des Signifikanztests anhand einer fiktiven Studie zu veranschaulichen, in der zwei Gruppen miteinander verglichen werden. Die meisten Biostatistik-Lehrbücher liefern diese Hintergrundinformationen und die Schritte des Signifikanztests nicht konsistent. Der Artikel richtet sich an Personen, die nur vage beschreiben können, was das Verfahren bewirkt. Fundamental statistical concepts − standard deviation, sampling error, and standard errorBasic understanding – random sampling from a target population (population model)The target population of a scientific question represents the totality of all observation units. If the target population is the resident population of the FRG, the total population in 2016 is 82.5 million. Interesting variables of this population could be mean values and scatters of characteristics (e.g. mean sleep latency, i.e. the average time from switching off the light in the bedroom to falling asleep in minutes). These characteristics of variables of the target population, which are usually unknown to us, are abbreviated with Greek letters in the sense of a statistical convention. For example, the Greek letter µ and σ are used for the mean value and the variance of a variable in the target population.When conducting empirical studies, it is generally not possible to examine the whole target population. For this reason, only a sample from the target population is examined and information from the sample is used to make statements about the target population. The statistical inference of a sample to a target population represents an inductive conclusion and is referred to in statistics as inferential statistics.When random samples are drawn from a target population, the so-called sampling error (sampling variability) occurs. Since only a part of the target population is examined, there is variability from sample to sample. This can easily be illustrated by the toss of a fair coin. One would expect that 50% of all tosses would show head. This expected value, also called probability, is the prognosis of a relative frequency. If the coin were flipped 10 t</PlainText></TextGroup>imes, head could appear 4 times. Flipping the coin <TextGroup><PlainText>10 t</PlainText></TextGroup>imes again would not necessarily result in 4 times head, but e.g. 6 times head. This variability is an expression of the sampling error. Thus there can be no certain conclusion from a sample to a target population. The law of large numbers states that with increasing study size the sampling error becomes smaller and smaller. </Pgraph><SubHeadline>Variability versus uncertainty</SubHeadline><Pgraph>If, for example, one undertakes a study on the basis of a sample of 30 adult women with sleep disorders aged 55–64 living in Germany with the aim of estimating the true mean value µ of the sleep latency of the target population, the sample provides a mean value <ImgLink imgNo="1" imgType="inlineFigure"/> of e.g. <TextGroup><PlainText>38 min</PlainText></TextGroup> and a corresponding empirical variance s<Superscript>2</Superscript>, which is calculated according to the following formula:</Pgraph><Pgraph><ImgLink imgNo="2" imgType="inlineFigure"/> </Pgraph><Pgraph>Assuming a normal distribution of the variable sleep latency, a suitable statistical measure describing the variability in the sample would be the standard deviation (SD), which is the square root of the variance, in addition to the variance. The standard deviation s for the sample would be 8.5 min. If this study were repeated, in which a random sample of 30 adult women with sleep disorders aged 55–64, resident in Germany, is again obtained, the mean value would be for example 33 min and the standard deviation would be for example 8.4 min. The standard error of the mean (SE) is not a measure that quantifies the variability of the measured values within the sample, but rather the uncertainty of the estimate of the mean µ of the target population <TextLink reference="5"></TextLink>. The standard error is calculated according to the following formula:</Pgraph><Pgraph><ImgLink imgNo="3" imgType="inlineFigure"/> </Pgraph><Pgraph>where <Mark2>n</Mark2> is the number of observations. It can be seen that the smaller the variability of the characteristic in the sample and the larger the sample, the smaller the SE becomes.</Pgraph></TextBlock> <TextBlock language="de" linked="yes" name="Statistische Grundbegriffe − Standardabweichung, Stichprobenfehler und Standardfehler"> <MainHeadline>Statistische Grundbegriffe − Standardabweichung, Stichprobenfehler und Standardfehler</MainHeadline><SubHeadline>Grundlegendes Verständnis – Zufallsstichproben aus einer Zielpopulation (Bevölkerungsmodell)</SubHeadline><Pgraph>Die Zielpopulation einer wissenschaftlichen Frage stellt die Gesamtheit aller Beobachtungseinheiten dar. Wenn die Zielpopulation die Wohnbevölkerung der BRD ist, beträgt die Gesamtbevölkerung im Jahr 2016 82,5 Millionen. Interessante Variablen dieser Grundgesamtheit könnten Mittelwerte und Streuungen von Merkmalen sein (z.B. die mittlere Schlaflatenz, d.h. die durchschnittliche Zeit vom Ausschalten des Lichts im Schlafzimmer bis zum Einschlafen in Minuten). Diese Merkmale von Variablen der Zielpopulation, die uns in der Regel unbekannt sind, werden im Sinne einer statistischen Konvention mit griechischen Buchstaben abgekürzt. Beispielsweise werden die griechischen Buchstaben µ und s für den Mittelwert und die Varianz einer Variablen der Zielpopulation verwendet. </Pgraph><Pgraph>Bei der Durchführung empirischer Studien ist es im Allgemeinen nicht möglich, die gesamte Zielpopulation zu untersuchen. Aus diesem Grund wird nur eine Stichprobe aus der Zielpopulation untersucht und die Informationen aus der Stichprobe werden verwendet, um Aussagen über die Zielpopulation zu treffen. Der statistische Rückschluss einer Stichprobe auf eine Zielpopulation stellt eine induktive Schlussfolgerung dar und wird in der Statistik als Inferenzstatistik bezeichnet.</Pgraph><Pgraph>Wenn aus einer Zielpopulation Zufallsstichproben gezogen werden, tritt der so genannte Stichprobenfehler (Stichprobenvariabilität) auf. Da nur ein Teil der Zielpopulation untersucht wird, gibt es eine Variabilität von Stichprobe zu Stichprobe. Dies kann leicht durch den Wurf einer ungezinkten Münze veranschaulicht werden. Man würde erwarten, dass 50% aller Würfe Kopf zeigen würden. Dieser Erwartungswert, auch Wahrscheinlichkeit genannt, ist die Prognose einer relativen Häufigkeit. Wenn die Münze 10-mal geworfen würde, könnte Kopf 4-mal erscheinen. Würde man die Münze noch einmal 10-mal werfen, so würde nicht unbedingt Kopf 4-mal, sondern z.B. 6-mal auftreten. Diese Variabilität ist Ausdruck des Stichprobenfehlers. Es kann also keine sichere Schlussfolgerung aus einer Stichprobe auf eine Zielpopulation gezogen werden. Das Gesetz der großen Zahlen besagt, dass mit zunehmender Studiengröße der Stichprobenfehler immer kleiner wird.</Pgraph><SubHeadline>Variabilität versus Unsicherheit</SubHeadline><Pgraph>Führt man z.B. eine Studie auf der Basis einer Stichprobe von 30 erwachsenen Frauen mit Schlafstörungen im Alter von 55–64 Jahren, die in Deutschland leben, durch, um den wahren Mittelwert µ der Schlaflatenz der Zielpopulation abzuschätzen, so liefert die Stichprobe einen Mittelwert <ImgLink imgNo="1" imgType="inlineFigure"/> von z.B. 38 min und eine entsprechende empirische Varianz s<Superscript>2</Superscript>, die nach folgender Formel berechnet wird:</Pgraph><Pgraph><ImgLink imgNo="2" imgType="inlineFigure"/> </Pgraph><Pgraph>Unter der Annahme einer Normalverteilung der Variable Schlaflatenz wäre ein geeignetes statistisches Maß, das die Variabilität in der Stichprobe beschreibt, neben der Varianz die Standardabweichung (SD), die die Quadratwurzel der Varianz ist. Die Standardabweichung s für die Stichprobe würde 8,5 min betragen. Würde diese Studie wiederholt, bei der wiederum eine Zufallsstichprobe von 30 erwachsenen Frauen mit Schlafstörungen im Alter von 55–64 Jahren, die in Deutschland wohnen, gewonnen wird, so würde der Mittelwert z.B. 33 min und die Standardabweichung z.B. 8,4 min betragen. Der Standardfehler des Mittelwertes (SE) ist kein Maß, das die Variabilität der Messwerte innerhalb der Stichprobe quantifiziert, sondern vielmehr die Unsicherheit der Schätzung des Mittelwertes µ der Zielpopulation <TextLink reference="5"></TextLink>. Der Standardfehler wird nach der folgenden Formel berechnet:</Pgraph><Pgraph><ImgLink imgNo="3" imgType="inlineFigure"/> </Pgraph><Pgraph>wobei n die Anzahl der Beobachtungen ist. Es ist zu erkennen, dass der Standardfehler umso kleiner wird, je kleiner die Variabilität des Merkmals in der Stichprobe und je größer die Stichprobe ist.</Pgraph></TextBlock> <TextBlock language="en" linked="yes" name="How does a statistical test work – the t-test as an example"> <MainHeadline>How does a statistical test work – the t-test as an example</MainHeadline><SubHeadline>Two-group comparison</SubHeadline><Pgraph>In an example of two randomly sampled groups, we compare the effect of a new sleeping pill on sleep latency. The verum group includes 32 persons, the placebo group 30 persons (cf. Table 1 <ImgLink imgNo="1" imgType="table"/>). In both groups, sleep latency was determined after 7 days of treatment in the sleep laboratory (polysomnography). The null hypothesis is that the two groups do not differ with regard to sleep latency. Several tests have been suggested for such a group comparison. </Pgraph><Pgraph>In Table 2 <ImgLink imgNo="2" imgType="table"/>, we briefly explain the permutation test that is historically important. The permutation test is rarely used nowadays because the computing effort may be huge. In our example, there are 4.5 times 10<Superscript>17</Superscript> permutations. Therefore, in our case the t-test would be preferred which can be regarded as a good approximation of the permutation test and is most popular in the biomedical literature.</Pgraph><Pgraph>A comparison of the mean values of the two samples shows that the mean sleep latency in the verum group is 5 min lower than in the placebo group. In both groups, sleep latency varied, as can be seen from the standard deviations. Both samples are associated with random error due to sampling error.</Pgraph><Pgraph>The question that arises here is whether the difference of 5 min is only an expression of a random error or whether this difference is an expression of an actual effect of the sleeping pill. In the first case, both samples would come from identical populations (µ<Subscript>p</Subscript>=µ<Subscript>v</Subscript>), in the second case, the two samples would come from different populations, i.e., populations with µ<Subscript>p</Subscript>≠µ<Subscript>v</Subscript>. Figure 1 <ImgLink imgNo="1" imgType="figure"/> illustrates the problem: could it be that placebo and verum do not differ with respect to the true sleep latency averages, i.e. come from the same population with e.g. µ=<TextGroup><PlainText>38 min</PlainText></TextGroup>, and the two sample averages (33 min and <TextGroup><PlainText>38 min</PlainText></TextGroup>) are merely an expression of the sampling error, similar to the coin toss of a fair coin? Or could it be that the new sleep pill actually has an effect on sleep latency so that the true mean values come from target populations with different mean values (µ<Subscript>p</Subscript>≠µ<Subscript>v</Subscript>)?</Pgraph><SubHeadline>Expectation of statistical variability of study results due to random error</SubHeadline><Pgraph>A significance test can provide some, albeit imperfect, information on these central questions. To answer the above questions, the behavior of the mean difference due to the random error must first be determined, assuming that a null hypothesis H<Subscript>0</Subscript> were true. There is an infinite set of null hypotheses. In medicine, the nil hypothesis has prevailed, i.e. the null hypothesis of no association between treatment assignment (placebo or verum) and sleep latency (i.e. µ<Subscript>p</Subscript>=µ<Subscript>v</Subscript>). The Greek letters indicate that this null hypothesis refers to the target population. Under this hypothesis, mean differences that are not equal to zero are an expression of the random error. Similar to how extreme outcomes of experiments are rarely observed when tossing a fair coin (e.g. 10 tosses and it appears 10 times head), the difference of the means rarely takes extreme values under the null hypothesis.</Pgraph><Pgraph>But how many permuted arrangements of patients split into two groups do exist and how would differences of the means in these arrangements behave if the null hypothesis µ<Subscript>p</Subscript>=µ<Subscript>v</Subscript> were true? The difficulty in answering this question lies in the fact that the behavior of the difference of the means under the null hypothesis depends on the variability of the sleep latency within the samples and the size of the samples.</Pgraph><Pgraph>So in order to predict how the differences of the means would behave if the null hypothesis were true, one has to take these two influencing variables into account. Here a kind of normalization is helpful, which will be illustrated by the following example. A difference of means of <TextGroup><PlainText>3 s</PlainText></TextGroup>econds is observed for two groups of marathon runners (2 hours, 3 min, 40 seconds versus 2 hours, 3 min, <TextGroup><PlainText>43 s</PlainText></TextGroup>econds) and for two groups of 400 meters runners (<TextGroup><PlainText>46 s</PlainText></TextGroup>econds versus 49 seconds). For similar groups of runners, the differences of 3 seconds have a different meaning. For marathon runners, the difference is very small in relation to the average total duration of the run, while it is relatively larger for 400 meters runners. The relation to the average running time is a kind of normalization. The choice of statistical test, which ensures such standardization, determines which test statistics is chosen. If, for example, the t-test is selected for independent samples, the corresponding test variable is the t-statistic, for the Chi-square test it is the Chi-square-sta<TextGroup><PlainText>tist</PlainText></TextGroup>ic etc. The choice of the appropriate statistical test again depends on criteria, which are briefly explained in Table 3 <ImgLink imgNo="3" imgType="table"/>.</Pgraph><Pgraph>The t-statistic is defined as:</Pgraph><Pgraph><ImgLink imgNo="4" imgType="inlineFigure"/> </Pgraph><Pgraph>The expected difference of means in the t-statistic formula is the value assumed under the null hypothesis H<Subscript>0</Subscript>. In the case of the nil hypothesis µ<Subscript>p</Subscript>=µ<Subscript>v</Subscript> a difference of zero minutes is expected. This simplifies the t-statistics:</Pgraph><Pgraph><ImgLink imgNo="5" imgType="inlineFigure"/> </Pgraph><Pgraph>In the case of unequal variances, the standard error of the difference of the means is calculated according to the following formula:</Pgraph><Pgraph><ImgLink imgNo="6" imgType="inlineFigure"/> </Pgraph><Pgraph>with</Pgraph><Pgraph>n<Subscript>1</Subscript>: number of patients in group 1 (placebo)<LineBreak></LineBreak>n<Subscript>2</Subscript>: number of patients in group 2 (verum)<LineBreak></LineBreak><ImgLink imgNo="7" imgType="inlineFigure"/>: variances of sleep latency in group 1<LineBreak></LineBreak><ImgLink imgNo="8" imgType="inlineFigure"/>: variances of sleep latency in group </Pgraph><Pgraph>The formula changes if the variances are equal (formula not shown). The standard error of the difference of the means depends on the variances of the variable (sleep latency) and the group sizes of the groups being compared. After determining the standard error, the t-statistic for two independent samples with unequal variances is:</Pgraph><Pgraph><ImgLink imgNo="9" imgType="inlineFigure"/> </Pgraph><Pgraph>Independence means that the two patient groups are independent of each other and also that patients within the groups are independent of each other. For example, independence is violated if the outcome of a patient would contribute statistically to both patient groups. Similarly, independence would be violated if patients in the same group influenced each other in terms of outcomes of interest. Independence is also violated when a characteristic is collected from a group of patients several times over time (e.g. before and after treatment). The data of the sleep study now have the following t-value:</Pgraph><Pgraph><ImgLink imgNo="10" imgType="inlineFigure"/> </Pgraph><Pgraph>The t-value for the concrete study is therefore +2.33. This distribution can be determined by using the so-called degrees of freedom (df). The number of degrees of freedom is the number of values that can be freely varied without changing the mean values. If, for example, there are three numbers k, l and m and their sum is 100, it is clear that if two of the three numbers are known, the <TextGroup><PlainText>third n</PlainText></TextGroup>umber is automatically given. If k=20 and l=70, m must be 10. With 62 patients in the study one has n<Subscript>1</Subscript>–1+n<Subscript>2</Subscript>–1=30–1+32–1=60 degrees of freedom. If <TextGroup><PlainText>60 v</PlainText></TextGroup>alues were freely selected, then one has no further choice for the last two observations.</Pgraph><Pgraph>With the help of the 60 degrees of freedom, the appropriate distribution can now be displayed under the assumption of the null hypothesis. The illustration of the formula for creating the t-distribution is omitted for didactic reasons (it is the ratio of the standard normal variable z and the square root of a chi-square value with n degrees of freedom divided by n). The t-distribution is symmetrical and bell-shaped like the normal distribution (Figure 2 <ImgLink imgNo="2" imgType="figure"/>). </Pgraph><Pgraph>The probability density function (PDF) supplies so-called density values depending on the t-values. In contrast to probabilities, which can only assume values between 0 and 1, densities can also assume values >1. </Pgraph><SubHeadline>Interpretation of the t-value</SubHeadline><Pgraph>A single density value of the PDF has no practical interpretation. The total area under the curve of the PDF is 1 so that (partial) areas under the probability density function have the interpretation of probabilities. In the context of the study, it is now possible to answer the question of how high the probability is that the t value assumes ≥+2.33 under the null hypothesis (µ<Subscript>p</Subscript>=µ<Subscript>v</Subscript>), i.e. t=0.</Pgraph><Pgraph>The cumulative distribution function (CDF) returns the probability that a t-value is smaller than or equal to a concrete value t<Subscript>k</Subscript>. It is also possible to use the CDF to calculate the probability that t becomes ≥t<Subscript>k</Subscript> by subtracting the probability for t values <t<Subscript>k</Subscript> from the value of one. The formula for this function is omitted at this point, but can easily be found on the Internet <TextLink reference="6"></TextLink>. In the case of the sleep study, t<Subscript>k</Subscript>=+2.33. Figure 3 <ImgLink imgNo="3" imgType="figure"/> shows the area under the curve for t≥+2.33 for a one-sided view and for the areas under the curve for t≤–2.33 and t≥+2.33, a two-sided view.</Pgraph><Pgraph>The one-sided area has an amount of 0.01. This means that the probability that studies under the assumption of the null hypothesis (µ<Subscript>p</Subscript>=µ<Subscript>v</Subscript>) generate a t value of ≥+2.33 is 1%. On a two-sided basis, the probability that studies assuming the null hypothesis (µ<Subscript>p</Subscript>=µ<Subscript>v</Subscript>) generate a t value of ≤–2.33 or ≥+2.33 is 2%. The probability of 1% corresponds to the one-sided p-value, while the probability of 2% corresponds to the two-sided p-value.</Pgraph></TextBlock> <TextBlock language="de" linked="yes" name="Wie funktioniert ein statistischer Test – der t-Test als Beispiel"> <MainHeadline>Wie funktioniert ein statistischer Test – der t-Test als Beispiel</MainHeadline><SubHeadline>Zwei-Gruppen-Vergleich</SubHeadline><Pgraph>In einem Beispiel von zwei zufällig ausgewählten Gruppen vergleichen wir die Wirkung eines neuen Schlafmittels auf die Schlaflatenz. Die Verumgruppe umfasst 32 Personen, die Placebogruppe 30 Personen (vgl. Tabelle 1 <ImgLink imgNo="1" imgType="table"/>). In beiden Gruppen wurde die Schlaflatenz nach 7 Tagen Behandlung im Schlaflabor (Polysomnographie) bestimmt. Die Nullhypothese ist, dass sich die beiden Gruppen hinsichtlich der Schlaflatenz nicht unterscheiden. Es wurden mehrere Tests für einen solchen Gruppenvergleich vorgeschlagen.</Pgraph><Pgraph>In Tabelle 2 <ImgLink imgNo="2" imgType="table"/> erläutern wir kurz den Permutationstest, der historisch wichtig ist. Der Permutationstest wird heutzutage nur noch selten verwendet, da der Rechenaufwand sehr groß sein kann. In unserem Beispiel gibt es 4,5 mal 10<Superscript>17</Superscript> Permutationen. Daher wäre in unserem Fall der t-Test zu bevorzugen, der als gute Annäherung an den Permutationstest angesehen werden kann und in der biomedizinischen Literatur am beliebtesten ist.</Pgraph><Pgraph>Ein Vergleich der Mittelwerte der beiden Stichproben zeigt, dass die mittlere Schlaflatenz in der Verumgruppe 5 min kleiner ist als in der Placebogruppe. In beiden Gruppen variierte die Schlaflatenz, wie aus den Standardabweichungen ersichtlich ist. Beide Stichproben sind aufgrund von Stichprobenfehlern mit einem Zufallsfehler verbunden.</Pgraph><Pgraph>Die Frage, die sich hier stellt, ist, ob die Differenz von<TextGroup><PlainText> 5 min</PlainText></TextGroup> nur Ausdruck eines zufälligen Fehlers ist oder ob diese Differenz Ausdruck einer tatsächlichen Wirkung des Schlafmittels ist. Im ersten Fall würden beide Stichproben aus identischen Populationen stammen (µ<Subscript>p</Subscript>=µ<Subscript>v</Subscript>), im zweiten Fall würden die beiden Stichproben aus unterschiedlichen Populationen stammen, d.h. aus Populationen mit µ<Subscript>p</Subscript>≠µ<Subscript>v</Subscript>. Abbildung 1 <ImgLink imgNo="1" imgType="figure"/> veranschaulicht das Problem: Könnte es sein, dass sich Placebo und Verum in Bezug auf die wahren Schlaflatenz-Durchschnitte nicht unterscheiden, d.h. aus der gleichen Population mit z.B. µ=38 min stammen, und die beiden Stichproben-Durchschnitte (33 min und 38 min) lediglich ein Ausdruck des Stichprobenfehlers sind, ähnlich wie beim Münzwurf einer ungezinkten Münze? Oder könnte es sein, dass das neue Schlafmittel tatsächlich einen Einfluss auf die Schlafla<TextGroup><PlainText>t</PlainText></TextGroup>enz hat, so dass die wahren Mittelwerte aus Zielpopulationen mit unterschiedlichen Mittelwerten stammen (µ<Subscript>p</Subscript>≠µ<Subscript>v</Subscript>)?</Pgraph><SubHeadline>Erwartung der statistischen Variabilität von Studienergebnissen aufgrund eines Zufallsfehlers</SubHeadline><Pgraph>Ein Signifikanztest kann gewisse, wenn auch unvollständige Informationen zu diesen zentralen Fragen liefern. Zur Beantwortung der obigen Fragen muss zunächst das Verhalten der Differenz der Mittelwerte aufgrund des Zufallsfehlers bestimmt werden, wobei angenommen wird, dass eine Nullhypothese H<Subscript>0</Subscript> wahr wäre. Es gibt eine unendliche Menge von Nullhypothesen. In der Medizin hat sich die Nil-Hypothese durchgesetzt, d.h. die Nullhypothese, dass es keinen Zusammenhang zwischen der Behandlungszuweisung (Placebo oder Verum) und der Schlaflatenz gibt (d.h. µ<Subscript>p</Subscript>=µ<Subscript>v</Subscript>). Die griechischen Buchstaben zeigen an, dass sich diese Nullhypothese auf die Zielpopulation bezieht. Unter dieser Hypothese sind Mittelwertunterschiede, die nicht gleich Null sind, ein Ausdruck des Zufallsfehlers. Ähnlich wie extreme Ergebnisse von Experimenten selten beobachtet werden, wenn eine ungezinkte Münze geworfen wird (z.B. 10 Würfe und es erscheint 10-mal Kopf), nimmt die Differenz der Mittelwerte unter der Nullhypothese selten extreme Werte an.</Pgraph><Pgraph>Aber wie viele permutierte Anordnungen von Patienten, die in zwei Gruppen aufgeteilt sind, gibt es und wie würden sich die Unterschiede der Mittel in diesen Arrangements verhalten, wenn die Nullhypothese µ<Subscript>p</Subscript>=µ<Subscript>v</Subscript> wahr wäre? Die Schwierigkeit bei der Beantwortung dieser Frage liegt darin, dass das Verhalten der Mittelwertunterschiede unter der Nullhypothese von der Variabilität der Schlaflatenz innerhalb der Stichproben und der Größe der Stichproben abhängt.</Pgraph><Pgraph>Um also vorherzusagen, wie sich die Unterschiede der Mittelwerte verhalten würden, wenn die Nullhypothese wahr wäre, muss man diese beiden Einflussgrößen berücksichtigen. Hier ist eine Art Normalisierung hilfreich, die durch das folgende Beispiel veranschaulicht werden soll. Ein Mittelwertunterschied von 3 Sekunden wird für zwei Gruppen von Marathonläufern (2 Stunden, 3 Minuten, 40 Sekunden gegenüber 2 Stunden, 3 Minuten, <TextGroup><PlainText>43 S</PlainText></TextGroup>ekunden) und für zwei Gruppen von 400-Meter-Läufern (46 Sekunden gegenüber 49 Sekunden) beobachtet. Bei ähnlichen Läufer-Gruppen haben die Unterschiede von 3 Sekunden eine unterschiedliche Bedeutung. Bei Marathonläufern ist der Unterschied im Verhältnis zur durchschnittlichen Gesamtdauer des Laufs sehr gering, während er bei 400-Meter-Läufern relativ groß ist. Das Verhältnis zur durchschnittlichen Laufdauer ist eine Art Normalisierung. Die Wahl des statistischen Tests, der eine solche Normierung gewährleistet, bestimmt, welche Teststatistik gewählt wird. Wenn z.B. der t-Test für unabhängige Stichproben gewählt wird, ist die entsprechende Testvariable die t-Statistik, für den Chi-Quadrat-Test die Chi-Quadrat-Statistik usw. Die Wahl des geeigneten statistischen Tests hängt wiederum von Kriterien ab, die in Tabelle 3 <ImgLink imgNo="3" imgType="table"/> kurz erläutert werden.</Pgraph><Pgraph>Die t-Statistik ist definiert als: </Pgraph><Pgraph><ImgLink imgNo="4" imgType="inlineFigure"/> </Pgraph><Pgraph>Die erwartete Differenz der Mittelwerte in der Formel der t-Statistik ist der unter der Nullhypothese H<Subscript>0</Subscript> angenommene Wert. Im Falle der Nullhypothese µp=µv wird eine Differenz von null Minuten erwartet. Dies vereinfacht die t-Statistik:</Pgraph><Pgraph><ImgLink imgNo="5" imgType="inlineFigure"/> </Pgraph><Pgraph>Bei ungleichen Varianzen wird der Standardfehler der Differenz der Mittelwerte nach folgender Formel berechnet:</Pgraph><Pgraph><ImgLink imgNo="6" imgType="inlineFigure"/> </Pgraph><Pgraph>mit</Pgraph><Pgraph>n<Subscript>1</Subscript>: Anzahl von Patienten in Gruppe 1 (Placebo)<LineBreak></LineBreak>n<Subscript>2</Subscript>: Anzahl von Patienten in Gruppe 2 (Verum)<LineBreak></LineBreak><ImgLink imgNo="7" imgType="inlineFigure"/>: Varianz der Schlaflatenz in Gruppe 1<LineBreak></LineBreak><ImgLink imgNo="8" imgType="inlineFigure"/>: Varianz der Schlaflatenz in Gruppe 2</Pgraph><Pgraph>Die Formel ändert sich, wenn die Varianzen gleich sind (Formel nicht dargestellt). Der Standardfehler der Differenz der Mittelwerte hängt von den Varianzen der Variablen (Schlaflatenz) und den Gruppengrößen der zu vergleichenden Gruppen ab. Nach der Bestimmung des Standardfehlers ergibt sich die t-Statistik für zwei unabhängige Stichproben mit ungleichen Varianzen:</Pgraph><Pgraph><ImgLink imgNo="9" imgType="inlineFigure"/> </Pgraph><Pgraph>Unabhängigkeit bedeutet, dass die beiden Patientengruppen voneinander unabhängig sind und auch dass die Patienten innerhalb der Gruppen unabhängig voneinander sind. Die Unabhängigkeit wird beispielsweise verletzt, wenn das Ergebnis eines Patienten statistisch gesehen zu beiden Patientengruppen beitragen würde. Ebenso wird die Unabhängigkeit verletzt, wenn Patienten derselben Gruppe sich gegenseitig in Bezug auf die Ergebnisse von Interesse beeinflussen würden. Die Unabhängigkeit ist auch verletzt, wenn ein Merkmal von einer Gruppe von Patienten im Laufe der Zeit mehrfach erhoben wird (z.B. vor und nach der Behandlung). Die Daten der Schlafstudie haben nun folgenden t-Wert:</Pgraph><Pgraph><ImgLink imgNo="10" imgType="inlineFigure"/> </Pgraph><Pgraph>Der t-Wert für die konkrete Studie beträgt daher +2,33. Diese Verteilung kann mit Hilfe der sogenannten Freiheitsgrade (df) bestimmt werden. Die Anzahl der Freiheitsgrade ist die Anzahl der Werte, die ohne Veränderung der Mittelwerte frei variiert werden können. Wenn es z.B. drei Zahlen k, l und m gibt und ihre Summe 100 ist, ist klar, dass, wenn zwei der drei Zahlen bekannt sind, automatisch die dritte Zahl gegeben ist. Wenn k=20 und l=70 ist, muss m 10 sein. Bei 62 Patienten in der Studie hat man n<Subscript>1</Subscript>–1+n<Subscript>2</Subscript>–1=30–1+32–1=60 Freiheitsgrade. Wurden 60 Werte frei gewählt, so hat man für die letzten beiden Beobachtungen keine weitere Wahl.</Pgraph><Pgraph>Mit Hilfe der 60 Freiheitsgrade, kann nun die geeignete Verteilung unter der Annahme der Nullhypothese dargestellt werden. Auf die Darstellung der Formel zur Erstellung der t-Verteilung wird aus didaktischen Gründen verzichtet (es ist das Verhältnis der Standard-Normalvariable z und der Quadratwurzel eines Chi-Quadrat-Wertes mit n Freiheitsgraden geteilt durch n). Die t-Verteilung ist symmetrisch und glockenförmig wie die Normalverteilung (Abbildung 2 <ImgLink imgNo="2" imgType="figure"/>).</Pgraph><Pgraph>Die Wahrscheinlichkeitsdichtefunktion (PDF) liefert in Abhängigkeit von den t-Werten sogenannte Dichtewerte. Im Gegensatz zu den Wahrscheinlichkeiten, die nur Werte zwischen 0 und 1 annehmen können, können Dichten auch Werte >1 annehmen.</Pgraph><SubHeadline>Interpretation des t-Wertes</SubHeadline><Pgraph>Ein einziger Dichtewert der PDF hat keine praktische Bedeutung. Die Gesamtfläche unter der Kurve der PDF ist 1, so dass (Teil-)Flächen unter der Wahrscheinlichkeitsdichtefunktion die Interpretation von Wahrscheinlichkeiten haben. Im Rahmen der Studie ist es nun möglich, die Frage zu beantworten, wie hoch die Wahrscheinlichkeit ist, dass der t-Wert ≥+2,33 unter der Nullhypothese (µ<Subscript>p</Subscript>=µ<Subscript>v</Subscript>) annimmt, d.h. t=0.</Pgraph><Pgraph>Die kumulative Verteilungsfunktion (CDF) liefert die Wahrscheinlichkeit, dass ein t-Wert kleiner oder gleich einem konkreten Wert t<Subscript>k</Subscript> ist. Es ist auch möglich, die CDF zu verwenden, um die Wahrscheinlichkeit zu berechnen, dass t≥t<Subscript>k</Subscript> wird, indem die Wahrscheinlichkeit für t-Werte <t<Subscript>k</Subscript> vom Wert 1 subtrahiert wird. Die Formel für diese Funktion wird an dieser Stelle weggelassen, kann aber im Internet leicht gefunden werden <TextLink reference="6"></TextLink>. Im Fall der Schlafstudie ist t<Subscript>k</Subscript>≥+2,33. Abbildung 3 <ImgLink imgNo="3" imgType="figure"/> zeigt die Fläche unter der Verteilung für t≥+2,33 bei einseitiger Betrachtung und für die Flächen unter der Verteilung für t≤–2,33 und t≥+2,33 bei zweiseitiger Betrachtung.</Pgraph><Pgraph>Der einseitige Bereich hat einen Betrag von 0,01. Das bedeutet, dass die Wahrscheinlichkeit, dass Studien unter der Annahme der Nullhypothese (µ<Subscript>p</Subscript>=µ<Subscript>v</Subscript>) einen t-Wert von ≥+2,33 erzeugen, 1% beträgt. Bei zweiseitiger Betrachtung beträgt die Wahrscheinlichkeit, dass Studien unter der Annahme der Nullhypothese (µ<Subscript>p</Subscript>=µ<Subscript>v</Subscript>) einen t-Wert von ≤–2,33 oder ≥+2,33 erzeugen, 2%. Die Wahrscheinlichkeit von 1% entspricht dem einseitigen p-Wert, während die Wahrscheinlichkeit von 2% dem zweiseitigen p-Wert entspricht.</Pgraph></TextBlock> <TextBlock language="en" linked="yes" name="The p-value – explanation and some caveats"> <MainHeadline>The p-value – explanation and some caveats</MainHeadline><SubHeadline>Interpretation of the p-value</SubHeadline><Pgraph>The p-value thus provides the probability (criterion 1) under a null hypothesis (criterion 2) of finding a result such as the present study result or observing study results that deviate even more from the null hypothesis (criterio<TextGroup><PlainText>n 3</PlainText></TextGroup>). All three criteria are necessary criteria for the definition of the p-value. </Pgraph><Pgraph>It is important to note here that the p-value makes a statement about the behavior of a test statistic in presence of random error given the null hypothesis. At a p-value of 0.01, only 1% of the studies would generate a t-value of ≥+2.33 if the null hypothesis were true. Thus, the p-value also makes a statement about the outcomes of studies that were not observed (counterfactual element). Furthermore, it must be emphasized that the p-value was calculated under a condition: the condition that the null hypothesis H<Subscript>0</Subscript> were true, which is why the p-value is also referred to as a conditional probability. The null hypothesis was merely assumed, regardless of how large the truth content of this hypothesis is. </Pgraph><Pgraph>Fisher interpreted the p-value as a continuous measure of evidence against the null hypothesis. He said: “No scientific worker has a fixed level of significance at which from year to year, and in all circumstances he rejects hypotheses; he rather gives mind to each particular case in the light of his evidence and his ideas” <TextLink reference="7"></TextLink>. This means that, according to Fisher’s school, the classification of a p-value is context-dependent and the application of a fixed threshold of typically 0.05 is not justified. The orthodox rejection of a null hypothesis at a pre-defined threshold of 0.05 comes from the competing school of Neyman and Pearson, who introduced the hypothesis test as a decision-theoretical procedure.</Pgraph><Pgraph>What does a large p-value of e.g. 0.70 mean? Technically speaking, it means that the probability is 70% of the observed study outcome or of study outcomes deviating even more from the null hypothesis under the assumption of the null hypothesis. In practice, this means that the significance test provided little evidence against the tested null hypothesis or statistical model. However, it does not mean that the null hypothesis is true. The p-value is a function of the strength of effect (e.g. observed mean difference, here 5 min) and the study size (here <TextGroup><PlainText>62 women</PlainText></TextGroup>). With a large p-value, a strong effect can actually be present, but the study size was very small. Typical errors in the definition of p-values are discussed below. </Pgraph><Pgraph>“The p-value is the probability that the null hypothesis is true.” The p-value does not provide a statement about the probability of the truth of the null hypothesis, but the p-value was calculated under the assumption that the null hypothesis was true. Incidentally, the reference to even more extreme outcomes of the study (counterfactual element) is missing here. </Pgraph><Pgraph>“The p-value is the probability of type I error.” This statement is incorrect because it mixes principles of the significance test (Fisher) with those of the hypothesis test (Neyman & Pearson). According to the school of Fisher, there is no a priori fixed level of significance (also called type I error). In contrast, according to Neyman & Pearson, the level of significance, called type I error, is fixed before the study started whereas the p-value is derived from the statistical model and the study data after the study has been done. According to Neyman & Pearson, the type I error remains as it is after the end of the study and the p-value is compared to the a priori fixed type I error for making a decision.</Pgraph><Pgraph>The type I error, also called α error, is determined according to Neyman and Pearson before the beginning of the study. At the end of the study, the p-value which is obtained from the null hypothesis, the statistical model (e.g. t-test) and the study data is compared with the α (most often 0.05). The statement that “a low p-value excludes chance as an explanation for an observed difference” proves a gross lack of understanding.</Pgraph><Pgraph>Almost correct sounding definitions of the p-value are for example: “The p-value is the probability to observe the present study result or even more extreme study results.” In this definition, the central condition (criterion 2) of the p-value is missing: the calculation takes place under the assumption that the null hypothesis were true. The following incorrect definition is also popular: “The p-value is the probability of observing the present study result under the null hypothesis.” Here criterion 3 is missing: the p-value also makes a statement about unobserved study results that deviate even more from the null hypothesis than the present study result. </Pgraph><Pgraph>In the significance test according to Fisher, there is no so-called type I error and type II error, there is no confidence interval, no alternative hypothesis and no concept for statistical power or sample size calculations. These phenomena originate from Neyman & Pearson and only become relevant when performing hypothesis tests, which are decision-theoretically only valid if all steps of the hypothesis test procedure are adhered to, which is why authors also speak of Neyman-Pearson orthodoxy <TextLink reference="8"></TextLink>:</Pgraph><Pgraph><OrderedList><ListItem level="1" levelPosition="1" numString="1.">Definition of the null and alternative hypothesis before the start of the study.</ListItem><ListItem level="1" levelPosition="2" numString="2.">Determination of type I and type II error before the start of the study.</ListItem><ListItem level="1" levelPosition="3" numString="3.">Determination of test statistics before the start of the study.</ListItem><ListItem level="1" levelPosition="4" numString="4.">Calculation of the required sample size before the start of the study.</ListItem><ListItem level="1" levelPosition="5" numString="5.">Conduct the study in compliance with the required sample size</ListItem><ListItem level="1" levelPosition="6" numString="6.">Calculation of the test statistics and comparison with a critical value of the test statistics or comparison of the p-value with the specified type I error (after the study).</ListItem><ListItem level="1" levelPosition="7" numString="7.">Decision: if p≤α, the null hypothesis is rejected, if p>α, the null hypothesis is not rejected (after the study).</ListItem></OrderedList></Pgraph><Pgraph>If steps 1–7 are not complied with, the decision-theoretical procedure of hypothesis testing loses its validity. The decision (7<Superscript>th</Superscript> step) must be consistently applied. If, for example, α=0.05 was specified and p=0.07 came out at the end of the study, then according to Neyman & Pearson it cannot be said that there was a “significance trend” or something similar, but only that the null hypothesis was not rejected. Likewise p-values ≤0.05 are not sub-catego<TextGroup><PlainText>r</PlainText></TextGroup>ized into e.g. p≤0.05*, p≤0.01** and p≤0.001*** according to Neyman & Pearson.</Pgraph><SubHeadline>Conditions necessary for the correct interpretation of the p-value</SubHeadline><Pgraph>Many introductory textbooks of biostatistics merely introduce the theory of significance testing. This means that there are no sources of error other than random error. In the practice of empirical studies, however, this is an unrealistic assumption. Greenland et al. <TextLink reference="9"></TextLink> rightly point out that in the case of a low p-value only a signal is given that something may be wrong with the so-called statistical model. The statistical model consists of three components: the chosen test statistics, the chosen null hypothesis and the empirical study data.</Pgraph><Pgraph>In addition to the hypothesis that the low p-value represents evidence against the null hypothesis, the following alternative explanations need to be considered, all of which are related to the statistical model and thus influence the p-value:</Pgraph><Pgraph><UnorderedList><ListItem level="1">An unsuitable test statistic was applied.</ListItem><ListItem level="1">Selection bias into the study or selection bias during follow-up of study subjects occurred.</ListItem><ListItem level="1">The comparison between two samples is confounded (mixing of effects).</ListItem><ListItem level="1">There is information bias in the measurement of the variables in the study.</ListItem></UnorderedList></Pgraph><Pgraph>If the p-value is low, we can only conclude that something is wrong with the statistical model. However, the p-value itself does not show what is wrong with the model. The inexperienced user of the significance test thinks of a low p-value only as an indication that the null hypothesis might be wrong. In addition to the contextual dependence of the meaning of low p-values explained by Fisher, the result of a significance test must always be seen in the light of the complete statistical model.</Pgraph></TextBlock> <TextBlock language="de" linked="yes" name="Der p-Wert – Erläuterung und einige Caveats"> <MainHeadline>Der p-Wert – Erläuterung und einige Caveats</MainHeadline><SubHeadline>Interpretation des p-Wertes</SubHeadline><Pgraph>Der p-Wert gibt somit die Wahrscheinlichkeit (Kriteriu<TextGroup><PlainText>m 1</PlainText></TextGroup>) unter einer Nullhypothese (Kriterium 2) an, ein Ergebnis wie das vorliegende Studienergebnis zu beobachten oder Studienergebnisse zu beobachten, die noch stärker von der Nullhypothese (Kriterium 3) abweichen. Alle drei Kriterien sind notwendige Kriterien für die Definition des p-Wertes.</Pgraph><Pgraph>Wichtig ist hier, dass der p-Wert eine Aussage über das Verhalten einer Teststatistik bei Vorliegen eines zufälligen Fehlers unter der Nullhypothese macht. Bei einem p-Wert von 0,01 würde nur 1% der Studien einen t-Wert von ≥+2,33 erzeugen, wenn die Nullhypothese wahr wäre. Der p-Wert macht also auch eine Aussage über Ergebnisse von Studien, die nicht beobachtet wurden (kontrafaktisches Element). Ferner muss betont werden, dass der p-Wert unter der Bedingung berechnet wurde, dass die Nullhypothese H<Subscript>0</Subscript> wahr wäre, weshalb der p-Wert auch als bedingte Wahrscheinlichkeit bezeichnet wird. Die Nullhypothese wurde lediglich angenommen, unabhängig davon, wie groß der Wahrheitsgehalt dieser Hypothese ist.</Pgraph><Pgraph>Fisher interpretierte den p-Wert als ein kontinuierliches Maß für die Evidenz gegen die Nullhypothese. Er sagte: „Kein Wissenschaftler hat ein festgelegtes Signifikanzniveau, auf dem er von Jahr zu Jahr und unter allen Umständen Hypothesen ablehnt; er macht sich vielmehr zu jedem einzelnen Fall Gedanken im Lichte der Evidenz und seiner Ideen“ <TextLink reference="7"></TextLink>. Das bedeutet, dass nach Fishers Schule die Einstufung eines p-Wertes kontextabhängig ist und die Anwendung eines festen Schwellenwertes von typischerweise 0,05 nicht gerechtfertigt ist. Die orthodoxe Ablehnung einer Nullhypothese bei einem vordefinierten Schwellenwert von 0,05 stammt von der konkurrierenden Schule von Neyman und Pearson, die den Hypothesentest als entscheidungstheoretisches Verfahren einführten.</Pgraph><Pgraph>Was bedeutet ein großer p-Wert von z.B. 0,70? Technisch gesehen bedeutet er, dass die Wahrscheinlichkeit 70% beträgt, das beobachtete Studienergebnis oder Studienergebnisse, die noch stärker von der Nullhypothese abweichen, zu beobachten, unter der Annahme die Nullhypothese sei wahr. In der Praxis bedeutet das, dass der Signifikanztest wenig Evidenz gegen die getestete Nullhypothese oder das statistische Modell liefert. Es bedeutet jedoch nicht, dass die Nullhypothese wahr ist. Der p-Wert ist eine Funktion der Stärke des Effekts (z.B. beobachteter Mittelwertunterschied, hier 5 min) und der Studiengröße (hier 62 Frauen). Bei einem großen p-Wert kann tatsächlich ein starker Effekt vorhanden sein, aber die Studiengröße war sehr klein. Typische Fehler bei der Definition von p-Werten werden im Folgenden diskutiert.</Pgraph><Pgraph>„Der p-Wert ist die Wahrscheinlichkeit, dass die Nullhypothese wahr ist.“ Der p-Wert macht keine Aussage über die Wahrscheinlichkeit der Wahrheit der Nullhypothese, jedoch wurde der p-Wert unter der Annahme berechnet, dass die Nullhypothese wahr ist. Übrigens fehlt hier der Hinweis auf noch extremere Ergebnisse der Studie (kontrafaktisches Element).</Pgraph><Pgraph>„Der p-Wert ist die Wahrscheinlichkeit eines Typ-I-Fehlers.“ Diese Aussage ist falsch, weil sie die Prinzipien des Signifikanztests (Fisher) mit denen des Hypothesentests (Neyman & Pearson) vermischt. Nach der Schule von Fisher gibt es kein a priori festgelegtes Signifikanzniveau (auch Typ-I-Fehler genannt). Im Gegensatz dazu wird nach Neyman & Pearson das Signifikanzniveau, auch Typ-I-Fehler genannt, vor Beginn der Studie festgelegt, während der p-Wert aus dem statistischen Modell und den Studiendaten nach Durchführung der Studie abgeleitet werden. Nach Neyman & Pearson bleibt der Typ-I-Fehler nach dem Ende der Studie unverändert und der p-Wert wird mit dem a priori festgelegten Typ-I-Fehler verglichen, um eine Entscheidung zu treffen.</Pgraph><Pgraph>Der Typ-I-Fehler, auch α-Fehler genannt, wird nach Neyman und Pearson vor Beginn der Studie bestimmt. Am Ende der Studie wird der p-Wert, der sich aus der Nullhypothese, dem statistischen Modell (z.B. t-Test) und den Studiendaten ergibt, mit dem α (meist 0,05) verglichen. Die Aussage, dass „ein niedriger p-Wert den Zufall als Erklärung für einen beobachteten Unterschied ausschließt“, beweist einen groben Mangel an Verständnis.</Pgraph><Pgraph>Nahezu korrekt klingende Definitionen des p-Wertes sind zum Beispiel: „Der p-Wert ist die Wahrscheinlichkeit, das vorliegende Studienergebnis oder noch extremere Studienergebnisse zu beobachten“. In dieser Definition fehlt die zentrale Bedingung (Kriterium 2) des p-Wertes: Die Berechnung erfolgt unter der Annahme, dass die Nullhypothese zutrifft. Auch die folgende falsche Definition ist beliebt: „Der p-Wert ist die Wahrscheinlichkeit, das vorliegende Studienergebnis unter der Nullhypothese zu beobachten.“ Hier fehlt Kriterium 3: Der p-Wert macht auch eine Aussage über unbeobachtete Studienergebnisse, die noch stärker von der Nullhypothese abweichen als das vorliegende Studienergebnis.</Pgraph><Pgraph>Beim Signifikanztest nach Fisher gibt es keinen so genannten Typ-I-Fehler und Typ-II-Fehler, es gibt kein Konfidenzintervall, keine Alternativhypothese und kein Konzept für statistische Macht (Power) oder Stichprobengrößenberechnungen. Diese Phänomene gehen auf Neyman & Pearson zurück und werden erst bei der Durchführung von Hypothesentests relevant, die entscheidungstheoretisch nur dann gültig sind, wenn alle Schritte des Hypothesentestverfahrens eingehalten werden, weshalb die Autoren auch von Neyman-Pearson-Orthodoxie sprechen <TextLink reference="8"></TextLink>:</Pgraph><Pgraph><OrderedList><ListItem level="1" levelPosition="1" numString="1.">Definition der Nullhypothese und Alternativhypothese vor Beginn der Studie</ListItem><ListItem level="1" levelPosition="2" numString="2.">Festlegung des Typ-I-Fehlers und Typ-II-Fehlers vor Beginn der Studie</ListItem><ListItem level="1" levelPosition="3" numString="3.">Festlegung der Teststatistik vor Beginn der Studie</ListItem><ListItem level="1" levelPosition="4" numString="4.">Berechnung der erforderlichen Stichprobengrößen vor Beginn der Studie</ListItem><ListItem level="1" levelPosition="5" numString="5.">Durchführung der Studie unter Einhaltung der erforderlichen Stichprobengrößen</ListItem><ListItem level="1" levelPosition="6" numString="6.">Berechnung der Teststatistik und Vergleich mit dem kritischen Wert der Teststatistik oder Vergleich des p-Wertes mit dem vorab definierten Typ-I-Fehler nach Durchführung der Studie</ListItem><ListItem level="1" levelPosition="7" numString="7.">Entscheidung: Wenn p≤α, wird die Nullhypothese abgelehnt, wenn p>α, wird die Nullhypothese nicht abgelehnt (nach Durchführung der Studie).</ListItem></OrderedList></Pgraph><Pgraph>Wenn die Schritte 1–7 nicht eingehalten werden, verliert das entscheidungstheoretische Verfahren des Hypothe<TextGroup><PlainText>sen</PlainText></TextGroup>testens seine Gültigkeit. Die Entscheidungsregel (<TextGroup><PlainText>7. Schritt</PlainText></TextGroup>) muss konsequent angewendet werden. Wenn z.B. α=0,05 angegeben wurde und p=0,07 am Ende der Studie herauskam, dann kann nach Neyman & Pearson nicht gesagt werden, dass es einen „Signifikanztrend“ oder etwas Ähnliches gab, sondern nur, dass die Nullhy<TextGroup><PlainText>p</PlainText></TextGroup>othese nicht abgelehnt wurde. Auch werden p-Werte ≤0,05 nach Neyman & Pearson nicht in z.B. p≤0,05*, p≤0,01** und p≤0,001*** weiter unterteilt.</Pgraph><SubHeadline>Bedingungen, die für die korrekte Interpretation des p-Wertes notwendig sind</SubHeadline><Pgraph>Viele einführende Lehrbücher der Biostatistik führen lediglich die Theorie der Signifikanztests ein. Das bedeutet, dass es außer dem Zufallsfehler keine weiteren Fehlerquellen gibt. In der Praxis der empirischen Studien ist dies jedoch eine unrealistische Annahme. Greenland et al. <TextLink reference="9"></TextLink> weisen zu Recht darauf hin, dass im Falle eines niedrigen p-Wertes nur ein Signal gegeben wird, dass mit dem sogenannten statistischen Modell etwas nicht in Ordnung sein könnte. Das statistische Modell besteht aus drei Komponenten: Der gewählten Teststatistik, der gewählten Nullhypothese und den empirischen Studiendaten.</Pgraph><Pgraph>Zusätzlich zu der Hypothese, dass der niedrige p-Wert Evidenz gegen die Nullhypothese darstellt, müssen die folgenden alternativen Erklärungen in Betracht gezogen werden, die alle mit dem statistischen Modell zusammenhängen und somit den p-Wert beeinflussen:</Pgraph><Pgraph><UnorderedList><ListItem level="1">Es wurde eine ungeeignete Teststatistik angewandt.</ListItem><ListItem level="1">Es kam zu einem Selektionsbias in die Studie oder zu einem Selektionsbias bei der Nachbeobachtung der Probanden.</ListItem><ListItem level="1">Der Vergleich zwischen zwei Stichproben ist konfundiert (Vermengung von Effekten).</ListItem><ListItem level="1">Es gibt einen Informationsbias bei der Messung der Variablen in der Studie.</ListItem></UnorderedList></Pgraph><Pgraph>Wenn der p-Wert niedrig ist, können wir nur den Schluss ziehen, dass etwas mit dem statistischen Modell nicht stimmt. Der p-Wert selbst zeigt jedoch nicht, was mit dem Modell nicht stimmt. Der unerfahrene Benutzer des Signifikanztests betrachtet einen niedrigen p-Wert nur als einen Hinweis darauf, dass die Nullhypothese falsch sein könnte. Zusätzlich zu der von Fisher erklärten kontextuellen Abhängigkeit der Bedeutung niedriger p-Werte muss das Ergebnis eines Signifikanztests immer im Licht des vollständigen statistischen Modells gesehen werden.</Pgraph></TextBlock> <TextBlock language="en" linked="yes" name="Summary"> <MainHeadline>Summary</MainHeadline><Pgraph>Fisher’s significance test is a different procedure than the Neyman & Pearson hypothesis test, which is often ignored. While the significance test produces a p-value, which according to Fisher should be interpreted context-dependently as a continuous measure of evidence against the null hypothesis, the p-value serves as a decision criterion if the necessary steps of the hypothesis test are followed. The significance test leads to the p-value, whose definition must contain three criteria: probability, the use of the null hypothesis assumption, and the counterfactual element of the p-value. P-values can be small for various reasons and the evidence against the null hypothesis is one of several competing reasons in empirical studies.</Pgraph></TextBlock> <TextBlock language="de" linked="yes" name="Fazit"> <MainHeadline>Fazit</MainHeadline><Pgraph>Fishers Signifikanztest ist ein anderes Verfahren als der Hypothesentest von Neyman & Pearson, was oft ignoriert wird. Während der Signifikanztest einen p-Wert erzeugt, der nach Fisher kontextabhängig als ein kontinuierliches Maß für die Evidenz gegen die Nullhypothese interpretiert werden sollte, dient der p-Wert als Entscheidungskriterium, wenn die notwendigen Schritte des Hypothesentests befolgt werden. Der Signifikanztest führt zum p-Wert, dessen Definition drei Kriterien enthalten muss: Die Wahrscheinlichkeit, die Verwendung der Nullhypothesen-Annahme und das kontrafaktische Element des p-Wertes. P-Werte können aus verschiedenen Gründen klein sein, und die Evidenz gegen die Nullhypothese ist einer von mehreren konkurrierenden Gründen in empirischen Studien.</Pgraph></TextBlock> <TextBlock language="en" linked="yes" name="Notes"> <MainHeadline>Notes</MainHeadline><SubHeadline>Competing interests</SubHeadline><Pgraph>The authors declare that they have no competing interests.</Pgraph></TextBlock> <TextBlock language="de" linked="yes" name="Anmerkungen"> <MainHeadline>Anmerkungen</MainHeadline><SubHeadline>Interessenkonflikte</SubHeadline><Pgraph>Die Autoren erklären, dass sie keine Interessenkonflikte in Zusammenhang mit diesem Artikel haben.</Pgraph></TextBlock> <References linked="yes"> <Reference refNo="1"> <RefAuthor>Gigerenzer G</RefAuthor> <RefAuthor>Swijtink Z</RefAuthor> <RefAuthor>Porter T</RefAuthor> <RefAuthor>Daston L</RefAuthor> <RefAuthor>Beatty J</RefAuthor> <RefAuthor>Krüger L</RefAuthor> <RefTitle></RefTitle> <RefYear>1989</RefYear> <RefBookTitle>The empire of chance. How probability changed science and everyday life</RefBookTitle> <RefPage></RefPage> <RefTotal>Gigerenzer G, Swijtink Z, Porter T, Daston L, Beatty J, Krüger L. The empire of chance. How probability changed science and everyday life. Cambridge: Cambridge University Press; 1989.</RefTotal> </Reference> <Reference refNo="2"> <RefAuthor>Amrhein V</RefAuthor> <RefAuthor>Trafimow D</RefAuthor> <RefAuthor>Greenland S</RefAuthor> <RefTitle>Inferential statistics as descriptive statistics: there is no replication crisis if we don't expect replication</RefTitle> <RefYear>2018</RefYear> <RefJournal>PeerJ Preprints</RefJournal> <RefPage>e26857v4</RefPage> <RefTotal>Amrhein V, Trafimow D, Greenland S. Inferential statistics as descriptive statistics: there is no replication crisis if we don't expect replication. PeerJ Preprints. 2018;6:e26857v4. DOI: 10.7287/peerj.preprints.26857v3</RefTotal> <RefLink>https://doi.org/10.7287/peerj.preprints.26857v3</RefLink> </Reference> <Reference refNo="3"> <RefAuthor>Wasserstein RL</RefAuthor> <RefAuthor>Lazar NA</RefAuthor> <RefTitle>The ASA's statement on p-values: context, process, and purpose</RefTitle> <RefYear>2016</RefYear> <RefJournal>Am Stat</RefJournal> <RefPage>129-33</RefPage> <RefTotal>Wasserstein RL, Lazar NA. The ASA's statement on p-values: context, process, and purpose. Am Stat. 2016;70:129-33. DOI: 10.1080/00031305.2016.1154108</RefTotal> <RefLink>https://doi.org/10.1080/00031305.2016.1154108</RefLink> </Reference> <Reference refNo="4"> <RefAuthor>Amrhein V</RefAuthor> <RefAuthor>Greenland S</RefAuthor> <RefAuthor>McShane B</RefAuthor> <RefTitle>Scientists rise up against statistical significance</RefTitle> <RefYear>2019</RefYear> <RefJournal>Nature</RefJournal> <RefPage>305-307</RefPage> <RefTotal>Amrhein V, Greenland S, McShane B. Scientists rise up against statistical significance. Nature. 2019 Mar;567(7748):305-307. DOI: 10.1038/d41586-019-00857-9</RefTotal> <RefLink>https://doi.org/10.1038/d41586-019-00857-9</RefLink> </Reference> <Reference refNo="5"> <RefAuthor>Cox DR</RefAuthor> <RefTitle></RefTitle> <RefYear>2006</RefYear> <RefBookTitle>Principles of statistical inference</RefBookTitle> <RefPage></RefPage> <RefTotal>Cox DR. Principles of statistical inference. Cambridge: Cambridge University Press; 2006. DOI: 10.1017/CBO9780511813559</RefTotal> <RefLink>https://doi.org/10.1017/CBO9780511813559</RefLink> </Reference> <Reference refNo="10"> <RefAuthor>Manly BFJ</RefAuthor> <RefTitle>Randomization</RefTitle> <RefYear>1996</RefYear> <RefBookTitle>Randomization, bootstrap and Monte Carlo methods in biology.</RefBookTitle> <RefPage>3-7</RefPage> <RefTotal>Manly BFJ. Randomization, bootstrap and Monte Carlo methods in biology. London: Chapman & Hall; 1996. Randomization; p. 3-7.</RefTotal> </Reference> <Reference refNo="11"> <RefAuthor>Feinstein AR</RefAuthor> <RefTitle>Testing stochastic hypotheses</RefTitle> <RefYear>2002</RefYear> <RefBookTitle>Principles of medical statistics.</RefBookTitle> <RefPage>190-1</RefPage> <RefTotal>Feinstein AR. Principles of medical statistics. Boca Raton: Chapman & Hall/CRC; 2002. Testing stochastic hypotheses; p. 190-1.</RefTotal> </Reference> <Reference refNo="6"> <RefAuthor>Anonym</RefAuthor> <RefTitle>Student's t-distribution.</RefTitle> <RefYear></RefYear> <RefBookTitle>Wikipedia</RefBookTitle> <RefPage></RefPage> <RefTotal>Student's t-distribution. In: Wikipedia. [accessed 2019 May 16]. Available from: https://en.wikipedia.org/wiki/Student%27s_t-distribution</RefTotal> <RefLink>https://en.wikipedia.org/wiki/Student%27s_t-distribution</RefLink> </Reference> <Reference refNo="7"> <RefAuthor>Fisher RA</RefAuthor> <RefTitle></RefTitle> <RefYear>1956</RefYear> <RefBookTitle>Statistical methods and scientific inference</RefBookTitle> <RefPage></RefPage> <RefTotal>Fisher RA. Statistical methods and scientific inference. Edinburgh: Oliver & Boyd; 1956.</RefTotal> </Reference> <Reference refNo="8"> <RefAuthor>Oakes MW</RefAuthor> <RefTitle></RefTitle> <RefYear>1986</RefYear> <RefBookTitle>Statistical inference</RefBookTitle> <RefPage></RefPage> <RefTotal>Oakes MW. Statistical inference. Chichester: Wiley; 1986.</RefTotal> </Reference> <Reference refNo="9"> <RefAuthor>Greenland S</RefAuthor> <RefAuthor>Senn SJ</RefAuthor> <RefAuthor>Rothman KJ</RefAuthor> <RefAuthor>Carlin JB</RefAuthor> <RefAuthor>Poole C</RefAuthor> <RefAuthor>Goodman SN</RefAuthor> <RefAuthor>Altman DG</RefAuthor> <RefTitle>Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations</RefTitle> <RefYear>2016</RefYear> <RefJournal>Eur J Epidemiol</RefJournal> <RefPage>337-50</RefPage> <RefTotal>Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, Altman DG. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016 Apr;31(4):337-50. DOI: 10.1007/s10654-016-0149-3</RefTotal> <RefLink>https://doi.org/10.1007/s10654-016-0149-3</RefLink> </Reference> </References> <Media> <Tables> <Table format="png"> <MediaNo>1</MediaNo> <MediaID language="en">1en</MediaID> <MediaID language="de">1de</MediaID> <Caption language="en"><Pgraph><Mark1>Table 1: Results of the study on the new sleep pill to reduce sleep latency</Mark1></Pgraph></Caption> <Caption language="de"><Pgraph><Mark1>Tabelle 1: Ergebnisse der Studie zum Einfluss eines neuen Schlafmedikaments auf die Schlaflatenz</Mark1></Pgraph></Caption> </Table> <Table format="png"> <MediaNo>2</MediaNo> <MediaID language="en">2en</MediaID> <MediaID language="de">2de</MediaID> <Caption language="en"><Pgraph><Mark1>Table 2: Permutation test</Mark1></Pgraph></Caption> <Caption language="de"><Pgraph><Mark1>Tabelle 2: Permutationstest</Mark1></Pgraph></Caption> </Table> <Table format="png"> <MediaNo>3</MediaNo> <MediaID language="en">3en</MediaID> <MediaID language="de">3de</MediaID> <Caption language="en"><Pgraph><Mark1>Table 3: Criteria for test selection</Mark1></Pgraph></Caption> <Caption language="de"><Pgraph><Mark1>Tabelle 3: Kriterien für die Testauswahl</Mark1></Pgraph></Caption> </Table> <NoOfTables>3</NoOfTables> </Tables> <Figures> <Figure format="png" height="495" width="956"> <MediaNo>1</MediaNo> <MediaID language="en">1en</MediaID> <MediaID language="de">1de</MediaID> <Caption language="en"><Pgraph><Mark1>Figure 1: Normal distributions of the sleep latency in the target populations</Mark1></Pgraph></Caption> <Caption language="de"><Pgraph><Mark1>Abbildung 1: Normalverteilungen der Schlaflatenz in der Zielpopulation</Mark1></Pgraph></Caption> </Figure> <Figure format="png" height="454" width="454"> <MediaNo>2</MediaNo> <MediaID language="en">2en</MediaID> <MediaID language="de">2de</MediaID> <Caption language="en"><Pgraph><Mark1>Figure 2: t-distribution with 60 degrees of freedom and marked result of the concrete study (t=2.33)</Mark1></Pgraph></Caption> <Caption language="de"><Pgraph><Mark1>Abbildung 2: t-Verteilung mit 60 Freiheitsgraden und markiertes Studienergebnis (t=2,33)</Mark1></Pgraph></Caption> </Figure> <Figure format="png" height="524" width="956"> <MediaNo>3</MediaNo> <MediaID language="en">3en</MediaID> <MediaID language="de">3de</MediaID> <Caption language="en"><Pgraph><Mark1>Figure 3: t-distribution with 60 degrees of freedom with marked areas under the curve for t≥+2.33 and t≤–2.33</Mark1></Pgraph></Caption> <Caption language="de"><Pgraph><Mark1>Abbildung 3: t-Verteilung mit 60 Freiheitsgraden und markierten Flächen unter der Verteilung für t≥+2,33 und t≤–2,33</Mark1></Pgraph></Caption> </Figure> <NoOfPictures>3</NoOfPictures> </Figures> <InlineFigures> <Figure format="png" height="62" width="256"> <MediaNo>2</MediaNo> <MediaID>2</MediaID> <AltText language="en">formula 1</AltText> <AltText language="de">Formel 1</AltText> </Figure> <Figure format="png" height="44" width="73"> <MediaNo>3</MediaNo> <MediaID>3</MediaID> <AltText language="en">formula 2</AltText> <AltText language="de">Formel 2</AltText> </Figure> <Figure format="png" height="41" width="399"> <MediaNo>4</MediaNo> <MediaID language="en">4en</MediaID> <MediaID language="de">4de</MediaID> <AltText language="en">formula 3</AltText> <AltText language="de">Formel 3</AltText> </Figure> <Figure format="png" height="49" width="399"> <MediaNo>5</MediaNo> <MediaID language="en">5en</MediaID> <MediaID language="de">5de</MediaID> <Caption><Pgraph> </Pgraph></Caption> <AltText language="en">formula 4</AltText> <AltText language="de">Formel 4</AltText> </Figure> <Figure format="png" height="64" width="120"> <MediaNo>6</MediaNo> <MediaID>6</MediaID> <AltText language="en">formula 5</AltText> <AltText language="de">Formel 5</AltText> </Figure> <Figure format="png" height="22" width="19"> <MediaNo>7</MediaNo> <MediaID>7</MediaID> <AltText language="en">formula 6</AltText> <AltText language="de">Formel 6</AltText> </Figure> <Figure format="png" height="22" width="19"> <MediaNo>8</MediaNo> <MediaID>8</MediaID> <AltText language="en">formula 7</AltText> <AltText language="de">Formel 7</AltText> </Figure> <Figure format="png" height="68" width="118"> <MediaNo>9</MediaNo> <MediaID>9</MediaID> <AltText language="en">formula 8</AltText> <AltText language="de">Formel 8</AltText> </Figure> <Figure format="png" height="68" width="289"> <MediaNo>10</MediaNo> <MediaID language="en">10en</MediaID> <MediaID language="de">10de</MediaID> <AltText language="en">formula 9</AltText> <AltText language="de">Formel 9</AltText> </Figure> <Figure format="png" height="18" width="14"> <MediaNo>1</MediaNo> <MediaID>1</MediaID> <AltText language="en">Formula</AltText> <AltText language="de">Formel</AltText> </Figure> <NoOfPictures>10</NoOfPictures> </InlineFigures> <Attachments> <NoOfAttachments>0</NoOfAttachments> </Attachments> </Media> </OrigData> </GmsArticle>