

EDITORIAL 

Year : 2016  Volume
: 7
 Issue : 3  Page : 7380 

Three common misuses of P values
Jeehyoung Kim^{1}, Heejung Bang^{2}
^{1} Department of Orthopedic Surgery, Seoul Sacred Heart General Hospital, Seoul, Korea ^{2} Division of Biostatistics, Department of Public Health Sciences, University of California, Davis, USA
Date of Web Publication  14Sep2016 
Correspondence Address: Jeehyoung Kim Department of Orthopedic Surgery, Seoul Sacred Heart General Hospital, Seoul Korea
Source of Support: None, Conflict of Interest: None  Check 
DOI: 10.4103/21558213.190481
How to cite this article: Kim J, Bang H. Three common misuses of P values. Dent Hypotheses 2016;7:7380 
One sponsor reported to us that they had been looking at the data as each patient came in and stopped when the P value was < 0.05… Peter A. Lachenbruch ^{[1]} "Significance" has a specific meaning in science, especially in statistics. The P value as a measure of statistical significance (evidence against a null hypothesis) has long been used in statistical inference and has served as a key player in science and research. Despite its clear mathematical definition and original purpose, and being just one of the many statistical measures/criteria, its role has been overemphasized along with hypothesis testing. Observing and reflecting on this practice, some journals have attempted to ban reporting of P values, and the American Statistical Association (for the first time in its 177yearold history) released a statement on P values in 2016. In this article, we intend to review the correct definition of the P value as well as its common misuses, in the hope that our article is useful to clinicians and researchers.
What is the P value?
Let us revisit what the P value (probability value) is in English, formula, and graph. We believe that it is wise to learn what it is before what it is not! We also think it is helpful to learn about null (H _{0} ) vs. alternative (H _{a} ) hypotheses, and the two errors associated with these hypotheses. Commonly, we assume H _{0} : the difference is 0 vs. H _{a} : the difference is not 0 (or H _{0} is not true), even though one could take H _{0} to represent a nonzero difference instead. Type I error (denoted by α) is the probability of rejecting H _{0} when H _{0} is true, i.e., false positive rate. Type II error (β) is the probability of not rejecting H _{0} when H _{a} is true (or H _{0} is false), i.e., false negative rate. α is also referred to as "significance level." (Note: Fisher and others before Neyman and Pearson did not use an explicit alternative or talk of error rates. P value can be defined and used without either concept.)
The American Statistical Association (2016) provided this definition of the P value: ^{[2]}
Informally, a P value is the probability under a specified statistical model that a statistical summary of the data would be equal to or more extreme than its observed value.
A more detailed definition is, a P value is the probability of obtaining a test statistic at least as extreme as the statistic observed under the H _{0} (and every other assumption made), where the () part is important, even though it is often ignored or omitted. For two and onesided tests, a mathematical formula and graphical depiction [Figure 1] can be provided as:
P  value = P(ǀtest statisticǀ≥ what you observed, under H _{0} ).
After setting up the test hypothesis and conducting a test (e.g., computing the P value), we ordinarily compare the P value with a prespecified α, conventionally, 0.05. If the P value is < 0.05, we reject the H _{0} ; if not, we do not reject the H _{0} .
An easy example would be a coin toss. If an unbiased coin is tossed properly many times, we would expect that approximately 50% of the time heads will face up. That is, if we toss 100 times, we will get heads approximately 50 times. However, if we get heads 90% of the time, we would suspect this coin is biased or something strange is occurring. From this specific exercise, we can compute the P value as the probability of heads facing up 90 times (i.e., what we observed) or more under the assumption that the truth is 50%, through a mathematical formula, and mark "x" in [Figure 1].
Three common misuses
As documented elsewhere, there are many misuses of P value and statistical procedures. ^{[3]} Here, we focus on three common misuses.
Large P value means no difference: Wrong
One property of the P value is that it is a function of the sample size (N) (not under the H _{0} ). Thus, when N is large, the P value is destined to be small; this feature can be a rewardacknowledging how hard it is to collect a large samplebut can cause other problems. Suppose that we observed the identical event rates (as well as risk difference (RD), odds ratio (OR), and risk ratio (RR) in the 2× 2 tables in [Figure 2]). When N is doubled, the P value can be meaningfully lower. Thus, statistical significance and acceptance vs. rejection of the H _{0} could be different in these two scenarios.
If the P value is above the prespecified threshold (e.g., 0.05), we normally conclude that the H _{0} is not rejected. However, it does not mean that the H _{0} is true. The safer interpretation is that there is insufficient evidence to reject the H _{0} . Similarly, the "not H _{0}" could suggest that something is wrong with the H _{0} , not necessarily that the H _{a} is right. It could be related to the assumptions of normality and independence, etc., which are often (unstated) parts of the H _{0} . A famous aphorism here is, Absence of evidence is not evidence of absence.
Then, how to demonstrate "equality" better? A more appropriate way is (bio)equivalence hypothesis testing, which is a norm in clinical trials (e.g., generic drug). Design and analysis should go hand in hand, whenever possible.
Multiple testing and 0.05
Let us assume that Thomas Edison, the Wright brothers, or you tried the same novel experiment 1000 times and finally succeeded. It is definitely a triumph. However, success out of 1 attempt and one success out of 50 attempts carry different meanings in terms of probability and statistics. For example, if you made a basket for the first time at your 50 ^{th} basketball throw, it is important to disclose this. If you do not report 49 failures intentionally or unintentionally, one can misunderstand your performance. Similarly, if you try 50 different sports or conduct 50 laboratory experiments, a similar logic could apply. It is called the "multiple testing" ("multiple comparison" or "multiplicity") issue in statistics, with direct implications on α and the P value. A simple and generalnot perfectrule of thumb solution is the Bonferroni adjustment, which is to use α =0.05/5 = 0.01 for 5 (independent) tests as a new threshold, or equivalently, inflate/adjust the observed P value by multiplying by 5. One problem with this adjustment is that it can drastically lower the chance of detecting a real difference (the power of the test) if indeed there is one. As readers would be aware of, there are a variety of methods available for application in different contexts, each with slightly different properties, but with the same fundamental goal (e.g., Tukey's posthoc; resampling version for correlated data; O'BrienFleming for interim analyses; and empirical Bayes methods).
The multiple testing issue often can take different forms in real lifesometimes in hidden and less clear mannersso that we do not realize we are running into multiple testing issues where a P value adjustment (or at least some consideration thereof) might be warranted. For example, a study with multiple outcomes or multiple treatments/groups is quite clear. On the other hand, interim analyses/looks, subgroup analyses, multiple modeling, different categorizations of a variable (e.g., quartiles vs. quintiles), or searching for an optimal cutpoint (e.g., 3 or 4 cups of coffee) could be more or less subject to multiple testing issues.
The underlying mechanism of multiple testing may be well described as, "No Free Lunch," "Fooled by Randomness," "Forgone conclusion," "File drawer problem," "Leaving no trace," and "If you torture the data enough, nature will always confess." Acceptable solutions are (1) to designate a single primary hypothesis (and outcome/parameter/method), while all others are secondary/sensitivity/confirmatory; (2) to reveal all analyses performed (under a given aim in one publication); (3) to present unadjusted vs. adjusted P value sidebyside; or (4) a P value plot (when a number of P value are computed). [Table 1] and [Figure 3] are examples of (3) and (4). ^{[4],[5],[6]} Here, the underlying mathematical mechanism of the P value plot is that P value are uniformly distributed in 01 when the H _{0} is true (and all other assumptions are met, regardless of N!).
A frequent and reasonable question from clinicians is, "Data do not change at all with or without multiple testing adjustment. Why should we care? Why don't you like presenting preliminary data for abstract submission?" Here are our answers: Multiple testing is more about "intention" and the future likelihood of "replicability/reproducibility" of the observed finding, rather than truth. Another way to view is: the data is the same, ordered or not. Suppose you compute 100 P values, then order them from the smallest to the largest (as in the P value plot). The smallest P value is the 1 ^{st} order statistic. It is not a single random P value. Suppose that you rankordered the students in a school from the shortest to the tallest. The smallest student is not representative of students. In the case of P value, there are ways to adjust for the fact that you are looking at the smallest P value.  Figure 3: P value plots; Left: 200 P values from ttests for randomly generated data under H_{0}; Right: 264 P values from ttests for real nutritional data (Young et al. 2009). Permission to reproduce this figure was granted by the publisher, the Royal Society
Click here to view 
 Table 1: Numbers of Persons with Onset of Fatal or Nonfatal Cancer in the SEAS Trial and in SHARP and IMPROVEIT
Click here to view 
Interestingly, however, even leading statisticians do not fully agree concerning whether to adjust and how to adjust; thus, if you are against adjustment, you are not alone. Moreover, exploratory nature and serendipity in scientific discovery and advancement should never be undervalued. Indeed, some even do not think that the multiple testing problem really exists, asking how one can report thousands, millions, or billions of results? What we really have is a "selective reporting problem." ^{[7]} Most would agree that one out of one vs. one out of 50 can be interpreted differently, and readers/judges have a right to know this along with other details (e.g. method used). If authors/investigators are honest or willing, it is not difficult to do.
Another common practice we see is a disproportionate focus on "false positive," compared to "false negative." Sometimes, the consequences of false negatives can be much greater than those of false positives, and the importance of α vs. β should be carefully considered and contextbased, rather than handling/deciding them mechanically. ^{[8]}
Back to the original example, why do we not talk about multiple testing issues relative to Edison and the Wrights? Possibly because their experiments and successes would be replicated at the 1001 ^{th} trial and beyond. Even if we applied multiple testing adjustment for their first 1000 trials (e.g., pilot) and they did not pass P < 0.05, a "new" rigorously designed protocol and experiment, including a priori hypothesis and N/power calculation, would easily pass P < 0.001!
Smaller P value is more significant? Not necessarily
We have discussed the wellknown "large N→small P0" phenomenon. Below we illustrate that "smaller P value, smaller effect" can happen, when Ns are different; see [Figure 4]. Another philosophical question may be: Which more strongly supports the effect, "a large effect size from a small sample" vs. "a small effect size from a large sample"? The answer can vary and may be not straightforward; yet, we are easily convinced that "sole reliance on P value" can be problematic.
We assert that estimate (point and interval) and P value can be complimentary, but each with advantages and disadvantages; the former better addresses clinical or practical significance and the latter addresses statistical significance, where clinical significance is a more important goal even though it is not an easy concept or task.
Additional issues   
Notable companion of the P value: confidence interval
There are two ways to view a statistical hypothesis test: one is through a P value (of the test) and the other is through a CI (of a parameter). Many busy clinicians use a simple rule, "If P < 0.05, or the CI does not cover the null value, H _{0} is rejected." in practice. The P value and CI are complementary while attempting to do the same/similar thing, where the P value quantifies how "significant" the association/difference is, while the CI quantifies how "precise" the estimation is and what the plausible values are.
Ironically, however, another dominating measure in statistics, CI, does not have an easy definition. Perhaps, the shortest interpretation and definition of a 95% CI is, (a, b) is the set of all values with P > 0.05 under the datageneration model. A more detailed definition may be, a 95% CI for a parameter (e.g. mean or OR) has the property that for many independent replications of the same experiment, approximately 95% of the CIs contain the true parameter. Here, the parameter is fixed and intervals are random! Thus (unfortunately), we need 1000 experiments under identical conditions in our brain when we try to understand CI properly. The following definition and its variants appear in top medical journals and editorials, often written by (bio)statisticians, "With 95% confidence, the population mean will lie in this interval." This may be justified only in the sense of, "Perfect is the enemy of good."
The point estimate plus or minus its "margin of error" is a CI for the parameter of interest, where the margin is determined by the variability of the point estimate, so called via standard error, which decreases when N increases. Here, it is critical to know that standard error (margin of error, CI and P value as well) accounts for "random sampling error" only, not for other errors and numerous biases from other sources, including poorly worded questions, false answers, wrong/misspecified model, and flawed/inadequate design in survey or experiment.
In the current literature (e.g. BigData or metaanalysis), you may find something like OR =3.11 [95% CI: 3.103.12], an extremely narrow CI. No one would believe the truth is really inside! Imagining hypothetical 1000 experiments and the margin of error in your mind, you would not be surprised by this interval, and you would also naturally understand the potential limitations of CI. With the pros and cons of each method, reporting all 3 (the point estimate, 95% CI, and P value) would be advisable.
Another common practice in the use of CI and P value is that readers often check if the two CIs do overlap to judge statistical significance. A rule of thumb is that nonoverlapping CIs imply significant difference, but not the reverse; the two CIs may overlap and yet be significantly different as long as each CI does not contain both point estimates. More on this topic can be found in a reference. ^{[9]}
Note : Common technical mistakes in CIs are: (1) we want to attach the probability statement about CI. However, strictly speaking, we should not say, "95% likely or probable," which is like saying, "95% chance of rain yesterday;" and (2) we tend to assume interval is fixed and truth is random or fixed. How to interpret Prob (89 < true blood pressure < 122)=0.95? This probability is 0 or 1. ^{[10]} In the classical, Frequentist approach the randomness comes from the repetition of experiments, whereas in the Bayesian approach, the randomness comes from uncertainty regarding the value of the parameter, which could be more appealing and pertinent. Bayesian interval is often called, "credible interval." For rigorous definitions, properties and fallacies of CIs, see references. ^{[3],[11],[12]}
Reproducibility of the P value
Today, the reproducibility of scientific finding under the Responsible Conduct of Research has become a component of various training programs because irreproducible/nonreplicable findings are unacceptably common (e.g., Random Medical News) in the competitive research arena. Actually, this is old news. ^{[13],[14],[15],[16],[17]} Sadly, but unsurprisingly, statistics (and the P value) is a big player there. In contrast to other statistical estimates, the P value's sampletosample variability is not fully appreciated. ^{[18]} Related to reproducibility for future replicate P value, it has been shown that P value exhibit surprisingly large variability in typical data, and some call for lower P value thresholds such as 0.005 or 0.001 (although randomized controlled trial (RCT) and laboratory science communities may be upset!). ^{[19]}
In addition, by definition P value depends on effect size (e.g., observed difference and variability). In turn, observed difference and variability also depend on study design, sample selection, measurement, and method, among others. In an extreme scenario, if the two comparison groups do not overlap (e.g., cases vs. controls), we can get an impressively low P value and perfect discrimination (AUC =1). This may indicate flawed design, such that the resulting P values or comparison per se can be misleading or meaningless. If we use a more suitable design and sample, even for the exactly same comparison, the previous small P value would not be reproduced. In some sense, "too good to be true" statistics (e.g., AUC ≈1, P < 0.0001) is a blessing by effectively serving as an alarm to investigators as well as reviewers/editors, e.g. "Do not publish findings yet; more checking is needed." You may want to check out the impressive P values in the famous "vitamin C and terminal cancers" and "vaccines and autism" papers in history. ^{[20],[21]} If you were a reviewer and saw these P values (and AUC ≈1), what would you say even if you don't know underlying science well?
A more common scenario is when testing the same hypothesis in different populations (e.g., low vs. high risk groups; American vs. Asian), the observed P values can be vastly different even with the same N, which is natural due to the different effect sizes expected. Statistical inference is generally based on "hypothetical" experiments (e.g., randomized, independent, sampling bias only) and mathematical formulations; to compare realworld settings can be much more complex. Thus, limited reproducibility in P values and varied performance of any model (e.g., prediction) in different settings/contexts are to be anticipated.
Related to reproducibility concerns and countless biases in practice, some people focus on large effect sizes (e.g. OR >2, in addition to or in place of P < 0.05). Yet, we should not ignore small but real effects or rare cases/events, which may be potentially translated to large total (or cumulative) effect or expense at the population or society level. The P value and CI cannot answer the meaningfulness and clinical or public health significance of "losing 100 grams" and "living 3 days longer after cancer screening," which should be judged together with societal and individual perspectives and values (and possibly costeffectiveness).
A possible solution for the irreproducibility crisis may be: In God we trust; all others must bring data (protocol and SAS output).
Large P with small N; posthoc power to blame?
Let us imagine a common situation. When we finished data collection and analysis, we got P = 0.2 for the primary hypothesis test. Naturally, we are disappointed after a long and hard work and tempted to find reasons, including anyone to blame? I guess low N (e.g., budget, boss's recommendation, wrong assumptions used in N/power calculation, etc.) and low observed power may be good victims for the posthoc blame game because we already know "larger N → smaller P" when H _{0} is false. There are debates regarding this issue to compute posthoc power or not. Recall power = 1 − β = 1 − Type II error, but posthoc or observed power is not 1− β (say, 80%)! It is the same fallacy as for misinterpreting the CI: After the data are in, the CI either does or does not contain the true value (1 or 0, not 95%). In the same way, the Type II error after the analysis is either 1 or 0.
Somewhat depressingly, this is another controversial topic in statistics because widely used biostatistics textbooks and some instructors teach how to calculate posthoc power. So again, you are not alone. Some recommend: once a study is over, we should focus on precision, rather than power, noting that, for any test, the observed power is a 1:1 function of the P value. ^{[22]} Even if posthoc power could be useless once one sees the P value and CI, the results of a study can and should be used to design subsequent studies because those results provide information about the crucial parameters used to estimate the N for subsequent studies (such as the size of the effect and the exposure or disease frequency one should expect). The CONSORT 2010 (item 17) also stated, there is little merit in a posthoc calculation of statistical power using the results of a trial; the power is then appropriately indicated by CIs. (Remark: power should be precision here.)
Final decision always binary (i.e., P < vs. > 0.05)?
A binary decision may be needed at courts, in sports, or on a job application; however, should scientific decision making always be two regions? We hope not! There are trinary decisions and hypotheses: accept (P < 0.05); reject (P > 0.2); and get more data (0.05< P < 0.2). Indeed, some understand this as Fisher's original suggestion for using P values. Adoption of trinary decisions instead of the current paradigm, e.g., "The earth is round (P < 0.05)." might be ideal in practice. ^{[23]}
Final thoughts and some recommendations   
There are few numerical numbers/measures/tools both as common and controversial as the P value in science and research (e.g., P hacking, P value chasing, fickle P value, the cult of statistical significance). This one measure is often believed to govern our career (e.g., funding, publication) and the destinies of pharmaceutical companies. We view the main reasons for P value's popularity and charm as being; an easily computed number, probability (within 01) with an interpretation of statistical significance, with high generality and universality, and solid mathematical foundations. Therefore, the P value must be doing what it is supposed to do well; it is more likely that we misuse/abuse it for what it is not expected to do. If a tool has long been used by many, there are always reasons, generally more good than bad (but not necessarily so, as the history of practices like bloodletting reveal).
Its popularity and dominance in medicine is noteworthy in part because simple and fast decision making (e.g. does this treatment work? is it best for this patient?) is needed on a daily basis. Regardless of its limitations, we expect the popularity of the P value to continue. Even in the face of criticism, cynicism, or even banning of P values, abandoning baseball statistics and the car because they are imperfect or throwing the baby out with the bathwater would be counterproductive or unwise. ^{[24],[25]} Dennis Lindley, a leading Bayesian statistician in history, did not believe in significance testing, but he taught it at Cambridge! ^{[26]} Better and feasible guidance on use and interpretation is of more use than outright rejection, which is virtually true for many other statistics, e.g. OR vs. RR, absolute vs. relative, kappa, and so on.
A recent survey of 1576 researchers by Nature picked "Selective reporting" as #1 factor of irreproducible research and "Better understanding of statistics" as #1 solution. ^{[7]} We want to conclude our article with two suggestions for practitioners.
Better design, better data, better P value
Design trumps analysis and experimentation trumps observation in scientific research, and "Garbage in, Garbage out" is so relevant to statistical analyses. ^{[27],[28]} It is nearly impossible to expect valid statistical analysis (including P values) from poor quality design and/or data. Fisher said, "To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of." More energy and emphasis should be given to more basic and fundamental components and steps: study design, measurement, data collection, blinding (if relevant), and protocol, including data analysis plan to help minimize retrospective rationalization and fishing expeditions. All of these are important prerequisites to valid P values.
It is essential to remember that design can address causality, whereas P value can only address numerical correlation/association in a given model/setting. In that sense, the old terms of "effect" and "effect size" (which the authors of this paper also used! And standard error is error, really?) could be the source of many wrongdoings. Also, similar to the Almighty P value, the RCT is extremely beloved in comparative and evaluative research, namely, Trialism.^{[29],[30]} Yet, it is crucial to understand the best gift that the RCT offers is average causal effect (of "intention," again). Sick populations and sick individuals are not the same thing. ^{[31]} As Stephen Jay Gould said from his own long experience with cancer, perhaps "The median isn't the message." ^{[32]} Yet, if the median is accompanied by good intervals (e.g., 95% CI, interquartile range, minmax), it could carry useful messages, even for others than the Average Man. ^{[33]}
Beyond P value and validation: Toward total evidence
No single measure/method is perfect. Notwithstanding the pros and cons of P value and CI, good scientists would look for both, and possibly more. Commonly utilized measures in biomedicine include effect size (e.g., RD/OR/RR); point and interval estimates; statistical significance (P value); discrimination (AUC); model determination (R ^{2}); correlation (Pearson, Spearman); model quality (AIC/BIC), etc. For instance, it is not rare to encounter "small P and small R ^{2}," "more predictors but lower AUC," or "P < 0.0001 in one study and >0.05 in another;" these look ostensibly contradictory, but if we understand each method's capability, we would not be surprised, even if both studies were well done.
Furthermore, we should always seek nonstatistical evidence, such as a theory, scientific explanations, qualitative evidence, and evidence from basic science (e.g., N = 3) whenever available. Theory serves as a base for thinking and it helps us to understand what is really going on. Data have no scientific (or empirical) meaning without theory. As Charles Darwin said, without speculation there is no good and original observation. If you study the association of the number of refrigerators and the crime rate, the finding may be validated easily in different cities. Also, inhospital mortality has been shown to be inversely related to the number of cardiovascular risk factors. ^{[34]} But for both the cases, we can look for possible explanations of such an association.
Through this editorial, we hope to help readers better understand and use the P value. At the end of the day, we should wait for total evidence through (sensible) validation to lead us closer to an ultimate answer for a given setting, although it takes time and resources. If we use P values correctly and wisely, we can shorten the time of this journey and save the resources. Maybe now is the time to move over "Publish or Perish" to "Validate or Vanish."
Acknowledgment
We thank Drs. Sander Greenland, Robert Elston, Stan Young, Dmitri Zaykin, and Ross Gibson for reading our draft, providing valuable comments and criticisms, and/or educating authors. The author thanks Ms. Caron Modeas for editing service.
Financial support and sponsorship
H. Bang was partly supported by the National Institutes of Health through grants UL1 TR000002 and P50 AR063043.
Conflicts of interest
There are no conflicts of interest.
References   
1.  
2.  Wasserstein R, Lazar N. The ASA′s statement on P values: Context, process, and purpose. Am Stat 2016. 
3.  Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN et al. Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. Eur J Epidemiol 2016;31:33750. 
4.  Schweder T, Spjøtvoll E. Plots of P values to evaluate many tests simultaneously. Biometrika 1982;69:493502. 
5.  Young SS, Bang H, Oktay K. Cerealinduced gender selection? Most likely a multiple testing false positive. Proc Biol Sci 2009;276:12112. 
6.  Peto R, Emberson J, Landray M, Baigent C, Collins R, Clare R, et al. Analyses of cancer data from three Ezetimibe trials. N Eng J Med 2008;359:135766. 
7.  Baker M. 1,500 scientists lift the lid on reproducibility. Nature 2016;533:4524. 
8.  Neyman J. Frequentist probability and frequentist statistics. Springer; 1977. 
9.  Schenker N, Gentleman J. On judging the significance of differences by examining the overlap between confidence intervals. Am Stat 2001;55:1826. 
10.  Elston R, Johnson W. Basic Biostatistics for Geneticists and Epidemiologists: A Practical Approach. Wiley; 2008. 
11.  Morey R, Hoekstra R, Rouder J, Lee MD, Wagenmakers EJ. The fallacy of placing confidence in confidence intervals. Psychon Bull Rev 2016;23:10323. 
12.  Greenland S, Poole C. Problems in common interpretations of statistics in scientific articles, expert reports, and testimony. Jurimetrics J 2011;51:11329. 
13.  Shapiro S. Looking to the 21 ^{st} century: Have we learned from our mistakes, or are we doomed to compound them? Pharmacoepidemiol Drug Saf 2004;13:25765. 
14.  Breslow N. Are statistical contributions to medicine undervalued? Biometrics 2003;59:18. 
15.  Greenwood M. Is statistical method of any value in medical research? Lancet 1924;204:1538. 
16.  Frey B. Publication as prostitution. Public Choice 2003;116:20523. 
17.  Mantel N. How to guarantee significance. Am Stat 1976;30:2012. 
18.  Boos D, Stefanski L. P Value precision and reproducibility. Am Stat 2011;65:213221. 
19.  Johnson V. Revised standards for statistical evidence. PNAS 2013;110:193137. 
20.  Cameron E, Pauling L. Supplemental ascorbate in the supportive treatment of cancer: Prolongation of survival times in terminal human cancer. PNAS 1976;73:36859. 
21.  Wakefield A, Murch S, Anthony A, Linnell J, Casson DM, Malik M, et al. Ileallymphoidnodular hyperplasia, nonspecific colitis, and pervasive developmental disorder in children. Lancet 1998;351:63741. 
22.  Hoenig J, Heisey D. The abuse of power: The pervasive fallacy of power calculations for data analysis. Am Stat 2001;55:1924. 
23.  Cohen J. The earth is round ( P < 0.05). Am Psychol 1994;49:9971003. 
24.  Woolston C. Psychology journal bans P values. Nature 2015;519:9. 
25.  Baker M. Statisticians issue warning over misuse of P values. Nature 2016;531:151. 
26.  Zheng G, Li Z, Geller N. A conversation with Robert C. Elston. Stat Sci 2015;30:25867. 
27.  Rubin DB. For objective causal inference, design trumps analysis. Ann App Stat 2008;2:80840. 
28.  Bang H. Introduction to observational studies In: Faries D, Leon A, Haro J, et al., editors. Analysis of Observational HealthCare Data Using SAS: SAS Press Series; 2010. 
29.  Peck C. The almighty P value or the significance of significance. Present Concepts Intern Med 1971;4:10214. 
30.  Rimm A, Bortin M. Trialism: The belief in the Holy Trinity clinicianpatientbiostatistician. Biomed Special Issue 1978;28:603. 
31.  Rose G. Sick individuals and sick populations. Int J Epidemiol 1985;14:328. 
32.  
33.  Quetelet A. Sur l′homme et le développement de ses facultés, ou Essai de physique sociale: Paris: Bachelier, imprimeurlibraire, quai des Augustins; 1835. 
34.  Canto J, Kiefe C, Rogers W, Peterson ED, Frederick PD, French WJ, et al. Number of coronary heart disease risk factors and mortality in patients with first myocardial infarction JAMA 2011;306:21207. 
[Figure 1], [Figure 2], [Figure 3], [Figure 4]
[Table 1]
This article has been cited by  1 
Metabolic implications of low muscle mass in the pediatric population: a critical review 

 Camila E. Orsso,Jenneffer R.B. Tibaes,Daniela A. Rubin,Catherine J. Field,Steven B. Heymsfield,Carla M. Prado,Andrea M. Haqq   Metabolism. 2019; : 153949   [Pubmed]  [DOI]   2 
Exploring the Links Between Receiving Special Education Services and Adulthood Outcomes 

 Tomoe Kanaya,Jonathan Wai,Brenda Miranda   Frontiers in Education. 2019; 4   [Pubmed]  [DOI]   3 
Alirocumab, Decreased Mortality, Nominal Significance,
P
Values, Bayesian Statistics, and the Duplicity of Multiplicity 

 Harold Edward Bays   Circulation. 2019; 140(2): 113   [Pubmed]  [DOI]   4 
Willingness to pay for communitybased health insurance and associated factors among rural households of Bugna District, Northeast Ethiopia 

 Amare Minyihun,Measho Gebreslassie Gebregziabher,Yalemzewd Assefa Gelaw   BMC Research Notes. 2019; 12(1)   [Pubmed]  [DOI]   5 
Association Study Among Candidate Genetic Polymorphisms and ChemotherapyRelated Severe Toxicity in Testicular Cancer Patients 

 María A. Lavanderos,Juan P. Cayún,Ángela Roco,Christopher Sandoval,Leslie Cerpa,Juan C. Rubilar,Roberto Cerro,Sebastián MolinaMellico,Cesar Celedón,Berta Cerda,Elena GarcíaMartín,José A. G. Agúndez,Cristián Acevedo,Karina Peña,Dante D. Cáceres,Nelson M. Varela,Luis A. Quiñones   Frontiers in Pharmacology. 2019; 10   [Pubmed]  [DOI]   6 
Thickness of retina and choroid in the elderly population and its association with Complement Factor H polymorphism: KLoSHA Eye study 

 NaKyung Ryoo,Seong Joon Ahn,Kyu Hyung Park,Jeeyun Ahn,Jiyeong Seo,Ji Won Han,Ki Woong Kim,Se Joon Woo,Simon J Clark   PLOS ONE. 2018; 13(12): e0209276   [Pubmed]  [DOI]   7 
How to Report Exotic Animal Research 

 Nicola Di Girolamo,Alexandra L. Winter   Veterinary Clinics of North America: Exotic Animal Practice. 2017; 20(3): 997   [Pubmed]  [DOI]  



