Prévia do material em texto
The Questions and Answers
Book on Statistics for Clinical
Researchers
262 questions for deeper statistical
learning through questioning
FELIPE FREGNI
ALMA SANCHEZ
WILSON FANDINO
KARLA LOSS
EDITORS:
Felipe Fregni, MD, PhD, MMSc, MPH, MEd
Professor of Physical Medicine and Rehabilitation, Harvard Medical
School
Professor of Epidemiology, Harvard T.H. Chan School of Public
Health
Director, Principles and Practice of Clinical Research (PPCR)
Program, Harvard T.H. Chan School of Public Health
Director, Spaulding Neuromodulation Center, Spaulding
Rehabilitation Hospital
Boston, USA.
Alma T Sanchez Jimenez, MD
Associate Director of Education
Harvard T.H. Chan School of Public Health - ECPE
Boston, USA.
Wilson Fandino Parra, MD, MSc
Consultant Anaesthetist
Guy’s and St Thomas’ NHS Foundation Trust. London, United
Kingdom.
Senior Teaching Assistant, Principles and Practice of Clinical
Research Program,
Harvard T.H. Chan School of Public Health - ECPE
Boston, USA.
Karla Loureiro Loss, MD
Researcher, Department of Cardiology, Children's Hospital Los
Angeles, Los Angeles, United States
Teaching Assistant II, Principles and Practice of Clinical Research
Program
Harvard T.H. Chan School of Public Health - ECPE
Boston, USA.
CHAPTER AUTHORS:
Antonio Vaz de Macedo, MD, MSc
Head of the Hematology Clinic, Hospital da Polícia Militar,
Belo Horizonte, Brazil
Bone Marrow Transplant Unit Physician, Hospital Luxemburgo
Instituto Mário Penna, Belo Horizonte, Minas Gerais, Brazil
Senior Teaching Assistant
Principles and Practice of Clinical Research Program,
Harvard T.H. Chan School of Public Health - ECPE
Boston, USA.
Arturo Tamayo MD, FAHA, MSc
Assistant Professor of Neurology and Cerebrovascular Diseases
The Max Rady Faculty of Health Sciences. University of Manitoba
Winnipeg, Canada.
Director of Brandon and Winnipeg HSC Stroke Prevention Clinics
Senior Teaching Assistant
Principles and Practice of Clinical Research Program,
Harvard T.H. Chan School of Public Health - ECPE
Boston, USA.
Carlos Cordero García, MD, PhD
Head of Physical Medicine & Rehabilitation Department. Juan
Ramón Jiménez University Hospital. Huelva, Spain.
Senior Teaching Assistant
Principles and Practice of Clinical Research Program,
Harvard T.H. Chan School of Public Health - ECPE
Boston, USA.
Denise S. Schwartz, DVM, MSc, PhD
Professor Doctor at Department of Internal Medicine
School of Veterinary Medicine and Animal Science
University of São Paulo - Brazil
Senior Teaching Assistant
Principles and Practice of Clinical Research Program
Harvard T.H. Chan School of Public Health - ECPE
Boston, USA.
Fathima Minisha, MBBS MRCOG MSc
Hamad Medical Corporation, Doha-Qatar
Teaching Assistant II
Principles and Practice of Clinical Research Program
Harvard T.H. Chan School of Public Health - ECPE
Boston, USA.
Francisco Álvarez Marcos, MD, PhD, MSc
Vascular Surgery Department, Hospital Universitario Central de
Asturias (HUCA)
Statistical Advisor, European Journal of Vascular and Endovascular
Surgery (EJVES)
Oviedo, Spain
Senior Teaching Assistant,
Principles and Practice of Clinical Research Program
Harvard T.H. Chan School of Public Health - ECPE
Boston, USA.
Juan Carlos Silva Godinez MD, MSc, MPHc
Hospital de Oncología, Centro Médico Nacional Siglo XXI, Instituto
Mexicano del Seguro Social
Mexico city, Mexico
Senior Teaching Assistant
Principles and Practice of Clinical Research Program
Harvard T.H. Chan School of Public Health - ECPE
Boston, USA.
Kevin Pacheco-Barrios MD, MSc
Researcher, Spaulding Neuromodulation Center,
Spaulding Rehabilitation Hospital, Harvard Medical School.
Boston, USA
Khalid Ahmed MBBS, MRCSEd, CABS, MSc,
FHEA
Hamad Medical Corporation
Department of Acute Care Surgery
Doha, Qatar
Senior Teaching Assistant
Principles and Practice of Clinical Research Program
Harvard T.H. Chan School of Public Health - ECPE
Boston, USA.
Ludmilla Candido Santos, MD
Department of Emergency Medicine, Massachusetts General
Hospital, Boston, MA
Senior Teaching Assistant
Principles and Practice of Clinical Research Program
Harvard T.H. Chan School of Public Health - ECPE
Boston, USA.
Michelle Quarti da Rosa, RN, MSc, MPH, PhD
Public Health Department, Institute of Tropical Pathology and Public
Health, Federal University of Goiás . Brazil
Senior Teaching Assistant
Principles and Practice of Clinical Research Program
Harvard T.H. Chan School of Public Health - ECPE
Boston, USA.
Mohamed Hashim M. Mahmoud, MD, MSc, ABHS-
FM
Primary Health Care Corporation - Qatar
Senior Teaching Assistant
Principles and Practice of Clinical Research Program
Harvard T.H. Chan School of Public Health - ECPE
Boston, USA.
Olivier F Clerc, MD
Department of Cardiology, University Hospital Basel,
University of Basel, Switzerland.
Senior Teaching Assistant
Principles and Practice of Clinical Research Program
Harvard T.H. Chan School of Public Health - ECPE
Boston, USA.
Rui Nakamura, MD
The University of Tokyo
Tokyo, Japan
Senior Teaching Assistant
Principles and Practice of Clinical Research Program
Harvard T.H. Chan School of Public Health - ECPE
Boston, USA.
Sabrina Donzelli MD, Msc
Italian Scientific Spine Institute ISICO
Milano, Italy
Senior Teaching Assistant
Principles and Practice of Clinical Research Program
Harvard T.H. Chan School of Public Health - ECPE
Boston, USA.
Yuri Takehara Chemale, MD
UNSW, Faculty of Medicine and Health, Sydney, Australia
Senior Teaching Assistant
Principles and Practice of Clinical Research Program
Harvard T.H. Chan School of Public Health - ECPE
Boston, USA.
Walid Omer, MD, MSc
Associate Consultant, Audiology & Balance, Unit ENT Section
Hamad Medical Corporation. Doha, Qatar
Senior Teaching Assistant, Principles and Practice of Clinical
Research Program, Executive and Continuing Education, Harvard
T.H. Chan School of Public Health - ECPE, Boston, USA.
Copyright © 2022 Felipe Fregni
All rights reserved.
To my children, Camila and Mateo, I love you to the moon and back.
Alma T Sanchez Jimenez
To Lorena, for being always supportive since the beginning of this
project; to my parents, who celebrate their golden anniversary this
year, for their unconditional support during my entire life; to my family
members and all lives lost during the Covid-19 pandemic, who will
be always remembered; and to the PPCR community, for being a
constant source of motivation to keep learning biostatistics.
Wilson Fandino
To my husband Gustavo and daughters, Helena and Mariana. Your
love and everyday smiles supported me during this journey. Thank
you!. Karla Loureiro Loss
This chapter is dedicated to CAMILA and ANA, without whom I
would not have delved into the depths and wonders of Clinical
Research. Antonio Vaz de Macedo
To Christy, Keara and Keelan. Thank you for your unlimited patience
and love. Arturo Tamayo
Dedicated to Mónica and our children (María and Álvaro) for being
my daily inspiration. Carlos Cordero
To all who seeks to learn and understand a bit more in applied
statistics! Denise S Schwartz
I would like to take this opportunity to thank Professor Fregni, and
the PPCR course for opening up such great opportunities for me in
the world of research and statistics. This work is dedicated to
everybody in my life who put their lives on hold so that I can pursue
my passion. Fathima Minisha.
To my wife Laura and my small teckel Opi, who both suffered some
shortage of outdoor activities in the making of this project. Francisco
Álvarez Marcos
To our students, who made all this possible. Juan Carlos Silva
Godinez
To my mom and dad for all the love and support always and all my
way. Kevin Pacheco-Barrios
To my parents, for making me believe that all my dreams can come
true. Ludmilla Candido Santos
To my husband Fabiano, we are unstoppable! Gratitude, to my mom
Elaine and my sister Camille, for always supporting me. Michelle
Quarti Machado da Rosa
My belovedfather, who taught me to be sincere and optimistic; my
family, who always support and encourage me; my inspirational
mentors and teachers. Mohamed Hashim M. Mahmoud
To my dear parents, my beloved wife, my lovely children, and my
sincere siblings for the continuous support and without whom this
work could not have been accomplished. To the PPCR family (With
special gratitude to Prof Fregni and Alma) for the trust, help, and
support throughout my ongoing research journey. Khalid Ahmed
To Giuliano, Matilde and Lidia. Sabrina Donzelli
For Professor Felipe Fregni, and my students at Harvard Medical
School & Harvard T.H. Chan School of Public Health. Rui Nakamura
To my lovely wife, the light of my life. To my parents, my foundations.
Yuri Takehara Chemale
Praise be to ALLAH “God”. To my mother, To my father, To my wife,
To my children. Walid Omer
TABLE OF CONTENTS
A BRIEF INTRODUCTION - THE FIRST QUESTION OF THIS BOOK: THE WHY,
WHAT, AND HOW OF THIS BOOK
W�� ��� ���� ���� �������?
W��� �� ���� ���� �����?
H�� �� ��� ���� ����?
1. DESCRIPTIVE STATISTICS
I����������� �� C������ 1
Q������� 1: W��� ������� �� �� ���� �� ����������� ����������?
Q������� 2: H�� �� �� �������� ��������� �� �������� ��������?
Q������� 3: W�� �� ��� V����� A����� S���� (VAS), ����� ���� �� �����
��� ��������� �� ����, ���������� � ���������� ��������?
Q������� 4: W��� �� � ������ ��������?
Q������� 5: H�� ��� ���� �� ���������� �� ��� ���������� ����������?
Q������� 6: W��� ���� �� ������ ��� ��������� ���� �� ��� Y-���� ��
����������?
Q������� 7: W��� �� ��� ���������� �� ������� �������� ��������?
Q������� 9: W��� �� ��� ���������� ������� ��������� ���� ������ ���
���-������ ������������?
Q������� 10: H�� �� ����� ��������� �� � ����� �������?
Q������� 11: H�� �� ��� S������–W��� ���� ����������?
Q������� 12: H�� �� K���������–S������ ���� ����������?
Q������� 13: W��� �� ��� ���������� �� ��������?
Q������� 14: W��� �� ��� ���������� �� ��������?
Q������� 15: H�� �� ����� ����������� �� ��������� �� � �����
�������?
Q������� 16: W��� ��� ��� ������������ �� ���-���������� ����� ����
������� ���� ���� ��������� � ���-������ ������������?
Q������� 17: W��� ��� ��� ���� ������ ����� �� ���� ��������������?
Q������� 18: W��� ��� ��� ������������� �� ���� ��������������?
Q������� 19: H�� �� ������ ��� ����������� ����������� ����, ����� �
��� �� ���������?
Q������� 20: I� �� � ���� ���� �� ���������� ���������� ���������, ��
���������� ����������� ��������?
C��������� ��� C������ 1
R��������� ��� C������ 1
2. INFERENTIAL STATISTICS
I����������� �� C������ 2
Q������� 21: W�� �� ��� ���������� �� ���������� ��������� ��
����������� ����������?
Q������� 22: W��� ������� �� ��� ������� ����� �������? H�� ��� ����
�������� �� ������� �� ����������� ����������?
Q������� 23: W�� �� �� ���������� ���� �� ����� ��� ������� �����
������� ��� � 30?
Q������� 24: H�� �� ��� �������� ��������� �� ��� ����������
���������?
Q������� 25: W��� �� ��� ���������� �� ������� �� �������? W�� ���
���� ��������� �� �������� ��������?
Q������� 26: W��� ���������� ��� ���������� �������� ���������, ���
������ �� ������� �� ������� (�-1), ��� ��� ������ �� ��� ������ ����
(�)?
Q������� 27: W��� �� ��� ��������� ������ ��� ���������� �� ����������
���������? A�� ���� ������� �� ��� S������� E���� �� ��� M���?
Q������� 28: W��� �� ��� ������������ ������� ����� �������� ����� ���
T��� I �����?
Q������� 29: I� ����� �������� ��������� ����� �� �� ����������� �� ���
�� ����� �������� ����� ����� 0.05?
Q������� 30: W��� �� ��� ���������� ������� ����� �������� ����� ���
P-�����?
Q������� 31: W��� �� ��� ������������ ������� ���� �������� �����, T���
II ����� ��� ����������� �����? W��� ��� ��� ������������ ��
�������� ��� ����� �������� �����?
Q������� 32: W�� ��� ����� ��� ���� �������� ������ ��������� ��� �� 5%
��� 20%, ������������?
Q������� 33: I� �� ������� �� ��� P-������ ����� ���� ����������
���������, ���� ��������� ����������� �������?
Q������� 34: W��� ��� ��� ����� ��� ���������� �����������?
Q������� 35: W��� �� ��� ��������� ������ ��� ���� ��� �����������
����������?
Q������� 36: W��� �� ��� ���������� ������� ��������� ��� ������� ��
������ ��� ���� ����������?
Q������� 37: A�� ��� �������� �� ����������� ������������ ��� ��������
��������� ���������? W�� ��� ���� ���������?
Q������� 38: W��� �� ��� ���������� �� ���-������ ��� ���-������
���������� �����?
Q������� 39: H�� ��� ���-������ ��� ���-������ ���������� �����
���������� �� ��� ����������� �� ���� ��� ����������� ����������?
Q������� 40: D��� ����������� ����� ���� ���� ����������� ��� ��
���-������ ���������� �����?
Q������� 41: W�� ���� ��� P-����� ������ ���� ������� � ���-������
����������, �������� �� � ���-������ ���������� ����?
Q������� 42: W��� ��� ����� ��������� ������� ���� ��� ��������� P-
������?
C���������� �� C������ 2
R��������� ��� C������ 2
3. STATISTICAL TESTS I: COMPARING TWO GROUPS WITH STUDENT’S T-
TEST AND NON-PARAMETRIC TESTS
I����������� �� C������ 3
Q������� 43: W��� ��������� ����� ������� ��� ������, ���� ��� ���
������� ����������� ��� ������ �� ����������� �����?
Q������� 44: W��� ��� ��� ����� ��������� ��� ��� ���������� �� �����
������� ��� ������?
Q������� 45: W��� �� � �-������������? H�� ��� �� �� ����������� �� ���
������� �� � S������’� �-����?
Q������� 46: W��� �� S������’� �-������ ����? W��� ��� ���� ����
���?
Q������� 47: H�� ��� ���������� ��������� ��������� ��� ����
�����������?
Q������� 48: W��� ����������� ���� �� �� ������� ������ �����������
� S������’� �-����?
Q������� 49: C�� � S������’� �-���� �� ��������� ��� ���� ���
�������� ������?
Q������� 50: W��� ���� “������������ �� ������������” ���� �� ���
������� �� S������’� �-���� �����������?
Q������� 51: I� �� �������� �� ������� � S������’� �-���� ����
��������� ��� �������?
Q������� 52: W��� �� ��� ���������� ������� S������’� �-���� ���
W����’� ����?
Q������� 53: I� �� �������� �� ������� � S������’� �-���� ��� ���������
����� �����?
Q������� 54: I� S������’� �-���� ������������� ������ �������
���������� ���� ���������?
I����������� �� ���-���������� �����
Q������� 55: W��� ��������� � ��������� �������� ������� ���
������, �� ����� ��������� �� �� ����������� �� ��� � ���-����������
����?
Q������� 56: W��� ����������� ���� �� �� ������� ������ �����������
� W�������–M���–W������’� (WMW) ���� ��� W�������’� ������-����
����?
Q������� 57: I� ���-���������� ����� �� ��� ������� ��� ���������
����������, ��� ��� ��� ���� ������� �� S������’� �-����?
Q������� 58: W��� �� ��� ���������� ������� W�������’� ����-��� ����
��� ��� W�������–M���–W������’� (WMW) ����?
Q������� 59: W��� �� ��� ���������� ������� W�������’� ������-����
���� ��� ��� W�������–M���–W������’� (WMW) ����?
Q������� 60: I� ��� W�������–M���–W������’� (WMW) ���� �
���������� �� �������?
Q������� 61: H�� �� ��� U-��������� ��������� ��� ��� W�������–M���–
W������’� (WMW) ����?
Q������� 62: W��� �� ��� ����������� ���� ��� W�������’� ������-
���� ����?
Q������� 63: C�� ���� �� ����� ���-���������� ����� �� ���� ���
������ �� ��������� �����?
C��������� ��� C������ 3
R��������� ��� C������ 3
4. STATISTICAL TESTS II: STATISTICAL TESTS FOR CATEGORICAL
FREQUENCY DATA
I����������� �� C������ 4
Q������� 64: W��� ������ �� �������� ���� ������������ ����������
����?
Q������� 65: H�� ��� ��� ������ ��� ���-��� ������ �� �� ���� �������
�������� �������� ���������?
Q������� 66: W�� �� �� ���� ������ ������� ��� ������������ ����?
Q������� 67: H�� �� �� �������� ������ �������� �� �������� ������?
Q������� 68: C���������� ��� ���������� �� ��������� �� R × C
����������� �����, ����� ��� ���� �� ������� �� ����� ���� � �����
����� �� ������ ���������? W��� ��� R-C ������? H�� �� ��� ��?
Q������� 69: H�� �� ��� C��-������ ���� ����������? C�� ��� ���� ��
�������?
Q������� 70: W��� �� ��� ���������� ������� ��� ���-������ ��������-
��-��� ���� ��� ���-������ ���� ��� ������������? I� ��� ���-������
���� ����������� ��� ��������� ������� ��� ��� ��������� ��� �������
�� �����������?Q������� 71: I� ��� ���-������ ���� �� ���-����������, ��� ���� ���
������� ����� ������� ���� �� �� ����?
Q������� 72: W��� ��� ������ �������� �� ����� ��� C��-������ ����
��� ����������� ���� ��������? W��� ��� ��� ����������� �� ��� C��-
S����� ����?
Q������� 73: W��� ������ ���� ������ �� ���� ��� C��-������?
Q������� 74: W��� �� Y����’� ���������� �� ��� ���-������ ����?
Q������� 75: W��� �� ��� C��-������ ����� ����?
Q������� 76: W��� ������ �� ��� M�N����'� ���� ������� �� ��� ���-
������� ����?
Q������� 77: W��� ������ �� ��� F�����'� ����� ����? W��� ��� ���
����������� ��� ����� ��� F����� ����?
Q������� 78: H�� �� �� ��������� � ���-������ F����� ����� ���� ������
��������� ��� ������ ����?
Q������� 79: S����� �� ��������� ���� C��-������ ��� F����� �� ���
������ ����� ��� ����������� ��� ������� ��� �������?
Q������� 80: A���� ���� ��� ������ ����, ��� ����� ��� �����������
�������� �� ������ ������� ��� F�����'� ����� ���� ��� ��� ���-������
����?
Q������� 81: C���� ��� ������� ��� ���� ����� ����������? H�� �� ���
��������� ���� ����� ���� ���������� ��������?
Q������� 82: W��� ��� ��� �������� �� ��� ��� C������-M�����-
H�������?
Q������� 83: W���'� ��� ���������� ������� �������� ���� ��� ����
�����? W��� �� ��� O��� R���� �� R������� R��� �� R��� D���������?
Q������� 84: W�� �����'� ��� �-����� �� F�����'� ����� ���� �����������
����� ���� ��� ���������� �������� ��� � RR, OR, �� R��� D���������?
R��������� ��� C������ 4
5. STATISTICAL TESTS III - MORE THAN TWO-GROUP COMPARISONS:
ANOVA AND NON-PARAMETRIC TESTS
I����������� �� C������ 5
Q������� 85: H�� ���� ��� ANOVA ���� ������ ��� ���������� �������
������?
Q������� 86: B���� �� K�������'� ����������� ��������� ��� ���������
�� ��� � �-���� (��� ��� ������������), ���� ����������� ���� ���� ���
��� �� ������� ����� ������ �� ��� ����� �����?
Q������� 87: A�� ����� ��� ������������� ��� ANOVA?
Q������� 89: W�� �� �� ���� �� ������� ��� ��������� ������� ���
������ ��� ������ ��� ������ �� ANOVA ��� ���� ���� ���� ���� ��?
Q������� 90: W��� �� ��� F-T���?
Q������� 91: W��� �� � ���� ��� ���� ��� ��� �� �� ���������? I� �����
������������� �� B��������� ������ ���� T����’� ����?
Q������� 92: W��� �� ��� ���������� ��� ���������� �� ���-��� ���
���-��� ANOVA?
Q������� 93: W��� ��� ��� ��������-�������� ANOVA? I� ����� � ���-
��� �� ���-��� �������� �������� ANOVA?
Q������� 94: I� �� �������� �� ��� ANOVA �� � ��������� ������ ��
������ �� ��� MANOVA?
Q������� 95: W��� �� ANCOVA ��� ���� ��� �� �� ����?
Q������� 96: W��� ������� �� �� ���� ���� ��� ������� ��� ���
����������� �������� ��� ANOVA?
Q������� 97: I� ����� ��� ����-��� �������� �� ��� K������-W����� ����?
Q������� 98: I� ����� � ������������� ������� �� � ���-��� ANOVA?
C���������� ��� C������ 5
R��������� ��� C������ 5
6. STATISTICAL TESTS IV - STRENGTH OF ASSOCIATION BETWEEN
VARIABLES: USE OF CORRELATION COEFFICIENTS
I����������� �� C������ 6
Q������� 99: I� ��� ��� �� ������� ����� ��� ���� ��� �� �����������
������� �����������?
Q������� 100: I� �� ����������� �� ���-������, ���� ���� ���� ����?
Q������� 101: S������ ���� ����� �� � �������� ����������� �������
���� �� ������ ������ ��� ��� ��� ����� ����� �� �����. T�� ��������
������� ������ ���� ��= 0.75. W��� ���� �� ����������� �������� ��
����? I� ����� � ���� �� ��������� �� ����� �� � ������ �� � ����
����������� �� �������?
Q������� 102: W��� ��� I �� �� I ���� �� ������� ��� ����������� ��
���� ���� ��� ��������� �� ��� ���� ����?
Q������� 103: W�� �� I ��������� ��� ����������� ������������ ���� � �
�����?
Q������� 104: W�� ���� “�” ���� � ������������ ��� ��� ��� I ���������
�� �� �� � ��� �� ��� ������ ����������?
Q������� 105: W��� �� ��� ����������� �� �������������?
Q������� 106: W��� �� �� ��� S������� �����������?
Q������� 107: I ���� ������������ ���� ��� �����: ���������, �����������,
��� ����������. A�� ���� ��� ��� ����?
C��������� ��� C������ 6
R��������� �� C������ 6
7. STATISTICAL TESTS V: SURVIVAL ANALYSIS
I����������� �� C������ 7
Q������� 108: H�� ���� ��� ����-��-����� �������� ������ ���� �����
����� �� ������� ��������?
Q������� 109: W��� �� ���������?
Q������� 110: W��� ��� ��� ���������� �� ���������?
Q������� 111: W��� ��� ��� ����� �� ��������� (���� ��. ����� ��.
�������� ���������)?
Q������� 112: H�� ���� �������� ���� ��� ����������?
Q������� 113: I� ��� K�����–M���� ������ ��� ���� ������ �� ��� ����
���� ���������? I� ����� ������� ����� �� ���� ���� ��?
Q������� 114: W��� ��� ��� ����������� ������� ������� ���� (�.�., ���
�� ���� �� ������-�� �� ��������), ��������� ����, ��� ��������
����? A�� ����� ������������� ������� ����?
Q������� 115: H�� �� �������� �������� ����? I� ����� � ���? C�� ��
����� ��? C�� �� ������ �������� ���� �� ��� �������� ��������?
Q������� 116: W�� �� �� ��� �������� �� ������ ��� �������� ���� �� ���
�����?
Q������� 117: I� ��� �������� ������ �� ��� ����� �����, ���� �������
��� �� ��� �� ��������� ��� ������� ����?
Q������� 118: W��� �� ��� ���-���� ���� ��������� K�����–M����
������ ���� ����������� ���� C��’� ������������ �������
����������?
Q������� 119: H�� ��� �� ��������� ��� ������������ �� ��� C��
������������ ������� ���������� ��������?
Q������� 120: C�� �� ��� �������� �������� �� ���� ����������� ���
������������� �������? C�� �������� �������� �� ���� �� ����������
���������� ������? W��� ��� ��� ��������� ������? W��� ����� ��
������� ��� �� ������� � �������� ��������?
Q������� 121: C�� �� ����������� �������� �������� �� ������ �����
���� �������� ��������?
Q������� 122: I� ��� ������ ���� ��������� �� �������� ��������? I� ���
������� ������ ���� ��������?
Q������� 123: W��� �� �� ���������� ����� (�.�., � �������
�����������) ������ ��� ��������� ��� ����� ���� �� �������� ��������
�� � ����������� �����? H�� �� �� ������� ����?
Q������� 124: H�� ������ ��� ������-�� ���� �� ������?
Q������� 125: W��� ����������� ������ �������� ����, ��� ����
������-�� ���� �� ������?
Q������� 126: G��� �������� �� ����������� ���������� ��� �����������
���������� ��� �������� ��������
Q������� 127: C�� �� ��� ���������� �������� �� ��� �������� ��������?
Q������� 128: A�� ����� ��� ������� ��� ��������� ��� ������� ��
��������� �������� �������� �� �������� ��� ������������� ������?
Q������� 129: I� �������� �������� � ���-���������� ����? A�� ����� ���
���������� ������������? G���� ���������, ��� ���� ��� �������
������ ��������� �� ��� ���� �� �������� ���� (�������� �������� ��.
������� ���������� ����������� �����)?
Q������� 130: W��� �� ��� ���������� ������� �������� �������� ���
������ ��������?
Q������� 131: W��� ��� ��� ����������� �� �������� ��������, ��� ���
�� �� ���� ����� �����������?
Q������� 132: W��� �� ��� ���������� ������� RR, OR, ��� HR?
Q������� 133: H�� �� ����� ���� �� �������� ��������? H�� ����
��������� ��������� ���� �� � �����? W��� ���� �� ���� �� �� ����
���� �� ������ ��� �������� �����? W��� �� ��������� �� ���������
����? W��� �� ����-���� ���� �� �������� ��������?
Q������� 134: H�� �� ����� �� �������� ���� �� �������� ��������?
Q������� 135: C�� �������� �������� �� ���� ���� ��� ����� ��
���������? C�� �� ��� ��� ����� �� ��������� (�.�., ����������
��������) ����� ��� ������ ��������? H�� �� ��������� ��? C�� �� ���
�� ���� �������� ���������? I� ��� ������� ������ ������ �� ��������
��������?
Q������� 136: H�� �� ������ ������� ����� ��� ����� �� ��� ���� ���-
��� ����� ��� ��� ������-��?
Q������� 137: W��� ��� ��� ���������� ��� ������������� �� ��������
�������� �������� �� ����� �������, ���� �� ������
����������/ANOVA?
R��������� ��� C������ 7
8. SAMPLE SIZE CALCULATION
I����������� �� C������ 8
Q������� 138: W��� ������� ������ ��� ����������� �� ��� ������ ����?
Q������� 139: W��� �� ��� ������������ ������� ����� ��� ������ ����?
Q������� 140: C�� �� �������� ��� ������ ���� �� ���� � �������������
���-����������� ������� ���� � ����������� ���?
Q������� 141: W��� �� ��� ��� ��� ������� ���������� ���������
���������� �� ��������� ��� ������ ����?
Q������� 142: W��, ���� �������������� ������ ����, �� �� �������
������ ����� �� 80%?
Q������� 143: I� �� �������� �� ��������� � ������ ���� �� ���� �����
��� ���� ������� ��� ��������� ���������?
Q������� 144: H�� ���� ��������� ���������� ������ �����?
Q������� 145: H�� �� ��� ��������� � ������ ���� �� ��� �� ��� ����
��� ���������� ���� �� ���� ��� ������ ���� ��� ����������� ����
�������� �������?
Q������� 146: H�� �� ��� ��������� � ������ ���� ��� � ���������� ��
����� ������� ��� ������?
Q������� 147: H�� �� ��� ��������� � ������ ���� ��� � ���������� ��
����������� ������� ��� ������?
Q������� 148: H�� ���� ��������������� �� SD ������ ��� ������ ����
�����������?
Q������� 149: H�� �� ��� ��������� ��� ������� ����, ��� ��� ���� ��
������ ���� ������ ���� �����������?
Q������� 150: H�� ���� ������ ���� ����������� ���� �� ��������
�������?
Q������� 151: F�� ���� ��������, ���� �� ��� ���� �������� ��� ������
���� ������������?
Q������� 152: S����� ��� ������� � ������ ���� ����������� ��
������������� �������?
Q������� 153: S����� ��� ������� ���� ��� ����� ��������?
C���������� ��� C������ 8
R��������� ��� C������ 8
9. MISSING DATA
I����������� �� C������ 9
Q������� 154: W��� �� ������� ����, ��� ��� ���� �� ������ ��������
��������?
Q������� 155: W��� �� ��� ���������� ������� ������� ���� ���
�������� ����?
Q������� 156: H�� ���� ������� ���� �� ��� ����? H�� ��� �� �������
��� ��������� ������ �� ������� ���� �� � �����?
Q������� 157: W��� ��� ��� ��������� ���������� �� ������� ����?
Q������� 158: H�� �� ���� ���� ������� ���� ���� ��� ��������� ��
�������? A�� ����� ��� ����������� ����� �� ��������� ��� ���������
�� ������� ����?
Q������� 159: W��� ���������� ��� �� ������� �� ������� ������� ����
���� ��������� � �����?
Q������� 160: W��� ��� �� ���� �� � ������� ������������ ��� ��������
���������?
Q������� 161: W��� �� �� ���������-��-����� ��������?
Q������� 162: H�� ��� ������� ���� ���������� ���� ��������� ���
����� �������?
Q������� 163: W��� ��� ��� ����������� ��� ���������� �� �����
C������� C��� A������� (CCA)?
Q������� 164: W��� ��� ��� ����������� ��� ���������� �� ����� I������
P���������� W�������� (IPW)?
Q������� 165: W��� ��� ��� ����������� ��� ���������� �� �����
����������-����� �������?
Q������� 166-A: W��� ��� ��� ����������� ��� ���������� �� �����
������ ���������� (SI) �������?
Q������� 166-B: W��� ��� ��� ����������� ��� ���������� �� �����
�������� ���������� (MI) �������?
Q������� 167: W��� �� � ����������� ��������?
C���������� ��� C������ 9
R��������� ��� C������ 9
10. COVARIATE ADJUSTMENT
I����������� �� C������ 10
Q������� 168: W��� �� � ��������� ����������?
Q������� 169: W��� ��� ��� ����������� ��� ���������� �� ���������� �
��������� ����������?
Q������� 170: H�� �� ������ ��� ���������� ������ ��� �������������
�������?
Q������� 171: W���� ���������� ������ �� ��������?
C��������� ��� C������ 10
R��������� ��� C������ 10
11. META-ANALYSIS
I����������� �� C������ 11
Q������� 172: H�� ������ � ���������� ������ �� ���������?
Q������� 173: W��� �� � ����-�������� �������� �� � ���������� ������?
Q������� 174: W��� ����� �� �������� ���� � ����-�������� �������?
Q������� 175: I� ����� ���������� �� � ����-�������� ��������� ���������?
Q������� 175: A�� ����� ���������� �� ��� �� ������� ��� ��������� �
����-��������?
Q������� 176: S����� �� �������� ��� �������� �� �������, ��� ��� ��
�� ��?
Q������� 177: W��� �� ������ �� ����� � ����-��������?
Q������� 178: W���� ������� ������ �� �������� �� � ����-��������?
Q������� 179: H�� �� ���� ���� ������� �� ����� ���������?
Q������� 180: C�� ��������� ����� �� ������� �� ����� �� � ����-
�������� (���������� ������� ����� (RCT), ������������� �����, ����
�������)?
Q������� 181: H�� ���� ������� ��� �������� �� ������� � ����������
����-��������?
Q������� 182: C�� ����������� ������� ��� ���� ���������� �� ���� �� �
����-��������, ��� ����� ��� ���� �� �����?
Q������� 183: W��� ��� ��� ���� ������� ������� (��� ���������� ���
����������� ����) ���� ��� �� ���� �� � ����-��������?
Q������� 184: W��� �� � ������ ����, ��� ��� �� �� �����������?
Q������� 185: H�� ��� ��������� ����������� �� ��������� ��
����������� �������� ��������?
Q������� 186: H�� ��� ����-�������� �� �������� ��� ��� ��
��������� ����� ��� �� ��� ���� ���������� �� ��������, ������������ ��
���� ��� �������������? I� �� �������� �� ������� � ����-�������� ��
����-��������?
Q������� 187: H�� �� ������ ������� ������� �� ����-��������?
Q������� 188: H�� �� ��� ������� �� ��� ������� �������� �� � ����-
�������� ��������, ��� ����� ����� ��� �� ���� ��� ����?
Q������� 189: H�� �� ������ ������� ��������� ��� �������� ����
������������� �������� �� ������� �������� ��� � ����-��������?
Q������� 190: W��� ��� ��� ���� ������ ����� �� ���� �� � ����-
��������?
Q������� 191: H�� �� ������������� �������� �� � ����-��������?
Q������� 192: D��� ��� ������������� �� � ����-�������� ������ ���
���������������� �� ��� �������?
Q������� 193: W��� ��� �����- ��� ������-������� ������ ��� ��� ��
������ ������� ����?
Q������� 194: H�� ���� ��� ���� �� � ����� ������ ��� ����-��������
�������?
Q������� 195: I� ����������� ���� ���������� ������ ���������?
Q������� 196: H�� �� ��������� ��� ������� �� � ������ ����?
Q������� 197: W��� �� E����’� ���� ��� ��� �� ��������� ��?
Q������� 198: H�� �� ����� ����������� ���� �� � ����-��������?
Q������� 199: H�� �� ����� ������ ���� ����������� �� ��������?
Q������� 200: I� ����� � ��������� ������ ������� C����’� �, H�����’
�, ��� G����’ Δ?
Q������� 201: H�� �� ������ ������� ���� �����, ���� �����, ��� ����
����������?
Q������� 202: W��� �� ��������� �� ������� ����������� ��������?
Q������� 203: H�� �� ������� � ����������� ��������, ����� ����� ���
�������� ������ �� ���������, ��� ��� �� ��������� ����?
Q������� 204: W��� ��� ��� ������� �� ����������� �������� �� ���� I ���
���� II ������?
Q������� 205: H�� �� ��������� �� �������� ����-�������� ������ ��
��������� ��� ��� �� ������� ��?
Q������� 206: W��� �� � ����-���������� ��� ��� �� ��� ��?
Q������� 207: W��� �� � ���������� ����-��������?
Q������� 208: W��� �� ��� ���������� ������� � ����-��������, � ����-
��������, �� ���������� ������� ���� ����-��������, ��� � ������� ����-
��������?
R��������� ��� C������ 11
R���������� A������� ��� C������ 11
R���������� B���� ��� C������ 11
12. SUBGROUP ANALYSIS
I����������� �� C������ 12
Q������� 209: W��� ������ � �������� �������� �� ���������?
Q������� 210: H�� ���� ������� ��� ������ ��� ����� ��� � ��������
��������?
Q������� 211: C�� � �������� �������� �� ��� ������� ������� �� ���
�����?
Q������� 212: H�� ������ � �������� �������� �� ���������?
Q������� 213: H�� ������ �������� �������� ������� �� �����������?
Q������� 214: H�� �� ������ ���� I ��� ���� II ������ �� ��������
��������?
Q������� 215: W��� ��� ��� ����������� ������� �������� ��������,
���������� (���������) ��������, ��� ����������� ��������?
R��������� ��� C������ 12
13. NON-INFERIORITY AND EQUIVALENCE DESIGNS
I����������� �� C������ 13
Q������� 216: W��� ��� ��� ����������� �� ���-����������� ���
����������� ������? W��� ��� ��� �������� �������� ��������� �� ���
���� ������?
Q������� 217: W��� ��� ��� ��������������� ���� ��� ��������� ����
�������� ��� NI ������?
Q������� 218: W��� ��� ��� ���������� ��� ������������� �� NI ������?
Q������� 219: C�� ��� ������� ��� �������� �� ����� �����������,
��������� ����������, ��� ��������? A���, ���� ��� ��� ����������
��� ������������ �� ��������� ��� �����������?
Q������� 220: I� NI ������ ���� ����������� �� ����? H�� �� ����������
�������� ��� NI ������ �� � ������?
Q������� 221: A����� � ������� �� �� NI �����: ���� ��� ��� ����������,
�����������, ��� �������� ���� ��� ����������?
Q������� 222: W��� �� ��� ���������� �� NI ������ ��� ���� ������� ���
��������� ��� ����?
Q������� 223: W��� ��� ��� ����������� ������� �� ��������� ��� NI
������? W��� ��� ��� ���������� ��� ������������� �� ���� ���?
W��� ��� ��� M1 ��� M2 ������?
Q������� 224: W��� �� ��� ���������� ������� ����� ������ ��������(95%-95% ������) ��� ��������� �������? W��� �� ��� �����������
������ �� ��������� NI ������?
Q������� 225: W��� �� ��� ���� ����������� ������ ��� ��� NI �����—
�������� ���� ���������� �� �������� ���� ���������?
Q������� 226: A�� ��� ��������� �������� �� ��� ��� ���� ���������
���� �������� ��� NI ������? C���� ����������� ����� ��� � ����
������������ ������ ������� �� ��� ���� ��������?
Q������� 227: W��� ��� ��� ���������� �� ���-��������� �� �������
���� �� NI ������? S����� �� ��� ITT �� PP �������� �� ��� NI ������?
Q������� 228: I� �� �������� �� ���� ���-����������� ��� �����������
���������� ���������� � ������? C�� � �������� ����������� ����� ��
���� �� ���� ��� NI ������?
Q������� 229: W��� �� ��� ���������� �� ����� ����������, ��������� ��
���������� ���� (������ �������), ����� �������� ���
���������/��������� �������� ��� ��� ��������� ����������?
Q������� 230: S����� NI ������ ������ �������� ��������? A�� ���
����������� ��� �������� ����������� ���� ���� ��������� �� NI ������?
Q������� 231: W��� ��� ��� ���� ��� ����������� ���������� ��� NI
�������? W�� �� �� ����� �� � ���-������ ����? I� ���� ����, �� ��� CI
97.5% ��� �-����� 0.025? W��� �� �� �������� �� ��� 90% CI �� 95% CI?
Q������� 232: D������� ��� ���� ��� ����������� ���������� ���
����������� �����. W��� �� ��� ���� ���������� ����������� �������� ���
����������� ������?
Q������� 233: I� ��� �������������� �� �-����� ��������� �� NI ������? I�
CI ���� �������� ��� NI ������? W��?
Q������� 234: W��� ��� ��� �������� ��������������� ����� ��� NI
��������? D������� ��� ���������� �� ���� �����������.
Q������� 235: H�� �� ��� NI ������ ��� ��� ����������� �����
��������� ��� ������ ���� ����������� ��� ��� ���������� �����
�������?
Q������� 236: I� ������� �������� ���� �� NI ������?
Q������� 237: H�� ������ �� ������ ��� NI �� ����������� �����
�������?
C���������� ��� C������ 13
R��������� ��� C������ 13
14. LINEAR REGRESSION
I����������� �� C������ 14
Q������� 238: W��� �� L����� R���������?
Q������� 239: W���� �������� ��������� ��� �� �������� ����� ������
����������?
Q������� 240: W��� �� ��� ����������?
Q������� 241: H�� �� �� ������ ��� ��������� �� �� �������� �� ���
�����?
Q������� 242: W���� ����������� ������ �� �������� �� ������
����������?
Q������� 243: W��� �� ��� ���������� �� ������������ ��� ������
������������?
Q������� 244: W��� �� ��� ���������� �� ����������� ��� ���� �� ���
���������� �������� �� ������ ������������?
Q������� 245: H�� ���� ��������� ��� �� �������� �� � ����������
����� ��� ��� �� ��������� ������ ����?
Q������� 246: H�� �� ��������� ��� ������ ��� ��� �� ���� ���
������� �� ���� �����?
Q������� 247: W��� �� �������� ����������?
C��������� ��� C������ 14
R��������� ��� C������ 14
15. CONFOUNDING IN OBSERVATIONAL STUDIES & PROPENSITY SCORES
I����������� �� C������ 15
C���������� �� C������� R������� � U������������ �����������
Q������� 248: W��� ��� ��� ��������������� �� � ����������?
Q������� 249: H�� �� �������� ��������� �����������?
Q������� 250: W��� �� ��� ��������� �� ���������� ������ ���������
����������� �������� �� ������������� �������?
Q������� 251: W�� �� �� ��������� �� ������� ��� ����������� ��
������������� �������?
Q������� 252: W��� ��� ��� ������� �� ������� ��� ����������� ��
���-���������� �������?
Q������� 253: W��� �� ��� ���������� ������� ����������� ��� ������
������������ �� �����������?
Q������� 254: W��� �� ��� ���������� ������� � ��������� �������
��� � ����������?
Q������� 255: W��� �� ��� ���� �� “���������� ����� (PS)” ��
����������� �������?
Q������� 256: H�� �� PS ����������?
Q������� 257: H�� �� ������ ��� ���������� �� �� �������� �� ���
�����
Q������� 258: W��� ��� ��� ������� �� ��� PS?
Q������� 259: W��� ��� ��� ���������� ��� ������������� �� ����� PS?
Q������� 260: W���� PS ������ ��� �� ���� �� ��� ���� �� ��������
��������� ����?
Q������� 261: W���� ��� �� ��������: ������������ ���������� ������
�� PS?
Q������� 262: W��� �� �� ��� ����, ��� ����� �� ���� �������� ����?
W���� PS ������ �������� � ���� �������� ��������, ���� �� ��� ����
�����?
C���������� ��� C������ 15
R��������� ��� C������ 15
A Brief Introduction - the First
Question of this Book: the Why, What,
and How of this Book
A book that uses questions to enhance learning should also start
with three important questions so the readers can understand the
motivation for this book: Why was this book written? What is this
book about? And how should this book be read?
Why was this book written?
To answer this question, we first need to understand the very basic
learning principles. If we can summarize in one sentence, it would
be: learning comes from within. It means that we learn what is
important to us. We learn what we are curious about. We learn when
we connect the knowledge. Taken all together, you can agree with us
that the questions are fundamental as they guide our learning. Let
me make an analogy here. Suppose you are assembling a jigsaw
puzzle. What is the best way? If someone gives you a random piece
or asks for a certain piece (for instance, a piece for the corner that is
dark blue as this is for the sky). The answer is the latter. And this is
how the building of our knowledge works. When we are given
random "pieces" of knowledge, they do not attach to our main
knowledge, and we quickly forget about them. But when we receive
the piece we are looking for, our knowledge and understanding
increase significantly.
This is the goal of writing this book, to help you find the pieces
you need to build your knowledge of statistics. Here the reader could
say: but the readers should send their questions in this case.
Although we would love to have a very interactive book (indeed, if
you do have additional questions, please send them to us (see the
email addresses below)), this would not be feasible.
Our solution was then to analyze thousands of questions we
receive every year from our about 500 students from the Principles
and Practice of Clinical Research Program (PPCR) program
(www.ppcr.org). This program was founded by Prof. Fregni in 2008
with the mission to enhance clinical research methodology globally.
We have trained more than 4,000 scientists around the world. Our
goal in this program is to provide a very interactive platform so that
learning takes place in a more natural manner and, more importantly,
in a more effective manner.
What is this book about?
This book is to help you to create your knowledge in statistics. Given
that our PPCR program welcomes students with a diverse
background in statistics: from little to no knowledge of statistics to
more advanced knowledge, the questions we created are based on
all these levels.
We hope that we cover most of the questions you may have on
the topics we discuss here. Is this a good book if this is your first
encounter with statistics? Yes, and no. If this is the first time you are
trying to learn statistics, we would recommend using this book along
with your course of statistics or another textbook (we also suggest
our other textbook of clinical research that has about half of that
dedicated to statistics - Critical Thinking in Clinical Research:
Applied Theory and Practice Using Case Studies).
Also, another important characteristic of this book is
that this book has been written by someone who is similar to the
audience of this book: alumni of our PPCR program that are now
senior teaching assistants (meaning they are part of our teaching
staff for at least four years) in addition to being carefully reviewed by
the faculty editors of this book.
We choose to answer the questions in a simple way to be
reachable no matter your level of expertise. We aimed to offer
guidance through simplicity, refined ideas, creating examples, and
graphics. In that sense, we hope you benefit from our explanations
and engage you in a learning journey with this book.
How to use this book?
In this book, we answer real questions that we have received based
on our experience with about 15 years of the PPCR program.We
selected carefully and included the most common and relevant
questions from the Statistical modules. Furthermore, each chapter
contains a short introduction and review of the topic. Among the
topics, we covered the introduction to statistics, basics of clinical
research, relevant statistical concepts, parametric and non-
parametric tests, sample size calculation, meta-analysis and
subgroup analysis, linear and logistic regression, missing data, and
covariate adjustment, survival analysis, and lastly, observational
studies and confounding.
This book will provide you with the skills you need to understand
statistics, apply them to your own research and daily practice, and
design a proper study. We ask our readers to send us any feedback,
and in case you don't see the question you're looking for, you can
always send it to us.
Boston, Massachusetts
Felipe Fregni (felipe.fregni@ppcr.org)
Alma Sanchez (alma.sanchez@ppcr.org)
Wilson Fandino (wilson.fandino@ppcr.org)
Karla Loss (karla.loss@ppcr.org)
mailto:felipe.fregni@ppcr.org
mailto:alma.sanchez@ppcr.org
mailto:wilson.fandino@ppcr.org
mailto:karla.loss@ppcr.org
Chapter 1
1. Descriptive Statistics
Wilson Fandino
Introduction to Chapter 1
It has been a long year. You have been working really hard, but
finally July has arrived, and you have decided to take a break to
spend quality time with your best friend in Las Vegas (Nevada). The
weather could not be better. The first thing that comes to mind is a
walk along the Strip, and maybe going shopping. It is a beautiful day,
but the temperature is extremely high. In fact, while having your
breakfast, you hear news reports saying that the average
temperature for the day will be 100 °F, and the probability of rain has
been estimated to be only around 5%.
You have forgotten to apply sunscreen lotion before leaving the
hotel, and given the weather conditions, it has become a priority to
get some. At the first pharmacy spotted, you choose an expensive
sunscreen based on the sun protection factor (SPF) and reduction in
the risk of sunburns: it offers 20% more protection, and diminishes
the chance of sunburns by 50%, when compared to competing
brands. However, the temperature is not as high as it was meant to
be. In the late afternoon, the weather changes dramatically and
heavy rain pours over Las Vegas. At that moment, you are
determined to visit one of the local casinos, and try your luck. After
placing your bet at a game of American Roulette, in a stroke of good
fortune, you win 170 dollars on the very first attempt. Astonished by
your incredible luck, you cannot resist the temptation of spinning the
roulette wheel again. On this occasion, luck is not on your side, and
after losing 180 dollars, your friend wisely convinces you to make
your way back to the hotel, certainly disappointed by the fact that the
day did not go as planned.
It should come as no surprise that your first day in Las Vegas
did not go as expected. After all, you did not have sufficient
information to make the right decisions. For example, if you had
known that the expected average temperature for the day had a
range with a certain level of confidence, you would have been
prepared for cooler (or warmer) temperatures within that range. If
you were a statistician, you would have been aware that the
probability of rain was just an estimate, and there is always an
uncertainty about the estimated 5%. If you were a dermatologist, you
would have known that a sunscreen lotion with SPF of 60 does not
provide further skin protection against ultraviolet radiation, when
compared to a similar product with SPF 50 (the same kind that you
had left at the hotel); consequently, a 20% increase from 50 to 60 is
not clinically relevant in the majority of situations. Likewise, if you
had known that halving the risk of sunburns was represented, for
instance, by a reduction of as little as 0.2% to 0.1 %, you would have
thought twice before opting for the most expensive lotion. Lastly, if
you were familiar with the concept of “regression to the mean” you
would probably have abandoned the casino with 170 dollars in your
pocket, knowing that having such a great stroke of luck was unlikely
to be repeated.
Numbers are everywhere in our lives, and if you are sufficiently
observant, they can tell you a story. By checking your bank
statements, they can depict your financial behavior. When going for
a jog, your smart watch can summarize the covered distance, the
amount of energy spent, and your heart rate. If you are not a big fan
of jogging, your smartphone can also track the number of hours
spent sleeping, or the amount of time employed on the social media.
However, you would not like to see an inventory of your daily
expenses over the last three years in your last bank statement, or a
long list of your hourly heart rate recorded for the last three months,
would you? Most likely, you would be more focused on the trend of
these figures, in the form of central tendency and dispersion
measures (see below); ideally, you would like to know whether those
figures are excessive, or perhaps insufficient, when compared with
other people. To this end, the use of graphs, diagrams and tables are
a convenient way to summarize data. This information may help you
improve your finances or modify your habits to keep your body in
good shape.
Stories can be told with numbers, and this is the ultimate end of
descriptive statistics.
The purpose of this chapter is to provide you a primer on the
main components of descriptive statistics, and the main tools to
summarize information in the context of clinical research. First, we
explore some general characteristics of variables and the
methodology used to describe the data distribution of any variable,
including graphical (histograms, box plots, frequency polygons, bar
charts, pie charts, and scatter plots) and numerical approaches
(central tendency and dispersion measures). Further, we examine
the most important strategies to assess the normality and
homogeneity of variances of outcome variables, followed by a brief
discussion on data transformation and its implications in clinical
research. Finally, we conclude this chapter by proposing a
methodology to choose the appropriate statistical test based on the
characteristics of the variables involved.
General Concepts
Question 1: What exactly do we mean by
descriptive statistics?
In essence, statistics embrace two fundamental disciplines:
descriptive statistics and inferential statistics. The core concepts of
inferential statistics are discussed in Chapter 2. Conversely,
descriptive statistics are concerned with a range of strategies used
to summarize and efficiently present a set of data (Vetter, 2017a).
One easy way to understand this concept is as follows: descriptive
statistics provide tools to describe and summarize a particular study
dataset (for instance, the mean blood pressure in the control group),
whereas inferential statistics provide you with the tools to infer
whether the results of your dataset can be applicable to the general
population (e.g., the confidence interval of blood pressure that can
be found in the general population based on your study data). More
specifically, in the framework of clinical research, descriptive
statistics usually encompass the main characteristics of the
exposure or treatment group (depending on whether the study is
observational or experimental, respectively) and the control group,
when applicable (Barkan, 2015). The summarized information of the
study sample is almost invariably included in the first table of any
scientific paper, which is often labeled as Table 1.1. This chart is
crucial for interpreting the results, as it allows the reader to compare
the baseline characteristics of the groups and variables of interest
(Govani, 2012). The methodological tools used to describe and
compile scientific information are discussed in detail in the following
sections.
Question 2: How do we classify variables inclinical research?
Data in clinical research can be divided into three main categories:
numerical, categorical (also known as nominal), and ordinal (Barkan,
2015). The different types of variables and relevant examples are
listed in Table 1.1. As opposed to categorical variables, numerical
variables can only take values at equal intervals. This means that,
for instance, a value of 6 is equidistant between 3 and 9.
Furthermore, a numerical variable can be a continuous interval or
continuous ratio variable. The difference between the interval and
ratio variables is given by the meaningfulness of the zero value.
Accordingly, the central venous pressure (CVP) can be deemed as
an interval continuous variable, because a value of zero cm H2O is
measured with respect to the atmospheric pressure; therefore, the
zero value does not represent a lack of pressure, but a leveling with
the atmospheric pressure. It follows that a CVP of 16 is not
necessarily double that of a CVP of 8. Another key characteristic of
interval variables is that they can take a negative value, given the
arbitrariness of zero. This would not be the case with the continuous
ratio variable weight. In the latter case, a patient weighing 110 kg is
literally double another patient weighing 55 kg. Therefore, it is
possible to compute the ratio of these two values. Discrete numerical
variables, on the other hand, can only accept positive integer values.
Lastly, as opposed to categorical variables, ordinal variables entail a
natural order (Table 1.1).
Table 1.1 Classification of variables in clinical research. CVP:
Central Venous Pressure
Variable Description Examples
Numerical
Continuous
Interval Zero
arbitrary
Temperature,
CVP
Ratio Zero
meaningful
Blood
pressure,
weight
Discrete Only takes
positive
integer
values
Number of
children,
number of
asthmatic
attacks
Categorical
Dichotomous 2 categories
Gender,
mortality
outcome
Non-dichotomous >2
categories
Marital
status,
ethnicity
Ordinal Natural
order
Severity of
pain, stage
of cancer
It is common to find in the literature different synonyms for
dependent and independent variables, which may result in confusion
among readers. Table 1.2 lists some of the most frequently used
alternative names for these variables.
Table 1.2. Common alternative names for dependent and
independent variables encountered in literature.
Dependent variable Independent variable
Outcome variable
Response variable
Predicted variable
Measured variable
Predictor variable
Explanatory variable
Covariate
Regressor
Lastly, in the context of data distribution, variables can also be
classified into two fundamental groups: normal (or Gaussian)
distribution and non-normal distribution. The fundamental differences
between these two distributions are discussed in detail in this
chapter.
Question 3: Why is the Visual Analog Scale
(VAS), often used to score the intensity of pain,
considered a continuous variable?
The Visual Analog Scale (VAS) is, in fact, used recurrently in clinical
research to quantify the intensity of pain experienced before or after
a given intervention. It is usually rated on a scale from 0 to 10, where
0 is the absence of pain and 10 is the worst pain ever experienced.
From this perspective, it becomes evident that such scales should be
considered ordinal rather than continuous, given the subjective
nature of measuring pain levels. However, the VAS score is usually
deemed a continuous ratio-scale measure by most researchers. This
is because the validation of VAS score as a continuous variable
dates back as early as 1983, and has become well established in
experimental research (Price, 1983). Thus, VAS scores represent
one of the few exceptions where an ordinal scale can be treated as a
continuous variable, and consequently, parametric tests can be
applied to samples with normal distribution (see below). However, in
most situations, ordinal variables are not suitable for parametric
tests.
Question 4: What is a random variable?
A random variable is a set of numerical outcomes that can be
obtained after conducting an experiment. The random nature of this
outcome entails a probability distribution associated with this
variable. Hence, there is always uncertainty regarding this value.
This is not the case, for example, of values assigned in a
deterministic manner. Imagine that you are making a pizza by
following the instructions in a recipe book. According to the recipe, in
order to make good pizza dough, you need to mix 50 mg of leaven,
50 mg of poolish, 125 mg of strong white flour, 4 mg of salt, and 70
ml of water. After topping the dough with your favorite ingredients,
the pizza needs to be baked for 20 minutes at 900 °F. All these
numerical values are carefully measured on a scale or selected in
the oven; therefore, they cannot be random variables.
From the perspective of statistical modeling, random variables
can be continuous or discrete, depending on the type of values
accepted by the variable. Continuous random variables can take
uncountable values (see also question 2). For example, if the
distance between points A and B were being measured and the
scale of measurement ranged from zero to infinity, this continuous
variable can take any value, countable or uncountable, equal to or
greater than zero. In turn, if the distance estimated was, say, 2.54
cm, there should be always a level of uncertainty (note that the
distance was estimated, not measured) expressed within a range of
values for a given confidence level. A detailed discussion of the
implications of this uncertainty in the inference of results from a
study sample to the population of interest, can be found in Chapter
2.
By contrast, discrete random variables can only accept
countable (integer) numbers (i.e., 0, 1, 2, 3, 4, etc.). For instance, the
number of migraine episodes following an experimental treatment
within a given period of time can theoretically range from zero to
infinity, but this variable cannot accept non-integer numbers (e.g.,
2.54), as this is not biologically possible.
For the purpose of statistical analysis, the different levels of
categorical and ordinal variables are usually replaced by numbers,
thereby making it possible a comparison between groups. As an
illustrative example, consider blood pressure as a categorical
variable with four finite levels. Any participant can be classified into
one of the following four categories: normal blood pressure (0), mild
hypertension (1), moderate hypertension (2), and severe
hypertension (3). Importantly, the distinction between categorical and
ordinal levels is crucial for choosing the appropriate statistical test,
as will be discussed in this chapter.
Methodology of Descriptive Statistics
Question 5: How can data be summarized in the
scientific literature?
In descriptive statistics, a dataset can be summarized using numbers
and efficiently represented with the help of specialized graphs. The
numerical description of the data is usually presented in tables. In
the case of continuous variables, the central tendency and
dispersion measures accurately summarize the data distribution (see
below). Conversely, categorical data are usually presented in terms
of sample proportions (number of events of interest divided by the
total number of observations), whereas discrete and ordinal
variables are often summarized with percentages (i.e., proportion x
100).
Depending on the nature of the variable, several graphs can be
used to summarize the data. Figure 1.1 depicts some of the most
frequently used graphs in clinical research: histograms, box plots,
frequency polygons, bar charts, pie charts, and scatter plots. The
latter is often used to analyze the relationship between two
continuous variables. Histograms and bar charts are different graphs
intended to summarize the different types of data. Histograms are
the preferred method to describe continuous variables, whereas bar
charts are often used to compare different levels of categorical
variables.As illustrated in Figure 1.1, the bars of the histograms are
always presented without any gaps, as they represent the specific
intervals of the continuous variable. By contrast, the bars of the bar
charts are always presented with spaces between the bars, because
each bar represents a different categorical level.
Box plots are graphs specifically designed to characterize data
with non-normal distributions. A description of the components is
shown in Figure 1.1B. Using this methodology, the data distribution
can be interpreted from the same perspective as a histogram. A
comparison of these two graphs is shown in Figure 1.2 (Vetter,
2017a)
The choice of the most appropriate graph fundamentally
depends on the type of variable (numerical, categorical, or ordinal)
and the distribution thereof (normal or non-normal distribution). The
distribution of the random variables is discussed thereafter.
Figure 1.1. Graphic representation of data according to the type of variable. A:
Histogram. This is the preferred graph to display continuous variables. It can also
be used to visually evaluate the data distribution (normal or non-normal). In this
example, the normal distribution curve is represented in orange. The x-axis
contains the values of the continuous variable distributed in pre-specified bins or
intervals, and the y-axis shows the density as a measure of the frequency of
values. B: Box plot. This is the natural way to represent data with non-normal
distribution. The height of the box (highlighted in blue) is defined by the 25-75%
interquartile range. The height of the whiskers (represented by vertical lines) is
given by the upper and lower quartiles, and the range (horizontal lines) represents
the minimum and maximum values. C: Frequency polygon. This is an alternative
way to summarize a set of data presented in a histogram, where the bins or
intervals are replaced by lines. D: Bar chart. This is the standard way to
summarize categorical data. The x and y-axes depict the levels of the categorical
variable and their frequency, respectively. E: Pie chart. This is the recommended
graph to summarize categorical data in the form of percentage, where the total
area of the circle corresponds to 100%. In this example, the percentage of data for
each hospital is proportionally represented in the graph. F: Scatter plot. This is the
default graph of correlation and linear regression models, wherein two continuous
variables are compared. In this particular case, the two variables are strongly
correlated.
Figure 1.2. Comparison between a histogram (lower graph) and a box plot (upper
graph) for the presentation of a set of data with normal and non-normal
distribution, respectively. The interpretation of central tendency and dispersion
measures depicted in box plots is analogous to the data distribution of a
histogram. Note that arithmetic mean and median take the same value when the
distribution is symmetrical. The length of the whiskers represents the spread of
data. Q1: First (lower) quantile. Q3: Third (upper) quantile. IQR: Interquartile
Range, representing the middle 50% of data (difference between upper and lower
quartiles). Min: Minimum value. Max: Maximum value. : Arithmetic mean.
Question 6: What kind of values can variables
take in the Y-axis of histograms?
The X-axis of any histogram almost invariably represents the range
of values taken by the variable illustrated in the graph, in a given
scale (e.g., mm Hg, kg, or years) and distributed in pre-specified bins
or intervals, which ultimately define the width of the bars (see Figure
1.1A). The scale used for the Y-axis, by contrast, may vary
depending on the convenience of the data presentation and the
personal preference of the researcher. Some of the most common
methods for representing the frequency of the values given in the X-
axis include absolute frequencies (frequencies), relative frequencies
(percentages), proportions (fractions), and frequency densities
(densities). For example, if a value of 5 was obtained 20 times in a
given sample of 100 observations, it could be represented in the Y
axis as frequency (20), percentage (20%), proportion (0.2), or
density (Vetter, 2017b).
The concept of frequency density deserves further discussion,
as many readers may not be familiar with this scale, and it is often
encountered in the literature. While fractions are scaled so that the
sum of the bar heights equals 1, density scales are intended to make
the sum of the areas of such bars equal to 1. Thus, when the width
of each bar (which ultimately represents the intervals or bins of the
variable) is equal to 1, the scale used for fractions and densities
would be identical. For consistency, all histograms and bar charts
presented in this chapter will use density in the Y-axis to represent
the different frequencies obtained.
Question 7: What is the definition of central
tendency measures?
A measure of central tendency is a statistic that describes the central
distribution of data. The most commonly used statistics are the
mean, median, and mode (Manikandan, 2011). At least four types of
mean have been described in clinical research: arithmetic, quadratic,
geometric, and harmonic. The formal definitions of these measures
are presented in Table 1.3.
Table 1.3. Classification of means. The definitions and statistical
notations are provided in this table. Example: Given the set of
observations 23, 21, 34, 42, 18, 36, 22, 46, 28 and 30, the
arithmetic, quadratic, geometric and harmonic means equal 30, 31.3,
28.7 and 27.5, respectively. = ith observation, product
of observations, n = total number of observations.
Mean Definition Equation
Arithmetic
Sum of observations
divided by total
number of
observations
Quadratic
Squared root of the
arithmetic mean of
squared values
Geometric root of the product
of observations
Harmonic
Total number of
observations divided
by the sum of
reciprocals
Aside from the undeniable importance of arithmetic mean in clinical
research, the geometric mean deserves further discussion, although
it is not frequently used. Consider the area of a rectangle of length 8
cm and width 2 cm. It is evident that this area is exactly the same as
that of a square with a base and height of 4 cm (16 cm2). In fact, in
these two cases, the geometric mean is 4, which is the square root
of 16 (see Table 1.3). Thus, whereas the arithmetic mean of the
values 8 and 2 was 5 cm, the geometric mean was only 4 cm. The
geometric mean is always smaller than the corresponding arithmetic
mean, and its value is usually assigned somewhere between the
median and arithmetic mean. The geometric mean gains relevance
in the context of the logarithmic transformation of data, which will be
discussed later in this chapter (Bland, 1996b).
Although the harmonic mean is not usually encountered in
scientific literature, it is often indicated to accurately estimate the
average rates. For example, assume that a given substance is
eliminated from the body at two different rates: the first half is
eliminated at the rate of 20 mg/h, whereas the second half is
eliminated at the rate of 10 mg/h. The arithmetic mean of these rates
is 15 mg/h. However, according to the formula given in Table 1.3, the
harmonic mean corresponds to 13.33 mg/h. In fact, if 100 mg were
administered, it is evident that the first half would be eliminated in 2.5
hours (at the rate of 20 mg/h), and the second half would be
eliminated in 5 hours (at the rate of 10 mg/h). It follows that 100 mg
would be eliminated in 7.5 hours, with an average rate of 13.33 mg/h
(100/7.5). Thus, the harmonic mean precisely estimates the average
elimination rate. In the context of clinical research, harmonic means
are included in the equation of several post-hoc tests used for the
contrast of variables after conducting an analysis of variance
(ANOVA) for the comparison of more than 2 continuous variables
(see Chapter 5). Lastly, similar to the case of geometric means,
harmonic means arealways smaller than their arithmetic
counterparts.
In the following section, we will discuss the characteristics of
central tendency measures often used in clinical research: mode,
median, and arithmetic mean. These statistics take the same value
when the dataset exactly follows a normal distribution. However, this
scenario is seldom seen in clinical research, because biological
variables can only approximate, not exactly reproduce, the normal
distribution (Altman, 1995). In this regard, the famous British
statistician George Box wrote: “The statistician knows that in nature
there never was a normal distribution, there never was a straight
line, yet with normal and linear assumptions, known to be false, he
can often derive results which match to a useful approximation,
those found in the real world”. (Box, 1976). Thus, it is common that
central tendency measures represent different values, even if the
dataset approximates to a normal distribution (see below).
The fundamental differences between arithmetic means and
medians are presented in Table 1.4. The mode, on the other hand,
can be defined as the most repeated value in a given dataset
(Manikandan, 2011). As the most frequent value is not necessarily
unique, more than one value can represent the mode. Therefore, this
statistic is rarely used in clinical research, except in the case of data
with bimodal distribution (Manikandan, 2011).
Table 1.4. Central tendency and dispersion measures for a
continuous variable (Manikandan, 2011).
Data
distribution Measure Statistic Definition Advantages Disadvantages
Normal
Central
tendency
Arithmetic
mean
Sum of
observations
divided by total
number of
observations
Easy to calculate
and comprehend
Sensitive to
outliers
Not appropriate
for skewed
distributions
Cannot be
calculated for
categorical and
ordinal data
Dispersion Standard
deviation
Squared root of
the variance
Easy to interpret
In contrast to
variance, uses the
same unit of
measurement as
the mean
Non-normal
Central
tendency Median
Middle value of
a dataset
arranged in
ascending (or
descending)
order
Insensitive to
outliers, unless
they are
excessive.
Can be calculated
for continuous
and ordinal
variables
Calculation takes
into account ranks
rather than actual
values.
Dispersion Interquartile
range
Middle 50% of
the data (i.e.,
between 25 and
75%), when
arranged in
quartiles
Insensitive to
outliers
Not suitable for
mathematical
calculations
Consider the following exemplary dataset that includes the age
(given in years) of 15 subjects:
14, 16, 22, 25, 28, 28, 31, 32, 37, 41, 44, 47, 47, 92, 96.
The values have been arranged in ascending order to facilitate
interpretation of the different statistics. The arithmetic mean or
average of the variable age is 40 years.
The median is 32 (i.e., the 8th observation), because 50% of
the data are greater and 50% lesser than this value. If the number of
observations were even, the median would be the average of the two
central values (e.g., in a dataset with 16 subjects, the average of the
8th and 9th observations would represent the median) (Vetter,
2017a). Lastly, there were two modes in this dataset (28 and 47)
because these observations were repeated twice. Note that the
difference between the arithmetic mean and median is given by the
presence of outliers (in this case, two participants aged 92 and 96
were included). If these extreme values were not included, the
arithmetic mean and median would be very similar (31.7 and 32,
respectively). It can be concluded that means are sensitive to
outliers, and the mode does not provide relevant information in most
instances.
Question 8: What is the definition of measures of
dispersion?
A measure of dispersion is a statistic that describes the spread of
data (Vetter, 2017a). Variance, standard deviation, standard error,
and coefficient of variation are the most common statistics used in
clinical research for normally distributed data. Conversely, range and
interquartile range (IQR) are used to describe data with non-normal
distribution (Mishra, 2019).
The mathematical expression for the variance and standard
deviation is given by
Variance (s2)= , Standard deviation (s) = ,
where is the ith value, is the arithmetic mean, and n is the total
number of observations.
A thorough explanation for the calculation of these statistics,
along with the rationale behind the concept of standard error, can be
found in Chapter 2. Returning to the example of variable age, we
consider the following dataset:
34, 19, 42, 54, 28, 33
As the mean for this example is 35 years, the variance is calculated
as follows:
Therefore, the standard deviation would be exactly 12 years ( ).
Figure 1.3 provides an illustration of the concept of standard
deviation as a measure of variability, based on different examples of
datasets with normal distribution.
Figure 1.3. Standard deviation of 4 imaginary datasets with mean 0 (dashed black
line) and different standard deviations represented by red, black, green and blue
curves, respectively. The different shape of normal distributions is governed by the
variability of data expressed in terms of standard deviation. In general, samples
with a small number of observations tend to have larger standard deviations, when
compared to their counterpart.
In some instances, the coefficient of variation is used as a
measure of dispersion. This is defined as the standard deviation
divided by the arithmetic mean and is expressed as a percentage.
Using the same example, this statistic would equal 34.28% [(12/35)
× 100]. Thus, according to this coefficient, the standard deviation
represents 34.38% of its mean value (Mishra, 2019).
Lastly, range and IQR are the preferred measures of
dispersion, used to describe data following a non-normal distribution.
Continuing with the example given in the preceding section, consider
the following dataset, which has been arranged in ascending order:
14, 16, 22, 25, 28, 28, 31, 32, 37, 41, 44, 47, 47, 92, 96.
The range is given by the difference between the maximum and
minimum values: range = 82 (96 14). Conversely, to find the IQR,
the median needs to be defined first. As discussed in the previous
section, the median for this dataset is 32, and this value represents
the second quartile (Q2) of the observations, or the central tendency
measure, where 50% of data are allocated at each side of its value.
Subsequently, the first (Q1) and third (Q3) quantiles are defined by
the median values of the lower and upper halves of the data,
previously defined by Q2:
14, 16, 22, 25, 28, 28, 31, 32, 37, 41, 44, 47, 47, 92, 96.
________Q1 _________Q2 _________Q3 ________
From this example, it is clear that the values for Q1, Q2, and Q3 are
25, 32, and 47, respectively. Consequently, the IQR is given by the
subtraction of Q1 from Q3: IQR = 22 (47 25). Note the differences in
measures between the arithmetic mean (40) and median (32), and
the standard deviation (24.2) and IQR (22), owing to the presence of
outliers (values 92 and 96).
Normality and Homogeneity of Variances
Question 9: What is the difference between
variables with normal and non-normal
distribution?
In statistics, normality refers to the compatibility of a set of
observations with a specific pattern of data dispersion, known as a
Gaussian distribution (Altman, 1995). The term Gaussian has been
coined to honor the work of the German mathematician Karl
Friedrich Gauss (1777–1855), who first described the symmetrical
distribution of random errors (Pontes, 2018).
The expectation of normality depends on the number of
observations included in the sample. Thus, small samples may not
necessarily appear symmetrical. However, as the sample size
increases, the semblance to a bell curve is more noticeable. This
phenomenon is related to the fact that the normality assumption, on
which many statistical tests rely, refers to the population of interest,
and not the sample itself. As largesamples tend to more accurately
represent the distribution of the population they come from (target
population), their distribution resembles a bell shape when data are
drawn from a normally distributed population (Altman, 1995). The
difference between sample distribution, population distribution, and
distribution of sample means is examined in Chapter 2, in the
context of the central limit theorem, which is a key concept for the
interpretation of inferential statistics.
Figure 1.4 illustrates a typical sample with a normal
distribution, and the theoretical area under the curve estimated for a
given value allocated a number of standard deviations away from the
mean. The estimation of these values is also crucial for the inference
of results from a study sample to the target population (see Chapter
2).
Figure 1.4. Graphic representation of a numerical continuous variable following a
Normal distribution, where =mean and s=standard deviation. For any normally
distributed variable, approximates to median and mode. The areas under the
curve within 1s, 1.96s and 3s correspond to 68%, 95% and 99.7%, respectively.
The range between -1.96 and 1.96 has important connotations for the definition of
95% confidence intervals and alpha critical values (see Chapter 2).
By contrast, a non-normal distribution is simply defined as any other
pattern (not necessarily skewed) that does not fit into the category of
Gaussian distribution. Some of the most common non-normal
distributions encountered in clinical research include the following:
Log-normal distribution
Student’s t distribution
F-Snedecor distribution
Chi-square distribution
Binomial distribution
Poisson distribution
Although a detailed explanation of these concepts goes beyond the
scope of this discussion, some are explored in depth in the coming
chapters.
Question 10: How to check normality in a given
dataset?
Broadly speaking, there are empirical and formal methods to
evaluate normality in a given dataset. The empirical approach refers
to the use of graphs to inspect the data distribution. The most
common graphs used in clinical research to assess normality are
shown in Figure 1.5 (Wilk, 1968).
Figure 1.5. Histograms, Q-Q plots and P-P plots for the visual assessment of
normal distribution. Tope left: Histogram for a numerical variable with n=700,
median 25, interquartile range (IQR) 36, and range of values between 0 and 200,
depicting non-normal distribution. Top middle: Q-Q plot representing the empirical
quantiles from the same data, against the theoretical quantiles. The empirical
values are depicted in orange, and the normal theoretical values are represented
by a straight blue line. Remarkably, the observed values are systematically
different from those obtained from a normal distribution. Also, note the
representation of extreme values (>110) in the y axis. As a result, the empirical
quantiles do not depict a straight line. Top right: P-P plot to compare the
cumulative distribution for the same data. The interpretation of this graph is very
similar to the Q-Q plot. Of note, the x and y axes always use a scale from 0 to 1,
because they represent the cumulative area under the curve for a given value if
the variable followed a normal distribution (see also Figure 1.3). Bottom left:
Histogram representing a numerical variable normally distributed. In this example,
700 observations with values between 60 and 140 were included. The and s
obtained were 95 and 13, respectively. Bottom middle: Q-Q plot for the same
example. Since the variable was normally distributed, the correlation between
empirical and theoretical normal quantiles was almost flawless, thus resembling a
straight line. Bottom right: P-P plot representing the same data. The interpretation
in this case is comparable to the preceding plot. Although in this figure Q-Q and P-
P plots were used to compare empirical with normal data, they can also be used to
contrast the obtained observations with other distributions (Wilk, 1968).
Conversely, formal tests are used to evaluate the normality of
data. The most popular tests used in clinical research are the
Shapiro–Wilk, Shapiro–Francia, Kolmogorov–Smirnov, skewness,
and kurtosis (Table 1.5). The methodological aspects of these tests
are discussed in the following sections. As most statistical software
packages can perform these tests without difficulty, the technical
details of these statistics will be explained here only for illustrative
purposes. However, a deep understanding of these methodologies is
important for interpreting the test results.
Table 1.5. Commonly used tests for the evaluation of normality for a
given continuous variable.
Test Comments
Shapiro-Wilk Very sensitive to departures of
normality
Shapiro-Francia
Simplification of Shapiro-Wilk Test
Based on Pearson correlation
coefficient
Described for sample sizes >50
Correlation between the ordered
sample values and the ordered
expected quantiles (approximated)
May be a substitute for quantile-
quantile (Q-Q) plots
Kolmogorov-Smirnov
Low power to test normality.
Originally described to test
goodness of fit
Skewness/Kurtosis
Skewness of 0 and kurtosis of 3
describe a dataset with a perfect
normal distribution
Question 11: How is the Shapiro–Wilk test
calculated?
Figure 1.6 provides a detailed description of this test based on a
working example. Data are organized in descending order and
compared to the ai coefficients given by the authors in their original
work (Shapiro, 1965). Subsequently, the W statistic is calculated and
compared to the level of significance provided by the same authors
(Shapiro, 1965). Given the null hypothesis of normality (see Chapter
2), any P-value 0.05 should lead to the conclusion that there is
evidence of non-normal distribution. The original paper published by
Shapiro and Wilk was intended for samples with 50 subjects.
However, Royston and Rahman proposed an extension of this test
for samples containing up to 2000 and 5000 subjects, respectively
(Royston 1992, Rahman 1997).
Figure 1.6. Shapiro Wilk’s test. Calculation of for testing normality
with Shapiro-Wilk’s methodology, for the dataset given in column X. After arranging
data in descending order (Xo), is obtained from the subtraction between
consecutive larger and smaller values. Subsequently, the differences are multiplied
by ai for n number of subjects, according to the table provided in the Shapiro-
Wilk’s original work (Shapiro, 1965). The Shapiro-Wilk statistic is obtained from the
equation / , where xi is the xth value, and is the
sample mean. The result (0.9415) is then compared with the theoretical P-values
for a given n sample size, provided by the same authors (Shapiro, 1965). Thus, by
looking at the W test table, the P-value for 0.9415 and n=20 should be somewhere
between 0.1 and 0.5 (Shapiro, 1965). In this example, the null hypothesis of
normality cannot be rejected, and therefore there is no enough evidence for non-
normal distribution of data. In consequence, the test suggests that data are
normally distributed.
Question 12: How is Kolmogorov–Smirnov test
calculated?
Figure 1.7 illustrates the methodology in detail. This test involves
three fundamental steps. In the first step, the empirical values of the
samples are calculated. To this end, the frequency of the different
values is recorded, and the cumulative frequency is divided by the
total number of observations. In the second step, theoretical values
are estimated. Thus, values are standardized to obtain a Z score,
and incorporated into an equation to calculate the theoretical value
for a cumulative distribution function, if data had a normal
distribution. In the final step, the difference between the empirical
and theoretical values are obtained, and the highest value is
compared to the critical value for a given sample size provided by
the authors. Given the null hypothesis of normality (see Chapter 2),
any P-value 0.05 should lead to the conclusionthat there is
evidence of non-normal distribution (Kolmogorov, 1933).
Figure 1.7. Kolmogorov-Smirnov’s method to test normality of data, using the
example given in the table. Sixty subjects were assigned to three different
treatment groups (A, B, C), and the severity of pain was evaluated using a scale
from 0 to 10 (Visual Analog Score). With this methodology, each cumulative value
(x) is divided by the total number of subjects (c/N). Then, these cumulative values
are standardized (Z-score), where is the individual value, is the mean of the
sample (5.8333), and s is the standard deviation of the sample (2.401741). The
Probability Density Function (PDF), denoted as f(x), is calculated based on the Z-
score, and the area under the curve [F(x)] is then computed based on f(x). F(x) is
the Cumulative Distribution Function (CDF), which represents the theoretical
position that the variable would take if data had normal distribution. Finally, the
empirical and theoretical values are subtracted [x/N – F(x)], and the highest value
obtained (0.0941) is compared with the critical value provided in the Kolmogorov-
Smirnov’s Table. Any larger values than the critical value will lead to the rejection
of the null hypothesis of normality (Kolmogorov, 1933).
Question 13: What is the definition of skewness?
In descriptive statistics, skewness is defined as a measure of data
symmetry. From this perspective, asymmetric data can be right or
left-skewed, as illustrated in Figure 1.8. Accordingly, a set of
perfectly equilibrated observations would have zero kurtosis. Of
note, most biological variables are not necessarily balanced, and the
value of zero would therefore be a point of reference for measuring
the degree of asymmetry of a given distribution. Thus, the closer the
skewness is to zero, the more likely the data are normally
distributed.
On the other hand, sample size plays an important role in
defining skewness. While the asymmetry is difficult to judge in
samples with <25 observations, large datasets following a normal
distribution have a skewness usually closer to zero. The typical
values of the skewness statistic for a given sample size can be
obtained from specialized tables (Doane, 2011).
Figure 1.8. Example of two different datasets with negative (left) skewness, and
positive (right) skewness. As opposed to a dataset with no skewness (see Figure
1.4), the values for arithmetic mean, median and mode are different. In the case of
negative skewed data, the mean is smaller than the median, and vice versa.
Question 14: What is the definition of kurtosis?
Kurtosis can be defined as a measure of the extent of outliers in a
given dataset. It must be emphasized that the measure of kurtosis is
different from the dispersion estimated by the standard deviation
(see Figure 1.3). In contrast to this latter statistic, the kurtosis
measures the heaviness of the tails for a given distribution (and
therefore the influence of outliers), as compared to the normal
distribution. Thus, datasets with similar standard deviations can have
different kurtosis values. Data distributions can be leptokurtic,
platykurtic, or mesokurtic, depending on whether the tails are heavy,
light, or similar to a normal distribution, respectively (Figure 1.9). By
definition, a set of data with a normal distribution has a kurtosis of 3.
However, most variables in clinical research would only approximate
to this value in order to be considered normally distributed.
Figure 1.9. Different types of kurtosis. Leptokurtic (heavy tailed), mesokurtic and
platykurtic (light tailed) data are represented in the left, middle and right panel,
respectively.
Question 15: How to check homogeneity of
variances in a given dataset?
Some parametric statistical tests, including t-test and analysis of
variance (ANOVA), require the assumption of homogeneity of
variances (homoscedasticity). A detailed discussion of the statistical
robustness of these tests against departures of homoscedasticity
can be found in Chapters 3 and 5, respectively. In a nutshell, t-test
and ANOVA are deemed robust against violations of the assumption
of homoscedasticity. Some of the customary tests used to assess
the homogeneity of variances are listed in Table 1.6. The choice of
the appropriate test depends on the departures of the normality
assumption and the number of groups to be compared. Although a
thorough examination of these tests is beyond the scope of this
book, the reader is referred to the original work published by these
authors (Barlett 1937; Levene 1960; Brown 1974).
Table 1.6. Some of the most frequently used tests to check
homogeneity of variances for a given continuous variable. The
trimmed mean used by Brown-Forsythe methodology truncates a
percentage of extreme values at both sides of the dataset, those
making the mean less sensitive to outliers.
Test Comments
F-Test
Variance ratio test
Frequently used to test homoscedasticity before conducting
a t test
Bartlett Very sensitive to departures of normality
Correction factor for overestimation
Levene Robust against departures of normality
Brown-Forsythe
Robust against departures of normality
As opposed to Levene’s test, uses medians or trimmed
means, instead of arithmetic means.
Data Transformation
Question 16: What are the alternatives to non-
parametric tests when dealing with data following
a non-normal distribution?
The ability of a non-parametric test to detect a true difference
between two or more samples (i.e., the statistical power) diminishes
considerably when compared to their counterparts (parametric tests),
with the remarkable exception of highly skewed distributions and/or
small samples (Sedgwick, 2015). This concept is explained by the
fact that most non-parametric tests are based on the comparison of
ranks, rather than the actual values, thus losing important
information about the magnitude of such values. Consider the
following two datasets, displayed in ascending order:
2, 5, 9, 18, 21, 23, 24, 27, 29, 30.
3, 4, 56, 79, 114, 189, 342, 945, 1342, 1938.
Although it is evident that these sets of observations represent two
different populations, most non-parametric tests would treat these
datasets equally by assigning a rank from 1 to 10 for each
observation, thereby ignoring the relevant information given by the
actual values (Sedgwick, 2015).
To overcome this problem, researchers have provided two
possible alternatives to non-parametric approaches:
1) Use of parametric tests relying on the principle of central
limit theorem
2) Transformation of data with non-normal distribution.
A comprehensive analysis of the concept of central limit theorem is
presented in Chapter 2. In short, samples with a sufficient number of
subjects (usually >30, accounting for the total sample size) can be
compared with non-parametric tests, irrespective of their distribution
pattern. However, the degree of skewness dictates the number of
subjects needed to overcome the problem of non-normal distribution.
The more asymmetric the distribution, the more sample size is
needed to rely on the central limit theorem (Kwak, 2017).
Lastly, data transformation seems to be a good alternative
when dealing with a set of observations with a non-normal
distribution. Although it works in selected cases where skewed
datasets can be successfully shifted into symmetric distributions,
there are some drawbacks that deserve further discussion; therefore,
they are examined in this chapter.
Question 17: What are the most common types of
data transformation?
There are at least five methods to convert data into a normal
distribution:
Logarithmic transformation
Square root transformation
Reciprocal transformation
Cube root transformation
Square transformation
Logarithmic transformation is, by far, the most common approach
used in clinical research, followed by square root and reciprocal
transformation (Manikandan, 2010). Nonetheless, these
methodologies have specific indications,usually dictated by the
original distribution of data, and the relationship (if any) between the
mean and the variance of the variable, in the hypothetical case
where random samples were taken from the same population. By
contrast, when the standard deviations of random samples are
independent of their mean, a normal distribution can be reasonably
assumed (Bland, 1996a). Table 1.7 summarizes the theoretical
relationship between variances and means, which influences the
choice of the most suitable transformation.
Table 1.7. Correlation between variance (s2) and mean ( when
random samples are taken from a population following a non-normal
distribution. The square root transformation is suitable for samples
with Poisson distribution (i.e., samples including a count of any event
of interest). In this particular distribution, the variance is equal to its
mean (Bland, 1996a). Logarithmic transformation is advisable for
highly right skewed data (see Figure 1.8), in which case the variance
is usually proportional to the mean square. Lastly, reciprocal
transformation is recommended for datasets with extremely variable
observations (as the serum concentrations of many substances), in
which case the logarithmic transformation might not be sufficient. In
such cases, the variance is proportional to the fourth power of the
mean (Bland 1996a, Manikandan 2010).
Proportionality
to s2 Type of transformation Example
Square root Number of deliveries in a maternity ward
Logarithmic Serum triglycerides
Reciprocal Serum creatinine
In practice, because taking random samples from the same
population may be inconvenient or even unfeasible, researchers
usually need to test the different methods of transformation based on
the type of measurement and the shape of the original distribution
(Bland, 1996c). This practice can be justifiable, considering that data
transformation is not a statistical test itself, and therefore, the
problem of multiple testing is not applicable here. Ultimately, the goal
is to remove the correlation between variances and their means,
thereby making the dataset normally distributed.
To further illustrate these concepts, Table 1.8 uses the
exemplary dataset given in the section on central tendency and
dispersion measures to compare the different types of
transformations.
Table 1.8. Methodologies for data transformation most often used in
clinical research. The top row shows the age of 15 subjects (given in
years), using the same imaginary dataset provided in the section of
central tendency and dispersion measures. Square root, logarithmic
and reciprocal transformations are carried out based on the original
dataset proposed. The logarithmic transformation has been obtained
using the common logarithm with base 10 and the natural logarithm
with base , only for illustrative purposes. In practice, both
methodologies should yield the same results, although the
logarithmic transformation with base is more common in clinical
research. After transforming data, the distribution should be
inspected using the strategies discussed in this chapter, in order to
evaluate the appropriateness of parametric tests. The means for the
transformed data correspond to 40 (natural scale), 6.0973 (squared
transformation), 1.5397 (logarithmic transformation with base 10),
3.5453 (logarithmic transformation with base ), and 0.0328
(reciprocal transformation). Importantly, the only interpretable mean
is the one obtained from raw data (40 years). Hence, the
transformed means need to be converted back to the original scale
by squaring, taking the antilogarithm or taking the reciprocal value,
respectively. Accordingly, the means in natural scale are 37.17
(6.09732), 34.65 (101.5397, or ) and 30.49 (1/0.0328), for the
relevant quadratic, logarithmic and reciprocal transformations. In the
particular case of logarithmic transformation, the result obtained is
the geometric mean (see also Table 1.3) (Bland, 1996b).
Age (natural
scale) 14 16 22 25 28 28 31 32
Square root
transformation 3.74 4 4.69 5 5.29 5.29 5.57 5.66
Log
transformation
(base 10)
1.15 1.2 1.34 1.4 1.45 1.45 1.49 1.51
Log
transformation
(base )
2.64 2.77 3.09 3.22 3.33 3.33 3.43 3.47
Reciprocal
transformation 0.07 0.06 0.05 0.04 0.04 0.04 0.03 0.03
Age (natural
scale) 37 41 44 47 47 92 96
Square root
transformation 6.08 6.4 6.63 6.86 6.86 9.59 9.8
Log
transformation
(base 10)
1.57 1.61 1.64 1.67 1.67 1.96 1.98
Log
transformation
(base )
3.61 3.71 3.78 3.85 3.85 4.52 4.56
Reciprocal
transformation 0.03 0.02 0.02 0.02 0.02 0.01 0.01
In clinical research, common and natural logarithms are often used.
Common logarithms refer to the exponent needed with base 10 to
obtain a certain number (for example, the logarithm with base 10 of
1000 is 3), while natural logarithms are concerned with the exponent
required with base to get a given number, where is a
mathematical constant approximately equal to 2.7183 (for instance,
the logarithm with base of 0.693 is 2). Importantly, the type of
logarithm applied to a given set of data is just a matter of
convenience for the relevant calculations, and therefore, the results
obtained should be identical (see Table 1.8). Thus, natural
logarithms are usually preferred among statisticians (Bland, 1996d).
A common misconception regarding logarithmic
transformation is the belief that any type of data can be transformed
to obtain a normal distribution. Unfortunately, data can only be
transformed with this methodology in selected cases, where the
underlying distribution is suitable for such modifications. This
distribution is known as a log-normal distribution, and it is
characterized by highly right-skewed data, similar to the one
depicted in the right panel of Figure 1.8. Hence, a log-normal
distribution is by nature non-normally distributed, but it can be shifted
to a normal distribution by taking the logarithm of the observations.
Question 18: What are the disadvantages of data
transformation?
The main drawback of data transformation is the difficulty in
interpreting reported standard deviations and, consequently,
confidence intervals (see Chapter 2).
In the case of comparison between two means using square
root and reciprocal transformations, confidence intervals obtained
from transformed data cannot be converted to meaningful units of
measure, and therefore, it is not possible to interpret the results
(Bland, 1996b).
Conversely, the logarithm of the mean for a given set of
observations can be transformed back to its original scale, thereby
making this statistic clinically interpretable. On the other hand,
although the standard deviation of logarithmic-transformed data
cannot be transformed back to the original scale, it can be
interpreted in the context of a dimensionless ratio, rather than a
difference between each observation and the mean of the
transformed data. This is because, according to the quotient rule for
logarithms, the difference between the logarithm of two values
equals the logarithm of their ratio (Bland, 1996d). A possible solution
to this problem is to calculate the confidence intervals of the
transformed values, and then take the antilogarithm of the results.
Thus, confidence intervals can be reported on the original scale.
However, this strategy is valid only when dealing with the
transformed data of one sample.
By contrast, when it comes to the transformation of data for
the comparison of two means, the interpretation of confidence
intervals can be more challenging. This is because the values
reported represent the difference between the two means on the
transformed scale. Again, following the quotient rule mentioned
earlier, this difference between the two logarithms corresponds to the
ratio of those values. Consequently, the confidence intervals
obtained for the difference between two means from a set of
transformed data should be interpreted as the limits for the
geometric mean ratio when data are transformed back to the original
scale.
Similar toany ratio scale, a value of 1 represents no
difference between the means. Therefore, a confidence interval of,
say, 1.34 to 0.46 (assuming a logarithmic- transformed data with
base ) would correspond to 0.26 to 1.58 after taking the respective
antilogarithms. As this interval includes the value of 1, it can be
concluded that the difference between the two groups is not
significant (Bland, 1996c).
Choosing the Right Statistical Test
Question 19: How to choose the appropriate
statistical test, given a set of variables?
The selection of statistical tests to adequately address the research
question of interest fundamentally depends on the following factors:
The type of dependent variable or outcome: Continuous,
categorical.
The type of independent variable(s): Continuous,
categorical, continuous + categorical.
The distribution of the dependent variable: Normal
distribution, non-normal distribution.
The study design: Independent measures, repeated
measures.
Table 1.9 provides a summary of the main statistical tests used in
clinical research. Based on the type of variables included in the
model and the distribution of the variable of interest, different
alternatives influencing the choice of the appropriate statistical test
are included in this table.
Table 1.9. Choice of the appropriate statistical test according to the
study variables included. The dependent variable is usually the
outcome of interest. Dependent and independent variables can be
further classified on the grounds of the nature of the observations
included (continuous, categorical, ordinal variables), and the
dependency of measures (independent measures, repeated
measures), In turn, categorical variables can be dichotomous (paired
or unpaired) or non-dichotomous, depending on the number of levels
included. Ordinal variables, on the other hand, require specific
statistical tests that account for the natural order of the observations
included (e.g., Cochran-Armitage test). For example, the outcome
variable blood pressure can be continuous if the values are reported
in mm Hg. It can also be dichotomized (hypertension yes=1, no=0)
or transformed to an ordinal scale (e.g., mild hypertension, moderate
hypertension, severe hypertension). Lastly, when the dependent
variable is continuous, researchers should consider the distribution
of observations included in the model (normal distribution non-
normal distribution), in order to use parametric or non-parametric
tests. ANOVA: Analysis of Variance.
Independent
variable(s) Dependent variable (outcome) Statistical test
Continuous
Continuous
Parametric
Linear
regression
(r-Pearson)
Non-
parametric
Linear
regression
(rho-
Spearman)
Categorical
Dichotomous
Binomial
(0,1) Logistic regression
Survival
(yes, no) Cox regression
Non-dichotomous Multinomial regression
Continuous
+ categorical
Continuous Parametric Two-way
ANOVA
Categorical
Dichotomous
Binomial
(0,1) Logistic regression
Survival
(yes, no) Cox regression
Non-dichotomous Multinomial regression
Categorical
Dichotomous
(paired)
Continuous
Repeated measures
Parametric Paired t-test
Non-
parametric
Wilcoxon
rank-sum test
Non-
dichotomous
Parametric
Repeated
measures
ANOVA
Non-
parametric Friedman
Dichotomous
Independent measures
Parametric t-test
Non-
parametric
U Mann-
Whitney
Non-
dichotomous
Parametric One-way
ANOVA
Non-
parametric
Kruskal-
Wallis
Repeated
measures
Categorical
Dichotomous (paired) McNemar
Non-dichotomous Cochran-Mantel-Haenszel
Independent
measures
Dichotomous or
non-dichotomous
Chi-square, Fisher exact
test
Ordinal Cochran-Armitage
From the perspective of data measurement, experimental designs
can have repeated or independent measures, depending on whether
the observations are taken from the same subject or from different
participants. Thus, when values are obtained before and after an
intervention, clinical trials can have a paired (if there is only one
measure after the intervention) or repeated-measures design (i.e.,
there is more than one measure following the intervention). By
contrast, independent measures are never obtained from the same
subject.
To further illustrate this, consider a clinical trial designed to
evaluate the effectiveness of two medications for the treatment of
chronic pain. As the severity of pain experienced before receiving
the treatment is important to measure the improvement (or
worsening) of the symptoms, researchers may be interested in the
evaluation of pain before and after the procedure for every single
patient (paired design). Conversely, if the basal pain is assumed to
be equal among participants, investigators may only be interested in
the assessment of pain after the intervention. For instance, when
comparing the effectiveness of two anesthetic techniques for the
treatment of acute postoperative pain 24 hours after a surgical
procedure, a design with independent measures may be suitable.
However, in the latter case, clinicians are often interested in pain
scores obtained at multiple time points (e.g., at 2, 6, 24, and 48
hours). When these measures are relevant to the research question,
a repeated-measures design is appropriate.
Occasionally, in the context of observational studies,
researchers may focus on the measurement of variables of two
different subjects with similar characteristics (e.g., age, gender,
ethnicity, literacy skills, etc.). In such cases of matched (paired) data,
observations cannot be considered independent (Barkan, 2015).
Accordingly, a suitable statistical test must be selected for data
analysis based on data independence.
However, data independence of study designs (i.e., paired,
unpaired, and repeated-measures observations) should be
distinguished from the independence of variables in a given
statistical model. From this perspective, variables can also be
dependent or independent, as discussed in this chapter (see Table
1.2). Whereas dependent variables are often the outcome of interest
in any study sample, independent variables are represented by the
observations collected by the researcher, usually between two or
more groups, in order to compare the outcome of interest after a
given intervention or exposure. Going back to our example, if pain is
the variable of interest, it would become the dependent variable.
Independent or explanatory variables may include, for instance,
gender, ethnicity, age, and group allocation (e.g., treatment, placebo)
or exposure (presence or absence of exposure).
Question 20: Is it a good idea to categorize
continuous variables, to facilitate statistical
analysis?
The categorization of continuous data (often in the form of
dichotomous variables) may constitute a valid method for many
researchers, particularly when the cut-offs have been well
established (for instance, hypertension yes = 1 no = 0, or diabetes
yes = 1 no = 0) (Altman, 2006). Moreover, explanatory variables,
such as age or smoking status, are usually grouped into three or
more categories (Bennette, 2012). Notwithstanding, statisticians are
reluctant to categorize continuous variables, because of
disadvantages of non-parametric tests. It comes at the cost of losing
information (and therefore statistical power, i.e., the ability to detect a
true relationship between two or more variables) (McCallum 2002,
Altman 2006). Thus, categorization is often discouraged from the
perspective of statistical analysis.
Conclusion for Chapter 1
This chapter provides a general framework to understand the core
concepts of descriptive statistics, which is essential for interpreting
the complexity of statistical analysis. In fact, clinical research cannot
be understood without adequately understanding the methodology of
descriptive statistics.
Researchers need to be familiar with the different strategies
used to summarize data efficiently and present numbers in a
comprehensive way. However, clinicians should be provided with
sufficient knowledge to accurately interpret the results. Equally
important is to bewary of the possible methodological flaws of a
given study, before considering valid (or erroneous) conclusions that
could eventually change clinical practice. The generation of new
knowledge is then the result of a consensual effort between
researchers and clinicians to make translational research an efficient
tool that can bring science advances to the bedside.
The following chapters will provide the reader with basic
knowledge on biostatistics, which is essential for understanding the
complexity of the scientific method, and will highlight the importance
of evidence-based medicine for the decision-making process in daily
clinical practice.
References for Chapter 1
1. Altman DG, Bland JM. Statistics notes: the normal
distribution. BMJ. 1995;310(6975):298.
2. Altman DG, Royston P. The cost of dichotomising continuous
variables. BMJ. 2006;332(7549):1080.
doi:10.1136/bmj.332.7549.1080
3. Andrei Nikolaevich Kolmogorov. Sulla Determinazione
Empirica di Una Legge di Distribuzione. G Dell’Istituto Ital
Degli Attuari. 1933;4:83-91.
4. Barkan H. Statistics in clinical research: Important
considerations. Ann Card Anaesth. 2015;18(1):74-82.
doi:10.4103/0971-9784.148325
5. Bartlett MS. Properties of Sufficiency and Statistical Tests.
Proc R Soc Lond Ser A. 1937;160:268-282.
doi:10.1098/rspa.1937.0109
6. Bennette C, Vickers A. Against quantiles: categorization of
continuous variables in epidemiologic research, and its
discontents. BMC Med Res Methodol. 2012;12(1):21.
doi:10.1186/1471-2288-12-21
7. Bland JM, Altman DG. Statistics Notes: The use of
transformation when comparing two means. BMJ.
1996;312(7039):1153.
doi:10.1136/bmj.312.7039.1153
8. Bland JM, Altman DG. Statistics notes. Logarithms. BMJ.
1996;312(7032):700. doi:10.1136/bmj.312.7032.700
9. Bland JM, Altman DG. Transformations, means, and
confidence intervals. BMJ. 1996;312(7038):1079.
doi:10.1136/bmj.312.7038.1079
10. Bland JM, Altman DG. Transforming data. BMJ.
1996;312(7033):770. doi:10.1136/bmj.312.7033.770
11. Box GEP. Science and Statistics. J Am Stat Assoc.
1976;71(356):791-799.
doi:10.1080/01621459.1976.10480949
12. Box, Joan Fisher. "RA Fisher and the design of
experiments, 1922–1926." The American Statistician 34.1
(1980): 1-7
13. Brown MB, Forsythe AB. Robust Tests for the Equality
of Variances. J Am Stat Assoc. 1974;69(346):364-367.
doi:10.2307/2285659
14. Doane D, Seward L. Measuring Skewness: A
Forgotten Statistic? J Stat Educ. 2011;19.
doi:10.1080/10691898.2011.11889611
15. Govani SM, Higgins PDR. How to Read a Clinical
Trial Paper. Gastroenterol Hepatol. 2012;8(4):241-
248.
16. Kwak SG, Kim JH. Central limit theorem: the
cornerstone of modern statistics. Korean J Anesthesiol.
2017;70(2):144-156. doi:10.4097/kjae.2017.70.2.144
17. MacCallum RC, Zhang S, Preacher KJ, Rucker DD.
On the practice of dichotomization of quantitative variables.
Psychol Methods. 2002;7(1):19-40. doi:10.1037/1082-
989x.7.1.19
18. Manikandan S. Measures of central tendency: The
mean. J Pharmacol Pharmacother. 2011;2(2):140-142.
doi:10.4103/0976-500X.81920
19. Mishra P, Pandey CM, Singh U, Gupta A, Sahu C,
Keshri A. Descriptive Statistics and Normality Tests for
Statistical Data. Ann Card Anaesth. 2019;22(1):67-72.
doi:10.4103/aca.ACA_157_18
20. Olkin I. Contributions to Probability and Statistics;
Essays in Honor of Harold Hotelling. Stanford University
Press; 1960.
21. Owen, A. R. G. "An Appreciation of the Life and Work
of Sir Ronald Aylmer Fisher FRS, FSS, Sc. D." Journal of the
Royal Statistical Society: Series D (The Statistician) 12.4
(1962): 313-319.
22. Pontes EAS. A Brief Historical Overview Of the
Gaussian Curve: From Abraham De Moivre to Johann Carl
Friedrich Gauss. Int J Eng Sci Invent IJESI. 2018;Volume 7
Issue 6 Ver V:PP 28-34.
23. Price DD, McGrath PA, Rafii A, Buckingham B. The
validation of visual analogue scales as ratio scale measures
for chronic and experimental pain. Pain. 1983;17(1):45-56.
doi:10.1016/0304-3959(83)90126-4
24. Rahman MM, Govindarajulu Z. A modification of the
test of Shapiro and Wilk for normality. J Appl Stat.
1997;24(2):219-236. doi:10.1080/02664769723828
25. Royston P. Approximating the Shapiro-Wilk W-test for
non-normality. Stat Comput. 1992;2(3):117-119.
doi:10.1007/BF01891203
26. Sedgwick P. A comparison of parametric and non-
parametric statistical tests. BMJ. 2015;350:h2053.
doi:10.1136/bmj.h2053
27. Shapiro SS, Wilk MB. An Analysis of Variance Test for
Normality (Complete Samples). Biometrika. 1965;52(3/4):591-
611. doi:10.2307/2333709
28. Vetter TR. Descriptive Statistics: Reporting the
Answers to the 5 Basic Questions of Who, What, Why, When,
Where, and a Sixth, So What? Anesth Analg.
2017;125(5):1797-1802.
doi:10.1213/ANE.0000000000002471
29. Vetter TR. Fundamentals of Research Data and
Variables: The Devil Is in the Details. Anesth Analg.
2017;125(4):1375-1380.
doi:10.1213/ANE.0000000000002370
30. Wilk MB, Gnanadesikan R. Probability plotting
methods for the analysis of data. Biometrika. 1968;55(1):1-
17.
Chapter 2
2. Inferential statistics
Wilson Fandino & Karla Loss
Introduction to Chapter 2
Suppose that you have recently moved into a new home, and you
are not familiar with the neighborhood. Your first challenge is to
choose a good time for grocery shopping. Only after shopping a few
times at the nearest supermarket (Shopping Mall A) may you collect
enough information as to what would be the most convenient day for
shopping, and what would be the best time to get a parking space.
One day, you decide to try another supermarket (Shopping Mall B),
which is located just blocks away from the earlier one. On this
occasion, your brain may take advantage of the data collected from
the first experience, and based on that information, you tend to avoid
shopping at the busiest times.
However, the information obtained from Shopping Mall A can
only be generalized to Shopping Mall B to the extent that people who
frequent supermarkets in your area are well represented by the
average customers of those shopping malls. This also implies that
the range of products sold in those shopping malls is comparable to
each other. For example, the busiest times of the day may be
different if one store specializes in fast food, and the other focuses
on vegan products. Under these assumptions, the information
collected from the first supermarket can be used to understand the
consumer shopping behaviors of your neighbors, thus making an
impact to your shopping habits.
From a conceptual perspective, the field of biostatistics
encompasses two fundamental aspects: descriptive statistics and
inferential statistics. The preceding chapter examined the nuances of
descriptive statistics. In our example, it can be interpreted as the
process of summarizing the data collected from Shopping Mall A.
By contrast, inferential statistics focus on the mathematical
strategies intended to generalize findings from a study sample
(ShoppingMall A) to the population of interest (shopping malls in
your area).
On the other hand, the concept of inferential statistics has an
inherent component of uncertainty. While the results obtained in a
given study are based on a study sample, the accuracy at which they
represent the target population depends, in general, on three
elements:
The sample size of the study.
The statistical distribution of the population of interest.
The standard deviation of the population.
It should be noted that, among these core components of statistical
precision, only the sample size is known by the researcher. While the
statistical distribution of the dependent variable (normal distribution
vs. non-normal distribution) is determined based on the central
tendency and dispersion measures of the sample, the statistical
distribution of the population of interest cannot be easily defined
unless the entire population is studied, which is, at best, impractical
and expensive, and at worst unfeasible and unethical. The statistical
approaches to overcome these issues are examined in depth in this
chapter (Motuslky, 2014; Vetter, 2017).
Estimation of parameters
Question 21: Why is the definition of parameters
important in inferential statistics?
The inferential statistics concept is based on the definition of the
parameters used in the statistical tests. Therefore, in the definition of
parameters is important to choose an appropriate test and to
interpret the results. For instance, if the population sample is
normally distributed, this information (along with other parameters)
can be used to choose a parametric statistical test. In the case of a
continuous random variable that follows a normal distribution, the
parameters of interest are the population mean and population
standard deviation. The main notations used in biostatistics to
represent the parameters of the population and the sample study for
continuous variables are listed in Table 2.1. The parameters used for
the other random variables are described in detail elsewhere (King,
2019)
Table 2.1. Notations used in biostatistics to represent sample
parameters of the population and study samples.
Sample
parameter
Population
parameter
Sample statistics
Mean
Standard
Deviation
s
Variance
Size N n
Question 22: What exactly is the central limit
theorem? How can this paradigm be applied to
inferential statistics?
Researchers can rely on the concept of the central limit theorem to
solve the problem of unknown distribution of the population.
According to this concept, in the hypothetical case in which an
infinite number of samples was taken from the population of interest,
the distribution of the sample means would closely resemble a
normal distribution, regardless of the distribution of the sampled
population, provided that the study sample size was relatively large
(usually including 30 subjects). The concepts of central tendency
and dispersion measures, statistical distribution, and central limit
theorem are discussed in detail in Chapter 1 (King, 2019).
A concept closely related to the central limit theorem is the
sampling distribution. Assuming a finite sample size n, if many
subsets of samples were taken from the original study sample, and
the mean of those subsamples was computed, the distribution of
those means tends to follow a normal distribution. Thus, the
sampling distribution addresses the question of how far would the
sample mean be from the true population mean (King, 2019).
For example, imagine that you are living on an island, and you
are investigating the potential impact of local dietary habits on serum
triglyceride levels in the adult population. It is possible that if you
take a sample of, say, 100 subjects, the variable measured follows a
right-skewed distribution because some people might have
hypertriglyceridemia, but hardly any of them would have very low
triglyceride serum levels (Figure 2.1) (Choi, 2016). Notwithstanding,
if you randomly take many samples of 100 adults from the same
population, compute the mean triglyceride serum levels for each
sample, and plot these values in a histogram as depicted in the right
panel of Figure 2.1, it can be ascertained that the sampling
distribution of mean triglyceride levels is normal, provided the
samples include more than 30 subjects.
Certainly, you do not need to repeatedly sample your
population, because you have included 100 observations, and as a
consequence, it is safe to assume a normal distribution for
triglyceride means across samples, based on the central limit
theorem (right panel in Figure 2.1). Note, however, that this
assumption has been made regardless of the right-skewed nature of
the sample (left panel in Figure 2.1) and the population (provided
that the sample is a good representation of the population).
In conclusion, and continuing with the same example, there
are three distributions to be considered:
The sample distribution of triglyceride serum concentration
The population distribution from which the sample was
taken
The triglyceride means distribution, also known as
sampling distribution
From these three distributions, only the first one is known by the
researcher with certainty. The second can be extrapolated, provided
that the sample was drawn randomly from the population of interest.
The third one, regardless of the shape of the first two distributions,
always approximates to normal, provided that the sample size is
30. Equally important is to understand that the assumption of the
central limit theorem does not change the skewed nature of
triglyceride serum concentrations; therefore, normality tests are
warranted before applying parametric tests, particularly in samples
with < 30 observations (see also Chapter 1).
Figure 2.1. Central limit theorem. The left panel shows a right-skewed distribution
for triglyceride serum concentrations in mg/dL. If random sets of 30
observations were repeatedly taken from this sample, and the mean was
computed for each subset of values, the distribution of those means would follow a
normal distribution, as depicted in the right panel, irrespective of the distribution of
the original sample. = ith observation, = sample mean, = population mean.
Question 23: Why is it considered safe to apply
the central limit theorem for n 30?
A threshold of n 30 was established based on computer
simulations for heavily skewed distributions. However, this threshold
depends on the skewness of the study sample. In our previous
example, if you take random samples of two subjects from the same
population, it is very likely that the means obtained are not normally
distributed, because they would still follow the distribution of the
population they came from. However, if the variable of interest
followed a perfectly symmetrical normal distribution, the sample
means distribution would be normal, even for small sample sizes. Of
note, such distributions are rarely observed in clinical research.
Consequently, a sample size 30 accounts for worst-case
scenarios of extremely skewed distributions, and therefore, this
threshold has long been considered to be safe for applying the
assumptions of the central limit theorem.
Question 24: How is the standard deviation of the
population estimated?
The application of the concepts of the central limit theorem and
sampling distribution to inferential statistics has enormous
implications in clinical research because, in many instances, the
population distribution can be deduced from these theoretical
assumptions. Nonetheless, the population standard deviation is yet
to be estimated. It has been well established that the unbiased
calculation of the sample standard deviation is a good approximation
of the standard deviation of the population. To have a good grasp of
these concepts, it is important to note the difference between the
calculated sample standard deviation and population standard
deviation:
s
where representsthe sum of the squared deviations
around the sample mean, n is the total number of observations in the
sample, and N is the total number of observations in the population.
Because in most cases the parameter is unknown, a good
estimation of the population standard deviation is given by the
equation:
Note that in this formula, the statistic has been used as an
estimation of the unknown parameter , and ( ) observations of
the sample have been included in the denominator instead of N.
Therefore, the parameter in the latter equation is an estimation,
rather than an actual calculation, and it can be concluded that the
calculated sample statistic s is a good estimation of the population
parameter . To interpret the notations used here, please refer to
Table 2.1.
Consider a study sample containing 10 subjects with ages
provided in Table 2.2. Of note, the unbiased estimation of the
population variance (and therefore the standard deviation) must
incorporate ( ) instead of (n) in the denominator. This difference
reflects the fact that has been used as a surrogate of for the
estimation of the population standard deviation; therefore, the
approximation needs to account for potential bias. Hence, the
degrees of freedom ( ) are used by statistical software packages
to estimate the population parameters of continuous variables.
Table 2.2. Estimated variance and standard deviation of a given
population for the variable age in a random sample of 10 subjects.
Age
(years) Mean Deviations
around mean
Squared
deviations
around mean
Estimated
population
variance ( )
Estimated population
standard deviation ( )
23
42.6
19.6 384.16
400 20
56 13.4 179.56
29 13.6 184.96
18 24.6 605.16
76 33.4 1115.56
43 0.4 0.16
56 13.4 179.56
65 22.4 501.76
38 4.6 21.16
22 20.6 424.36
Question 25: What is the definition of degrees of
freedom? Why are they important in clinical
research?
Although the definition of degrees of freedom varies depending on
the nature of the random variable and the underlying distribution, in
the specific case of a continuous variable provided in Table 2.2, for a
given average age of 42.6 years, the first nine observations can
admit any of the ten values. However, the tenth observation only
admits a specific value, so that the sample mean equals exactly 42.6
years. Therefore, only nine out of ten observations would have the
freedom to take any value (Levine, 2014).
Question 26: When estimating the population
standard deviation, why divide by degrees of
freedom (n-1), and not simply by the sample size
(n)?
From the perspective of the definition of standard deviation (Table
2.2), it is noteworthy that dividing the sum of the squared deviations
around the mean by ( ) makes the estimated standard deviation
less than simply dividing it by the sample size (n). For instance, the
estimated population standard deviation of the dataset provided in
Table 2.2 would be 19 instead of 20 when using the sample size
instead of the degrees of freedom.
It must be emphasized that, while dividing by ( ) yields a
better approximation of the population standard deviation, this
estimation is still biased. A more accurate calculation involves
complex equations (Holtzman, 1950). However, it is widely accepted
by most researchers that for n > 10, the difference between the
unbiased estimation and the actual standard deviation of the
population is negligible.
The mathematical arguments to justify the use of degrees of
freedom involve simulations of the estimated population standard
deviation in extreme scenarios. Nonetheless, the intuition behind the
use of ( ) degrees of freedom can be used to interpret this
parameter in most practical scenarios. From this perspective, the
inclusion of ( ) observations in the computation of standard
deviations outlines the importance of the sample size for the
estimation of any parameter.
Consider the extreme case of a sample size of three subjects
compared to 1,000 subjects. Undoubtedly, the difference between
dividing a number by two and dividing the same number by three is
considerable. In large samples, however, dividing by ( ) does not
have a large impact on the parameter estimations, when compared
to the inclusion of n in the denominator. Thus, despite the fact that
the discrepancy observed is minimal in large datasets, it may have
an impact on the estimation of confidence intervals for small sample
sizes.
Question 27: What is the rationale behind the
estimation of confidence intervals? Are they
related to the Standard Error of the Mean?
Given the uncertainty in the estimation of the population mean,
researchers need to provide a range at which the true population
mean is allocated, with a pre-specified level of confidence. This
range is known in biostatistics as the confidence interval, and the
level of confidence is given by (1 - , where is the alpha critical
value (see below).
In order to estimate the confidence interval of the mean,
statisticians have developed the concept of standard error of the
mean (SEM), which is defined by the following equation:
where is the population standard deviation, and n is the total
sample size. Since is in most instances unknown, the SEM is
usually estimated by replacing this parameter with the sample
standard deviation:
Suppose that you are conducting a telephone survey to investigate
the average time spent on social media by adolescents living in your
area. From 400 subjects sampled, the calculated mean was 2.5 h,
with a standard deviation of 2 h. Thus, an unbiased estimation of the
SEM would be 0.1 (2 / ). This estimation implies that the
sample mean would be approximately 0.1 standard deviations
away from the true population mean ( ). It is important to note the
key role of sample size in the SEM estimation. For instance, if you
had interviewed only 16 subjects in your sample, the SEM would be
0.5, instead of 0.1, and would have enormous implications for the
estimation of confidence intervals.
The general definition of confidence intervals, on the other
hand, is based on a hypothetical situation in which the researcher
was able to repeat the same experiment many times. Accordingly, a
95% confidence interval should be interpreted as follows: if the same
experiment was repeated 100 times, it is expected that in 95 out of
the 100 experiments, the mean would be between the range
estimated in the 95% confidence interval (King, 2019; Du Prel,
2009).
The estimation of Confidence Intervals of a population mean is
given by:
where is the sample mean, and is the t-statistic for (
) degrees of freedom at a pre-specified alpha critical value.
The concept of t values is explored in depth in Chapter 3.
However, for n >120 and alpha critical value of 0.05, the t-statistic is
approximately equal to 1.96. For the same example, the 95%
confidence interval for the average time spent by adolescents on
social media would be 2.5 (1.96 0.1). Consequently, if you
were able to replicate this survey many times, you can be 95%
confident that the mean obtained would be somewhere between
2.304 and 2.696 h.
Question 28: What is the relationship between
alpha critical value and Type I error?
One of the key determinants of confidence intervals is the alpha
critical value, which must always be established a priori by the
researcher, based on the desired Type I error. This type of error is
defined as the probability of finding significant results (in other
words, rejecting the null hypothesis) when the results were not
significant (i.e., the null hypothesis should not have been rejected).
(Akobeng, 2016).
On the other hand, an alpha critical value of 0.05, implies that
the researcher accepts that the probability of falsely rejecting the null
hypothesis is 5%. A detailed description of the rationale behind the
null and alternative hypotheses can be found in the section of
“formulation of the hypothesis” of this chapter (Banerjee, 2009).
Thus,the concepts of alpha critical value and Type I error are closely
related.
As an illustration, imagine that after conducting a survey, you
are planning to implement an educational intervention to diminish the
time spent on social media among adolescents. If you specify an
alpha critical value of 0.05, the accepted risk of finding spurious
results assuming that the intervention was not successful would be
5%.
Question 29: In which clinical scenarios would it
be appropriate to set an alpha critical value
below 0.05?
In essence, the choice of the alpha critical value is conditioned by
the extent to which the researcher is willing to accept the possibility
of finding spurious results, that is, the probability of having a Type I
error. The acceptance of this type of error is, in turn, largely
dependent on the research question. Thus, when the researcher
needs to be as certain as possible about the results yielded by a
given statistical test, the alpha critical value should be set at a level
lower than 0.05.
Accordingly, on some occasions, and depending on the
desired probability of having false positive results, the alpha critical
value can be set at levels lower than 0.05. This scenario is rarely
seen in clinical research, as the 0.05 threshold has long been
accepted in this field. However, this matter is subject to ongoing
debate (Thiese, 2016; Wasserstein, 2019).
On the other hand, although it is desirable to set an alpha
critical value as low as possible, it comes at the price of a dramatic
increase in the sample size needed. In fact, lowering the alpha
critical value to 0.01 would require prohibitively large sample sizes in
many instances (an increase of 48% in the sample size is needed to
maintain a statistical power of 80%). A thorough description of this
phenomenon can be found in the section on sample size calculation
in this book (Chapter 8).
A different scenario frequently encountered in clinical
research is the adjustment of alpha critical values (typically pre-
specified at the level of 0.05) in the context of multiple testing (Qu,
2010). The issue of multiplicity is recurrent in the setting of post-hoc
analyses (which will be discussed in Chapter 5) and interim analyses
(which are beyond the scope of this book). It must be emphasized,
however, that the rationale behind alpha critical levels of 0.01 is not
equivalent to the interpretation of a P-value of 0.01.
Question 30: What is the difference between
alpha critical value and P-value?
As highlighted above, the definition of the P-value is entirely different
from the alpha critical value. In contrast to the alpha critical value,
the P-value cannot be determined a priori. It is calculated using
statistical software packages, in view of the study results, and based
on the underlying distribution of the sample and the alpha critical
value specified beforehand. The most accepted definition of the P-
value is the probability of having results at least as extreme as that
obtained in the study by chance, assuming that the null hypothesis is
true (Foster, 2020).
Question 31: What is the relationship between
beta critical value, Type II error and statistical
power? What are the consequences of reducing
the alpha critical value?
Unfortunately, for a given study with a fixed sample size, Type I
errors can only be diminished at the expense of increasing Type II
errors. In line with the aforementioned concepts, Type II errors can
be defined as the probability of missing significant results, which
leads to failure to reject the null hypothesis, when in fact it should
have been rejected (Akobeng, 2016). The Type II error is closely
related to the beta critical value and statistical power. While the beta
critical value is equivalent to the Type II error, the statistical power of
the study is given by (1 ), where is the beta critical value
(Banerjee, 2009).
Table 2.3 illustrates the relationship between Type I error, Type II
error, alpha critical value, and statistical power in the framework of
the formulation of a hypothesis.
Table 2.3. Hypothesis testing based on the contrast of the null
hypothesis
Hypothesis testing:
Null hypothesis
Fact
False True
Statistical
analysis
Reject (1 ):
Statistical Power
:
Type I Error
Fail to reject :
Type II Error (1
Question 32: Why are alpha and beta critical
values typically set at 5% and 20%, respectively?
Any improvement in the alpha critical value (also known as the
significance level, which represents the probability of making a Type
I error) is made at the cost of increasing the possibility of making a
Type II error, thereby rendering the statistical test underpowered.
The only way to diminish the alpha critical value, while keeping the
counterpart (beta critical value) unchanged, is by increasing the
sample size (and therefore, the statistical power), which is not
always feasible. In other words, studies with smaller alpha critical
values require a larger number of observations to maintain the same
statistical power. Accordingly, the selection of alpha critical values is
governed not only by the desired probability of making Type I errors,
but also by the researcher’s willingness to accept Type II errors
(Akobeng, 2016). The main concepts related to the alpha and beta
critical values are listed in Table 2.4..
Consequently, for a given study with a fixed sample size, a
trade-off must be established to accept a reasonable risk of Type I
and Type II errors. From this perspective, the control for Type I errors
tends to be more important in clinical research because finding
spurious results may have a negative impact on decision-making in
daily clinical practice. Therefore, alpha critical values are typically set
below beta critical values.
The possibility of failing to reject a null hypothesis when in fact
it should have been rejected may, on the other hand, have different
implications. From a clinical point of view, patients might not benefit
from landmark discoveries, and from a public health perspective,
researchers’ efforts may not be reflected in the decisions of
policymakers, thus wasting valuable resources. Therefore, some
authors have advocated the routine setting of statistical power as
high as 90%, even though this strategy requires larger sample sizes
(the sample size needed increases by 34%, provided the
significance level is maintained at 0.05) (Schober, 2019).
Table 2.4. Key concepts related to the definition of alpha and beta
critical values. Of note, the definition of the P-value is unrelated to
these concepts, and therefore is not included here (see text).
Alpha critical
value
Probability of
rejecting a true
null hypothesis
Type I error Rejection of a true null hypothesis
Significance
level
Alpha critical value specified by the
researcher
False
positive rate
Ratio of the number of observations
falsely labeled as positive over the
total number of observations.
Beta critical
value
Probability of
failing to reject
a false null
hypothesis
Type II error Failure to reject a false null hypothesis
Statistical
power
(1 ), where is the beta critical
value
False
negative
rate
Ratio of the number of observations
falsely labeled as negative over the
total number of observations.
Question 33: Is it correct to use P-values along
with confidence intervals, when reporting
statistical results?
The information provided by P-values is different from that contained
in the confidence intervals. As discussed previously, the
interpretation of P-values should be made in the context of
probabilities. Conversely, confidence intervals give a glimpse to the
reader as to how confident they should be regarding the true
estimation of the population parameter. For example, if the
researcher reports a mean difference between two groups of 2.5,
with a 95% confidence interval of -0.5 to 5.5, it should warn the
reader that the difference observed may not be statistically
significant in the population from which the sample came from,
because the interval includes the zerovalue in the range. However,
such a wide interval may well correspond to an underpowered study.
By contrast, if the researcher repeated the same experiment
with a larger sample size, and the confidence interval was reported
to be between 2.0 and 3.0 units for the same parameter, then the
reader should be 95% confident that the mean difference in the
population of interest would be somewhere between 2.0 and 3.0
units. Thus, sample size plays a key role in the magnitude of the
confidence intervals. This phenomenon can be explained by the fact
that confidence intervals are constructed on the basis of standard
errors, and these estimations are inversely correlated with the
square root of the sample size (see above) (Bender, 2007).
A different scenario may also be observed if the reported
mean difference between the two groups was 0.1 units, and the 95%
confidence interval was in the range of 0.1 and 0.3 units. In this
case, a lack of statistical significance would be more reliable
because the reported interval was small. However, such a narrow
confidence interval would necessitate prohibitively large sample
sizes. Lastly, if the confidence interval was from 0.1 to 0.3 for a
mean difference of 0.2 units, the interpretation of these significant
results (note that the range of the confidence interval does not
include 0) should account for the clinical relevance of such
difference. In conclusion, P-values and confidence intervals yield
different and complementary information, and considering that they
are equally important for the correct interpretation of the estimation
of parameters, both of them should be included in the study results
(Du Prel, 2009).
Formulation of the hypothesis
Question 34: What are the steps for hypothesis
formulation?
In the authors’ view, the hypothesis-testing process entails four
critical steps:
Defining the hypothesis to be tested: this is known as the
null hypothesis.
Specifying the significance level and establishing a
decision rule: as discussed above, the significance level is
given by the alpha critical value, which needs to be
determined a priori. Based on the chosen significance
level, the researcher rejects or fails to reject the null
hypothesis.
Choosing the appropriate statistical test: this decision is
typically based on the design of the research project, the
type of outcome (numerical, categorical, ordinal) and its
underlying distribution, and the nature of the independent
variable(s).
Inferring results from the study to a population: a
methodologically well-conducted study taken from a
random sample that accurately represents the target
population should yield results that can be generalized to
that same population. (Overholser,2007)
The proposed components of hypothesis testing in the framework of
inferential statistics will be explored.
Question 35: What is the rationale behind the null
and alternative hypotheses?
It is worth noting that to make inferences from the study sample to
the population of interest, the researcher needs to choose the
appropriate statistical test. However, the interpretation of statistical
tests would become burdensome if the hypotheses tested were
different. Hence, the unification of hypothesis testing would be
convenient for the judgment of test results. In fact, if the hypotheses
were not formulated in the same manner, a P-value < 0.05 reported
for a given test would have opposite interpretations in light of which
hypothesis has been tested. (Nickerson, 2000)
Therefore, most statistical tests are based on the premise that
there is no difference between the groups under evaluation. This is
known as the null hypothesis, which is denoted as Ho in the majority
of scientific literature. (Guyatt, 1995). The alternative hypothesis,
usually denoted as H1, represents the opposite statement.
Question 36: What is the difference between
rejecting and failing to reject the null hypothesis?
The interpretation of any statistical test departs from the assumption
that the null hypothesis is true. Subsequently, the alpha critical value
would dictate the statistical significance of the test; if the P-value
reported was smaller than this critical value, the null hypothesis
would be rejected, and therefore, there would be strong evidence in
favor of the alternative hypothesis (Pereira, 2009).
The strength of this evidence is given by the estimated P-
value. As an example, if the P-value reported was 0.03, the reader
should conclude that, under the assumption that the null hypothesis
was true, the probability of obtaining results at least as extreme as
the observed by chance would be only 3%.
The opposite scenario is represented by a P-value larger than
the alpha critical value. In this hypothetical situation, there would not
be enough evidence to reject the null hypothesis, and consequently,
it should be concluded that the researcher has failed to reject the null
hypothesis. Intuitively, it is tempting to say that the null hypothesis is
accepted. However, this interpretation is incorrect, because the fact
that the null hypothesis has failed to be rejected does not necessarily
imply that it has been accepted. Indeed, the failure to reject could be
explained by an insufficient sample size (and therefore a lack of
statistical power) or by the effect of chance.
Question 37: Are the concepts of statistical
significance and clinical relevance different?
Why are they important?
From the concepts explored in the preceding sections, it becomes
evident that the statistical significance is mathematically deduced
from the process of hypothesis testing. Nonetheless, the potential
impact that the intervention tested would have on the population is
always subject to clinical judgment (LeFort, 1993).
On the other hand, large clinical trials can be powered to
detect statistically significant differences, even for small effect sizes.
However, the interpretation of these differences must be taken from
a clinical perspective in order to elucidate the potential benefits of
the intervention tested (Ranganathan, 2015). A representative
example is an imaginary clinical trial comparing two medications for
the treatment of hypertension. Assuming that this study was
designed to have sufficient statistical power to detect a difference
between treatments as small as 3 mm Hg, and that the null
hypothesis was rejected (i.e., the mean of the observations obtained
between groups was different, and the reported P-value provided
enough evidence against the null hypothesis), clinicians would be
reluctant to use the experimental treatment, because a decrease of 3
mm Hg is generally considered of little or non-clinical relevance.
Question 38: What is the definition of one-tailed
and two-tailed hypothesis tests?
In addition to formulation of the null hypothesis, it is imperative to
establish the direction of the hypothesis tested. This step is critical
for defining the alpha critical value, because one- and two-tailed
tests distribute the probability of Type I errors only along the area
under a curve defined by the probability density function of the null
hypothesis. Figure 2.2 depicts the distribution of alpha critical values
around the null hypothesis for one- and two-tailed hypothesis tests.
Consequently, a one-tailed hypothesis test can be defined as
a statistical test that only analyzes one side of the critical area of the
curve representing the distribution of the null hypothesis. Conversely,
a two-tailed hypothesis test is a statistical test that evaluates both
sides of the curve (Figure 2.2) (Knottnerus & Bouter, 2001; Moyé &
Tita, 2002; Motuslky, 2014).
Figure 2.2. Hypothesis contrast for one-tailed (left panel) and two-tailed (right
panel) statistical tests, given a null hypothesis of no difference between two
groups, and an alpha critical value of 0.05. The X-axis represents the standardized
mean difference (SMD) between the two groups, and the Y-axis depicts the
probability density for the underlying distribution. The probability ofhaving results
at least as extreme as those observed by chance (i.e., the P-value), is represented
by the area under the curve highlighted in green and orange for one- and two-
tailed hypotheses, respectively. Thus, for a given SMD of 1.8 (dashed orange
lines), such probability would fall within the green area in the left panel and out of
the red area in the right panel. In the latter case, statistical software packages
double the P-value obtained, in order to account for the other side of the tail
(dashed black line). Therefore, the null hypothesis would be rejected in a one-
tailed statistical test, and fail to be rejected in a two-tailed statistical test. S =
Standard deviation.
Question 39: How are one-tailed and two-tailed
hypothesis tests influenced by the formulation of
null and alternative hypotheses?
The direction of the hypothesis test also has important implications
for the formulation of the null hypothesis. Thus, a one-sided
hypothesis test assumes that biologically, it is not possible to
consider the opposite effects posed by the alternative hypothesis.
For instance, if you were planning to test an experimental medication
to treat patients with systemic lupus erythematosus, it might be
biologically possible to conclude that rather than having the expected
benefit, the proposed treatment may actually have deleterious
effects on patients with this condition. Accordingly, many scenarios in
clinical research require considering both spectrums of the
alternative hypothesis; therefore, a two-tailed test is indicated in
most cases.
Occasionally, it is appropriate to use one-tailed hypothesis
tests when the researcher ascertains that the extreme possibility of
observing detrimental effects derived from the experimental
treatment can be ruled out. For example, if you are investigating the
adverse effects of a medication on the kidney, and it is biologically
unlikely to expect any improvement in renal function with such
medication, it may be pertinent to consider a one-tailed hypothesis
test (Knottnerus & Bouter, 2001; Moyé & Tita, 2002; Motuslky, 2014).
By contrast, one-tailed hypothesis tests should not be
employed if the possibility of unintended harmful effects of a given
intervention cannot be ruled out. In the Cardiac Arrhythmia
Suppression Trial (CAST investigators, 1989), researchers
hypothesized that encainide and flecainide would reduce mortality in
patients with ventricular arrhythmias after myocardial infarction.
However, the study was prematurely discontinued because, rather
than reducing mortality, the number of deaths reported in the
intervention group was larger than that observed in the placebo
group (CAST investigators, 1989). Hence, one-tailed tests must be
considered with extreme caution, since accounting for the two ends
of the spectrum is deemed safer in the majority of clinical scenarios
(Moyé, 2002).
Question 40: Does statistical power vary when
formulating one or two-tailed hypothesis tests?
One-tailed hypothesis tests have more statistical power to detect the
same effect size, when compared to their counterparts, because the
beta critical value invariably decreases when holding sample size,
effect size, and standard deviation of the dependent variable
constant (Ringwalt, 2011). Furthermore, the required sample size for
one-tailed hypotheses is reduced by approximately 21% compared
to two-tailed hypotheses when holding other parameters influencing
the sample size constant (Moyé, 2002). The implications of one- and
two-tailed hypothesis tests for the estimation of sample size and
statistical power are thoroughly examined in Chapter 8.
Question 41: Why does the P-value change when
setting a one-tailed hypothesis, compared to a
two-tailed hypothesis test?
First, it should be emphasized that the alpha critical value must be
specified in accordance with the research question before embarking
on any research project. Second, the direction of the hypothesis test
(one-tailed vs. two-tailed) is decided based on a reasonable
judgment of the possible results expected. Third, although it is true
that, from a theoretical perspective, the P-value changes according
to the direction of the hypothesis test, it should not condition the
researcher to choose between different P-values reported for a given
test, because the pre-specified hypothesis is not subject to change
during the study (Ringwalt, 2011).
In other words, the P-value is conditioned on the direction of
the hypothesis, because the probability of having results at least as
extreme as those observed by the effect of chance (assuming that
the null hypothesis is true) varies as a function of the distribution of
the alpha critical value under a curve that represents the probability
density function of the null hypothesis (Figure 2.2).
As an illustration, suppose that in order to compare two
means, you formulate a two-tailed hypothesis with an alpha critical
value of 0.05, and the P-value reported was 0.06. Since 0.06 > 0.05,
you would fail to claim a significant difference between the two
means. However, if you had opted for a one-tailed hypothesis, you
would have to choose between two scenarios of alternative
hypotheses, depending on your expectations:
H1: The mean of the treatment group is greater than that of the
control group. In this case, since the treatment group mean was
actually greater than the control group mean, the P-value reported
would have been 0.03 (0.06/2), instead of 0.06; consequently, the
null hypothesis of no difference would have been rejected.
H1: The mean of the control group is greater than that of the
treatment group. In this case, since the treatment group mean was
actually greater than the control group mean, the P-value reported
would have been 0.97 [1-(0.06/2)], instead of 0.06, and
consequently, the null hypothesis of no difference would have been
failed to be rejected.
By contrast, two-tailed hypothesis tests do not rely on the prior
expectations of the researcher, as both possibilities are considered.
In fact, the formulation of a two-tailed test implicitly carries the
following hypotheses:
H0: The mean of treatment and control groups is not significantly
different.
H1: The mean of treatment and control group is significantly
different.
Remarkably, in the example of the CAST trial, investigators were
unable to provide a P-value for the increased mortality of the
experimental group, because since the clinical trial was designed to
test exclusively the potential benefits of antiarrhythmics, the failure to
reject the null hypothesis of no difference in mortality left unresolved
the question of whether there was actually no difference, or on the
contrary (as was the case), a significant difference was observed in
detriment of the intervention tested, with the aggravating
circumstance that replicating the same study in the frame of a two-
tailed hypothesis test would be unethical (CAST investigators 1989,
Moyé 2002).
Question 42: What are other important factors
that may influence P-values?
In general, the computation of P-values (irrespective of whether they
are statistically significant or not) is related to the direction of the
hypothesis test, the statistical power, and the effect size predicted to
be clinically relevant. Of note, all these factors are also intimately
related to the sample size (Chapter 8), and consequently, P-values
are heavily influenced by sample size. (Thiese, 2016)
Thus, studies involving a large number of subjects invariably
have more statistical power to detect significant differences even for
small effect sizes, thereby rendering P-values smaller and more
likely to be statistically significant, depending on the specified alpha
critical value.
Conclusions of Chapter 2
As clinicians, we are rarely interested in the results of a particular
study sample. Very often, our attention is focused on whether the
study sample resembles the target population, and whether the
study results can be applied to our patients, thus makingan impact
in our clinical practice. To this end, inferential statistics plays a
pivotal role in estimating population parameters from a given study
sample.
The process of inferring results from a study sample to the
population of interest relies heavily on the premises of the central
limit theorem. Based on a normally distributed population assumed
by the application of this paradigm, researchers can provide reliable
P-values and confidence intervals that describe the level of
uncertainty of such estimated population parameters. These values
provide valuable and complementary information, and therefore
should be interpreted together in conjunction with reasonable clinical
judgment.
References for Chapter 2
1. Akobeng AK. Understanding type I and type II errors,
statistical power and sample size. Acta Paediatr Oslo Nor
1992. 2016;105(6):605-609. doi:10.1111/apa.13384
2. Banerjee A, Chitnis UB, Jadhav SL, Bhawalkar JS,
Chaudhury S. Hypothesis testing, type I and type II errors. Ind
Psychiatry J. 2009;18(2):127-131. doi:10.4103/0972-
6748.62274
3. Bender R, Lange S. Was ist ein Konfidenzintervall? DMW -
Dtsch Med Wochenschr. 2007;132(S 01):e17-e18.
doi:10.1055/s-2007-959031
4. Choi SW. Life is lognormal! What to do when your data does
not follow a normal distribution. Anaesthesia.
2016;71(11):1363-1366. doi:10.1111/anae.13666
5. du Prel JB, Hommel G, Röhrig B, Blettner M. Confidence
interval or p-value?: part 4 of a series on evaluation of
scientific publications. Dtsch Arzteblatt Int. 2009;106(19):335-
339. doi:10.3238/arztebl.2009.0335
6. Garett Foster HZ. 7.5: Critical values, p-values, and
significance level. Statistics LibreTexts. Published November
12, 2019. Accessed February 23, 2022.
https://stats.libretexts.org/Bookshelves/Applied_Statistics/Boo
k%3A_An_Introduction_to_Psychological_Statistics_(Foster_
et_al.)/07%3A__Introduction_to_Hypothesis_Testing/7.05%3
A_Critical_values_p-values_and_significance_level
7. Guyatt G, Jaeschke R, Heddle N, Cook D, Shannon H, Walter
S. Basic statistics for clinicians: 1. Hypothesis testing. CMAJ
Can Med Assoc J. 1995;152(1):27-32.
8. Harvey Motulsky. Intuitive Biostatistics A Nonmathematical
Guide to Statistical Thinking. (4th Edition):145-156.
9. Harvey Motulsky. Intuitive Biostatistics A Nonmathematical
Guide to Statistical Thinking. (4th Edition):24-28.
10. Holtzman WH. The unbiased estimate of the
population variance and standard deviation. Am J Psychol.
1950;63(4):615-617.
11. King AP, Eckersley RJ. Chapter 4 - Inferential
Statistics I: Basic Concepts. In: King AP, Eckersley RJ, eds.
Statistics for Biomedical Engineers and Scientists. Academic
Press; 2019:71-90. doi:10.1016/B978-0-08-102939-8.00013-
X
12. Knottnerus JA, Bouter LM. The ethics of sample size:
two-sided testing and one-sided thinking. J Clin Epidemiol.
2001;54(2):109-110. doi:10.1016/s0895-4356(00)00276-6
13. LeFort SM. The statistical versus clinical significance
debate. Image-- J Nurs Scholarsh. 1993;25(1):57-62.
doi:10.1111/j.1547-5069.1993.tb00754.x
14. Levine DM, Stephan DF. Even You Can Learn
Statistics and Analytics: An Easy to Understand Guide to
Statistics and Analytics. FT Press; 2014.
15. Moyé LA, Tita ATN. Defending the rationale for the
two-tailed test in clinical research. Circulation.
2002;105(25):3062-3065.
doi:10.1161/01.cir.0000018283.15527.97
16. Nickerson RS. Null hypothesis significance testing: a
review of an old and continuing controversy. Psychol
Methods. 2000;5(2):241-301. doi:10.1037/1082-989x.5.2.241
17. Overholser BR, Sowinski KM. Biostatistics primer:
part I. Nutr Clin Pract Off Publ Am Soc Parenter Enter Nutr.
2007;22(6):629-635. doi:10.1177/0115426507022006629
18. Pereira SMC, Leslie G. Hypothesis testing. Aust Crit
Care Off J Confed Aust Crit Care Nurses. 2009;22(4):187-
191. doi:10.1016/j.aucc.2009.08.003
19. Qu HQ, Tien M, Polychronakos C. Statistical
significance in genetic association studies. Clin Investig Med
Med Clin Exp. 2010;33(5):E266-270.
doi:10.25011/cim.v33i5.14351
20. Ranganathan P, Pramesh CS, Buyse M. Common
pitfalls in statistical analysis: Clinical versus statistical
significance. Perspect Clin Res. 2015;6(3):169-170.
doi:10.4103/2229-3485.159943
21. Ringwalt C, Paschall MJ, Gorman D, Derzon J,
Kinlaw A. The use of one- versus two-tailed tests to evaluate
prevention programs. Eval Health Prof. 2011;34(2):135-150.
doi:10.1177/0163278710388178
22. Ruskin JN. The cardiac arrhythmia suppression trial
(CAST). N Engl J Med. 1989;321(6):386-388.
doi:10.1056/NEJM198908103210608
23. Schober P, Bossers SM, Schwarte LA. Statistical
Significance Versus Clinical Importance of Observed Effect
Sizes: What Do P Values and Confidence Intervals Really
Represent? Anesth Analg. 2018;126(3):1068-1072.
doi:10.1213/ANE.0000000000002798
24. Thiese MS, Ronna B, Ott U. P value interpretations
and considerations. J Thorac Dis. 2016;8(9):E928-E931.
doi:10.21037/jtd.2016.08.16
25. Vetter TR. Fundamentals of Research Data and
Variables: The Devil Is in the Details. Anesth Analg.
2017;125(4):1375-1380.
doi:10.1213/ANE.0000000000002370
26. Wasserstein RL, Schirm AL, Lazar NA. Moving to a
World Beyond “p < 0.05.” Am Stat. 2019;73(sup1):1-19.
doi:10.1080/00031305.2019.1583913
Chapter 3
3. Statistical tests I: Comparing two
groups with student’s t-test and non-
parametric tests
Wilson Fandino & Karla Loss
Introduction to Chapter 3
Ireland, October 1899. You are 23 years old and have been recently
appointed as a chemist by one of the most prominent breweries
worldwide. Your job is to investigate the role of different methods of
selection, cultivation, and treatment of barley in the brewing process.
As a newly qualified chemist with a background in mathematics, one
must deal with a few beer samples to compare several techniques
by measuring stout acidity as a surrogate of beer quality (Barnard,
1990).
Your first thought might be that comparing the quality of beer
samples would not be problematic, particularly if you rely on the
properties of a Normal distribution for your experiment. However,
since your aim is to implement strategies to improve the quality
control of the brewery, you may be interested in working with small
samples, with the hope that the findings can be generalized to
multiple batches. Subsequently, you may encounter a fundamental
problem: much of the experimentation available to you has been
developed for large sample sizes, and little is known about inferring
conclusions from a small number of observations.
In light of contemporary evidence, would you still feel
confident working with mathematical rules mainly developed for
experiments with a large number of observations? Would you simply
incorporate those concepts into the analysis of small experiments?
Would you consider developing more reliable approaches to address
this problem (Student 1908, Barnard 1990, Trkulja & Hrabač 2020)?
William Sealy Gosset (1876-1937), an English chemist and
mathematician, worked most of his life for the famous Irish brewery,
Guinness. Owing to Guinness’ confidentiality policy, all his landmark
contributions to the field of statistics were made under the
pseudonym “Student,” coined by the company director of that time
(Trkulja & Hrabač, 2020).
Two of the most extraordinary minds of the last century
heavily influenced Student’s work in the field of statistics: Karl
Pearson and Sir Ronald Fisher. One of Student’s most important
discoveries was the t-distribution, a family of symmetric distributions
that describe the error in the estimation of population means, when
comparing two groups containing a small number of observations.
Student’s methodology has gained worldwide acceptanceamong
researchers and will be discussed in-depth in this chapter, along with
the non-parametric versions of the test (Student 1908, Barnard
1990).
Comparing means between two groups
Question 43: When comparing means between
two groups, what are the factors influencing the
choice of statistical tests?
When analyzing a set of variables in clinical research, selecting the
right statistical test is crucial for the researcher to draw valid
conclusions from their dataset (internal validity) and apply these
results to the population of interest (external validity). Furthermore,
the choice of statistical test depends on the nature of the variables to
be contrasted and the characteristics of the groups. Thus, when
comparing a continuous dependent variable with a categorical
independent variable, the researcher encounters many options,
depending on four fundamental factors:
The distribution of the continuous variable (Normal vs.
non-normal distribution),
Independence of the continuous variable (paired or
matched vs. unpaired or independent),
The number of groups included in the categorical
independent variable, which is usually the intervention
variable (two vs. three or more groups, e.g., placebo, dose
1, and dose 2), and
The sample size balance with respect to the distribution of
observations between groups.
The concepts of parametric and non-parametric distributions and the
application of statistical tests to compare three or more groups are
examined in depth in Chapters 1 and 5 of this book, respectively.
Question 44: What are the tests indicated for the
comparison of means between two groups?
Clinical research recurrently encounters the specific case of
comparing a random continuous variable (for instance, blood
pressure or cholesterol level) between groups. In the case of the
Guinness brewery, imagine that you were interested in comparing a
continuous outcome (beer acidity) between two samples
representing different beer brew techniques. The outcome would be
the dependent variable, and the dichotomous independent variable
represents the two groups.
Depending on the factors introduced above, the contrast
between the two variables comprises two categories of statistical
tests:
Parametric tests:
Unpaired data: Student’s t-test
Paired data: Student’s paired t-test
Non-parametric tests
Paired data: Wilcoxon’s sign-rank test
Unpaired data: Wilcoxon’s rank-sum test (also known as
the Mann–Whitney U test).
This chapter will detail the indications, assumptions, limitations, and
potential issues researchers face for all possible statistical tests that
apply to all proposed scenarios.
Parametric tests
Question 45: What is a t-distribution? How can it
be interpreted in the context of a Student’s t-test?
Estimating population mean differences from a set of two
independent groups of random observations, with an approximately
symmetric distribution and equal variances, is based on the
assumption that those observations follow a specific distribution.
When dealing with large sample sizes, they tend to follow a Normal
distribution. Therefore, the estimation of parameters relies on a Z-
score for a given alpha critical value. This score is crucial for
estimating the area of rejection of the null hypothesis. Nevertheless,
using Z-scores for parameter estimations in small datasets is
unreliable because the distribution is not exactly normal.
The distribution that more accurately describes the behavior
of continuous variables in small datasets is called a t-distribution. It is
a type of symmetric distribution, closely related to the Normal
distribution (Figure 3.1), described by Student in one of his original
works (Student, 1908). As depicted in Figure 3.1, the exact definition
of the t-distribution depends on the number of observations included
in the total sample size. The larger the sample size, the more likely
the sample is to approximate a Normal distribution. Thus, for n 30
observations, it is widely accepted that the shape of the Normal
distribution is a good approximation of the relevant t-distribution.
By contrast, the parameter estimation for n<30 datasets
needs to account for t-values instead of Z-scores. The t-values
depend on the direction of the hypothesis (one- vs. two-tailed tests),
the degrees of freedom (df) (n 1 for one-tailed tests, and n 2 for
two-tailed tests), and the alpha critical value (see Chapter 2).
Figure 3.1. Left: Student’s t-distribution for n=4 (green curve), n=10 (blue curve),
and n=20 (red curve). The shape of the Student’s t-distribution for n<30 is very
similar to the Normal distribution. However, different t-values need to be applied
instead of standardized Z-scores to account for the estimation error of population
mean differences if a Normal distribution was assumed (see Figure 3.2). Right:
Standardized Normal distribution for reference. In sample sizes with n 30, the t-
distribution approximates very well to a Normal distribution, and therefore, the
error obtained for the estimation of parameters is negligible. Accordingly, a Z-score
of 1.96 can be used instead of t-scores in large sample sizes to estimate
confidence intervals (Figure 3.2).
Question 46: What do Student’s t-tables mean?
What are they used for?
In 1908, Student described the behavior of distributions for small
sample sizes and highlighted the importance of adjusting Z-Scores
as a function of the number of observations. Thus, he developed
specific t-values for (n 1) and (n 2) observations (i.e., the degrees
of freedom for one - and two-tailed tests, respectively), which are the
foundation of Student’s t-tables (Student, 1908).
Figure 3.2 provides the specific t-values for a two-tailed
Student’s t-test with a pre-specified alpha critical value of 0.05 and 4
to 30 degrees of freedom. The corresponding Z-score of 1.96
replaces these t-values, used for large sample sizes to estimate
confidence intervals for the difference of means (see left panel of
Figure 3.2).
Clinical research frequently uses the t-values included in
Figure 3.2. However, this list is not exhaustive. Different values for
one-sided tests, alpha critical values <0.05, and 30 degrees of
freedom are available in other statistical books and on the Internet
(Sachs, 2012). It should be noted that most Student’s t-tables
include degrees of freedom ranging from 1 to 120. As highlighted
above, the Normal distribution is a good approximation for samples
with 30 observations, but the exact estimate is given for degrees
of freedom 120. These estimates are used in statistical software
packages to build confidence intervals.
It must also be emphasized that with the advent of modern
statistical software packages, there is no need to look at Student’s t-
tables unless the researcher is interested in manually estimating
confidence intervals. However, the authors contend that familiarity
with these tables facilitates the interpretation of outputs yielded by
statistical software packages.
Figure 3.2. Left: Z-score approximation for a two-tailed Student’s t-test with an
alpha critical value of 0.05, when the sample size is 30. In this case, the t-
distribution is approximated to Normal. Consequently, the critical area for rejecting
the null hypothesis lies within the areas under the curve corresponding to Z-scores
>1.96 and < 1.96 (highlighted in green). Right: Two-tailed Student’s t-table for 4
to 30 degrees of freedom and an alpha critical value of 0.05. In such small
datasets, the t-values provided in the table replace the Z-score of 1.96 to account
for the different shapes of the t-distribution (see Figure 3.1). The t-values for other
alpha critical levels, larger sample sizes (30 to 120 degrees of freedom), and one-
tailed t-tests are well-established and readily accessible elsewhere (Sachs, 2012).
Question 47: How are confidence intervals
estimated for mean differences?
As an illustration, consider a clinical trial comparing two different
analgesic regimens in60 patients undergoing lithotripsy for kidney
stones (Choudhary, 2019). The mean pain scores reported
(measured by a visual analog score of 0 to 10) for Treatment A
(n=32) and Treatment B (n=28) were 4.9 ± 1.9 standard deviation (s)
and 2.9 ± 1.7 s, respectively. Consequently, the mean difference in
pain scores between the two groups was 2 points, and the 95%
confidence interval for a two-tailed test was given by:
where t is the t-score for (n 2) degrees of freedom, and SED is the
standard error of the difference. The latter is defined as
As the t-score for this two-tailed test with an alpha critical value of
0.05 and 58 degrees of freedom is 2.0017, the 95% confidence
interval is estimated as follows:
x = 1.0697 2.9303
Alternatively, when replacing the t-value with a Z-score of 1.96, the
95% confidence interval is approximated as
x = 1.089 2.911
Note that the confidence interval estimated with a t-value is
more conservative. Unsurprisingly, the two estimations look very
similar because the t-distribution for 58 degrees of freedom
approximates the Normal distribution (Figure 3.1) (Choudhary,
2019). However, these differences become more relevant when
dealing with <30 observations. Chapter 2 offers a detailed discussion
on the rationale behind confidence intervals.
Although you do not need to memorize the formula for the
confidence interval, it is important to understand the parameters that
influence the confidence interval such as: mean difference, sample
size and alfa critical value, which ultimately define the t-value.
Question 48: What assumptions need to be
checked before considering a Student’s t-test?
Using a Student's t-test to compare a continuous dependent variable
between two groups requires meeting the following assumptions:
The observations included in the sample are taken
randomly from the population of interest,
All observations are independent of each other (for a two-
sample t-test) or repeated in the same subject (paired t-
test),
The dependent variable follows a Normal distribution within
each group (for a two-sample t-test), or the differences
between groups are normally distributed (paired t-test),
and
The variances of the two samples are approximately equal
(homoscedasticity).
Question 49: Can a Student’s t-test be performed
for only one unpaired sample?
Occasionally, a researcher may be interested in comparing the mean
of one group with a reference value. As a worked example, suppose
that you are a high school chemistry teacher examining a group
(Group A) of 10 students, who obtained the following scores (on a
scale of 0 to 100):
Group A: 56, 89, 65, 77, 91, 64, 82, 54, 72, and 60
The mean score for this dataset is 71, and the standard deviation (s)
is 13.34. The average national score is 74. Is the score obtained in
your group of students significantly worse than the national average
score? To address this question, you would need to perform a one-
sample Student’s t-test. The t-statistic is given by equation A of
Figure 3.3. Thus,
For a two-tailed Student’s t-test with a pre-specified alpha critical
value of 0.05, a t-value of 0.711 yielded a P-value of 0.495.
Consequently, there is insufficient evidence in favor of the alternative
hypothesis of different means. In this example, a negative t-value
only reflects that 71<74. Therefore, a t-statistic of 0.711 would
produce the same P-value.
Figure 3.3. Equations used by statistical software packages for a one-sample
Student’s t-test (A), paired Student’s t-test (B), and the comparison of two
independent samples, x and y (C to F). In the latter, the approach would differ for
equal (C) or unequal (F) variances. When the variances are equal, it is appropriate
to combine the standard deviations (C to E). However, the pooled standard
deviation would differ for equal (D) or unequal (E) sample sizes. When comparing
two independent samples with unequal variances, a Welch test is designated
(irrespective of the sizes of the groups). S = Standard deviation. Sd = Standard
deviation of the differences. : Population mean.
Question 50: What does “independency of
observations” mean in the context of Student’s t-
test assumptions?
The Student’s t-test is used for both paired and unpaired data.
However, the statistical approach is different. Continuing with the
same example, imagine that you want to compare the performance
of your students (Group A) with another group (Group B). The scores
obtained were as follows:
Group A: 56, 89, 65, 77, 91, 64, 82, 54, 72, and 60.
Group B: 58, 90, 70, 81, 93, 67, 84, 60, 74, and 63.
The mean scores obtained for Groups A and B were 71 and 74, and
the standard deviations were 13.34 and 12.49, respectively. Figure
3.4 shows a graphical representation of this experiment (left panel).
Is such a small difference (3) statistically significant for a
dataset containing 20 observations? The t-statistic for two
independent groups (assuming normality, equal variances, and equal
sample sizes) is given by equation D of Figure 3.3. Accordingly,
The P-value for a t-statistic of 0.52 is 0.61, assuming a two-tailed
Student’s t-test with independent samples and an alpha critical value
of 0.05. Thus, there would be insufficient evidence to reject the null
hypothesis of equality of means. Of note, this approach would not be
appropriate if the variances were not approximately equal and/or the
sample size for each group was different (see below).
Note, however, that this P-value is different from that reported
in the one-sample test, even when the mean difference obtained was
the same ( 3). This is because the research questions posed are
different.
Alternatively, we can consider a different scenario using the
same dataset. A sample of 10 students underwent an examination
before and after taking a chemistry course.
Before: 56, 89, 65, 77, 91, 64, 82, 54, 72, and 60.
After: 58, 90, 70, 81, 93, 67, 84, 60, 74, and 63.
For the same mean difference ( 3), the standard deviation of the
differences (sd) was 1.563. It is worth noting that the standard
deviation for each group computed earlier for a two-sample
Student’s t-test is replaced here by the standard deviation of the
differences obtained from the paired observations.
From this example, it is evident that all students improved
their exam performance. Under the assumptions of a Normal
distribution and equality of variances, the t-statistic for a Student’s
paired t-test is defined by equation B provided in Figure 3.3:
Accordingly, a two-tailed paired t-statistic of 6.07 would be
equivalent to a P-value of 0.0002, and the conclusion would be the
opposite for the same dataset when compared to the last example.
Hence, the paired or unpaired nature of the observations
provides invaluable information on the significance of the test.
Nevertheless, a statistically significant improvement of 3 points on
average in exam performance does not necessarily imply that this
difference is relevant from a practical perspective.
Figure 3.4. Comparison of mean scores (on a scale of 0 to 100) obtained in a
chemistry exam by different groups of 10 students in two hypothetical situations.
The mean ( ) and standard deviation (s) are indicated for each group. The dashed
lines represent the means obtained for each group. Left: Comparison of two
groups (Group A and Group B) with approximately equal variances (and therefore,
standard deviations). Assuming normality and equality of variances, a Student’s t-
test for independent samples would be indicated. Right: Comparison of two groups
(Group A and Group C) with different variances. Note the different curve shapes
and the change in the density scale. In the latter, the scores obtained in both
groups follow a Normal distribution, but the assumption of the equality of variances
would be violated, warranting a Welch’s test.
Question 51: Is it possible to conduct a Student’s
t-test when variances are unequal?
Imagine that you are now interested in comparingthe scores
obtained by your group of high school students (Group A) with a
group of students from a chemistry faculty (Group C). The scores
obtained were as follows:
Group A: 56, 89, 65, 77, 91, 64, 82, 54, 72, and 60.
Group C: 81, 84, 97, 78, 84, 96, 87, 91, 82, and 90.
As described above, the mean and standard deviation for Group A
were 71 and 13.34, respectively. By contrast, the mean for Group C
was 87, with a standard deviation of 6.38 (see right panel of Figure
3.4). It becomes evident that students in Group C performed better,
most likely because they had advanced chemistry training. More
importantly, the standard deviations of the two groups, and therefore
the variances, were heterogeneous because the samples were
drawn from different populations (high school vs. university
students).
It is essential to clarify that the chemistry student examples
provided here are only generalizable to the extent that the population
of interest is representative of chemistry students (see Chapter 2). In
any other case, the conclusions would only apply to Group A
students.
Apart from a visual inspection of the data (Figure 3.4), there
are different ways to formally evaluate the equality of variances
between the groups. Levene’s test and F-test (also known as the
variance-ratio test, discussed in Chapter 5) are customary
approaches. The F statistic is obtained from the ratio of the
variances of the two groups. In our example:
Although a detailed description of this test is beyond the scope of
this book, the P-value reported for this example is 0.039.
Consequently, the assumption of the equality of variances would not
be satisfied, and conducting a Student’s t-test would not be
appropriate. In this case, a Welch test is preferred (Welch, 1938).
Question 52: What is the difference between
Student’s t-test and Welch’s test?
When conducting a Student’s t-test for independent samples,
estimating population mean differences requires the variances to be
approximately equal to combine them in equation C of Figure 3.3
(see also the left panel of Figure 3.4). The inequality of variances
between groups can be problematic, because the control of Type I
errors is not reliable (see Chapter 2) (De Gil, 2013). From the
graphic in the right panel of Figure 3.4, it can be deduced that two
samples from different populations are not comparable with a
Student’s t-test, as the standard errors estimated for each group are
discrepant. Hence, it cannot be ascertained whether the estimation
of t-values is consistent for such different groups.
In the last century, Welch and Satterthwaite proposed an
approximate distribution to address this problem (Welch, 1938;
Satterthwaite, 1946). Figure 3.3 shows Welch’s equation for
comparing two means with unequal variances (Lumley, 2002).
Furthermore, to account for the potential error induced by the
disparity of variances, the degrees of freedom were adjusted
according to the following equation:
where s2 is the variance obtained for each group (groups A and C in
this case).
For the proposed example, the adjusted degrees of freedom
are obtained as follows:
Note that the resulting degrees of freedom do not necessarily have
integer numbers. In this example, the resulting value is quite different
from (n observations because the contrasted variances were
heterogeneous (see right panel of Figure 3.4). Consequently, the
reported P-value would be more conservative for the relevant test.
The equation presented here is known as the Welch-
Satterthwaite equation (Lumley, 2002). However, some statistical
software packages (e.g., Stata®) distinguish between the formulae
proposed by Satterthwaite and Welch. For practical purposes, both
approaches yield similar degrees of freedom. Other packages,
including SPSS Statistics, have incorporated the Welch approach as
the default test for mean comparisons, and some authors have
advocated the use of the Welch test instead of Student’s t-test
(Delacre, 2017).
Lastly, it should be highlighted that Welch’s test is only
applicable for Student’s t-tests with independent samples. Moreover,
there is no homogeneity of variance requirement for Student’s paired
t-test, as in this case, the researcher deals with only one sample with
repeated measurements.
Question 53: Is it possible to perform a Student’s
t-test for different group sizes?
Occasionally, researchers may be interested in conducting clinical
trials with a 1:2 randomization ratio (Onuma, 2020). Sample sizes
with a 1:3 ratio are encountered less frequently in the literature (Vold,
2016), and seldom is the allocation set at 1:4 (Kotasek, 2003). The
reasons for considering clinical trials with unequal sample sizes
include ethical issues, cost minimization, and enhancement of the
recruitment process. In any case, implementing these strategies
improves the statistical power (see Chapter 8).
The statistical approach for comparing two means from
unequal sample sizes is different because the variance obtained for
each group needs to be weighted by the number of observations
included. The Student’s t-test for unequal sizes is provided in
equation E of Figure 3.3.
Notably, when applying the same variances in equations C to
E of Figure 3.3 to samples with unequal group sizes, the t-statistic
may yield different results. Hence, it is vital to apply the right
equation, following the flowchart shown in Figure 3.3. However, it is
important to clarify that all these formulae are automatically applied
using statistical software packages.
Question 54: Is Student’s t-test statistically
robust against departures from normality?
As discussed above, it is possible to conduct a Student’s t-test for
small samples (n< 30), provided that the required assumptions are
satisfied. In such small datasets, the assumption of normality is
imperative. By contrast, when dealing with samples involving a large
number of observations, the normality assumption is relative. The
capacity of a given test to control Type I errors under certain
violations of the normality assumption is known as statistical
robustness.
The concept of statistical robustness should be interpreted in
the context of the central limit theorem (Lumley, 2002). Chapter 2
discusses the rationale behind this theorem. In essence, the
normality assumption applies to the distribution of the means and not
to the actual distribution of the variable of interest within the two
groups (see Chapter 2). For samples with <30 observations, the
distribution of the means strongly depends on the distribution of the
variable within the sample, and therefore, the Normal distribution of
the dependent variable becomes essential to make accurate
inferences based on the reported t-value. Note that in samples with
<30 observations the central limit theorem does not apply.
A Student’s t-test with two independent, large samples
deserves further consideration. In this case, and relying on central
limit theorem premises, the distribution of means follows an
approximately Normal distribution, irrespective of the dependent
variable’s original distribution, provided the experiment includes
30 subjects. Consequently, the normality assumption for the
dependent variable is less important for large datasets.
Nonetheless, the efficiency of the central limit theorem to
normally distribute the means depends on the original data’s
skewness. Thus, the more outliers are observed in the dataset, the
greater the sample size needed to rely on the central limit theorem.
Hence, there is no established boundary to label a given sample as
small or large. However, it has long been accepted that samples
30 are considered safe in the majority of clinical scenarios to apply
the central limit theorem (Lumley, 2002).
In conclusion, the Student’s t-test is robust against violations
of normality when dealing with large datasets. In samples with <30
observations, the normality of the dependent variable must be
checked (see Chapter 1). If the variable is normally distributed, aStudent’s t-test is used. If normality cannot be guaranteed, data
transformation should be considered. Alternatively, non-parametric
versions are suggested for paired (Wilcoxon’s sign-rank test) and
independent (Wilcoxon–Mann–Whitney’s test, WMW test) samples.
Non-parametric tests
Introduction to non-parametric tests
Chapter 2 discusses the process of inference from a given sample to
the population of interest by defining statistical parameters. In this
context, a parameter is a statistical term referring to a range of
numerical values that describe the estimated distribution of a
variable in the target population. In a Normal distribution, the
parameters of interest are the population mean and population
standard deviation.
The first section of this chapter explored the different
indications of Student’s t-test. It is considered a parametric test
because it relies on the Normal distribution of sample means (see
the central limit theorem in Chapter 2). By contrast, non-parametric
tests do not require the assumption of normality. This section
examines the main features of the Wilcoxin–Mann–Whitney’s
(WMW) test (also known as Wilcoxon rank-sum test) and Wilcoxon’s
signed-rank test.
Question 55: When comparing a numerical
variable between two groups, in which situation
is it appropriate to use a non-parametric test?
As discussed above, the Student’s t-test is appropriate in majority of
the scenarios. Generally, large samples (regardless of their
distribution) and small samples (provided that the distribution is not
skewed) are suitable for this test. However, there are a few
circumstances in which non-parametric tests are appropriate.
1. Ordinal scales
Imagine that you are marketing a novel product, and you conduct a
survey to investigate whether customer satisfaction with the item
purchased is related to gender. To address this question, satisfaction
may be rated using the following scale:
□ Very satisfied □ Satisfied □ Neutral □ Unsatisfied □ Very
unsatisfied
Student’s t-test may not be indicated for ordinal scales for several
reasons. First, Likert scales, as the one provided in this example, are
essentially ordinal variables. Consequently, they cannot be treated
as continuous variables (e.g., being “very satisfied” does not mean
being twice as “satisfied”). Second, it is very likely that the scores
have a skewed distribution, as the subjective perception of
participants tends to orient toward one of the extremes. Third, if
there are <30 observations, the central limit theorem would not
apply. Although the first two arguments are subject to debate (De
Winter & Dodou, 2010), a small sample size would prevent you from
conducting a Student’s t-test. Consequently, a non-parametric test is
appropriate in this case.
2. Presence of outliers in the sample
Extreme values may be problematic for Student’s t-test because
means can be very sensitive to outliers. As non-parametric tests use
medians instead of means, they are more suitable in most cases.
For example, the length of stay (LOS) is a common outcome used in
health economics as a surrogate for quality of care. In an intensive
care unit, some patients require prolonged admissions, consequently
skewing the distribution. Similarly, the costs derived from the care
provided vary as a function of specialty. In both cases, the
researcher should consider non-parametric tests, particularly when
dealing with small samples (Harris, 2008).
3. Detection limit
In the field of biochemistry, viral load is a continuous outcome with
values ranging from “non-detectable” (depending on the threshold of
detection) to large numbers of viral copies per millimeter of blood.
However, only a few patients presenting this condition will have a
large viral load. Thus, studies involving small samples of such
skewed distributions should be analyzed using non-parametric tests.
Question 56: What assumptions need to be
checked before considering a Wilcoxon–Mann–
Whitney’s (WMW) test and Wilcoxon’s signed-
rank test?
The assumptions for these non-parametric tests are as follows
(Motulsky 2018, Neuhäuser 2011):
The sample was drawn at random from a large population,
and
Each observation was independent of each other (for the
WMW test), or the selection of pairs was independent of
each other (for Wilcoxon’s signed-rank test).
Question 57: If non-parametric tests do not
require the normality assumption, why not use
them instead of Student’s t-test?
Generally, Student’s t-test has more statistical power for large
samples with Normal distributions compared to non-parametric tests.
This means the Student’s t-test more accurately controls Type I
errors. Conversely, non-parametric tests have more statistical power
for small samples with highly skewed distributions and similar
variances (Fagerland & Sandvik, 2009) because they compare the
rank of the observations rather than the actual value, thus losing
important information regarding the magnitude of the difference.
Consider the following hypothetical experiment: suppose you
have randomized six diabetic patients to administer their insulin in
the abdomen or thigh to evaluate the pain experienced after
injection. Imagine that the visual analog scale (VAS from 0 to 10)
obtained was 0, 1, and 2 for the first group and 8, 9, and 10 for the
second group. Your first impression might be that there is an
important difference between the two groups. However, the WMW
test assigns ranks 1, 2, and 3 for the first group and 4, 5, and 6 for
the second group, ignoring the difference between values. Thus, the
P-value obtained for this example would be 0.1, and you would
conclude that there was not enough information to reject the null
hypothesis of no difference.
Another important aspect favoring parametric tests over non-
parametric tests (when this is indicated) is the interpretation of
results. If you were comparing 2 drugs for systemic hypertension,
and the mean difference obtained was 12 mm Hg, interpreting
Student’s t-test would be straightforward. However, the WMW test
output would be more difficult to elucidate because apart from the
differences in medians, this test also evaluates the shape and
spread of the distributions (see below).
Question 58: What is the difference between
Wilcoxon’s rank-sum test and the Wilcoxon–
Mann–Whitney’s (WMW) test?
In his original work, Wilcoxon described the methodology to compare
two groups with skewed distributions, both for independent
(Wilcoxon’s rank-sum test) and paired (Wilcoxon’s signed-rank test)
data. The rationale behind estimating the P-values was based on the
number of possible combinations of results for a given dataset
(Wilcoxon, 1945). In 1947, Mann and Whitney proposed U-statistics
to compare the sum of ranks between groups (Mann & Whitney,
1947). In the same year, Wilcoxon published tables to compare the
sum of ranks (Wilcoxon, 1947). Although these tables are different
(Wilcoxon’s tables contain the critical values for the sum of ranks,
and Mann–Whitney’s tables provide the critical values for the U-
statistic), the methodology is essentially the same. For clarity in this
section, we will refer to the Wilcoxon rank-sum test as the WMW test
to avoid confusion with the paired version of Wilcoxon (Wilcoxon’s
signed-rank test).
Question 59: What is the difference between
Wilcoxon’s signed-rank test and the Wilcoxon–
Mann–Whitney’s (WMW) test?
When comparing a numerical variable between two groups with a
non-normal distribution, the observations can be independent or
paired. For independent samples, the Wilcoxon–Mann–Whitney’s
(WMW) test is generally appropriate. Conversely, for paired
observations, Wilcoxon’s signed-rank test is indicated (Schober,
2020). Thus, the non-parametric counterpart of two-sample and
paired Student’s t-tests are the WMW test and Wilcoxon’s signed-
rank test, respectively.
Question 60: Is the Wilcoxon–Mann–Whitney’s
(WMW) test a comparison of medians?
In the first section of this chapter,we explored the main features of
Student’s t-test for comparing means between two groups. One can
analogically interpret the Wilcoxon–Mann–Whitney’s (WMW) test as
a non-parametric version of Student’s t-test for independent samples
with <30 observations and asymmetric distribution. However, it is
incorrect to deduce that the WMW test is merely a comparison of
medians. This statement is only true in the unlikely case that the
distributions of the two groups are identical (Hart 2001, Fagerland
2009, Divine 2018).
Consider the hypothetical situations in Figure 3.5. The dataset
provided in Group A of this figure is compared with that of Groups B
to D. Group B simulated an identical distribution as Group A, so the
only difference between the two groups was the location of the
median value. Aside from the difference between medians, the
distribution tail of Group C shifted to the other extreme. Finally, for
Group D, a different distribution was created for comparison with
Group A. Despite obtaining important median differences when
comparing Groups A to B, A to C, and A to D, only groups A and B
reported significant P-values. This significant difference can be
explained by the fact that the distributions were identical, and only in
this case can the WMW test be regarded as a comparison of
medians (Fagerland, 2009). Thus, while Student’s t-test is robust
against departures from normality, the WMW test is particularly
vulnerable to samples with different variances, thereby decreasing
its statistical power. Consequently, this test should ideally be used
for groups with similar variances.
However, having two groups with identical distributions in
spread and shape is rare in medical research. To overcome this
problem, some authors have advocated using the F-test (explained
above) to evaluate differences in variances. Thus, for variance ratios
>1.5, a researcher could consider using Welch’s test instead of the
Wilcoxon–Mann–Whitney’s (WMW) test (Fagerland, 2009).
Unfortunately, this practice is not widespread among researchers
because it would preclude using the Wilcoxon–Mann–Whitney’s
(WMW) test in many situations (most samples have considerably
distinct variances). Instead, providing relevant information regarding
the nature of the sample distributions and the P-value is strongly
recommended (Hart 2001, Divine 2018).
Figure 3.5. Wilcoxon–Mann–Whitney (WMW) U-test for the comparison of the
variable weight (given in kg) between two groups with skewed distributions in
different hypothetical scenarios. The dataset for each group is provided on the
right side of the respective histogram. Likewise, the mean and median for each
dataset are provided on the left side of the histograms and represented by red and
blue dashed lines, respectively. The solid black curves depict a theoretical Normal
distribution for reference. Note that the observations of Groups A and B follow the
same distribution. However, the median difference is 27 (86 59). The U-test for
this comparison yields a P-value of 0.03. Contrastingly, the dataset of Group C
mirrors the distribution of Group A. The median difference is 34, and the P-value
for the U-test is 0.42, despite having a larger median difference than the first
example. Lastly, the distribution of Group D is different when compared to Group
A. This case reported a P-value of 0.2 for a median difference of 18.5 (the smallest
difference of these three examples). From these examples, it follows that the level
of significance of the WMW test does not only depend on the median differences
but also on the distribution shape (Hart 2001, Fagerland 2009).
Question 61: How is the U-statistic estimated for
the Wilcoxon–Mann–Whitney’s (WMW) test?
Figure 3.6 provides a detailed explanation of the Wilcoxon–Mann–
Whitney’s (WMW) test. Note that although the actual data
distribution used in this example is skewed (see example A of Figure
3.5), the U-statistic follows a Normal distribution, as depicted in the
graph of Figure 3.6. Under the null hypothesis of no difference, the
mean of this distribution is always equal to (Figure 3.6).
On the other hand, the critical area for rejection of the null
hypothesis is given by the values provided in the Mann–Whitney
table (Mann & Whitney, 1947). The U-statistic obtained for each
group is contrasted with the critical values in the U-table (for the
corresponding group sizes and the desired alpha critical level)
(Sachs, 2012), and the estimated P-value is reported based on the
area under the curve representing values at least as extreme as the
observed value. Remarkably, most statistical software packages
automatically estimate the U-statistic, and the Mann–Whitney table is
used here only for illustration purposes.
Figure 3.6 Wilcoxon–Mann–Whitney (WMW) test for the comparison of the
variable weight (given in kg) between genders for the dataset provided in example
A of Figure 3.5. The observations are arranged in ascending order. The rank
corresponds to the number of observations unless the sample has ties (two or
more subjects taking the same value), in which case the average rank is given for
each observation. For example, as subjects 4 and 5 weighed 55 kg, the rank
assigned would be the average of 4 and 5 (4.5). The rank-sum is obtained for the
whole group and each gender. Based on the equation provided in the figure, the
resulting U-statistic for males (M) and females (F) was 15 and 9, respectively.
These values are compared with the Mann–Whitney table (Mann & Whitney,
1947). Accordingly, the critical U-value for a group of 4 and 6 subjects with an
alpha level of 0.05 is 2. As depicted in the graphic, under the null hypothesis (H0)
of no difference, the U-statistic follows a Normal distribution with
. It entails that the U-value in this example ranges from 0 to 24 ( ), and the
expected value equals 12, assuming no differences between the groups. As the
critical U-value is 2, any U-value between 0 and 2 or 22 and 24 would yield a P-
value <0.05 (areas highlighted in green). In this case, the U-values of 9 and 15
(dashed red and blue lines, respectively) are far away from the critical area, and
therefore, the null hypothesis of no difference cannot be rejected (P-value = 0.58)
(Mann & Whitney, 1947).
Question 62: What is the methodology used for
Wilcoxon’s signed-rank test?
The methodology used by Wilcoxon for paired samples is similar to
the analog test for two independent samples (Schober, 2020). Figure
3.7 provides an explanation of the test based on a clinical example.
Remarkably, the ranks used in this test are for the differences
(positive and negative) and not for the absolute values. Thus, the
signed-rank Wilcoxon’s test considers the magnitude and direction of
the difference (Whitley & Ball, 2002). It is also important to
emphasize that there is no comparison group as this is essentially a
one-sample test, and the values obtained for each observation are
compared before and after a given intervention.
However, in clinical research it is usual to have a comparison
group. Suppose you are interested in conducting a clinical trial to
evaluate the analgesic effect of an experimental treatment against
the standard of care. The visual analogue scale (VAS) score needs
to be recorded before and after the treatment to account for the
basal level of pain to quantify the pain in each group. As this
experiment involves two groups, it is adequate to compute the VAS
score differences for each group, and conduct a two-sample test
(WMW test or Student’s t-test for independent samples, as
appropriate).
Figure 3.7. Wilcoxon’s signed-rank test for a group of 8 patients diagnosed with
refractory migraines. The number of migraine attacks was recorded one month
before and after receiving an experimental prophylactic medication. The top-left
graph depicts the distribution of the differences obtained. Accordingly, the mean
and median for this dataset are 2.75 (dashed black line) and 3.5 (dashed green
line), respectively. The solidblack curve represents the theoretical Normal
distribution for reference. The methodology is similar to that described in Figure
3.6 for the WMW test, except the ranks are given according to the differences in
migraine attacks reported before and after the intervention, and the rank sums are
obtained for positive and negative differences. Like the U-statistic for the WMW
test, the t-statistic for Wilcoxon’s signed-rank test follows a Normal distribution
(bottom right graph), with a mean, represented by (18), under the null
hypothesis of no difference between groups. In this example, the sum of ranks
obtained for positive (dashed blue line) and negative (dashed red line) signs was
29 and 7, respectively. As the critical area for n = 8 corresponds to t-statistic = 3
(areas highlighted in green), there is insufficient evidence to reject the null
hypothesis of no difference (P-value = 0.14).
Question 63: Can both of these non-parametric
tests be used for groups of different sizes?
Given the paired nature of Wilcoxon’s signed-rank test, it is not
possible to have samples of different sizes, as the researcher
fundamentally deals with only one sample. On the other hand, as
illustrated in the Figure 3.6 example, the Wilcoxon–Mann–Whitney’s
(WMW) test works for both equal and unequal sample sizes.
However, note that samples with unequal sizes are more likely to
have different shapes and discrepant variances, thereby reducing
the statistical power of the Wilcoxon–Mann–Whitney’s (WMW) test
(Skovlund & Fenstad, 2001).
Consider a clinical trial involving 18 subjects. Placing 7 in one
group and 11 in the other would decrease the statistical power
compared to a clinical trial involving 9 observations per group.
However, increasing the sample size within each group (e.g., 30
patients distributed evenly in the study sample) will undoubtedly
improve the statistical power compared to the first two scenarios.
Conclusion for Chapter 3
Comparing numerical variables between two groups has important
connotations for the correct interpretation of statistical outputs. To
this end, a thorough understanding of the indications, advantages,
and disadvantages of parametric and non-parametric tests is of
utmost importance. In the authors’ view, it is also desirable to get
familiar with the methodology used by statistical software packages
for estimating the parameters of interest.
Clinicians should also consider the importance of sample size
when selecting a statistical test, the implications of big datasets for
statistical power, and the validity of the central limit theorem. More
specifically, Student’s t-test is deemed appropriate for most large
study samples, given its statistical robustness. Non-parametric tests,
on the other hand, should primarily be reserved for small samples
with highly skewed distributions.
References for Chapter 3
4. Choudhary A, Basu S, Sharma R, Gupta R, Das RK, Dey RK.
A novel triple oral regime provides effective analgesia during
extracorporeal shockwave lithotripsy for renal stones. Urol
Ann. 2019;11(1):66-71. doi:10.4103/UA.UA_15_18
5. Delacre M, Lakens D, Leys C. Why Psychologists Should by
Default Use Welch’s-test Instead of Student’s-test. Int Rev
Soc Psychol. 2017;30(1):92. doi:10.5334/irsp.82
6. Divine GW, Norton HJ, Barón AE, Juarez-Colunga E. The
Wilcoxon–Mann–Whitney Procedure Fails as a Test of
Medians. Am Stat. 2018;72(3):278-286.
doi:10.1080/00031305.2017.1305291
7. Fagerland MW, Sandvik L. The Wilcoxon-Mann-Whitney test
under scrutiny. Stat Med. 2009;28(10):1487-1497.
doi:10.1002/sim.3561
8. Harris JE, Boushey C, Bruemmer B, Archer SL. Publishing
nutrition research: a review of nonparametric methods, part 3.
J Am Diet Assoc. 2008;108(9):1488-1496.
doi:10.1016/j.jada.2008.06.426
9. Hart A. Mann-Whitney test is not just a test of medians:
differences in spread can be important. BMJ.
2001;323(7309):391-393.
10. Harvey Motulsky. Intuitive Biostatistics A
Nonmathematical Guide to Statistical Thinking. (4th
Edition):431-441.
11. Kotasek D, Steger G, Faught W, et al. Darbepoetin
alfa administered every 3 weeks alleviates anaemia in
patients with solid tumours receiving chemotherapy; results of
a double-blind, placebo-controlled, randomised study. Eur J
Cancer Oxf Engl 1990. 2003;39(14):2026-2034.
doi:10.1016/s0959-8049(03)00456-8
12. Lumley T, Diehr P, Emerson S, Chen L. The
importance of the normality assumption in large public health
data sets. Annu Rev Public Health. 2002;23:151-169.
doi:10.1146/annurev.publhealth.23.100901.140546
13. Mann HB, Whitney DR. On a Test of Whether one of
Two Random Variables is Stochastically Larger than the
Other. Ann Math Stat. 1947;18(1):50-60.
14. Neuhäuser M. Wilcoxon–Mann–Whitney Test. In:
Lovric M, ed. International Encyclopedia of Statistical
Science. Springer; 2011:1656-1658. doi:10.1007/978-3-642-
04898-2_615
15. Nguyen D, Kim ES, Gil PR de, et al. Parametric Tests
for Two Population Means under Normal and Non-Normal
Distributions. J Mod Appl Stat Methods. 2016;15(1).
doi:10.22237/jmasm/1462075680
16. Onuma Y, Honda Y, Asano T, et al. Randomized
Comparison Between Everolimus-Eluting Bioresorbable
Scaffold and Metallic Stent: Multimodality Imaging Through 3
Years. JACC Cardiovasc Interv. 2020;13(1):116-127.
doi:10.1016/j.jcin.2019.09.047
17. Pearson ES, Gosset WS, Plackett RL, Barnard GA.
Student: A Statistical Biography of William Sealy Gosset.
Clarendon Press; Oxford University Press; 1990.
18. Sachs L. Applied Statistics: A Handbook of
Techniques. Springer Science & Business Media; 2012.
19. Satterthwaite FE. An Approximate Distribution of
Estimates of Variance Components. Biom Bull. 1946;2(6):110-
114. doi:10.2307/3002019
20. Schober P, Vetter TR. Nonparametric Statistical
Methods in Medical Research. Anesth Analg.
2020;131(6):1862-1863.
doi:10.1213/ANE.0000000000005101
21. Skovlund E, Fenstad GU. Should we always choose a
nonparametric test when comparing two apparently
nonnormal distributions? J Clin Epidemiol. 2001;54(1):86-92.
doi:10.1016/s0895-4356(00)00264-x
22. Student. The Probable Error of a Mean. Biometrika.
1908;6(1):1-25. doi:10.2307/2331554
23. Trkulja V, Hrabač P. The role of t test in beer brewing.
Croat Med J. 2020;61(1):69-72. doi:10.3325/cmj.2020.61.69
24. Vold S, Ahmed IIK, Craven ER, et al. Two-Year
COMPASS Trial Results: Supraciliary Microstenting with
Phacoemulsification in Patients with Open-Angle Glaucoma
and Cataracts. Ophthalmology. 2016;123(10):2103-2112.
doi:10.1016/j.ophtha.2016.06.032
25. Welch BL. The significance of the difference between
two means when the population variances are unequal.
Biometrika. 1938;29(3-4):350-362. doi:10.1093/biomet/29.3-
4.350
26. Whitley E, Ball J. Statistics review 6: Nonparametric
methods. Crit Care Lond Engl. 2002;6(6):509-513.
doi:10.1186/cc1820
27. Wilcoxon F. Individual Comparisons by Ranking
Methods. Biom Bull. 1945;1(6):80-83. doi:10.2307/3001968
28. Wilcoxon F. Probability Tables for Individual
Comparisons by Ranking Methods. Biometrics.
1947;3(3):119-122. doi:10.2307/3001946
29. Winter JD, Dodou D. Five-Point Likert Items: t test
versus Mann-Whitney-Wilcoxon. Published online 2010.
doi:10.7275/BJ1P-TS64
Chapter 4
4. Statistical tests II: Statistical tests
for categorical frequency data
Michelle Rosa and Kevin Pacheco-Barrios
Introduction to Chapter 4
It is 1940, and you are a known painter and artist in London. It is late
at night, and you are in your studio, using colors to nurture your
inspiration since you live a painful period of history (World War II).
The room is full of nontraditional paintings,textured brushwork, and
intense colors, as an artist from the abstract expressionism
movement.
After a long night of work, you receive a letter from Sir Ronald
Aylmer Fisher, who visits you the next day. Curiously, you do not
know who Sir Fisher is, but like any other somewhat self-centered
artist, you do not worry and leave for the first coffee of the day.
When Sir Fisher arrives, he begins to look at paintings and
sculptures. You realize that you do not have too much in common,
as you find out that Sir Fisher is a British statistician, geneticist, and
academic. Despite his lack of a dressing style (as you judged), he
wanted to request a painting for his wife. He did not give much
information and left the studio saying, "I hope you can impress me.”
You feel challenged and have no idea how to start, so you
decide to visit the University College London, where Sir Fisher was
head of the Department of Eugenics. The librarian was kind enough
to explain his phenomenal work, highlighting the book "The Design
of Experiments." The librarian also described the book's second
chapter, where Fisher explains the lady-tasting tea experiment. This
experiment is known by the introduction of the null hypothesis idea.
A lady claimed to identify in her tea which was poured first: the milk
or the tea. Fisher designed an experiment to prove whether her
claim was true or not. The null hypothesis was that the lady could not
distinguish the difference between teas; with random guessing, she
would have a 50% chance of success each time. He proposed the
experiment using eight cups of tea, four of the eight with milk first
and the rest with tea first. The lady knows there are exactly four cups
of each, so she is essentially choosing four cups at random out of
eight. Fisher calculated the lady's chances of properly sorting the
eight cups; however, he didn't disclose the final resolution in the
chapter. "Do you think that eight cups are enough to conclude that
the lady can truly tell the difference correctly?" asked the librarian.
Unfortunately, her passion didn't spark your curiosity for
experimental design, as you were lost in the middle of the
experiment's explanation. At the end of your conversation, you find
out as well that his deficient eyesight caused his rejection by the
British Army for World War I, and then you start to have some ideas.
(Box,1980).
After a week of your artistic journey, you realize that you have
used different painting techniques, combining colors and textures;
then, "the final impressive result" is almost done. A meeting with Sir
Fisher was scheduled the following day.
Sir Fisher arrived, and you saw a little disinterested analysis
in the air. It took a while until he decided to give his ultimate and
reductionist comment: "I was expecting something more "black and
white." In physics, black and white are not colors; they are, in fact,
darkness (the absence of light) and clarity (the presence of light)”.
Created by Sir Ronald Fisher, Fisher's exact test is a
statistical significance test used to analyze categorical data. It is
appropriately arranged in a contingency table in clinical research,
classifying variables more simplistically than continuous variables.
Researchers can easily perform a statistical analysis (with Fisher
and other tests) when categorizing a continuous variable and
interpreting the results. However, some possible drawbacks must be
assessed when reducing the data.
Analysis of categorical data
In this book, you have already seen the best way of representing
your outcome, which generally speaking, can be a numerical
variable or a categorical variable (when you can define some
category or group) (for more read Chapter 1). This chapter also
discusses binary variables, a special case of categorical variables,
but can only assume two values (yes/no or presence/absence).
Question 64: What should we consider when
categorizing continuous data?
When you have a continuous variable, you have many values that
can be used (and interpreted) meaningfully. Therefore, you should
consider the importance of keeping such detailed information in your
analysis. When you categorize, can you imagine all values from your
cutoff point being considered equal? For example, on a test (where
grade usually is 0-100), everyone below 70 could be considered a
“failure.” If you are a surgeon reviewing the patient's chart to decide
the risk of surgery, knowing if the patient falls within the normal
range of blood pressure values might be sufficient instead of
knowing the exact blood pressure measurement. It is generally
easier to summarize data using categories; however, you will reduce
your clinical information when categorizing a continuous variable. In
this research, there may be some drawbacks to the statistical
analysis. As you will end up with much less information, it will be
harder to find the effects there, reducing the power of your statistical
test (for power, see Chapter 8). It is also essential to consider the
number of events in the data since the total sample size could affect
the precision of the estimates and, most importantly, the number of
observed events (e.g., death or disease presence).
Question 65: How can you decide the cut-off
values to be used besides previous clinical
knowledge?
A logical or clinical basis is needed to classify the data into
categorical data. Even if you use previous knowledge, you might
need biological or medical significance behind your estimates, as
you want to transform your continuous data into meaningful
categorical data. Some researchers from the social sciences will use
the median split method, which is one way of doing so arbitrarily
(DeCoster et al., 2011; Iacobucci et al., 2015). Essentially, you will
find the median of your continuous variable and classify it into two
categories: "low" (to every value below the median) and "high" (to
every value above the median). Median splits are not advisable
because they can introduce random errors and frequently reduce the
power. Furthermore, you might increase the chance of Type I and II
errors, as split data can give significant results even when
continuous data are not available. “If researchers pick the method
that yields significance, then Type I errors will increase even as
splitting, overall, reduces power” (McClelland et al., 2015). Another
approach is to use the distribution shape of the data as guidance,
which requires plotting a histogram of the variable and detecting any
bimodal behavior (two approximately normal distributions in the
histogram) if it is adequate to use the values between the two
distributions (DeCoster et al., 2011). There is no perfect method to
categorize a continuous variable; therefore, it is important to clearly
report your criteria in your method section to ensure reproducibility.
Question 66: Why do we need larger samples for
dichotomized data?
Usually, in a randomized clinical trial, you want to detect the
differences between the two groups (for effect size or minimal
clinically relevant difference, see Chapter 8). Calculating the sample
size means estimating the number of people needed in a trial. Using
a binary outcome, instead of the number of people, estimating the
number of events (or the proportion of patients in each group) is
necessary, leading to a difference in event rate as the effect size.
Because you are using much less information, you need some trade-
offs to increase the power of your study, which can be done by
increasing the sample size. This is especially true when the
frequency of the event is low in the target population. Thus, for
dichotomized data, it is crucial to conduct a comprehensive literature
review to identify the best values (event frequencies) in previous
studies to calculate the sample size.
Question 67: How do we identify binary
outcomes in research papers?
A good starting point is identifying the research question using the
PICOT's acronyms (PICOT = population, intervention, control,
outcome, and time) when lookingat a paper (Agoritsas et al., 2015).
When looking for the outcome, you search for what happened during
an investigation to measure the health condition. As stated earlier, a
binary outcome assumes only two values; in other words, you have
only two possible results (for example, success, failure/dead, or
alive). This is useful when calculating the proportions or
percentages.
Chi-Square test
Question 68: What are R-C tables? How to use
them? Could you give an example in which such
tables are necessary?
Let us begin by defining the R-C tables. Frequency tables, also
called contingency tables, describe the frequencies of a categorical
variable or a combination of multicategory variables in a dataset. R
stands for “row,” and C stands for “column.” Contingency tables will
describe the results wherever two or more groups are compared,
and the outcome is a categorical variable (response/nonresponse).
See table 4.1 for an example.
Table 4.1. Contingency table example
A Response Nonresponse B Response Nonresponse
TTR A a b TTR A a b
TTR B c d TTR B c d
TTR C d e
TTR=treatment
These tables are called contingency tables because the count in
each cell is the number in that category of that variable contingent on
also lying in a particular category of the other variables. The number
of observations simultaneously falling into the categories of two
variables is counted to test the relationship between two variables,
such as whether they are independent or associated (Riffenburgh,
2005).
Question 69: How is the Chi-square test
calculated? Can you show an example?
The Chi-square statistic can be calculated with the following formula:
, where:
O = observed values
E = expected values
∑ = sum of all cell chi-square values
In the contingency tables above, apply the formula to each cell in
your contingency table and sum these values. The result would be
your chi-square statistic, a single value that shows how much
difference exists between the observed and expected values if there
are no relationships in the population (Frost, 2020). Usually, you will
not be using your hand to calculate, but if you do so, you need to
take your chi-square and compare it to a critical value from a chi-
square table (see the example below).
Generally, a large chi-square test statistic shows that the
observed values (sample data) do not fit the expected values
(population data). We can conclude that there is a relationship
between the tested variables. A small chi-square test statistic
indicates a high correlation between the observed and expected
values (Frost, 2020). Thus, there is no relationship between the
variables being tested.
Example for Chi-square calculation
In a hypothetical study, a psychiatrist assessed the distribution of
patients taking depression medication (medication AxBxC). The
psychiatrist found the following:
Gender Depression
medication A
Depression
medication B
Depression
medication C
Total
Male 35 32 51 118
Female 70 17 48 135
Total 105 49 99 253
State the null hypothesis:
H0: there is no association between patient gender and depression
medication use
HA: there is an association between patient gender and depression
medication use
State the alpha level: 0.05
Calculate the degrees of freedom:
Use the Chi-square table to find the critical chi-square value
Figure 4.1. Chi-square table adapted from http://www.ttable.org/chi-square-
table.html
http://www.ttable.org/chi-square-table.html
Chi-square value = 5.99
Therefore, we can reject the null hypothesis if the calculated chi-
square value exceeds the critical value (5.99). How do we calculate?
, where:
O = observed values
E = expected values (what we expect to see)
∑ = sum of all cell chi-square values
Gender Depression
medication A
Depression
medication B
Depression
medication C
Total
Male 35 (49) (4)
32 (23) (3.5) 51 (46) (0.54) 118
Female 70 (56) (3.5)
17 (26) (3.2) 48 (53) (0.47) 135
Total 105 49 99 253
Expected frequencies = (row total x column total/grand total)
E = (118 x 105)/253 = 49
E = (135 x 105)/253 = 56 (and so on)
Then, you move forward with the equation
(and so on)
15.21 > 5.99 – therefore, we can reject the null hypothesis and state
that there is a significant association between gender and type of
depression medication.
Question 70: What is the difference between the
chi-square goodness-of-fit test and chi-square
test for independence? Is the chi-square test
appropriate for examining whether the two
variables are related or independent?
There are three types of chi-square methods: goodness of fit, the
chi-square test for independence, and the chi-square test of
homogeneity. All three tests calculate a test statistic by using the
same formula. All three tests will be used to understand the
relationship between observed and theoretical data ("expected").
The main difference relies on data collection and hypothesis testing
(please see the following questions).
Chi-square goodness of fit test
The chi-square goodness-of-fit test considers one categorical
variable and one population. It needs to be from a randomly
collected sample from the population, and you can determine if the
distribution of a set of categorical values follows an expected
distribution (Peacock & Peacock, 2011).
H0: the set of categorical values follow the expected distribution
HA: the set of categorical values do not follow the expected
distribution
Chi-square for independence
The Chi-square test can be applied to contingency tables with
various dimensions and may be used to compare two variables and
check if they are related. Finally, we want to know if one categorical
variable depends on the other categorical variable, similar to a
correlation using continuous variables. The differences between the
observed and expected frequencies were calculated as a test of
association. You can interpret “expected” as what you expect if the
null hypothesis is true (McHugh, 2013).
Stating the hypothesis for Chi-square could be this:
H0: there is no association between variables (variables are
independent)
HA: there is an association between variables (variables are
dependent)
The exercise we did before is an example of Chi-square for
independence; in that case, we rejected the independence (no
association) of patient sex and depression medication use.
Chi-square of homogeneity
The chi-square homogeneity test is used to understand if a single,
categorical variable has the same distribution in two or more
populations (or subgroups within a population). This is similar to the
results of the chi-square test for independence. This difference is
easier to understand when comparing the null hypothesis statement:
H0: the distribution of the categorical variable is the same for the
populations.
HA: the distribution of the categorical variable isn’t the same for the
populations.
Independence and homogeneity tests are concerned with the cross-
classification of data. They used the same test statistics. However,
they differ from one another. The test for independence checks
whether one attribute is independent of the other (association) and
includes a single sample from the population. However, a
homogeneity test checks whether different samples come from the
same population. It includes two or more independent samples
(Peacock & Peacock, 2011).
Question 71: If the chi-square test is non-
parametric, why does the central limit theorem
need to be used?
The chi-square test can be used in different scenarios and can be
parametric or non-parametric. As discussed before, the goodness of
fit, chi-square test for independence, and chi-square test of
homogeneity are the most commonly used for categorical variable
analysis. These three tests are non-parametric because no
parameters (mean, standard deviation) are required to compute the
test, and no assumptions about the underlying distribution are made.
We only needed to calculate the number of expected values and
compare them with the observedvalues in our study. However, to
obtain a reliable estimate, the calculation of your expected values in
each cell should be “larger enough” (the rule of thumb is at least five
events) (Peacock & Peacock, 2011), which is usually achieved by
larger sample size. For this reason, several authors suggest applying
the central limit theorem (for its definition, review Chapter 1) as
guidance because the chi-square distribution is derived from the
standard normal distribution. If the sample is large, the probability of
observing more events is also higher. In general, this is not the best
approach; it is adequate to assess the contingency table directly.
Question 72: What are common pitfalls in using
the Chi-square test for categorical data analysis?
What are the limitations of the Chi-Square test?
The chi-square test requires no assumptions regarding the shape of
the population distribution from which the sample was drawn.
However, similar to all inferential methods, it assumes that the data
are from a random sample.
Regarding the sample size, the chi-square test is sensitive to
sample size and small expected frequencies (within the cells in the
table)(Riffenburgh, 2005).
Even if a relationship is found, the strength of the relationship
cannot be measured using the chi-square test.
It is important to note that the chi-square statistic can only be
used for values (numbers). However, they cannot be used for
percentages, proportions, means, or similar statistical values.
Question 73: What sample size should be best for
Chi-square?
You need to consider that the main goal for estimating statistical
hypothesis testing is to collect evidence of rejecting the null
hypothesis of "no difference”. If the number of samples is too small,
the power of the test may not be sufficient to detect a difference that
exists, which results in a type II error. A common recommendation
for sample size is a minimum of 30 observations, and no more than
20% of cells should have an expected frequency of less than five
(McHugh, 2013).
Question 74: What is Yates’s correction in the
chi-square test?
The chi-squared test is only an approximation. We learned about
contingency tables. The chi-square distribution is continuous, and
the 2 × 2 table is dichotomous; therefore, an adjustment must be
made. Yates is used to avoiding overestimation of statistical
significance for a small sample size. If your expected number of cells
is below 10, it is recommended to perform the Yates correction. This
test is intended to improve the chi-square approximation; therefore, if
one has a large sample size, it will be meaningless to perform Yate's
correction (Pagano & Gauvreau, 2018). In practice, if your expected
cells are below 10 or 5, it is better to perform Fisher’s test instead of
using Yate’s correction.
Question 75: What is the Chi-square trend test?
As stated, the name is used to check if an association between
variables follows a trend. The null hypothesis was that there would
be no trend within the groups. One binary variable and one ordered
categorical variable are used to check whether the association
between the variables followed a trend using the chi-square test for
trend (Koletsi & Pandis, 2016; Swinscow, 1997). You will not use the
chi-square test of independence because you want to consider the
level of categories; for example, if you have a binary outcome
(disease yes/no) and an ordered categorical variable (poisoning
dose: low, medium, or high dose). The odds ratio shows how likely a
person can become ill based on the level of exposure to poisoning.
Question 76: When should we use McNemar's
test instead of the chi-squared test?
The McNemar’s test was used to compare two paired samples when
the data were nominal and dichotomous. You will need to take this
pairing into account to perform the analysis. It can be used with
matched case-control studies as well as for data collected “before
and after” (Peacock & Peacock, 2011).
Fisher Exact Test
Question 77: When should we use Fisher's exact
test? What are the assumptions for using the
Fisher test?
If the expected values for the cells are very small, the results of the
chi-squared test may be invalid. Therefore, in this case (when the
cell count was < 5), Fisher's exact test was preferred. The
calculation by hand may take a long time, as to make a comparison,
ranks need to be created, which requires higher computational
power (Riffenburgh, 2005). Note that the row and column totals are
fixed and not random. Another assumption is that each observation
can be classified into only one cell. Fisher’s exact test should not be
used to analyze paired data (du Prel et al., 2010).
Stating the hypothesis for Fisher's exact test could be this:
H0: there is no association between variables (variables are
independent).
HA: there is an association between variables (variables are
dependent).
We can use the lady-tasting tea experiment as a working example
(Box, 1980; R. Mehta & R. Patel, 2011). From the beginning of this
chapter, we will check whether a lady can identify whether milk or tea
in her tea was poured first. Tea was added first in four cups, and milk
was added first in the remaining cups. She made four predictions for
each order. Given the 2 × 2 table, can you conclude that she could
correctly identify which was poured first?
Null hypothesis: The variables are independent (the order in which
milk or tea is poured into a cup and the lady’s guess).
Alternative hypothesis: The variables are dependent (the order in
which milk or tea is poured into a cup and the lady’s guess).
Two p-values can be observed in the STATA output looking at the
lady experiment example. Considering our hypothesis, we failed to
reject the null hypothesis that there is no association between the
order in which milk or tea is poured into a cup and the lady’s guess.
Question 78: How do we interpret a one-tailed
Fisher exact test result regarding the fisher test?
The one-tailed test is more related to hypothesis testing than the
Fisher test itself, which means the alternative hypothesis will have a
direction. One-tailed or two-tailed tests are the methods used to
perform statistical tests. Applying our clinical example to the Fisher
test, consider the following research question: Does treatment
reduce insomnia? In this case, you are interested in reducing
insomnia, and if you assume that the treatment cannot increase
insomnia (which may not be acceptable), a one-tailed test can be
used. The hypothesis can be stated (hypothesis of interest in bold).
H0: Odds of insomnia are equal between treatment and placebo.
HA:
Two-tailed: odds of insomnia differ between treatment and placebo.
Left tailed: odds of insomnia are less for patients on treatment
Right tailed: odds of insomnia are greater for patients on treatment
Question 79: Should we calculate both Chi-
square and Fisher if the sample meets the
assumptions and compare the results?
It is wrong to choose the test that provides you with the preferred
results. Checking the literature, you may find different resources to
understand why you should plan your statistical analysis beforehand,
including the test’s decision (du Prel et al., 2010; Kim, 2017;
Simpson, 2015). The first step is related to the research question
and hypothesis statement; therefore, the statistical test will be guided
by these pieces of information. It's important to notice that you need
to consider that sampling or allocation are random (an assumption
for both tests). Regarding assumptions, it is essential to consider
that the Fisher exact test will require marginals to be fixed in
advance, a condition that is not common in medical research
(Ludbrook, 2013).
Question 80: Apart from the sample size, are
there any significant criteria to choose between
the Fisher's exact test and the chi-square test?
Another important piece of information to consider is the precision of
the p-value. We have learned that the chi-squared test relies on an
approximation,while Fisher's exact test is an exact test; that is, it will
give you the exact p-value. The p-value of Fisher's exact test is
accurate for all sample sizes, whereas the chi-square test results
that examine the same hypotheses can be inaccurate when the
number of events in the cells is small (Kim, 2017).
Outcome
(event)
Response Nonresponse
Exposure
(predictor)
Treatment a b
Control c d
Effect measures
Question 81: Could you explain the odds ratio
definition? How do you interpret odds ratio with
confidence interval?
Considering our discussion, you might start recalling that the chi-
square and Fisher tests will be testing if an association exists. The
odds ratio (OR) will be used to measure the strength of association,
so it is just another way of defining the relationship between two
variables. The odds ratio can be seen as another application of the 2
× 2 tables to check the association between two variables (Bewick et
al., 2004) or defined as a measure of the association between an
exposure and an outcome (Szumilas, 2010). Odds are not a
probability or a risk. Let us consider a 2 × 2 table and its calculation.
Odds
ratio =
Let us adapt our example from the beginning of this section. In a
hypothetical study, a psychiatrist wanted to assess the occurrence of
insomnia in patients taking depression medication A and compare it
with a placebo (control group). We aimed to determine whether
insomnia was similar between patients taking medication and those
not (control group). How can these results be interpreted? Is there
an association between treatment and outcome? Is this significant
(Szumilas, 2010; Whitley & Ball, 2002)?
Stating the hypothesis:
H0: there is no association between treatment and insomnia.
HA: there is an association between treatment and insomnia.
We used the chi-square test because we had more than five events
in each cell in the 2 × 2 table. From STATA:
We can reject the null hypothesis and say that there is an
association between treatment and insomnia occurrence; however,
we cannot determine whether the significance favors treatment A or
the placebo.
To present the direction of our association (strength), we can
calculate the odds ratio. In our example, the odds ratio can be
defined as the odds of the outcome in treated patients (treatment A)
divided by the odds of the outcome in untreated patients (control
group).
No Yes
Control 62 61
Treatment A 86 34
OR =
The OR we calculated was 0.40, meaning the odds of insomnia in
the treatment A group was 60% less than that in the control group
(the value is smaller than 1, meaning it is "protecting you” from the
occurrence of the event).
OR = 1 odds of exposure are the same for treatment and control
(no effect).
OR > 1 odds of exposure among the treatment group is greater
than among the control group (positive association between
exposure and disease).
OR < 1 odds of exposure among treatment group is less than
among control group (negative association between exposure and
disease, protective effect).
We can consider the confidence interval (CI), which indicates that
the expected range of the true odds ratio of the population from
which the patients were drawn is likely to be between. The CI was
calculated as follows (Pagano & Gauvreau, 2018):
95% CI = exp (ln (OR) ± 1.96 * SE {ln (OR)})
Calculate:
the natural log of the OR: ln (OR) = -0.92
the standard error of the ln (OR)
Calculate the lower and upper CI
Lower 95% CI = exp [(-0.92) – 1.96 * 0.27] = exp [-1.44] = 0.23
Upper 95% CI = exp [(-0.92) + 1.96 * 0.27] = exp [-0.38] = 0.68
In this study, we obtained an OR= 0.40, indicating that the odds of
insomnia in the treatment A group was 0.40 times the odds of
experiencing insomnia in the control group; or the odds of
experiencing insomnia was reduced by 60% among the treatment A
group compared to the control group. With 95% confidence, the true
odds ratio was in the range of 0.23 – 0.68. Looking at the first
interpretation, we should consider that for OR, the “no difference
between groups” (no effect) is equal to 1 (not zero as usual). This
means that when a confidence interval includes a value of 1, the
odds of the measured outcome are the same for both groups
(intervention and control), even without a result from a significance
test.
Question 82: What are the criteria to use the
Cochran-Mantel-Haenszel?
We need to use the Cochran-Mantel-Haenszel test (CMH) when we
want to control for a third variable, that is, to estimate the association
between exposure and outcome after adjusting for a third variable.
This stratification controlled for confounding (Tripepi et al., 2010).
Considering our hypothetical study, we could stratify by sex, using
the 2 × 2 table divided by sex (using STATA):
Stating the hypothesis:
H0: Odds ratios in individual strata are equal to one (the rate of
insomnia is the same for men and women).
HA: at least one odds ratio is different from one.
The output showed that the ORs for males (OR = 0.31) and females
(OR = 0.46) were different (using a chi-square test for homogeneity).
The OR adjusted by sex was 0.40, and all CIs did not include the
value of 1, meaning that our results were statistically significant.
Thus, we can reject the null hypothesis and state that the association
between the predictor (treatment/control) and the outcome variable
(insomnia/no insomnia) is significantly different at different levels of
the conditional variable (gender).
Question 83: What's the difference between
relative risk and odds ratio? When to use Odds
Ratio, Relative Risk or Risk Difference?
In clinical research, we are familiar with three study designs used to
investigate the effects of medical interventions: randomized clinical
trials, observational cohort studies, and case-control studies (Knol et
al., 2012).
Relative risk (RR), odds ratio (OR), and risk difference (RD)
are measures of association commonly used for the analysis of
dichotomous data. RR and OR are measures of the relative effect,
whereas RD measures the absolute effect.
We can check the definitions (Higgins et al., 2021;
Ranganathan et al., 2015; Tenny & Hoffman, 2021a, 2021b) and the
use of the 2 × 2 table.
Event/outcome
Yes No
Exposure
/intervention
Yes a b
No c d
RR is the ratio of the risk of an event in the exposed group versus
the risk of the event in the non-exposed group. It can also be used
for cohort studies.
In clinical research, it is common to assess the occurrence of an
event by comparing groups. Statistical tests will detect the excess
risk between both groups; the relative risk allows the quantification of
the magnitude of such excess and measures the strength of the
association between exposure and disease. You will see that the RR
is usually observed in cohort studies.
OR: odds ratio of an event in the exposed group versus the odds of
the event in the non-exposed group
It can be used in RCTs, cohort, and case-control studies. In a cohort
study, the OR is determined from exposure to disease. In contrast,
the OR is determined from disease to exposure in a case-control
study.
RD: the risk of an event in the experimental and an event in the
comparison groups. The frequency of the events is compared in
terms of their absolute difference.
The risk difference could be estimated for any study, even when
there were no events in either group. The interpretation is
straightforward, as it describes the difference in the observed risk of
events between the groups.
Sometimes, you might see OR and RR used interchangeably;
it is important not to misinterpret both measures. When the outcome
is rare, the OR approximates RR (Davies et al., 1998; Knol et al.,
2012). However, when an event is common, the OR can
overestimate the RR; thus, OR should be avoided.
Question 84: Why doesn't p-value in Fisher's
exact test necessarily agree with the confidence
interval for a RR, OR, or Risk Difference?
To interpret the resultsbetter, one might consider looking at both
estimates, p-values, and confidence intervals. While the p-value
shows that the probability of an observed difference between groups
is simply by chance, confidence intervals give you a range of values
within which the true population value is likely, which helps to
understand the uncertainty around your point estimate (Whitley &
Ball, 2002).
Fisher's test was used to compute the exact p-value. However, the
confidence intervals for the odds ratio and relative risk are calculated
using only an approximation, using logarithm transformations to
calculate its variability. Therefore, the confidence interval may not
necessarily agree with the p-value.
References for Chapter 4
1. Agoritsas, T., Vandvik, P. O., Neumann, I., Rochwerg, B.,
Jaeschke, R., Hayward, R., Guyatt, G., & McKibbon, K. A. (2015).
Finding Current Best Evidence. In G. Guyatt, D. Rennie, M. O.
Meade, & D. J. Cook (Orgs.), Users’ Guides to the Medical
Literature: A Manual for Evidence-Based Clinical Practice, 3rd ed
(Vol. 1–Book, Section). McGraw-Hill Education.
jamaevidence.mhmedical.com/content.aspx?aid=1183875650
2. Bewick, V., Cheek, L., & Ball, J. (2004). Statistics review 8:
Qualitative data – tests of association. Critical Care, 8(1), 46–53.
https://doi.org/10.1186/cc2428
3. Box, J. F. (1980). R. A. Fisher and the Design of Experiments,
1922-1926. The American Statistician, 34(1), 1–7.
https://doi.org/10.2307/2682986
4. Davies, H. T. O., Crombie, I. K., & Tavakoli, M. (1998). When can
odds ratios mislead? BMJ : British Medical Journal, 316(7136),
989–991.
5. DeCoster, J., Gallucci, M., & Iselin, A.-M. R. (2011). Best
Practices for Using Median Splits, Artificial Categorization, and
their Continuous Alternatives. Journal of Experimental
Psychopathology, 2(2), 197–209.
https://doi.org/10.5127/jep.008310
6. du Prel, J.-B., Röhrig, B., Hommel, G., & Blettner, M. (2010).
Choosing Statistical Tests. Deutsches Ärzteblatt International,
107(19), 343–348. https://doi.org/10.3238/arztebl.2010.0343
7. Frost, J. (2020). Hypothesis Testing: An Intuitive Guide for Making
Data Driven Decisions (1a edição). Statistics By Jim Publishing.
8. Higgins, J. P., Li, T., & Deeks, J. J. (2021). Chapter 6: Choosing
effect measures and computing estimates of effect. In Cochrane
Handbook for Systematic Reviews of Interventions version 6.2.
Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ,
Welch VA (editors).
https://training.cochrane.org/handbook/current/chapter-06
9. Iacobucci, D., Posavac, S. S., Kardes, F. R., Schneider, M. J., &
Popovich, D. L. (2015). Toward a more nuanced understanding of
the statistical properties of a median split. Journal of Consumer
Psychology, 25(4), 652–665.
https://doi.org/10.1016/j.jcps.2014.12.002
10. Kim, H.-Y. (2017). Statistical notes for clinical researchers:
Chi-squared test and Fisher’s exact test. Restorative Dentistry &
Endodontics, 42(2), 152–155.
https://doi.org/10.5395/rde.2017.42.2.152
11. Knol, M. J., Algra, A., & Groenwold, R. H. H. (2012). How to
Deal with Measures of Association: A Short Guide for the
Clinician. Cerebrovascular Diseases, 33(2), 98–103.
https://doi.org/10.1159/000334180
12. Koletsi, D., & Pandis, N. (2016). The chi-square test for trend.
American Journal of Orthodontics and Dentofacial Orthopedics,
150(6), 1066–1067. https://doi.org/10.1016/j.ajodo.2016.10.001
13. Ludbrook, J. (2013). Analysing 2 × 2 contingency tables:
Which test is best? Clinical and Experimental Pharmacology &
Physiology, 40(3), 177–180. https://doi.org/10.1111/1440-
1681.12052
14. McClelland, G. H., Lynch, J. G., Irwin, J. R., Spiller, S. A., &
Fitzsimons, G. J. (2015). Median splits, Type II errors, and false–
positive consumer psychology: Don’t fight the power. Journal of
Consumer Psychology, 25(4), 679–689.
https://doi.org/10.1016/j.jcps.2015.05.006
15. McHugh, M. L. (2013). The chi-square test of independence.
Biochemia Medica, 23(2), 143–149.
https://doi.org/10.11613/bm.2013.018
16. Pagano, M., & Gauvreau, K. (2018). Principles of Biostatistics.
Chapman and Hall/CRC. https://www.routledge.com/Principles-of-
Biostatistics/Pagano-Gauvreau/p/book/9781138593145
17. Peacock, J., & Peacock, P. (2011). Oxford Handbook of
Medical Statistics (1a edição). Oxford University Press, USA.
18. R. Mehta, C., & R. Patel, N. (2011). IBM SPSS Exact Tests.
https://www.sussex.ac.uk/its/pdfs/SPSS_Exact_Tests_19.pdf
19. Ranganathan, P., Aggarwal, R., & Pramesh, C. S. (2015).
Common pitfalls in statistical analysis: Odds versus risk.
Perspectives in Clinical Research, 6(4), 222–224.
https://doi.org/10.4103/2229-3485.167092
20. Riffenburgh, R. (2005). Statistics in Medicine (2o ed, Vol. 1–1).
Academic Press.
21. Simpson, S. H. (2015). Creating a Data Analysis Plan: What
to Consider When Choosing Statistics for a Study. The Canadian
Journal of Hospital Pharmacy, 68(4), 311–317.
22. Swinscow, T. (1997). Statistics at Square One | The BMJ. The
BMJ | The BMJ: Leading General Medical Journal. Research.
Education. Comment. https://www.bmj.com/about-bmj/resources-
readers/publications/statistics-square-one
23. Szumilas, M. (2010). Explaining Odds Ratios. Journal of the
Canadian Academy of Child and Adolescent Psychiatry, 19(3),
227–229.
24. Tenny, S., & Hoffman, M. R. (2021a). Odds Ratio. In
StatPearls. StatPearls Publishing.
http://www.ncbi.nlm.nih.gov/books/NBK431098/
25. Tenny, S., & Hoffman, M. R. (2021b). Relative Risk. In
StatPearls. StatPearls Publishing.
http://www.ncbi.nlm.nih.gov/books/NBK430824/
26. Tripepi, G., Jager, K. J., Dekker, F. W., & Zoccali, C. (2010).
Stratification for Confounding – Part 1: The Mantel-Haenszel
Formula. Nephron Clinical Practice, 116(4), c317–c321.
https://doi.org/10.1159/000319590
27. Whitley, E., & Ball, J. (2002). Statistics review 3: Hypothesis
testing and P values. Critical Care, 6(3), 222–225.
Chapter 5
5. Statistical tests III - more than two-
group comparisons: ANOVA and non-
parametric tests
Arturo Tamayo & Denise S Schwartz
Introduction to Chapter 5
Suppose you are a sleep researcher and you have designed a study
to compare three treatments to increase the sleep duration in
patients with short-term insomnia. Patients will be randomly
assigned to three different groups: Group A (chamomile passion fruit
extract capsule), group B (melatonin capsule), and group C (placebo
capsule). Now, you have to decide the statistical plan to assess
whether the groups are different. In a meeting with your new
doctorate students and postdoctoral fellow, you ask them what
statistical test they suggest. Harry Clay, a new doctoral student,
wanted to give a good impression, and suggested using the
Student's t-test. Kathryna Brownie, who is also a doctorate student,
jumps in and says that they do not have just two groups, but three,
and that she learned in her statistical class that it would not be
appropriate in this case.
One frequent challenge that researchers face is comparing
more than two groups, which cannot be compared by multiple t-tests,
due to the risk of increasing the probability of Type I error, that is,
finding a difference only by chance (a false positive result). In other
words, when multiple tests are done repeatedly to find an “effect,”
you make multiple pairwise comparisons. In such circumstances, if
you perform three pairwise comparisons, you are testing three
hypotheses, and for each inference you consider an α level of 0.05,
(type I error rate at 95% confidence), your actual error will be
combined (or compounded), and instead of a 95% confidence, it will
decrease to 85.7%. This is sometimes called the cumulative Type I
error, which is approximately the alpha value multiplied by the
number of pairwise comparisons performed.
This challenge created the need to develop a statistical test that
could detect the differences between multiple groups meansand
account for multiple comparisons. Analysis of variance (ANOVA) can
be seen as an extension of the t-test, making it possible to compare
multiple groups (usually three or more groups), without inflating the
type I error (Sawyer, 2005).
ANOVA is considered an "omnibus" statistical test, that is, an
elaborate test or a test that comprises several items at once.
This test was designed with one dependent (response or outcome)
continuous variable measured among multiple groups, with one or
more independent categorical variables.
In this test, two concepts are identified: factors and levels.
Factors refer to the experimental variables (independent or
explanatory), which are categorical variables. The levels refer to the
groups assigned to each factor in this test.
Depending on the number of factors, ANOVA is described as:
- One-way ANOVA = 1 factor
- Two-way ANOVA = 2 factors
- Three-way ANOVA = 3 factors, and so on.
Therefore, we performed one-way ANOVA (for one factor) and
factorial ANOVA (for two or more factors). Examples of factors
include the treatment groups, gender, stage, or the severity of the
disease.
However, one important comment here is that the investigator
should first decide what is his main hypothesis and then choose the
appropriate statistical test. In other words, if an investigator has two
doses of drug A to test against placebo, he can go into two different
pathways: (a) use an ANOVA and add the three groups (drug A dose
1, drug A dose 2, and placebo) or (b) truly think where the higher
likelihood of an effect is and then to run a study with drug A dose 1
vs. placebo only (and use a Student t-test instead). The investigator
needs to remember that by increasing the number of groups, there
will be less statistical power as well.
Question 85: How does the ANOVA test detect
the difference between groups?
ANOVA tests the null hypothesis that the population means (levels)
are all equal. The alternative hypothesis states that at least one
mean is different (see Figure 5.1).
This can be expressed as:
H0: μ1 = μ2 = μ3 = μ4 = …..μκ
Ha: at least one of the μi’s is different
H0= null hypothesis
к= number of groups
μ=group mean
Ha= alternative hypothesis
Figure 5.1: Graphic representation of null and alternative hypotheses for the
analysis of variance (ANOVA).
Although the null hypothesis states that all means are equal, it does
not denote that all the means should be "exactly" the same,
nevertheless, that the values are close enough (not spread) in a way
that they can be considered to be part of the same population (Foltz,
2013). While the t-test compares the distances of the distributions
represented by the means, ANOVA assesses variabilities by using
the ratio of between-group variance and within-group variance (Kim,
2014).
However, this is more complex as we compare more than two
levels (i.e., more than two groups). When the one-way ANOVA
results are significant, the null hypothesis is rejected, indicating that
at least one pair differs (Pallant, 2010). Notwithstanding, this test will
not discriminate which specific group differs from the others;
therefore, a post hoc test is required (discussed later).
As explained in Chapter I, variance can be defined as
dispersion in relation to the estimated mean. This is the same as
measuring how far each point of the dataset is distant from the
center. Variance is expressed as the sum of the squared deviations
from the mean divided by (n-1). When referring to the population
variance, you will find this expressed in some statistical books as “σ2
“ as it represents the square of the standard deviation (Pagano,
2010).
To illustrate the above:
σ2= ∑ ( X- μ )2
σ2= population variance
Σ= sum of…
X= each value
μ= population mean
N=number of values in the population
In the initial analysis, we are working with a sample; therefore, this
will be the “sample variance (s2)” rather than the σ2.
In σ2, the use of “n” would give us a biased estimate that
consistently underestimates variability (variance lower than the real
variance of the population). Reducing n-1; this makes the variance
artificially large, which will mathematically compensate for this.
Therefore, we use “n-1”
S2= Σ ( X- )2
s2= sample variance = sample mean n= number of values
in the sample. See Question 86 for a practical example.
Question 86: Based on Kathryna's observation
regarding the inability to use a t-test (see the
introduction), what statistical test will you use to
compare three groups in the sleep study?
See Figure 5.2 to understand which statistical test can compare
three groups.
Figure 5.2. Example of a parallel design with 1 factor and 3 levels.
In this case, one-way ANOVA would be the option. There is a single
factor or one independent variable (treatment) with three categories
or levels (in our case, the three treatment groups). The outcome
(dependent variable) is sleep duration measured in hours, which is a
continuous variable (Fig. 16).
Figure 5.3: Example of one-way ANOVA components based on the parallel design
with 1 independent variable (1 factor, with 3 levels).
In ANOVA, the significance is expressed through the F-test (or F-
statistics). In other words, it uses the variances in the groups to
compare the means (group that received chamomile + passion fruit
extract, the one that received melatonin, and the group that received
placebo) and compared to the overall variance in the dependent
variable.
Therefore, we will have the result of variances within groups
and variances between groups.
If the F-test within groups (A), (B), and (C) is smaller than the
variance among all the groups (A-B-C), your test will have a high F-
value, which strongly supports that your result is real and not due to
chance.
Table 5.1: Example of format of data disposition and notation for
one-way analysis of variance (ANOVA).
Independent variable: Treatment (1 Factor, 3 Levels)
Group A (level 1) Group B (level 2) Group C (level 3)
Chamomile+passion
fruit extract capsule
Melatonin capsule Placebo capsule
x11 x21 x31
x12 x22 x32
x13 x23 x33
x14 x24 x34
x15 x25 x35
x16 x26 x36
x17 x27 x37
x18 x28 x38
x19 x29 x39
Treatment means
1 2 3
Treatment Sum of
squares ∑ ( )2
∑ ( )2
∑ ( )2
s: subject t: total Grand mean: Total sum of squares = ∑∑(xts )2
In your sleep study, the data would be:
Independent variable: Treatment (1 Factor, 3 Levels)
Group A (level 1) Group B (level 2) Group C (level 3)
Chamomile+passion
fruit extract capsule
Melatonin capsule Placebo capsule
6 8 2
5 7 3
4 7 4
4 6 3
6 9 5
4 6 2
6 7 3
7 7 4
5 8 5
Treatment means 5.2 7.2 3.4
Treatment Sum of
squares ∑ ( )2
∑ ( )2
∑ ( )2
s: subject t: total Grand mean: Total sum of squares = ∑∑(xts )2
Question 87: Are there any prerequisites for
ANOVA?
Before you start, we look at your data. You should fulfill the critical
assumptions for ANOVA use (Sawyer 2009), which are:
1. Continuous data. Ratio or interval measures, which can be
summarized or represented by means and standard
deviations (or variances), that is, obtained from a normally
distributed population.
2. Normal-distributed data within each group or level The
Shapiro-Wilk test can be used (Shapiro, 1965) if the groups
are relatively small (<50), considering D'Agostino's
modification for larger samples (D’Agostino,1971). However,
never forget displaying a histogram and determine how well it
resembles a “bell-shaped curve, ” as explained in Chapter 1.
3. Homogeneity of variance within each group: In other words,
the dispersion is similar at all levels. This equality can be
verified using Bartlett's test (NIST/SEMATECH 2013).
Statistical software can display Bartlett's test when performing
an ANOVA. Bartlett's null hypothesis is based on the
homogeneity of variances, that is,
H0: σ1
2 = σ2
2 = ... = σk
2
Ha: σi
2 ≠ σj
2 for at least one pair (i,j).
Therefore, you will expect a Bartlett's test p-value >0.05, or
not significantly different, that is, the test does not reject the
null hypothesis of homogeneity, which is what you want.
The Levene test (Levene, 1960) is another option for Bartlett's
test, and it is preferred when the samples deviate from
normality, as it is less sensitive to deviations from normality
(NIST/SEMATECH, 2013).
4. Independent observations. This assumption is seen in all
parametric tests, that is, the value of each observation for
each subject is independent and not influenced by any other
observation. This phenomenon is called “sphericity” (a
condition where the variances of the differences between all
possible pairs of within-subject conditions (levels of
independent variable) are equal. In repeated-measures
ANOVA (as we will discuss later), we have the capability of
analyzing dependent observations, that is, repeating
measures (longitudinal studies). This is certainly a potential
problem if sphericity is not equal, distorting our F-test (Hinton,
2004). Mauchly’s test of sphericity is commonly used for this
purpose, and similar to the homogeneity and normal
distribution tests, the null hypothesis states that all variances
of the differences are equal. If this test is significant, it means
there has been a violation of this assumption of sphericity,
and the F-score calculation should not be used unless other
tests are used. These methods are outside the scope of this
study.
Question 88: How can we understand the displayed results of
ANOVA? (sum of squares, degrees of freedom, mean squares,
and F-statistics test)
The format that is commonly displayed in all statistical programs
using one-way ANOVA (Riffenburgh, 2012) (Jones, 2021) and is
presented in Table 5.2
Table 5.2 Display and notations for one-way ANOVA in statistical
programs.
Levels Sum of
squares
(SS)
Degrees of
freedom
(df)
For One-
ANOVA
Means
squares
(MS)
F-test Mean
square
error
(MSE)
Root Mean
square error
Between (SSB) κ-1 SSB
κ-1
MSB
MSW
SS[residual]
df
Within (SSW) n-κ SSW
N-κ
Total SSA+SSW n-1
SSA, sum of squares; SSW, sum of squares; MSA, means squares; MSW,
squares within. MSE: Mean square error. K: number of groups. n: number of
observations
In our case, the researcher noticed that in each group, sleep
duration varied from one day to another. How can you be able to tell
if the sleep hours changed significantly during the study period?
The answer is to calculate the sum of squares of the data.
The sum of squares (SS) represents a measure of variation or
deviation from the mean.
However, the researcher knows that simply summing each
point from the mean will display possible scattered data
throughout, and some of them could be even negative. Thus, a way
to solve this problem is to square the deviation values and add them,
which removes the effect of sign (-).
In other words, our researcher did ( )2 for each day and
summed all in the total result : ∑ ( )2.
We already have SS. However, the value that we are mostly
interested in is the result of our MS. This represents the variance of
the level.
As shown in the table displayed above, MS requires not only
the sum of squares from the level but also the df in a one-way
ANOVA.
Question 89: Why do we need to display the
variation between the groups and within the
groups in ANOVA and what does that tell us?
This is a critical point, as the result of both, it will give you the F-test
result. (Please refer to Table 5.3 in the ANOVA display format). In
between-group variance, the interaction between the samples is
expressed as the sum of squares (SSB). If the samples are similar or
close to similar, we expect a small number of samples. Within-group
variance is the variation secondary to the differences in each level
and is displayed as SSW.
If the variances of SSW are smaller than variances of SSB,
that will give you a high F-test result, which will tell you that this is
real (not by chance).
You can think intuitively about this: suppose you have three
groups of treatment, as discussed above, Group A (chamomile +
passion fruit extract capsule), group B (melatonin capsule), and
group C (placebo capsule). Now, if I tell you the variance between
groups A, B, and C is much higher than within each group, what
does it tell you? It tells you that the difference between groups is
likely not a random variation but an effect of the treatment.
Table 5.3: STATA output for one-way ANOVA (sleep study).
Stata output shows that the variance within groups (27.3) is smaller
than the variance of between-groups (64.29), resulting in a high F-
test (28.23), which is statistically significant (Prob > F=0.000).
Bartlett's test was not significant (p= 0.91); therefore, it implied that
there was homogeneity among variances.
Question 90: What is the F-Test?
The F-test is a test named after the statistician Sir Ronald Fisher,
who described it. The F-test or F-statistics uses the F-distribution,
which is a continuous probability distribution. It is a ratio of intergroup
and intragroup variability (Foltz, 2013). Because it is a ratio of two
variances, it is a chi-square or Fisher distribution. Look at the
display features of the ANOVA table and figure 5.4. Finally, the result
of the test supports or rejects Ho (Fig. 5.5). It is important to
remember that F statistics is a ratio of two variances and has two
degrees of freedom:
df1 and df2 (df1/df2)
df1= κ-1.
df2= N-κ.
κ= number of comparison groups.
N=total number of observations.
Figure 5.4. Components of the F-statistics calculation.
Figure 5.5. Schematic representation of the F-statistics decision.
The F distribution did not display negative values. It has a right-
skewed distribution (Fig. 19). Thus, the rejection region is always in
the upper tail (right distribution). The F value (α=0.05) in the graph is
the critical value to which the calculated value is compared.
In general, statistics books have an appendix containing
critical values for the F test when α=0.05. It is indexed by df1 and df2
(Sullivan, 2018).
Figure 5.6. F-distribution and representation of α=0.05.
In your sleep study, you found a high (significant) F-test value and
you know that there are differences (at least in one pair) of the study
groups A, B, and/or C. However, ANOVA cannot provide this
information. What kind of tests would you perform to figure this out
(see question 91)?
Question 91: What is a post hoc test and why is it
important? In which circumstances is Bonferroni
better than Tukey’s test?
One important aspect in clinical trials is the design of a statistical
approach a-priori. The ANOVA test aims to reject or support H0,
where the means among groups AvsB, AvsC, and BvsC are equal.
Once H0 is rejected, the test indicates that at least two pairs of
means are not equal. Therefore, multiple comparison procedures are
needed to determine the extent and which groups are unequal.
Do Groups A and C differ or do groups A and B differ? or do
groups B and C differ?
Multiple other comparisons can be performed, such as time
(days 3 and 7 among groups) and subgroup analysis.
However, planned comparisons do not control for the increased risk
of Type 1 errors (rejecting the null hypothesis when it is actually true)
(Fregni, 2018).
Post hoc tests are statistical strategies to reduce the rate of
false-positive results when analyzing a multiple comparison analysis.
However, there is a cost, which is the reduction in statistical power.
Bonferroni correction is the most conservative test that avoids Type I
errors by modifying the level of significance α.
In our sleeping case, you perform three pairwise
comparisons; therefore, it will be α/n= 0.05/3. You will need to set the
new alpha level to 0.017.
The problem will then be overcorrecting Type I (false positive)
and would have a higher risk of Type II error (false negative).
Tukey’s test (honest significant difference test), on the other hand,
focuses on measuring every mean from every groupand comparing
them. Very effective pairwise comparisons.
Therefore, there is no single answer to this question; all
multiple comparison methods have advantages and disadvantages
(Midway, 2020).
Table 5.4 Post Hoc test (Bonferroni) from one-way ANOVA STATA
output (sleep study).
The table 5.4 shows the post-hoc test results. Considering a P-value
of 0.01 (adjusted), all pairwise comparisons are different.
However, other post-hoc tests were available. We describe
the other commonly used in the table 5.5 below and STATA output
using these other tests (table 5.6).
Table 5.5. Post-hoc tests
Post hoc test Characteristics
Scheffé This test is a more versatile version of multiple comparison test, can
be used for different sample size groups and allow a huge number of
comparisons, however the increased comparison groups will
generate a larger CI and less powerful method to detect a difference
between groups.
Tukey This test describes the multiple comparison represented by the
difference of the mean and 95% CI between groups.
If the interval includes the 0, there is no statistical difference between
groups, if the interval does not include 0 there is a difference.
Dunnett This test compares the mean of each group with one control group.
As a consequence, it generates less group comparison. So, the CI
generated by Dunnet’s test is narrower and has more power to detect
a difference.
Holm This test uses a method that starts from a set of p-values and
determines which of these should be considered as significant based
on the rank of it. This is a powerful test, but no CI are reported from
this test.
Holm–Sidak is a modified version that is more powerful than the
Holm version.
Adapted from Motulsky, 4th Edition; Staffa, 2020)
Table 5.6. Post-hoc tests comparison in STATA output (sleep study).
In this case, the results are very similar among the different tests.
Question 92: What is the difference of one-way
and two-way ANOVA? What are their
advantages?
Going back to our original case. Imagine the same researcher
investigating hours of sleep in group A (chamomile and passion fruit
extract capsule), group B (melatonin), and group C (placebo). Now,
you would also like to determine whether gender contributes to the
effect and determine if there is any interaction. In other words, the
effects of two simultaneous factors were considered (figure 5.7).
In our first example, there was only one factor (treatment) with
three levels (A, B, and C) on the dependent variable (outcome: sleep
duration); therefore, one-way ANOVA was used to compare the
group means (refer to figures 5.2 and 5.3).
In this case, two-way ANOVA explores the effect of two
factors (treatment and gender) on the dependent variable (duration
of sleep), and the interaction of factors, that is, whether there is a
change in the effect of one factor (ex. treatment) at different levels of
the other factor (gender) (figure 5.7 and 5.8).
Figure 5.7. Example of a parallel group design with two factors (treatment and
gender).
Figure 5.8 Example of two-way ANOVA components based on the parallel group
design with two independent variables (treatment and gender) and one continuous
dependent variable (duration of sleep).
Table 5.7: Example of data disposition for two-way ANOVA.
Factor 1: Treatment (3 levels)
A B C
Factor 2:
Gender
(2 levels)
Female Female +
Treatment A
Female +
Treatment B
Female +
Treatment C
Male Male +
Treatment A
Male +
Treatment B
Male +
Treatment C
In the one-way ANOVA, one independent variable with three levels
was analyzed. In the two-way ANOVA, there were two independent
categorical variables (treatment and gender) that might affect the
dependent variable (hours of sleep).
When using one-or two-way ANOVA, your assumptions
(mentioned above) should be fulfilled.
One possible advantage of this method is that we are not only
testing the “main effect” for each independent variable but also the
interactions between factors.
Therefore, in this second scenario there are 3 null
hypotheses:
1- Treatment has no effect on sleep duration, controlling for gender.
2- Gender has no effect on sleep duration, controlling for treatment.
3- The effect of treatment on sleep duration is the same regardless
of gender.
The first and second null hypotheses refer to two main effects, and
the third hypothesis refers to the interaction between treatment and
gender (the two main effects). If there is an interaction, the effect of
treatment is different for different genders.
In addition, as a one-way, it removes the random variability
(by chance), as mentioned previously. The two-way method also has
disadvantages that are shared with the one-way ANOVA, that is, test
for means cannot tell you which one is the one different. For this, you
need to go to your post-hoc tests (Pallant, 2010).
As stated previously, other factors may be considered, and
therefore, a three-way ANOVA or higher could be considered.
However, this interpretation is complicated. To consider other
independent variables, other possible statistical tests should be
considered, such as multivariate data analysis, which are beyond the
scope of this book.
Question 93: What is the repeated-measures
ANOVA? Is there a one-way or two-way repeated
measures ANOVA?
When the outcome is measured repeatedly over time (longitudinal
study) and compared among the treatment groups, specific statistical
strategies should be used. Schober and Vetter (2018) call attention
to the possibility of biased estimates, and consequently invalid
estimated confidence intervals (and P values) if the within-subject
inter-measurement correlation is not considered.
Considering the sleep disorder study, this test would be
applied, for example, if the question was to compare drug A
(chamomile and fruit extracts), drug B (melatonin), and drug C
(placebo) over 3 weeks (weeks 1, 2, and 3), instead of only once. In
this case, there are two factors (factor 1: treatment with three levels,
and factor 2: time (with three levels) (Fig. 22 and Fig 23).
In this case, although the groups are independent, the
measurements within the groups are dependent, as each individual
will have the sleep duration measured every week for three weeks
(multiple time points). Hence, there will be within-participant
measurement variability, which is smaller than the variability between
subjects. For this reason, it increases the estimated effect precision,
and this will have to be considered in the analysis. This is the reason
that the correct choice would be a two-way repeated example of two-
way repeated measures ANOVA components based on the parallel
design with two independent variables (treatment and time-repeated
measures over time) and one continuous dependent variable
(duration of sleep in hours).
Figure 5.9. Example of a parallel group design with two factors (treatment and
time-repeated measurements).
Figure 5.10. Example of two-way ANOVA components based on the parallel group
design with two independent variables (treatment and time) and one continuous
dependent variable (duration of sleep).
Table 5.8. Example of data disposition for two-way repeated measures ANOVA.
Factor 1: Treatment groups (3 levels)
A B C
Factor 2:
time
(3 levels)
Week 1 Week 1 +
Treatment A
Week 1 +
Treatment B
Week 1 +
Treatment C
Week 2 Week 2 +
Treatment A
Week 2 +
Treatment B
Week 2 +
Treatment C
Week 3 Week 3 +
Treatment A
Week 3 +
Treatment B
Week 3 +
Treatment C
A one-way repeated measures ANOVA was used in case you
wanted to assess one of the groups (for example, the chamomile-
passiflora group) over time, without comparing it to another group. In
this case, the question would be whether there is a change in sleep
duration over three weeks, using chamomile-passiflora capsules.
There is only one factor or independent variable (time points: week
1, week 2, and week 3) with dependent measurements (multiple time
point measures), and one continuous dependent variable(sleep
duration in hours) (see below figures 5.11 and 5.12; Table 5.9).
Figure 5.11 Example of one group design treated with chamomile-passiflora
capsules, and the outcome (sleep duration) measured in multiple time points (one
factor: one group measured in three time points – repeated measurements) and
one continuous dependent variable (sleep duration – hours).
Figure 5.12 Example of one-way repeated measures ANOVA components based
on the parallel group design with 1 independent variable (time points: repeated
measures over time) and 1 continuous dependent variable (duration of sleep in
hours).
Table 5.9. Example of format of data disposition and notation for
one-way repeated measures ANOVA.
Independent variable: time points (1 factor, 3 levels)
One factor:
Chamomile +
passiflorine capsule
Week 1 (level 1) Week 2 (level 2) Week 3 (level 3)
X11 X21 X31
X12 X22 X32
X13 X23 X33
X14 X24 X34
X15 X25 X35
X16 X26 X36
X17 X27 X37
X18 X28 X38
X19 X29 X39
After this discussion, hope we made it clear that when there are
repeated measures, you need to use the appropriate ANOVA model
and indicate in the statistical software you are using which factor is
the repeated measures factor, so the statistical program will provide
accurate estimates and p-values.
Question 94: Is it possible to use ANOVA in a
factorial design or should we use MANOVA?
It is possible to use ANOVA in factorial design trials, called “factorial
ANOVA” This test is composed of two or more independent variables
(factors) that split the sample into four or more groups (levels).
In factorial ANOVA, one has two or more independent
variables and one dependent variable or outcome, which will allow
us to understand the effect of each independent factor and the
combined effect (which would be the interaction term) on the
outcome. This is a very special aspect of factorial ANOVA; however,
it adds complexity to the interpretation of the results. Going back to
the sleep study, if the research team wants to evaluate the effect of
chamomile and fruit extracts (group A) versus melatonin (Group B)
versus chamomile and fruit extracts + melatonin (Group C) versus
placebo (Group D), the correct statistical approach would be a
factorial ANOVA.
Multivariate analysis of variance (MANOVA) is perhaps the
most complex design as it compiles not only two or more
independent variables but also more than two dependent variables.
It is not frequently used because it is difficult to interpret the results
unless the dependent variables are very similar.
Table 5.10. Combinations of dependent and independent variables
and choice of most adequate ANOVA type for data analysis.
Independent
variable
(one)
Independent
variables
(two or more)
Dependent variable One One-Way ANOVA Two-way ANOVA
Factorial ANOVA
Dependent variable Two MANOVA
One important assumption in MANOVA is that both dependent
variables need to be related to each other. MANOVA is an extremely
complex test. Using the sleep study as an example, the research
team could be interested in analyzing whether gender and the
treatment group would influence the mean duration of sleep and the
Pittsburgh Sleep Quality Index in the same analysis. In this scenario,
MANOVA is the correct statistical option for testing the hypothesis.
Because of multiple comparisons, inflating Type I error is a
risk; therefore, a Bonferroni test (for adjustment of alpha) is
recommended. Some statisticians recommend increasing the
sample size to improve the power of the test (Pallant, 2010).
In summary, a MANOVA is used when similar outcomes are
used, for instance you measure the motor function with 4 different
outcomes. However, again this is not usual, and the best option may
be to choose the main outcome and consider the other ones as
secondary outcomes (among the similar dependent variables).
Question 95: What is ANCOVA and when can it
be used?
Analysis of covariance (ANCOVA) is a statistical tool that merges
analysis of variance (ANOVA) with covariate adjustment (linear
regression) in the same model with the goal of increasing the
statistical power and producing equate groups for comparison. In this
case, you will have two independent variables, one categorical
variable and one continuous variable, and the outcome. The
continuous independent variable was covariant. The test considers
the effect of the covariate associated with the treatment on the
outcome. (Figure 5.13). (Owen, 1998) Interestingly, when Sir Ronald
Fisher developed the ANCOVA, it was planned for randomly
assigned and prospective experimental tests and the objective was
to enhance the precision of statistical analysis. The assumptions for
ANCOVA are similar to the ANOVA test, but an important one is that
the covariate is completely unrelated to the categorical independent
variable, in general, the treatment group. For randomized clinical
trials, this is generally the case because the treatment is randomly
assigned. However, for retrospective and non-experimental studies,
where the treatment allocation or exposure is not at random, the use
of ANCOVA could be biased because the covariate could be related
to the treatment. (Van Breukelen, 2006)
In your sleep study, we sought to determine the effect of the
three levels of the study intervention (independent variable) on the
outcome, considering the effect of the covariate (daily cups of
coffee). If it was a retrospective study, the treatment used by the
patients could influence the number of cups of coffee, and the
analysis would be biased.
Figure 5.13 Schematic representation of analysis of covariance (ANCOVA).
Question 96: What happens if my data does not
fulfill all the assumptions required for ANOVA?
If the assumptions for a one-way ANOVA are not met, the option is
to perform a Kruskal-Wallis “H-TEST, ” commonly known as Kruskal-
Wallis, Kruskal-Wallis test by ranks, Kruskal-Wallis H-test, or one-
way ANOVA by ranks.
The Kruskal-Wallis test is a nonparametric test (that is, it can
be used with non-symmetrical distributed rank-based data). It
partially follows the principle of assessing differences on a
continuous ordinal dependent variable by a categorical independent
variable in more than two independent groups.
Assumptions:
1. The dependent variable should be ordinal or continuous
(measured by interval or ratio). In our sleep study, for example,
sleep can be the same, worse, much worse, better, much better).
2. The independent variables were two or more categorical
independent groups.
3. Independence of observation: each group is independent. That is
group A cannot be also in group B or C, etc.
4. The variability of the scores in each group was the same.
Remember, as all non-parametric tests, we compare the mean
ranks of the dependent variables.
In the sleep study, groups were as follows: group A (chamomile +
passion fruit extract capsule), group B (melatonin capsule), and
group C (placebo capsule).
The information that your statistical program will display must
include:
a) Chi-square value
b) df
c) Significance level (often shown as Asymp.Sig.). If this <0.05
If this is significant, then at least two groups’ ranks are not the same;
therefore, you can look at the mean rank for the three groups and
see which had the highest overall ranking that corresponds to the
highest score on your continuous variable, which will identify the
highest, intermediate, and lowest scores. However, you still do not
know which group is statistically significant from one another
(Riffenburg. 2012), (Pallant, 2010).
In your sleep study, you included 60 patients per group: drug
A (chamomile and fruit extracts), drug B (melatonin), and drug C
(placebo).
Note: These results are just figurative for teaching purposes and do
not reflect the results of a real data set.
Table 5.11. Example of data used as ordinal variable.
Drug Study N Mean Rank
Sleep quality Drug A 20 35.20
(ordinal) Drug B 20 31.83
Drug C 20 19.35
Total 60
The results that willbe displayed include the H-test, the grouping
variable (drug treatment groups). For example:
Quality of sleep
Chi-square
df
Asymp.Sig.
8.360
2
.02
Therefore, you do have a statistically significant result, so you know
there is a difference in at least one group, but still do not know which
group it is.
Question 97: Is there any post-hoc analysis in the
Kruskal-Wallis test?
The Mann-Whitney U test between pairs of groups is an option;
however, you might run into Type I error because of multiple or
repetitive comparisons; therefore, you will need to apply a Bonferroni
adjustment to the alpha values as mentioned above. Therefore, you
can compare (group A with B, group A with C, and group B with C)
(Pallant, 2010).
Question 98: Is there a nonparametric version of
a two-way ANOVA?
Yes, the Friedman test, also called Friedman two-way ANOVA by
ranks, is a non-parametric test analogous to repeated measures
ANOVA. The principle of this test is essentially the same as the two-
way ANOVA, that is, with two factors and two or more groups
(levels), but with ranks. One factor was treatment (groups), and the
other factor was time (repeated measure). It is used for variables
measured on an ordinal scale or for continuous data (interval or
ratio), which do not meet the assumptions of normal distribution and
homogeneity of variances (Currel, 2020) (Sheldon et al., 1996).
Conclusions for Chapter 5
ANOVA is a powerful and useful statistical tool for clinical research.
As explained in this chapter, ANOVA associated with post-hoc tests
allows multiple group comparisons, without increasing the type I
error. One-way ANOVA is the simplest statistical strategy for
evaluating more than two groups. More sophisticated strategies can
be applied to this research question. However, when a more
complex statistical plan is added, interpretation of the test results
may become challenging.
At the end, the outcome relies completely on your research
question, study design, and a prior statistical plan. Multiple statistical
tests were created to improve the post-hoc analysis. The matter of
fact is that there is no answer as to which one is better, rather it can
be stated that each one has its own characteristics, which should be
taken into consideration before defining these tests.
Modern statistics have created powerful tools for exploring
causal effects among multiple variables through methods of
modeling (regression). We will discuss regression models in our
book, but ANOVA continues to be an important test for broad
questions when multiple scenarios need to be evaluated
simultaneously.
If the model does not fit into a parametric test, non-parametric
tests can always be used for statistical investigation if there is a
difference between the groups. For one-way and repeated
measurement ANOVA, the Kruskal Wallis and Friedman tests are
their non-parametric alternatives.
References for Chapter 5
1. Bland, M. (2015) Summarizing Data. In an introduction to
medical statistics. Oxford Ed. Fourth Ed.
2. Currell, G. & Sowman, A. Kruskal-Wallis & Friedman Tests. In
Essential Mathematics and Statistics for Science. Chapter 12.
Willey-Blackwell Ed. 2020.
3. D’Agnostino, R.B. (1971) An omnibus test for normality of
moderate and large size samples. Biometrika; 58:341-348.
4. Foltz, B. (2013, April 27). Statistics 101: ANOVA, A Visual
Introduction [Video]. YouTube.
https://www.youtube.com/watch?
v=0Vj2V2qRU10&feature=youtu.be
5. Fregni F, Illigens BM. (2018) Other issues in statistics. In
Critical thinking in clinical research. Oxford Ed.
6. Hinton, P.R.; Brownlow, C.; McMurray, I. (2004). SPSS
Explained. Routledge.
7. Jones, J. (2021) One-way ANOVA. In Concepts of statistics.
Richland Community College. Concepts of Statistics. Lecture
13. www.people.richland.edu
8. Kim HY. Analysis of variance (ANOVA) comparing means of
more than two groups. Restor Dent Endod. 2014
Feb;39(1):74-77. https://doi.org/10.5395/rde.2014.39.1.74
9. Levene, H. (1960) Robust test for the equality of variance test
for normality. In Olkin I, ed. Contributions to Probability and
Statistics: Essays in Honor of Harold Hotelling. Palo Alto:
Stanford University Press
10. Midway, S.; Robertson, M.; Flinn, S.; Kaller, M.
Comparing multiple comparisons: practical guidance for
choosing the best multiple comparison test. PeerJ. 2020.
11. NIST/SEMATECH e-Handbook of Statistical Methods,
from http://www.itl.nist.gov/div898/handbook/.
https://doi.org/10.18434/M32189, 2013. Access date: June
26, 2021.
12. Pagano M, and Gauvreau K. One-way analysis of
variance. In Principles of Biostatistics 2nd Ed. Duxbury
Thomson Learning eds. 2000.
13. Pallant Julie. One-way analysis of variance. In SPSS
Survival Manual. A step by step guide to data analysis using
SPSS. McGraw-Hill. 2010.
14. Riffenburgh RH. Tests on means of continuous data.
In statistics in Medicine. Academic Press. 3rd Ed. 2012.
15. Schober, P., & Vetter, T. R. (2018). Repeated
Measures Designs and Analysis of Longitudinal Data: If at
First You Do Not Succeed—Try, Try Again. Anesthesia &
Analgesia, 127(2), 569–575.
https://doi.org/10.1213/ane.000000000000351 1
16. Shapiro SS, Wilk MB. An analysis of variance test for
normality (complete samples). Biometrika 1965;52:591-611.
17. Sheldon, M. R., Fillyaw, M. J., & Thompson, W. D.
(1996). The use and interpretation of the Friedman test in the
analysis of ordinal-scale data in repeated measures designs.
Physiotherapy research international : the journal for
researchers and clinicians in physical therapy, 1(4), 221–228.
https://doi.org/10.1002/pri.66
18. Staffa, S. J., & Zurakowski, D. (2020). Strategies in
adjusting for multiple comparisons: A primer for pediatric
surgeons. Journal of Pediatric Surgery, 55(9), 1699–1705.
https://doi.org/10.1016/j.jpedsurg.2020.01.003
19. Steven F. Sawyer (2009) Analysis of Variance: The
Fundamental Concepts, Journal of Manual & Manipulative
Therapy, 17:2, 27E-38E, DOI: 10.1179/jmt.2009.17.2.27E
20. Sullivan, L. Hypothesis testing. Retrieved June, 2nd
2021, from https://sphweb.bumc.bu.edu/otlt/mph-
modules/bs/bs704_hypothesistest-means-proportions/.
https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_hypothesistest-means-proportions/
Chapter 6
6. Statistical tests IV - Strength of
association between variables: Use of
correlation coefficients
Arturo Tamayo & Denise S Schwartz
Introduction to Chapter 6
A very important statistical method for summarizing the relationship
strength between two continuous variables is using correlation
(Taylor, 1990).
In other words, when analyzing continuous data, we wanted
to determine whether changes in variable A would modify variable B
(See figure 6.1). These relationships can be between two
independent variables, or independent and dependent variables.
If variable B increases proportionately as variable A
increases, then one can say that there is a positive correlation
between A and B, and your correlation should be expressed with the
“+” sign.
If variable B decreases as variable A increases, then your
correlation is negative “- sign”.
Finally, if we do not find a relationship between two variables,
as they do not change as the other variables increase or decrease,
then they can be considered uncorrelated (Hinton, 2014).
Figure 6.1. Examples of types of correlation. There are three possible results of a
correlational study: a positive correlation, a negative correlation, and no
correlation. Variable A is on the vertical axis, also called y-axis, and variable B is
on the horizontal axis, also called x-axis.
The statistical display of this relationship strength needs to be
measured quantitatively, and this is achieved through correlation
coefficients.
The Pearson product moment correlation coefficient (or r
coefficient)and values are displayed from -1 to +1. Please note that
the direction sign of your correlation precedes the value and tells you
if this is positive or negative. The closer the value is to 1, the
stronger is the correlation. On the other hand, the closest “r” is to 0
(zero), the weakest to no correlation should be implied.
Our sleep researcher is now interested in determining
whether there is a correlation between the number of cups of coffee
during the day and hours of sleep. How can you display your results
graphically?
Question 99: Is the use of scatter plots the only
way to graphically display correlation?
Yes, and it is highly recommended to display it through a scatter plot-
figure if your correlation result supports your hypothesis. Scatter
plots or “scatter diagram” is the representation of observations in
“dots”. Each dot represents the intersection of two observations
between two variables (Portney, 2009).
As shown in Figure 6.1, the relationship expressed by these
“dots” follows a linear correlation.
Question 100: If my correlation is non-linear,
what does that mean?
This means that the observations between the two variables are not
linearly proportional and rather follow a curvilinear form (the ratio of
change is not constant). The implication of this finding is very
important, as Pearson’s correlation coefficient is a measure of the
strength of linear relationships and would not detect non-linear
patterns. (Sullivan LM. 2018).
In other words, “rp” in curvilinear correlations can be detected
as “zero,” even though there could be some relationship. This further
supports the importance of displaying your correlation graphically
and analyzing the relationship patterns you found.
Figure 6.2. Example of a non-linear (curvilinear) relationship.
Question 101: Suppose that there is a negative
correlation between cups of coffee during the
day and daily hours of sleep. The analysis results
showed that rp= 0.75. What kind of association
strength is this? Is there a rule to determine if
there is a strong or a weak correlation in
numbers?
No there is not. That is, there are no fixed numbers from which you
can say it is strong or weak; these numbers might change with
sample size, measurement error, and types of variables being
studied. However, a “practical” rule of thumb is as follows
(Portney,2009).
0.00 to .25: Weak or no relationship
0.25 to .50: Fair relationship
0.50 to .75: Moderate to good relationship
>0.75: Good to excellent relationship
However, we would like to explore multiple variables that might be
intercorrelated in the study. What can this be done?
Question 102: What can I do if I want to examine
the correlation of more than two variables at the
same time?
This is mostly done through a “correlation matrix” which is essentially
a table that shows the coefficients between variables (usually more
than 2). However, you need to have the result of your “r” for each
value and display it. This creates a correlation matrix table that
resembles a triangular table.
This summarizes your data (usually large data), so you can
start your exploratory analysis in this way. There are many
computers/software that have the capability to perform this
exploratory table. (STATA, SPSS, CORRPLOT). (Portney.2009).
Variable 1 2 3 4 5 6 7
1 - (1x2) rp (1x3) rp (1x4) rp (1x5) rp (1x6) rp (1x7) rp
2 - (2x3) rp (2x4) rp (2x5) rp (2x6) rp (2x7) rp
3 - (3x4) rp (3x5) rp (3x6) rp (3x7) rp
4 - (4x5) rp (4x6) rp (4x7) rp
5 - (5x6) rp (5x7) rp
6 - (6x7) rp
7 -
Figure 6.3. Correlation matrix display - r= Pearson coefficient.
Question 103: Why do I sometimes see
correlation coefficients with a p value?
As explained in this book, inferential statistics rely on the capability
of a statistical test to reject the null hypothesis. In this specific case,
the null hypothesis is that there is no relationship between the two
variables.
H0: r = 0 (where r is the Pearson correlation)
Our alternative hypotheses can be expressed as
H1: r > 0,
H1: r < 0.
Or simply it can be
H1: r ≠ 0
To report a p-value you can use a t-test or F-tests; however, it is not
as important as to know the coefficient value and interpret it
appropriately.
Of note is that most statistical books have an appendix at the
end of the book with tables with critical values of “r”. You can
manually calculate it, looking at your “r” result and compare to critical
values in your table (appendix) for one or two tailed tests of
significance, and compare your number according to your p=value
(usually α = 0.05) within the designated degrees of freedom (df)
(Portney, 2009).
Question 104: Why does “r” have a distribution
and how can I interpret if it is a one or two tailed
prediction?
This can be confusing. As mentioned, the Pearson correlation
explores the relationship between two variables in a linear pattern.
The variables are continuous; therefore, this is included in a
parametric statistic.
When there is no relationship, the expected correlation will be
close to “0.” However, it is possible to obtain, by chance, r values
that might deviate from zero. This significantly decreases as the
number approaches +1 or -1.
If we consider H0 to be r = 0. We consider a symmetric
distribution with an = 0, and the distribution will change if there are
fewer or more observations (n).
However, the “n” is not as important as the degrees of
freedom (df). For “r” the df is N-2 (and not N-1). This can be easily
explained as follows:
Our “r” correlation is the “SLOPE” of the “best fit regression
line for the z scores” Therefore, we need to know at least the
information of the TWO points to draw a specific straight line. Having
done that, we have used two observations and therefore cannot be
used again (in other tests, we use only one df on the sample ).
A one-way prediction (or one-tailed) state if the correlation is positive
or negative. In contrast, a two-tailed test predicts a significant
correlation (Hinton, 2014).
You found the following correlation results between “mint- tea
cups at supper time and hours of sleep in 30 patients. How do you
interpret them?
r(28)= - 0.52. p=0.005
In this case:
“r”is our Pearson correlation.
df is 28 (as we consider N-2)
- 0.52: shows a significant negative moderate correlation based on
the p-value.
Question 105: What is the coefficient of
determination?
The coefficient of determination is the square of the correlation
coefficient between the predictor and outcome variables. It is
represented as r2, and it refers to the percentage of variation in the
outcome explained by the predictor in the model (r2= coefficient
determination). In other words, the coefficient of determination
shows how well the model is explained by the data. A number close
to 1 represents an almost perfect model, and a number close to zero
implies that the model is poorly explained by your data. Therefore, if
the correlation coefficient is 0.65, the coefficient of determination is:
r2 =(0.65)2 = 0.42 = 42% (Hinton, 2014).
A sleep study was successful. However, you now decide to
investigate the correlation between race and quality of sleep.
However, you realize that both variables are categorical (gender is a
nominal variable and quality of sleep through a sleep scale uses
ordinal values), rather than continuous. How can a correlation test be
performed using categorical variables?
Question 106: When do we use Spearman
correlation?
The Spearman rank correlation coefficient (rs) is the nonparametric
analog of rp and therefore can be used with ordinal data and is not
affected by outliers. One important assumption for the Spearman
test is the monotonic relationship between variables A and B,
meaning that they always either decrease or increase in relation to
each other. (Motulsky, 4th edition). It is expressed as ranks:
Low ranks of variable A will correlate with low ranks of variable B.
High ranks of variable A will correlate with high ranks of variable B.Ranks would not correlate between variables A and B.
rs= 1 - 6 d.2
n(n2-1)
rs= Spearman correlation
d(2)= the sum of the squared differences.
n= number of observations.
The critical values are, as mentioned in rp, represented in every
statistical book within the appendix tables. As a difference between
rp, rs uses n rather than df to locate critical values (Portney, 2009).
Question 107: I have difficulties with the terms:
causation, correlation, and regression. Are they
all the same?
All of them are important concepts that, even though they can
coexist, they do not mean the same.
Correlation is used to assess how the change in one variable
is associated with a change in the second variable. Therefore, it
implies a relationship, and you can randomly use one or the other
variable as x or y. This means that one variable increases or
decreases as the other increases or decreases, but they may do so
because of another factor that is acting on both. Consequently, this
does not naturally imply causation. On the other hand, regression
predicts one variable based on another variable. This explains part
of the change in the outcome variable based on the covariate. Again,
causation is not necessarily a conclusion from regression analysis.
Causation means that one variable directly produces the other factor
to change; therefore, this is not the case when assessing
association. Both correlation and regression are statistical tools that
are used for exploratory analysis and identification of possible risk
factors (Rodgers and Nicewander, 1988).
Conclusion for Chapter 6
Correlation tests are used to determine the statistical association
between two variables. As explained in this chapter, Pearson
correlation is used for two continuous variables and Spearman
correlation for two non-normally distributed variables.
The graphical representation of data is very important in this
test and must be included as part of the statistical plan when
analyzing the correlation.
The strength and direction of the association are more
important than the p-value for correlation analysis. Therefore, the
relationship between parameters should be used to describe the
association, but not the causation.
References of Chapter 6
1. Hinton, P.R. (2014) Linear correlation and regression. In
Statistics Explained.Routledge Eds. Third edition.
2. Portney, L. G., & Watkins, M. P. (2009). Correlation. In:
Foundations of clinical researchApplications to practice. (3rd
ed., pp. 523–538). Pearson Prentice Hall.
3. Rodgers, J. & Nicewander, A. (1988). Thirteen Ways to Look
at the Correlation Coefficient. American Statistician - AMER
STATIST. 42. 59-66. 10.1080/00031305.1988.10475524
4. Sullivan LM. Hypothesis testing procedures. In Essentials of
biostatistics in public health. Chapter 7. Apha Ed.2018.
5. Taylor, R. (1990). Interpretation of the correlation coefficient:
A basic review. JDMS. 1:35-39.
Chapter 7
7. Statistical Tests V: Survival
Analysis
Rui Nakamura & Antonio Macedo
Introduction to Chapter 7
Suppose that you want to conduct a trial to test the efficacy of
crizotinib, an oral tyrosine kinase inhibitor targeting anaplastic
lymphoma kinase (ALK) phosphorylation, in ALK-positive metastatic
non-small cell lung cancer patients, as compared with standard
chemotherapy.
In this trial, you face a challenge: you need to obtain an
accurate and reproducible measurement. You could use the “overall
survival” (OS) as an outcome; however, this will require a large
sample size, a long follow-up time, and higher costs. Consequently,
you select “progression-free survival” (PFS) as a surrogate endpoint
for predicting OS and decide to evaluate the effect of crizotinib on
non-small cell lung cancer patients to determine if there has been
any progression of the disease.
PFS is selected as the primary outcome of your trial, which
considers the time period between the treatment and worsening of
the disease (National Cancer Institute, 2021). It should be noted that
PFS is generally used in oncology trials as a key endpoint (Shaw,
2013).
Therefore, how do we analyze PFS in a clinical trial? What will
happen if some patients do not develop an event or lose contact with
them? In this case, you recall the survival analysis discussion you
participated in during your lectures and how survival analysis is a
unique approach to overcome some of these challenges.
Hence, survival analysis is a statistical method to analyze
“time to event” data. This special type of analysis is also known as
lifetime data analysis, time-to-event analysis, and event history
analysis, depending on the circumstances and stream where it is
performed. “Event” in time to event refers to a transition of one state
to another, which is situated in time. The event does not always have
a negative implication since it includes not only death, but also
pregnancy, injury, leaving the hospital, and so on. The event in the
trial described above is disease progression.
The most important feature of time-to-event data is that all
patients will probably not have experienced the event at the end of
the trial’s follow-up period. Special methods are needed to analyze
time-to-event data, as patients usually do not begin their treatment or
enter the study simultaneously. Therefore, we cannot use a simple t-
test to estimate the survival data. For instance, if we consider “days
of survival” as our primary outcome measure, it would not be
appropriate for us to use a t-test, although it might appear to be a
continuous variable at first; this is because we are not assessing
time (a continuous variable) but the proportion of participants who
are still alive at each point in time. Thus, in this case, days of survival
should be viewed as a binary variable, analyzing if the patient has
survived or not survived the event.
Table 7.1. Main methods used in Survival Analysis
Method Purpose
Kaplan-Meier estimation Describe survival curves
Log-rank test Comparing survival curves
Cox Proportional Hazards
Regression Model
Assess whether explanatory
factors are associated with
survival
In survival analysis, is usually conducted via certain specific steps.
First, the Kaplan–Meier estimator is the most commonly used
method to illustrate survival curves. Second, we compare survival
curves, for which we use the log-rank test. This test assesses the
null hypothesis that there is no difference between survival curves
(i.e., the probability of an event occurring at any time point is the
same for each group being compared) (Bewick et al., 2004). Third, if
we want to assess whether explanatory factors are associated with
survival, Cox proportional hazards regression can be used to adjust
for the potential confounders. Applying this method to your trial, we
could evaluate the relationship of the outcome variable PFS with age
and sex, studying the effects of these variables on survival.
This chapter will discuss the concepts of survival analysis, the
interpretation of Kaplan–Meier plots, log-rank tests, and Cox
regression.
Survival Analysis – General Concepts
Question 108: How does the time-to-event
analysis differ from other types of outcome
analyses?
Most of the time, it is not possible to follow all participants until the
occurrence of an event of interest. In addition, patients generally do
not begin treatment or enter the study at simultaneously. Therefore,
the methods that analyze time to event differ from other outcome
analyses. One of the most important differences between time to
event and other types of analyses is that estimating the survival time,
or the time to an event of interest, is based on the probability that a
participant will successfully participate in the event according to the
results of the trials conducted using censored data.
Censored Data
Question 109: What is censoring?
As discussed previously, censoring means that the survival time to
an event for a given subject cannot be accurately determined
because of their dropping out, lossto follow-up, unavailable
information, consent withdrawal, or when a study ends before the
occurrence of the event of interest (Rich, 2010). In short,
observations in a Kaplan–Meier analysis that end without the
occurrence of an actual outcome are censored.
As a rule, the true outcome risk associated with censored
subjects is assumed to be the same as that of those who remain in
the cohort (Putter, 2007). However, this does not hold when subjects
are censored because of a competing event (e.g., death and disease
relapse). Competing events prevent the true outcome from occurring
(Walraven, 2016; Wolkewitz, 2014). That is, subjects experiencing a
competing event have an outcome risk of zero. When censored at
their competing event period, their outcome risk is assumed to be
the same as that of the others observed in the cohort. This, in turn,
leads to overestimated (true) outcome risk estimates (Wolkewitz,
2014).
Question 110: What are the mechanisms of
censoring?
In survival analysis, it is also crucial to identify the
mechanisms of censoring. In this context, let us define important
concepts:
Informative censoring: When participants are lost to follow‑up
for reasons related to the study (e.g., a patient withdraws
consent due to intolerable side effects of the intervention
arm).
Noninformative censoring: When participants drop out of the
study for reasons unrelated to the study (e.g., if a patient
moves from town because of a business proposal and can no
longer attend the follow-up visits).
For an unbiased analysis of survival curves, censoring due to loss to
follow-up should be minimal and truly “noninformative.”
Simply stated, censored observations can arise in three
primary ways (or three primary mechanisms) (Zaman, 2011;
Dwivedi, 2016):
1. The event of interest does not occur during the study period.
2. Loss to follow up.
3. Death is unrelated to the disease in question.
If we take a hypothetical complete follow-up dataset (which is
virtually unfeasible to obtain in practice), we may obtain censored
datasets by one of the three basic mechanisms of censoring, as
follows:
- Time censoring, in which every case had a shorter follow-up time
than the hypothetical complete follow-up time.
- Interim censoring, in which, for every case in a hypothetical
complete follow-up dataset, a random recruitment time is generated
for survival analysis, wherein the interval between the recruitment
time and interim time is shorter than the theoretical complete follow-
up time. This can be conceptually compared to the censoring
associated with an interim analysis in a clinical trial.
- Case censoring, in which the corresponding time is shortened by a
random amount of time during a predefined follow-up time. This
mechanism of censoring corroborates the clinical trial cases that are
either lost to follow-up or to a withdrawal of consent.
Question 111: What are the types of censoring
(left vs. right vs. interval censoring)?
We can also address censoring according to its three main types, as
shown below (Machin, 2007; Indrayan, 2013; Lee, 2016) in Figure
7.1:
Right censoring occurs when the observation of a patient ends
before the event of interest occurs; hence, the actual time to event, if
it were to occur, is longer than the observation time, but the time
period cannot be determined. This is the most common type of
censoring (e.g., when non-small cell lung cancer progression occurs
after the trial ends). Right censoring includes Type I, Type II, and
random censoring schemes, as follows:
Type I censoring is the type of censoring in which the
observational period is fixed. At the end of the study, any
subject who has not yet suffered from a given event is
censored. In a study in which there is no accidental loss, all
censored observations equal the length of the study. this is
also known as “time censoring.”
Type II censoring is the type of censoring in which the total
number of events is fixed in advance. In this case, a study
ends after a predefined number of events take place.
Left censoring occurs when a subject has experienced an event
before the start of the observation period. That is, the actual time to
event is shorter than the interval between the origin and the start of
the observation period, but its extent is unknown. For this type of
censoring, we do not know the exact point in time an individual has
entered the state of interest (e.g., progression of non-small cell lung
cancer occurring before the patient entered the trial because of a
failure to recognize this at the time of their enrollment).
Interval censoring occurs when the time interval (left, right) in which
a given event occurs is known, but we cannot identify it. That is, it
occurs during the occurrence of an event of interest between the
observation intervals. True time to event is thus observed within a
known time interval. For example, suppose a participant with
cardiac-specific troponin measurements is taken to assess
myocardial injury twice a month, and the second serum level
measurement is elevated. In this case, although we know that
myocardial injury occurred between the first and second time points,
we do not know the exact time point when the event occurred
(Munkhjargal, 2019).
Figure 7.1 Types of censoring. Cases B and C are right-censored. Case D is left
censored. Case E is interval censored.
Table 7.2 describes further details regarding the types of
censoring commonly seen in survival analyses and provides some
simple examples.
Table 7.2 Main types of censoring in survival analysis (Dwivedi, 2016; Machin,
2007; Indrayan, 2013; Lee, 2016; Cleves, 2016)
Censoring
type
Definition Example
Right
censoring *
Occurs when a subject participates
in the study for a time and,
thereafter, is no longer observed.
If failure event has not yet
occurred for some subjects
during a prespecified study
This is the most common type of
censoring in survival analysis.
period.
Other example is the
premature withdrawal of
consent (e.g., due to
intolerance to or lack of
efficacy of a new
drug).
Another example is loss to
follow-up (for whatever
reason, e.g., subject moved
from town).
Interval
censoring
Occurs when, rather than knowing
the exact time of failure or that
failure occurred past a certain point,
all we know is that the given event
occurred between two predefined
time points (within a short interval or
a long one).
When subjects are seen at
fixed time points throughout
their follow-up period (e.g.,
patients with asthma visiting
the clinic once per month
during follow-up) for several
years (in this case, an event
will have occurred between
two clinic visits)
Left
censoring
May be seen as a case of interval-
censoring in which the left limit
equals zero (i.e., when the failure
event occurs prior to the subject’s
entering the study). This is a less
common type of censoring.
In a study of time of
progression from in situ
carcinoma until overt
disease, an individual who
has overt disease when first
interviewed is considered
left-censored because the
progression from in situ
carcinoma to clinical
disease (the failure event)
happened before the start of
the follow-up period
Gap Occurs when a subject is loss to
follow-up for a while but then
reports back to a study, causing a
gap in the follow-up data (this may
become a statistical issue if, say, a
subject died and could never have
reported back).
When a study participant
misses a couple of follow-up
visits, say, for a reason not
related to the study (neither
the baseline disease nor the
outcome), and then reports
back at the following visit.
*Includes Type I, Type II, and random censoring schemes.
As discussed before, there are two main reasons why the number-
at-risk drops over time in a survival curve when a subject dies (or
“fails” a certain endpoint) or when their data are censored. In either
situation, we cannotpredict how long they would have been followed
had they not died or been censored.
Question 112: How much censored data are
acceptable?
Overall, the Kaplan–Meier method performs effectively with right-
censored and left-truncated observations (Prinja, 2010). However,
having too many censored observations can render the survival
analysis unreliable. As mentioned before, censored observations
cannot be removed from the analysis because this would generate
highly biased results (Lucijanic, 2012). Similarly, having an extremely
high censoring rate will lower the accuracy and effectiveness of the
analysis within the survival model, thus increasing the risk of bias.
Survival analyses in the literature seldom report the actual
censoring rates (ref), even though up to 84% of their observations
are reported (Zhu, 2017). There is not a well-defined upper threshold
for the proportion of censoring allowed in a survival analysis.
Suppose that over a study period of 12 months, the probability of the
occurrence of an event is very low (e.g., 1%); in this case, we will
have a high rate of censored observations even if the survival
estimate is determined appropriately. Further, this implies that the
variance of the log hazard ratio is large, given the low number of
events in each group.
The main issue with censoring is that it needs to occur for a
reason unrelated to the impending risk of an event, that is, it has to
be assumingly uninformative. Moreover, a highly unbalanced
censoring (e.g., lost to follow-up) rate among groups (regardless of
the total censoring rate) may also cause biased results in a trial.
When right censoring (e.g., due to a loss of follow-up or consent
withdrawal) is high, it may indicate the possible flaws in the design
and quality of the observed data. If censoring is related to the
outcome of interest (e.g., the patient is too ill to for the follow-up
visit), the results will be biased, particularly if censoring before the
completion of follow-up is high. Some advice using full Bayesian
model to capture the uncertainty in the proportional hazards inherent
to such cases (Clark, 2003). Occasionally, it might be helpful to run a
Cox model to check for predictors of time until censoring, after
adjustment for the date of subject entry into the study. This can also
help identify subject or clinical site characteristics associated with
inadequate follow-up.
However, when the total sample size or the number of actual
events (uncensored observations) is small, we cannot estimate the
hazard rate with precision nor perform reliable comparative
analyses. In such situations, comparing two treatments on a relative
scale (e.g., the hazard ratio) when too few subjects have an event
will be unreliable. However, the confidence interval for the difference
in survival probability will work well. Switching to another method for
handling the high censoring rate (e.g., using logistic regression)
instead of Cox regression will not account for the variable follow-up
time and will also lose time-to-event information.
Another point to consider is that if the proportion of censorship
is asymmetric between groups, caution should be taken when
interpreting the results. For example, when comparing two types of
conditioning regimens in patients with acute myeloid leukemia, as
mentioned previously, a myeloablative regimen will lead to more
events (e.g., deaths) during the first quarter of the follow-up period
(due to high early treatment-related toxicity). Those who survive will
have longer survival at late follow-up times; for the reduced-intensity
conditioning group, most patients may exhibit a higher probability of
survival at the beginning, but more events at the end, which
highlights the disproportionality of events over time across both
groups.
Question 113: Is the Kaplan–Meier method the
only method we can deal with censoring? Is there
another model to deal with it?
Censoring can be performed using several methods. Censored data
analysis methods can be divided into four categories: substitution
methods, log-probit regression methods, maximum likelihood
estimation methods, and non-parametric methods, including the
Kaplan–Meier method. However, the Kaplan–Meier method is
generally considered the preferred approach to analyzing censored
data in clinical research. It evaluates the censored data and uses the
data of these censored patients up to the time when they are
censored. (Hewett, 2007).
Question 114: What are the differences between
missing data (e.g., due to loss to follow-up or
dropouts), truncated data, and censored data?
Are there relationships between them?
Missing data are ‘‘values that are not available and that would be
meaningful for analysis if they were observed’ (Ware, 2012).
Censored data can be explained as missing data because
they are also produced by incomplete observation (e.g., due to loss
to follow-up, dropouts, or study withdrawal).
Truncation is often confused with censoring because it leads
to incomplete observations over time. It differs from it, though, as
truncation is due to a systematic process inherent to a given study
design. For example, in a clinical trial, truncation means omitting all
the data outside a predefined boundary, thus reflecting complete
ignorance of the event of interest and the covariates over part of the
distribution (Cleves, 2016). Therefore, it reflects a sampling
constraint in which a failure time variable is observable only if it falls
within a certain region or boundary (Balakrishnan, 1995).
Truncation can be divided into two types depending on which
limit of the considered interval is known, as follows (Dwivedi, 2016):
Right truncation: selective inclusion of patients in whom an
event has already occurred (e.g., selection of patients from a
death registry). In this case, the upper limit of the observation
interval is fixed.
Left truncation (or late/delayed study entry): Selective
inclusion of patients in whom the event has not occurred yet;
those who have already experienced the event before the
time point of study entry are not identified and may often not
even be known to exist; such subjects are thereby selectively
excluded. In this case, the lower limit of the observation
interval is fixed.
Question 115: How to minimize censored data? Is
there a way? Can we avoid it? Can we ignore
censored data in the survival analysis?
As mentioned before, to mitigate bias in a survival analysis,
censoring due to loss to follow-up is minimal and truly
noninformative. Several methods have been proposed to address
the issue of censored data. These include imputation techniques for
missing data, sensitivity analyses to mimic best- and worst-case
scenarios, and the use of the dropout event as a study endpoint
(Dwivedi, 2016). Right censoring, which is the most common type,
tends to be better dealt with using parametric models (Cleves, 2016).
Informative censoring bias may be associated with
overestimating treatment effects (Walraven, 2016). Nonetheless, the
extent to which this leads to overinflated results is difficult to assess.
Censoring may change the conclusions of clinical trials and may
challenge the validity of results from interim analyses in clinical trials
(Prasad, 2015; Bagust, 2018).
A critical assumption in survival analysis is that censoring is
noninformative. Even so, this may not hold true. As discussed,
informative censoring occurs when data related to prognosis are
censored.
If we ignore the censored cases, this may seriously bias the
results since we might be missing out on the potential heterogeneity
within our study population. This may lead to overestimation or
underestimation of the Kaplan–Meier survival estimates, given that
the censored values would not be randomly interspersed among the
remaining survival times. Another crucial drawback of such an
approach is that a complete case analysis may be underpowered
because of the smaller number of events. As in other types of
studies, survival analysis tends toderive informative results when
the number of observed or unobserved events is increased.
Therefore, censoring must be accounted for in the analysis to allow
valid inferences.
As mentioned before, a unique feature of survival data is that
not all patients typically experience the event (e.g., death) by the end
of the observation period. Hence, the survival times for some
patients are unknown. Hence, survival data are usually censored or
incomplete. Moreover, survival times are usually skewed, limiting the
usefulness of analysis methods that assume a normal data
distribution.
With few exceptions, the censoring mechanisms in most
observational studies are unknown. Hence, it is necessary to make
assumptions about censoring when common statistical methods are
used to analyze censored data (Leung,1997).
As pointed out before, censoring occurs in three basic forms:
left (or time)-censoring, interval (or interim), and right (or case)
censoring. Independent of the bias inherent to the design and
conduct of prospective clinical trials and the analysis of retrospective
studies, bias may result from patient censoring (or incomplete
observation). This is inherent to the actual Kaplan–Meier method in
survival analyses and, hence, beyond the most common biases
associated with study design and execution. Thus, it may be a
source of bias even in the setting of a well-designed and conducted
trial.
For instance, if we take PFS as our outcome measure, all
three types of censoring (and thus bias) may apply. In contrast, OS
is based on a well-defined time point and is therefore not subject to
interval censoring; it will, be prone to right censoring, a surrogate of
loss to follow up or due to incomplete follow-up, where we do not
know the true event timepoint, and thus be a source of bias
(Barrajón, 2020). Given that a survival benefit in an oncology study is
often somewhat marginal, any potential bias may be unwelcomely
impactful by undermining real clinical benefits (Henegan, 2017).
Survival curves
Question 116: Why is it not possible to obtain the
survival time in all cases?
This is because it is impossible to follow all cases or participants up
to the event of interest. For instance, some participants withdrew or
did not experience the event of interest by the end of the observation
period. For example, in a cancer trial, with the event defined as “the
time-to-first relapse” (including death from lung cancer), it could be
seen that some patients were alive and relapse-free at the end of the
study.
Question 117: If the survival curves of two drugs
cross, what methods can we use to interpret and
address them?
Crossing of the survival curves of two drug arms sometimes occurs
in clinical trials, especially when a drug provides a short-term benefit
but no long-term advantages, which implies that a drug provides
better survival in a certain time period as opposed to less survival in
another time period, which is related to the overall homogeneity of
survival distributions.
The log-rank test is commonly used to analyze survival
curves, including those that cross each other. However, interpreting
the differences between the crossing survival curves may be
misleading. Several statistical methods, including adaptive Neyman’s
smooth tests, the modified Kolmogorov-Smirnov test, and the
weighted Kaplan–Meier tests can be used as alternative strategies to
address this issue (Li, 2015).
Covariate adjusted analysis
Question 118: When is the log-rank test
comparing Kaplan–Meier curves more
appropriate than Cox’s proportional hazards
regression?
It is important to clarify that Kaplan–Meier provides a method for
estimating the survival curve. The log-rank test was used for the
statistical comparison of the two groups. In Cox proportional hazards
regression, covariates can be added to the analysis. Accordingly,
you perform a long-rank test when we do not need to adjust for
confounders or covariates, and you do a Cox’s proportional hazards
model when you need to account for factors that could be associated
with time-to-event outcome.
Cox proportional hazards regression enables us to adjust for
covariates, account for the variables of interest, and assess the
effects of multiple covariates in survival analysis.
Question 119: How can we interpret the
coefficients in the Cox proportional hazards
regression analysis?
Coefficients (i.e., β terms) in Cox proportional hazards regression
analysis can be interpreted as the logarithm of the hazard ratio. That
is, the antilog of an estimated regression coefficient exp(β) produces
the hazard ratio (HR).
Suppose a predictor does not affect the event of interest, HR
equals 1. An HR greater than 1 indicates that the predictor is
positively associated with the event of interest, thus decreasing
survival. An HR of less than 1 indicates that the predictor has a
protective effect on the event of interest, thus increasing survival.
For example, suppose you want to evaluate the association
between weight and time to event for myocardial infarction (MI), you
do a prospective cohort study with an eight-year follow-up. As a
result, a hazard ratio of 1.016 and CI (1.009 – 1.031). Because
weight is a continuous factor, we interpret it as "a one unit increase
in weight is associated with a 1.6% increase in the MI.”
Use of survival analysis in clinical research
Question 120: Can we use survival analysis in
both prospective and retrospective studies? Can
survival analysis be used in randomized
controlled trials? What are the crossover trials?
What types of studies can we perform a survival
analysis?
Overall, survival analyses were used in all the study designs
mentioned above, provided that the outcome of interest was a time-
to-event endpoint. For studies on time to disease, a prospective
cohort study is generally the method of choice. When the time of
origin is known for all subjects, a prevalent cohort study in which
subjects are recruited at variable times after the onset of the disease
is also a good option.
Although prospective studies usually have fewer potential
sources of bias and confounding than retrospective studies, they are
also commonly used (Curtis, 2003). In any case, censoring was an
issue.
As for randomized controlled trials, parallel design and
crossover trials are amenable to survival analysis (Rossouw, 2002).
However, crossover designs have some particularities. These
trials comprise an efficient trial design that is often used to assess
the effect of treatments that act relatively quickly and whose benefits
(and side effects) tend to disappear upon discontinuation. Each
patient served as their control, which enabled within-individual
treatment and placebo responses to be compared. However,
estimating OS in clinical trials involving crossover is challenging
because performing a conventional intention-to-treat (ITT) analysis
may confound the treatment effect.
Crossover trials are generally not the most appropriate choice
when dealing with binary endpoints, such as death (Nason, 2010).
Nevertheless, they may be quite well suited to using such endpoints
– and even more efficient than the parallel-group design–when there
is heterogeneity in individuals' risks (in this case, crossover trials can
increase power using a log-rank test) (Buyze, 2013). An important
issue when considering this trial design in survival analysis is right-
censored data. Deriving a time point for treatment crossover that
optimizes efficiency may be helpful. Some authors propose a
combination of these types of study design by using a "two-period
design,” wherein first-period "survivors" are rerandomized for the
second period. This helps maintain the efficiency of crossover trials
while ensuring that the second-period groups are still comparable by
randomization (Nason, 2010).
Question 121: Can we extrapolate survival
analysis to fields other than clinical research?
Although survival analysis can be extrapolated to non-clinical
research, several issues tend tobe unique to clinical trials
(Nakamura, 2018). Even so, several non-medical fields apply
survival analyses in specific situations, as exemplified below:
a) In economics: duration analysis or duration modeling. For
instance, analyzing ‘‘time to job loss.’’
b) In engineering: Reliability analysis or reliability theory. For
instance, analyzing ‘‘time to loss of safety of a machine.’’
c) In sociology: Event history analysis. For instance, analyzing
‘‘time to adopt a policy.’’
Question 122: Is the sample size important in
survival analysis? Is the minimum sample size
required?
As a general rule, an adequate sample size in clinical research is
important to yield statistically significant results. However, obtaining
a large sample size is not always feasible owing to logistic and
ethical reasons. Estimating an appropriate (required) sample size is
mainly based on the research question, purpose of the study,
confidence level, and precision.
As mentioned before, one of the distinctive points of survival
analysis is that it takes into account data from participants who
developed the event and from those who were censored. As a rule, it
is possible to compare survival medians between different groups,
provided that the event occurs in at least 50% of the subjects being
studied (Lira, 2020). Extra care should be taken when interpreting
the survival curve in relation to the sample size. One can start
estimating the probability of survival in a study sample from the
moment a censored observation occurs. Estimates in the later
observation periods across the survival curve tend to become
somewhat inaccurate owing to the reduction in the number of
participants. That is, accuracy tends to decrease as fewer patients
become at risk for a given event. Another important point to consider
when performing the more widely used survival analysis methods
(namely, the Kaplan–Meier method, the log-rank test, and the Cox
method) is the proportionality assumption, in that hazards must be
proportional at different time intervals (Lira, 2020; Botelho, 2009).
Sample size in survival analysis depends on the follow-up
time, number of censored data points, number of compared subjects,
total frequency of events, and differences in events between
subsets. Overall, models tended to perform worse (with a more
significant Type II error) when there were fewer than 10 events per
analysis subset and the number of participants was less than 10 per
group (Miot 2017). Some researchers advise a pre-test with a
shortened follow-up time to ensure an appropriate sample (Miot,
2011). Specific formulas have been developed to calculate the
number of events needed as a function of the HR, with a significance
level (Type I error) of 5% and a beta value (Type 2 error) of 20%
(Williamson, 2009; Schoenfeld, 1983).
The power of a 2-sample log-rank test depends on the total
number of events. This should be hundreds or greater. The number
of subjects required to estimate the entire single survival curve using
the Kaplan–Meier estimator usually depends on the timing of
censored observations to reach a reliable and reasonably precise
estimate.
Sample size calculations for survival trials include an
adjustment to account for the expected rate of dropouts (or non-
adherence). Otherwise, the study may end up being greatly
underpowered.
The power of a method to analyze survival time data depends
on the number of events rather than the total sample size per se
(Bradburn, 2003). Therefore, this is done in a two-step process, as
follows (Hosmer, 2008):
First, the number of events needed to detect a minimum
clinically important effect size (e.g., a pre-specified hazard ratio
[HR]), with a predefined power (e.g., 80%) and alpha level (e.g.,
5%), was calculated. Different approaches for estimating the number
of events have been proposed depending on the planned data
analysis method, including the Schoenfeld method for log-rank tests
or proportional hazards models (Schoenfeld, 1983).
Second, to calculate the total sample size, the proportion of
patients expected to experience the event needs to be estimated
(Hosmer, 2008). Notably, for multivariable models such as the Cox
proportional hazards model, it has been suggested that at least 10
events need to be observed per covariate to be included in the
model (Peduzzi, 1995).
There are several freely available tools on the Internet that
enable us to calculate the required sample size to compare survival
rates between two independent groups.
Overall, the most relevant statistical terms needed to describe
the data to be collected for survival analysis were size, allocation,
and duration. These components enable us to detect an effect given
its likely magnitude (Cleves, 2016), provided they are carefully
thought about and defined at the outset of the study we intend to
perform.
Let us examine the following example: given a study whose
main goal was to estimate the efficacy of crizotinib in patients with
metastatic non-small cell lung cancer, suppose we had a dataset
comprising 60 individuals with the disease in total. Let us assume
that 40 patients were allocated to the crizotinib arm and the
remaining patients received the standard therapy, and both groups
were followed until the end of the study (e.g., for 24 months). Why do
60 individuals and not, say, 30 or 90? Why were the groups
imbalanced? Lastly, why was the follow-up period 24 months and
not, for instance, 12 or 36 months? Let us still assume that the
expected reduction in tumor burden was estimated at 30%, based on
previous pilot studies and medical literature. In this case, a 30%
reduction in tumor burden would mean an HR of 0.7. If we were to
plan this study, what sample size would we need to detect such a
reduction in tumor burden with an 80% probability (power) and an
HR <1.0, at the 5% significance level, given a true HR of 0.7? We
have to account for and decide upon a few questions before moving
on with a study before data collection
In survival analysis, three main components related to the
presumed magnitude of an effect must be considered in the planning
phase: sample size (number of subjects), duration of the study (e.g.,
measured in months or years), and allocation of the subjects
between groups. As a rule, more data leads to more precise results.
In survival studies, the focus is on failure events (failure rates that
are too close to zero will not render meaningful or even comparable
results (Cleves, 2016). For sample-size estimation, we will need to
envision a certain number of failures and the duration of follow-up
required to observe such a number. If instead, our study has a fixed
follow-up period, then we will need to define the number of subjects
that need to be followed so that a given number of subjects
experience the event before the study ends. In both circumstances,
the required length of follow-up and number of subjects will depend
on the probability of failure (which is presumably not known). That is
why we might often need unbalanced groups if we achieve the same
number of events over the same time period (assuming that the
experimental arm has a lower event rate or a longer time to event).
In summary, when using the log-rank test to compare two
survivor curves, we need to specify the expected HR (0.7 in the
example above), the power (say, 80%), the balance between
treatment and control arms, and the expected censoring rate among
controls (since the rate of actual events tends to be known
beforehand for this group, for a given time period, which, in turn, will
define the follow-up period). The latter then leads us to an
anticipated censoring rate in the experimental arm based on the
hypothesized HR and the (unequal) allocation of subjects (this will
allow us to estimate the number of subjects needed to observe a
certain number of expected events). Note that the estimation of the
power to be used is somewhat arbitrary and based on prior
knowledge and clinical judgment.
Insurvival analysis, inference and power directly depend on
the number of observed events (failures) in that power and sample
size exhibit an inverse correlation (more subjects usually mean more
events). To estimate the required number of subjects based on the
required number of events, we have to estimate the probability of
failure for a given time period of a study. In the survival analysis, a
variable proportion of subjects did not have an event during a
predetermined observation period. Therefore, we need to account for
censoring by estimating the probability of failure during the study to
ensure an appropriate sample size (Cleves, 2016).
Question 123: What if an unexpected event (e.g.,
a natural catastrophe) occurs and increases the
death rate in survival analysis in a prospective
study? How do we address this?
Assuming that a catastrophe occurs and, as a result, a variable
number of subjects dropped; this does not necessarily ruin the entire
study. We can still estimate the survival probability because we will
have complete (and censored) information and data on the survival
of subjects until the time point of the catastrophic event. For
example, we can estimate the median survival time until that time
point (depending on the number of remaining subjects), and other
survival probabilities. Hence, censoring observations due to an
inadvertent event does not necessarily preclude us from performing
a reliable analysis. In such an event, we have to account for a
differential denominator. The steps in the Kaplan–Meier plot will
become wider and wider, owing to the loss of subjects at risk for the
event of interest. Considering that, in this case, the censored values
at the event time point would not be randomly interspersed among
the remaining survival times but would rather constitute a sudden
“drop” in the survival curve and would, thereafter, remain
unobserved, we might potentially underestimate the survival
probabilities beyond that time point.
Question 124: How should the follow-up time be
chosen?
The follow-up time was measured from time zero, or time of origin,
that is, from the start of the study (i.e., at enrollment) or the point at
which the participant is considered at risk for a given event (the
outcome of interest). Of note, although the generic term “survival
time” is used in survival (time-to-event) analyses, it can equally be
applied to the time ‘survived’ from, say, complete remission to
relapse or progression and from diagnosis time to death. Participants
were then followed up until the event occurred, the study ended (at a
certain calendar data set by the researcher), or the participant was
lost to follow-up, whichever came first.
In the Kaplan–Meier method, the follow-up time is divided into
intervals, with limits corresponding to the follow-up time between
events, which can be censored or not (Lira, 2020). This enables us
to estimate the probability (or likelihood) that participants at the
beginning of each time interval will experience the event by the end
of each interval. In a clinical trial, time zero is usually considered the
time point of randomization.
The total extent of follow-up will depend on the baseline
condition and the actual risk of having an event over time. For
instance, in a prospective cohort study evaluating patients with very
low-risk myelodysplasia and their time to progression to acute
leukemia, this may take years (quite commonly more than 10 years,
and, even so, only in a subset of patients). Therefore, in such a
situation, aside from the potential ethical issues involved, it may be
better to recruit patients with higher-risk disease, or to rethink the
endpoint of interest (a possible alternative would be to set time-to-
first supportive treatment as a more feasible alternative). When
studying cardiovascular disease, in turn, it might not be appropriate
to include patients who are younger than 55 years of age to assess
the time to incident stroke, given that the risk for such an event in
this younger subgroup of patients would be somewhat low. This may
change, though, if we have other relevant risk factors added to “age,”
such as uncontrolled arterial hypertension, hyperlipidemia, stress,
obesity, etc., thus modulating our target population. Hence, age and
comorbidities, among other factors, should be considered when
defining the inclusion criteria and the follow-up time in survival
analysis (such covariates should also be adjusted accordingly using
a regression model).
Nonetheless, most commonly, survival data are rarely
normally distributed but tend to be skewed and exhibit a higher
proportion of early events and relatively few late events. These data
features are the main reasons why special methods are necessary in
survival analysis (Clark, 2003).
Another important point to consider is that the entry time of
each study participant will vary among the sample since subjects are
often recruited into cohort studies and clinical trials for months or
years. It is, therefore, important to record the entry time of each
participant so that the follow-up time is accurately measured. As
discussed previously, for subjects who do not suffer an event or are
lost to follow-up for various reasons, the follow-up time will be
shorter than that of the time to the actual event. Hence, the true time
to an event is unknown and is thus censored. Bear in mind that few,
or many (depending on the specific condition and related event),
participants may never suffer the event. If it were not for censoring,
that is, if the event of interest occurred in all individuals from a study
sample, many methods of analysis would be applicable (Clark,
2003).
Bear in mind that, depending on the nature of the intervention,
that is, whether it is a surgical or clinical treatment, the pattern of the
survival curve tends to be different in that post-surgical outcomes
tend to have a higher short-term risk (of dying, for instance). In
contrast, post-clinical ones tend to have a later risk for events. For
instance, hematopoietic stem cell transplantation, has two types of
conditioning regimens (which have an anti-tumor effect and enable
engraftment of the stem cell source): myeloablative and non-
myeloablative. In the former, treatment-related mortality (a time-to-
event outcome) is higher during early follow-up, as opposed to the
latter, wherein a higher mortality rate is expected during long-term
follow-up (usually after 100 days post-transplant) due to relapse of
disease. Therefore, follow-up time might be modulated based on the
timing of the expected events for a given type of intervention.
Question 125: When calculating median survival
time, how much follow-up time is needed?
As a variable, time is always non-normally distributed in a survival
analysis and, hence, calculated and expressed as the median
survival time. Hence, time to event is analyzed based on median
values up to the point of interest (Clark, 2003).
The most common summary statistic in the survival analysis
is the median survival time. The median is often preferred over the
mean survival time because it is less biased. If censoring is present,
the mean tends to be underestimated. Similarly, since the distribution
of survival time is typically non-normally distributed and highly
skewed, the mean will generally not be an appropriate
representation of the overall data. However, a median may not be
the best estimator when 50% of a sample has not yet experienced
the event of interest. This may result in the overestimation of survival
times since it relies on the assumption that the slope of the survival
curve over time follows a predictable function (Meier-Kriesche, 2004;
Miot, 2017). In such cases, especially when analyzing early follow-up
times, the restricted mean, and the area under the Kaplan–Meier
curves may be used (this is beyond the scope of this review).
In simple terms, the median survival time was calculated by
dividing the total number of participants in a study sample bytwo
and rounding downward. The first-ordered survival time greater than
this number is the median survival time. We can also state it as the
time point at which 50% of the sample has survived (without a known
event). Interestingly, if survival exceeds 50% at the longest follow-up
time point, median survival cannot be calculated and will thus be left
undefined. This also occurs if the survival curve is horizontal at 50%
survival, that is, at 50% on the left (Y) axis. However, this does not
alter the interpretation and p-value computed for the log-rank
comparison of survival curves, as it compares entire curves and not
the actual median survival times. Therefore, the p-value derived from
the log-rank test is still valid even if one or both median survival
times are undefined.
Of note, survival analysis often deals with studies in which
each subject is followed for different lengths of time. It is also
important to distinguish between median survival time and median
follow-up time, that is, the time period during which the research
subjects were followed. As mentioned previously, unequal follow-up
between different groups, such as treatment arms, may bias the
analysis. A simple count of participants lost to follow-up may indicate
data incompleteness, but it does not inform us about the actual time
lost (Clark, 2002; Clark, 2003). One should be cautious when there
are too few events in one of the arms in a survival curve with a high
frequency of censoring, since this may mean that the loss to follow-
up may be related to the exposure or outcome of interest (Miot,
2017; Carvalho, 2011; Ma, 2020).
Question 126: Give examples of descriptive
statistics and inferential statistics for survival
analysis
In survival analysis, the outcome can be viewed as the time elapsed
from an initiating event, such as the onset of a disease until death.
The probability F(t) of dying (denoted “F(t)”) before time “t” is
comprised within the cumulative distribution function, also denoted
“cumulative incidence function” This is a different concept, though, to
the “relative frequency of survival times less than t,” given the issue
of censored data. Therefore, survival analysis involves inferences
based on incomplete data. Therefore, it serves the purpose of
making inferences about an underlying, theoretically “completely
observed” population in the presence of censored observations
(Cleves, 2016; Clark, 2003). For this to be feasible and valid,
censoring must be independent in that an individual censored at a
time "t" should be representative of those still at risk at the same
time point. Moreover, censoring should not be biased toward
individuals with a systematically high or low risk of dying.
The Kaplan–Meier method is essentially a descriptive method
that can depict the survival curve of a set of patients by fitting the
data into an estimated survival function. (Zhang, 2016; Kaplan and
Meier, 1958). In this method, the function describing survival has an
exponential pattern (in which the hazard is constant over time), while
the hazard function increases or decreases over time in the Weibull
method.
The log-rank test, in turn, allows for the inference of
comparative survival patterns between two or more groups in a
univariate fashion, that is, without accounting for any potentially
relevant covariates. This is overcome by the Cox proportional
hazards model, which incorporates intervening variables, as
mentioned previously in this chapter. These methods allow for
inferences (and thus extrapolation) of the results to a broader target
population.
An important point to consider when using these methods is
that survival estimates can be unreliable toward the end of a study
when few subjects are at risk of having an event. Lastly, the Kaplan–
Meier estimator also requires categorical predictors.
Question 127: Can we use historical controls in
the survival analysis?
The use of historical controls in the survival analysis for treatment
evaluation is subject to great controversy. Comparing the probability
of the mortality (or survival) of a study group with that expected
mortality under “conventional” historical rates is often challenging
(Keiding, 1987). When the historical rates are derived from a specific
statistical analysis, they usually allow us to fit a specific statistical
model comprising the current and historical data, which could then
be compared, considering their covariates. However, the complete
historical data may not always be available as robust statistical
estimators to conduct such a comparison (namely, of an observed
versus “historically expected” probability of events) (Cochran, 1969;
Cochran, 1973; Gastwirth, 1994). This is particularly true in the case
of defining an expected survival curve and when the potential
censoring time for each individual in the historical cohort is known.
One may decide to use historical-control data to fully or
partially replace concurrent controls. This may be justified, for
instance, by ethical concerns in recruiting patients for control arms in
life-threatening diseases, by the need to develop the effective
treatments for rare or unmet pediatric indications. By reducing the
required sample size of a trial using historical controls, enrollment in
rare disease trials may be more feasible (Ghadessi, 2020).
Concerns about bias might be partially overcome in a survival
analysis study using a historical control if a large treatment effect is
expected (Ghadessi, 2020) or if a trial targeting a life-threatening
disease with no treatment options is proposed. In any case, caution
should be taken when interpreting data from historical cohorts. It
should also be robustly justified and carefully planned. For instance,
using the existing “Real World Data” sources as the potential
historical (or concurrent) comparative data may lead to inherent
biases, limiting the value for drawing causal inferences in a time-to-
event analysis. In contrast, using a completed clinical trial data for
intervention (or comparative) arms that share the same mechanism
of an action or actual drug, or device from previous trial phases of
development may potentially serve as a highly reliable source for
drawing a comparison, given their prior generation under a controlled
environment. Similarly, placebo controls from previous trials can be a
reliable source of control data for a given indication (Ghadessi,
2020). As an example, an open-label study conducted by Kishnani et
al. (2018) comparing invasive ventilator-free survival (as well as
survival rate) in children with Pompe disease (a rare disease
estimated at 1/40,000 births) treated with a drug called alglucosidase
alfa compared to a historical control with natural history data
collected over a period of 20 years helps to further illustrate the
potential usefulness and feasibility of historical-control survival
designs (Kishnani, 2018).
Lastly, whenever an external control group is identified
retrospectively, one should consider that this can lead to selection
bias or a systematic difference among groups (in a wide range of
factors), which could affect the outcome analysis (Ghadessi, 2020).
Most Common definitions in survival analysis
Table 7.3. Summary of the most common definitions in survival
analysis
Survival Term Definition
Censoring Incomplete observation
of survival times
Truncation
Incomplete observation
of survival times due to
a systematic selection
process inherent to the
study design
Lead time
Time between the early
diagnosis of a specific
disease via screening
and the time in which
diagnosis would have
been made if it were
not for screening
Origin Starting point of a
survival time estimate
(e.g., diagnosis of a
disease, timepoint of a
specific baseline event,
such as surgical
procedure)
Failure rate Same as hazard rate
Failure time
Time until an event
occurs (interval
between origin and
event); the same as
time-to-event or
survival time
Hazard functionThe instantaneous rate
of occurrence of an
event over time
Hazard rate Rate of occurrence of
an event over time
Hazar ratio Ratio of two hazards
(e.g., for two treatment
groups); measure of
effect size.
Kaplan-Meier curve
Graphical
representation of the
survival function by
estimating probability
using the Kaplan-Meier
method
Kaplan-Meier method
Nonparametric method
to estimate survival
probabilities over time
Parametric method Statistical method that
assumes a specific
distribution of the
outcome variable,
including the
relationship between
covariates and survival
Semi-parametric method Statistical method that
does not assume a
specific distribution of
the outcome variable
(the survival function or
hazard function), but
does assume a
specific relationship
between covariates
and the outcome
Non-parametric method Statistical method that
does not require any
underlying
assumptions regarding
the distribution of the
outcome variable
within the survival or
hazard function or to
the functional form of
the covariates
Survival data Data that describe the
time until a predefined
event (e.g., death,
progression of
disease) occurs
Survival time Time elapsed between
origin (e.g., diagnosis
of a disease) and
event (e.g., death);
same as failure time or
time-to-event.
Survivor function
Probability over time
that the event of
interest has not yet
occurred
Log-rank test Non-parametric test
that compares ≥2
survival functions
(groups)
Proportional hazards Assumption used for
modelling the hazard
function, in which the
effect of a covariate is
assumed to be the
same at all time points;
such models can be
semiparametric (e.g.,
the Cox model) or
parametric.
Cox regression
A semi-parametric
technique commonly
used for survival
analysis that allows for
the simultaneous
assessment of the
association between
multiple covariates and
time-to-event
outcomes (also termed
Cox proportional
hazards model)
Adapted from: Anesthesia & Analgesia 2018;127(3):792-8(Schober,
2018)
Question 128: Are there any methods for
assessing the quality of reporting survival
analyses in clinical and observational trials?
The literature has scarcely dealt with the quality of reporting survival
analyses. Table 7.4 depicts the quality assessment criteria proposed by
Zhu et al., 2017). Table 7.5 offers a simple guide for reviewers (Zhu,
2017).
Table 7.4 Main Factors to Consider for Review of the Reporting of
Survival Analyses Quality
Reporting of the proportional hazard assumption for the
Cox model.
Selection of independent variables for the Cox model.
Interactions between independent variables for the Cox
model.
Collinearity of independent variables for the Cox model.
Whether multivariate analysis was performed.
The number of types of survival analysis.
Reporting of well-defined survival endpoints (e.g., overall
survival).
Reporting of median survival time, survival rate of different
time-periods, and range of survival time.
Reporting of the statistical significance level considered,
along with textual/graphical representation of the results.
Reporting of follow-up information (i.e., end of follow-up
time, follow-up method, median of follow-up time, follow-up
rate, and dealing with missing data).
Reporting of the sample size and the method for calculating
it.
Clear reporting of censored events.
Reporting of the criteria of inclusion and exclusion for the
study sample.
Clear display of survival curves.
Reporting of the statistical package (software) used.
Authorship roles (including statisticians and epidemiologists
among authors).
Adapted from: Medicine (2017) 96(50):e9204 (Zhu, 2017).
Table 7.5. Short guide for reviewers: What to address when
reviewing survival analysis studies?
The object being
addressed in the survival
analysis.
Was the purpose of the
survival analysis clearly
stated? Was censoring
adequately described?
Were considerations about
competing risks well
described?
The method used to estimate
and compare survival
functions (e.g., the Kaplan
Meier method, the log-rank
test, and the Cox model).
Is the method appropriate
when considering the nature
of censoring and the need to
adjust for covariates? Were
proportionality assumptions
assessed if a Cox model was
used? What steps were
followed if the proportionality
assumption was not met?
How were competing risks
accounted for?
The analytical and graphical
reporting of the survival data.
Was the measure used to
describe the effect of the
intervention on the outcome
appropriate (e.g., hazard
ratio, relative risk)? Was
reporting relevant to the
objective of the study?
Adapted from: CHEST 2020; 158(1S):S39-S48 (Dey, 2020).
Question 129: Is survival analysis a non-
parametric test? Are there any parametric
alternatives? Given censoring, how does the
outcome differ depending on the type of analysis
used (survival analysis vs. regular parametric
statistical tests)?
There are three methods to analyze survival data:
Non-parametric methods: Kaplan-Meier estimator and log-rank test.
Both analyses do not impose assumptions on the distribution of
survival times (a specific shape of the survival function or hazard
function). In addition, they do not assume a specific relationship
between covariates and the survival time.
Semiparametric methods: Cox proportional hazards model: also
make no assumptions regarding the distribution of survival times but
assume a specific relationship between covariates and the hazard
function—and hence, the survival time.
Parametric methods: assume a distribution of the survival times and
a functional form of the covariates. (Schober, 2018)
The research question of interest should guide the choice of the
method. Sometimes, we can use more than one approach for the
same analysis. Table 7.6 depicts the most commonly used methods
and their corresponding pros and cons.
Table 7.6. Most Common Types of Survival Analysis Models (Clark,
2003; Lira, 2020)
Method
Comments
Kaplan-Meier method Pros: provides an estimate of the
survival probability at each point
in time; has very few
assumptions; confidence
intervals for the Kaplan-Meier
curve are generally valid under
very few easily met assumptions;
we can easily extract quartiles
and medians (along with their
confidence limits) from the KM
curve.
Cons: purely descriptive method;
does not include adjustment for
covariates; requires categorical
predictors.
Log-rank test Pros: provides a direct
comparison of the Kaplan-Meier
curves for two or more groups;
can be considered a one-way
ANOVA for survival analysis.
Cons: does not work well in more
complex settings (e.g., for
continuous covariates and
interactions)
Cox proportional hazards
(PH) model
Pros: as a semi-parametric
method, it is a flexible method. It
allows adjustment for important
covariates and confounders.
Cons: underlying proportional
hazards assumption (that may
limit its applicability in certain
settings)
Parametric models
(accelerated time models)
Pros: allows for more robust
conclusions about the survival
patterns over time and for
possible (and careful)
extrapolation of the results
beyond the range of the
observed data.
Cons: rely on a strong
assumption regarding the
survival distribution of the data
(e.g., exponential and Weibull
distributions); depending on the
distribution we chose, may affect
the reliability (or not) and
possibility of extrapolation of our
results
Frailty (cluster) models Pros: since it accounts for the
correlation within a group by
adding a frailty term (a random
effect), it may be useful for
analyzing multiple events for the
same individual.
Cons: more complex method
(possible issues when trying to fit
and interpret the random effects
involved)
Competing risk models Pros: allows for partitioning of
the events within the survival
model into discrete competing
events; enables assessment of
the effect of different factors on
the risk of an event and of the
concomitance of competing
causes.
Cons: simple relationships
between explanatoryvariables
and cause-specific hazards do
not lead to simple relationships
between explanatory variables
and cumulative incidences (lack
of a direct link, as is the case for
the Cox model regarding all-
cause mortality). This may be
achieved with Fine and Gray8
model.
Discrete time model
(logistic regression)
Pros: easy to interpret (binary
outcome)
Cons: adds potential complexity
to the analysis, since time is
measured as a discrete variable
(not as a continuous one, as
opposed to other most
commonly used models); cox PH
model inappropriate due to tied
survival data; needs special data
management to account for
censoring.
Question 130: What is the difference between
survival function and hazard function?
The survival (survivor) function, denoted by S(t), indicates the
probability over time that the event of interest has not occurred yet.
The Kaplan-Meier estimator can estimate the survival function.
The survival or survivor function (St) is calculated as follows:
St = Number of individuals surviving longer than t
Total number of individuals studied
The survival probability at any particular time is calculated as follow:
St = Number of individuals living at the start – Number of subjects died
Number of individuals living at the start
Where t is a time period known as the survival time, time to event
(such as relapse, death, etc.)
The survival curve is created by calculating the probabilities of
occurrence of an event at a specific time and multiplying these
successive probabilities by any earlier probabilities to get the final
calculation.
The hazard function, also known as hazard rate, denoted by h(t),
indicates the instantaneous rate of the event of interest over time. In
other words, the hazard function shows the instantaneous potential
for having an event of interest at time t.
Cox proportional hazard regression is based on the hazard function.
Of note, the hazard function is sometimes called conditional failure
rate. It is essential to underscore that the Hazard function is not the
same as the Hazard Ratio.
The hazard function differs from the survival function in that the
former focuses on the event of interest occurring. In contrast, the
survival function focuses on the event of interest not occurring or
happening.
Question 131: What are the assumptions in
survival analysis, and how do we test these
assumptions?
Kaplan–Meier survival plots are one of the most commonly used
methods for presenting survival data by graphically depicting survival
changes over time. One of its main underlying assumptions is that it
accounts for censoring survival-independent (or noninformative).
As a non-parametric estimator, no assumptions are made
about the distribution of the survival times per se. The number of
observed events at each time point and the remaining number of
patients at risk are used to calculate an overall probability of
surviving to a given time point. This overall probability decreases as
more events are observed.
Some critical assumptions of both the Kaplan Meier estimator
and the log-rank test (which uses the logarithms of the data ranks to
compare two or more survival curves) are shown in Table 7.7 (Clark,
2003).
Table 7.7. Key assumptions for using the Kaplan-Meier survival
estimator (Goel, 2010; Clark, 2003; Etikan, 2018)
Survival
Analysis
Method
Assumptions
Kaplan
Meier
estimator
Time is most likely to be not normally
distributed
A variable proportion of the study population
will have censored observations
The survival probability is the same for
censored and uncensored subjects
The likelihood of occurrence of an event is the
same regardless of the timing of enrolment
(i.e., whether study participants are enrolled
early or late during the course of the study)
The probability of censoring is the same across
the groups that are being compared
The event is assumed to occur at a defined
point in time
Log-rank
test
The survival times are ordinal or continuous
The proportional hazards assumptions risk of
an event in one group relative to the other does
not change with time. Thus, if linoleic acid
reduces the risk of death in patients with
colorectal cancer, then this risk reduction does
not change with time (the so called proportional
hazards assumption)
Cox model assumptions: as for the Cox model, aside from the
specific assumptions that apply to all survival analyses, such as
noninformative censoring, as described above, its main underlying
assumptions include (Dessai, 2019):
1) A linear relationship between the covariates and the log-hazard
2) The proportional hazards assumption
Question 132: What is the difference between RR,
OR, and HR?
Relative risk (RR), or risk ratio, is the risk of an exposed individual
developing a disease relative to the risk of an unexposed individual
acquiring the same disease.
RR = Risk for disease among exposed participants
Risk for disease among unexposed participants
It can be derived from a 2x2 table using the following formula:
RR = (a/(a+b)) / (c/(c+d))
Disease
+ - Total
Exposure + a b a+b
- c d c+d
a+c b+d N
Figure 7.2. General 2 X 2 contingency table. Cell a represents the number of
individuals who have the disease and were exposed to the factor. Cell d
represents the number of individuals who have not had the disease and were not
exposed to the factor. N, the sum of a, b, c, and d, and represents the total sample
size.
Interpretation of the RR values:
RR = 1: If the exposure does not present any risk for developing
the outcome.
RR > 1: Indicates that exposure is positively associated with
developing the outcome.
RR <1: Indicates a decrease in the probability of developing the
outcome, i.e., a protective effect that is associated with the
exposure.
RR can be determined from cohort studies. However, it cannot be
determined from case-control and cross-sectional studies. The
reason is that we cannot calculate the cumulative incidence of an
event (i.e., the number of new patients with a disease during a
specified period time divided by the total number of participants at
risk) based on these types of study design.
In such situations, the RR is calculated using the Odds Ratio (OR),
which can also be derived from a 2x2 table using the following
formula:
OR = (a/c) = ad
(b/d) bc
Interpretation of the OR and RR is fairly the same. The OR is defined
as the odds of having the outcome divided by the odds of not having
the outcome.
In addition, the OR and RR tend to be quite similar to one another
when a disease has a low probability (risk) of occurring, which
results in small numbers for a and c in the formula above
(Djannatian, 2018).
Hazard Ratio (HR) is the ratio of 2 hazard rates, which is the rate of
occurrence of the event of interest during a specified period. An
example would be that of the ratio of 2 hazard rates: the hazard rate
of progression of non-small cell lung cancer in the crizotinib arm and
that in the standard chemotherapy group (as alluded to in the
Introduction section of this chapter). HR can be generated from
coefficients of the hazard function.
Survival Analysis: Practical Aspects
Question 133: How to avoid bias in survival
analysis? How does censoring introduce bias in
a study? What kind of bias do we have when we
ignore the censored cases? What is attrition or
detection bias? What is lead-time bias in survival
analysis?
Bias is a common concern in survival analysis. Frequently, bias
results from inappropriate sampling strategies or time-dependent
intermediate events. Although this may be partly prevented by good
planning, standardized data collection, and proper statistical
approaches, some degree of bias may be inevitable if the temporal
dynamics of an event in a time-to-event analysis are not adequately
modeled. It is important to underscore that some of the biases seen
in survival analysis is also common to other study designs, but there
are alsoa few specific biases for survival analysis as will be
discussed below.
An intrinsic feature of a time-to-event analysis is that the exact
onset of an event (say, a given disease or medical condition) is
somewhat imprecise or unknown, given the timepoints at which
subjects are seen or assessed for an event are invariably
intermittent.
Even though Kaplan-Meier survival analyses represent the
most objective measure of treatment efficacy in oncology and other
fields, it is still subject to potential bias (Kaplan and Meier, 1958).
Before we define bias in survival analysis, we must first recall the
concept of competing risk and censoring.
Survival bias refers to all types of bias that occur in a time-to-
event analysis if incorrectly modeled (Wolkewitz, 2010). "Class
imbalance" can be a significant issue in survival analyses when the
number of individuals diagnosed with a disease is far outnumbered
by those who remain undiagnosed within a specified time frame.
An example would be the incidence rate of the progression of
the “chronic-phase” in chronic myeloid leukemia patients to the “blast
phase” of the disease on a tyrosine-kinase inhibitor therapy. The
overall response rate to this targeted therapy is such that the
progression of the disease is typically sporadic throughout a study.
This imbalance may result in biased (inaccurate) risk predictions for
subjects who are more likely to progress to acute leukemia over
time.
Another important situation that could lead to invalid results in
survival analysis is the existence of a competing risk. A competing
risk is an event that precludes the primary event of interest. For
example, when evaluating mortality due to lung cancer, in real life,
patients can die due to a cardiovascular event, stroke, an accident,
or any other cause that is not the main event under study. So, all
these death causes “compete” with each other to deliver the event,
and one happening before prevent the other events occurrence. We
calculate the competing risks based on the probability of these
events happening.
Ignoring competing risks can lead to estimates of cumulative
incidence (using the Kaplan–Meier complement) and predicted risk
(using Cox regression) that are biased. (Abdel Qadir, 2018).
Explaining more, Kaplan Meier is not designed to accommodate the
competing nature of multiple causes to the same event.
In this case, a competing risk analysis is needed, where the
researcher correctly estimates the probability of an event in the
presence of competing events.
Similar to a Kaplan Meier estimator, where it estimates the
probability separately for each type of event, in competing risk
analysis, the competing events are treated as censored data in
addition to those who are censored from loss to follow-up or
withdrawal. This method of estimating event probability is called
cause-specific hazard function.
The types of bias commonly seen in survival analysis are
summarized in Table 7.8, some typical of time-to-event analysis.
Time-dependent bias is common and frequently affects key
factors and the study's conclusion. In survival analysis,
“immeasurable baseline” time-dependent factors cannot be recorded
at baseline and change value after patient observation starts. Time-
dependent bias can occur if such variables are not analyzed
appropriately (Walraven, 2004).
Although selection bias can be interpreted differently, a
standard definition is a bias that arises when the study population
has not been randomly selected from the target population for a
given study. Hence, Individuals who are recruited do not represent
the target population, which undermines the generalizability of the
findings reported. This issue will mainly depend upon the recruitment
strategy and eligibility criteria defined for an interventional or
observational trial. Such cases are biased by design.
Selection bias may also result from differing rates of study
participation depending on the subjects' cultural background, age, or
socioeconomic status, for instance. The other definition for selection
bias is in observational studies when subjects are selected to the
intervention according to clinical characteristics.
Although RCTs (randomized controlled trials) have the
particular attribute of randomization, which is basically aimed at
achieving a comparable distribution of both known and unknown
factors in two (or more) groups that are being compared, this is
rarely possible in observational studies, in which confounding, if not
appropriately adjusted and accounted for, may lead to erroneous or
spurious results. Observational studies are thus mostly evaluated
with regression models.
Broadly speaking, in survival analysis, potential confounders,
or covariates (aside from the risk factor(s) under the study), are
accounted for as explanatory risk factors using some type of
regression model, most commonly the Cox proportional hazards
model, as described above. Thereby, the effect of a potential
confounder can be checked by comparing the results from different
models, with and without incorporating the confounder. In short, the
effects of individual factors are adjusted for the others.
Of note, "survivor treatment selection bias” is a specific type of
time-dependent bias that occurs in survival analyses, whereby
patients who live longer are often more likely to receive treatment
than patients who die early. In this context, ineffective treatment may
inadvertently appear to prolong survival (Sy, 2009). In addition, time-
dependent bias is a common source of bias when interpreting
observational data and could be an important effect on the study
results (van Walraven, 2004).
Attrition bias refers to the systematic error caused by uneven
loss of subjects in an RCT due to different reasons. Participants
might leave the trial to adverse effects, unsatisfactory treatment, or
death. As mentioned before, in survival analysis, since censoring is a
critical feature of such an analysis, this would be equivalent to the
potential bias derived mainly from right-censoring, as discussed
above.
Information bias (selective recall or inconsistent data
collection), (confounding) detection or measurement bias (error), and
further factors may contribute to biased estimates of
survival. Notably, when a hard endpoint is used in survival
analysis, such as death (as a binary outcome), this will eliminate the
risk of bias derived from inaccurate reporting, measurement,
information, or detection. Likewise, blinding is not an issue when
hard outcomes are used. This might not be the case when softer
outcome measurements are used (such as progression-free survival)
since these might be subject to biased estimates.
Question 134: How to avoid or overcome bias in
survival analysis?
Besides the methods explained above related to some specific types
of bias, more complex strategies involving specific techniques
implemented in statistical software have been reported in the
medical and biostatistical literature to identify possible approaches to
avoid or overcome the risk of bias when analyzing survival outcomes
in a population. It is important to consider the different types of bias
during the design phase as to address correctly during the study
design phase.
Survival analyses (and epidemiologic studies as a whole) are
generally focused on estimating the distribution of the intervals from
the onset of a disease or given condition to the event of interest (say,
death) and comparing the distributions of these survival times
between two or more well-defined groups. When survival data are
obtained from a prevalent cohort study, the included cases will have
already experienced a given baseline condition (or "initiating event").
These prevalent cases are then followed for a predefined period of
time, at the end of which the subjects will either have failed or have
been censored. In such cases, when estimating the survival
distribution, from the onset, the survival times of the cases in a
prevalent cohort studyare left truncated (Asgharian, 2010).
In a prospective cohort or clinical trial, it is possible to follow all
subjects over time and apply standard survival analysis techniques.
Quite often, though, the onset of the disease is identified in subjects
drawn from a cross-sectional, prevalent cohort, study at some fixed
point in time. Notably, the survivor function corresponding to these
data is said to be "length biased" and different from the survivor
function derived from incident cases. (Asgharian, 2002) To further
illustrate this, when assessed through a cross-sectional study, those
who have survived to that time are recruited into the study, whereas
those who have not will not be included in this initial recruitment
phase (and will not even be identified). The recruited subjects are
then followed until a second-time point, corresponding to the
predefined end of the study. Some of these subjects will have
developed the event by the end of follow-up, whereas others will
have censored failure times for several reasons (e.g., survival until
the end of the study). For every subject included, an onset date (of
the disease or condition) is recorded. Therefore, the data on each
subject includes the dates of onset and failure/censoring for the
recruited subjects. The intervals from the onset of disease (or other
predefined condition) to failure/censoring are considered to be
"length biased". In other words, the time intervals that are actually
observed tend to be longer than those arising from the true
underlying failure events (censoring distributions).
Lead-time bias, in turn, may occur when the early diagnosis of
a disease falsely makes it seem like subjects are surviving longer.
This is commonly seen in the context of cancer screening. Take, for
instance, a female patient who dies of metastatic breast cancer at a
given point in time. Let us say her cancer was diagnosed two years
ago. In this case, it would seem that she had survived with the
disease for two years. Instead, suppose that her breast cancer had
been noted six years earlier. This would result in her actual survival
time being eight years with the disease, even though no real change
in survival actually took place.
Table 7.8. Main types of bias in Kaplan-Meier survival analysis
Type of Bias Definition
Attrition bias Bias typically caused by unequal
loss of participants between
study arms
Competing risk bias When competing risks are
ignored, estimates of cumulative
incidence (e.g., using the
Kaplan–Meier complement) and
predicted risk (e.g., using Cox
regression) tend to be
overestimated
Censoring bias Bias resulting from various types
of censoring, particularly if high
censoring rate (typically denotes
right censoring)
Time-dependent bias Occurs when unknown baseline
time-dependent factors are not
accounted for accordingly
(Walraven, 2004)
Length (time) bias Systematic error resulting from
detection of a disease with a
long latency or pre-clinical period
(e.g., prostate cancer)
Lead time bias Observed time is increased even
if there was no real effect of
screening on absolute survival
time
Survivor
treatment selection bias
Type of time-dependent bias in
which patients who live longer
are often more likely to receive
treatment than patients who die
early
Survival bias All types of biases which occur in
a time-to-event analysis if time is
incorrectly modelled
Note: Length bias and time-dependent bias have opposite effects in
terms of the direction of the estimation bias.
While almost half the survival analyses available in the medical
literature are prone to “competing risk bias” that leads to an
overestimation of event risk, informative censoring bias (Walraven,
2015). Generally speaking, right-censoring may impact survival and
thus lead to biased results by significantly improving the estimation
of survival, as measured by the hazard ratio, when compared to
complete follow-up datasets (McNamee, 2017; Prasad, 2015).
Question 135: Can survival analysis be done with
all types of variables? Can we use all types of
variables (e.g., continuous outcomes) using the
hazard function? How to interpret it? Can we use
it with discrete variables? Is the outcome always
binary in survival analysis?
Survival analysis typically considers an event as binary, by recording
the first time at which a subject has (or does not have) the event of
interest (or at which the outcome is otherwise censored). As a rule,
this disregards any additional information potentially contained within
continuous variables, which may have both power and precision
implications. Sometimes, time-to-event outcomes involve continuous
components crossing a threshold.
Although it may, at first glance, appear that a research question
about the length of a time-to-event interval is essentially a
continuous outcome variable that could be addressed by linear
regression or any other related techniques, such as t-test or an
analysis of variance. A key feature of survival time, as compared to
other continuous variables, in survival analysis, as mentioned before,
is that the event of interest (e.g., progression of a disease or death)
will usually only have occurred in a subset of patients by the time a
study ends. Therefore, full survival times are unknown for those who
survive until the end of the study period or are lost to follow-up
before the end of the observation period. This way, survival analysis
typically deals with a combination of a binary outcome (i.e., if an
event has occurred or not) and a continuous outcome (i.e., when it
occurred), which are thus referred to as time-to-event or
survival/failure time data. Regardless, for the purpose of this type of
analysis, the outcome will be treated as a categorical, binary
outcome (e.g., death or not).
Most commonly, survival analysis uses models that assume
that time is measured continuously. A survival analysis often
analyzes time within time intervals during which the event of interest
can occur. Compared to other survival analysis methods, the
outcome variable can be viewed as a discrete variable in such
cases.
Cox models are not suitable for analyzing time as a discrete
variable, since this implies too many ties (e.g., too many subjects
sharing the same number of time-intervals in a study measured in
semesters) (Rodriguez, 2007). In such cases, logistic regression
may be used, provided the data set up is able to incorporate
censoring. For example, suppose we were studying time until
mastering a certain surgical ability within a group of trained
surgeons, as compared to that of non-trained surgeons, but we only
knew the quarter of the year in which an individual acquired the
related technique (e.g., the first quarter of the year). In this case, we
could analyze “time to mastering the surgical technique” using
discrete-time survival analysis. Discrete-time survival analyses tend
to be useful when we have to account for covariates that vary over
time for a significant number of subjects (e.g., certain socioeconomic
factors, such as family income). In such cases, discrete-time hazard
(survival) models, such as some parametric models and the Cox
model, are used to contain the data into discrete intervals defined by
the varying covariates.
If, on the other hand, we are analyzing tied survival times (as in
quarter-year intervals: 0 months, 3 months, six months, and 12
months), we may consider creating an event outcome as a discrete
variable. Nonetheless, survival time is commonly analyzed in, say,
"person-years", in which case continuous-time methods like Cox
proportional hazards are used (Rodriguez, 2007). As a rule, the
choice of the survival model (and thus the outcome variable) should
be guided by the underlying phenomenon (i.e., whether it is
intrinsically discrete or continuous). Most commonly, there will be
tied survival times in most survival analyses. In such cases, even
though the data is collected in a somewhat discrete manner, the
outcome per se will be continuous (especially if the discretedata can
be subdivided into several parts, say ten or more). Of note, when
using continuous-time models, we may have interval censoring,
whereby the exact failure time is unknown (instead, only an interval
in which the failure occurred is known). This may be further dealt
with by fitting parametric regression models with interval censoring
using maximum likelihood estimators, but this is beyond the scope of
this chapter (Rodriguez, 2007).
Question 136: How to choose between using the
event or the time cut-off point for the follow-up?
Survival plots generally require some choice of a cut-off for display
and interpretability. However, there are no standard methods or tools
to help choose between using the event or the time cut-off point in
survival analysis. Thus, determining both the “optimal” number and
exact cut-off points for survival data (and dichotomized outcomes)
depends on the nature of the study itself. For example, if the
probability of an event over time is high, it may be wise to pick “time”
(i.e., an earlier time point) as the cut-off point for the analysis. When
picking a cut-off for displaying survival plots, the median is often
used in this context, given its robustness amid skewed data.
Therefore. It should be the preferred option unless there is some
clinical consideration. The “cut-off" point would be in order to
determine some cut-off for some future analysis. Lastly, setting a cut-
off point earlier in the course of time along a survival curve might
help reduce the chance of having a high censoring rate if this is an
expected issue.
Question 137: What are the advantages and
disadvantages of survival analysis compared to
other methods, such as linear
regression/ANOVA?
Survival analysis deals with a number of statistical models, which
may depend upon slightly different data and study design situations.
Choosing the most appropriate model for which type of data can be
quite challenging. Regardless of the method we choose, we will be
testing how predictor variables predict an outcome variable that
measures the time until a given event. These methods are all based
on important key concepts in any time-to-event analysis, such as
censoring, survival functions, hazard function, and cumulative
hazards. Survival analysis, with its unique focus on censoring,
comprises various methods, ranging from simple approaches, with
very few assumptions, to more complex ones, depending on the
data, with a larger set of assumptions. Importantly, when performing
a survival analysis, we must account for the specific model
assumptions and bear in mind that our results might be misleading if
these assumptions are not met (George, 2014).
Among all approaches comprised within different survival
analysis methods, the Kaplan-Meier method estimates the survival
curve (descriptive statistics), the log-rank test compares two survival
curves without accounting for covariates, the hazard ratio is a
measure of association (similarly to the relative risk), and the Cox
model analyzes the survival function by incorporating relevant
predictors (covariates) (Botelho, 2009; Lira, 2020).
The main reason for which we cannot reliably compare mean
time-to-event between groups using a t-test or a linear regression
model (such as ANOVA) is that these methods assume a normal
distribution of the time-related outcome (which is typically non-
Normal and often highly skewed) and, most importantly, they do not
account for censoring.
In addition, it would not be appropriate to compare the
proportion of events in two or more groups by simply using risk/odds
ratios as our measure of effect, nor could we use logistic regression
since these methods ignore time (to an event). If we need to fit a
regression model to our survival data, we will have to make a
distributional assumption about our data.
Table 7.9 compares some of the main features related to the
most commonly used survival analysis methods (e.g., in Oncology),
along with some of those features related to perhaps broader
statistical methods.
Table 7.9. Comparison between “classic” survival analysis and
“classic” statistical methods in Clinical Research
Classic Survival
Analysis
Classic
Statistics
Dependent variable Time-to-event
outcome
Occurrence
Measure of
association
Hazard ratio Relative risk,
odds ratio
Graphical
presentation of
results
Survival table,
Kaplan-Meier curve
Table
Test for comparing
groups (univariate
analysis)
Logrank Chi-square
and Fisher test
Test for adjusting for
covariates
(multivariate
analysis)
Cox proportional
hazards model
Logistic
regression
Legend: ANOVA: Analysis of Variance.
(Adapted from: Arquivos Brasileiros de Oftalmologia, 83(2), V-VII.
Epub March 09, 2020 (Lira, 2020))
See Table 7.6 “Most Common Types of Survival Analysis Models”
from question 129, which describes the most common survival
analysis methods drawn from the medical literature.
References for Chapter 7
1. Abdel-Qadir H, Fang J, Lee DS, et al. Importance of
Considering Competing Risks in Time-to-Event Analyses:
Application to Stroke Risk in a Retrospective Cohort Study of
Elderly Patients With Atrial Fibrillation. Circ Cardiovasc Qual
Outcomes. 2018;11(7):e004580.
doi:10.1161/CIRCOUTCOMES.118.004580.
2. Asgharian M, M’Lan CE, Wolfson DB. Length-biased
sampling with right censoring: an unconditional approach. J
Am Stat Assoc. 2002;97(457):201–209. doi:10.2307/3085775.
3. Asgharian M. SURVIVAL ANALYSIS AND LENGTH-BIASED
SAMPLING. Croatian Operational Research Review.
2010;1(1). https://hrcak.srce.hr/index.php?
show=clanak&id_clanak_jezik=137536. Accessed April 29,
2021.
4. Bagust A, Beale SJ. Exploring the effects of early censoring
and analysis of clinical trial survival data on effectiveness and
cost-effectiveness estimation through a case study in
advanced breast cancer. Med Decis Making. 2018;38(7):789-
796. doi:10.1177/0272989X18790966.
5. Balakrishnan N, Basu AP. The Exponential Distribution
Theory, Methods and Applications. Netherland: Gordon and
Breach Publishers; 1995.
6. Barrajón E, Barrajón L. Effect of right censoring bias on
survival analysis. arXiv website.
https://arxiv.org/abs/2012.08649v1. Accessed: April 29, 2021.
7. Bewick V, Cheek L, Ball J. Statistics review 12: survival
analysis. Crit Care. 2004;8(5):389-394. doi: 10.1186/cc2955.
8. Botelho F, Silva C, Cruz F. Epidemiologia explicada-análise
de sobrevivência. Acta Urol. 2009;26(4):33-8 [Portuguese].
9. Bradburn MJ, Clark TG, Love SB, Altman DG. Survival
analysis part III: multivariate data analysis—choosing a model
and assessing its adequacy and fit. Br J Cancer.
2003;89(4):605–611. doi:10.1038/sj.bjc.6601120.
10. Buyze J, Goetghebeur E. Crossover studies with
survival outcomes. Stat Methods Med Res. 2013;22(6):612-
629. doi:10.1177/0962280211402258.
11. Carvalho MS, Andreozzi VL, Codeço CT, Campos DP,
Barbosa MTS, Shimakura SE. Análise de sobrevivência:
teoria e aplicações em saúde. Rio de Janeiro: Editora
FIOCRUZ; 2011.
12. Clark T, Bradburn M, Love S, et al. Survival Analysis
Part I: Basic concepts and first analyses. Br J Cancer
2003;89:232-238. doi.10.1038/sj.bjc.6601118.
13. Clark TG, Altman DG, De Stavola BL. Quantifying the
completeness of follow-up. Lancet. 2002;359(9314):1309-
1310. doi:10.1016/s0140-6736(02)08272-7.
14. Cleves M, Gould WW, Marchenko YV. An Introduction
to Survival Analysis Using Stata, Revised 3rd Edition. Stata
Press; 2016.
15. Cochran WG, Rubin DB. Controlling Bias in
Observational Studies: A Review. Sankya. 1973;35(4): 417–
446. doi:10.1017/CBO9780511810725.005.
16. Cochran WG. The Use of Covariance in
Observational Studies. Journal of the Royal Statistical
Society. 1969;18(3): 270-275. doi:10.2307/2346587.
17. Curtis JP, Wang Y, Portnay EL, Masoudi FA,
Havranek EP, Krumholz
18. Dessai S, Patil V. Testing and interpretingassumptions of COX regression analysis. Cancer Res Stat
Treat. 2019;2(1):108-11. doi:10.4103/CRST.CRST_40_19.
19. Dey T, Mukherjee A, Chakraborty S. A Practical
Overview and Reporting Strategies for Statistical Analysis of
Survival Studies. Chest. 2020;158(1S):S39-S48.
doi:10.1016/j.chest.2020.03.015.
20. Djannatian M, Valim C. Chapter 16 - Observational
Studies. In: Fregni F, Illigens BMW, eds. Critical Thinking in
Clinical Research. Oxford University Press; 2018:324-361.
21. doi: 10.1016/j.jclinepi.2015.07.006.
22. Dwivedi N, Sachdeva S. Survival analysis: A brief
note. J Curr Res Sci Med. 2016;2:73-79. doi:10.4103/2455-
3069.198374.
23. Etikan I, Bukirova K, Yuvali M. Choosing statistical
tests for survival analysis. Biom Biostat Int J. 2018;7(5):477-
481. doi:10.15406/bbij.2018.07.00249.
24. Gastwirth JL, Greenhouse SW. Biostatistical concepts
and methods in the legal setting. Stat Med.
1994;14(15):1641–1653. doi:10.1002/sim.4780141505.
25. George B, Seals S, Aban I. Survival analysis and
regression models. J Nucl Cardiol. 2014;21(4):686-694.
doi:10.1007/s12350-014-9908-2.
26. Ghadessi M, Tang R, Zhou J, et al. A roadmap to
using historical controls in clinical trials - by Drug Information
Association Adaptive Design Scientific Working Group (DIA-
ADSWG). Orphanet J Rare Dis. 2020;15(1):69.
doi:10.1186/s13023-020-1332-x.
27. Goel MK, Khanna P, Kishore J. Understanding
survival analysis: Kaplan ‑ Meier estimate. Int J Ayurveda
Res. 2010;1(4):274 ‑ 278. doi:10.4103/0974-7788.76794.
28. Heneghan C, Goldacre B, Mahtani KR. Why clinical
trial outcomes fail to translate into benefits for patients. Trials.
2017;18(1):122. doi:10.1186/s13063-017-1870-2.
29. Hewett P, Ganser GH. A Comparison of Several
Methods for Analyzing Censored Data. Ann Occup Hyg.
2007;51(7):611-632. doi:10.1093/annhyg/mem045.
30. HM. Aspirin, ibuprofen, and mortality after myocardial
infarction:
31. Hosmer DW, Lemeshow S, May S. Assessment of
model adequacy. In: Applied Survival Analysis. 2nd ed.
Hoboken, NJ: John Wiley & Sons; 2008:169–206.
32.
http://interstat.statjournals.net/YEAR/2011/articles/1105005.p
df. Published 2011. Accessed April 26, 2021.
33.
https://www.cancer.gov/publications/dictionaries/cancer-
terms/def/progression-free-survival. Accessed April 23, 2021.
34. Indrayan A. Medical Biostatistics. 3rd ed. New York:
Chapman and Hall/ CRC Press; 2013.
35. Kaplan EL, Meier P. Nonparametric Estimation from
Incomplete Observations. Journal of the American Statistical
Association. 1958;53(282):457- 481. doi:10.2307/2281868.
36. Keiding N.The Method of Expected Number of
Deaths, 1786-1886-1986, Correspondent Paper. International
Statistical Review. 1987;55(1):1-20. doi:10.2307/1403267.
37. Kishnani P. Lessons Learned from Pompe Disease
[Internet]. https://events-
support.com/Documents/SummaryNHS.pdf. Accessed April
29, 2021.
38. Kraus D. Adaptive Neyman’s smooth tests of
homogeneity of two samples of survival data. J Stat Plan
Inference. 2009;139(10):3559–69.
doi.10.1016/j.jspi.2009.04.009.
39. Lee ET, Wang JW. Statistical Methods for Survival
Data Analysis. 4th ed. New York: Wiley ‑ InterScience; 2013.
40. Leung KM, Elashoff RM, Afifi AA. Censoring issues in
survival analysis. Annual Review of Public Health.
1997;18(1):83-104. doi:10.1146/annurev.publhealth.18.1.83.
41. Li H, Han D, Hou Y, Chen H, Chen Z. Statistical
Inference Methods for TwoCrossing Survival Curves: A
Comparisonof Methods. PLoSONE. 2015;10(1): e0116774.
doi:10.1371/journal.pone.0116774.
42. Lira RPC, Antunes-Foschini R, Rocha EM. Survival
analysis (Kaplan-Meier curves): a method to predict the
future. Arquivos Brasileiros de Oftalmologia. 2020;83(2):V-VII.
doi:10.5935/0004-2749.20200036.
43. Lucijanic M, Petrovecki M. Analysis of censored data.
Biochemia Medica. 2012;22(2),151-155.
doi:10.11613/bm.2012.018.
44. Ma J, Sun H, Chen SM, et al. Clinical features and
survival analysis of patients with CD20 positive adult B-
lineage acute lymphoblastic leukemia. Journal of
Experimental Hematology. 2010;18(2):477-481.
45. Machin D, Campbell MJ, Walters SJ. Medical
Statistics. 4th ed. New York: John Wiley and Sons; 2007.
46. McNamee R. How serious is bias in effect estimation
in randomised trials with survival data given risk
heterogeneity and informative censoring? Stat Med.
2017;36(21):3315–3333. doi: 10.1002/sim.7343.
47. Meier-Kriesche HU, Schold JD, Kaplan B. Long-term
renal allograft survival: have we made significant progress or
is it time to rethink our analytic and therapeutic strategies?
Am J Transplant. 2004;4(8):1289-1295. doi:10.1111/j.1600-
6143.2004.00515.x
48. Miot HA. Sample size in clinical and experimental
trials. J Vasc Bras. 2011;10:275-278. doi:10.1590/S1677-
54492011000400001.
49. Miot HA. Survival analysis in clinical and experimental
studies. Jornal Vascular Brasileiro 2017;16(4):267-269.
doi:10.1590/1677-5449.001604.
50. Munkhjargal BM, Aabdien M, Alcantara MC, et al.
Dabigatran for the Prevention of Myocardial Injury after
Orthopedic Surgery (DMINS-PRE). A Study Protocol For A
Randomized, Active-Controlled, Phase II Trial. PPCR Journal.
2019;5(2):44-50. doi:10.21801/ppcrj.2019.52.2.
51. Nakamura R, Khawaja F, Saavedra LC, Fregni F,
Freedman SD. Chapter 1 - Basics of Clinical Research:
Introduction to Clinical Research. In: Fregni F, Illigens BMW,
eds. Critical Thinking in Clinical Research. Oxford University
Press; 2018:3-25.
52. Nason M, Follmann D. Design and analysis of
crossover trials for absorbing binary endpoints. Biometrics.
2010;66(3):958-965. doi:10.1111/j.1541-0420.2009.01358.x.
53. Peduzzi P, Concato J, Feinstein AR, Holford TR.
Importance of events per independent variable in proportional
hazards regression analysis. II. Accuracy and precision of
regression estimates. J Clin Epidemiol. 1995;48(12):1503–
1510. doi:10.1016/0895-4356(95)00048-8.
54. Prasad V, Bilal U. The role of censoring on
progression free survival: oncologist discretion advised. Eur J
Cancer. 2015;51(16):2269-2271.
doi:10.1016/j.ejca.2015.07.005.
55. Prinja S, Gupta N, Verma R. Censoring in clinical
trials: review of survival analysis techniques. Indian J
Community Med. 2010;35(2):217-221. doi:10.4103/0970-
0218.66859.
56. Progression-free-survival. National Cancer Institute
website.
57. Putter H, Fiocco M, Geskus RB. Tutorial in
biostatistics: competing risks and multi-state models. Stat
Med 2007;26(11):2389-2430. doi:10.1002/sim.2712.
58. retrospective cohort study. BMJ.
2003;327(7427):1322-1323. doi:10.1136/bmj.327.7427.1322.
59. Rich JT, Neely JG, Paniello RC, Paniello CCJ,
Nussenbaum B, Wang EW. A practical guide to
understanding Kaplan–Meier curves. Otolaryngol Head Neck
Surg. 2010;143(3):331-336.
doi:10.1016/j.otohns.2010.05.007.
60. Rodríguez G. Chapter 7: Survival Models. In: Lecture
Notes on Generalized Linear Models. Princeton website.
https://data.princeton.edu/wws509/notes/c7.pdf. Revised
September, 2010. Accessed April 29, 2021.
61. Rossouw JE, Anderson GL, Prentice RL, et al. Risks
and Benefits of Estrogen Plus Progestin in Healthy
Postmenopausal Women: Principal Results From the
Women's Health Initiative