About the Author(s)

Candice MacMillan Email symbol
Department of Physiotherapy, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa

Benita Olivier symbol
Department of Physiotherapy, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa

Natalie Benjamin-Damons symbol
Department of Physiotherapy, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa


MacMillan, C., Olivier, B. & Benjamin-Damons, N., 2021, ‘The interrater and intrarater reliability of the flexibility and strength tests included in the Sport Science Lab® screening protocol amongst professional rugby players’, South African Journal of Physiotherapy 77(1), a1504. https://doi.org/10.4102/sajp.v77i1.1504

Original Research

The interrater and intrarater reliability of the flexibility and strength tests included in the Sport Science Lab® screening protocol amongst professional rugby players

Candice MacMillan, Benita Olivier, Natalie Benjamin-Damons

Received: 28 May 2020; Accepted: 30 Sept. 2020; Published: 22 Apr. 2021

Copyright: © 2021. The Author(s). Licensee: AOSIS.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Background: Considering the injury incidence rate (IR) associated with elite-level rugby, measures to reduce players’ injury risk are important. Establishing scientifically sound, pre-season musculoskeletal screening protocols forms part of injury prevention strategies.

Objective: To determine the interrater and intrarater reliability of the flexibility and strength tests included in the Sport Science Lab® (SSL) screening protocol.

Methods: We determine the interrater and intrarater reliability of 11 flexibility and nine strength tests. Twenty-four injury-free, elite, adult (> 18 years), male rugby players were screened by two raters on two occasions. To establish intrarater and interrater reliability, Gwet’s AC1, AC2 and intraclass correlation coefficients (ICC) were used for the analysis of binary, ordinal and continuous variables, respectively. Statistical significance was set at 95%.

Results: Flexibility tests which require lineal measurement had at least substantial interrater (ICC = 0.70–0.96) and intrarater reliability (ICC = 0.89–0.97). Most of the flexibility tests with binary outcomes attained almost perfect interrater and intrarater reliability (Gwet’s AC1 = 0.8–0.97). All strength tests attained at least substantial interrater (Gwet’s AC2 = 0.73–0.96) and intrarater (Gwet’s AC2 = 0.67–0.97) reliability.

Conclusion: The level of interrater and intrarater reliability of most of the flexibility and strength tests investigated supports their use to quantify various aspects of neuromusculoskeletal qualities and possible intrinsic risk factors amongst elite rugby players.

Clinical implications: Establishing the reliability of tests, is one step to support the inclusion thereof in official screening protocols. Results of our study, verify the reliability of the simple, clinically friendly strength and flexibility tests included and therefore support their use as preparticipation screening tools for rugby players. Further investigation as to the association thereof to athletes’ injury risk and performance is warranted.

Keywords: rugby; injury risk factors; screening; reliability; manual strength testing.


Despite the high risk of injury related to the collisional nature of rugby union (henceforth rugby) (Schwellnus et al. 2014), the sport remains one of the most popular professional team sports worldwide (Brooks 2005). Rugby injury incidence rates (IR) have been considered high compared to sports such as soccer and basketball (Yeomans et al. 2018) but similar to other high-impact collisional sports such as Australian Rules football (Orchard & Seward 2002) and American National Football League (NFL) (Kerr et al. 2016).

Epidemiological studies conducted in South Africa (Schwellnus et al. 2014) and England (Brooks 2005) as well as a meta-analysis conducted by Williams et al. (2013) report similar findings regarding the incidence and nature of injury amongst professional male rugby players. All three studies concluded that the majority of injuries occur during matches (Injury IR 81.0–91.0 injuries/1000 player hours) and that injuries are related to a tackling incident. These studies also concurred that the injury rates for forwards and back-line players are similar and that the lower (48.1% – 58.9%) and upper limbs (15.6% – 25.6%) are the most common site of injuries.

The development of screening protocols has been advocated by various (Brooks 2005; Schwellnus et al. 2014; Van Mechelen, Hlobil & Kemper 1992) injury prevention paradigms. International rugby unions (Gray & Naylor 2012; Quarrie 2001) have developed pre-season musculoskeletal (MSK) screening protocols in an attempt to identify players at risk of sustaining in-season injuries. The protocol developed by the South African Rugby Union (SARU), which the developers claim to be similar to that of New Zealand and Australia, includes a series of physical screening tests related to, amongst others, strength, flexibility and joint range of motion (ROM) (Gray & Naylor 2012). Limited studies regarding the association of the tests included in these protocols and injury incidence amongst elite-level rugby players have however been published. Also, the developers’ rationale for inclusion of the tests was largely based on the tests’ reliability and normative values amongst athletes other than elite-level rugby players. Quarrie (2001) investigated various MSK performance measures amongst rugby players, of which only one was found to have a univariate relationship to injury. The developers of the Sport Science Lab® (SSL) screening protocol therefore identified a need for evidence regarding existing MSK screening tools’ reliability and association with in-season injury. Hence, the aim of our study was to develop a screening protocol, investigate (amongst other qualities) the reliability thereof and publish the results based on the findings, and if necessary, amend the tool to improve the psychometric properties thereof.

When designing a screening protocol, the challenge lies in finding a delicate balance between scientific accuracy (reliability and validity) and practicality (ease and duration of execution; a small amount of inexpensive equipment and space, as well as the examination skill required) (Castro-Piñero et al. 2009). Reliability refers to the reproducibility of measurements within a given participant over time (intrarater reliability) and by various raters (interrater reliability) (Hayen, Dennis & Finch 2007). The ability of researchers to make inferences regarding certain outcome variables such as intrinsic risk factors is largely depended on repeated measurement accuracy, and the reliability of screening protocols is therefore pivotal (Dennis et al. 2008).

Xue (2016) suggested that better observer training, improved scale design and introducing items better at capturing heterogeneity improve the reliability of a screening tool. The developers considered both the proposed strategies to improve reliability (Xue 2016) and practicality thereof when designing the SSL screening protocol. The complete SSL screening protocol consists of 11 flexibility, seven strength, six plyometric and one rugby-specific fitness tests. As the plyometric and cardiorespiratory fitness tests are objective in nature (i.e. the raters do not have to measure, eyeball or base a rating on subjective measurement as is the case for the strength and flexibility tests), we did not include the plyometric and fitness tests in the reliability part of our study. The strength and flexibility tests included, equipment required and standard instructions are described in Online Appendix 1, Table 1-A1, whilst a detailed description of the purpose and rationale for the inclusion, modification of and proposed minimal standards for the flexibility and strength tests included in the protocol is summarised in Online Appendix 1, Table 2-A1.

Rationale for inclusion of flexibility and strength tests, and manner of execution

Limitations in muscle flexibility and related joint mobility have been identified as injury risk factors amongst rugby players (O’Connor 2004; Yeomans et al. 2018). Considering the suggestions summarised by Xue (2016) regarding improvement of test reliability, flexibility tests were simplified to only include tape measured (lineal) outcomes or joint ROM, considered relative to stationary objects with either 0° horizontal or vertical planes such as a plinth.

Rationale for inclusion of strength tests and manner of execution

The game of rugby requires players to tolerate and generate forces to propel their own and additional external weight loads. It is thus fair to regard muscular strength and power as important performance predictors (Posthumus & Durandt 2009) as well as intrinsic risk factors associated with injury prevention (Gamble 2004). Whilst strength doesn’t have a set definition or unit of measure, it is an attribute of force and power (Bohannon 2019). Manual muscle tests (MMT) have been used as a way to gauge muscle output (Bohannon 2019). The developers of the SSL screening protocol regarded MMT as the most practical option as they are inexpensive, quick and easily performed. The manner of execution and proposed rating scale is however new and has not been investigated. Some might argue that hand-held dynamometers (HHD) might be equally practical and provide more objective output measures. However, the cost of HHD may be prohibitive to some and the main limitation of MMT, that is, subjectivity of tester strength and related external resistance applied, is not overcome (Bohannon 2019). Further limitations of HHD and existing MMT strength rating scales are summarised in Online Appendix 1, Table 2-A1.

Our study is the first of two (the second investigates the association between the tests included in the protocol and in-season injury) conducted to establish a clinically useful, evidence-based, pre-season screening protocol that could be used by both medical and strength and conditioning professionals. In a team setting this would allow for a holistic picture of athletes’ pre-season intrinsic injury risks as well as to establish baseline fitness parameters. The aim of our study was thus to investigate the interrater and intrarater reliability of the SSL screening protocol.


This was a reliability study with a test–retest design. Guidelines for reporting reliability and agreement studies were followed (Kottner et al. 2011).

Information regarding our study was sent to 14 official national rugby unions requesting that they send a list of potential participants who volunteered. Participants included elite (i.e. part of an official SARU team) male rugby players between the ages of 19 and 36 years who were injury free at the start of the competitive rugby season. Players who were not on the active team roster at the start of the competitive rugby season were not eligible for inclusion. For convenience, the sample was selected based on the teams’/participants’ geographical proximity to the facility of an established sport rehabilitation and performance centre.

The sample size was calculated based on published guidelines regarding sample size requirements for two-rater reliability studies with nominal (Bujang & Baharum 2017; Sim & Wright 2005) or ordinal (Bujang & Baharum 2017) variables, which assume at least 50% positive ratings and a power of 80%. The authors of these studies suggest a sample size of between 25 (Sim & Wright 2005) and 29 (Bujang & Baharum 2017) participants. To account for dropout, 27 volunteers were included. Other similar reliability studies included 15 (O’Connor 2014) and 40 (Armstrong 2016) participants, respectively.


Our study commenced 3 weeks prior to the start of the competitive rugby season to allow for a standardised volume of training to have been completed. Intrarater and interrater reliability was assessed concurrently. The screening tests were conducted by a qualified physiotherapist (Rater 1; first author) and an athletic trainer (Rater 2). Both raters had more than 5 years of clinical experience and were experienced in the use of SSL screening protocol in daily practice. Two research assistants recorded the participants’ ratings/measurements. Raters were not allowed to communicate with each other during the rating of any of the screening tests and were blinded to the participants’ injury history and each other’s findings.

After performing a 10-min warm-up of their choice, participants were requested to perform all strength and flexibility tests as described in Appendices 1A and 1B. For time efficiency and minimal inconvenience to participants, all tests required to be done on the floor were done first (in no particular order), followed by the tests in standing and then tests performed on the plinth. Each test was performed three times and the best attempt was recorded.

Considering the logistics, practicality and training schedules of the participating teams, a week was dedicated to collect data. To minimise any physiological effects and allow symptoms that may have been provoked by the tests to subside, screening of participants occurred on two consecutive days, in the same environment, before training sessions. Ten participants were screened on two consecutive days and one day thereafter, and the remaining participants were screened on the next two consecutive days. During the screening sessions, each participant was screened once by Rater 1 and an hour later by Rater 2. To minimise potential recollection bias, the ordering of participants scheduled for a screening on a particular day, was randomised for each rater in both rating sessions. This randomisation, coupled with raters being blinded to ratings made during session 1, aimed to further reduce possible recollection bias.

Data analysis

Statistical analyses were performed using Stata/IC 15.1 (StataCorp, TX, USA). Continuous variables were summarised by mean and standards deviation, whilst binary and ordinal variables were summarised by count and frequency.

Interrater reliability for both raters was determined by comparing per-session ratings (for both sessions) of Rater 1 with that of Rater 2. Intrarater reliability was analysed by comparing each rater’s day 1 ratings with that of day 2. To determine both interrater and intrarater reliability, Gwet’s AC1 (Gwet 2016) was used for tests with binary (yes or no) outcomes, Gwet’s AC2 (Gwet 2016) for ordinal variables and ICC3,2 (two-way mixed effects, consistency, multiple raters/measurements) (Mandrekar 2011) for tests with continuous outcome measures. The respective reliability coefficients with their 95% confidence intervals (CIs) were reported. Standard error of mean (SEM) values were also calculated. Intraclass correlation coefficient (ICC) values were interpreted according to the Landis and Koch scale (Landis & Koch 1977). Gwet’s agreement coefficients have been shown to be more stable and paradox-resistant (high percentage agreement but low k-value) than Cohen’s kappa (k) and other coefficients (Gwet 2014, 2016; Wongpakaran et al. 2013). Interpretation of results was done according to the benchmarking procedure as suggested by Gwet (2014), that is, the absolute agreement coefficients benchmarked as cumulative probability (in our case 95%), for any reliability coefficient to fall into one of the following categories: < 0.00, = Poor; 0.01–0.20 = Slight; 0.21–0.40 = Fair; 0.41–0.60 = Moderate; 0.61–0.80 = Substantial; 0.81–1.00 = Almost perfect. This method allows for direct and more precise comparisons of the different agreement coefficients and their representation on the Landis and Koch scale.

Ethical considerations

Ethical approval was obtained from the University of the Witwatersrand Human Research Ethics Committee (Medical) (M180452). Written permission was obtained from the rugby union and the coaches of the respective teams and informed consent was obtained from players who volunteered to participate in our study.


Three (11.11%) of the participants (n = 27) did not attend the second screening session because of logistical problems or conflict with other obligations. Data for 24 participants were therefore analysed. The average age of the players was 19.96 (± 1.78) years, weight was 95.33 (± 13.50) kg and height was 186.50 (± 8.98) cm.

Descriptive statistics

The descriptive statistics for flexibility tests with continuous outcomes are summarised in Table 1; flexibility tests with binary outcomes are summarised in Table 2 and all strength tests are summarised in Table 3. Considering the mean and minimum standards of the respective flexibility tests, both raters agreed on both days that most of the players did not achieve the minimum standards for the majority of tests. In contrast, the raters agreed that on both days the majority of players achieved the set minimum standards (score of 4 or 5) for most of the strength tests.

TABLE 1: Descriptive statistics for all strength and flexibility tests with continuous outcomes (n = 24).
TABLE 2: Descriptive statistics for flexibility tests with binary outcomes (n = 24).
TABLE 3: Descriptive statistics for strength tests rated on a scale of 1–5† (n = 24).
Inter- and intrarater reliability

The inter- and intrarater agreement coefficients, CI and standard error (SE) for flexibility tests with continuous outcomes are summarised in Table 4; flexibility tests with binary outcomes in Table 5 and strength tests in Table 6.

TABLE 4: Inter- and intrarater reliability for flexibility tests with continuous outcomes.
TABLE 5: Inter- and intrarater reliability for flexibility tests with binary outcomes.
TABLE 6a: Intrarater and interrater reliability for all strength tests.
TABLE 6b: Intrarater and interrater reliability for all strength tests.
Flexibility tests

With the exception of the Toe Touch (TT) test, all other flexibility tests with continuous outcomes had almost perfect intrarater (ICC = 0.91–0.98) and interrater (ICC = 0.0.89–0.99) agreement. The TT test had substantial interrater agreement for both sessions and almost perfect intrarater agreement.

Except for the Modified Thomas test (MTT) and hip ER tests, all binominal flexibility tests had at least substantial inter- and intrarater reliability (Gwet AC1 = 0.65–1.00; SE < 0.12). Interrater reliability for all aspects of the Thomas test (i.e. psoas, rectus femoris and ITB) on both sides were at most moderate, with Gwet’s AC1, respectively, ranging from 0.22 to 0.58, 0.16 to 0.22, and 0.03 to 0.38. Intrarrater reliability for the Thomas tests ranged from slight to substantial (Gwet’s AC1 = 0.25–0.76), with larger CI compared to other binary tests. Notably, the intrarater reliability for Rater 1 was consistently higher than that of Rater 2.

Strength tests

All strength tests had at least substantial interrater (Gwet’s AC = 0.73–0.96) and intrarater (Gwet’s AC2 = 0.67–0.96) agreement with small SE (< 0.15). The abdominal and oblique strength tests had almost perfect intrarater (ICC = 0.90–0.96) and interrater agreement (ICC = 0.77–0.92) with small SE (SE = 2.61–6.19) compared to the test means as summarised in Table 1.


Because of the collisional nature of rugby, injuries seem an inevitable part of the game. However, clinicians should continuously seek strategies to minimise the incidence and severity of injuries. For medical and conditioning staff involved in elite-level sports, such strategies involve the development of practical and scientifically sound pre-season MSK screening protocols to identify possible intrinsic risk factors to injury.

Like Ashworth et al. (2018) who investigated the reliability of an original upper body strength test, our study only included elite adult male rugby players. The anthropometrics and demographics (age) of the players in our study were similar to that of Ashworth et al. (2018). Haitz et al. (2014) investigated the inter- and intrarater reliability of a battery of screening tests amongst collegiate athletes (i.e. all participating at the same level) of various sports and reported high levels of inter-rater (k = 0.83–1.00) and intrarater (k = 0.71–0.95) reliability. A degree of homogeneity in the level of participation and sporting activity might therefore have a significant impact on the outcomes of studies investigating the reliability of neuromusculoskeletal screening tests. One of the reasons is that elite athletes’ ability to recover after performing multiple physical fitness tests exceeds that of athletes participating at lower levels of competition. The variability in test results, because of the possible physiological effects of repeated physical fitness testing (more specifically strength and flexibility), by multiple raters on multiple days, may therefore be more limited and more reliable.

Considering the mean of the toe-touch test (TT test) (–0.75 cm – 2.04 cm), the standard deviation (SD) (4.73–5.32) was large. The TT test is the only test of which the outcome distribution is bimodal (i.e. outcomes can be both, greater or less than zero). The large SD can therefore be explained by the cumulative, mathematical effect of including both positive and negative values in the calculation of the mean and in turn SD. The calculation of SEM takes SD and ICC into account. Considering the high SD (in addition to the lower interrater reliability ICC: [0.70 {0.40–0.86}]) values, it is not surprising that the interrater SEM (8.24–8.55) for the TT test was also high. The SD for the combined right shoulder mobility test was also large, considering the mean (mean = 4.63 cm – 5.17 cm; SD = 4.73 cm – 5.32 cm). This could be attributed to the number of zero measurements included in the data set.

All lineal flexibility tests had at least substantial interrater reliability (0.70–0.98) and almost perfect intrarrater reliability (0.89–0.98) and, except for the TT test, had small corresponding CI as well. This can be attributed to the objective, simple precision with which outcomes can be measured using a tape measure. Although, interrater reliability of the TT did not achieve the acceptable benchmarks set by the authors (i.e. almost perfect), the intrarater reliability did achieve the acceptable standards. Interrater reliability for this test can be improved by a more thorough description of the test, specifically ensuring that raters identify all possible compensatory mechanisms related to achieving better test scores, for example, by slightly bending the knees.

Although the TT, combined shoulder flexion and extension, and v-sit tests had at least substantial inter- and intrarater reliability, their respective SEMs were larger than other lineal flexibility tests. At first glance, it may seem that these values are indicative of a lesser degree of agreement. However, this can be attributed to the larger range of possible scores (i.e. greater distribution range) associated with the respective tests. For example, the maximum range for ankle DF might be limited from 0 cm to 20 cm where combined shoulder flexion and extension has an outcome range of 0 cm to > 60 cm. For larger range outcomes the variability (i.e. SD) may be more extensive, resulting in larger SEM values.

Most flexibility tests with binary outcomes attained almost perfect intrarater and interrater reliability (Gwet’s AC1 > 0.8). The MTT and hip ER tests yielded lower intrarater and interrater reliability values (Gwet’s AC < 0.73). The difference in the reliability achieved for these tests can be attributed to the complexity of the tests. Whilst tests that require the observation of single joint movement or for which the rating criteria is obvious (e.g. dorsal aspect of the foot and ankle has to be flat against the floor), the MTT and hip ER tests challenge the flexibility and range of multiple joints and structures simultaneously, thereby making the rating criteria more complicated. Numerous studies investigating the reliability of observational neuromusculoskeletal tests that require assessment of more than one component have been found to have poor intrarater reliability (Monnier et al. 2012; Moreland et al. 1997; Whatman, Hume & Hing 2015). To improve reliability, one can consider simplifying the tests by executing, for example, the MTT three times and only assessing one aspect per repetition. Another consideration is to measure the outcomes of the test more objectively, using a goniometer. However, Peeler and Anderson (2007) reported poor interrater and intrarater reliability regardless of whether an observational dichotomous (fail/pass) scale or goniometer was used for measurement of the various aspects of the Thomas test. The hip ER test might be improved by the objective measurement of the linear distance of the forehead to plinth surface using a tape measure. If the participant is unable to place the lateral aspect of the test leg knee flat on the plinth, the distance from the lateral epicondyle to plinth surface can also be measured as a baseline for tracking progress.

Several MMT’s and rating scales have been documented (Avers & Brown 2019; Cuthbert & Goodheart 2007). However, some have fundamental shortcomings when applied to an athletic population. The main limitations related to their relevance in a rugby population, as explained in the introduction and Online Appendix 1, Table 2-A1, are related to non-functional player position during testing and the type of muscle actions (concentric only) tested. The manual strength testing regime proposed by the developers of the SSL screening protocol attempts to address some of the shortcomings of existing manual strength testing regimes.

Considering the physicality of MMT, the subjectivity of tester resistance and tester strength have been identified as factors limiting the reliability thereof, particularly amongst higher scores (Bohannon 2019). In our study, the anthropometrics and demographics of the raters differed vastly (Rater 1 – Female, 34 years, height = 168 cm, weight = 60.00 kg; Rater 2 – Male, 28 years, height = 188 cm, weight, 100 kg), yet the interrater reliability of all manual strength tests, with the exception of the left hip adduction which was substantial, were almost perfect (Gwet’s AC2 = 0.81–0.96; SE [0.03–0.12]). The level of reliability and agreement therefore did not seem to be affected by the raters’ physical characteristics or resistance-related subjectivity. In fact, perhaps contrary to what one would expect, considering the modes in Table 3, Rater 1 rated most players’ strength lower than Rater 2 on two occasions (day 1 – right glut/ham; day 2 glut/ham; day 2-left hip IR) and the same as Rater 1 for 21 (out of 28) occasions.

The modes further indicate that, with the exception of the right glut/hamstring and left hip IR tests, both raters on both days agreed that the majority of the participants met or did not meet the proposed minimum standards. It therefore appears that both raters had similar clinical decision-making skills, reiterating the importance of well-described testing procedures and adequate training in the use of the tools. Specifically, the testers’ understanding of the position and hand placement that allows for optimal biomechanical advantage when the external force is applied is crucial.

Reliability studies investigating MMT amongst elite, healthy athletes are rare. Manual muscles tests (MMTs), such as the ‘break-test’ (Avers & Brown 2019), have good reliability for assessing individuals with neuromusculoskeletal dysfunction (Cuthbert & Goodheart 2007). In our study, the authors proposed the use of a novel MMT strength test battery and rating scale for screening, as opposed to a diagnostic tool, for asymptomatic, seemingly healthy individuals. Manual muscles tests evaluate the ability of the nervous system to adapt to either meet or counter the changing pressure exerted by the examiner (Cuthbert & Goodheart 2007). The developers of the SSL strength testing regime therefore assume that an optimal functioning, well-trained nervous system will immediately alter motor unit recruitment in an attempt to meet the demands of the test (external pressure/force applied), whilst sub-optimal or a dysfunctional nervous system, or structurally damaged muscle fibres, that they innervate, will fail to do so.

Cuthbert and Goodheart (2007) reported that studies investigating the level of agreement for MMT amongst symptomatic or asymptomatic, non-sporting participants attained high levels of interrater (82.00% – 97.00%) and intrarater (96.00% – 98.00%) agreement. Similarly, we found substantial agreement between raters (Gwet’s AC2 = 0.73–0.96) and sessions (Gwet’s AC2 = 0.67–0.96) for MMT executed and rated according to the SSL guidelines. The strength tests, based on the number of repetitions completed, that is, the double leg lower and oblique twist tests, also yielded high interrater (ICC = 0.89–0.93 and 0.90–0.99, respectively) and intrarater (ICC = 0.90–0.96 and 0.92–0.96) reliability.

Limitations and strengths of the study

The reliability measures were based on the fixed raters (not randomly selected) who participated in our study, and the results may be limited to this specific group of raters. Only elite adult male rugby players were investigated and the results are therefore not generalisable to other sports, or youth players and/or players playing at a different level. Although a power analysis was done, the sample size was small. Further research is required with larger cohorts. Ideally, if the team’s schedules allowed, a longer wash-out period would have been introduced to further reduce recollection bias. The strength of our study is that a homogeneous population, following the same training schedule, was evaluated. Therefore, the variability of individualised scores because of physiological changes arising from testing or training (and other possible confounding variables such as training load between sessions) was limited.

Clinical and research implications

Reliability of screening protocols is essential as it is of fundamental importance to the quality of players’ healthcare and performance, so that the professionals can replicate and agree on their findings and conclusions. Furthermore, reliable tools should reflect the qualities of the group of participants being screened and not the raters involved in the screening. Raters involved in our study had experience in the use of the SSL screening protocol, emphasising the importance of raters being trained in the use of standardised protocols. Future studies should focus on establishing the reliability of this screening protocol amongst novice raters with less experience, across a range of different sporting professionals as well as amongst athletes participating at different levels and in other sports. As the reliability of most of the tests included in the SSL protocol has been established, the association of these tests with injury risk could be investigated to establish players’ injury risk profiles at the start of the season and in turn develop targeted injury prevention strategies.

Knotter and Steiner (2011) emphasised that the difference in ratings is not solely a statistical decision, but also a clinical one. In clinical practice, the interpretation should consider that the purpose and consequences of the test results are to establish the acceptable margin of error for clinical decision-making. Here, like in other studies (Knotter & Steiner 2011), unless there were statistically sound reasons for lower reliability coefficient values, we considered values of at least 0.80 (i.e. ‘perfect agreement’) as clinically acceptable. Lower values might however still be useful for research purposes and group comparisons (Kottner et al. 2011).


Most of the flexibility and strength tests included in the SSL screening protocol demonstrated at least substantial intrarater and interrater reliability. Establishing the reliability of this protocol is one step closer to support its use as a clinical tool to quantify various aspects of neuromusculoskeletal qualities and identify possible intrinsic risk factors amongst adult, elite male rugby players. Additionally, the test results reported here can provide baseline scores or measurements for comparison with similar or different level athletes. Continued efforts should be made by the developers of the SSL screening protocol to improve the reliability, or include alternative tests to assess the hip flexor and external rotation ROM.


The authors acknowledge all the rugby players who participated in their study. A special thank-you to the athletic trainer and research assistants for their contributions.

Competing interests

The authors declare that they have no financial or personal relationships that may have inappropriately influenced them in writing this article.

Authors’ contributions

C.M., B.O. and N.B-D. contributed to the design and implementation of the research, to the analysis of the results and to the writing of the manuscript.

Funding information

Funding was provided by The South African society of Physiotherapy as well as a research grant from The University of the Witwatersrand Faculty of Health Sciences.

Data availability

The authors confirm that the data supporting the findings of this study are available within the article. Further information if required may be requested from the corresponding author, C.M., upon reasonable request.


The views and opinions expressed in this article are those of the authors and do not necessarily reflect the official policy or position of any affiliated agency of the authors.


Armstrong, R., 2016, ‘Functional movement screening as a predictor of injury in male and female university rugby union players’, Physiotherapy 102(Suppl 1), 178–179. https://doi.org/10.1016/j.physio.2016.10.213

Ashworth, B., Hogben, P., Singh, N., Tulloch, L. & Cohen, D.D., 2018, ‘The athletic shoulder (ASH) test: Reliability of a novel upper body isometric strength test in elite rugby players’, BMJ Open Sport & Exercise Medicine 4(1), 365. https://doi.org/10.1136/bmjsem-2018-000365

Avers, D. & Brown, M., 2019, Daniels and Worthingham’s muscle testing: Techniques of manual examination and performance testing, Elsevier, St. Louis, MO.

Bohannon, R.W., 2019, ‘Considerations and practical options for measuring muscle strength: A narrative review’, BioMed Research International 2019, article 8194537, 1–10. https://doi.org/10.1155/2019/8194537

Brooks, J.H.M., 2005, ‘Epidemiology of injuries in English professional rugby union: Part 1 match injuries’, British Journal of Sports Medicine 39(10), 757–766. https://doi.org/10.1136/bjsm.2005.018135

Bujang, M.A. & Baharum, N., 2017, ‘Guidelines of the minimum sample size requirements for Kappa agreement test’, Epidemiology, Biostatistics and Public Health 14, e12267-1–e12267-2.

Castro-Piñero, J., Chillón, P., Ortega, F.B., Montesinos, J.L., Sjöström, M. & Ruiz, J.R., 2009, ‘Criterion-related validity of sit-and-reach and modified sit-and-reach test for estimating hamstring flexibility in children and adolescents aged 6–17 years’, International Journal of Sports Medicine 30(9), 658–662. https://doi.org/10.1055/s-0029-1224175

Cuthbert, S.C. & Goodheart, G.J., 2007, ‘On the reliability and validity of manual muscle testing: A literature review’, Chiropractic & Osteopathy 15(1), 4. https://doi.org/10.1186/1746-1340-15-4

Dennis, R.J., Finch, C.F., Elliott, B.C. & Farhart, P.J., 2008, ‘The reliability of musculoskeletal screening tests used in cricket’, Physical Therapy in Sport 9(1), 25–33. https://doi.org/10.1016/j.ptsp.2007.09.004

Gamble, P., 2004, ‘Physical preparation for elite-level rugby union football’, Strength and Conditioning Journal 26(4), 10–23. https://doi.org/10.1519/00126548-200408000-00001

Gray, J. & Naylor, R., 2012, BokSmart musculoskeletal assessment form, BokSmart, Cape Town.

Gwet, K.L., 2014, ‘Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters’, in A handbook for researchers, practitioners, teachers & students, 4th edn., pp. 112–117, Advanced Analytics, LLC, Gaithersburg, MD.

Gwet, K.L., 2016, ‘Testing the difference of correlated agreement coefficients for statistical significance’, Educational and Psychological Measurement 76(4), 609–637. https://doi.org/10.1177/0013164415596420

Haitz, K., Shultz, R., Hodgins, M. & Matheson, G.O., 2014, ‘Test-retest and interrater reliability of the functional lower extremity evaluation’, Journal of Orthopaedic Sports Physical Therapy 44(12), 947–954. https://doi.org/10.2519/jospt.2014.4809

Hayen, A., Dennis, R.J. & Finch, C.F., 2007, ‘Determining the intra- and inter-observer reliability of screening tools used in sports injury research’, Journal of Science and Medicine in Sport 10(4), 201–210. https://doi.org/10.1016/j.jsams.2006.09.002

Kerr, Z.Y., Simon, J.E., Grooms, D.R., Roos, K.G., Cohen, R.P. & Dompier, T.P., 2016, ‘Epidemiology of football injuries in the National Collegiate Athletic Association, 2004–2005 to 2008–2009’, Orthopaedic Journal of Sports Medicine 4(9), 232596711666450. https://doi.org/10.1177/2325967116664500

Knotter, J. & Steiner, D., 2011, ‘The difference between reliability and agreement’, Journal of Clinical Epidemiology 64(6), 701–702. https://doi.org/10.1016/j.jclinepi.2010.12.001

Kottner, J., Audigé, L., Brorson, S., Donner, A., Gajewski, B.J., Hróbjartsson, A. et al., 2011, ‘Guidelines for reporting reliability and agreement studies (GRRAS) were proposed’, Journal of Clinical Epidemiology 64(1), 96–106. https://doi.org/10.1016/j.jclinepi.2010.03.002

Landis, J.R. & Koch, G.G., 1977, ‘The measurement of observer agreement for categorical data’, Biometrics 33(1), 159. https://doi.org/10.2307/2529310

Mandrekar, J.N., 2011, ‘Measures of interrater agreement’, Journal of Thoracic Oncology 6(1), 6–7. https://doi.org/10.1097/JTO.0b013e318200f983

Monnier, A., Heuer, J., Norman, K. & Äng, B.O., 2012, ‘Inter- and intra-observer reliability of clinical movement-control tests for marines’, BMC Musculoskeletal Disorders 13(1), 263. https://doi.org/10.1186/1471-2474-13-263

Moreland, J., Finch, E., Stratford, P., Balsor, B. & Gill, C., 1997, ‘Interrater reliability of six tests of trunk muscle function and endurance’, Journal of Orthopaedic & Sports Physical Therapy 26(4), 200–208. https://doi.org/10.2519/jospt.1997.26.4.200

O’Connor, D.M., 2004, ‘Groin injuries in professional rugby league players: A prospective study’, Journal of Sports Sciences 22(7), 629–636. https://doi.org/10.1080/02640410310001655804

O’Connor, S., 2014, ‘The design of a reliable musculoskeletal pre-participation screening and the establishment of normative data, epidemiology of injury and risk factors for injury in adolescent and collegiate Gaelic footballers and hurlers’, PhD thesis, Dublin City University, Dublin.

Orchard, J. & Seward, H., 2002, ‘Epidemiology of injuries in the Australian football league, seasons 1997–2000’, British Journal of Sports Medicine 36(1), 39–44. https://doi.org/10.1136/bjsm.36.1.39

Peeler, J. & Anderson, J.E., 2007, ‘Reliability of the Thomas test for assessing range of motion about the hip’, Physical Therapy in Sport 8(1), 14–21. https://doi.org/10.1016/j.ptsp.2006.09.023

Posthumus, M. & Durandt, J., 2009, Physical conditioning for rugby, BokSmart, Cape Town.

Quarrie, K.L., 2001, ‘The New Zealand rugby injury and performance project, VI: A prospective cohort study of risk factors for injury in rugby union football’, British Journal of Sports Medicine 35(3), 157–166. https://doi.org/10.1136/bjsm.35.3.157

Schwellnus, M.P., Thomson, A., Derman, W., Jordaan, E., Readhead, C., Collins, R. et al., 2014, ‘More than 50% of players sustained a time-loss injury (>1 day of lost training or playing time) during the 2012 super rugby union tournament: A prospective cohort study of 17 340 player-hours’, British Journal of Sports Medicine 48(17), 1306–1315. https://doi.org/10.1136/bjsports-2014-093745

Sim, J. & Wright, C.C., 2005, ‘The kappa statistic in reliability studies: Use, interpretation, and sample size requirements’, Physical Therapy 85(3), 257–268. https://doi.org/10.1093/ptj/85.3.257

Van Mechelen, W., Hlobil, H. & Kemper, H.C., 1992, ‘Incidence, severity, aetiology and prevention of sports injuries: A review of concepts’, Sports Medicine 14(2), 82–99. https://doi.org/10.2165/00007256-199214020-00002

Whatman, C., Hume, P. & Hing, W., 2015, ‘The reliability and validity of visual rating of dynamic alignment during lower extremity functional screening tests: A review of the literature’, Physical Therapy Reviews 20(3), 210–224. https://doi.org/10.1179/1743288X15Y.0000000006

Williams, S., Trewartha, G., Kemp, S. & Stokes, K., 2013, ‘A meta-analysis of injuries in senior men’s professional Rugby Union’, Sports Medicine 43(10), 1043–1055. https://doi.org/10.1007/s40279-013-0078-1

Wongpakaran, N., Wongpakaran, T., Wedding, D. & Gwet, K.L., 2013, ‘A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: A study conducted with personality disorder samples’, BMC Medical Research Methodology 13(1), 61. https://doi.org/10.1186/1471-2288-13-61

Xue, Q.-L., 2016, Measurement reliability, The Harvard Clinical & Translational Science Center, Boston, Massachusetts.

Yeomans, C., Kenny, I.C., Cahalan, R., Warrington, G.D., Harrison, A.J., Hayes, K. et al., 2018, ‘The incidence of injury in amateur male rugby union: A systematic review and meta-analysis’, Sports Medicine 48(4), 837–848. https://doi.org/10.1007/s40279-017-0838-4


Crossref Citations

1. Prevalência de lesões em atletas amadores de rúgbi no Brasil
Bruno de Assis Godoy, André Polli Fujita, Natalie Lange Candido, Rodrigo de Almeida Ferreira, Josie Resendo Torres da Silva, Marcelo Lourenço da Silva
Revista de Educação Física / Journal of Physical Education  vol: 91  issue: 2  first page: 182  year: 2023  
doi: 10.37310/ref.v91i2.2848