The interrater and intrarater reliability of the flexibility and strength tests included in the Sport Science Lab® screening protocol amongst professional rugby players

Background Considering the injury incidence rate (IR) associated with elite-level rugby, measures to reduce players’ injury risk are important. Establishing scientifically sound, pre-season musculoskeletal screening protocols forms part of injury prevention strategies. Objective To determine the interrater and intrarater reliability of the flexibility and strength tests included in the Sport Science Lab® (SSL) screening protocol. Methods We determine the interrater and intrarater reliability of 11 flexibility and nine strength tests. Twenty-four injury-free, elite, adult (> 18 years), male rugby players were screened by two raters on two occasions. To establish intrarater and interrater reliability, Gwet’s AC1, AC2 and intraclass correlation coefficients (ICC) were used for the analysis of binary, ordinal and continuous variables, respectively. Statistical significance was set at 95%. Results Flexibility tests which require lineal measurement had at least substantial interrater (ICC = 0.70–0.96) and intrarater reliability (ICC = 0.89–0.97). Most of the flexibility tests with binary outcomes attained almost perfect interrater and intrarater reliability (Gwet’s AC1 = 0.8–0.97). All strength tests attained at least substantial interrater (Gwet’s AC2 = 0.73–0.96) and intrarater (Gwet’s AC2 = 0.67–0.97) reliability. Conclusion The level of interrater and intrarater reliability of most of the flexibility and strength tests investigated supports their use to quantify various aspects of neuromusculoskeletal qualities and possible intrinsic risk factors amongst elite rugby players. Clinical implications Establishing the reliability of tests, is one step to support the inclusion thereof in official screening protocols. Results of our study, verify the reliability of the simple, clinically friendly strength and flexibility tests included and therefore support their use as preparticipation screening tools for rugby players. Further investigation as to the association thereof to athletes’ injury risk and performance is warranted.


Introduction
Despite the high risk of injury related to the collisional nature of rugby union (henceforth rugby) (Schwellnus et al. 2014), the sport remains one of the most popular professional team sports worldwide (Brooks 2005). Rugby injury incidence rates (IR) have been considered high compared to sports such as soccer and basketball (Yeomans et al. 2018) but similar to other high-impact collisional sports such as Australian Rules football (Orchard & Seward 2002) and American National Football League (NFL) (Kerr et al. 2016).
Epidemiological studies conducted in South Africa (Schwellnus et al. 2014) and England (Brooks 2005) as well as a meta-analysis conducted by Williams et al. (2013) report similar findings regarding the incidence and nature of injury amongst professional male rugby players. All three studies concluded that the majority of injuries occur during matches (Injury IR 81.0-91.0 injuries/1000 player hours) and that injuries are related to a tackling incident. These studies also concurred that the injury rates for forwards and back-line players are similar and that the lower (48.1% -58.9%) and upper limbs (15.6% -25.6%) are the most common site of injuries.
(MSK) screening protocols in an attempt to identify players at risk of sustaining in-season injuries. The protocol developed by the South African Rugby Union (SARU), which the developers claim to be similar to that of New Zealand and Australia, includes a series of physical screening tests related to, amongst others, strength, flexibility and joint range of motion (ROM) (Gray & Naylor 2012). Limited studies regarding the association of the tests included in these protocols and injury incidence amongst elite-level rugby players have however been published. Also, the developers' rationale for inclusion of the tests was largely based on the tests' reliability and normative values amongst athletes other than elite-level rugby players. Quarrie (2001) investigated various MSK performance measures amongst rugby players, of which only one was found to have a univariate relationship to injury. The developers of the Sport Science Lab® (SSL) screening protocol therefore identified a need for evidence regarding existing MSK screening tools' reliability and association with in-season injury. Hence, the aim of our study was to develop a screening protocol, investigate (amongst other qualities) the reliability thereof and publish the results based on the findings, and if necessary, amend the tool to improve the psychometric properties thereof.
When designing a screening protocol, the challenge lies in finding a delicate balance between scientific accuracy (reliability and validity) and practicality (ease and duration of execution; a small amount of inexpensive equipment and space, as well as the examination skill required) (Castro-Piñero et al. 2009). Reliability refers to the reproducibility of measurements within a given participant over time (intrarater reliability) and by various raters (interrater reliability) (Hayen, Dennis & Finch 2007). The ability of researchers to make inferences regarding certain outcome variables such as intrinsic risk factors is largely depended on repeated measurement accuracy, and the reliability of screening protocols is therefore pivotal (Dennis et al. 2008). Xue (2016) suggested that better observer training, improved scale design and introducing items better at capturing heterogeneity improve the reliability of a screening tool. The developers considered both the proposed strategies to improve reliability (Xue 2016) and practicality thereof when designing the SSL screening protocol. The complete SSL screening protocol consists of 11 flexibility, seven strength, six plyometric and one rugby-specific fitness tests. As the plyometric and cardiorespiratory fitness tests are objective in nature (i.e. the raters do not have to measure, eyeball or base a rating on subjective measurement as is the case for the strength and flexibility tests), we did not include the plyometric and fitness tests in the reliability part of our study. The strength and flexibility tests included, equipment required and standard instructions are described in Online Appendix 1, Table 1-A1, whilst a detailed description of the purpose and rationale for the inclusion, modification of and proposed minimal standards for the flexibility and strength tests included in the protocol is summarised in Online Appendix 1,

Rationale for inclusion of flexibility and strength tests, and manner of execution
Limitations in muscle flexibility and related joint mobility have been identified as injury risk factors amongst rugby players (O'Connor 2004;Yeomans et al. 2018). Considering the suggestions summarised by Xue (2016) regarding improvement of test reliability, flexibility tests were simplified to only include tape measured (lineal) outcomes or joint ROM, considered relative to stationary objects with either 0° horizontal or vertical planes such as a plinth.

Rationale for inclusion of strength tests and manner of execution
The game of rugby requires players to tolerate and generate forces to propel their own and additional external weight loads. It is thus fair to regard muscular strength and power as important performance predictors (Posthumus & Durandt 2009) as well as intrinsic risk factors associated with injury prevention (Gamble 2004). Whilst strength doesn't have a set definition or unit of measure, it is an attribute of force and power (Bohannon 2019). Manual muscle tests (MMT) have been used as a way to gauge muscle output (Bohannon 2019). The developers of the SSL screening protocol regarded MMT as the most practical option as they are inexpensive, quick and easily performed. The manner of execution and proposed rating scale is however new and has not been investigated. Some might argue that hand-held dynamometers (HHD) might be equally practical and provide more objective output measures. However, the cost of HHD may be prohibitive to some and the main limitation of MMT, that is, subjectivity of tester strength and related external resistance applied, is not overcome (Bohannon 2019). Further limitations of HHD and existing MMT strength rating scales are summarised in Online Appendix 1, Table 2-A1. Our study is the first of two (the second investigates the association between the tests included in the protocol and inseason injury) conducted to establish a clinically useful, evidence-based, pre-season screening protocol that could be used by both medical and strength and conditioning professionals. In a team setting this would allow for a holistic picture of athletes' pre-season intrinsic injury risks as well as to establish baseline fitness parameters. The aim of our study was thus to investigate the interrater and intrarater reliability of the SSL screening protocol.

Methodology
This was a reliability study with a test-retest design. Guidelines for reporting reliability and agreement studies were followed (Kottner et al. 2011).
Information regarding our study was sent to 14 official national rugby unions requesting that they send a list of potential participants who volunteered. Participants included elite (i.e. part of an official SARU team) male rugby players between the ages of 19 and 36 years who were injury free at the start of the competitive rugby season. Players who were not on the active team roster at the start of the competitive rugby season were not eligible for inclusion. For convenience, the sample was selected based on the teams'/participants' geographical proximity to the facility of an established sport rehabilitation and performance centre.
The sample size was calculated based on published guidelines regarding sample size requirements for two-rater reliability studies with nominal (Bujang & Baharum 2017;Sim & Wright 2005) or ordinal (Bujang & Baharum 2017) variables, which assume at least 50% positive ratings and a power of 80%. The authors of these studies suggest a sample size of between 25 (Sim & Wright 2005) and 29 (Bujang & Baharum 2017) participants. To account for dropout, 27 volunteers were included. Other similar reliability studies included 15 (O'Connor 2014) and 40 (Armstrong 2016) participants, respectively.

Procedure
Our study commenced 3 weeks prior to the start of the competitive rugby season to allow for a standardised volume of training to have been completed. Intrarater and interrater reliability was assessed concurrently. The screening tests were conducted by a qualified physiotherapist (Rater 1; first author) and an athletic trainer (Rater 2). Both raters had more than 5 years of clinical experience and were experienced in the use of SSL screening protocol in daily practice. Two research assistants recorded the participants' ratings/ measurements. Raters were not allowed to communicate with each other during the rating of any of the screening tests and were blinded to the participants' injury history and each other's findings.
After performing a 10-min warm-up of their choice, participants were requested to perform all strength and flexibility tests as described in Appendices 1A and 1B. For time efficiency and minimal inconvenience to participants, all tests required to be done on the floor were done first (in no particular order), followed by the tests in standing and then tests performed on the plinth. Each test was performed three times and the best attempt was recorded.
Considering the logistics, practicality and training schedules of the participating teams, a week was dedicated to collect data. To minimise any physiological effects and allow symptoms that may have been provoked by the tests to subside, screening of participants occurred on two consecutive days, in the same environment, before training sessions. Ten participants were screened on two consecutive days and one day thereafter, and the remaining participants were screened on the next two consecutive days. During the screening sessions, each participant was screened once by Rater 1 and an hour later by Rater 2. To minimise potential recollection bias, the ordering of participants scheduled for a screening on a particular day, was randomised for each rater in both rating sessions. This randomisation, coupled with raters being blinded to ratings made during session 1, aimed to further reduce possible recollection bias.

Data analysis
Statistical analyses were performed using Stata/IC 15.1 (StataCorp, TX, USA). Continuous variables were summarised by mean and standards deviation, whilst binary and ordinal variables were summarised by count and frequency.
Interrater reliability for both raters was determined by comparing per-session ratings (for both sessions) of Rater 1 with that of Rater 2. Intrarater reliability was analysed by comparing each rater's day 1 ratings with that of day 2. To determine both interrater and intrarater reliability, Gwet's AC 1 (Gwet 2016) was used for tests with binary (yes or no) outcomes, Gwet's AC 2 (Gwet 2016) for ordinal variables and ICC 3,2 (two-way mixed effects, consistency, multiple raters/ measurements) (Mandrekar 2011) for tests with continuous outcome measures. The respective reliability coefficients with their 95% confidence intervals (CIs) were reported. Standard error of mean (SEM) values were also calculated. Intraclass correlation coefficient (ICC) values were interpreted according to the Landis and Koch scale (Landis & Koch 1977). Gwet's agreement coefficients have been shown to be more stable and paradox-resistant (high percentage agreement but low k-value) than Cohen's kappa (k) and other coefficients (Gwet 2014(Gwet , 2016Wongpakaran et al. 2013). Interpretation of results was done according to the benchmarking procedure as suggested by Gwet (2014), that is, the absolute agreement coefficients benchmarked as cumulative probability (in our case 95%), for any reliability coefficient to fall into one of the following categories: < 0.00, = Poor; 0.01-0.20 = Slight; 0.21-0.40 = Fair; 0.41-0.60 = Moderate; 0.61-0.80 = Substantial; 0.81-1.00 = Almost perfect. This method allows for direct and more precise comparisons of the different agreement coefficients and their representation on the Landis and Koch scale.

Ethical considerations
Ethical approval was obtained from the University of the Witwatersrand Human Research Ethics Committee (Medical) (M180452). Written permission was obtained from the rugby union and the coaches of the respective teams and informed consent was obtained from players who volunteered to participate in our study.

Results
Three (11.11%) of the participants (n = 27) did not attend the second screening session because of logistical problems or conflict with other obligations. Data for 24 participants were therefore analysed. The average age of the players was 19.96 (± 1.78) years, weight was 95.33 (± 13.50) kg and height was 186.50 (± 8.98) cm.

Descriptive statistics
The descriptive statistics for flexibility tests with continuous outcomes are summarised in Table 1; flexibility tests with binary outcomes are summarised in Table 2 and all strength  tests are summarised in Table 3. Considering the mean and minimum standards of the respective flexibility tests, both raters agreed on both days that most of the players did not achieve the minimum standards for the majority of tests. In contrast, the raters agreed that on both days the majority of players achieved the set minimum standards (score of 4 or 5) for most of the strength tests.

Inter-and intrarater reliability
The inter-and intrarater agreement coefficients, CI and standard error (SE) for flexibility tests with continuous outcomes are summarised in Table 4; flexibility tests with binary outcomes in Table 5 and strength tests in Table 6.

Flexibility tests
With the exception of the

Discussion
Because of the collisional nature of rugby, injuries seem an inevitable part of the game. However, clinicians should continuously seek strategies to minimise the incidence and severity of injuries. For medical and conditioning staff involved in elite-level sports, such strategies involve the development of practical and scientifically sound pre-season MSK screening protocols to identify possible intrinsic risk factors to injury.
Like Ashworth et al. (2018) who investigated the reliability of an original upper body strength test, our study only included elite adult male rugby players. The anthropometrics and demographics (age) of the players in our study were similar to that of Ashworth et al. (2018). Haitz et al. (2014) investigated the inter-and intrarater reliability of a battery of screening tests amongst collegiate athletes (i.e. all participating at the same level) of various sports and reported high levels of inter-rater (k = 0.83-1.00) and intrarater (k = 0.71-0.95) reliability. A degree of homogeneity in the level of participation and sporting activity might therefore have a significant impact on the outcomes of studies investigating the reliability of neuromusculoskeletal screening tests. One of the reasons is that elite athletes' ability to recover after performing multiple physical fitness tests exceeds that of athletes participating at lower levels of competition. The variability in test results, because of the possible physiological effects of repeated physical fitness testing (more specifically strength and flexibility), by multiple raters on multiple days, may therefore be more limited and more reliable.
Considering the mean of the toe-touch test (TT test) (-0.75 cm -2.04 cm), the standard deviation (SD) (4.73-5.32) was large. The TT test is the only test of which the outcome distribution is bimodal (i.e. outcomes can be both, greater or less than zero). The large SD can therefore be explained by the cumulative, mathematical effect of including both positive and negative values in the calculation of the mean and in turn SD. The calculation of SEM takes SD and ICC into account. Considering the high SD (in addition to the lower interrater reliability ICC: [0.70 {0.40-0.86}]) values, it is not surprising that the interrater SEM (8.24-8.55) for the TT test was also high. The SD for the combined right shoulder mobility test was also large, considering the mean (mean = 4.63 cm -5.17 cm; SD = 4.73 cm -5.32 cm). This could be attributed to the number of zero measurements included in the data set.
All lineal flexibility tests had at least substantial interrater reliability (0.70-0.98) and almost perfect intrarrater reliability (0.89-0.98) and, except for the TT test, had small corresponding CI as well. This can be attributed to the objective, simple precision with which outcomes can be measured using a tape measure. Although, interrater reliability of the TT did not achieve the acceptable benchmarks set by the authors (i.e. almost perfect), the intrarater reliability did achieve the acceptable standards.
Interrater reliability for this test can be improved by a more thorough description of the test, specifically ensuring that raters identify all possible compensatory mechanisms related to achieving better test scores, for example, by slightly bending the knees.
Although the TT, combined shoulder flexion and extension, and v-sit tests had at least substantial inter-and intrarater reliability, their respective SEMs were larger than other lineal flexibility tests. At first glance, it may seem that these values are indicative of a lesser degree of agreement. However, this can be attributed to the larger range of possible scores (i.e. greater distribution range) associated with the respective tests. For example, the maximum range for ankle DF might be limited from 0 cm to 20 cm where combined shoulder flexion and extension has an outcome range of 0 cm to > 60 cm. For larger range outcomes the variability (i.e. SD) may be more extensive, resulting in larger SEM values.
The MTT and hip ER tests yielded lower intrarater and interrater reliability values (Gwet's AC < 0.73). The difference in the reliability achieved for these tests can be attributed to the complexity of the tests. Whilst tests that require the observation of single joint movement or for which the rating criteria is obvious (e.g. dorsal aspect of the foot and ankle has to be flat against the floor), the MTT and hip ER tests challenge the flexibility and range of multiple joints and structures simultaneously, thereby making the rating criteria more complicated. Numerous studies investigating the reliability of observational neuromusculoskeletal tests that require assessment of more than one component have been found to have poor intrarater reliability (Monnier et al. 2012;Moreland et al. 1997;Whatman, Hume & Hing 2015). To improve reliability, one can consider simplifying the tests by executing, for example, the MTT three times and only assessing one aspect per repetition. Another consideration is to measure the outcomes of the test more objectively, using a goniometer. However, Peeler and Anderson (2007) reported poor interrater and intrarater reliability regardless of whether an observational dichotomous (fail/pass) scale or goniometer was used for measurement of the various aspects of the Thomas test. The hip ER test might be improved by the objective measurement of the linear distance of the forehead to plinth surface using a tape measure. If the participant is unable to place the lateral aspect of the test leg knee flat on the plinth, the distance from the lateral epicondyle to plinth surface can also be measured as a baseline for tracking progress.
Several MMT's and rating scales have been documented (Avers & Brown 2019;Cuthbert & Goodheart 2007). However, some have fundamental shortcomings when applied to an athletic population. The main limitations related to their relevance in a rugby population, as explained in the introduction and Online Appendix 1, Table 2-A1, are related to non-functional player position during testing and the type of muscle actions (concentric only) tested. The manual strength testing regime proposed by the developers of the SSL screening protocol attempts to address some of the shortcomings of existing manual strength testing regimes.
Considering the physicality of MMT, the subjectivity of tester resistance and tester strength have been identified as factors limiting the reliability thereof, particularly amongst higher scores (Bohannon 2019). In our study, the anthropometrics and demographics of the raters differed vastly (Rater 1 -Female, 34 years, height = 168 cm, weight = 60.00 kg; Rater 2 -Male, 28 years, height = 188 cm, weight, 100 kg), yet the interrater reliability of all manual strength tests, with the exception of the left hip adduction which was substantial, were almost perfect ). The level of reliability and agreement therefore did not seem to be affected by the raters' physical characteristics or resistance-related subjectivity.
In fact, perhaps contrary to what one would expect, considering the modes in Table 3, Rater 1 rated most players' strength lower than Rater 2 on two occasions (day 1 -right glut/ham; day 2 glut/ham; day 2-left hip IR) and the same as Rater 1 for 21 (out of 28) occasions.
The modes further indicate that, with the exception of the right glut/hamstring and left hip IR tests, both raters on both days agreed that the majority of the participants met or did not meet the proposed minimum standards. It therefore appears that both raters had similar clinical decisionmaking skills, reiterating the importance of well-described testing procedures and adequate training in the use of the tools. Specifically, the testers' understanding of the position and hand placement that allows for optimal biomechanical advantage when the external force is applied is crucial.
Reliability studies investigating MMT amongst elite, healthy athletes are rare. Manual muscles tests (MMTs), such as the 'break-test' (Avers & Brown 2019), have good reliability for assessing individuals with neuromusculoskeletal dysfunction (Cuthbert & Goodheart 2007). In our study, the authors proposed the use of a novel MMT strength test battery and rating scale for screening, as opposed to a diagnostic tool, for asymptomatic, seemingly healthy individuals. Manual muscles tests evaluate the ability of the nervous system to adapt to either meet or counter the changing pressure exerted  by the examiner (Cuthbert & Goodheart 2007). The developers of the SSL strength testing regime therefore assume that an optimal functioning, well-trained nervous system will immediately alter motor unit recruitment in an attempt to meet the demands of the test (external pressure/force applied), whilst sub-optimal or a dysfunctional nervous system, or structurally damaged muscle fibres, that they innervate, will fail to do so. Cuthbert and Goodheart (2007)

Limitations and strengths of the study
The reliability measures were based on the fixed raters (not randomly selected) who participated in our study, and the results may be limited to this specific group of raters. Only elite adult male rugby players were investigated and the results are therefore not generalisable to other sports, or youth players and/or players playing at a different level. Although a power analysis was done, the sample size was small. Further research is required with larger cohorts. Ideally, if the team's schedules allowed, a longer wash-out period would have been introduced to further reduce recollection bias. The strength of our study is that a homogeneous population, following the same training schedule, was evaluated. Therefore, the variability of individualised scores because of physiological changes arising from testing or training (and other possible confounding variables such as training load between sessions) was limited.

Clinical and research implications
Reliability of screening protocols is essential as it is of fundamental importance to the quality of players' healthcare and performance, so that the professionals can replicate and agree on their findings and conclusions. Furthermore, reliable tools should reflect the qualities of the group of participants being screened and not the raters involved in the screening. Raters involved in our study had experience in the use of the SSL screening protocol, emphasising the importance of raters being trained in the use of standardised protocols. Future studies should focus on establishing the reliability of this screening protocol amongst novice raters with less experience, across a range of different sporting professionals as well as amongst athletes participating at different levels and in other sports. As the reliability of most of the tests included in the SSL protocol has been established, the association of these tests with injury risk could be investigated to establish players' injury risk profiles at the start of the season and in turn develop targeted injury prevention strategies. Knotter and Steiner (2011) emphasised that the difference in ratings is not solely a statistical decision, but also a clinical one. In clinical practice, the interpretation should consider that the purpose and consequences of the test results are to establish the acceptable margin of error for clinical decisionmaking. Here, like in other studies (Knotter & Steiner 2011), unless there were statistically sound reasons for lower reliability coefficient values, we considered values of at least 0.80 (i.e. 'perfect agreement') as clinically acceptable. Lower values might however still be useful for research purposes and group comparisons (Kottner et al. 2011).

Conclusion
Most of the flexibility and strength tests included in the SSL screening protocol demonstrated at least substantial intrarater and interrater reliability. Establishing the reliability of this protocol is one step closer to support its use as a clinical tool to quantify various aspects of neuromusculoskeletal qualities and identify possible intrinsic risk factors amongst adult, elite male rugby players. Additionally, the test results reported here can provide baseline scores or measurements for comparison with similar or different level athletes. Continued efforts should be made by the developers of the SSL screening protocol to improve the reliability, or include alternative tests to assess the hip flexor and external rotation ROM.