INTER-EXAMINER RELIABILITY WHEN USING THE OBJECTIVE STRUCTURED PRACTICAL EXAMINATION ( OSPE ) MARK SHEET FOR PHYSIOTHERAPY PRACTICAL EXAMINATIONS

Correspondence Author: Benita Olivier Physiotherapy Department Wits Medical School 7 York Road Park Town 2193 Johannesburg South Africa Email: Benita.Olivier@wits.ac.za ABSTRACT: Background: The Objective Structured Practi cal Examination (OSPE) format is used during practical examinations as part of the physiotherapy undergraduate curriculum at the University of the Witwatersrand. Various factors influence inter-examiner reliability and investigating the inter-examiner reliability when using the OSPE can lead to improvement of the examination process. The aim of this study was to establish interexaminer reliability when using the OSPE mark sheet. Methods: Twelve examiners participated in this study. Thirty three second year PT students were examined at six stations and by two examiners at each station. The Spearman’s correlation test was used to establish inter-examiner reliability. Results: The general inter-examiner reliability of the OSPE mark sheet was high. There was a high correlation between examiners who had the same level of experience (r=0.79 to r=0.93; p<0.001). The background knowledge section of the OSPE mark sheet showed the greatest inter-examiner reliability (r=0.75 to r=0.91; p<0.001). Discussion: In general a high inter-examiner reliability was found. Examiners with the same level of experience seemed to generally have better inter-examiner reliability when using the OSPE mark sheet. Furthermore, a well-described, operationalised list of micro-skills also improved inter-examiner reliability. Conclusion: The OSPE mark sheet aids inter-examiner reliability. The use of this method of examination should be encouraged.


INTRODUCTION
The primary aim of the undergraduate training of physiotherapy students is to equip them with sufficient assessment and treatment skills so that they may safely deliver effective care to their patients.These practical skills can be assessed in a variety of ways, one of which being, performance analysis through the use of the Objective Struc tured Practical Examination (OSPE) (Scott et al 2001).The OSPE can be regarded as a simulation where the performance criteria represent cha racteristics relevant to the authentic perfor mance under assessment (Scott et al 2001).While assessment drives learning from a student perspective, the chosen assessment methods should be valid and reliable from both the examiners' and the students' perspectives (Abraham et al 2009).The development of improved assessment methods as well as the implementation of these assess ment tools is therefore essential.
The OSPE consist of a series of stan dardised assessment stations made up of tasks based on clinical situations (Larsen and JeppeJensen 2008).The OSPE mark sheet consists of an operationalised list of competencies, also called microskills, each weighted at different levels accord ing to importance and difficulty (Scott et al 2001).These competencies or specific predetermined criteria are agreed upon beforehand by examiners (Larsen and JeppeJensen 2008).The OSPE mark sheet allows for the benefits of formative assessment because the competencies, criteria and weighting are clearly defined.
Objectivity in assessment of perfor mance remains a challenge as a result of a variety of factors including the consistency of those making judgements (Scott et al 2001).Three independent variables are present during the tradi tional practical examination: the student, the examiner and the patient.The exa miner and the patient are potential sources of variability which may influence the assessment of students (Harden et al 1975).The patient com ponent is controlled by using the students' peers as models during prac tical examinations, this ensures that all students perform their techniques on a young, healthy model.The OSPE method of assessment attempts to control the variability of the examiner by provi ding very specific criteria for assigning marks (Harden et al 1975).The OSPE is an excellent solution as nonstandardised practical tasks present major problems in achieving objective levels of assessment since the criteria for performance may vary (Scott et al 2001).
Interexaminer agreement was explored by Scott et al (2001) who found good levels of agreement between three pairs of assessors (>90%; ĸscores 0.460.64)during recording of a dental impression.They stated that although agreement was high, disagreement between assessors are unavoidable as clinical performance is difficult to standardise.This problem can be minimised by careful wording of each criterion.The use of OSPE mark sheets to assess practical skill per During the examination session, there were six stations of five minutes each.One technique was examined at each station.Two examiners were based at each station and were given clear procedural instructions (Appendix 1).Examiners were not allowed to discuss results.Two different skills tests were available at each station.Each student performed one of the two skills.All students were assessed at all six stations.On completion of the examination process the OSPE mark sheets were collected and data analysed.

Data Analysis
Spearman's correlation test was used to establish the correlation between the following: the marks of the two exami ners using the same OSPE mark sheet at the same station examining the same skill and the correlation between marks allocated by the two examiners per subsection of the OSPE mark sheet for the same skill.The Spearman's cor relation test was preferred to Intra class Corre lation Coefficient (ICC) because the questions at each station were dif ferent and so it would not be appropri ate to aggregate the marks to compare means as is done when using ICC.Spear man's correlation allowed for compa rison of the actual examiner marks per student.Statistical significance was set at p < 0.001.An rvalue of higher than 0.7 was regarded as a strong correlation and between 0.5 and 0.7 as a moderate correlation.Analysis of the data was performed using Statistica version 8 (StatSoft Inc, Tulsa, USA).

RESULTS
Twelve examiners examined 33 second year physiotherapy students who agreed to participate in this study.The data from two students at one of the stations were excluded from the analysis as they were only evaluated by one examiner (the second examiner joined the exam pro cess later).Of these twelve examiners, five had less than three years academic experience (examiner 4,5,9,11,12), two had between three and five (3,6) and five examiners had more than five years aca demic experience (1,2,7,8,10)  The aim of this study was therefore to establish interexaminer reliability when using the OSPE mark sheet.

METHODS
This was a quantitative, descriptive correlational study.The sample popu lation consisted of 12 examiners who were examining 33 second year physiotherapy students.The study was open to all lecturers (examiners) at the University of the Witwatersrand who were participating in practical tests.A demographic questionnaire was used to capture details of the examiners such as their number of years in academia and area of specialisation.

Procedure
The OSPE mark sheets were developed with input from academic staff and undergraduate students (Appendix 2).Ethical approval for the study was obtained from the Human Research Ethics Committee of the University of the Witwatersrand.Consent was obtained from second year students and examiners to perform this study before the examination session commenced.One day prior to the examination session all examiners were given the opportunity to familiarise themselves with the contents of the OSPE mark sheets.The full pack of appropriate OSPE sheets and their corresponding memoranda were made available.
There was a high correlation between examiners who had the same level of experience and other than for examiner 11 and 12, the experienced examiners had higher correlations (Table 3).The correlations between two examiners when examining the same skill were mostly high.For the examiners whose correlations were low, the results were not statistically significant (Table 4).The correlation between examiners when evaluating the students in the 'gene ral' section was low.The highest noted within this area of evaluation was r=0.49(p=0.074)(Table 5).The correlation between examiners when they evaluated students' background knowledge was generally high (Table 5).

DISCUSSION
The aim of the study was to establish overall interexaminer reliability as well as interexaminer reliability within the subsections of an OSPE format for testing practical skills in physiotherapy training at our institution.This discussion centers on interexaminer reliability of the overall marks allocated at each station (Table 3), for each skill (Table 4) as well as for each subsection (general, technique and background knowledge) of the same skill (Table 5).Based on the literature (Chen et al 2013; Chenot et al 2007), it was expected that those examiners who have more academic experience would show higher correlations between marks and this was the general trend except for E11 and E12 who were sort of an outlier in this study.The fact that E11 and E12 had a high interexaminer correlation could stem from the type of skill question they examined.Their questions were on strapping and the expectations on strapping are clear cut with little room for varying interpretation of what is correct and what is not.It is possible that the OSPE mark sheet was clear as to what to look for when marking the students for this section.Where examiners had the same level of experience, irrelevant of the number of years, higher interexaminer reliability scores were obtained.This may be due to the OSPE mark sheet that contains welldefined microskills and for that reason does not require a high level of knowledge or understanding from examiners (Chenot et al 2007).Although examiners did receive all information in writing a day before the practical test, more intensive training of examiners may have improved the interexaminer reliability of examiners with different levels of experience.This finding is important as it is not always possible to have only more experienced examiners due to human resource constraints.Better induction of examiners with all levels of experience may thus be the solution to minimise the effect of level of experience on the allocation of marks when using the OSPE mark sheet (Chenot et al 2007).
One examiner at each of the stations 1, 2 and 4 examined in an area that they themselves have taught as well as had developed the OSPE mark sheet.This did not seem to influence results as at these stations a high interexaminer reliability was achieved despite one examiner sup posedly being more familiar with the content of the skill and the mark sheet.Scott et al (2001) suggested that inter examiner reliability may be improved if the person who develops the OSPE mark sheet is the same person who marks at that station.However, in our study, for example at station 2, where examiners had different levels of experience as well as different areas of expertise, correlation between overall marks was still high.This may emphasise the value of the OSPE mark sheet as a reliable and objec tive measure to use in spite of experience and expertise (Patricio et al 2013).
The construction of a specific OSPE mark sheet may also influence reliabi lity.At station 4 there was a strong correlation between E7 and E8 (Table 3), however when correlations were calcu lated separately for each skill (Table 4), a strong correlation was found for skill 1 but not for skill 2. In Table 5 it is clear that a discrepancy exists in the technique section of the mark sheet.In the case of skill 2, the examiners indicated that five minutes was not enough for them to properly allocate marks on the mark sheet.Nickbakt et al ( 2013) and Gupta et al (2010) found that not having enough time decreases reliability.The skill 2 mark sheet however did not contain much detail, thus micro skills were not adequately described which may have increased the room for subjective interpretation and in that way decreased reliability (Scott et al 2001).Also, at station 4, one of the examiners taught this specific skill and developed the mark sheet.It is possible that the lack of detail on the mark sheet as well as the time constraints did not influence this examiner's mark allocation because of familiarity with the content of the mark sheet and subsequently a greater discrepancy in marks occurred.In all other cases where the general overall mark of the examiner showed a high correlation (Table 3), the separate correlations for each skill were also similar (Table 2).The quality of the process (time per station, induction of examiners) as well as the quality of the mark sheet (description of microskills) seems to be important in improving interexaminer reliability.When reviewing the results sur rounding the different sections of the OSPE mark sheet (general, technique and background knowledge) (Table 5), it is evident that the correlation between the examiners ranged from low to high depending on the specific section.This may be due to various factors.The 'general' section of the OSPE evaluation form (Appendix 2) revealed the lowest correlation (the highest being r=0.48 for this section).The overall low corre lation may be as a result of the different examiners having different ideas on what for example 'professionalism' embodies and may also be due to students being unclear as to what`s expected of them in terms of professionalism.In addition, 'preparation of equipment' may not require mark allocation in some of the skills as there may be no equipment needed for that skill.For example, manual muscle testing does not require specific equipment to complete this task and thus no marks should be allocated for preparation of equipment for this skill.Examiners may have awarded marks differently here based on merely just giving the marks or abstaining from mark allocation due to a possible unrealistic expectation from the student.However, that being acknowledged, mark allocation is also geared towards 'preparation of the area' and this certainly should be a standard procedure followed by all students regardless of the skill.It is however possible that there may be examiner bias based on personal expectations which may be different.Examiner objectivity in assessment of various aspects of student performance is a difficult task (Scott et al 2001).It is therefore important that the criteria given on the marksheet are very specific and detailed.The subsection assessing 'interaction' with model, may also be examined based on the individual examiner`s subjective interpretation of what`s expected from the student.The low overall mark allocation for this section (5 marks out of a total of 50 marks) seems justified since it does appear to embrace an examiner`s subjective opinion on the content being asked.
The 'technique' section displayed a generally moderate to high correlation between examiners.Techniques appear to embrace agreement in application between the students and examiners as these are taught in a standardised and objective manner to the students.This finding seems to agree with Scott et al (2001) where they found that clearly setout criteria for each skill reduces examiner subjectivity in assessment.Under the 'technique' section in the OSPE, there are specific microskills, each with their specific mark allocation, describing the exact way in which the students are to carry out the specific part of the technique.This allows for a greater degree of objectivity in assessing the students.This provides an explanation for the greater correlation between the examiners when assessing this section.The mark allocation for this section of the OSPE is 40 marks out of a total of 50 marks.Evidently, a high mark allocation for this section appears appropriate as techniques are a core part of physiotherapy students` training and in addition allows for a high degree of objectivity in application and assessment.
The highest measure of correlation between examiners for each section of the OSPE was found within the 'background knowledge' section.This section accounts for the students` theoretical knowledge underpinning questions provided; a straight forward theoretical question was asked by the examiner which required a direct answer by the student.On the OSPE mark sheet, the answer was provided for the examiners thus minimising any degree of subjectivity in the assessment of this section.The marks allocated for this section were 5 out of 50 marks.This level of marks appears valid as the OSPE`s focus is on practical skills rather than theoretical skills hence the higher weighting of marks for the practical components of the examination.
Future studies should be carried out to establish the role of years of experience and area of expertise in interexaminer reliability.Interexaminer reliability can be tested for individual microskills and in that way each OSPE mark sheet can be improved.

LIMITATIONS OF THE STUDY
While we preferred to use Spearman's correlations, one could also do ICC to get an overall view of interexaminer reliability.The results are also limited to the mark sheets that were used in this study.To get a more accurate view of interexaminer reliability one could also use one mark sheet for all the stations.This was however not possible to do given that we chose to use an actual practical test session to check inter examiner reliability when using the OSPE mark sheet.

CONCLUSION
Despite different levels of experience and different areas of expertise, generally, the interexaminer reliability when using the OSPE mark sheet was good to high.The induction of examiners, time allocated per station as well as the amount of detail in which the microskills are described may have influenced the noted differences in interexaminer reliability.Although objectivity in practical exami nations remains a challenge, the good to high interexaminer reliability when using the OSPE mark sheet makes it an appropriate choice.Physiotherapy educators should be encouraged to take the time to draw up very specific and detailed criteria for the examination of the multitude of practical skills which are assessed at a preclinical level.

REFERENCES
Lee MS, Chen WJ, Lee ST 2013 Assessment in orthopedic trainingan analysis of rating consistency by using an objective structured examination video.Journal of Surgical Education 70: 189192 Chenot JF, SimmenrothNayda A, Koch A, Fischer T, Scherer M, Emmert B, Stanske B, Kochen MM, Himmel W 2007 Can student tutors act as examiners in an objective structured clinical examination?Medical Education 41an OSCE with an element of self and peerassessment.European Journal of Dental Education 12: 27 Macluskey M, Hanson C, Kershaw A, Wight AJ, Ogden GR 2004 Development of a structured clinical operative test (SCOT) in the assessment of practical ability in the oral surgery undergraduate curriculum.British Dental Journal 196new gold standard for evaluating postgraduate clinical performance.Annals of Surgery 222: 735742 (Table 1).

Table 2 : The practical skills that the students were assessed on at each station
Each student was assessed on either Skill 1 or Skill 2

Table 4 : Correlations for two examiners when examining the same skill at the same station
Two different skills were asked at each station; each student performed either skill 1 or skill 2 at each station; overall marks for each skill were correlated.