Criterion Referenced Assessment : Establishing Content Validity of Complex Skills Related to Specific Tasks

Career and Technical Education (CTE) is a nationwide program that emphasizes training for primary, secondary, and post secondary educational stages for the career and workforce needs of today and tomorrow’s society. Mandated indicators of success have been set in place and secondary schools are expected to improve student’s skill levels in preparation for their next stage of education or employment. This study examines ways to measure proficiency in Automotive Service Technology (AST) skill ability domain levels, which consist of knowledge, concepts, and skills. The second part of the study examines the reliability and validity of an assessment method that is aligned with the AST foundational skills and ability levels needed by students or future employees and are intended to be a means to evaluate their readiness for their next educational stage. CRITERION REFERENCED ASSESSMENT: ESTABLISHING CONTENT VALIDITY OF COMPLEX SKILLS RELATED TO SPECIFIC TASKS The United States Department of Education‘s Strategic Plan for 2007 to 2012 outlines focused initiatives for educational reform (2007). The third goal of the Strategic Plan centers on students‘ successful transition between secondary education, post secondary education, and the workforce. Career and Technical Education (CTE) is a nationwide program that emphasizes training for primary, secondary, and post secondary educational stages for the workforce needs of today and tomorrow's society. Indicators of success have been mandated and schools are expected to improve student's skill levels to prepare the student for the next stage. The context for the assessment validation and evaluation process was the CTE area of Automotive Service Technology (AST). State mandates for funding frequently require an AST program to meet the National Institute of Automotive Service Excellence (ASE) (2005) program certification standards. ASE is most noted for Automotive Technician (AT) competency certification, but also certifies AST training programs through the National Automotive Technician Education Foundation (NATEF) (2005). NATEF sets quality standards for AST programs that include current industry task listings, required tool and equipment lists, and a general description of skills that are assumed to be taught and learned. Unfortunately, two skill areas are apparently prerequisite, but are not clearly defined or assessed by NATEF or ASE and include Basic Vehicle Interval Maintenance Skills and Basic Vehicle Repair Skills, which form the Automotive Service Technology Foundational Skills (ASTFS) set. Current reviews of certification or standardized assessment literature do not reveal a singular assessment that can measure ASTFS skills for secondary or post secondary education.


REVIEW OF THE LITERATURE
The AST repair industry sector is expected to increase by 30.7% between the years 2004 and 2014 (US Department of Labor, 2007).CTE training for AST students exists at secondary and post secondary levels.Specific training for a manufacturer's line of vehicles begins shortly after employment if the automotive technician (AT) is evaluated as being ready.Thus, it is important that CTE properly service the educational needs of prospective AT employees.It is also important that AST employers can evaluate a prospective AT's technical proficiency levels quickly and accurately, prior to hiring or in order to evaluate a person's readiness for job level responsibilities, activities, and further training.
There are basically two assessment design routes available when assessing a person's ASTFS levels.The first route is a performance testing strategy of the ASTFS (Gronlund, 1998).This type of test typically relies on direct observation and judgement rather than traditional paper and pencil forced choice strategies (Worthen, White, Fan, & Sudweeks, 1999).Performance testing includes a simulation of the actual tasks, during which the testing candidate is observed and rated by an evaluator (Gronlund, 1998).An advantage is that this process can be very valid.Disadvantages include high expense, time consuming processes, individual administration requirements, need for evaluator training, and inter-rater reliability estimation techniques (Worthen, et al., 1999;Gronlund, 1998).Inter-rater variability in particular is difficult to solve and often results in low reliability and subjective results.Performance testing is commonly attempted in most of the AST training programs and is the preferred method of CTE teachers and is also a requirement of NATEF AST program certification (NATEF, 2005).
A second type of assessment strategy would include a norm-referenced test (NRT) or criterion-referenced test (CRT) to serve as an objective, psychometric measure of aptitude or proficiency (Worthen, et al., 1999).Psychometric testing can be defined as the science of measuring psychological aspects of a person such as knowledge, skills, abilities, or personality.Psychometric testing traditionally utilizes paper and pencil or computer adapted assessment techniques (Labor Law Center, 2005).An aptitude test is generally a broadly constructed measure of the cumulative effects of past learning (on a test taker) in order to predict future learning potential (Worthen, et al., 1999;Hopkins, 1998).Proficiency tests tend to be much more specific measures of a persons' ability to perform specific tasks at a specific criterion level, which would consistently parallel those levels typically found in authentic performance contexts.
Reliability estimates and empirical validity indicators require the application of various statistical procedures based on the type of assessment being evaluated (Joint committee, 1999).Reliability estimates for a norm-referenced, standardized achievement test do not work as well for CRT proficiency assessments due to the restriction of item homogeneity responses in the target population (Crocker & Algina, 1986).Although Cronbach's Coefficient Alpha is a versatile and widely used method for estimating the internal consistency of a typical scale or NRT (Worthen, et al., 1999), it simply is not as meaningful or useful in the context of CRT proficiency assessment (Crocker & Algina, 1986).A more meaningful reliability estimate is the test-retest correlation coefficient between two sets of test takers' scores on an assessment that is administered about two weeks apart in time.A better way to estimate reliability for a standardized proficiency assessment is to report the probability of a mastery decision using the same cut score or criterion on a parallel test.Subkoviak's (1976) Coefficient of Agreement (CGA) estimate for a mastery decision reports the probability that the test takers would be assigned mastery on a parallel test to the first test, based on results from a single test administration.
Validity can be defined as the degree to which concurrent evidence supports assessment scores (Joint committee, 1999).In the case of the ASTFS Proficiency Assessment (ASTFSP), the resulting scores are ideally the predictor measure and the ASTFS is the actual criterion level of the test taker.However, as the ASTFS construct is still newly delineated (MacQuarrie, Applegate, Lacefield, 2008), concomitant measures of this construct were not available to serve as validating correlates.Therefore, indirect correlates and contrasted group procedures were used to validate the ASTFSP Assessment.
In many cases the goal is to determine a person's readiness level for the next educational stage.In these situations the test taker's score should be compared to a criterion level and not to a population or similar general population alone.Therefore, it was decided that the purpose of the ASTFSP Assessment is to provide a current or prospective AT employee with their proficiency level of the ASTFS as compared to industry criterion levels.The best way to ensure that ASTFSP Assessment scores are meaningful is to associate them with industry ASTFS criterion cut-score levels, such as those for certification purposes.
Bob Clark, a technical specialist in the Special Testing Programs for ASE was the resource sought for expert guidance on standard cut score procedures (B. Clark, personal communication, September 2, 2005).Clark indicated that ASE cut scores were set for each scale area assessment using a "modified Angoff procedure," which he claimed is common for high stakes tests (2005).Clark then asserted that items were carried forward for future tests using "pre-equating" procedures based on Item Response Theory (IRT) techniques.In addition to reporting the single cut-score result of pass / fail for test takers, ASE also reported the number of correct responses for each section of the test.

SKILL DEFINITION AND REPRESENTATION
The ASTFS skills and tasks and were defined and delineated in a previous paper titled, Criterion Referenced Assessment: Delineating Curricular Related Performance Skills Necessary for the Development of a Table of Test Specifications,‖ (MacQuarrie, Applegate, & Lacefield, 2008).A summary of the findings from that previous paper will be reported here to maintain continuity of the material.The ASTFS are listed in two scales: the Basic Vehicle Repair Skills (BVRS) and the Basic Vehicle Interval Maintenance Skills (BVIMS).
General categorical listings of tasks often are ambiguous in the sense that they fail to further delineate lower levels of prerequisite skills.However, they do have utility for illustrating skill and task hierarchies.Refer to Table 1 and 2 for a listing of the Table of Test Specifications for each of the two scales of the ASTFS.Within sub-categories, the general units, tasks, and objectives used for further defining the skills are delineated.
Procedures were followed in accordance with credentialing assessment standards and published best practices to create the ASTFSP Assessment.The ASTFSP Assessment is a criterion referenced mechanical aptitude assessment of multiple choice design.Several iterations of preliminary and pilot studies using both qualitative and quantitative processes provided information to improve item quality concerning reliability and validity.

PURPOSE OF THE STUDY
This second phase of the study examines and evaluates the reliability and validity of an assessment that is aligned with the ASTFS needed by students or future employees that is intended to evaluate their readiness for their next educational stage.This paper describes assessment design processes and not research processes and will, therefore be presented in manner that fits typical assessment procedures.In order to achieve the purpose for this study, objectives that align with typical assessment design and construction processes are used.
1.The assessment design and construction processes utilized an assessment purpose and methods to ensure both the content and ability domains were proportionately aligned with a highly recognized content or skill area.2. The assessment's item writing processes referenced and reflected the content and ability domains and were then improved through preliminary item try-outs and pilot studies.3. Assessment and item reliability processes included empirical evidence of reliability estimations from item analyses procedures through both internal consistency and external comparison estimations.4. Assessment and item validation processes included empirical evidences of convergent and/or discriminate validity.
The first objective was previously completed and described in detail in a previous paper, as explained in the previous section of this paper.The processes of this phase included: internal and external reliability estimations, content validation, and convergent and discriminate validation procedures including: contrasted group methods, concurrent correlations, and IRT analytical procedures.Initial standardization procedures were implemented using an objective means of deriving the passing cut scores.

METHODS
The Table of Test Specifications (ToTS) was fulfilled by specific items being written using the research resources used to complete the ASTFS list.Multiple techniques were employed to estimate the psychometric properties of the ASTFSP Assessment.A preliminary study of the ASTFSP Assessment included procedures for empirical validation focused on improving the ASTFSP Assessment and gathering evidence for construct validity (MacQuarrie, 2005).This study expanded the participant groups to include CTE high school students and further statistical analyses.In addition, IRT procedures were employed to assess the utility of the ASTFSP Assessment items and scores for the intended goal of measuring ASTFS ability levels.

INSTRUMENTATION:
The ASTFSP Assessment used a four choice multiple choice format for items forming two scales.The first scale included items related to the BVIMS area, such as: lubrication replacement procedures, coolant selection, and tire pressure checking and correcting procedures.The second scale included items related to the BVRS area, such as: hand and power tool safety, selection, and procedures, fastener selection and uses, and oxygen-acetylene torch safety procedures.PROCEDURES: The ASTFSP Assessment items were specifically written to fit the ASTFS ToTS plan proportions of content and ability domain as displayed in Tables 1 & 2. The item writing process was performed by the primary author of this paper using research references from the previous delineation process as well as industry related case studies in a manner as described in typical item writing texts (Worthen, White, Fan, & Sudweeks, 1999;Gronlund, 1998;Hopkins, 1998).
The empirical construct validation process required a two part approach because of the lack of concomitant measures for the ASTFS construct.The two approaches for gathering convergent and discriminate evidence included both: contrasted groups and correlating the ASTFSP Assessment scores with criterion measures.This study used two primary groups of participants with opposing levels of AST skills for the administration and gathering of the ASTFSP Assessment data.They first group was the AST experienced group and was composed of three subgroups: 1) AST experts working in industry 2) AST teacher members of the Automotive Youth Educational Systems (AYES) (2005) and 3) a group of AST high school students age 16 to 20 years old near the end of a year's training.The second group consisted of non-AST experienced participants in two subgroups: 1) non-AST high school students age 16 to 20 years old, and 2) non-AST teachers.A survey question within the ASTFSP Assessment identified a sixth potential group, from the first two subgroups within the two primary groups, who self-identified themselves as AST hobbyists and who would likely vary widely in AST skill level.
Reliability estimates for internal consistency, test-retest correlation, and Subkoviak's (1976) CGA estimate for mastery tests are presented below.Descriptive statistics for the test takers are reported by group and subgroup.Discriminate validation included MANOVA procedures to test the hypothesis of mean differences on the dependant variables, BVIMS and BVRS, among various groups: AST experts, AST teachers, Non-AST students, non-AST teachers, AST students, and AST hobbyists.A canonical discriminant function analysis (DFA) was conducted to determine whether ASTFSP Assessment scores could be used to differentiate between the six AST groups, but is not reported here to reduce redundancy.
The convergent validation approach included correlating the ASTFSP Assessment scores with criterion measures of AT developmental indicators.Developmental indicators were reported by an AST test taker's supervisor on a second performance rating scale and survey completed while the participating technician was completing the ASTFSP Assessment.Correlations are reported between the ASTFSP Assessment scores and developmental indicators such as ASE certifications, State of Michigan Certifications, and work duty responsibilities.
The expert experiential group's scores were then used to establish cut-scores in an objective manner.The method used is similar to a contrasted groups approach separating the expert's scores from those of other groups.The current study also allowed IRT procedures to be used to further evaluate the ASTFSP Assessment items and to measure AST trait levels of ability.The first two expert subgroups' ASTFSP Assessment data were used to calculate the item difficulty for each item.Item difficulties were used in a manner similar to the way the Angoff procedure uses experts.Traditional Angoff procedures typically use selected experts to directly estimate the probability of a mythical minimally competent person would get correct for each item (Standard, 2008).The Angoff procedure would be offset for this assessment to ensure objectivity by way of contrasted groups.The offset would be completed by summing the selected experts' actual item difficulties together for the two expert sub-sample groups, thereby deriving a set of appropriate cut scores based on actual experience.

RESULTS
The first and previous part of this study set the stage for construct validity of the ASTFSP Assessment with the ToTs.The second objective was: The assessment's item writing processes referenced and reflected the content and ability domains and were then improved through preliminary item try-outs and pilot studies.
To complete the second objective the ASTFSP Assessment items were specifically written to fit the ASTFS ToTS plan proportions of content and ability domain as displayed in Tables 1 & 2. The item writing process was performed by the primary author of this paper using research references from the previous delineation process as well as industry related case studies gathered from industry personnel.Preliminary item try-outs and pilot studies were used to improve the items in multiple ways.First, qualitative feedback was gathered for the items as experts completed the assessment along with an instrument rubric.Second, iterative improvements used simple Item Analysis procedures for monitoring: response proportions, item completion, and proportions correct.Each administration resulted in an iteration of assessment improvement as well as provides a deeper and more objective result due to an increased number of participants.
The third objective was initiated during the previous section with simple Item Analyses.The third objective involving reliability is performed on the data that is gathered from the validity study due to assessment design and is as follows:

Assessment and item reliability processes included empirical evidence of reliability estimations from item analyses procedures through both internal consistency and external comparison estimations.
To complete the third and fourth objective related to reliability and validity a two part approach was used due to the lack of concomitant measures of the ASTFS construct.The first approach used contrasted groups and the second approach correlated ASTFSP Assessment scores to developmental indicators.Refer to the descriptive statistics in Table 3, which depict score and scale score means, number of participants in a group, and standard deviations that will need to be statistically tested for differences.Refer to Figure 1 for the number of ASE certifications possessed by the AST Industry Expert Group.
External reliability compares assessment scores in time or with another parallel form of the test.A study of stability reliability estimates for the ASTFSP Assessment within a 20 day period indicated excellent results (n = 24) ρ = .908with a Confidence Interval of α ≥ .796≤ .959(Barnette, 2005).Reliability estimates for the three cut scores set on a test-retest mastery decision is (n=24): .96,.88,and 1.00, which is interpreted as the proportions of participants that were assigned the same mastery score decision as 96%, 88%, and 100%, which is very good.Subkoviak's (1976) CGA estimate for a mastery decision for the ASTFSP Assessment (n=354) was good at .741, which is interpreted to mean that an individual would have a 74% lower bound probability that he or she would be assigned to the same mastery (or non-mastery) state on a second testing that was parallel to the first test.Refer to Figure 2 for a graphical representation of the Coefficient of Agreement for the ASTFSP Assessment.Stability (test-retest) reliability, when available, is a better indicator of reliability than internal reliability or even CGA estimation since the ASTFSP Assessment is a proficiency assessment (Crocker & Algina, 1986).

CONTRASTED GROUPS VALIDATION RESULTS
The fourth objective summarizes the validity study design and uses the data gathered to make conclusions about the ASTFSP Assessment's meaning and not for purposes of the groups' differences, as in research.The fourth objective is as follows: Assessment and item validation processes included empirical evidences of convergent and/or discriminate validity.
There isn't another direct assessment parallel to ASTFSP Assessment that measures the ASTFS.Therefore, statistical discriminate differences were sought between natural groups based on ASTFS experience.MANOVA procedures were performed to test for differences between the dependant variable scale scores, BVIMS and BVRS, of the ASTFSP Assessment for the six groups: AST experts, AST teachers, Non-AST high school students, non-AST teachers, AST high school students, and AST hobbyists.
Box's test indicated the MANOVA equality assumption was violated, p < .05 and thus, required Dunnett's C correction for the post hoc analyses.Results of the MANOVA for mean differences on BVIMS and BVRS scales for the six groups were statistically significant Wilks' Lambda = .529,F(10, 694) = 26.041,p < .001and partial η 2 = .273,indicating that 27% of the variance was accounted for in the model.Univariate analyses revealed both statistical and practical effects for each dependent variable, followed by interesting Dunnett's C post hoc pairwise means comparisons among groups.Refer to Tables 4 and 5, respectively for ANOVA results and the post hoc analysis.Refer to Figures 3 and 4 for a graphical representation of the estimated means for each group and scale of the ASTFSP Assessment.
In summary, the MANOVA results indicate there are statistically significant differences between the two primary experiential groups and sub-groups for both scales of the ASTFSP Assessment: those with ASTFS experience and those without.Therefore, ASFTSP Assessment could allow a detection of discriminate differences, thus indicating a measure of validity for the assessment.

SCALE SETTING OF CUT SCORES
Finally, two cut-scores were derived using an objective approach based on contrasted groups to separate the subjects into three grouping categories: -Non -Expert‖, -Minimal Knowledge‖, and -M inimally Competent‖ level groups.The cut score results were then evaluated using DFA classification procedures between the cut score groups.Refer to Table 8 for the results of the cut score DFA classification results.The -Minimally Competent‖ level spanned a score percentage range from approximately 65% to 79%.There were a limited number of participant scores near 80% therefore, the -Com petent‖ cut score level is reserved for scoring future ASTs who have been purposefully and effectively trained in the ASTFS.
The first part of this paper has presented validity and performance results for various expert levels in the field of AST, teachers, and ATs.The second part of this report will present additional results from a high school CTE ASTFSP Assessment.The larger sample of CTE high school students allowed IRT procedures to be used to further evaluate the validity of the ASTFSP Assessment items and ASTFS ability levels.

LATENT TRAIT DIMENSIONALITY
Content validity is important, but more important is the validity of the latent trait: the ability domain.To analyze the validity of the ability variances of the ASTFSP Assessment items IRT procedures were performed.IRT procedures allow the plotting of ICC's along the latent trait continuum to gain insight into each item's functionality of performance being measured by the assessment.BILOG MG was used to estimate a one parameter IRT on the ASTFSP Assessment data.Refer to Figures 5 and 6  The total test inform ation for a specific scale score is read from the left vertical axis.
The standard error for a specific scale score is read from the right vertical axis.The total test information for a specific scale score is read from the left vertical axis.
The standard error for a specific scale score is read from the right vertical axis.In summary, the IRT statistics indicate that the ASTFSP Assessment is measuring a varied ability level for each of the two scales.Further IRT analysis would be beneficial for a two and three parameter model for all items of the ASTFSP Assessment in future administrations.

DISCUSSION, IMPLICATIONS, AND OPPORTUNITIES
The ASTFSP Assessment is useful as a criterion referenced mechanical aptitude assessment for making group or individual level decisions.Several important points have emerged while viewing the ASTFSP Assessment results.The first point is that an AST technician should possess a -Minimally Competent‖ or higher level of the ASTFS as it would benefit their customers, their employers, and themselves.Additionally, an educational organization should consider including the direct instruction and assessment of the ASTFS due to the positive correlation between an AST technician's ASTFSP Assessment score and their success concerning the obtainment of AST developmental indicators such as higher work duties and the obtainment of certifications.
The second point focuses on the average ASTFSP Assessment scores obtained by AST industry experts of 58% and the average AST teacher score of 64%.These score levels are lower than expected for practicing AST technicians and teachers and is attributed to both a current and past lack of professional development specific to the ASTFS.There is a potential for growth among AST industry experts, AST teachers, and CTE high school level AST students.
The ASTFSP Assessment could be administered to multiple groups for various reasons.In closing, the discovery and development of both the ASTFS and the ASTFSP Assessment can assist schools, AST programs, and employers in evaluating the development level of AST's or AST students.AST students would likely benefit from effectively learning the ASTFS most if they were to learn them prior to the NATEF task lists as they are underlying skills.It would seem ideal for high school level students to possess a higher level of ASTFS proficiency to enable a student to learn and perform more effectively at the system level of AST duties and tasks.Additionally, all transportation technicians would most likely need most if not all of the ASTFS and may be transferable to other transportation and industrial areas.Future plans for further evaluating the potential of the ASTFSP Assessment is currently in planning stages for a predictive validity study and for AT's, AST students, and other technicians.Interested organizations or parties are invited to inquire, volunteer assistance, or support.

Figure 1 .Figure 2 .
Figure 1.Number of ASE Certifications Possessed by AST Expert Group Members

Figure 3 .
Figure 3.Estimated Marginal Means for the BVIMS Scale for the Six Groups

Figure 4 .
Figure 4.Estimated Marginal Means for the BVRS Scale for the Six Groups

Figure 6 .
Figure 6.BVRS Scale Information and Standard Error

Figure 7 .
Figure 7. Item Characteristic Curves for the ASTFSP Assessment by Scales

Table 2 .
Table of Test Specifications for the ASTFS Basic Vehicle Interval Maintenance Skills Table of Test Specifications for the ASTFS Refer to scale.b Dependant on specific certification.c Dependant on repair facility option selection.

Table 3 .
Descriptive Statistics for the Five Group's Scores

Table 4 .
ANOVA Results for Each Predictor Variable

Table 5 .
Results for the Post Hoc Analyses Note: Dunnett's C post hoc analyses correction

Table 7 .
Developmental Correlations for the ASTFSP Assessment

Table 8 .
DFA Classification Results for the Cut Score Groups

Table 3 .
for a graphic depiction of the Total Information and Standard Error for each scale of the ASTFSP Assessment.The high level of Standard Error depicted in the graph is attributed to the AST high school group, as can be verified by the Standard Deviations reported in Item level parameters for the BVIMS scale indicate that the items are discriminating and

Matrix Plot of Item Characteristic Curves BVIMS Items BVRS Items Legend Secondary
level CTE students can be administered the ASTFSP Assessment to indicate readiness to enter the career field or post secondary level CTE.Post secondary level CTE students can be administered the ASTFSP Assessment to indicate readiness to enter the career field.ASTFSP Assessment scores from practicing or prospective AT's can indicate a need for professional development in the BVIMS or BVRS scale areas.