ITEM ANALYSIS OF AN ENGLISH SUMMATIVE TEST

While almost half of teachers’ activities are assessing their students, they are not wellprepared with assessment literacy training. Hence, they are unable produce good tests to measure students’ level of knowledge and skills. This study is aimed at analyzing item difficulty and item discrimination of a multiple-choice test made by an English teacher at a junior high school in Kupang. The instruments of this qualitative research were 50 test items and students’ answer sheets. The results of this study indicated that the English summative test had poor item difficulty index and low item discrimination index. For the level of difficulty, it was found that 27 items (54%) were easy, 22 items (44%) were moderate, and 1 item (2%) was difficult. For item discrimination power, it was revealed that 5 items (10%) were excellent, 8 items (16%) were good, 8 items (16%) were satisfactory, 23 items (46%) were poor, and 6 items (12%) were negative. In addition, 12 items were acceptable, 20 items were unacceptable and 18 items needed revisions. In conclusion, this English summative test did not fulfill the criteria of a good test and could not measure students’ true ability.


INTRODUCTION
Teachers have a set of responsibilities which is not only about preparing and teaching the lesson but also assessing students and evaluating the course. One of the ways to measure students' ability is by conducting a test (McNamara, 2000;Hughes, 1989). It is important for the teachers to make good test items because this can measure students' true ability, reflect the success of the lesson and also indicate low and high performers. This is also a part of assessment and evaluation because testing is a device to reflect the assessment process and the effectiveness of the lesson and teaching process (Fulcher & Davidson, 2007). Therefore, developing lessons is as important as evaluating the lessons.
Teachers, these days, are not well-prepared with assessment training. They focus on developing teaching method but overlook to equip themselves with assessment literacy which may lead to conducting poor test items. Almost half of teachers' activities are assessing their students, but they are not wellprepared to make a good assessment (Plake & Impara, 1996). They make the test to measure students' ability but it does not fulfill the criteria of a good test such as validity, reliability and practicality (Hughes, 1989). In addition, they often construct multiple choice tests with poor level of difficulty and low item discrimination. This means that they are unable to make valid and reliable test. A well-constructed classroom test will provide students with an opportunity to show their ability to recognize and to produce correct forms of the language.
Multiple choice-test is one types of tests that is frequently used by English teachers to measure their students' English knowledge and skills. Multiple-choice tests are receptive or selective kinds of tests (Toksoz & Ertunc, 2017). This test requires the test takers to select one of the options from the test item. This type of test has stem, options and distractors. The stem contains the statement or the question and the options are alternatives to be selected. The right option must be selected by eliminating the wrong answers or distractors (Brown, 2004;Hughes, 1989).

L. Suek
PEJLaC, Volume 1, Issue 1, June 2021 (9-18) Teachers prefer multiple choice item because of its practicality. It does not consume much time to prepare and it is easy to administer. This test is also more reliable than any other types of tests because it is objective and scored consistently (Brown, 2004). However, this kind of test may have low discrimination power because it does not clearly discriminate high performers and low performers if the items are not well-designed. In this case, students might just guess the answers without reflecting their knowledge or skills. The quality of a multiple-choice test depends on the test items (Brown, 2004;Hughes, 1989;Kehoe, 19951). If teachers are unable to design the multiple-choice items well, the students might just guess the answer. In fact, guessing might affect the scores of the students (Brown, 2004;McNamara, 2000). One of the biggest challenges in designing multiple-choice items is writing successful items that fulfill the criteria of validity, item difficulty and item discrimination.
In order to write successful multiple-choice items, the test should have good item difficulty and high item discrimination (Haladyna, 2004;Henning, 1987). Item difficulty determines which items are difficult and which items are easy (Brown, 2004). It also determines whether the questions are trivial, difficult or impossible items (Bodner, 1980). While, item discrimination functions to differentiate higher and lower performers (Brown, 2004). An item with high discrimination means that good students can get it right while poor students will get it wrong (Toksoz & Ertunc, 2017).
The increasing trend of using multiple-choice items for testing students' ability and the fact that teachers could construct good tests encouraged the researcher to analyze the multiple-choice test items made by an English teacher. This study is aimed at finding out the item difficulty index and item discrimination index of a multiple-choice test made by and English teacher at SMP Negeri 1 Kupang. This research will shed a light on the tests constructed by the teachers, the quality of test, measurement of student's ability and also the implications for further test construction.

RESEARCH METHODOLOGY
Documentation and descriptive methods with quantitative approach were used in this study. First, the multiple-choice test that has was constructed by an English teacher at the ninth grade of SMP Negeri 1 Kupang Tengah was analyzed. A quantitative approach was used to measure the difficulty level and discrimination power of the test. The subject of this study was the English teacher and the students at the ninth grade of SMP Negeri 1 Kupang Tengah academic year 2019/2020. This school was selected because it has been accredited A since a few years ago which means that it is a good school and it is expected that the teachers are able to construct good tests. The instrument used in this study was 50 multiple-choice test questions that was constructed by the English teacher and 36 students' answer sheets that were corrected or scored by the teacher.
In order to collect data, a series of procedure was taken. First, the researcher went to the school and met the English teacher to get the test items and students' answers sheets and scores. By her permission, the English summative test was analyzed by examining the students' answer sheets and scores, computing the difficulty levels and the discrimination power of all items, revealing good and poor test items, discussing the findings and making conclusions. Arifin (2012:251) said that there are several steps to analyze test items. First, score all of the students' answer sheets. Then, the scores are recorded ranging from highest to the lowest. Next, 27% of higher performers and 27% of low performers are grouped. While the medium performers are put aside. Finally, students' answers are analyzed. An item analysis reveals three things including: how difficult each item is, whether or not the question discriminates or tells different between high and low students, and which distractors are working as they should.
In constructing a test, teachers should consider the difficulty level of the items. To claim that the item is easy or difficult, an analyzes of students' answer sheets has to be done. A test item is too easy when more than 90% of the students answered it correctly. An item is too difficult when less than 30% of the students answered it correctly. According to Arifin (2012: 266), the steps to find out the level of difficulty begin by tabulating students' answer sheets from the highest scores (high group) to the lowest scores (low group). Then, 27% of the answer sheet from the high group and 27% of the answer sheet from the low group was taken while the remaining 46% was set aside. Finally, all the data was presented in the table to find out the answers from each student both from the high and low groups. The score for the correct answer is 1, and the wrong answer will get 0.
The level of difficulty is calculated as follows (Arifin, 2012: 266): DL = difficulty level WL = total students who answered incorrectly from the lower group WH = total students who answered incorrectly from the higher group L = total students in lower groups nH = total students in higher groups The criteria for interpreting the difficulty level that was claimed by Arifin (2012: 270) ranging from easy to difficult as follows: If DL is 27%, the test item is easy. If DL is between 28% and 72%, the test item is moderate. If DL is 73%, the test item is difficult.
Discrimination power discriminates high performers from low performers. According to Arifin (2012: 274), there are several procedures in analyzing the discrimination power including tabulating students' answer sheet, counting the number of students who got the item wrong in the low group (WL) and counting the number of students who got the item wrong in the high group (WH), subtracting WL to WH, and calculating the discrimination power of each question.
The discrimination power is calculated as follows (Arifin, 2012: 266;Hoha, 2001:147) Dp = discrimination power WL = total students who answered incorrectly from the lower group WH = total students who answered incorrectly from the higher group n = total students The criteria for interpreting the discrimination power that was claimed by Arifin (2012: 270) ranging from poor to excellent as follows:  (2008: 206-207) stated it is important to analyze the questions that have been made to find which items are excellent and which items are poor. According to Sudijono (2009: 376-378) wellconstructed items should be used in the test, while poor items should be revised or changed. In this study, the English summative test was analyzed to find out the level of difficulty and discrimination power. The test was made at the end of the semester by the teacher to assess the materials that have been taught. It was also used to prepare the students for the national examination. Summative test according to Djamarah (2005;253) is an assessment that is carried out at the end of each teaching of a program or a certain number of learning units.

Arikunto
The findings are presented based on on the steps in analyzing the tests. The first step in doing the item analysis is to score all of the students' answer sheets. Students' scores are presented in the following  Table 1 indicated scores of all 36 students who took the test. The student with the highest score answered 49 out of 50 questions correctly, while the lowest one answered 29 out of 50 questions correctly. Students with the lowest score answered 21 questions incorrectly and the student with the highest score answered 1 question incorrectly.
The second step is deciding 27% of high group and 27% low group. According to Arifin (2012: 266), the steps to find the level of difficulty items begin by tabulating students' answer sheets from the highest score (high group) to the lowest score (low group). Then, 27% of the answer sheet from the high group and 27% of the answer sheet from the low group was taken while the remaining 46% was set aside. Top 10 students were grouped as hig performers (27% x 36 students) and 10 students with the lowest score were grouped as low performers (27% x 36 students). Table 2. High group and Low group Table 2 shows the group of high and low performers. Out of 36 students who took the test, 10 students were in high group and 10 students were in low group. The third step was recapitulation of difficulty level and discrimination power. After deciding the high and low performers, the recapitulation of difficulty level and discrimination power was done by analyzing total students who answered the question incorrectl. The result of the analysis are presented in the following table.
Table 3. Recapitulation of the difficulty level and discrimination power Table 3 indicates recapitulation of the difficulty level and discrimination power. In terms of difficulty level, it was revealed 1 question was difficult, 21 questions were moderate and 28 questions were easy. In terms of discrimination power, 5 questions were excellent, 8 questions were good, 8 questions were satisfactory, 23 questions were poor, were 6 negative or not good. Questions number 26, 27, 31, 37, 41, 50 got negative discrimination index, which means that these questions were difficult for the high group but easy for the lower group. In other words, the lower group answered the questions correctly, while low groups did not. Question number 1, 3,6,12,19,22,23,30,35, and 39 got 0.00 discrimination index which means that both groups can answer the questions correctly so that the questions were not acceptable because they could discriminate high performers from low performers. Meanwhile,questions number 8,11,18,28,44 were difficult for the low group but easy for the high group. So, 16 questions were poorly constructed because they got minus (-) index and 0.00 index. Therefore, those questions were unacceptable or have to be revised.
The fourth step was interpretation of difficulty level and discrimination power. In terms of difficulty level, the questions were analyzed as easy, moderate, and difficult items. While for discrimination power, the questions were analyzed as poor, good, satisfactory, excellent and not good (negative) items. The table below presents the summary of these categories. Table 4. Interpretation of difficulty level and discrimination power Table 4 shows the interpretation of difficulty level and discrimination power. The table indicates that in terms of discrimination level, there were 27 easy questions (54%), 22 moderate questions (44%), and 1 difficult question (2%). In terms of discrimination power, 5 questions were excellent (10%), 8 questions were good (16%), 8 questions were satisfactory (16%), 23 questions were poor (46%), 6 questions were negative or not good (12%).
The diagram below shows the percentage of difficulty levels of the test items. According to Arifin (2009: 270), a well-constructed should be not too easy or too difficult. To obtain good learning achievement, the proportion between the difficulty levels of the questions is spread evenly as the following options: 1. 25% of the items are difficult, 50% of the items are moderate, 25% of the items are easy. 2. 20% of the items are difficult, 60% of the items are moderate, 20% of the items are easy. 3. 15% of the items are difficult, 70% of the items are moderate, 15% of the items are easy. The diagram (figure.1) above revealed that the English summative test was not well-constructed because 27 questions (54%) were easy, 22 questions (44%) were moderate and 1 question (2%) was difficult. Therefore, the English summative test did not meet the criteria of a good test. The diagram below shows the discrimination power of the test items. The discrimination power was calculated and interpreted (Arifin 2012: 270) as follows: the index of negative or not good item was minus, the index of poor items was between 0.00 and 0.20, the index of satisfactory items was between 0.20 and 0.40, the index of good items was between 0.40 and 0.70, the index of excellent item was between 70 and 1.00 The diagram (figure. 2) above shows that 5 items (10%) were excellent because the discrimination index is between 70 and 1.00, 8 items (16%) were good because the index is between 0.40 and 0.70, 8 items (16%) were satisfactory because the index is between 0.20 and 0.40, 23 items (46%) were poor because the index is between 0.00 and 0.20, and 6 questions (12%) were negative or not good because the index was minus.
The last step was analyzing which items were acceptable or well-constructed and which items were unacceptable or ill-constructed. This final step is important for the recommendation because it will help teachers to revise the test. According to Arifin (2012) there were criteria to decide whether the test items are acceptable, unacceptable or need revision. The item is acceptable when the difficulty index is moderate and the discrimination index is ranging from satisfactory to excellent. The item is unacceptable when the difficulty index is easy or difficult and the discrimination index is poor or negative. While, the item has to be revised when the difficulty index and discrimination index vary where one of the indexes is the lowest in the category. Table 5. The analysis of acceptable, unacceptable or need-revision items Table 5 indicates the analysis of acceptable, unacceptable or need-revision items. It was revealed that 12 items were acceptable, 20 items were unacceptable and 18 items needed revision. The acceptable items must be used for the test, while unacceptable items should be dropped and new items should be constructed to replace them. The items that need revision should be recheck for further improvement. The revision could be done by rechecking the key answers, stems, distractors, teaching materials.

CONCLUSION AND SUGGESTION
Multiple choice is one of the most objective tests. In addition, it is less time consuming and easy to administer (Higgins & Tatham, 2003). Item analysis is essential in developing a test because it shows which items should be included, improved or eliminated (Gajjar, Kumar, & Rana, 2014). Based on data analysis of this study, it was found that 54% of test items (22) were easy, 44% of test items (27) were moderate, 2% of test items (1) was difficult questions. This means that most of the test items were easy and moderate in which both high stake and low stake students could answer them correctly. For the discrimination power, it as was revealed that 5 items (10%) were excellent, 8 items (16%) were good, 8 items (16%) were satisfactory, 23 items (46%) were poor, and 6 items (12%) were negative. Therefore, 12 items were acceptable, 20 items were eliminated and 18 items needed revisions. To sum up, this English summative test constructed by a junior high school teacher did not meet the criteria of a good test because it had poor item difficulty index and low discrimination index.
For further implication, the analysis of test items is necessary for evaluating the education system (Arhin, 2017). Beside eliminating misleading test items, it is also used to evaluate the quality of educational system (Malau-Aduli & Zimitat, 2011). If the tests are poorly constructed, this reflects low assessment literacy of the teacher. This study provides valuable insight for further item modification and test development. Quality control is important to make sure that the teachers produce good tests. They can seek advice from experts to validify their tests.
Based on the findings, some recommendations are made for future development of test items. First, the test item should be constructed by considering the criterion of a good test. Second, teachers need to seek experts to validate their test before administering it to the students. Third, the schools and government need to provide assessment literacy training for the teachers so that they have knowledge on making valid, reliable and practical test items.