Analysis of Grand Prix Marks

In this article we look at the mathematical properties of CoP as determined by analysis of the marks from the Grand Prix.  The purpose of this analysis was to determine how the judges used the marks and how CoP performed, according to rigorous mathematical standards.  A more detailed description of this analysis which includes The Mathematical Details is provided elsewhere (including tables of examples).  This article, however, covers only with the conclusions that follow from the calculations.

Seven characteristics of CoP were looked at in this study.

Program Components

To examine the way the judges used the five program component marks during the Grand Prix we calculated the statistical spread in the marks  for each program component using all the marks from the Grand Prix.  Under normal circumstances, about 68% of the marks should lie within the statistical spread in the marks, and about 95% should lie within twice the statistical spread in the marks.

For the Grand Prix competitions we find that the spread of the judges' marks calculated for each program component was typically is in the range of 12 to 16% of the average value, with some values as low as 5% and others near 20%.  This agrees with studies of human perception that consistently show that humans can rate observed events (assign a numeric value) to generally no better than about 15%, and sometimes no better than 25%.  The marks assigned by the judges in the Grand Prix confirm this, and we conclude the judges are doing about the best that can be expected in assigning marks to a single program component.

The program components are intended to capture in the scoring five completely different and independent aspects of a skating performance.  The spread in one judge's marks for the five components should be no better than about 15% for a skater who has roughly equal skill in each component, since that is the limit to the consistency of human judging.  For skaters with different skill in the five components the spread in the marks from one judge for the five components should be worse than 15%, with any excess over 15% due to the difference in skill from one component to the next.

What is found, however, is that the spread is typically only about 5% -- about three times better agreement than the spread from one judge to the next.  Some judges did move their marks around for some skaters, but most did not to any significant extent.  The tight agreement in the five component scores for each skater says that throughout the Grand Prix the judges were unable to use the components as five independent aspects of skating, and were unable to assign independent ratings for the five components.  This held true throughout the entire Grand Prix;  i.e., they did not get significantly better at it as time went on.

So far as the use of the program components are concerned, the worst of the worst are program components 3 through 5 that score presentation.  These program components typically vary by less than 0.25 points throughout the Grand Prix.  The judges are a little better at distinguishing Transitions from Skating Skills, but the marks show they were completely incapable of distinguishing the presentation components from each other.

There are three possible reasons for the failure to score the five program components independently: 1, the guidelines for marking the program components were not clear enough; 2, the judges were not adequately trained and the seven competitions of the Grand Prix were inadequate to get the judges "calibrated"; 3, it is humanly impossible to train judges to use the program components as intended (and required) by the system.

In any event, regardless of the actual cause, the numerical evidence shows that the marking of the program components under CoP in its current form was a complete failure throughout the entire Grand Prix.

The CoP development team has acknowledged this failure, and has assumed the problem in marking the program components was in the guidelines.  The guidelines were extensively rewritten after the Grand Prix, but there is no evidence, however, that this is the actual source of the problem, and there is no evidence that the new guideline will work better than the first set.  In the absence of a second round of testing to confirm that the new guidelines will work better, it is just a wild guess and wishful thinking.

Elements

Each program element is given a grade of execution (GoE) from each judge with a designation of -3 through +3, for a total of seven grades of execution.  For every element judged in the Grand Prix the statistical spread in the grade of execution was calculated.  Throughout the Grand Prix the statistical spread for the GoE was very small; much less than one step in grade of execution.  In other words, the judges agreed fairly well on the GoE to be assigned to the various elements.

Since there are seven possible grades of execution from worst to best, each grade of execution spans about 14% of the total range from best to worst.  Thus, the statistical spread for the GoEs in the Grand Prix are consistent with the 15% rule, which says the judge are doing about the best that can be expected in assigning the GoEs, at least in terms of consistency.  Further, it is clear from inspection of the marks that the judges had no difficulty assigning independent GoEs to each element.

The most mathematically correct approach to using GoEs would be to say that a GoE of 0 corresponds to the typical/average execution of an element.  If one then knew the typical spread in quality of execution determined over several seasons for all competitors, one would specify that a GoE of +1 should cover elements executed better than the typical spread, +2 better than twice the typical spread, and so on for all GoEs.

Under normal circumstances, GoEs from -1 through +1 would then encompass about 68% of all assessments, -2 and +2 would include about 27% of all assessments, and -3 and +3 would include the remaining 5% of all assessments.  Of course, in particularly high or low quality events this distribution would not be followed exactly, but for many events in many competitions one would expect this would be a reasonable distribution.  This is not the distribution one finds in the Grand Prix, however.

Typically, GoEs from -1 through +1 were assigned 85% to 90% of the time, while GoEs of +/- 2 were assigned 5% of the time each, or less.  GoEs of -3 were not uncommon, for obvious reasons; but GoEs of +3 were extremely rare, awarded typically less than one-third of 1% of the time.  In many events, never.

It is difficult to accept that in seven competitions that include the best skaters in the world, elements (some with the minimum difficulty level, such as simple spins and double jumps) were executed with the best possible quality less than 0.3% of the time (less than one attempt in 300).

As for the program components, there are the three possible sources for this problem described above.  Regardless of the actual cause, it is clear from the numerical evidence that the marking of the GoEs under CoP in its current form was a significant failure throughout the Grand Prix.

No substantial changes to the judging of the GoEs have been made since the end of the Grand Prix, nor so far as we know, are any planned. Thus, this failure to use the GoEs correctly can be expected to carry over into the future.

It is a well established rule in skating that a well executed double should earn more points than a poorly executed triple (and likewise for singles to doubles, and triples to quads).  In CoP, a +3 double Axel, for example, can earn more points than a -3 triple Axel, so this principle has been retained in CoP.  In practice, however, the judges so rarely award a GoE of +3 that a well executed double will almost never be scored higher than a poorly executed triple.

Another consequence of primarily awarding GoEs of -1 through +1 is that the total value for all the grades of execution has little impact on the scores.  In principle, skaters could gain or lose tens of points in grade of execution compared to the average execution.  In practice, the total grade of execution for all elements is typically only plus or minus 2-3 points for each skater. Only occasionally is it more than plus or minus 5 points.

All of the complexity to judge the GoE, all of the agonizing over what constitutes a given GoE, all of the difficult effort trying to train all the judges to mark over 1500 GoEs consistently (all GoEs for all possible elements), all of the cost and complexity to incorporate GoE into the scoring process comes down to plus or minus 2 points that at best affects the result by about plus or minus 1 place, only part of the time.

Of all the events, it is found that the GoEs have the least impact on the results in dance.  The GoEs have all the characteristics discussed above, and one other not present in singles and pairs.

In singles and pairs different skaters frequently end up rated differently in each of the skating skills.  For example, different skates may end up rated as the best jumper, vs. the best spinner, vs. best in presentation; or different teams may end up rated as best in lifts and jumps, vs. best in spins, or best in presentation.

In dance it is uncommon for teams to be rated differently in each skating skill.  For example, in Compulsory Dance the order of places for the first pattern, second pattern, timing, presentation, or GoE alone, all track more or less in lock-step.  Consequently, it doesn't matter which of these skating skills is used to determine the order of finish, since they are all the same.  This pattern is also true for the Original Dance and Free Dance, with at best a little bit of variety seen in the ratings for lifts in the Free Dance.

The long running joke in ice dancing that couples place in more or less the same order for every dance has carried over into CoP.  Although there was a little more movement from dance to dance in the Grand Prix under CoP than has typically been the case, there wasn't that much more, and within any one dance all the couples were rated in essentially the same order for each skating skill within the dance.  This lock-step conformity for the rating of each skating skill is as implausible in dance as it is for the uniformity of the program component marks.  It says that throughout the Grand Prix the dance judges were incapable of judging each skating skill within a dance independently form the others.

Uncertainty in the Results

Results in CoP are calculated to one one-hundredth (1/100) of a point, and thus CoP claims it can distinguish between two programs that differ from each other in value  by one-half of one one-hundredth of one percent.

This is equivalent to claiming a panel of judges can look at two people, one three hours after the first, and determine that one is taller than the other by the thickness of a piece of newspaper, without benefit of a ruler.  It is equivalent to claiming a panel can watch two people race, one three hours after the other, and determine that one ran the race one-millisecond faster than the other, without benefit of a clock.  It is equivalent to claiming a panel can watch two skaters jump, one three hours after the other, and determine that one jumps higher than the other by one-thousandth of an inch, or takes off faster by one-hundredth of an inch per second, without benefit of a ruler or a clock.  This strains credibility.

The most basic premise of CoP is that the marks are awarded according to an absolute, consistent standard.  The calculated scores are then supposed to be the absolute truth for what each performance deserves according to the absolute standard.  However, because there is a large spread in the marks and only a small number of judges (five used to compute the scores), the calculated scores are only an estimate of the absolute score each program deserves.  It is like timing a race using a stopwatch with a loose second-hand that flops around randomly. You sort of know the time, but not exactly -- only to give or take a few seconds.  Similarly, you only sort of know the true value of the programs in CoP, only to give or take some number of points.

When the uncertainty in the scores is calculated one finds the scores typically have a believability of plus or minus 3/4 of a point.  This means the CoP "stopwatch" can only tell time to give or take 3/4 of a "second", but claims it can determine a winner by one-hundredth of a "second".

Based on the typical spread in the actual marks, it is impossible to say with absolute certainty which of two skaters had the better performance if their point totals differ by less than 1.5 points.  Based on the mathematical characteristics of the actual marks, skaters whose point totals differ by less than 1.5 points should be considered tied.

During the Grand Prix, more than one-third of the places were determined by a point difference of less than 1.5 points, some places were determined by a point difference of 1/100 of a point, and several medals were determined by point differences less than 1/10 of a point.  In terms of the statistical accuracy needed to believably determine places, CoP was a major failure during the Grand Prix.

In Championship events with 20-30 skaters, one can expect the fraction of places to be determined by statistically meaningless point differences to increase, since the average point difference between places in larger events will decrease by a factor of 2 or more compared to the Grand Prix.   Without changes that improve the statistical accuracy of the system, CoP will continue to be a failure in the area of believably separating performances of nearly equal value.

Well Balanced Program - Distribution of Points

In the past we have described how jumps could be expected to make up more than 40% of the points for the men's free skating and presentation only about 30% -- and how spins and sequences would have little value in CoP.

Using the actual scores during the Grand Prix, the amount by which each class of element contributes to the actual scores was determined.  In singles, for example, we divided the scoring into five classes of skill: jumps, spins, sequences, basic skating and transitions (program components 1 and 2), and presentation (program components 3 through 5).  The actual marks during the Grand Prix confirm previous comments about the distribution of points, and confirm that scores are dominated by the marks for jumps.  Spins and sequences together made up only 12-15% of the scores during the Grand Prix, while jumps and the program components made up 85-88%.

To examine the impact of spins and sequences on the results, the skaters' scores were calculated omitting the points from spins and sequences.  It is found that the impact of spins and sequences on the results is limited to at most one place for less than one-third the skaters.

Since the completion of the Grand Prix, one additional jump has been added to the allowed elements.  This change will further shift the distribution of points in favor of jumps, adding a few percentage points to their contribution and decreasing the others.  For Free skating, CoP was a jumping contest during the Grand Prix and will be more so in the future.  With one additional jump, spins and sequences can be expected to have even less of an impact on the results.

In the Short Programs, the balance of the elements was not as bad as for the Free Skating, since the number elements of each type are more nearly equal in the Short Programs.  Spins and sequences together made up 17-25% of the scores, while jumps and the program components made up 75-83%.  Despite the slightly greater weight given spins and sequences in the Short Programs, the impact of spins and sequences on the actual placements was only slightly greater than for the Free Skating.

Random Selection of Marks

One contentious feature of CoP is the use of random selection of the marks.  There is no evidence that random selection of marks has any effect to eliminate attempts at misconduct.  It is clear from the marks in the Grand Prix, however, that it has a significant negative impact on the scoring.

The most obvious example of how random selection of the marks can skew the results occurs when a panel in nearly evenly divided in rating two skaters.  If half the panel gives Skater A higher marks than Skater B, and seven judges are randomly selected, it is obvious that either Skater A or Skater B could be placed first, depending on which judges are selected.  Less obvious is that even if as many as seven or eight judges give Skater B higher marks than Skater A, random selection of marks can still result in Skater A receiving the higher score.  An example of this can be found in the above link to the detailed calculations.

Bear in mind, it is not just a matter of how many judges score one skater higher than another, it also the amount by which they score one higher than another.  For example, eight of ten judges could score Skater B higher than Skater A.  After random selection of the judges, marks from five of the seven remaining judges could be higher for Skater B than Skater A.  After the single trimmed mean three of the five marks could be higher for Skater B than for Skater A, and yet Skater A can still end up with the higher score if the two judges marking A higher than B do so with a greater point difference than the three marking B greater than A.

To see how often random selection of marks affect the results in the Grand Prix the order of finish using a single trimmed mean applied to the marks from all the judges was calculated.  It is found that random selection of marks typically skews the results for 1/6 to 1/3 of  the places, and in some events it skews the results for as many as 50% of the places.  This can be expected to occur even more frequently in Championship events where the typical difference in points between places can be expected to be about one-third what it was in the Grand Prix.  (The smaller the point difference between places, the more sensitive the results are to random selection.)

Random selection of marks also effects the combination of points from two segments of an event, which further increases the impact random selection has on the results.

Suppose, for example, Skater A scores five points more than Skater B in the Short Program.  To win the event in this example, Skater B needs to beat Skater A by more than five points in the Free Skating.

In examining the marks in the Grand Prix one finds that the point difference between two skater can vary drastically depending on which judges are randomly selected.  One can have events where all ten judges score Skater B higher than Skater A, but depending on which judges are selected the margin of victory can vary from one or two point to ten or more points.  For example, using all the judges Skater B might have a six point victory in the Free Skating which is enough to win the event, but due to random selection of marks might actually end up with only a four point margin of victory and place second.

Random selection has a frequent and pervasive negative impact on the results in CoP.  It skews the results in individual event segments for a significant fraction of the places, and skews the event point totals on top of that.  With random selection of marks CoP is currently a computerized roulette wheel.

Come From Behind Victory

One of the characteristics of CoP that has been well received is the fact it allows a come from behind victory without help.  The great example of this in the Grand Prix is the ladies event at Cup of China where Elena Liashenko placed seventh in the Short Program and then won the Free Skating to win the event; however, this is only a part of the story for that event.

CoP does allow a small improvement over Total Factored Place (TFP) in combining event segments, but not nearly as much as expected, and there is no free lunch.  Some skaters gain by directly summing the points, but other lose.

When event results are calculated for the ladies event at Cup of China it is found that under TFP Liashenko moves up from 7th to 2nd place, so using point totals only bought her one extra place.  On the other hand, Jennifer Robinson lost out big-time, ending up three places lower using point totals compared to TFP.

Another negative for summing points is that a skater can place second in the Short Program and first in the Free Skating and still lose.  During the Grand Prix this occurred for 1/4 of the skaters who won the Free Skate and were second in the Short Program.

Further, Liashenko benefited not only from the use of point totals, but also from random selection of marks and harsh double penalties directed at Yoshie Onda.

For the 11 sets of marks in the Protocol, the point difference between Liashenko and Onda for the Free Skating varied from a few points in favor of Onda, to 44 points in favor of Liashenko.  Liashenko lucked-out on the random selection of marks which chose the judges that gave her the largest margins of victory.  If all judges are used, Liashenko's margin of victory is only a few tenths of a point.

As for Onda, in the Free Skating she attempted eight jump elements when only seven are allowed.  Her last three jumps consisted of triple toe loop, triple toe loop, and double Axel.  Since the second triple cannot be repeated outside a combination it did not count.  But neither did the double Axel, so she got points for only six jump elements.  Had she done the Axel before the second triple toe loop it would have counted, and without random selection of marks she would have won.

Consequently, Liashenko's come from behind victory resulted not only from the way event segments are combined, but also from random selection of marks and the way penalties are assessed.  Note, for example, that there is a combination of judges that gives Onda victory in both the Free Skating and the overall event, had that combination been randomly selected.

Whether summing points from event segments is a good thing or not is to some extent more a philosophical question than mathematical one.  Cynthia Phaneuf, who moved up from 8th place in the Short Program to finish 2nd by winning the Free Skating at Four Continents under TFP, would probably have liked summing of points in her competition, but other skaters in the Grand Prix would probably have preferred their overall results calculated using TFP.

Consistency of Marks

Another basic premise of CoP is that by combining the marks from every judge for the different aspects of skating using the same weights, greater consistency is obtained, producing superior results.  It is generally assumed that the large range in marks and ordinals sometimes seen in the 6.0 system is due to the judges giving different values to the various aspects of skating and combining them in different ways.  A fundamental goal of CoP is to eliminate this inconsistency.

At events marked using the 6.0 system following the conclusion of the Grand Prix, there were the occasional cases where skaters received a wide range of scores and ordinals.  It was often remarked how terrible this was, how this wouldn't happen under CoP, and how the sooner all competitions switched to CoP the better.

In looking at the marks for the ladies Free Skating at Cup of China it was noticed, however, that there was a great deal of spread in the judges marks.  To study the consistency of the judges marks, the total points and places determined by each judge for each skater was calculated for all event segments during the Grand Prix.  In doing this it was found that the consistency of the judges marks was extremely poor for many, many events.

For the ladies event at Cup of China, the consistency of the marks and placements was about as dreadful as one sees in competition.  If this lack of agreement had occurred under the 6.0 system, people would be screaming bloody murder!  Liashenko had places of 1 through 5 and marks that correspond to the high 5's to mid 4's.  Onda had marks equivalent to the mid 5's through the low 3's.  Robinson had placements of 1 through 9 and marks equivalent to the high 4's to the mid 3's.  Corwin has placements of 2 through 9 and marks equivalent to the high 4's to the low 3's.

These calculations also show that despite a major effort to train the judges, some judges marked systematically higher than average and other judges marked systematically lower than average by a significant amount.  Getting a large group of humans to score in a consistent way is extremely hard to accomplish (which is why the concept of ordinals was introduced in the first place), and has yet to be accomplished under CoP.

In terms of bringing consistency to the marks and the places, CoP in its current form is a complete failure.  The result for this event is not unique.  There were many events during the Grand Prix that were just as bad.  For CoP to work, the judges must all mark in a consistent way.  At this point, the marks say they do not.  Whether judges can ever be trained to do this remains to be seen.  Further training of the judges and further testing will be required to prove that they can -- for to quote Mark Twain, "supposing is good, but knowing is better."

This analysis was extended to see if among the different aspects of skating there was any better consistency.  For the five classes of skill (jumps, spins, sequences, basic skating and transitions, and presentation) the number of points in each class of skill and the corresponding places for each skill was calculated.  The result of this calculation is that the scores and places for the individual skills shows just as much variation as do the total points.  The marks and places are all over the map.  CoP is currently no better at determining who is the best jumper or spinner, etc. than it is at determining who is the best overall skater.

For example, in the ladies Free Skating at Cup of China, the consistency in the presentation marks  (sum of program components 3 through 5) was about as dreadful as dreadful can be.  Five ladies got at least one first place score for presentation and seven got at least one second place score!  Nine of eleven skaters had a range of places that spanned 5 to 8 places!

The result for this event and this skating skill was not unique.  In many events the five skating skills  all show equally poor consistency.  This amount of disagreement would be considered completely unacceptable in a senior level event judged under the 6.0 system; and, ironically, when calculated in terms of places, the inconsistency in the marks is significantly worse under CoP than what is usually found under the ordinal system.  In terms of consistency of judgment, the data show that CoP does not perform nearly as well as the 6.0 system.

This raises the question, if the judges are capable of significantly better agreement under the 6.0 system, why are the same judges incapable of getting at least as good consistency under CoP?  Is it a fundamental defect in CoP?  Is it a result of inadequate training?  The only way to find out for sure is further study and testing.

This degree of inconsistency also raises the question how a meaningful and valid system of accountability can be implemented under CoP when the judges' marks show so much variation?  For example, according to some criteria being considered for accountability, one might conclude that more than half the marks in the ladies Free Skating at Cup of China are anomalies!

Among the different events, consistency among the judges is worst in singles, and only slightly better in pairs.  As is often the case, dance follows a different pattern, with the scoring of the judges fairly consistent, though still not as much as might be expected given the assumption underlying the construction of CoP.

Summary

To the extent it places the better performances more or less at the top and the worst performances more or less at the bottom, CoP gives the appearance of producing plausible results.  However, when subjected to objective mathematical testing it is found there is little validity to the specific placements produced by the system for the majority of the skaters.

These calculation indicate that CoP has yet to achieve its desire goal of providing a rigorous, bias free, absolute standard for evaluating skating.  The data show that improvements are needed in several areas.  It is also clear that the judges are not marking the grades of execution and program components with the accuracy and consistency needed.  Whether these problems can be solved through revision of CoP and/or further training of the judges is unknown, and demands further study and testing.  The statistical methods used here provide an objective method of determining whether such efforts achieve the improvement required, and which the skaters deserve.

Return to title page

Copyright 2004 by George S. Rossano