The results of the present study indicated "fair to good" agreement for both grading systems, 2004/2016 WHO and Cheng’s system, in general, and in subgroups for all grades, i.e., low grade and high grade in the 2004/2016 grading system and grades II, III, and IV in Cheng’s grading system. This comparison has not been performed previously; therefore, we cannot compare the results with similar studies, and we must compare our results with the few studies reporting the reproducibility of the 2004/2016 grading system compared with previous WHO systems. The agreement among the pathologists in the 2004/2016 system ranged from 39 - 74%, with kappa values of κ = 0.14 - 0.58 (
11), κ = 0.30 - 0.52 (
12), and κ = 0.35 (
15) in different studies. We have obtained higher Kappa values in the present study compared with the two studies mentioned above (
11,
12,
15). The inclusion of a specific group of patients may be one of the reasons for such differences. Furthermore, this difference can be related to the specification of grade III in Cheng et al.’s system (
8).
It should be noted that the tumor’s grade is an important source of inter-observer variability; the overall inter-observer agreement of 87% (κ = 0.70), reported by Mangrud et al. for the WHO 2004/2016 grading system, which evaluated patients with TaT1 carcinoma, decreased to 66% in high-grade tumors and increased to 100% agreement in low-grade tumors (
16). The Kappa values reported in this study for the WHO 2004/2016 grading system were close to those reported in the present study, indicating "fair to good" inter-observer agreement. A review of 20 studies indicated a wide range of inter-observer agreement for the 2004/2016 classification (43 - 100%, κ = 0.17-0.70) (
13). This wide range refers to differences in the disease subgroup analyzed and the limitations of the studies (moderate to high risk of bias) (
13).
In addition to the tumor’s grading, we also investigated the pathologists’ agreement in reporting stromal invasion, muscular invasion, and tumor heterogeneity; the results showed good agreement in all parameters except heterogeneity. Few studies have evaluated the pathologists’ agreement for stromal invasion and none for tumor heterogeneity and muscular invasion; although it is well known these are essential for evaluating the depth of invasion and have a significant effect on disease progression and, thus, patients’ prognosis (
21-
23). Tosoni et al. reported that the report of one pathologist about stromal invasion should not be considered for treatment choice and suggested its evaluation by another pathologist (
21). In the study suggesting a new staging system for pT1 papillary bladder cancer (superficially invasive papillary urothelial cell carcinoma), stromal invasion was used for defining microinvasion, and the results showed 81% agreement between two pathologists (
24). In another study on bladder cancer, full agreement was observed only in 44% of eight genitourinary pathologists, with greater discordance between aggressive and conservative pathologists (
25). Others have also identified lamina propria invasion as one of the parameters with poor agreement between experienced pathologists (
26), while we found good agreement between pathologists in this regard. Tumor heterogeneity, which refers to the divergent differentiation of this type of cancer, is identified as another cause of observer variability, which can also result in uncertainty of treatment choice (
11,
27,
28). Therefore, further studies are required to report the pathologists’ agreement on these parameters to be comparable to the results of the present study.
The characteristics of the pathologists may also affect the results of our study. In our study, pairwise comparisons showed no difference in Kappa values, with fair to good agreement between the pathologists in both grading systems, except in one comparison. Some have suggested that the level of experience is an important factor in poor inter-observer agreement for grading urothelial carcinoma on urine cytology, reporting more accuracy in reports from senior pathologists with more than seven years of experience or those with specialized training (
29). Others reported no difference in Kappa values for pathologists with more than ten years versus less than ten years of experience in low-grade urothelial carcinoma of instrumented urinary tract cytology specimens (
30). Possibly, experience with that specific grading system is more important than the overall years of experience of the pathologists (
31). We cannot comment on this based on our study because we did not have any pathologists with a low level of experience. We also evaluated whether the medical center where the pathologists work can influence their agreement, but the results rejected such an effect, and all Kappa values fell within one category (fair to moderate). Further studies are required to determine whether the pathologists’ characteristics can influence the accuracy of the results reported for grading UCB.
Regarding the patients’ characteristics, most patients in our study were men, which is consistent with previous reports. This sex difference is attributed to the exposure to risk factors, such as cigarette smoking and occupational hazards in men, as well as the role of sex hormones (
32), with different male-to-female ratios reported (
33). Additionally, the mean age of our study population was close to that reported in the USA (
34) and Canada (
35). It has been suggested that patients younger than 40 years have smaller tumors, lower-grade cancers, and better overall survival (
36,
37), which may explain why we had no grade I pathology (in the new grading system) among our samples and the poor prognosis observed.
The main limitation of this study was the retrospective nature of data collection, which resulted in a large amount of missing data on some variables, such as clinical outcomes (death and recurrence); however, these variables were not the main objective of our study and thus not critical. Another consequence of retrospective evaluation was that we could not differentiate between diagnoses based on biopsy versus surgical specimens. There were also several variables not available in the medical records and not evaluated, such as underlying diseases, whose confounding effects may influence the results of our study. Additionally, we did not collect any data about the treatment strategies employed; therefore, the results of survival and recurrence could not be interpreted completely. For the same reason, we did not comment on the impact of these grading systems on the therapeutic strategies selected by the oncologists. Finally, we selected patients from one medical center; therefore, the results cannot be generalized to all patients with this condition, and further research is required to generalize the results to a wider population.
5.1. Conclusions
The fair to good agreement among pathologists in both grading systems for UCB, namely the WHO 2004/2016 and Cheng’s grading system, and the close Kappa values demonstrated the reproducibility of these two grading systems and rejected differences in the grading systems as the cause of discrepancies among pathologists. Therefore, the choice of grading system used for the pathological report of a UCB specimen should not be based on the reproducibility of the grading system. Other factors may be significantly different between the two grading systems, the determination of which requires further studies.