Data and challenge

The Rotterdam EyePACS AIROGS dataset (in full, so including train and test) contains 113,893 color fundus images from 60,357 subjects and approximately 500 different sites with a heterogeneous ethnicity. All images were assigned by human experts with the labels referable glaucoma, no referable glaucoma, or ungradable. The training set can be downloaded from Zenodo. Please note that we have to manually accept requests on Zenodo, so it may take some time before you will be granted access. In the future, the dataset will also be available on the AWS Registry of Open Data.

To encourage participants to develop technologies that are equipped with inherent robustness mechanisms, the training set set is an in-the-lab set where only gradable images are considered and ungradable images excluded. The test set, however, includes all image types acquired during screening, simulating a real-world scenario.

The test set is "closed", meaning the test data cannot be downloaded. For participants to be assessed on this test set, they first produce an algorithm, wrapped in a Docker container (see example here). The algorithm is then executed by the evaluation platform on Grand Challenge using a NVIDIA T4 GPU (16 GB) with 8 CPUs (32 GB), resulting in a set of predictions that are subsequently evaluated on the closed test data. This dataset is used for the final phase. One submission to this challenge phase will be allowed. The creation of multiple accounts by the same team is prohibited.

Before the final phase, there will also be a preliminary test phase 1 with a smaller number of images to test if the algorithms wrapped as Docker containers work as expected. A larger number of submissions for this phase will be allowed. To preliminary phase 2, three submissions will be allowed and it contains 10% of the final test data.

The training set contains approximately 101,000 gradable images. The test set contains about 11,000 gradable and ungradable images (both gradable and ungradable).

For each input image during evaluation, the desired output is a likelihood score for referable glaucoma (O1), a binary decision on referable glaucoma presence (O2), a binary decision on whether an image is ungradable (O3, true if ungradable, false if gradable), and a non-thresholded scalar value that is positively correlated with the likelihood for ungradability (e.g. the entropy of a probability vector produced by a machine learning model or the variance of an ensemble) (O4). Output O2 will not be used in the evaluation pipeline for the challenge leaderboard, but merely for any - still to be determined - additional evaluation in the challenge paper.

The use of additional fundus image data (including weights pretrained on fundus image data and in a pre-processing step, such as optic disk segmentation) is prohibited. The training data will be made available under the CC BY-NC-ND licence.

We aim to write a summary paper about the challenge. The best performing teams will be allowed to choose a maximum of three persons to be co-author in this summary paper.

Labeling procedure

A dedicated, optimized web-based annotation system was built to facilitate the labeling of all images. Labeling was performed by a pool of carefully selected graders. From a larger group of general ophthalmologists, glaucoma specialists, residents in ophthalmology and optometrists who had been trained at optic disc assessment for detecting glaucoma, 89 indicated that they wanted to become a grader for labeling the data set. They were invited for an exam and 32 of them passed with a minimum specificity of 92% and a minimum sensitivity of 85% for detecting glaucoma on fundus photographs.

For each image, graders needed to state whether the eye should be referred (‘referable glaucoma’), should not be referred (‘no referable glaucoma’) or that the image could not be graded (‘ungradable’). To ensure a high-quality set of labels, each image was initially graded twice. If the graders agreed, the agreed-on label was the final label. If the graders disagreed, the image was graded by a third grader (one of two experienced glaucoma specialists); that label was the final label.

During the grading process, the performance of all graders was monitored; those that showed a sensitivity below 80% and/or a specificity below 95% were removed from the graders pool. All the images they had graded were redistributed amongst the remaining graders. In the end, the pool consisted of 20 graders, excluding the two experienced glaucoma specialists that only graded the images in case of disagreement.


The evaluation will be based on two aspects: screening performance and robustness.

The screening performance will be evaluated using the partial area under the receiver operator characteristic curve (90-100% specificity) for referable glaucoma (α) and sensitivity at 95% specificity (β). The screening performance metrics are based on these specificity ranges, since a high specificity is generally desired in screening settings. To calculate α and β, we compare output O1 to the referable glaucoma reference provided by human experts.

Using Cohen's kappa score, the agreement between the reference and the decisions provided by the challenge participants on image gradability, O3, is calculated (γ). Furthermore, the area under the receiver operator characteristic curve will be determined using the human reference for ungradability as the true labels and the ungradability scalar values provided by the participants, O4, as the target scores (δ).

The final participant ranking will be based on a combination of the aforementioned metrics in the following manner. Firstly, all participants will be ranked on the individual metrics α, β, γ and δ, resulting in rankings Rα, Rβ, Rγ and Rδ, respectively. The final score will be calculated as follows:

Sfinal = (Rα + Rβ + Rγ + Rδ)/4.

The final ranking will subsequently be based on Sfinal, where a lower value for Sfinal will result in a higher ranking.

The evaluation code is available here.

A tutorial for how to make a submission for the challenge is located here.