BMIRDS Datasets

MHIST: A Minimalist Histopathology Image Analysis Dataset

This dataset comprises 3,152 hematoxylin and eosin (H&E)-stained Formalin Fixed Paraffin-Embedded (FFPE) fixed-size images (224 by 224 pixels) of colorectal polyps from the Department of Pathology and Laboratory Medicine at Dartmouth-Hitchcock Medical Center (DHMC). The dataset is de-identified and released with permission from Dartmouth-Hitchcock Health (D-HH) Institutional Review Board (IRB). All images are labeled according to the opinions of seven pathologists, Drs. Arief Suriawinata, Bing Ren, Xiaoying Liu, Mikhail Lisovsky, Louis Vaickus, Charles Brown, and Michael Baker, at the Department of Pathology and Laboratory Medicine at DHMC for the type of colorectal polyps. This dataset and its associated annotations aim to foster collaboration with the research community and facilitate developing and evaluating new methodologies for accurate histology image analysis in digital pathology. For more information about this dataset, please refer to “A Petri Dish for Histopathology Image Analysis”.

MHIST Binary Classification Task

Classes in our dataset indicate the predominant histological pattern of each image and are as follows:

  • Hyperplastic Polyp (HP)
  • Sessile Serrated Adenoma (SSA)

This classification task focuses on the clinically-important binary distinction between HPs and SSAs, a challenging problem with considerable inter-pathologist variability. HPs are typically benign, while sessile serrated adenomas are precancerous lesions that can turn into cancer if left untreated and require sooner follow-up examinations. Histologically, HPs have a superficial serrated architecture and elongated crypts, whereas SSAs are characterized by broad-based crypts, often with complex structure and heavy serration.

Dataset Description

The dataset includes:

  • annotations.csv
  • images.zip (333 MB)
  • MD5SUMs.txt

All 3,152 images are in images.zip file.

Annotations are included in annotations.csv. Note that this file includes each image file name and its corresponding majority-vote label and degree of annotator agreement expressed as the number of annotators who marked the image as SSA (e.g., 6 indicates 6/7 agreement with a ground truth of SSA and 2 would indicate 5/7 agreement with a ground truth of HP).

MD5SUMs.txt contains a checksum that can be used to verify that the contents of the dataset are downloaded correctly.

Code Repository

DeepSlide, our open-source framework for histology image analysis in PyTorch, is available to develop deep learning models for histology image classification.

Accessing Dataset

Before downloading our dataset, please read the Dataset Research Use Agreement.

Please fill out the form below to receive the links to download the dataset by email.

Citation

If you use this dataset, please cite the corresponding paper:

Jerry Wei, Arief Suriawinata, Bing Ren, Xiaoying Liu, Mikhail Lisovsky, Louis Vaickus, Charles Brown, Michael Baker, Naofumi Tomita, Lorenzo Torresani, Jason Wei, Saeed Hassanpour, “A Petri Dish for Histopathology Image Analysis”, International Conference on Artificial Intelligence in Medicine (AIME), 12721:11-24, 2021.

FAQ

“I haven’t received any email after submitting the form.”

Please wait for a few hours and submit the form again.

By default, the download links will be expired after 4 hours. Please submit the form again to receive new links and download data before the links expire.




For inquiries, please contact us at :mailbox:BMIRDS.

If you are interested in histology image analysis, please check out other datasets from our group.