While fine tuning the Swin UNETR, Training loss is not decreasing and training gets crashed after 10 epohs #304

Mgithus · 2023-09-10T03:18:09Z

Describe the bug
I am trying to reproduce the Swin UNETR. I am doing finetuning, using the code and model.pt file given at:
https://github.com/Project-MONAI/research-contributions/tree/main/SwinUNETR/BRATS21

That model was trained using BraTS 2021 data. I am using Brats 2023 data provided on request from synapse:
https://www.synapse.org/#!Synapse:syn27046444/wiki/616992

I am using an A100 GPU provided by Colab Pro, running following command line for finetuning, using 1 GPU:

!python '/content/drive/MyDrive/Mgithus/SWIN/SwinUNETR/BRATS21/main.py' --json_list='/content/drive/MyDrive/data/whole_2023/ASNR-MICCAI-BraTS2023-GLI-Challenge-TrainingData.json' --data_dir='/content/drive/MyDrive/data/whole_2023/train_ds/ASNR-MICCAI-BraTS2023-GLI-Challenge-TrainingData' --val_every=10 --noamp --pretrained_model_name='Swin UNETR'
--pretrained_dir='/content/drive/MyDrive/fold1_f48_ep300_4gpu_dice0_9059/fold1_f48_ep300_4gpu_dice0_9059/model.pt' --fold=1 --roi_x=128 --roi_y=128 --roi_z=128 --in_channels=4 --spatial_dims=3 --use_checkpoint --feature_size=48 --max_epochs=80 --batch_size=3 --workers=12

In paper, Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images:
https://arxiv.org/abs/2201.01266
they have achieved Avg dice score of 0.913. but I train it on BraTS 2024, it did not show any inpactful reduction in loss, also gives error after 1st val, as follows :

After 5th epoch loss increased from 0.9465 to 0.97 instead of decreasing....

Am I not using the pretrained model in correct way? What hyperparameter values can help in increasing the dice score, as ET is 0 till the end of third epoch.
Although the input data have 1251 sample folders , even if I am using 4 fold cross validation, model gives 936 iterations instead of 939
at batch size of 1, using T4 GPU. Is it related to runtime type, the cuda out of memory problem?

FengheTan9 · 2023-09-21T07:49:23Z

same issue

Luffy03 · 2023-10-10T13:58:50Z

Hi, have you figured it out?

Mgithus · 2023-10-11T02:39:08Z

Cuda out of memory ... Crashing of training problem was solved by reducing no.of workers and batch size to 1... But problem with increasing loss after 5th or 6th epoch is still there...

Luffy03 · 2023-10-11T02:45:31Z

Cuda out of memory ... Crashing of training problem was solved by reducing no.of workers and batch size to 1... But problem with increasing loss after 5th or 6th epoch is still there...

Thx for sharing! Would you please share your Monai and pytorch version? I also meet the same problem and I have found a solution here (Project-MONAI/model-zoo#180). But it does not work for me .....

Mgithus · 2023-10-11T03:02:49Z

My pleasure... I will try it .... Thnx....

Mgithus · 2023-10-11T03:14:07Z

I used Google Colab Pro Plus and it automatically installed the latest versions of Monai and Pytorch directly without specifying a specific version. However, when I tried to run this model in the virtual environment in VS code using the latest versions (then they were Monai 1.2 and Pytorch 2.0.1), it did not work.

Luffy03 · 2023-10-17T17:27:57Z

I used Google Colab Pro Plus and it automatically installed the latest versions of Monai and Pytorch directly without specifying a specific version. However, when I tried to run this model in the virtual environment in VS code using the latest versions (then they were Monai 1.2 and Pytorch 2.0.1), it did not work.

I still struggle to implement it ......

NkwamPhilip · 2024-08-07T17:33:17Z

@Mgithus it shows ET as 0 because the segmentation labels are different. BRATS labelling "ConvertMultiChannel..." takes 0, 1, 2, 4 based on previous datasets segmentation labels (..., 2021). The labelling changed to 0, 1, 2, 3 in 2023.
You have to fix that by either modifying the labels on transforms."ConvertMultiChannel..." with a custom code & the 2021 seg labels, then retrain the model, or use nibabel to convert label 3 to 4 on the 2023 & 2024 data.
It's a label mismatch error.

Luffy03 · 2024-10-14T11:28:33Z

Hi, we reproduce the results at https://github.com/Luffy03/Large-Scale-Medical. You can find our implementation at https://github.com/Luffy03/Large-Scale-Medical/tree/main/Downstream/monai/BRATS21.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

While fine tuning the Swin UNETR, Training loss is not decreasing and training gets crashed after 10 epohs #304

While fine tuning the Swin UNETR, Training loss is not decreasing and training gets crashed after 10 epohs #304

Mgithus commented Sep 10, 2023

FengheTan9 commented Sep 21, 2023

Luffy03 commented Oct 10, 2023 •

edited

Loading

Mgithus commented Oct 11, 2023

Luffy03 commented Oct 11, 2023

Mgithus commented Oct 11, 2023

Mgithus commented Oct 11, 2023

Luffy03 commented Oct 17, 2023

NkwamPhilip commented Aug 7, 2024

Luffy03 commented Oct 14, 2024

While fine tuning the Swin UNETR, Training loss is not decreasing and training gets crashed after 10 epohs #304

While fine tuning the Swin UNETR, Training loss is not decreasing and training gets crashed after 10 epohs #304

Comments

Mgithus commented Sep 10, 2023

FengheTan9 commented Sep 21, 2023

Luffy03 commented Oct 10, 2023 • edited Loading

Mgithus commented Oct 11, 2023

Luffy03 commented Oct 11, 2023

Mgithus commented Oct 11, 2023

Mgithus commented Oct 11, 2023

Luffy03 commented Oct 17, 2023

NkwamPhilip commented Aug 7, 2024

Luffy03 commented Oct 14, 2024

Luffy03 commented Oct 10, 2023 •

edited

Loading