-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loss staying high while trying to tune network #166
Comments
Hi there, what is the size of your dataset? How many epochs did you train for? Have you experimented with the hyperparameters? What is the goal of your dataset—what are you trying to detect? You might want to try DEIM. The repository is similar, and according to their paper, it should converge way faster than RT-DETR and D-FINE. Have you tried converting the D-FINE (or RT-DETR) model to ONNX and TensorRT? If so, what has your experience been with the model's inference speed? Did you encounter any errors during the conversion? One more question: Have you encountered this issue on Windows or Linux with your custom dataset: [rank0]: NotImplementedError: Caught NotImplementedError in DataLoader worker process 0. I have successfully trained D-FINE and RT-DETR a month ago, but for some reason, I can’t get past this error now. I’m really pissed off now. |
My dataset is small (180 images), I am trying to detect visible blobs in my images. My images are grey, but I kept them in RGB format (3 channels). I can look into DEIM, but I don't get why this isn't working. I trained for 64 epochs and did not mess with any of the hyperparameters. Is this simply too few epochs? I did convert the model to ONNX, it quickly makes very inaccurate inferences, which aligns with the high loss. There were no errors with the conversion. I am training on Windows and did initially run into a NotImplementedError. I think I solved it by going to an older torch version (2.3.1). What steps did you use to train on your custom dataset? I feel like I'm missing something obvious. |
If you had a perfect architecture for your problem you'd only need a few datapoints, I would add more images to the dataset: https://en.m.wikipedia.org/wiki/Neural_scaling_law I haven't tried these models on smaller datasets, but transformers tend to perform worse than convolutional networks in low-data scenarios. Transformers allow each part of the image to influence the representation of every other part, while convolutional networks focus on local regions, with downsampling enabling deeper layers to capture broader context. The YOLO Darknet repository has shown that fine-tuning requires only a few images (around 20 is sufficient), even for relatively complex scenes. However, transformers are far more complex, and adjusting their weights to an optimal point requires a larger dataset, especially for understanding complex scenes. This is an interesting paper for this question: https://openreview.net/forum?id=SCN8UaetXx Ultimately the simplest route to a good transformer model is a ton of data. |
I'm trying to tune the obj365+coco for my custom dataset. I have 1 class. This is the class information from the coco annotations: "categories": [{"supercategory": "none", "id": 0, "name": "0"}]}. I have set num_classes: 2 in custom_detection.yml. I have also made sure that remap_mscoco_category: False in custom_detection.yml. I'm training from powershell with this command: $env:CUDA_VISIBLE_DEVICES="0"; python train.py
-c configs/dfine/custom/objects365/dfine_hgnetv2_s_obj2custom.yml
--use-amp --seed=0 -t dfine_s_obj2coco.pth. My loss is staying high (20-30) during training and all of my evals come back as 0. I know the dataset is trainable, I've successfully trained a YOLO network on it. What am I doing wrong here?The text was updated successfully, but these errors were encountered: