Replies: 1 comment
-
@georgedahl so you're in the loop |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Reading through the playbook, I had some general thoughts I thought I'd share - I'm interested to hear your opinion on which of these you maybe considered but decided not to use in the end and why! I work on AutoML/RL, so this is where a lot of my questions are coming from. We're usually more about automation and efficiency, but I just want to say that I'm not judging what you're doing. If it works, it works after all.
Measuring Hyperparameter Impact and Importance
This comes up a lot and it's a really important point - understanding what is important to final performance is fundamental to finding a well-performing configuration. You don't really say if you measure importance or impact in any way beyond incumbent improvement, though. Have you looked into hyperparameter importance analysis tools (fANOVA etc.)? Or AutoML analysis tools in general? You can use those even with random search as far as I know (at least DeepCAVE and XAutoML should work posthoc without additional runs, I think?) and I think especially in an iterative process as you suggest, quantifying things like the hyperparameter importance more explicitly could be helpful.
Exploration vs Exploitation in the Nuisance Hyperparameters
I personally would probably not do as much human exploration as you do (given that most AutoML tools deal with AutoML sized search spaces fairly well), but I also wouldn't say you're wrong for doing it. What is strange to me, however, is that you seem to use the same strategy for exploring scientific hyperparameters and exploiting nuisance ones.
Random searching across the scientific hyperparameter(s) makes sense to me since you likely want even coverage of the whole space - in that case it doesn't matter much that random search is not sample efficient at finding good configurations because it's not what you're looking for any way. It's a different story with the nuisance hyperparameters, though. I don't know how many of these you usually include, if it's only two or three, I can imagine the number of additional budget you'd need with random search to stay somewhat manageable, but if it's more dimensions that that, I can actually imagine random sampling the scientific dimensions and tuning the nuisance dimensions with something more elaborate being the better cost/performance tradeoff.
Did you ever try if it makes a difference for you? Or do you usually just have a low dimensional nuisance search space?
Partial Runs
Something you don't mention at all (well, maybe in selecting the tuning budget) is utilizing partial runs for configuration evaluation. A multi-fidelity method like e.g. HyperBand would be really efficient at covering the search space and eliminating uninteresting regions fairly quickly. You probably want to verify that final performance actually has decent correlation with anytime performance after the smallest budget runtime, but I'd say this should work well enough for most ML hyperparameters save maybe architecture size. Using multiple fidelities, even if just three or four, can easily increase your search space coverage while keeping the overall budget small. Did you actively decide against using them or did they just not come up in your process?
Validating Hyperparameters
I know it's standard practice to report tuned runs in research papers and in production it makes no sense to deploy anything else, but this is maybe a word of caution for researchers in particular: it's definitely possible to overtune on seeds and in that case the configuration you painstakingly found might perform very badly if those seeds change. I mostly do RL where this is probably much worse, but I've seen configurations go from pretty well on across 5 tuning seeds to literally 10x worse on 5 test seeds.
This can always happen, but if you want to report results across seeds somewhere and/or want other people to get good results when re-running your method, you probably want to at least periodically look at multiple runs and seeds and make sure the performance actually transfers. As I said, this is a research problem, not a production one, and a specific one at that, but if you invest a lot of compute into tuning your method, doing some validation runs is a good idea in my opinion.
As I mentioned already, I personally would probably not explore as much as you do in the first place, that's mostly due to the fact that your process sounds out of scope for my usual compute budget. I think you might be overestimating how much work is necessary to reduce the search space for modern optimizers, though, especially in multi-fidelity settings. Maybe I'm wrong about this, I don't think I've ever tried running e.g. SMAC or DEHB with a run budget of 100, but I've even had good results with standard HyperBand on an equivalent of 6 runs. I'd think you end up learning more about the algorithm in a more involved iterative process, though, and if you plan on continuing to work on the model, that's probably worth what you loose in efficiency.
Now I'm really curious what the difference between iterated random search + exploitation phase and a high budget AutoML tool run with post-hoc analysis would be - if you ever end up trying that, please let me know! 😄
Beta Was this translation helpful? Give feedback.
All reactions