Snooping on HF is the best because sometimes you just discover that someone (in this case, Earth Species Project) is about to drop terabytes of sick (high quality animal sounds) data...
Just dropped two bigger physics datasets (both on photonics)!
NUMBA 1: SIB-CL This dataset of Surrogate- and Invariance-Boosted Contrastive Learning (SIB-CL) datasets for two scientific problems: - PhC2D: 2D photonic crystal density-of-states (DOS) and bandstructure data. - TISE: 3D time-independent Schrödinger equation eigenvalue and eigenvector solutions.
NUMBA2: 2D Photonic Topology Symmetry-driven analysis of 2D photonic crystals: 10k random unit cells across 11 symmetries, 2 polarizations, 5 contrasts. Includes time-reversal breaking cases for 4 symmetries at high contrast.
Always surprised that so few people actually read the FineTasks blog, on ✨how to select training evals with the highest signal✨
If you're serious about training models without wasting compute on shitty runs, you absolutely should read it!!
An high signal eval actually tells you precisely, during training, how wel & what your model is learning, allowing you to discard the bad runs/bad samplings/...!
The blog covers in depth prompt choice, metrics, dataset, across languages/capabilities, and my fave section is "which properties should evals have"👌 (to know on your use case how to select the best evals for you)