LeMaterial

Team

non-profit

https://www.lematerial.org

LeMaterial

Activity Feed

AI & ML interests

AI4Science

Recent Activity

cgeorgiaw updated a Space 12 days ago

LeMaterial/LeMat-GenBench

cgeorgiaw published a Space 12 days ago

LeMaterial/LeMat-GenBench

thomwolf authored a paper about 2 months ago

Robot Learning: A Tutorial

View all activity

cgeorgiaw

updated a Space 12 days ago

Lemat Bench

😻

Browse and submit material generation model results

cgeorgiaw

published a Space 12 days ago

Lemat Bench

😻

Browse and submit material generation model results

cgeorgiaw

posted an update 18 days ago

Post

1153

🚀🚀🚀Huge biotech data drop today🚀🚀🚀

The largest drug-target dataset ever created was just released on Hugging Face—and it's still growing...

EvE Bio is further updating the dataset every 8 weeks. Drug development dream.

Read the blog: https://huggingface.co/blog/hugging-science/eve-bio-mapping-the-pharmone-drug-interaction
Play with the data: eve-bio/drug-target-activity

thomwolf

authored a paper about 2 months ago

Robot Learning: A Tutorial

Paper • 2510.12403 • Published Oct 14 • 114

lvwerra

authored a paper about 2 months ago

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

Paper • 2510.08697 • Published Oct 9 • 35

Ramlaoui

updated a dataset about 2 months ago

LeMaterial/LeMat-Bulk-MLIP-Hull

Viewer • Updated Oct 8 • 1.05M • 1.32k

cgeorgiaw

posted an update 3 months ago

Post

5948

🚀🚀🚀 The largest ever dataset of co-folded 3D protein-ligand structures just dropped on HF!!

Meet SAIR (Structurally Augmented IC₅₀ Repository): 5M+ AI-generated complexes with experimentally measured drug potency data from SandboxAQ. 🚀🚀🚀

Check it out and explore here: SandboxAQ/SAIR

3 replies

cgeorgiaw

posted an update 4 months ago

Post

631

Just dropped the most influential materials science data of the year so far! Check it out :)))

cgeorgiaw/WyFormer-Symmetric-Crystals

1 reply

IAMJB

authored a paper 5 months ago

SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning

Paper • 2506.21355 • Published Jun 26 • 10

thomwolf

authored a paper 5 months ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26 • 75

lvwerra

authored a paper 5 months ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26 • 75

cgeorgiaw

posted an update 6 months ago

Post

2785

Huge new bio datasets just dropped!!!

Check out them out @

ginkgo-datapoints
Read the blog for more info: https://huggingface.co/blog/cgeorgiaw/gdp

1 reply

cgeorgiaw

posted an update 6 months ago

Post

1608

Snooping on HF is the best because sometimes you just discover that someone (in this case, Earth Species Project) is about to drop terabytes of sick (high quality animal sounds) data...

EarthSpeciesProject/NatureLM-audio-training

thomwolf

authored a paper 6 months ago

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Paper • 2506.01844 • Published Jun 2 • 143

cgeorgiaw

posted an update 6 months ago

Post

520

Just dropped two bigger physics datasets (both on photonics)!

NUMBA 1: SIB-CL
This dataset of Surrogate- and Invariance-Boosted Contrastive Learning (SIB-CL) datasets for two scientific problems:
- PhC2D: 2D photonic crystal density-of-states (DOS) and bandstructure data.
- TISE: 3D time-independent Schrödinger equation eigenvalue and eigenvector solutions.

NUMBA2: 2D Photonic Topology
Symmetry-driven analysis of 2D photonic crystals: 10k random unit cells across 11 symmetries, 2 polarizations, 5 contrasts. Includes time-reversal breaking cases for 4 symmetries at high contrast.

Check them out: cgeorgiaw/sib-cl & cgeorgiaw/2d-photonic-topology

cgeorgiaw

authored 4 papers 7 months ago

Toward Robust Real-World Audio Deepfake Detection: Closing the Explainability Gap

Paper • 2410.07436 • Published Oct 9, 2024 • 1

Fact-Checking with Contextual Narratives: Leveraging Retrieval-Augmented LLMs for Social Media Analysis

Paper • 2504.10166 • Published Apr 14 • 2

PSyDUCK: Training-Free Steganography for Latent Diffusion

Paper • 2501.19172 • Published Jan 31 • 1

LLM-Consensus: Multi-Agent Debate for Visual Misinformation Detection

Paper • 2410.20140 • Published Oct 26, 2024 • 1

clefourrier

posted an update 7 months ago

Post

1998

Always surprised that so few people actually read the FineTasks blog, on
✨how to select training evals with the highest signal✨

If you're serious about training models without wasting compute on shitty runs, you absolutely should read it!!

An high signal eval actually tells you precisely, during training, how wel & what your model is learning, allowing you to discard the bad runs/bad samplings/...!

The blog covers in depth prompt choice, metrics, dataset, across languages/capabilities, and my fave section is "which properties should evals have"👌
(to know on your use case how to select the best evals for you)

Blog: HuggingFaceFW/blogpost-fine-tasks

2 replies

AI & ML interests

Recent Activity

Team members 24

LeMaterial's activity

Lemat Bench

Lemat Bench