datalyesorg

non-profit

AI & ML interests

None defined yet.

datalyesĀ 
posted an update 3 months ago
view post
Post
323
# PatenTEB: A Comprehensive Benchmark for Patent Text Embeddings šŸŽÆ

Very excited to finally be able to announce the (partial) release of **PatenTEB**, the first comprehensive benchmark specifically designed for evaluating text embedding models on patent-specific tasks!

## šŸš€ What's Released

### šŸ“¦ 15 Benchmark Datasets (NEW to MTEB)
All tasks are **completely new** and not previously available in MTEB or other benchmarks:

- **3 Classification tasks**: Patent citation timing, NLI directionality, IPC3 technology classification
- **2 Clustering tasks**: IPC-based and inventor-based patent grouping
- **8 Retrieval tasks**: 3 symmetric (IN/MIXED/OUT domain) + 5 asymmetric (fragment-to-document matching)
- **2 Paraphrase tasks**: Problem and solution semantic similarity detection

šŸ”— **All datasets**: [huggingface.co/datalyes]( @datalyes )

### šŸ¤– 12 Trained Models
The **patembed model family** (67M-344M parameters):
- 6 core models (large, base, base_small, small, mini, nano)
- 3 long-context variants (1024, 2048, 4096 tokens)
- 3 ablation models (no prompts, retrieval-only, no classification)

šŸ”— **All models**: [huggingface.co/datalyes]( @datalyes )

## šŸ“– Resources

- **Paper**: [arXiv:2510.22264](https://arxiv.org/abs/2510.22264)
- **Datasets**: [huggingface.co/datalyes]( @datalyes ) (15 tasks)
- **Models**: [huggingface.co/datalyes]( @datalyes )
- **GitHub**: [github.com/iliass-y/patenteb](https://github.com/iliass-y/patenteb)
- **License**: CC BY-NC-SA 4.0 (non-commercial research use)

## šŸ™ Acknowledgments

Big thanks to :
- **Lens.org** for providing access to raw patent data at a reasonable cost for us little labs
- **MTEB community** for the excellent benchmark framework and the inspiration
- **Sentence Transformers** team for the powerful embedding library

#patent #nlp #embeddings #benchmark #retrieval #classification #mteb #sentence-transformers