YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

ga_gold_dataset

This dataset is the GA (Gradient Ascent) Gold Dataset, also known as WMDP-Bio-Remove-Dataset-Augmented-Flattened.

Dataset Description

The dataset contains 97,800 records of augmented WMDP-Bio content used for training language models with safety mechanisms through gradient ascent/descent techniques.

Structure

Each record contains:

type: The format type of the text (one of: original, lecture, exam, article)
text: The content text
__index_level_0__: Index from the original dataset

Source

Created from the HuggingFace dataset Unlearning/WMDP-Bio-Remove-Dataset-Augmented, which contains 24,453 original documents about biology and medical topics. Each document was augmented into 4 different formats:

original: The original research paper or article text
lecture: Content reformatted as educational lecture material
exam: Content transformed into multiple choice exam questions
article: Content rewritten as informative articles

Usage

This dataset is intended for research on safety filtering and unlearning techniques in language models, particularly for preventing the generation of potentially dangerous biological and medical information.

Statistics

Total records: 97,800
Average text length: ~9,224 characters
Token count: ~220 million tokens (using standard tokenization)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support