YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
ga_gold_dataset
This dataset is the GA (Gradient Ascent) Gold Dataset, also known as WMDP-Bio-Remove-Dataset-Augmented-Flattened.
Dataset Description
The dataset contains 97,800 records of augmented WMDP-Bio content used for training language models with safety mechanisms through gradient ascent/descent techniques.
Structure
Each record contains:
type: The format type of the text (one of:original,lecture,exam,article)text: The content text__index_level_0__: Index from the original dataset
Source
Created from the HuggingFace dataset Unlearning/WMDP-Bio-Remove-Dataset-Augmented, which contains 24,453 original documents about biology and medical topics. Each document was augmented into 4 different formats:
- original: The original research paper or article text
- lecture: Content reformatted as educational lecture material
- exam: Content transformed into multiple choice exam questions
- article: Content rewritten as informative articles
Usage
This dataset is intended for research on safety filtering and unlearning techniques in language models, particularly for preventing the generation of potentially dangerous biological and medical information.
Statistics
- Total records: 97,800
- Average text length: ~9,224 characters
- Token count: ~220 million tokens (using standard tokenization)
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support