YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

ga_gold_dataset

This dataset is the GA (Gradient Ascent) Gold Dataset, also known as WMDP-Bio-Remove-Dataset-Augmented-Flattened.

Dataset Description

The dataset contains 97,800 records of augmented WMDP-Bio content used for training language models with safety mechanisms through gradient ascent/descent techniques.

Structure

Each record contains:

  • type: The format type of the text (one of: original, lecture, exam, article)
  • text: The content text
  • __index_level_0__: Index from the original dataset

Source

Created from the HuggingFace dataset Unlearning/WMDP-Bio-Remove-Dataset-Augmented, which contains 24,453 original documents about biology and medical topics. Each document was augmented into 4 different formats:

  • original: The original research paper or article text
  • lecture: Content reformatted as educational lecture material
  • exam: Content transformed into multiple choice exam questions
  • article: Content rewritten as informative articles

Usage

This dataset is intended for research on safety filtering and unlearning techniques in language models, particularly for preventing the generation of potentially dangerous biological and medical information.

Statistics

  • Total records: 97,800
  • Average text length: ~9,224 characters
  • Token count: ~220 million tokens (using standard tokenization)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support