AI Text Detector - HC3 Dataset

This model is a fine-tuned DistilBERT model for detecting AI-generated text vs human-written text. It was trained on the HC3 dataset from Hugging Face.

Model Details

Base Model: distilbert-base-uncased
Task: Binary text classification (Human vs AI-generated)
Dataset: HC3 (Human ChatGPT Comparison Corpus)
Training Framework: PyTorch + Transformers

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("VSAsteroid/ai-text-detector-hc3")
model = AutoModelForSequenceClassification.from_pretrained("VSAsteroid/ai-text-detector-hc3")

# Example prediction
text = "Your text here"
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=256)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    
# Get prediction
predicted_class = torch.argmax(predictions, dim=-1).item()
confidence = torch.max(predictions).item()

label = "AI-Generated" if predicted_class == 1 else "Human-Written"
print(f"Prediction: {label} (Confidence: {confidence:.3f})")

Labels

0: Human-Written
1: AI-Generated

Training Details

Epochs: 2-3
Batch Size: 8-16
Learning Rate: 2e-5
Max Sequence Length: 256
Optimizer: AdamW with linear scheduling

Performance

The model achieves good performance on distinguishing between human-written and AI-generated text, particularly on the types of content present in the HC3 dataset.

Limitations

The model is trained specifically on the HC3 dataset and may not generalize well to other types of text
Performance may vary depending on the AI model that generated the text
Short texts may be more difficult to classify accurately

Citation

If you use this model, please cite the HC3 dataset:

@misc{guo2023close,
    title={How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection},
    author={Biyang Guo and Xin Zhang and Ziyuan Wang and Minqi Jiang and Jinran Nie and Yuxuan Ding and Jianwei Yue and Yupeng Wu},
    year={2023},
    eprint={2301.07597},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Downloads last month: 57

Safetensors

Model size

67M params

Tensor type

F32

Model tree for VSAsteroid/ai-text-detector-hc3

Base model

distilbert/distilbert-base-uncased

Finetuned

(10571)

this model

Quantizations

1 model

Dataset used to train VSAsteroid/ai-text-detector-hc3

Space using VSAsteroid/ai-text-detector-hc3 1

Paper for VSAsteroid/ai-text-detector-hc3

How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection

Paper • 2301.07597 • Published Jan 18, 2023 • 1