---
license: agpl-3.0
language:
- en
base_model:
- emilyalsentzer/Bio_ClinicalBERT
pipeline_tag: token-classification
library_name: transformers
tags:
- named-entity-recognition
- ner
- medical
- disease-extraction
- healthcare
- bert
- clinical-bert
- fine-tuned
- pytorch
- bio-tagging
datasets:
- custom
widget:
- text: "Patient has a history of hypertension and type 2 diabetes."
  example_title: "Medical History Example"

---

# BERT for Medical Named Entity Recognition (Disease Extraction)

## Model Description

This model is a fine-tuned version of [emilyalsentzer/Bio_ClinicalBERT](https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT) for Named Entity Recognition (NER) specifically designed to extract disease names from medical text. The model uses BIO tagging schema to identify and classify disease entities in clinical narratives.

## Model Details

- **Base Model**: emilyalsentzer/Bio_ClinicalBERT
- **Task**: Token Classification (Named Entity Recognition)
- **Domain**: Medical/Healthcare
- **Target Entities**: Diseases
- **Tagging Schema**: BIO (Beginning-Inside-Outside)
- **Labels**: 
  - `O`: Outside (not a disease entity)
  - `B-DISEASE`: Beginning of a disease entity
  - `I-DISEASE`: Inside/continuation of a disease entity

## Training Details

- **Training Epochs**: 50
- **Batch Size**: 16
- **Learning Rate**: 2e-5
- **Optimizer**: AdamW
- **Scheduler**: Linear schedule with warmup
- **Max Sequence Length**: 128 tokens
- **Train/Validation Split**: 80/20

## Performance Metrics

The model achieved the following performance on the validation set:
- **Accuracy**: [Will be filled with actual values]
- **Precision**: [Will be filled with actual values] 
- **Recall**: [Will be filled with actual values]
- **F1 Score**: [Will be filled with actual values]
- **AUC**: [Will be filled with actual values]

## Usage

### Quick Start

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
import nltk

# Load model and tokenizer
model_name = "keanteng/bert-sentiment-wqd7007"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Example text
text = "Patient has a history of hypertension and type 2 diabetes."

# Tokenize
tokens = nltk.word_tokenize(text)
inputs = tokenizer(tokens, is_split_into_words=True, return_tensors="pt", padding=True, truncation=True)

# Predict
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=2)

# Map predictions to labels
id2label = {0: 'O', 1: 'B-DISEASE', 2: 'I-DISEASE'}
predicted_labels = [id2label[pred.item()] for pred in predictions[0]]

# Extract diseases
diseases = []
current_disease = []
word_ids = inputs.word_ids()

for i, (word_idx, label) in enumerate(zip(word_ids, predicted_labels)):
    if word_idx is not None and word_idx < len(tokens):
        if label == 'B-DISEASE':
            if current_disease:
                diseases.append(' '.join(current_disease))
                current_disease = []
            current_disease.append(tokens[word_idx])
        elif label == 'I-DISEASE' and current_disease:
            current_disease.append(tokens[word_idx])
        elif current_disease:
            diseases.append(' '.join(current_disease))
            current_disease = []

if current_disease:
    diseases.append(' '.join(current_disease))

print(f"Extracted diseases: {diseases}")
```

### Using the Prediction Function

```python
def predict_diseases(text, model, tokenizer):
    import nltk
    
    # Tokenize the text
    tokens = nltk.word_tokenize(text)
    token_tags = [(token, 'O') for token in tokens]
    
    # Prepare BERT input
    inputs = tokenizer(tokens, is_split_into_words=True, return_tensors="pt", 
                      padding=True, truncation=True, max_length=128)
    
    # Get predictions
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.argmax(outputs.logits, dim=2).squeeze(0).numpy()
    
    # Map predictions to labels
    id2tag = {0: 'O', 1: 'B-DISEASE', 2: 'I-DISEASE'}
    
    # Extract diseases
    diseases = []
    current_disease = []
    word_ids = inputs.word_ids()
    
    for i, word_idx in enumerate(word_ids):
        if word_idx is not None and i < len(predictions):
            prediction = id2tag[predictions[i]]
            if word_idx < len(tokens):
                if prediction == 'B-DISEASE':
                    if current_disease:
                        diseases.append(' '.join(current_disease))
                        current_disease = []
                    current_disease.append(tokens[word_idx])
                elif prediction == 'I-DISEASE' and current_disease:
                    current_disease.append(tokens[word_idx])
                elif current_disease:
                    diseases.append(' '.join(current_disease))
                    current_disease = []
    
    if current_disease:
        diseases.append(' '.join(current_disease))
    
    return diseases

# Example usage
text = "Patient diagnosed with hypertension, diabetes mellitus, and chronic kidney disease."
diseases = predict_diseases(text, model, tokenizer)
print(f"Extracted diseases: {diseases}")
```

## Training Data

The model was trained on a custom dataset of medical patient records containing:
- Medical history narratives
- Manually extracted disease entities
- BIO-tagged training examples

## Limitations

- The model is specifically trained for disease entity extraction
- Performance may vary on medical texts from different domains or institutions
- May not capture very rare or newly named diseases not seen during training
- Limited to English language medical texts

## Ethical Considerations

- This model is intended for research and educational purposes
- Should not be used as a substitute for professional medical diagnosis
- Patient privacy and data protection must be ensured when using this model
- Results should be validated by medical professionals