Memory requierements

Hi, I am trying to fine-tune meta-llama/Llama-3.2-1B-Instruct. I loaded the model in 4-bit precision using the Transformers library and applying the LoRA method using the PEFT library and TRL. The issue comes when I start the training step, as I am permanently running out of memory, and I don’t know why. These are my training arguments:

training_args = SFTConfig(
    output_dir='/content/results',
    num_train_epochs=5,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=2,
    learning_rate=2e-4,
    bf16=True,
    logging_steps=50,
    eval_strategy='steps',
    eval_steps=500,
    save_strategy="steps",
    save_steps=500,
    warmup_steps=100,
    weight_decay=0.01,
    logging_dir="/content/logs",
    packing=True,
    report_to="none"
)

trainer = SFTTrainer(
    model=model,
    train_dataset=templated_dataset['train'],
    eval_dataset=templated_dataset['test'],
    args=training_args,
    tokenizer=tokenizer,
    
)

The sequence length is 2048, and the parameters to train are 1,179,648 (LoRA). I calculated that I will need around 3.57GB, but with 15GB which I have, I am running out of memory. I don’t know if there is something wrong with my training arguments configurations. Can you help me, please? Thanks in advance.

Running LLaMA 3.2 locally requires adequate computational resources.

Below are the recommended specifications:
Hardware:

GPU: NVIDIA GPU with CUDA support (16GB VRAM or higher recommended).
RAM: At least 32GB (64GB for larger models).
Storage: Minimum 50GB of free disk space for the model and dependencies.

Software:

Operating System: Linux (preferred), macOS, or Windows.
Python: Version 3.8 or higher.
CUDA Toolkit: Required for GPU acceleration (11.6 or newer).

@Pixies did you ever figure out why the RAM usage was so high?

I’m training on a CPU with 32bit precision and memory consumption averages between 5GB-8GB RAM usage. I did notice that the RAM usage shot up to 16GB at the start before settling lower.