T5 Biological Sequence + English Mixed Model

A T5-small model was trained on a mixture of DNA, protein sequences, and English text data, primarily for downstream fine-tuning tasks such as sequence function prediction.

Tokenizer Training

T5 uses the Unigram tokenizer. The input data consists of DNA sequences, protein sequences, and English text.

The specific training script is: t5_token_gene_eng.py.

Tokenizer training requires more than 128GB of memory and can be time-consuming.
You may use the pre-trained tokenizer directly:

trained_t5_gene_eng_tokenizer

Pre-training the T5 Model

A T5-large model was trained from scratch on a mixed dataset of DNA, protein sequences, and English text. The steps are as follows:

  1. Obtain the T5 configuration by running get_t5_config.ipynb.
  2. Prepare the mixed training data by running combine_data.ipynb.
  3. Launch the pre-training script ./run_pt.sh.
    Training takes approximately 5 hours using 8x NVIDIA 4090 GPUs.

Fine-tuning the T5 Model

  1. Protein Function Prediction: t5_gene_eng_abstract_ft_protein_fun.ipynb
  2. Amazon Review Summarization (for reference): t5_gene_eng_abstract_ft_review.ipynb
  3. CNN Article Summarization (for reference): t5_gene_eng_abstract_ft_cnn.ipynb
  4. DNA-Protein Coding Prediction (experimental, poor performance, for reference only): t5_gene_eng_abstract_ft_dna_protein.ipynb

Additional Experiments

Directory: multi_trans_lab
This contains experimental tasks exploring cross-modal and cross-lingual transfer capabilities, such as English-to-Spanish summarization and even English-to-DNA sequence generation. These are research-oriented and provided for academic reference only.

  • NC_000001.11_chapter_1.fna.p1: Partial human genome sequence data.
  • get_dna_summary.ipynb: Generates summaries for genomic DNA sequences (can use different fine-tuned models; see fine-tuning section above).
  • get_gene_summary.ipynb: Generates summaries for coding DNA regions (model can be swapped).
  • dna_abstract_search_bench.ipynb: Indirectly evaluates summary quality via search-based methods. Results are currently poor; ongoing research.
  • abstract_trans_en_es.ipynb: Baseline test for transferring English summarization capability to Spanish.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support