CS5260_demo / README.md
martyn-wong's picture
update readme with spaces config
9a156be

A newer version of the Gradio SDK is available: 6.1.0

Upgrade
metadata
title: Academic Paper Summarizer & Concept-Map Explorer
emoji: πŸ“‹
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: 5.28.0
app_file: app.py
pinned: false

Academic Paper Summarizer & Concept-Map Explorer

A lightweight Gradio dashboard to help AI/ML researchers quickly find, summarize, and visualize the conceptual landscape of academic papers.

  • Search ArXiv by keyword
  • Per-paper summary (2 – 3 sentences) via spaCy extractive summarization
  • Cross-paper summary (5 – 6 sentences) driven by Qwen/Qwen2.5-Coder-32B-Instruct
  • Global concept map (all papers) and πŸ“ per-paper concept maps via KeyBERT + Sentence-Transformer embeddings + PyVis
  • Export to PDF for saving summaries in a neatly formatted document

Repository Layout

  • util.py: contains core functions to summarize, extract and build concept map
  • app.py: contains Gradio UI functions
  • config
    • .env: holds API_KEY to access DeepInfra OpenAI
  • requirements.txt
  • README.md

Installation

  1. Clone the repo and enter its folder

       git clone https://github.com/lim-mingen/cs5260.git
       cd cs5260
    
  2. Create a virtual environment and install

       pip install -r requirements.txt
    
  3. Add your DeepInfra API key in config/.env

  4. Run the app

       python app.py
    
  5. Open the URL printed in your terminal to start exploring

Features & Methodology

1. Data Collection

  • Source: arXiv via the arxiv Python library
  • (Disabled) Semantic Scholar & CrossRef wrappers included, but commented out since many entries lack abstracts

2. Per-Paper Summarization

  • Model: spaCy en_core_web_sm
  • How:
    1. Tokenize & filter stop-words/punctuation
    2. Score sentences by term-frequency
    3. Select top 2–3 sentences

3. Keyphrase Extraction & Concept Maps

  • Keyphrases: extracted with KeyBERT over Specter embeddings
  • Deduplication:
    • Substring-based filtering
    • Agglomerative clustering on normalized embeddings (cosine threshold = 0.1)
  • Graphs (PyVis):
    • Nodes: top 10 keyphrases per paper
    • Edges: connect if cosine similarity β‰₯ 0.85
    • Layout: force-directed repulsion (nodeDistance, springLength, damping)

4. Cross-Paper Summary

  • Model: Qwen/Qwen2.5-Coder-32B-Instruct via DeepInfra’s OpenAI-compatible endpoint
  • Prompt: "These are the abstracts of {len(abstracts)} papers. Produce a cross-paper summary that summarizes all the key points across each paper. Keep it to 5-6 sentences."

5. Graphs (PyVis):

  • Nodes: top 10 keyphrases per paper
  • Edges: connect if cosine similarity β‰₯ 0.85
  • Layout: force-directed repulsion (nodeDistance, springLength, damping)

5. Progress Bar

  • Purpose: Provides real-time updates on the status of long-running tasks (e.g., generating summaries and concept maps).

  • How:

    • Implemented using Gradio's yield functionality in the process_all function.
    • Displays messages like "Generating cross-paper summary..." and "Processing paper X of Y..." in a gr.Textbox.

    6. Export to PDF

  • Purpose: Allows users to save the cross-paper summary in a neatly formatted PDF document.

  • How:

    • Extracts <p> blocks from the HTML output using BeautifulSoup.
    • Formats the summary with headers and spacing using the FPDF library.
    • Saves the PDF as summary.pdf and provides a download link in the Gradio interface.

πŸ”¬ Experiments & Outcomes

  1. Semantic Scholar & CrossRef
    β€’ Added fetch_semantic_scholar and fetch_crossref with semanticscholar/habanero clients
    β€’ Outcome: most results lacked abstracts or relevance β†’ disabled

  2. Full-Text PDF Extraction
    β€’ Downloaded PDFs + PyPDF2 β†’ NER/summarization on full text
    β€’ Outcome: noisy extractions from captions, tables, references β†’ reverted to abstracts only

  3. Domain-Specific NER
    β€’ Tried SciSpaCy (biomedical) and SciERC transformers
    β€’ Outcome: labels too niche or model download failures β†’ reverted to spaCy general NER

  4. Keyphrase Approaches
    β€’ RAKE, TextRank, KeyBERT with Specter embeddings
    β€’ Outcome: heavy verb/digit filtering & clustering needed β†’ settled on current pipeline for balance

  5. Cross-Paper Summarizers
    β€’ Pegasus-XSum (single sentence) β†’ too terse
    β€’ BART-CNN hierarchical summarization β†’ 3–5 sentences but lacked coherence
    β€’ Solution: LLM prompt via Qwen/Qwen2.5-Coder-32B-Instruct produced the best narrative

  6. Concept-Map Connectivity
    β€’ Sentence co-occurrence β†’ isolated per-paper clusters
    β€’ Embedding-similarity edges β†’ hair-ball or slow performance
    β€’ Final: per-paper maps by embedding similarity (threshold 0.85) + one global map by co-occurrence