Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.1.0
title: Academic Paper Summarizer & Concept-Map Explorer
emoji: π
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: 5.28.0
app_file: app.py
pinned: false
Academic Paper Summarizer & Concept-Map Explorer
A lightweight Gradio dashboard to help AI/ML researchers quickly find, summarize, and visualize the conceptual landscape of academic papers.
- Search ArXiv by keyword
- Per-paper summary (2 β 3 sentences) via spaCy extractive summarization
- Cross-paper summary (5 β 6 sentences) driven by Qwen/Qwen2.5-Coder-32B-Instruct
- Global concept map (all papers) and π per-paper concept maps via KeyBERT + Sentence-Transformer embeddings + PyVis
- Export to PDF for saving summaries in a neatly formatted document
Repository Layout
- util.py: contains core functions to summarize, extract and build concept map
- app.py: contains Gradio UI functions
- config
- .env: holds API_KEY to access DeepInfra OpenAI
- requirements.txt
- README.md
Installation
Clone the repo and enter its folder
git clone https://github.com/lim-mingen/cs5260.git cd cs5260Create a virtual environment and install
pip install -r requirements.txtAdd your DeepInfra API key in config/.env
Run the app
python app.pyOpen the URL printed in your terminal to start exploring
Features & Methodology
1. Data Collection
- Source: arXiv via the
arxivPython library - (Disabled) Semantic Scholar & CrossRef wrappers included, but commented out since many entries lack abstracts
2. Per-Paper Summarization
- Model: spaCy
en_core_web_sm - How:
- Tokenize & filter stop-words/punctuation
- Score sentences by term-frequency
- Select top 2β3 sentences
3. Keyphrase Extraction & Concept Maps
- Keyphrases: extracted with KeyBERT over Specter embeddings
- Deduplication:
- Substring-based filtering
- Agglomerative clustering on normalized embeddings (cosine threshold = 0.1)
- Graphs (PyVis):
- Nodes: top 10 keyphrases per paper
- Edges: connect if cosine similarity β₯ 0.85
- Layout: force-directed repulsion (
nodeDistance,springLength,damping)
4. Cross-Paper Summary
- Model: Qwen/Qwen2.5-Coder-32B-Instruct via DeepInfraβs OpenAI-compatible endpoint
- Prompt: "These are the abstracts of {len(abstracts)} papers. Produce a cross-paper summary that summarizes all the key points across each paper. Keep it to 5-6 sentences."
5. Graphs (PyVis):
- Nodes: top 10 keyphrases per paper
- Edges: connect if cosine similarity β₯ 0.85
- Layout: force-directed repulsion (nodeDistance, springLength, damping)
5. Progress Bar
Purpose: Provides real-time updates on the status of long-running tasks (e.g., generating summaries and concept maps).
How:
- Implemented using Gradio's
yieldfunctionality in theprocess_allfunction. - Displays messages like "Generating cross-paper summary..." and "Processing paper X of Y..." in a
gr.Textbox.
6. Export to PDF
- Implemented using Gradio's
Purpose: Allows users to save the cross-paper summary in a neatly formatted PDF document.
How:
- Extracts
<p>blocks from the HTML output usingBeautifulSoup. - Formats the summary with headers and spacing using the
FPDFlibrary. - Saves the PDF as
summary.pdfand provides a download link in the Gradio interface.
- Extracts
π¬ Experiments & Outcomes
Semantic Scholar & CrossRef
β’ Addedfetch_semantic_scholarandfetch_crossrefwithsemanticscholar/habaneroclients
β’ Outcome: most results lacked abstracts or relevance β disabledFull-Text PDF Extraction
β’ Downloaded PDFs +PyPDF2β NER/summarization on full text
β’ Outcome: noisy extractions from captions, tables, references β reverted to abstracts onlyDomain-Specific NER
β’ Tried SciSpaCy (biomedical) and SciERC transformers
β’ Outcome: labels too niche or model download failures β reverted to spaCy general NERKeyphrase Approaches
β’ RAKE, TextRank, KeyBERT with Specter embeddings
β’ Outcome: heavy verb/digit filtering & clustering needed β settled on current pipeline for balanceCross-Paper Summarizers
β’ Pegasus-XSum (single sentence) β too terse
β’ BART-CNN hierarchical summarization β 3β5 sentences but lacked coherence
β’ Solution: LLM prompt via Qwen/Qwen2.5-Coder-32B-Instruct produced the best narrativeConcept-Map Connectivity
β’ Sentence co-occurrence β isolated per-paper clusters
β’ Embedding-similarity edges β hair-ball or slow performance
β’ Final: per-paper maps by embedding similarity (threshold 0.85) + one global map by co-occurrence