Spaces:
Sleeping
Sleeping
| title: Academic Paper Summarizer & Concept-Map Explorer | |
| emoji: π | |
| colorFrom: purple | |
| colorTo: indigo | |
| sdk: gradio | |
| sdk_version: "5.28.0" | |
| app_file: app.py | |
| pinned: false | |
| # Academic Paper Summarizer & Concept-Map Explorer | |
| A lightweight Gradio dashboard to help AI/ML researchers quickly find, summarize, and visualize the conceptual landscape of academic papers. | |
| - **Search** ArXiv by keyword | |
| - **Per-paper summary** (2 β 3 sentences) via spaCy extractive summarization | |
| - **Cross-paper summary** (5 β 6 sentences) driven by **Qwen/Qwen2.5-Coder-32B-Instruct** | |
| - **Global concept map** (all papers) and π **per-paper concept maps** via KeyBERT + Sentence-Transformer embeddings + PyVis | |
| - **Export to PDF** for saving summaries in a neatly formatted document | |
| --- | |
| ## Repository Layout | |
| - util.py: contains core functions to summarize, extract and build concept map | |
| - app.py: contains Gradio UI functions | |
| - config | |
| - .env: holds API_KEY to access DeepInfra OpenAI | |
| - requirements.txt | |
| - README.md | |
| --- | |
| ## Installation | |
| 1. **Clone** the repo and enter its folder | |
| ```bash | |
| git clone https://github.com/lim-mingen/cs5260.git | |
| cd cs5260 | |
| 2. Create a virtual environment and install | |
| ```bash | |
| pip install -r requirements.txt | |
| 3. Add your DeepInfra API key in config/.env | |
| 4. Run the app | |
| ```bash | |
| python app.py | |
| 5. Open the URL printed in your terminal to start exploring | |
| ## Features & Methodology | |
| ### 1. Data Collection | |
| - **Source**: arXiv via the `arxiv` Python library | |
| - _(Disabled)_ Semantic Scholar & CrossRef wrappers included, but commented out since many entries lack abstracts | |
| ### 2. Per-Paper Summarization | |
| - **Model**: spaCy `en_core_web_sm` | |
| - **How**: | |
| 1. Tokenize & filter stop-words/punctuation | |
| 2. Score sentences by term-frequency | |
| 3. Select top 2β3 sentences | |
| ### 3. Keyphrase Extraction & Concept Maps | |
| - **Keyphrases**: extracted with KeyBERT over **Specter** embeddings | |
| - **Deduplication**: | |
| - Substring-based filtering | |
| - Agglomerative clustering on normalized embeddings (cosine threshold = 0.1) | |
| - **Graphs (PyVis)**: | |
| - **Nodes**: top 10 keyphrases per paper | |
| - **Edges**: connect if cosine similarity β₯ 0.85 | |
| - **Layout**: force-directed repulsion (`nodeDistance`, `springLength`, `damping`) | |
| ### 4. Cross-Paper Summary | |
| - **Model**: **Qwen/Qwen2.5-Coder-32B-Instruct** via DeepInfraβs OpenAI-compatible endpoint | |
| - **Prompt**: "These are the abstracts of {len(abstracts)} papers. Produce a cross-paper summary that summarizes all the key points across each paper. Keep it to 5-6 sentences." | |
| ### 5. Graphs (PyVis): | |
| - **Nodes**: top 10 keyphrases per paper | |
| - **Edges**: connect if cosine similarity β₯ 0.85 | |
| - **Layout**: force-directed repulsion (nodeDistance, springLength, damping) | |
| ### 5. Progress Bar | |
| - **Purpose**: Provides real-time updates on the status of long-running tasks (e.g., generating summaries and concept maps). | |
| - **How**: | |
| - Implemented using Gradio's `yield` functionality in the `process_all` function. | |
| - Displays messages like "Generating cross-paper summary..." and "Processing paper X of Y..." in a `gr.Textbox`. | |
| ### 6. Export to PDF | |
| - **Purpose**: Allows users to save the cross-paper summary in a neatly formatted PDF document. | |
| - **How**: | |
| - Extracts `<p>` blocks from the HTML output using `BeautifulSoup`. | |
| - Formats the summary with headers and spacing using the `FPDF` library. | |
| - Saves the PDF as `summary.pdf` and provides a download link in the Gradio interface. | |
| ## π¬ Experiments & Outcomes | |
| 1. **Semantic Scholar & CrossRef** | |
| β’ Added `fetch_semantic_scholar` and `fetch_crossref` with `semanticscholar`/`habanero` clients | |
| β’ **Outcome**: most results lacked abstracts or relevance β **disabled** | |
| 2. **Full-Text PDF Extraction** | |
| β’ Downloaded PDFs + `PyPDF2` β NER/summarization on full text | |
| β’ **Outcome**: noisy extractions from captions, tables, references β reverted to abstracts only | |
| 3. **Domain-Specific NER** | |
| β’ Tried SciSpaCy (biomedical) and SciERC transformers | |
| β’ **Outcome**: labels too niche or model download failures β reverted to spaCy general NER | |
| 4. **Keyphrase Approaches** | |
| β’ RAKE, TextRank, KeyBERT with Specter embeddings | |
| β’ **Outcome**: heavy verb/digit filtering & clustering needed β settled on current pipeline for balance | |
| 5. **Cross-Paper Summarizers** | |
| β’ Pegasus-XSum (single sentence) β too terse | |
| β’ BART-CNN hierarchical summarization β 3β5 sentences but lacked coherence | |
| β’ **Solution**: LLM prompt via Qwen/Qwen2.5-Coder-32B-Instruct produced the best narrative | |
| 6. **Concept-Map Connectivity** | |
| β’ Sentence co-occurrence β isolated per-paper clusters | |
| β’ Embedding-similarity edges β hair-ball or slow performance | |
| β’ **Final**: per-paper maps by embedding similarity (threshold 0.85) + one global map by co-occurrence | |