Spaces:
Sleeping
Sleeping
Commit
Β·
73e2343
1
Parent(s):
179ff9c
updated main.py to app.py
Browse files- README.md +98 -74
- main.py β app.py +0 -0
README.md
CHANGED
|
@@ -1,91 +1,115 @@
|
|
| 1 |
# Academic Paper Summarizer & Concept-Map Explorer
|
| 2 |
|
| 3 |
-
A Gradio
|
| 4 |
|
| 5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
|
| 7 |
-
|
| 8 |
-
- Search for academic papers using keywords via the ArXiv API.
|
| 9 |
-
- _(Optional)_ Semantic Scholar and CrossRef integrations are included but currently disabled.
|
| 10 |
-
|
| 11 |
-
2. **Per-Paper Summarization**
|
| 12 |
-
- Extractive summarization using spaCy to generate concise summaries (2β3 sentences per paper).
|
| 13 |
-
|
| 14 |
-
3. **Concept Maps**
|
| 15 |
-
- Generate concept maps for individual papers and a global concept map for all papers.
|
| 16 |
-
- Keyphrases are extracted using KeyBERT and visualized with PyVis.
|
| 17 |
-
|
| 18 |
-
4. **Cross-Paper Summary**
|
| 19 |
-
- Summarize key points across all selected papers using a large language model (LLM) via DeepInfra's OpenAI-compatible endpoint.
|
| 20 |
|
| 21 |
-
|
| 22 |
-
- Save the cross-paper summary and concept maps in a neatly formatted PDF.
|
| 23 |
|
| 24 |
-
|
| 25 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
---
|
| 28 |
|
| 29 |
## Installation
|
| 30 |
|
| 31 |
-
1. **Clone the
|
| 32 |
```bash
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
2. **Install dependencies**
|
| 38 |
```bash
|
| 39 |
-
|
| 40 |
-
```
|
| 41 |
|
| 42 |
-
3.
|
| 43 |
-
- Add your DeepInfra API key to `config/.env`.
|
| 44 |
|
| 45 |
-
4.
|
| 46 |
```bash
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
###
|
| 58 |
-
-
|
| 59 |
-
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
### 3. Concept Maps
|
| 65 |
-
- Keyphrases
|
| 66 |
-
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
-
|
| 80 |
-
-
|
| 81 |
-
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 87 |
|
| 88 |
-
- Enable Semantic Scholar and CrossRef integrations.
|
| 89 |
-
- Add support for full-text PDF extraction and summarization.
|
| 90 |
-
- Improve concept map connectivity and layout algorithms.
|
| 91 |
-
- Explore additional summarization models for better coherence.
|
|
|
|
| 1 |
# Academic Paper Summarizer & Concept-Map Explorer
|
| 2 |
|
| 3 |
+
A lightweight Gradio dashboard to help AI/ML researchers quickly find, summarize, and visualize the conceptual landscape of academic papers.
|
| 4 |
|
| 5 |
+
- **Search** ArXiv by keyword
|
| 6 |
+
- **Per-paper summary** (2 β 3 sentences) via spaCy extractive summarization
|
| 7 |
+
- **Cross-paper summary** (5 β 6 sentences) driven by **Qwen/Qwen2.5-Coder-32B-Instruct**
|
| 8 |
+
- **Global concept map** (all papers) and π **per-paper concept maps** via KeyBERT + Sentence-Transformer embeddings + PyVis
|
| 9 |
+
- **Export to PDF** for saving summaries in a neatly formatted document
|
| 10 |
|
| 11 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
+
## Repository Layout
|
|
|
|
| 14 |
|
| 15 |
+
- util.py: contains core functions to summarize, extract and build concept map
|
| 16 |
+
- app.py: contains Gradio UI functions
|
| 17 |
+
- config
|
| 18 |
+
- .env: holds API_KEY to access DeepInfra OpenAI
|
| 19 |
+
- requirements.txt
|
| 20 |
+
- README.md
|
| 21 |
|
| 22 |
---
|
| 23 |
|
| 24 |
## Installation
|
| 25 |
|
| 26 |
+
1. **Clone** the repo and enter its folder
|
| 27 |
```bash
|
| 28 |
+
git clone https://github.com/lim-mingen/cs5260.git
|
| 29 |
+
cd cs5260
|
| 30 |
+
|
| 31 |
+
2. Create a virtual environment and install
|
|
|
|
| 32 |
```bash
|
| 33 |
+
pip install -r requirements.txt
|
|
|
|
| 34 |
|
| 35 |
+
3. Add your DeepInfra API key in config/.env
|
|
|
|
| 36 |
|
| 37 |
+
4. Run the app
|
| 38 |
```bash
|
| 39 |
+
python app.py
|
| 40 |
+
|
| 41 |
+
5. Open the URL printed in your terminal to start exploring
|
| 42 |
+
|
| 43 |
+
## Features & Methodology
|
| 44 |
+
|
| 45 |
+
### 1. Data Collection
|
| 46 |
+
- **Source**: arXiv via the `arxiv` Python library
|
| 47 |
+
- _(Disabled)_ Semantic Scholar & CrossRef wrappers included, but commented out since many entries lack abstracts
|
| 48 |
+
|
| 49 |
+
### 2. Per-Paper Summarization
|
| 50 |
+
- **Model**: spaCy `en_core_web_sm`
|
| 51 |
+
- **How**:
|
| 52 |
+
1. Tokenize & filter stop-words/punctuation
|
| 53 |
+
2. Score sentences by term-frequency
|
| 54 |
+
3. Select top 2β3 sentences
|
| 55 |
+
|
| 56 |
+
### 3. Keyphrase Extraction & Concept Maps
|
| 57 |
+
- **Keyphrases**: extracted with KeyBERT over **Specter** embeddings
|
| 58 |
+
- **Deduplication**:
|
| 59 |
+
- Substring-based filtering
|
| 60 |
+
- Agglomerative clustering on normalized embeddings (cosine threshold = 0.1)
|
| 61 |
+
- **Graphs (PyVis)**:
|
| 62 |
+
- **Nodes**: top 10 keyphrases per paper
|
| 63 |
+
- **Edges**: connect if cosine similarity β₯ 0.85
|
| 64 |
+
- **Layout**: force-directed repulsion (`nodeDistance`, `springLength`, `damping`)
|
| 65 |
+
|
| 66 |
+
### 4. Cross-Paper Summary
|
| 67 |
+
- **Model**: **Qwen/Qwen2.5-Coder-32B-Instruct** via DeepInfraβs OpenAI-compatible endpoint
|
| 68 |
+
- **Prompt**: "These are the abstracts of {len(abstracts)} papers. Produce a cross-paper summary that summarizes all the key points across each paper. Keep it to 5-6 sentences."
|
| 69 |
+
|
| 70 |
+
### 5. Graphs (PyVis):
|
| 71 |
+
- **Nodes**: top 10 keyphrases per paper
|
| 72 |
+
- **Edges**: connect if cosine similarity β₯ 0.85
|
| 73 |
+
- **Layout**: force-directed repulsion (nodeDistance, springLength, damping)
|
| 74 |
+
|
| 75 |
+
### 5. Progress Bar
|
| 76 |
+
- **Purpose**: Provides real-time updates on the status of long-running tasks (e.g., generating summaries and concept maps).
|
| 77 |
+
- **How**:
|
| 78 |
+
- Implemented using Gradio's `yield` functionality in the `process_all` function.
|
| 79 |
+
- Displays messages like "Generating cross-paper summary..." and "Processing paper X of Y..." in a `gr.Textbox`.
|
| 80 |
+
|
| 81 |
+
### 6. Export to PDF
|
| 82 |
+
- **Purpose**: Allows users to save the cross-paper summary in a neatly formatted PDF document.
|
| 83 |
+
- **How**:
|
| 84 |
+
- Extracts `<p>` blocks from the HTML output using `BeautifulSoup`.
|
| 85 |
+
- Formats the summary with headers and spacing using the `FPDF` library.
|
| 86 |
+
- Saves the PDF as `summary.pdf` and provides a download link in the Gradio interface.
|
| 87 |
+
|
| 88 |
+
## π¬ Experiments & Outcomes
|
| 89 |
+
|
| 90 |
+
1. **Semantic Scholar & CrossRef**
|
| 91 |
+
β’ Added `fetch_semantic_scholar` and `fetch_crossref` with `semanticscholar`/`habanero` clients
|
| 92 |
+
β’ **Outcome**: most results lacked abstracts or relevance β **disabled**
|
| 93 |
+
|
| 94 |
+
2. **Full-Text PDF Extraction**
|
| 95 |
+
β’ Downloaded PDFs + `PyPDF2` β NER/summarization on full text
|
| 96 |
+
β’ **Outcome**: noisy extractions from captions, tables, references β reverted to abstracts only
|
| 97 |
+
|
| 98 |
+
3. **Domain-Specific NER**
|
| 99 |
+
β’ Tried SciSpaCy (biomedical) and SciERC transformers
|
| 100 |
+
β’ **Outcome**: labels too niche or model download failures β reverted to spaCy general NER
|
| 101 |
+
|
| 102 |
+
4. **Keyphrase Approaches**
|
| 103 |
+
β’ RAKE, TextRank, KeyBERT with Specter embeddings
|
| 104 |
+
β’ **Outcome**: heavy verb/digit filtering & clustering needed β settled on current pipeline for balance
|
| 105 |
+
|
| 106 |
+
5. **Cross-Paper Summarizers**
|
| 107 |
+
β’ Pegasus-XSum (single sentence) β too terse
|
| 108 |
+
β’ BART-CNN hierarchical summarization β 3β5 sentences but lacked coherence
|
| 109 |
+
β’ **Solution**: LLM prompt via Qwen/Qwen2.5-Coder-32B-Instruct produced the best narrative
|
| 110 |
+
|
| 111 |
+
6. **Concept-Map Connectivity**
|
| 112 |
+
β’ Sentence co-occurrence β isolated per-paper clusters
|
| 113 |
+
β’ Embedding-similarity edges β hair-ball or slow performance
|
| 114 |
+
β’ **Final**: per-paper maps by embedding similarity (threshold 0.85) + one global map by co-occurrence
|
| 115 |
|
|
|
|
|
|
|
|
|
|
|
|
main.py β app.py
RENAMED
|
File without changes
|