File size: 4,955 Bytes
9a156be
 
 
 
 
 
 
 
 
 
 
 
ab47503
 
73e2343
ab47503
73e2343
 
 
 
 
ab47503
73e2343
ab47503
73e2343
179ff9c
73e2343
 
 
 
 
 
ab47503
de06717
 
ab47503
 
73e2343
ab47503
73e2343
 
 
 
ab47503
73e2343
ab47503
73e2343
ab47503
73e2343
ab47503
73e2343
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ab47503
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
title: Academic Paper Summarizer & Concept-Map Explorer
emoji: πŸ“‹
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: "5.28.0"
app_file: app.py
pinned: false
---


# Academic Paper Summarizer & Concept-Map Explorer

A lightweight Gradio dashboard to help AI/ML researchers quickly find, summarize, and visualize the conceptual landscape of academic papers.  

- **Search** ArXiv by keyword  
- **Per-paper summary** (2 – 3 sentences) via spaCy extractive summarization  
- **Cross-paper summary** (5 – 6 sentences) driven by **Qwen/Qwen2.5-Coder-32B-Instruct**  
- **Global concept map** (all papers) and πŸ“ **per-paper concept maps** via KeyBERT + Sentence-Transformer embeddings + PyVis  
- **Export to PDF** for saving summaries in a neatly formatted document  

---

## Repository Layout

- util.py: contains core functions to summarize, extract and build concept map
- app.py: contains Gradio UI functions
- config
   - .env: holds API_KEY to access DeepInfra OpenAI
- requirements.txt
- README.md

---

## Installation

1. **Clone** the repo and enter its folder  
   ```bash
      git clone https://github.com/lim-mingen/cs5260.git
      cd cs5260
   
2. Create a virtual environment and install
   ```bash
      pip install -r requirements.txt

3. Add your DeepInfra API key in config/.env

4. Run the app
   ```bash
      python app.py

5. Open the URL printed in your terminal to start exploring

## Features & Methodology

### 1. Data Collection  
- **Source**: arXiv via the `arxiv` Python library  
- _(Disabled)_ Semantic Scholar & CrossRef wrappers included, but commented out since many entries lack abstracts  

### 2. Per-Paper Summarization  
- **Model**: spaCy `en_core_web_sm`  
- **How**:  
  1. Tokenize & filter stop-words/punctuation  
  2. Score sentences by term-frequency  
  3. Select top 2–3 sentences  

### 3. Keyphrase Extraction & Concept Maps  
- **Keyphrases**: extracted with KeyBERT over **Specter** embeddings  
- **Deduplication**:  
  - Substring-based filtering  
  - Agglomerative clustering on normalized embeddings (cosine threshold = 0.1)  
- **Graphs (PyVis)**:  
  - **Nodes**: top 10 keyphrases per paper  
  - **Edges**: connect if cosine similarity β‰₯ 0.85  
  - **Layout**: force-directed repulsion (`nodeDistance`, `springLength`, `damping`)

### 4. Cross-Paper Summary  
- **Model**: **Qwen/Qwen2.5-Coder-32B-Instruct** via DeepInfra’s OpenAI-compatible endpoint  
- **Prompt**: "These are the abstracts of {len(abstracts)} papers. Produce a cross-paper summary that summarizes all the key points across each paper. Keep it to 5-6 sentences."

### 5. Graphs (PyVis):
- **Nodes**: top 10 keyphrases per paper
- **Edges**: connect if cosine similarity β‰₯ 0.85
- **Layout**: force-directed repulsion (nodeDistance, springLength, damping)

### 5. Progress Bar  
- **Purpose**: Provides real-time updates on the status of long-running tasks (e.g., generating summaries and concept maps).  
- **How**:  
  - Implemented using Gradio's `yield` functionality in the `process_all` function.  
  - Displays messages like "Generating cross-paper summary..." and "Processing paper X of Y..." in a `gr.Textbox`.  

  ### 6. Export to PDF  
- **Purpose**: Allows users to save the cross-paper summary in a neatly formatted PDF document.  
- **How**:  
  - Extracts `<p>` blocks from the HTML output using `BeautifulSoup`.  
  - Formats the summary with headers and spacing using the `FPDF` library.  
  - Saves the PDF as `summary.pdf` and provides a download link in the Gradio interface.  

## πŸ”¬ Experiments & Outcomes

1. **Semantic Scholar & CrossRef**  
   β€’ Added `fetch_semantic_scholar` and `fetch_crossref` with `semanticscholar`/`habanero` clients  
   β€’ **Outcome**: most results lacked abstracts or relevance β†’ **disabled**

2. **Full-Text PDF Extraction**  
   β€’ Downloaded PDFs + `PyPDF2` β†’ NER/summarization on full text  
   β€’ **Outcome**: noisy extractions from captions, tables, references β†’ reverted to abstracts only

3. **Domain-Specific NER**  
   β€’ Tried SciSpaCy (biomedical) and SciERC transformers  
   β€’ **Outcome**: labels too niche or model download failures β†’ reverted to spaCy general NER

4. **Keyphrase Approaches**  
   β€’ RAKE, TextRank, KeyBERT with Specter embeddings  
   β€’ **Outcome**: heavy verb/digit filtering & clustering needed β†’ settled on current pipeline for balance

5. **Cross-Paper Summarizers**  
   β€’ Pegasus-XSum (single sentence) β†’ too terse  
   β€’ BART-CNN hierarchical summarization β†’ 3–5 sentences but lacked coherence  
   β€’ **Solution**: LLM prompt via Qwen/Qwen2.5-Coder-32B-Instruct produced the best narrative

6. **Concept-Map Connectivity**  
   β€’ Sentence co-occurrence β†’ isolated per-paper clusters  
   β€’ Embedding-similarity edges β†’ hair-ball or slow performance  
   β€’ **Final**: per-paper maps by embedding similarity (threshold 0.85) + one global map by co-occurrence