martyn-wong commited on
Commit
73e2343
Β·
1 Parent(s): 179ff9c

updated main.py to app.py

Browse files
Files changed (2) hide show
  1. README.md +98 -74
  2. main.py β†’ app.py +0 -0
README.md CHANGED
@@ -1,91 +1,115 @@
1
  # Academic Paper Summarizer & Concept-Map Explorer
2
 
3
- A Gradio-based dashboard designed to assist AI/ML researchers in efficiently searching, summarizing, and visualizing academic papers.
4
 
5
- ## Features
 
 
 
 
6
 
7
- 1. **Search Papers**
8
- - Search for academic papers using keywords via the ArXiv API.
9
- - _(Optional)_ Semantic Scholar and CrossRef integrations are included but currently disabled.
10
-
11
- 2. **Per-Paper Summarization**
12
- - Extractive summarization using spaCy to generate concise summaries (2–3 sentences per paper).
13
-
14
- 3. **Concept Maps**
15
- - Generate concept maps for individual papers and a global concept map for all papers.
16
- - Keyphrases are extracted using KeyBERT and visualized with PyVis.
17
-
18
- 4. **Cross-Paper Summary**
19
- - Summarize key points across all selected papers using a large language model (LLM) via DeepInfra's OpenAI-compatible endpoint.
20
 
21
- 5. **Export to PDF**
22
- - Save the cross-paper summary and concept maps in a neatly formatted PDF.
23
 
24
- 6. **Progress Updates**
25
- - Real-time progress updates for long-running tasks like generating summaries and concept maps.
 
 
 
 
26
 
27
  ---
28
 
29
  ## Installation
30
 
31
- 1. **Clone the repository**
32
  ```bash
33
- git clone https://github.com/lim-mingen/cs5260.git
34
- cd cs5260
35
- ```
36
-
37
- 2. **Install dependencies**
38
  ```bash
39
- pip install -r requirements.txt
40
- ```
41
 
42
- 3. **Set up API keys**
43
- - Add your DeepInfra API key to `config/.env`.
44
 
45
- 4. **Run the application**
46
  ```bash
47
- python main.py
48
- ```
49
-
50
- 5. **Access the app**
51
- - Open the URL printed in your terminal to start using the app.
52
-
53
- ---
54
-
55
- ## How It Works
56
-
57
- ### 1. Search Papers
58
- - Enter a keyword and the number of papers to fetch.
59
- - Results are retrieved from ArXiv and displayed in a table.
60
-
61
- ### 2. Summarization
62
- - Each paper's abstract is summarized using spaCy's extractive summarization.
63
-
64
- ### 3. Concept Maps
65
- - Keyphrases are extracted using KeyBERT and visualized as nodes.
66
- - Edges are created based on cosine similarity between embeddings.
67
-
68
- ### 4. Cross-Paper Summary
69
- - Abstracts from all papers are summarized into a cohesive narrative using an LLM.
70
-
71
- ### 5. Export to PDF
72
- - The summary and concept maps are exported to a PDF using the FPDF library.
73
-
74
- ---
75
-
76
- ## Repository Structure
77
-
78
- - `main.py`: Contains the Gradio app logic and UI components.
79
- - `util.py`: Core functions for fetching papers, summarization, and concept map generation.
80
- - `config/.env`: Stores API keys (e.g., DeepInfra OpenAI key).
81
- - `requirements.txt`: Lists all dependencies.
82
- - `README.md`: Documentation for the project.
83
-
84
- ---
85
-
86
- ## Future Improvements
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
 
88
- - Enable Semantic Scholar and CrossRef integrations.
89
- - Add support for full-text PDF extraction and summarization.
90
- - Improve concept map connectivity and layout algorithms.
91
- - Explore additional summarization models for better coherence.
 
1
  # Academic Paper Summarizer & Concept-Map Explorer
2
 
3
+ A lightweight Gradio dashboard to help AI/ML researchers quickly find, summarize, and visualize the conceptual landscape of academic papers.
4
 
5
+ - **Search** ArXiv by keyword
6
+ - **Per-paper summary** (2 – 3 sentences) via spaCy extractive summarization
7
+ - **Cross-paper summary** (5 – 6 sentences) driven by **Qwen/Qwen2.5-Coder-32B-Instruct**
8
+ - **Global concept map** (all papers) and πŸ“ **per-paper concept maps** via KeyBERT + Sentence-Transformer embeddings + PyVis
9
+ - **Export to PDF** for saving summaries in a neatly formatted document
10
 
11
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
+ ## Repository Layout
 
14
 
15
+ - util.py: contains core functions to summarize, extract and build concept map
16
+ - app.py: contains Gradio UI functions
17
+ - config
18
+ - .env: holds API_KEY to access DeepInfra OpenAI
19
+ - requirements.txt
20
+ - README.md
21
 
22
  ---
23
 
24
  ## Installation
25
 
26
+ 1. **Clone** the repo and enter its folder
27
  ```bash
28
+ git clone https://github.com/lim-mingen/cs5260.git
29
+ cd cs5260
30
+
31
+ 2. Create a virtual environment and install
 
32
  ```bash
33
+ pip install -r requirements.txt
 
34
 
35
+ 3. Add your DeepInfra API key in config/.env
 
36
 
37
+ 4. Run the app
38
  ```bash
39
+ python app.py
40
+
41
+ 5. Open the URL printed in your terminal to start exploring
42
+
43
+ ## Features & Methodology
44
+
45
+ ### 1. Data Collection
46
+ - **Source**: arXiv via the `arxiv` Python library
47
+ - _(Disabled)_ Semantic Scholar & CrossRef wrappers included, but commented out since many entries lack abstracts
48
+
49
+ ### 2. Per-Paper Summarization
50
+ - **Model**: spaCy `en_core_web_sm`
51
+ - **How**:
52
+ 1. Tokenize & filter stop-words/punctuation
53
+ 2. Score sentences by term-frequency
54
+ 3. Select top 2–3 sentences
55
+
56
+ ### 3. Keyphrase Extraction & Concept Maps
57
+ - **Keyphrases**: extracted with KeyBERT over **Specter** embeddings
58
+ - **Deduplication**:
59
+ - Substring-based filtering
60
+ - Agglomerative clustering on normalized embeddings (cosine threshold = 0.1)
61
+ - **Graphs (PyVis)**:
62
+ - **Nodes**: top 10 keyphrases per paper
63
+ - **Edges**: connect if cosine similarity β‰₯ 0.85
64
+ - **Layout**: force-directed repulsion (`nodeDistance`, `springLength`, `damping`)
65
+
66
+ ### 4. Cross-Paper Summary
67
+ - **Model**: **Qwen/Qwen2.5-Coder-32B-Instruct** via DeepInfra’s OpenAI-compatible endpoint
68
+ - **Prompt**: "These are the abstracts of {len(abstracts)} papers. Produce a cross-paper summary that summarizes all the key points across each paper. Keep it to 5-6 sentences."
69
+
70
+ ### 5. Graphs (PyVis):
71
+ - **Nodes**: top 10 keyphrases per paper
72
+ - **Edges**: connect if cosine similarity β‰₯ 0.85
73
+ - **Layout**: force-directed repulsion (nodeDistance, springLength, damping)
74
+
75
+ ### 5. Progress Bar
76
+ - **Purpose**: Provides real-time updates on the status of long-running tasks (e.g., generating summaries and concept maps).
77
+ - **How**:
78
+ - Implemented using Gradio's `yield` functionality in the `process_all` function.
79
+ - Displays messages like "Generating cross-paper summary..." and "Processing paper X of Y..." in a `gr.Textbox`.
80
+
81
+ ### 6. Export to PDF
82
+ - **Purpose**: Allows users to save the cross-paper summary in a neatly formatted PDF document.
83
+ - **How**:
84
+ - Extracts `<p>` blocks from the HTML output using `BeautifulSoup`.
85
+ - Formats the summary with headers and spacing using the `FPDF` library.
86
+ - Saves the PDF as `summary.pdf` and provides a download link in the Gradio interface.
87
+
88
+ ## πŸ”¬ Experiments & Outcomes
89
+
90
+ 1. **Semantic Scholar & CrossRef**
91
+ β€’ Added `fetch_semantic_scholar` and `fetch_crossref` with `semanticscholar`/`habanero` clients
92
+ β€’ **Outcome**: most results lacked abstracts or relevance β†’ **disabled**
93
+
94
+ 2. **Full-Text PDF Extraction**
95
+ β€’ Downloaded PDFs + `PyPDF2` β†’ NER/summarization on full text
96
+ β€’ **Outcome**: noisy extractions from captions, tables, references β†’ reverted to abstracts only
97
+
98
+ 3. **Domain-Specific NER**
99
+ β€’ Tried SciSpaCy (biomedical) and SciERC transformers
100
+ β€’ **Outcome**: labels too niche or model download failures β†’ reverted to spaCy general NER
101
+
102
+ 4. **Keyphrase Approaches**
103
+ β€’ RAKE, TextRank, KeyBERT with Specter embeddings
104
+ β€’ **Outcome**: heavy verb/digit filtering & clustering needed β†’ settled on current pipeline for balance
105
+
106
+ 5. **Cross-Paper Summarizers**
107
+ β€’ Pegasus-XSum (single sentence) β†’ too terse
108
+ β€’ BART-CNN hierarchical summarization β†’ 3–5 sentences but lacked coherence
109
+ β€’ **Solution**: LLM prompt via Qwen/Qwen2.5-Coder-32B-Instruct produced the best narrative
110
+
111
+ 6. **Concept-Map Connectivity**
112
+ β€’ Sentence co-occurrence β†’ isolated per-paper clusters
113
+ β€’ Embedding-similarity edges β†’ hair-ball or slow performance
114
+ β€’ **Final**: per-paper maps by embedding similarity (threshold 0.85) + one global map by co-occurrence
115
 
 
 
 
 
main.py β†’ app.py RENAMED
File without changes