Spaces:

MartynW
/

CS5260_demo

Sleeping

App Files Files Community

martyn-wong commited on May 1, 2025

Commit

73e2343

1 Parent(s): 179ff9c

updated main.py to app.py

Browse files

Files changed (2) hide show

README.md +98 -74
main.py → app.py +0 -0

README.md CHANGED Viewed

@@ -1,91 +1,115 @@
 # Academic Paper Summarizer & Concept-Map Explorer
-A Gradio-based dashboard designed to assist AI/ML researchers in efficiently searching, summarizing, and visualizing academic papers.
-## Features
-1. **Search Papers**
-   - Search for academic papers using keywords via the ArXiv API.
-   - _(Optional)_ Semantic Scholar and CrossRef integrations are included but currently disabled.
-2. **Per-Paper Summarization**
-   - Extractive summarization using spaCy to generate concise summaries (2–3 sentences per paper).
-3. **Concept Maps**
-   - Generate concept maps for individual papers and a global concept map for all papers.
-   - Keyphrases are extracted using KeyBERT and visualized with PyVis.
-4. **Cross-Paper Summary**
-   - Summarize key points across all selected papers using a large language model (LLM) via DeepInfra's OpenAI-compatible endpoint.
-5. **Export to PDF**
-   - Save the cross-paper summary and concept maps in a neatly formatted PDF.
-6. **Progress Updates**
-   - Real-time progress updates for long-running tasks like generating summaries and concept maps.
 ---
 ## Installation
-1. **Clone the repository**
    ```bash
-   git clone https://github.com/lim-mingen/cs5260.git
-   cd cs5260
-   ```
-2. **Install dependencies**
    ```bash
-   pip install -r requirements.txt
-   ```
-3. **Set up API keys**
-   - Add your DeepInfra API key to `config/.env`.
-4. **Run the application**
    ```bash
-   python main.py
-   ```
-5. **Access the app**
-   - Open the URL printed in your terminal to start using the app.
----
-## How It Works
-### 1. Search Papers
-- Enter a keyword and the number of papers to fetch.
-- Results are retrieved from ArXiv and displayed in a table.
-### 2. Summarization
-- Each paper's abstract is summarized using spaCy's extractive summarization.
-### 3. Concept Maps
-- Keyphrases are extracted using KeyBERT and visualized as nodes.
-- Edges are created based on cosine similarity between embeddings.
-### 4. Cross-Paper Summary
-- Abstracts from all papers are summarized into a cohesive narrative using an LLM.
-### 5. Export to PDF
-- The summary and concept maps are exported to a PDF using the FPDF library.
----
-## Repository Structure
-- `main.py`: Contains the Gradio app logic and UI components.
-- `util.py`: Core functions for fetching papers, summarization, and concept map generation.
-- `config/.env`: Stores API keys (e.g., DeepInfra OpenAI key).
-- `requirements.txt`: Lists all dependencies.
-- `README.md`: Documentation for the project.
----
-## Future Improvements
-- Enable Semantic Scholar and CrossRef integrations.
-- Add support for full-text PDF extraction and summarization.
-- Improve concept map connectivity and layout algorithms.
-- Explore additional summarization models for better coherence.

 # Academic Paper Summarizer & Concept-Map Explorer
+A lightweight Gradio dashboard to help AI/ML researchers quickly find, summarize, and visualize the conceptual landscape of academic papers.
+- **Search** ArXiv by keyword
+- **Per-paper summary** (2 – 3 sentences) via spaCy extractive summarization
+- **Cross-paper summary** (5 – 6 sentences) driven by **Qwen/Qwen2.5-Coder-32B-Instruct**
+- **Global concept map** (all papers) and 📝 **per-paper concept maps** via KeyBERT + Sentence-Transformer embeddings + PyVis
+- **Export to PDF** for saving summaries in a neatly formatted document
+---
+## Repository Layout
+- util.py: contains core functions to summarize, extract and build concept map
+- app.py: contains Gradio UI functions
+- config
+   - .env: holds API_KEY to access DeepInfra OpenAI
+- requirements.txt
+- README.md
 ---
 ## Installation
+1. **Clone** the repo and enter its folder
    ```bash
+      git clone https://github.com/lim-mingen/cs5260.git
+      cd cs5260
+2. Create a virtual environment and install
    ```bash
+      pip install -r requirements.txt
+3. Add your DeepInfra API key in config/.env
+4. Run the app
    ```bash
+      python app.py
+5. Open the URL printed in your terminal to start exploring
+## Features & Methodology
+### 1. Data Collection
+- **Source**: arXiv via the `arxiv` Python library
+- _(Disabled)_ Semantic Scholar & CrossRef wrappers included, but commented out since many entries lack abstracts
+### 2. Per-Paper Summarization
+- **Model**: spaCy `en_core_web_sm`
+- **How**:
+  1. Tokenize & filter stop-words/punctuation
+  2. Score sentences by term-frequency
+  3. Select top 2–3 sentences
+### 3. Keyphrase Extraction & Concept Maps
+- **Keyphrases**: extracted with KeyBERT over **Specter** embeddings
+- **Deduplication**:
+  - Substring-based filtering
+  - Agglomerative clustering on normalized embeddings (cosine threshold = 0.1)
+- **Graphs (PyVis)**:
+  - **Nodes**: top 10 keyphrases per paper
+  - **Edges**: connect if cosine similarity ≥ 0.85
+  - **Layout**: force-directed repulsion (`nodeDistance`, `springLength`, `damping`)
+### 4. Cross-Paper Summary
+- **Model**: **Qwen/Qwen2.5-Coder-32B-Instruct** via DeepInfra’s OpenAI-compatible endpoint
+- **Prompt**: "These are the abstracts of {len(abstracts)} papers. Produce a cross-paper summary that summarizes all the key points across each paper. Keep it to 5-6 sentences."
+### 5. Graphs (PyVis):
+- **Nodes**: top 10 keyphrases per paper
+- **Edges**: connect if cosine similarity ≥ 0.85
+- **Layout**: force-directed repulsion (nodeDistance, springLength, damping)
+### 5. Progress Bar
+- **Purpose**: Provides real-time updates on the status of long-running tasks (e.g., generating summaries and concept maps).
+- **How**:
+  - Implemented using Gradio's `yield` functionality in the `process_all` function.
+  - Displays messages like "Generating cross-paper summary..." and "Processing paper X of Y..." in a `gr.Textbox`.
+  ### 6. Export to PDF
+- **Purpose**: Allows users to save the cross-paper summary in a neatly formatted PDF document.
+- **How**:
+  - Extracts `<p>` blocks from the HTML output using `BeautifulSoup`.
+  - Formats the summary with headers and spacing using the `FPDF` library.
+  - Saves the PDF as `summary.pdf` and provides a download link in the Gradio interface.
+## 🔬 Experiments & Outcomes
+1. **Semantic Scholar & CrossRef**
+   • Added `fetch_semantic_scholar` and `fetch_crossref` with `semanticscholar`/`habanero` clients
+   • **Outcome**: most results lacked abstracts or relevance → **disabled**
+2. **Full-Text PDF Extraction**
+   • Downloaded PDFs + `PyPDF2` → NER/summarization on full text
+   • **Outcome**: noisy extractions from captions, tables, references → reverted to abstracts only
+3. **Domain-Specific NER**
+   • Tried SciSpaCy (biomedical) and SciERC transformers
+   • **Outcome**: labels too niche or model download failures → reverted to spaCy general NER
+4. **Keyphrase Approaches**
+   • RAKE, TextRank, KeyBERT with Specter embeddings
+   • **Outcome**: heavy verb/digit filtering & clustering needed → settled on current pipeline for balance
+5. **Cross-Paper Summarizers**
+   • Pegasus-XSum (single sentence) → too terse
+   • BART-CNN hierarchical summarization → 3–5 sentences but lacked coherence
+   • **Solution**: LLM prompt via Qwen/Qwen2.5-Coder-32B-Instruct produced the best narrative
+6. **Concept-Map Connectivity**
+   • Sentence co-occurrence → isolated per-paper clusters
+   • Embedding-similarity edges → hair-ball or slow performance
+   • **Final**: per-paper maps by embedding similarity (threshold 0.85) + one global map by co-occurrence

main.py → app.py RENAMED Viewed

File without changes