TL;DR:
A fun mini project to build a simple CLI utility to verify citations in BibTeX files by checking if they are valid or potentially hallucinated. Install from source with pip and run verify-citations --bibtex-file path/to/references.bib. The tool checks if the paper can be found online, verifies URLs, and matches metadata (title and authors) against online databases like arXiv, Semantic Scholar, DBLP, Google Scholar, and DuckDuckGo.
Q: Does this tool solve the underlying problem of hallucinated citations?
A: No, this tool does not solve this larger issue. As noted by GPTZero, this is unfortunately a growing problem and is likely not going to go away. But it is also true that the use of LLMs for scientific writing is on the rise and they are undeniably useful. This is designed to help good-faith users identify potentially invalid or hallucinated citations in their BibTeX files by performing automated checks against online databases. The project started as a bit of light fun of vibe coding, but finding that it was helpful to people, I decided to polish it up and release it publicly.
Q: How should I use this tool?
A: This tool is intended to be used as a first-pass filter to identify potentially invalid citations in BibTeX files. It can help flag citations for you to go and then manually verify. Places where friends have mentioned this can be most helpful are when citing papers from pre-print servers where the name/author list might change, or when the Bibtex entry generated by Google Scholar or Semantic Scholar is incorrect or incomplete.
Q: Can this tool verify all citations?
A: No, this tool cannot verify all citations. There are many, many, many edge cases which I found is really hard to cover comprehensively. However, it does cover a large number of common cases and is able to provide useful for many users.
Q: Does this tool check if the correct conference/journal version is cited?
A: No. The tool currently focuses on checking that the paper can be found online (which is mostly how people find papers right :)) and matching the metadata (title and authors) to the specified bibtex. It does not verify if the correct version (e.g., conference vs. journal) is cited.
This tool performs automated checks to verify citations:
-
Online Findability: Checks if the paper can be found online with multiple search engines:
- arXiv: Direct lookup when arXiv ID is available, plus API-based title search
- ACL Anthology: Natural language processing and computational linguistics papers
- Semantic Scholar: Academic paper database with comprehensive API
- DBLP: Computer science bibliography
- Google Scholar: Broad scholarly articles search
- DuckDuckGo: General web search as fallback
- Rate limit resilience: Automatic retry with exponential backoff when APIs return 429 errors, with fallback to alternative sources
-
URL Validation: Verifies that provided links are correct and accessible
- Handles HTTP status codes with appropriate error reporting
- 429 Rate Limit handling: Automatically retries with exponential backoff (1s → 2s → 4s, up to 3 retries) before moving to alternative sources
- 403 Forbidden handling: Recognizes when servers block automated access and flags for manual verification
- Distinguishes between critical errors (404) and warnings (403, 429, connection issues)
- Supports automatic conversion of arXiv IDs to full URLs
-
Metadata Verification: Checks if both the title AND author list match what's found online
- Compares paper titles with difflib sequence similarity (70% threshold for metadata verification, 50% for initial findability), with a word-overlap fallback for DuckDuckGo results
- Validates author lists by extracting and comparing author last names (50% match threshold)
- Handles name format differences: Recognizes "Last, First" and "First Last" as the same author
- Fuzzy matching: Tolerates small misspellings (up to 2 character differences) in author names
- Special character handling: Correctly processes LaTeX special characters in names
- "et al" / "and others" support: Validates that all explicitly listed authors appear in online source
- Works with arXiv, Semantic Scholar, and DBLP sources
-
Version Information: Surfaces version-related fields from BibTeX entries
- arXiv preprints (including version suffixes)
- Journal publications
- Conference proceedings
- DOI information when available
-
Color-Coded Output: Clear visual feedback
- 🟢 Green (✓): Successfully verified
- 🔴 Red (✗): Critical errors (URL invalid, paper not found, metadata mismatch)
- 🟡 Yellow (⚠): Warnings (403 errors, metadata couldn't be verified)
- 🔵 Cyan (ℹ): Informational messages (version info, verbose logs)
# Clone the repository
git clone https://github.com/vishakhpk/verify_citations.git
cd verify_citations
# Install dependencies
pip install -r requirements.txt
# Install the package
pip install -e .verify-citations --bibtex-file path/to/references.bib
verify-citations # Uses default references.bibverify-citations --bibtex-file references.bib --verbose # Show detailed output
verify-citations --bibtex-file references.bib --summary-only # Show only summary
verify-citations --bibtex-file references.bib --timeout 20 # Set request timeout
verify-citations --bibtex-file references.bib --max-retries 5 # Set max retries for 429 errors
verify-citations --timeout 20 --max-retries 5 # Uses default references.bib with custom optionsWhen citations are verified successfully:
=== Citation Verification Tool ===
Processing: examples/sample.bib
Found 3 citation(s) to verify
[1/3] Verifying:
[vaswani2017attention] Attention is All you Need (Vaswani et al., 2017)
Status: ✓ VERIFIED
✓ Paper found online via search
✓ URL is valid and accessible
✓ Metadata (title and authors) verified
ℹ Version: arXiv:1706.03762
When there are metadata mismatches (shown in yellow/red), detailed information is provided:
[2/3] Verifying:
[wrong2023] Wrong Paper Entry (Smith, John, 2023)
Status: ⚠ ISSUES FOUND
✓ Paper found online via search
⚠ Title matches but author list mismatch detected
BibTeX authors: Smith, John and Doe, Jane
Online authors: Johnson, Alice and Brown, Bob
Source: https://arxiv.org/abs/1706.03762
When there are critical errors (shown in red):
[3/3] Verifying:
[fake2023paper] This is a Fake Paper (Nobody et al., 2023)
Status: ✗ ISSUES FOUND
✗ Could not find paper online
✗ URL returns 404 (not found)
Summary:
============================================================
SUMMARY
Total citations: 3
Verified: 1
Issues found: 2
Incomplete: 0
Citations with issues:
- wrong2023: Wrong Paper Entry
- fake2023paper: This is a Fake Paper
Citations where you should manually check the links due to a 403 error:
- example2023: Example Paper Title
Citations where author list could not be verified:
Count: 1
- another2023: Another Example Paper
See examples/sample.bib for an example BibTeX file with various citation types.
The tool:
- Parses BibTeX files to extract citation metadata
- Performs web searches across multiple sources to verify paper existence:
- arXiv: Direct ID lookup and API-based title search for preprints
- ACL Anthology: NLP/computational linguistics papers via website search
- Semantic Scholar API: Academic papers with structured metadata
- DBLP: Computer science bibliography search
- Google Scholar: Broad scholarly article search
- DuckDuckGo: General web search fallback
- Validates URLs by making HTTP requests
- Handles HEAD requests with GET fallback
- Automatic retry on rate limits: Retries with exponential backoff (1s, 2s, 4s) when encountering 429 errors
- Distinguishes critical errors (404, invalid format) from warnings (403, 429, timeouts)
- Extracts and compares both title AND author metadata from online sources
- Uses difflib sequence similarity for title matching (50% for findability, 70% for metadata verification), with a word-overlap fallback for DuckDuckGo results
- Compares author last names with fuzzy matching to detect mismatches
- Handles "et al" / "and others" by validating listed authors
- Surfaces version-related fields from the BibTeX entry (preprints vs. published metadata)
- Provides a clear report of verification results with color-coded output
- Python 3.8+
- Internet connection for verification
bibtexparser- BibTeX file parsingrequests- HTTP requests for URL validationbeautifulsoup4- HTML parsing for metadata extractionlxml- XML/HTML parsing library (used by BeautifulSoup)click- CLI interfacecolorama- Colored terminal output
Contributions are welcome! Please reach out, open issues (with sample .bib files that fail) and/or raise pull requests on the GitHub repository.
MIT License