Automated workflow for building, intelligently scoring, and screening a corpus of research papers from the NIME (New Interfaces for Musical Expression) conference (2001–2025), specifically focusing on Keyboard Interfaces.
The repository includes the pre-extracted text corpus in Keyboard_Interface_Texts/. You can run the analysis immediately without the original PDF files.
- Setup Environment:
pip install -r requirements.txt
- Run Analysis:
# Generate the screening report (KWIC) python kwic_screening.py # After manual labeling in 'kwic_context_screening.csv': python merge_screening_with_metadata.py
Crawler/: Contains the scraping pipeline for NIME papers spanning 2001–2025.Keyboard_Interface_Texts/: The processed text corpus (Ready for analysis).Metadata_Filtered_Results/: Final output storage for screened CSVs.*.py: Core pipeline scripts for standardization, filtering, and extraction.- Note: The
Renamed_PDFs/andNIME Papers/directories are excluded (~17GB) to comply with GitHub limits.
To ensure maximum accuracy and coverage, the pipeline cross-references multiple data sources to calibrate the corpus:
- Metadata Analysis:
export.csvis generated via NIME Proceedings Analyzer to extract structural metadata. - NIME Official Bibliography:
nime_papers.csvis sourced from the NIME Bibliography Archive. - Crawled Data & Archives:
- The
Crawler/folder contains scripts used to scrape the NIME portal for papers from 2001–2024, plus a dedicated script for 2025. - Historical data is also supplemented by the official NIME ZIP Archives.
- The
- Validation: This multi-source comparison ensures that renamed PDFs align perfectly with official bibliography entries.
If you need to rebuild the corpus or add new conference years, follow these steps:
- Acquisition: Use the
Crawler/tools to fetch metadata and PDFs for 2001–2025, or download ZIP containers from the official NIME Archives. - Standardization: rename_pdfs_by_nime_id.py
Aligns raw PDFs with official metadata and resolves inconsistent naming schemes. - Filtering: filter_renamed_pdfs_combined.py
Categorizes papers and performs pre-screening by stripping bibliographies to avoid false positives. - Extraction: extract_keyboard_pdfs_to_txt.py
Converts PDFs to TXT (specifically fixing the 2013 word-spacing bug).
The pipeline applies a heuristic scoring model to prioritize relevant research within the text corpus.
Mathematical Foundation (IDF Weights):
The script calculates the Inverse Document Frequency (IDF) for each keyword across the corpus to ignore common terms and highlight rare instruments:
Heuristic Scoring Model (
- Hits: Logarithmic frequency bonus to avoid rewarding length over relevance.
- Instrument Boost: Fixed bonuses for definitive keyboard terms (Piano, Organ, Accordion) to override low IDF scores.
- Musical Context: Reward points for co-occurring terms like
MIDI,sensor, orvelocity. - Typing Noise Penalty: Significant penalty for office/computing context like
QWERTYortext entry.
The final stage involves human validation of the high-priority papers identified by the pipeline.
- Manual Decision: Review snippets in
kwic_context_screening.csvand mark relevant papers in theKEEP(1)_or_EXCLUDE(0)column. - Metatada Export: Use merge_screening_with_metadata.py to unify your final selection with BibTeX entries and full metadata for your literature review.