A Text Mining project by Lucas de Wolff (s3672980) and Ruben Ahrens (s3677532)
Report: https://rubenahrens.com/docs/biobert.pdf
This project focuses on Named Entity Recognition (NER) in medical text, specifically using the CSIRO Adverse Drug Event Corpus (CADEC). We fine-tuned transformer-based models (BERT and BioBERT) to identify medical entities in patient-reported adverse drug event narratives.
/cadec/: The CSIRO Adverse Drug Event Corpus dataset/meddra/: MedDRA annotations/original/: Original text data/processed/: Processed dataset/sct/: SNOMED CT annotations/text/: Raw text files
/Code/: Project source code/NER/: Named entity recognition implementationNER_bert.ipynb: Implementation of BERT for NERNER_biobert.ipynb: Implementation of BioBERT for NER
/Entity Linking/: Code for entity linking tasksdatastats.py: Script for dataset statisticstest.py: Testing script
- Python with HuggingFace Transformers
- BERT and BioBERT models
- Jupyter Notebooks
- scikit-learn for evaluation metrics
To run the notebooks:
- Ensure all dependencies are installed
- Run the Jupyter notebooks in the
/Code/NER/directory
If you use the CADEC dataset:
- Karimi et al. (2015) CADEC: A corpus of adverse drug event annotations
- Data: https://data.csiro.au/collection/csiro:10948
- Lucas de Wolff (s3672980)
- Ruben Ahrens (s3677532)
January 2024