Code and Data for Paper: "Unveiling the Power of Language Models in Chemical Research Question Answering"
This repository contains the code, data, and instructions for reproducing the experiments presented in the paper. It provides scripts for data collection, preprocessing, model training, and evaluation.
code/: Scripts for data collection, data preprocessing, model training, and evaluation.data/: Includes:test.json: Test dataset.doi_list.txt: The DOI list.
git clone <repo-link>
cd <repo-name>pip install -r code/model_code/requirements.txtThe data collection scripts are located in code/data_collection_code and support scraping data from five different websites. Make sure you have valid API keys for the respective data sources before running the scripts.
-
Elsevier (
elsevier/):1cursor.py: Initial data fetching.2extract.py: Data extraction.
-
Lens (
lens/):- Scripts for different categories:
1lens_cursor_bio.py1lens_cursor_cata.py1lens_cursor_elec.py1lens_cursor_enginner.py
- Scripts for different categories:
-
S2ORC (
s2orc/):1abstracts.py: Fetch abstracts.1s2orc.py: Main data fetching.2extract_paper.py: Extract papers from fetched data.
-
Scopus (
scopus/):1cursor.py: Initial data fetching.2extract.py: Data extraction.
-
Springer (
springer/):1cursor.py: Initial data fetching.2extract.py: Data extraction.
After collecting the raw data, use the scripts in code/model_code/preprocess/set1_ver1 to format the data for model training and evaluation.
-
For Test Data
- Run:
python code/model_code/preprocess/set1_ver1/generate_json_test.py
- Run:
-
For Labeled Training Data
- Run:
python code/model_code/preprocess/set1_ver1/generate_json_train_label.py
- Run:
-
For Unlabeled Training Data
- Run:
python code/model_code/preprocess/set1_ver1/generate_json_train_unlabel.py
- Run:
Use the provided train.sh script to initiate model training with the preprocessed data.
bash code/model_code/train.sh- The training progress will be logged, and the best-performing model will be saved automatically as a checkpoint.
Evaluate the trained model using the evaluate.sh script. The evaluation will generate performance metrics on the test data.
bash code/model_code/evaluate.shYou can also download the pre-trained model checkpoint from this link for direct evaluation without training.
| File/Folder | Description |
|---|---|
code/data_collection_code |
Scripts to fetch and extract data from five websites. |
code/model_code |
Contains scripts for model training, evaluation, and data preprocessing. |
data/test.json |
Test dataset used during evaluation. |
data/doi_list.txt |
The DOI list of collected files. |
train.sh |
Script to initiate model training. |
evaluate.sh |
Script to evaluate the model using the test data. |
requirements.txt |
List of Python dependencies required for running the project. |
- Ensure that you have API access and valid API keys for data collection from external sources.
- If there are missing dependencies, verify that you installed all the packages listed in
requirements.txt. - For issues during training or evaluation, check the log files generated by the
train.shandevaluate.shscripts.