Code and Data for Paper: "Unveiling the Power of Language Models in Chemical Research Question Answering"

📌 Project Overview

This repository contains the code, data, and instructions for reproducing the experiments presented in the paper. It provides scripts for data collection, preprocessing, model training, and evaluation.

📁 Folder Structure:

code/: Scripts for data collection, data preprocessing, model training, and evaluation.
data/: Includes:
- test.json: Test dataset.
- doi_list.txt: The DOI list.

🚀 Getting Started

1. Clone the Repository

git clone <repo-link>
cd <repo-name>

2. Install Dependencies

pip install -r code/model_code/requirements.txt

📚 Data Collection

The data collection scripts are located in code/data_collection_code and support scraping data from five different websites. Make sure you have valid API keys for the respective data sources before running the scripts.

📁 Data Sources and Scripts:

Elsevier (elsevier/):
- 1cursor.py: Initial data fetching.
- 2extract.py: Data extraction.
Lens (lens/):
- Scripts for different categories:
  - 1lens_cursor_bio.py
  - 1lens_cursor_cata.py
  - 1lens_cursor_elec.py
  - 1lens_cursor_enginner.py
S2ORC (s2orc/):
- 1abstracts.py: Fetch abstracts.
- 1s2orc.py: Main data fetching.
- 2extract_paper.py: Extract papers from fetched data.
Scopus (scopus/):
- 1cursor.py: Initial data fetching.
- 2extract.py: Data extraction.
Springer (springer/):
- 1cursor.py: Initial data fetching.
- 2extract.py: Data extraction.

🔄 Data Preprocessing

After collecting the raw data, use the scripts in code/model_code/preprocess/set1_ver1 to format the data for model training and evaluation.

📌 Steps:

For Test Data

Run:

python code/model_code/preprocess/set1_ver1/generate_json_test.py

For Labeled Training Data

Run:

python code/model_code/preprocess/set1_ver1/generate_json_train_label.py

For Unlabeled Training Data

Run:

python code/model_code/preprocess/set1_ver1/generate_json_train_unlabel.py

⚙️ Model Training

Use the provided train.sh script to initiate model training with the preprocessed data.

🚀 Run Training:

bash code/model_code/train.sh

The training progress will be logged, and the best-performing model will be saved automatically as a checkpoint.

📊 Model Evaluation

Evaluate the trained model using the evaluate.sh script. The evaluation will generate performance metrics on the test data.

🧪 Run Evaluation:

bash code/model_code/evaluate.sh

You can also download the pre-trained model checkpoint from this link for direct evaluation without training.

📌 File Descriptions

File/Folder	Description
`code/data_collection_code`	Scripts to fetch and extract data from five websites.
`code/model_code`	Contains scripts for model training, evaluation, and data preprocessing.
`data/test.json`	Test dataset used during evaluation.
`data/doi_list.txt`	The DOI list of collected files.
`train.sh`	Script to initiate model training.
`evaluate.sh`	Script to evaluate the model using the test data.
`requirements.txt`	List of Python dependencies required for running the project.

🔧 Troubleshooting

Ensure that you have API access and valid API keys for data collection from external sources.
If there are missing dependencies, verify that you installed all the packages listed in requirements.txt.
For issues during training or evaluation, check the log files generated by the train.sh and evaluate.sh scripts.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
code		code
data		data
.gitignore		.gitignore
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code and Data for Paper: "Unveiling the Power of Language Models in Chemical Research Question Answering"

📌 Project Overview

📁 Folder Structure:

🚀 Getting Started

1. Clone the Repository

2. Install Dependencies

📚 Data Collection

📁 Data Sources and Scripts:

🔄 Data Preprocessing

📌 Steps:

⚙️ Model Training

🚀 Run Training:

📊 Model Evaluation

🧪 Run Evaluation:

📌 File Descriptions

🔧 Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Code and Data for Paper: "Unveiling the Power of Language Models in Chemical Research Question Answering"

📌 Project Overview

📁 Folder Structure:

🚀 Getting Started

1. Clone the Repository

2. Install Dependencies

📚 Data Collection

📁 Data Sources and Scripts:

🔄 Data Preprocessing

📌 Steps:

⚙️ Model Training

🚀 Run Training:

📊 Model Evaluation

🧪 Run Evaluation:

📌 File Descriptions

🔧 Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages