Skip to content

iriscxy/chemmatch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Code and Data for Paper: "Unveiling the Power of Language Models in Chemical Research Question Answering"

📌 Project Overview

This repository contains the code, data, and instructions for reproducing the experiments presented in the paper. It provides scripts for data collection, preprocessing, model training, and evaluation.

📁 Folder Structure:

  • code/: Scripts for data collection, data preprocessing, model training, and evaluation.
  • data/: Includes:
    • test.json: Test dataset.
    • doi_list.txt: The DOI list.

🚀 Getting Started

1. Clone the Repository

git clone <repo-link>
cd <repo-name>

2. Install Dependencies

pip install -r code/model_code/requirements.txt

📚 Data Collection

The data collection scripts are located in code/data_collection_code and support scraping data from five different websites. Make sure you have valid API keys for the respective data sources before running the scripts.

📁 Data Sources and Scripts:

  1. Elsevier (elsevier/):

    • 1cursor.py: Initial data fetching.
    • 2extract.py: Data extraction.
  2. Lens (lens/):

    • Scripts for different categories:
      • 1lens_cursor_bio.py
      • 1lens_cursor_cata.py
      • 1lens_cursor_elec.py
      • 1lens_cursor_enginner.py
  3. S2ORC (s2orc/):

    • 1abstracts.py: Fetch abstracts.
    • 1s2orc.py: Main data fetching.
    • 2extract_paper.py: Extract papers from fetched data.
  4. Scopus (scopus/):

    • 1cursor.py: Initial data fetching.
    • 2extract.py: Data extraction.
  5. Springer (springer/):

    • 1cursor.py: Initial data fetching.
    • 2extract.py: Data extraction.

🔄 Data Preprocessing

After collecting the raw data, use the scripts in code/model_code/preprocess/set1_ver1 to format the data for model training and evaluation.

📌 Steps:

  1. For Test Data

    • Run:
      python code/model_code/preprocess/set1_ver1/generate_json_test.py
  2. For Labeled Training Data

    • Run:
      python code/model_code/preprocess/set1_ver1/generate_json_train_label.py
  3. For Unlabeled Training Data

    • Run:
      python code/model_code/preprocess/set1_ver1/generate_json_train_unlabel.py

⚙️ Model Training

Use the provided train.sh script to initiate model training with the preprocessed data.

🚀 Run Training:

bash code/model_code/train.sh
  • The training progress will be logged, and the best-performing model will be saved automatically as a checkpoint.

📊 Model Evaluation

Evaluate the trained model using the evaluate.sh script. The evaluation will generate performance metrics on the test data.

🧪 Run Evaluation:

bash code/model_code/evaluate.sh

You can also download the pre-trained model checkpoint from this link for direct evaluation without training.


📌 File Descriptions

File/Folder Description
code/data_collection_code Scripts to fetch and extract data from five websites.
code/model_code Contains scripts for model training, evaluation, and data preprocessing.
data/test.json Test dataset used during evaluation.
data/doi_list.txt The DOI list of collected files.
train.sh Script to initiate model training.
evaluate.sh Script to evaluate the model using the test data.
requirements.txt List of Python dependencies required for running the project.

🔧 Troubleshooting

  • Ensure that you have API access and valid API keys for data collection from external sources.
  • If there are missing dependencies, verify that you installed all the packages listed in requirements.txt.
  • For issues during training or evaluation, check the log files generated by the train.sh and evaluate.sh scripts.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages