This project automates the process of retrieving scientific paper data from Semantic Scholar, enriching it with additional information from CrossRef, and storing it in a MySQL database.
Please, remember to respect copyright law in your country. Do not use this software to process abstracts without the necessary permission of rightholders!
- MySQL Server (locally or remotely, e.g., via XAMPP)
- Python Libraries:
pymysqlpytestrequestskeyringdotenvtabulatecolorama
Install the required packages:
pip install pymysql requests keyring python-dotenv tabulate colorama pytest- Clone the repository
git clone <repository_url>
cd <project_directory>- Create a
.envfile in the project root with the following structure:
API_KEY="your Semantic Scholar API key"
SemSch_URL="https://api.semanticscholar.org/graph/v1/paper/search"
DB_NAME="your_database_name"
DB_password="your_database_password"
HOST="your_database_host"
USER="your_database_user"
- Set up the MySQL database (locally or remotely using XAMPP or another tool).
- Retrieve Data – Downloads paper data from Semantic Scholar based on custom query parameters.
- Enrich Data – Enriches the data with metadata from CrossRef (if available).
- Prepare Data – Converts JSON data into a format compatible with MySQL.
- Load Data to MySQL – Connects to the database, creates necessary tables, and inserts the enriched data.
You can customize the query parameters for Semantic Scholar in the main.py file (e.g., change query, fieldsOfStudy, etc.).
Optional functions (enabled in code):
- View table contents and records in MySQL.
- Update existing database records with new data.
-
source/get_data.py– Retrieves data from Semantic Scholar.enrich_data.py– Enriches data with CrossRef metadata.make_data_compatible.py– Prepares JSON data for MySQL insertion.load_data_to_MySQL.py– Manages MySQL connections, table creation, and data insertion.
-
files/- Contains intermediate JSON files: raw data, enriched data, compatible data.
-
.env– Environment file with API keys and database credentials.
from dotenv import load_dotenv
import os
from source import get_data as gd
from source import enrich_data as ed
from source import make_data_compatible as mdc
from source import load_data_to_MySQL as ldm
load_dotenv(dotenv_path="API_KEY.env")
SemSch_Api = os.getenv('API_KEY')
SemSch_URL = os.getenv('SemSch_URL')
DB_name = os.getenv('DB_NAME')
DB_password = os.getenv('DB_password')
DB_host = os.getenv('HOST')
DB_user = os.getenv('USER')
query_params = {
'query': "aesthetics",
'isOpenAccess': True,
'year': 2023,
'limit': 20,
'fieldsOfStudy': 'Philosophy',
'fields': 'title,authors,abstract,externalIds,year,venue,url'
}
gd.get_data(SemSch_Api, SemSch_URL, query_params)
ed.enrich_data(print_result=False)
mdc.process_data(print_process=False, testing=False)
connection = ldm.connection(
db=DB_name,
password=DB_password,
localhost=False,
host=DB_host,
user=DB_user
)
ldm.create_database(
con=connection,
compatible_json_path='files/compatible_enriched_data.json',
table_name="aesthetics_test_final",
print_process=False
)
ldm.print_table_contents(connection, 'aesthetics_test_final', number=5)
connection.close()
This project is open-source. Feel free to contribute or adapt it for your use case!