Skip to content
Merged
Changes from 1 commit
Commits
Show all changes
204 commits
Select commit Hold shift + click to select a range
230ee54
Retraining script for spacy transformer
brabbit61 Jun 21, 2023
9bc357a
Merge branch 'main' into data-review-tool-testing
shaunhutch Jun 21, 2023
63a5594
Merge branch 'data-review-tool-testing' of github.com:NeotomaDB/MetaE…
shaunhutch Jun 21, 2023
98dc260
Add Colab instructions
brabbit61 Jun 21, 2023
1cce4e4
Training README for spacy
brabbit61 Jun 21, 2023
0da4364
Merge branch 'dev' into 14-fine-tune-spacy-ner-model
brabbit61 Jun 21, 2023
cfbc813
Remove redundant config files
brabbit61 Jun 21, 2023
6caf9d6
Merge branch 'dev' into 14-fine-tune-spacy-ner-model
brabbit61 Jun 22, 2023
ac6d147
Update labelstudio readme
brabbit61 Jun 22, 2023
76eddec
bugfix: Update scope of docopt object
brabbit61 Jun 22, 2023
bd6ab77
enhancement: Modularize data splitting code
brabbit61 Jun 22, 2023
e8001df
Update Spacy Readme
brabbit61 Jun 22, 2023
d6cb366
Merge pull request #52 from NeotomaDB/14-fine-tune-spacy-ner-model
tieandrews Jun 22, 2023
e595f86
Fixed paths and correct update issue
shaunhutch Jun 22, 2023
2ed7c55
change from global variable to dcc.Store
shaunhutch Jun 22, 2023
7efc700
Fixed path to image
shaunhutch Jun 22, 2023
cd7b5e7
docs: rename file
tieandrews Jun 23, 2023
2bc4ee6
feat: added logging functionality
tieandrews Jun 23, 2023
bae1079
feat: refacotr after file rename
tieandrews Jun 23, 2023
1847fd3
Adding logger and sentence prior and after
shaunhutch Jun 23, 2023
3c292b1
Added relative paths and debugger
shaunhutch Jun 23, 2023
f6639f2
Added tests for data review tool
shaunhutch Jun 23, 2023
feb30e1
Update formatting
shaunhutch Jun 23, 2023
9e92948
embeddings added
kellywujy Jun 23, 2023
c8005a7
Merge branch 'data-review-tool-testing' of github.com:NeotomaDB/MetaE…
shaunhutch Jun 23, 2023
7b2d367
Updated logger
shaunhutch Jun 23, 2023
b3bcf03
added LOG_LEVEL
shaunhutch Jun 23, 2023
23bb5a8
Update tests
shaunhutch Jun 23, 2023
df9830d
updated requirements
shaunhutch Jun 23, 2023
23d7abb
Added requirements
shaunhutch Jun 23, 2023
746725d
Added selenium to the list of dependencies
shaunhutch Jun 23, 2023
0dc2153
Add labeling instructions
brabbit61 Jun 23, 2023
438ea1c
feat: env vraible to set log level
tieandrews Jun 23, 2023
ec5fc2e
docs: adde docstrings
tieandrews Jun 23, 2023
f8da29b
bug: added truncation and padding to tokenizer
tieandrews Jun 23, 2023
33f5434
docs: clean up description and intro
tieandrews Jun 23, 2023
c9ab71f
docs: update file locations and preprocessing
tieandrews Jun 23, 2023
6a7006d
docs: renaming of file
tieandrews Jun 23, 2023
aa85116
tests: updated test suite
tieandrews Jun 23, 2023
0d37b00
clean: deleted old json file
tieandrews Jun 23, 2023
b44ced9
Merge branch '22-fine-tune-allenai-specter2-model-for-ner' of github.…
tieandrews Jun 23, 2023
0db870d
docs: updated transformers version
tieandrews Jun 23, 2023
c500fa5
Add input args to labelling preprocessing script
brabbit61 Jun 24, 2023
5a6ff33
Add input args to labelling preprocessing script
brabbit61 Jun 24, 2023
5e17e66
Images for LabelStudio README
brabbit61 Jun 24, 2023
3eb603a
Add logging statements
brabbit61 Jun 24, 2023
673ff9f
Update command line args and assert statements
brabbit61 Jun 24, 2023
9d9c7fa
Update assert statements
brabbit61 Jun 24, 2023
4efbe2a
Preprocessing README with script descriptions
brabbit61 Jun 24, 2023
47e0fb8
Update input data format
brabbit61 Jun 26, 2023
f038692
raw train data load done
kellywujy Jun 26, 2023
ef60ea3
Removed unused imports and added additional reqs
shaunhutch Jun 26, 2023
8cb4ed5
Removed requirements
shaunhutch Jun 26, 2023
363829b
bug: fixed early stop parameter
tieandrews Jun 26, 2023
96ba07e
Added en_core_web_lg
shaunhutch Jun 26, 2023
7b8af42
Fixed tests
shaunhutch Jun 26, 2023
75ca9ee
Add pyarrow
brabbit61 Jun 26, 2023
b42a03e
Updated version numbers
shaunhutch Jun 26, 2023
d03beb7
Added codecov on PR
shaunhutch Jun 26, 2023
d047210
bug: add interface html and fix realtive paths
tieandrews Jun 26, 2023
bb922ab
Merge pull request #71 from NeotomaDB/57-labelstudio-doc-and-scripts
tieandrews Jun 26, 2023
b9e2415
Merge pull request #64 from NeotomaDB/data-review-tool-testing
tieandrews Jun 26, 2023
36b6d03
Finished edits up to 2.2
shaunhutch Jun 26, 2023
418a5d6
bugfix: remove logger initialization
brabbit61 Jun 26, 2023
db9f702
Merge pull request #73 from NeotomaDB/14-fine-tune-spacy-ner-model
brabbit61 Jun 26, 2023
83f62d0
Updated as per Florencia's comments
shaunhutch Jun 26, 2023
5c6bd04
Merge branch '79-general-simons-report-fixes' into 68-make-mds-report…
shaunhutch Jun 26, 2023
7526bd2
remove redundant print of directory
kellywujy Jun 26, 2023
9679e20
Update CODE_OF_CONDUCT.md with contact information
shaunhutch Jun 26, 2023
6b9f842
Fixed gridlines
shaunhutch Jun 26, 2023
a697bc5
Created environment.yml
shaunhutch Jun 27, 2023
b50e6fa
feat: added refernce
tieandrews Jun 27, 2023
2b5f019
feat: fixed section 2.2
tieandrews Jun 27, 2023
4575c0c
feat: added robert v3 train metrics
tieandrews Jun 27, 2023
130ad59
docs: remove diagram legend
tieandrews Jun 27, 2023
8285991
Merge branch '79-general-simons-report-fixes' of github.com:NeotomaDB…
tieandrews Jun 27, 2023
7017adf
Adding Final MDS Report
shaunhutch Jun 27, 2023
28de8f1
Merge remote-tracking branch 'origin/dev' into 22
tieandrews Jun 27, 2023
ae74625
typo fixed
kellywujy Jun 27, 2023
75f2b80
delete redundant comments
kellywujy Jun 27, 2023
5d4b7d4
bug: removed duplicated functions
tieandrews Jun 27, 2023
5b10199
test: removed unused test
tieandrews Jun 27, 2023
44e46d7
bug: correctly using GPU support
tieandrews Jun 27, 2023
8a7d6d3
bug: correctly use cuda gpu
tieandrews Jun 27, 2023
08def2d
bugfix: add field if not present
brabbit61 Jun 27, 2023
5884d12
Remove redundant notebooks
brabbit61 Jun 27, 2023
45f5e8a
Merge pull request #65 from NeotomaDB/22-fine-tune-allenai-specter2-m…
brabbit61 Jun 27, 2023
8042e60
Loading models hosted on huggingface
brabbit61 Jun 27, 2023
466bcff
Update S-LSTM ref
brabbit61 Jun 27, 2023
9203584
Declare env variables in docker compose
brabbit61 Jun 27, 2023
46a3054
Merge branch '79-general-simons-report-fixes' into 68-make-mds-report…
shaunhutch Jun 27, 2023
aa435c6
Updated report
shaunhutch Jun 27, 2023
98d62af
Updated report
shaunhutch Jun 27, 2023
02425bf
Update env variables
brabbit61 Jun 27, 2023
a18a06c
bugfix: change value default to -1
brabbit61 Jun 27, 2023
44fb392
Remove args and add docs
brabbit61 Jun 27, 2023
75a4533
Update ORCID badge
brabbit61 Jun 27, 2023
504c0e7
data processing done
kellywujy Jun 27, 2023
7e80546
minor edit
kellywujy Jun 27, 2023
2eaaa20
retrain implemented. To do: model eval
kellywujy Jun 27, 2023
a30d878
retrain pipeline done
kellywujy Jun 27, 2023
80a74b8
add main
kellywujy Jun 27, 2023
b580584
json format fix
kellywujy Jun 27, 2023
9d3668d
Fixed File path
shaunhutch Jun 27, 2023
0e9fb87
Removed unused data_review_tool data folders
shaunhutch Jun 27, 2023
fd14d0d
Adding sample files for data review tool
shaunhutch Jun 27, 2023
afafe34
Merge branch 'data-review-tool-jenit' of github.com:NeotomaDB/MetaExt…
shaunhutch Jun 27, 2023
bfbb3a2
Merge branch 'dev' into data-review-tool-jenit
shaunhutch Jun 27, 2023
7241d08
variables to change in README
brabbit61 Jun 27, 2023
64d8d6c
bug: download HF model during build
tieandrews Jun 27, 2023
07f7caf
bug: fix invalid access issues
tieandrews Jun 27, 2023
cfaadb5
Updated Sample
shaunhutch Jun 27, 2023
5fb25b0
Fixed Errors in merge
shaunhutch Jun 27, 2023
cbe81bc
Updated logger
shaunhutch Jun 27, 2023
2ffc78f
docs: updated calls and examples
tieandrews Jun 27, 2023
9c62460
bug: fixed HF model load and paths
tieandrews Jun 27, 2023
f1c5fa7
Update codecov with secret
shaunhutch Jun 27, 2023
3888d2b
Merge branch '22-fine-tune-allenai-specter2-model-for-ner' of github.…
tieandrews Jun 27, 2023
f8437ee
Fixed test issues
shaunhutch Jun 27, 2023
7df99bf
Merge pull request #82 from NeotomaDB/22-fine-tune-allenai-specter2-m…
tieandrews Jun 27, 2023
96a45e4
Added Codecov token
shaunhutch Jun 27, 2023
eb1193d
Merge branch 'dev' into data-review-tool-jenit
shaunhutch Jun 27, 2023
5bdcadb
Merge pull request #81 from NeotomaDB/data-review-tool-jenit
shaunhutch Jun 27, 2023
4617825
docs: readme update with orcid and clean up
tieandrews Jun 27, 2023
2d2ed2e
Merge branch 'dev' into 74-finish-main-repo-readme
tieandrews Jun 27, 2023
c6ef186
Updated references
shaunhutch Jun 27, 2023
c75d91b
Changed folder name
shaunhutch Jun 27, 2023
59a9cba
Merge branch 'dev' into 72-make-conda-environmentyml-file
shaunhutch Jun 27, 2023
97e145e
Cleaned up unused notebooks
shaunhutch Jun 27, 2023
065ce8b
Updated Negative Search Sample code
shaunhutch Jun 27, 2023
496a7ef
Updated Data Review Tool Sections
shaunhutch Jun 27, 2023
01b0810
Add output folder for mount
kellywujy Jun 28, 2023
4c476e7
update README to reflect volume mount
kellywujy Jun 28, 2023
cb2317e
delete sample docker compose,README has it already
kellywujy Jun 28, 2023
be062bf
docs: move labelstudio img into main assets folder
tieandrews Jun 28, 2023
465267c
article retrain dockerized + documentation
kellywujy Jun 28, 2023
2330dd8
docs: labelstudio imgs in assets
tieandrews Jun 28, 2023
4fd7b59
feat: setup prediction folder
tieandrews Jun 28, 2023
6d253c9
docs: setup evaluation folder
tieandrews Jun 28, 2023
21b711a
docs: setup preprocessing folder
tieandrews Jun 28, 2023
b7713a0
docs: setup training/huggingface folder
tieandrews Jun 28, 2023
df97660
docs: setup training/spacy folder
tieandrews Jun 28, 2023
2863b50
docs: updated pipeline code
tieandrews Jun 28, 2023
1c5d66d
tests: update path references
tieandrews Jun 28, 2023
3ca14ca
tests: path fix
tieandrews Jun 28, 2023
b07f482
docs: updated path references in notebooks
tieandrews Jun 28, 2023
a24a567
docs: updated link to training readme HF
tieandrews Jun 28, 2023
6e5d0ac
retrain docker small fix
kellywujy Jun 28, 2023
8470735
docs: added article relvance link
tieandrews Jun 28, 2023
c5713a1
Merge branch '74-finish-main-repo-readme' of github.com:NeotomaDB/Met…
tieandrews Jun 28, 2023
073cbf4
Merge pull request #85 from NeotomaDB/update_notebook
kellywujy Jun 28, 2023
936d799
Updated environment yml file
shaunhutch Jun 28, 2023
cc2c6e9
removed old file
shaunhutch Jun 28, 2023
880cecd
Minor corrections
brabbit61 Jun 28, 2023
4e17ead
Fix typo
brabbit61 Jun 28, 2023
7fcd574
Minor updates to the report
brabbit61 Jun 28, 2023
2f23128
Merge pull request #83 from NeotomaDB/74-finish-main-repo-readme
shaunhutch Jun 28, 2023
2314816
Merge branch 'dev' into 79-general-simons-report-fixes
shaunhutch Jun 28, 2023
89829cc
Updated tense issues
shaunhutch Jun 28, 2023
c56fee6
Tense updates
shaunhutch Jun 28, 2023
44aa8da
Minor changes
shaunhutch Jun 28, 2023
9100c5e
Minor corrections
shaunhutch Jun 28, 2023
b15624b
Latest changes
shaunhutch Jun 28, 2023
810bb2e
Updated .yml
shaunhutch Jun 28, 2023
b3048e8
Update packages
shaunhutch Jun 28, 2023
2a1cbd2
report: compressed entity-extraction
tieandrews Jun 28, 2023
a4f0852
article relevance prediction script test files add
kellywujy Jun 28, 2023
a01048f
edit file path
kellywujy Jun 28, 2023
503ed90
temp path setup to pass on both local and git
kellywujy Jun 28, 2023
14858bb
report: finals cuts
tieandrews Jun 28, 2023
2e89896
tolerance setup for embedding values
kellywujy Jun 28, 2023
2482c55
Final Updates
shaunhutch Jun 28, 2023
c3aa52c
merge: bring commits from br 76 into reorg
tieandrews Jun 28, 2023
882f523
Merge branch 'dev' into 68-make-mds-report-and-cut-down-report
shaunhutch Jun 28, 2023
ad0b119
Merge branch '79-general-simons-report-fixes' into 68-make-mds-report…
shaunhutch Jun 28, 2023
1d8399a
merge: cherrypick br 78 changes
tieandrews Jun 28, 2023
015dc67
bug: move test file
tieandrews Jun 28, 2023
5d0c413
merge: missed on ecommit cherry pick
tieandrews Jun 28, 2023
c8bf887
Merge branch 'dev' into 86-re-organize-entity-extraction-code
tieandrews Jun 28, 2023
d44a050
Add training metric for spacy
brabbit61 Jun 28, 2023
932a647
Remove config.py file (no transfer learning)
brabbit61 Jun 28, 2023
dc45bba
Merge pull request #88 from NeotomaDB/86-re-organize-entity-extractio…
brabbit61 Jun 28, 2023
61c4c08
Merge pull request #89 from NeotomaDB/article-relevance-retrain
shaunhutch Jun 28, 2023
c50e251
Merge pull request #80 from NeotomaDB/72-make-conda-environmentyml-file
tieandrews Jun 28, 2023
cd544b1
docs: updated model results for report
tieandrews Jun 28, 2023
31028f8
feat: added notebook for detailed ner results
tieandrews Jun 28, 2023
8194129
final edits on mds report
tieandrews Jun 28, 2023
ef48c39
Merge pull request #91 from NeotomaDB/68-make-mds-report-and-cut-down…
tieandrews Jun 28, 2023
707b205
Update tests and remove model bin file
brabbit61 Jun 28, 2023
1e538ce
bugfix: remove redundant test statement
brabbit61 Jun 28, 2023
69ab735
Update keyname
brabbit61 Jun 28, 2023
a60f343
main readme and retrain readme updated
kellywujy Jun 28, 2023
9faf36f
Merge pull request #94 from NeotomaDB/update-main-readme
kellywujy Jun 28, 2023
1339a95
Update README.md
kellywujy Jun 28, 2023
3909ca0
docs: final report updates and refinements
tieandrews Jun 28, 2023
fc5e74e
Merge pull request #95 from NeotomaDB/79-general-simons-report-fixes
tieandrews Jun 28, 2023
96b696e
Add Label Data Downloading Steps
tieandrews Jun 28, 2023
f61eb88
bug: hyperlink fix
tieandrews Jun 28, 2023
cb676de
Dockerfile edits
shaunhutch Jun 28, 2023
f2abc5e
Updated environments
shaunhutch Jun 28, 2023
12cc88b
Moved Notebooks
shaunhutch Jun 28, 2023
af84c3a
Merge branch 'dev' into final-edits
shaunhutch Jun 28, 2023
a37da8b
Merge branch 'main' into dev
tieandrews Jun 28, 2023
a420bc8
Merge pull request #96 from NeotomaDB/final-edits
tieandrews Jun 28, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
bug: download HF model during build
  • Loading branch information
tieandrews committed Jun 27, 2023
commit 64d8d6ce975b22fdff66ea99e59d5574f697cd96
24 changes: 20 additions & 4 deletions docker/entity-extraction-pipeline/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -11,19 +11,35 @@ COPY docker/entity-extraction-pipeline/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
RUN python -m nltk.downloader stopwords
RUN pip install https://huggingface.co/finding-fossils/metaextractor-spacy/resolve/main/en_metaextractor_spacy-any-py3-none-any.whl
# install git-lfs to be able to clone model weights from huggingface
RUN apt-get update && apt-get install -y git-lfs
# download the HF model into /app/models/ner/metaextractor
RUN mkdir -p ./models/ner/ \
&& cd ./models/ner/ \
&& git lfs install \
&& git clone https://huggingface.co/finding-fossils/metaextractor

# Copy the entire repository folder into the container
COPY src ./src

# Set default env variables for when running the container
ENV USE_NER_MODEL_TYPE=huggingface
ENV MAX_ARTICLES=-1
ENV MAX_SENTENCES=-1

# non-root user control inspired from here: https://stackoverflow.com/questions/66349101/docker-non-root-user-does-not-have-writing-permissions-when-using-volumes
# Create a non-root user that owns the input/outputs directory by default
RUN useradd -r extraction-user # no specific user ID
RUN mkdir ./inputs && chown extraction-user ./inputs
RUN mkdir ./outputs && chown extraction-user ./outputs
RUN mkdir /inputs && chown extraction-user /inputs
RUN mkdir /outputs && chown extraction-user /outputs
# Mount the "inputs" and "outputs" folders as volumes
VOLUME ["./inputs", "./outputs"]
VOLUME ["/inputs", "/outputs"]

# Set the entry point and command to run the script
USER extraction-user
RUN ls -alp /app
ENTRYPOINT python src/pipeline/entity_extraction_pipeline.py --article_text_path ./inputs/ --output_path ./outputs/
ENTRYPOINT python src/pipeline/entity_extraction_pipeline.py \
--article_text_path /inputs/ \
--output_path /outputs/ \
--max_articles ${MAX_ARTICLES} \
--max_sentences ${MAX_SENTENCES}