You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copyright 2020 The HuggingFace Team. All rights reserved.
3
+
4
+
Licensed under the Apache License, Version 2.0 (the "License");
5
+
you may not use this file except in compliance with the License.
6
+
You may obtain a copy of the License at
7
+
8
+
http://www.apache.org/licenses/LICENSE-2.0
9
+
10
+
Unless required by applicable law or agreed to in writing, software
11
+
distributed under the License is distributed on an "AS IS" BASIS,
12
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+
See the License for the specific language governing permissions and
14
+
limitations under the License.
15
+
-->
16
+
17
+
Original code addopted from here on May 23, 2023: https://github.com/huggingface/transformers/tree/main/examples/pytorch/token-classification
18
+
19
+
# Improvements/Adjustments
20
+
The following improvements were added to the tools linked above:
21
+
1. The ability to log models and metrics to an Azure ML workspace MLflow instance was added and requiring environment variables to be set and the azureml-mlflow package to be installed to log.
22
+
1. The environnment variables AZURE_TENANT_ID, AZURE_CLIENT_ID, AZURE_CLIENT_SECRET and AZURE_MLFLOW_TRACKING_URI must be set using the .env file in the root of the repo.
23
+
2. Adding in the automated text preprocessing from labelstudio outputs in the `labelstudio_preprocessing.py` file which is added as a bash script target.
24
+
25
+
# Token classification
26
+
27
+
## PyTorch version
28
+
29
+
Fine-tuning the library models for token classification task such as Named Entity Recognition (NER), Parts-of-speech
30
+
tagging (POS) or phrase extraction (CHUNKS). The main scrip `run_ner.py` leverages the 🤗 Datasets library and the Trainer API. You can easily
31
+
customize it to your needs if you need extra processing on your datasets.
32
+
33
+
It will either run on a datasets hosted on our [hub](https://huggingface.co/datasets) or with your own text files for
34
+
training and validation, you might just need to add some tweaks in the data preprocessing.
35
+
36
+
The following example fine-tunes BERT on CoNLL-2003:
37
+
38
+
```bash
39
+
python run_ner.py \
40
+
--model_name_or_path bert-base-uncased \
41
+
--dataset_name conll2003 \
42
+
--output_dir /tmp/test-ner \
43
+
--do_train \
44
+
--do_eval
45
+
```
46
+
47
+
or just can just run the bash script `run.sh`.
48
+
49
+
To run on your own training and validation files, use the following command:
50
+
51
+
```bash
52
+
python run_ner.py \
53
+
--model_name_or_path bert-base-uncased \
54
+
--train_file path_to_train_file \
55
+
--validation_file path_to_validation_file \
56
+
--output_dir /tmp/test-ner \
57
+
--do_train \
58
+
--do_eval
59
+
```
60
+
61
+
**Note:** This script only works with models that have a fast tokenizer (backed by the 🤗 Tokenizers library) as it
62
+
uses special features of those tokenizers. You can check if your favorite model has a fast tokenizer in
63
+
[this table](https://huggingface.co/transformers/index.html#supported-frameworks), if it doesn't you can still use the old version
64
+
of the script.
65
+
66
+
> If your model classification head dimensions do not fit the number of labels in the dataset, you can specify `--ignore_mismatched_sizes` to adapt it.
67
+
68
+
## Old version of the script
69
+
70
+
You can find the old version of the PyTorch script [here](https://github.com/huggingface/transformers/blob/main/examples/legacy/token-classification/run_ner.py).
71
+
72
+
## Pytorch version, no Trainer
73
+
74
+
Based on the script [run_ner_no_trainer.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/token-classification/run_ner_no_trainer.py).
75
+
76
+
Like `run_ner.py`, this script allows you to fine-tune any of the models on the [hub](https://huggingface.co/models) on a
77
+
token classification task, either NER, POS or CHUNKS tasks or your own data in a csv or a JSON file. The main difference is that this
78
+
script exposes the bare training loop, to allow you to quickly experiment and add any customization you would like.
79
+
80
+
It offers less options than the script with `Trainer` (for instance you can easily change the options for the optimizer
81
+
or the dataloaders directly in the script) but still run in a distributed setup, on TPU and supports mixed precision by
82
+
the mean of the [🤗 `Accelerate`](https://github.com/huggingface/accelerate) library. You can use the script normally
0 commit comments