- 📍 Overview
- 🚀 Getting Started
- 🔎 In-depth explanation
- 🧪 Running Tests
- 📖 Resources generated with CURATE
- 🤝 Contributions
- 📜 License
- ☎️ Contact
- 🥇 Acknowledgements
- 📝 Citation
The Corpus Cleaner consists of several modules corresponding to different steps in our text processing pipeline. Each module is a python script that can be run from the command line. The modules are organized into folders based on their functionality. The root directory contains some utilities relating to the input, internal representation and output of documents. These are used by the rest of modules.
A corpus is a big set of documents, usually from the same source. It does not have an internal representation as it is usually too big to fit in memory. A corpus is usually split into several parts in order to be processed in parallel. The whole corpus is represented by a metadata file, which contains information about the parts, like their paths, their ids, etc. It also contains information about the destination of the output of each step (or input of the next step) of the pipeline. This is useful because it allows us to run the pipeline in several steps, without having to worry about the paths of each intermediate file (only the metadata file is needed). The metadata file is a json file. See the test one for an example. The corpus metadata file is generated using the metadata tool. All modules in this pipeline assume that you have a metadata file for the corpus you want to process.
Right now, the following steps are implemented:
- Deduplication: Removes exact duplicate documents from a corpus.
- Splitting into paragraphs and sentences, Filtering and Scoring, which are all done by the Preprocess and score module.
- Classification. Right now we are using an adult content model and keywords specific to that area, but the same module can be used for other types of classification.
All .sh files, command-line snippets and python scripts should be run from the root directory of the project. This is because the project uses relative paths to access files and folders. If you run them from another directory, they might not work properly.
You will have to do the following steps to install the project and run any module (as they all use input and output formats):
- Clone the CURATE repository:
git clone https://github.com/langtech-bsc/CURATE.git- Change to the project directory:
cd CURATE- Create a virtual environment and the dependencies:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt(this step might be very different depending on your system, especially if you are using the clusters, please check the appropriate documentation)
- Download the nltk sentence tokenizer model:
bash download_punkt.shIf you don't have access to the internet on your machine
you can download it manually from here,
unzip it and put it in the nltk_data/tokenizers folder under the root directory of the project.
The best way to check that everything is working properly is to run some of the tests in the following subsection.
All modules in this pipeline can be run from the command line. They all have the same structure:
python [my_script].py --input_format_read [input_format_name] --output_format [output_format_name] \
--metadata_path [path_of_your_metadata_file_here] # i/o arguments here
# document checks here
# specific my_script.py arguments here
As an example, we will be metadating, deduplicating, scoring, filtering by score and sampling a corpus (see the readmes of the corresponding modules for more information):
We have already done this step for you, the metadata file is in test_data. The commands to reproduce it would be:
pip install -r corpus_metadata_tool/requirements.txt
python corpus_metadata_tool/src/metadatatool.py --prefix test_data/ --part_size 0Where --part_size 0 indicates that we do not want to merge the jsonls into a single part.
Note that you should select option 1 to register a new raw collection, and then select the index of the folder where the data is stored (in our case, 0 because we only have one folder in test_data/data/01-raw-collections).
Because you would usually need to use Greasy to run the deduplication module, we will be running the naive deduplication script instead. The command to run it is:
python deduplicate/naive_deduplicator.py --metadata_path test_data/data/02-metadata/test_mx_20240325.json --input_format_read old_crawling_v2 --output_format raw_tsv Two new files should have appeared at test_data/data/03-text-repository/data/test_mx_20240325/ named part_0.dedup
and part_1.dedup. They are tsv files with each document's index, text and url.
The input format is old_crawling_v2 because this is how our provided data is formatted.
The output format should always be raw_tsv for deduplication,
as it is the only one that takes the dedup_path from the metadata (or create one that does the same).
The next step is to preprocess and score the documents (which in turn splits them into sentences and paragraphs), and then filter by score. You can read more about it in the preprocess and score module.
First, we need to run the following to set up the environment for this module:
pip install -r preprocess_and_score/requirements.txt
bash preprocess_and_score/download_language_identifier.shBecause we wish to process both parts at the same time (and we have already deduplicated the data), we will run the following command:
python preprocess_and_score/clean_whole_metadata.py --metadata_path test_data/data/02-metadata/test_mx_20240325.json --clean_setup not_testing.sh --input_format_split old_crawling_v2 --venv_script venv/bin/activateThe not_testing.sh script is a bash script that sets up the environment for the preprocess_and_score module
And places and takes everything from the dedup_path in the metadata file and puts it in the clean_path after processing it.
To see how other setups would work, check the io_args section.
You can see that under
test_data/data/03-text-repository/data/test_mx_20240325/03.2-processing
there are folders for each language (including mx, which means unidentified language) with the processed documents.
Each is in the clean_jsonl format, which is a verbose format that contains all the internal structure of the documents
(you would change this by editing the not_testing.sh script).
There is also a .stats document
for each clean file, which contains information about the number of words and documents in different score ranges for each language.
This will be used in the next step (see the sample module).
There are many ways to do this part. See the sample module for more information. We will just sample by score threshold (that is, we will take will set a score threshold and take all documents with a score above it). We will split the data into training, validation and test sets (by default, the first 80% of the data is training, the next 10% is validation and the last 10% is test):
python sample/sampler_top_n_threshold.py --clean_jsons all --sample_output_path test_data/test_sample --input_format_read clean_jsonl --languages ca es --threshold 0.7 --max_test_valid_size_PERCENT 1 --max_test_valid_size_MB 100000 --clean_path test_data/data/03-text-repository/03.2-processing/clean/This will create a folder called test_sample in the test_data folder with the sampled data.
In order for you to add input or output formats, or even new fuctionality, you should understand the following concepts:
A Document is the internal representation of a contiguous piece of text. It consists list of Paragraphs, which are lists of Sentences. However, each of these elements has some additional attributes:
- Document: A document has a corpus identifier, a part id (relative to the corpus) and a document id (relative to the part).
It also has other attributes which are usually lazily computed, like the number of sentences, the number of paragraphs, the number of words, languages, scores, topic etc.
which are computed when needed by other modules, and then cached. In the case of having the topic
bias, additional attributes will be added such aspred_warmthandpred_competence. - Paragraph: A paragraph has a paragraph id (relative to the document) and a list of sentences. It also has other attributes which are usually lazily computed, similar to the document.
- Sentence: A sentence has a sentence id (relative to the paragraph) and a text. It also has other attributes which are usually lazily computed, similar to the document, as well as its word spans, which are computed when needed by other modules, and then cached.
An input format is a way to turn a file (or any other structure, for example, a database)
into a generator of Documents. It is a child class of the InputFormat
class, which specifies two main methods:
# yields all documents with their properties and text (decoding the file if necessary in the process).
def read(self) -> Iterator[Document]:
...
# splits a document into paragraphs and sentences.
def split(self, doc: Document) -> Document:
...In the implementation of these methods the input format can use the self.args namespace,
which contains all the arguments passed to the caller script (in particular, the io_args
explained below and any others you may want to add). An explanation of how to pass the arguments to the input (and output)
formats is given below in the Full Example section.
Only use the split method when needed, as it is usually very costly. Additionally, the generate_full_docs()
method can be used to generate already split documents, which is useful for some modules; and the
read_and_check() method does the same as read but only yields the documents that pass all the
specified document checks.
# either
for unsplit_doc in input_format_read.read_and_check(): # or input_format_read.read() if you don't want to run checks
split_doc = input_format_split.split(unsplit_doc)
(...) # your custom logic here
# or
for split_doc in input_format_read.generate_full_docs():
(...) # your custom logic hereAn Output format is a way to turn a generator of Documents into a file (or any other structure, for example, one or more entries in a database). It is defined as a function, for example,
def output_format_foo(docs: Iterator[Document], args) -> None:
...Note that, given an input format class in_class, an output format out_ and the appropriate passed arguments
args, you could simply translate files between the two formats by simply calling:
in_ = in_class(args)
out_(in.read(), args)The document_checks.py file contains a set of checks that can be performed on a document.
they are functions that take a document as input and return a boolean.
They are used to eliminate documents that do not pass the checks, for example,
if they are part of an evaluation dataset.
In order to use this feature, the read_and_check() method must be used in the corresponding
module instead of read(), for example, we do so in the preprocess_and_score module.
In order to use any set of these functions, their name must be passed to the
corresponding python script followed by all additional arguments of the function in order,
using the wildcard '*' to pass the default value for the corresponding argument. For example:
python [my_script].py --input_format_read [input_format_name] --output_format[output_format_name] --check_1 check1_arg --check_2 check2_arg_1 '*' check2_arg_3 The way that input and output formats work is highly configurable,
and thus, depending on the input or output format that you use you may need very differend command-line
arguments to configure them. For convenience, we have defined them all in the io_args file.
The way you catch the arguments is by calling the add_io_arguments_to_parser(parser: ArgumentParser)
before calling parser.parse_args().
Depending on the input or output format that you use you may need very differend command-line arguments to configure them. The standard way of working in our group is to use file-based input and output formats, and to use the metadata file for keeping track of the paths of the files. Therefore, the arguments are usually the following:
-
--input_format_read: The input format to use to read from the file. This will decode it and provide the separate documents, but not necessarily split them into smaller structures, unless these are already provided by the file format. See input_formats.py for a list of available formats. Note that if you want to read from the deduplication files, you should use theraw_tsvinput format which will load from thededup_pathprovided in the metadata file. Similarly, if you want to read from already clean documents in theclean_path(or write to them, for example, in the case of thepreprocess_and_scoremodule) you should use one of the corresponding input formats (currentlytsvorclean_jsonl, the latter is more verbose and contains all internal structure of the documents). -
--input_format_split: The input format to use to split the documents into paragraphs and sentences. See input_formats.py for a list of available formats. Note that, for some tasks, like deduplication, you don't need to split the documents, so you may not have to pass this argument In that case, the documents may be stored in an intermediate format (usuallyraw_tsv), to then be split by the original InputFormat'ssplit()method. Thereforeinput_format_readandinput_format_splitmay not be the same. -
--output_format: The output format to use. See output_formats.py for a list of available formats. The same precautions about which file you are actually reading from apply as to ininput_format_read -
--metadata_path: The path to the metadata file of the corpus you want to process. -
--part: The part of the corpus you want to process. In the future, we want to add support for processing several parts at the same time by not passing this argument, but right now it is required. -
--override_output: Whether to overwrite the output files if they already exist. This is false by default, so if you want to overwrite them you have to pass this argument. -
--testing, which means all input and output data will be reparented to the testing folder It is false by default. If it is passed, it will be set to true. -
--testing_input, which means that the input will be taken from the testing folder. It is False by default. If it is passed, it will be set to True. -
--testing_output, which means that the output will be stored in the testing folder. It is False by default. -
--testing_folder, which should be the path to the folder where the input or output of the script will be stored in testing mode. It istest_databy default.
However, we encourage you to look at the input and output format you are using
in each step and see which attributes of the args namespace they make use of.
Similarly, if you are implementing a new input or output format,
you should look at the attributes of the args namespace that you need to use,
and add them to the add_io_arguments_to_parser(parser: ArgumentParser) function.
This way, the user will be able to configure your input or output format from the command line.
The responsability for checking that the arguments are correct is on you,
so make sure to check that the arguments are correct before using them
(for example, check that the user actually passed the arguments you need, and that they are valid,
like a valid path, etc.)
This is how all the pieces above can fit together into an end-to-end pipeline:
from argparse import ArgumentParser
# This is needed to import the modules from the root directory, if your script is in a subdirectory
import sys
import os
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
from io_args import add_io_arguments_to_parser
import input_formats
import output_formats
def my_main():
parser = ArgumentParser()
add_io_arguments_to_parser(parser)
(...) # Add any other arguments you need here
args = parser.parse_args()
input_format_name = args.input_format_read
output_format_name = args.output_format
my_input_format = getattr(input_formats, input_format_name)(args)
my_output_format = getattr(output_formats, output_format_name)
def my_iterator():
for doc in my_input_format.read():
(...) # your custom logic here to modify the document (e.g., score it)
yield doc
my_output_format(my_iterator(), args)
if __name__ == '__main__':
my_main()There is a script that you can use to test a given input or output format. For now it only works for file-based formats. you should pass at least the following I/O arguments:
--input_format_read--output_format--metadata_path--corpus_metadata_id--testing_folder
Note that, as with any other script, you need to pass in testing_input and/or testing_output if you want to use the testing folder
to read from or write to, respectively. Usually you will have to pass in testing_output to take data from your 01-raw-collections
and write it to your testing_folder (which should be test_data by default). If you want to chain together several tests,
you can pass the --testing argument, which will set both testing_input and testing_output to true.
Additionaly, you can pass the --input_format argument, which will override both --input_format_read and --input_format_split (which will be set to the same value as --input_format).
Optionally, you can pass the --part argument, but by default the part will be 0.
For example, we can do a 'mock deduplication' test by running the following command:
python tests/test_general.py --metadata_path test_data/data/02-metadata/test_mx_20240325.json --input_format old_crawling_v2 --output_format raw_tsv --testing_folder test_data/test1 --testing_outputAnd then, check that the raw_tsv input format, old_crawling_v2 split input format and clean_jsonl output format work properly by running:
python tests/test_general.py --metadata_path test_data/data/02-metadata/test_mx_20240325.json --input_format_read raw_tsv --input_format_split old_crawling_v2 --output_format clean_jsonl --testing_folder test_data/test1 --testing_output --testing_inputWhich would be an example of what is needed to execute the preprocessing and scoring module on the deduplicated corpus.
Both executions will take the first 100 documents, print them on screen, and then write them to a file in the specified output format. You can check the results of deduplication and scoring in the specified test_data/test1 folder.
Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
- Fork the Project
- Create your Feature Branch (git checkout -b feature/AmazingFeature)
- Commit your Changes (git commit -m 'Add some AmazingFeature')
- Push to the Branch (git push origin feature/AmazingFeature)
- Open a Pull Request
This project is distributed under the Apache-2.0 license. See the LICENSE file for more information.
Language Technologies Unit ([email protected]) at the Barcelona Supercomputing Center (BSC).
This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.
This work is the result of the project reference 2022/TL22/00215337 funded by the Ministerio de Asuntos Económicos y Transformación Digital and by the Plan de Recuperación, Transformación y Resiliencia founded by EU - NextGenerationEU.
@inproceedings{palomar-giner-etal-2024-curate,
author = "Palomar-Giner, Jorge and
Saiz, Javier and
Espuña, Ferran and
Mina, Mario and
Da Dalt, Severino and
Llop, Joan and
Ostendorff, Malte and
Ortiz Suarez, Pedro and
Rehm, Georg and
Gonzalez-Aguirre, Aitor and
Villegas, Marta",
title = "A CURATEd CATalog: Rethinking the Extraction of Pretraining Corpora for Mid-Resourced Languages",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation",
year = "2024",
publisher = "European Language Resource Association and the International Comittee on Computational Linguistics",
}
