Our main contribution is a scalable fact-checking system which provides two main features:
- Question answering
- Fact checking
Our system is a combination of multiple components ranging from NLI, QA, IR in NLP:
- Retriever: retrieve a set of most relevant data related to the content that the user requests
- Reader: search and extract answer for the question from the user, given the relevant data from the Retriever
- Inferrer: classify each data (evidence) in the set of most relevant data from the Retriever given the user's claim
Instruction to reproduce the experiment step-by-step for the testset of the MLQA dataset
Python: 3.7.5
pip install -r requirements.txtalso, install missing packets for FAISS library:
sudo apt-get install libopenblas-dev
sudo apt-get install libomp-devClone:
git clone https://github.com/icesonata/docker-es-cococ-tokenizer.gitDeploy:
docker-compose up -d*Note: this may require sudo privilege.
Create index with title and content fields via API :
(for cURL, move to the Alternative below the payload)
Send to localhost:9200/vi_mlqa_test with PUT method and the payload below.
{
"settings": {
"index": {
"number_of_shards" : 1,
"number_of_replicas" : 1,
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "vi_tokenizer",
"char_filter": [ "html_strip" ],
"filter": [
"icu_folding"
]
}
}
}
}
},
"mappings": {
"properties" : {
"title" : {
"type" : "text",
"analyzer": "my_analyzer"
},
"content" : {
"type" : "text",
"analyzer" : "my_analyzer"
}
}
}
}*Alternatively, using cURL:
curl -XPUT "http://localhost:9200/vi_mlqa_test" -H 'Content-Type: application/json' -d'{"settings": {"index":{"number_of_shards":1,"number_of_replicas":1,"analysis":{"analyzer":{"my_analyzer":{"tokenizer":"vi_tokenizer","char_filter":["html_strip"],"filter":["icu_folding"]}}}}},"mappings":{"properties":{"title" :{"type":"text","analyzer":"my_analyzer"},"content":{"type":"text","analyzer":"my_analyzer"}}}}'Misc
Check the existing indices:
curl -XGET localhost:9200/_cat/indicesCount the number of documents in an index:
curl -XGET localhost:9200/vi_mlqa_test/_countSetup MySQL with password=root and host port=15432
docker run --name db_index -e MYSQL_ROOT_PASSWORD=root -p 15432:3306 -d mysql:latestGet into to the MySQL container:
docker exec -it db_index /bin/bash*Note: this step may require sudo privilege.
Get into MySQL server in the container:
mysql -uroot -p*Note: enter password=root when the server requires authentication.
Create a database, named corpus:
CREATE corpus;
USE corpus;Create a new user:
CREATE USER 'longnguyen'@'%' IDENTIFIED BY 'longnguyen';Grant the user with privilege enough for access on the corpus database from SQLAlchemy:
GRANT ALL PRIVILEGES ON test_db.* to 'longnguyen'@'%';Create document table:
CREATE TABLE mlqa_test_articles(id int not null auto_increment, title text, content longtext, publish_date varchar(50), primary key(id)) character set utf8mb4 collate utf8mb4_general_ci;Create sentence table:
CREATE TABLE mlqa_test_sent_articles(id int not null auto_increment, sentence text, doc_id int, primary key(id), foreign key(doc_id) references mlqa_test_articles(id) on delete cascade) character set utf8mb4 collate utf8mb4_unicode_ci;Misc
Check tables:
SHOW tables;
DESCRIBE mlqa_test_articles;
DESCRIBE mlqa_test_sent_articles;Count number of entries in the tables:
SELECT COUNT(*) FROM mlqa_test_articles;
SELECT COUNT(*) FROM mlqa_test_sent_articles;Move to dataset/ directory.
cd dataset/ElasticSearch
python import_es.pyMySQL
python import_db.pyFor deployment, make sure ElasticSearch and MySQL are working.
This step requires 3 separate shells:
- Backend
- Encoder
- Frontend
Move to backend/ directory to deploy server on the 0.0.0.0 with port 8888 by running the command below:
python manage.py runserver 0.0.0.0:8888Move to encoder/ directory and change ROOT_DIR to an absolute path to the directory of the project, e.g.,
ROOT_DIR = "/home/username/FactCheck-QA/"Then, run the command below:
python encoder_server.py*Note: change serving address of the Encoder via serve() function of the encoder/encoder_server.py
NodeJS: >= 16.0
*Note: use nvm for switching to newer nodeJS version.
Move to frontend/ directory and run the command below once to install the dependencies:
npm installThen, everytime frontend needs to be deployed, just run the command below:
npm run dev*Note: frontend interacts with backend via API. You can change backend address in frontend/src/@core/utils/api/api.js
Look into dataset/[Research]_Sentence_processing_for_SquAD_format_dataset.ipynb
or you can reuse the available resources of the MLQA dataset we provide.
API format and relevant documents of the backend can be found in backend/docs.
The system serves three services through API, you can request it via API endpoints below, given a field namely data as a form data:
localhost:8888/api/search/relevance/: information retrieval, retrieve relevant data given a piece of informationlocalhost:8888/api/search/answering/: question answering, answer a question given by userlocalhost:8888/api/search/inference/: fact-checking, returns a list of evidence support/refute the given claim given by user
*Note: The url strictly requires the trailing slash / at the end. Also, replace the address if the backend runs on different port or address.
There are some note for this project:
- There is a indexing mismatch between the indices that mentioned in the comments, for example:
- MySQL: 1-index
- ElasticSearch: 0-index
- FAISS: 0-index
- How to change language models which are downloaded from HuggingFace: there are two files to be concerned
backend/apps/search/components/config.py:READER_MODEL: question-answering modelINFERRER_MODEL: natural language inference model, text-classification following NLI model, or zero-shot classification model. Note that different models have different output style, hence we must config ourbackend/apps/search/components/inferrer.pyto comply with the output format.
encoder/encoder_server.py:EMBEDDING_MODEL: embedding model released by SBERTFEATURE_SIZE: dimension of the output that the embedding model produces
- We use IndexFlatL2 combined with IndexIDMap of FAISS for semantic search
- Remember to alter the address and port of different services in the
config.pyfile. Also,Kindicates number of documents retrieved in the full-text search conducted by ElasticSearch, whileLindicating number of sentences to retrieved fromKdocuments by semantic search. In other words, in information retrieval step,K->Ldocuments are retrieved. - Reader offers two modes:
concatandensemble, which can be set in theconfig.py:concat: concatenates L retrieved data into a context and, with the question, put it to the language modelensemble: each data in L retrieved data will be treated as a context and the final step is filter out an answer with highest confidence score provided by the language model. Note: this mode usually produces less accurate results.
- By default, the project runs only on CPU. Therefore, considering switching
deviceto 0 orgpu, etc. for better productivity with GPU if available.
Authors:


