We introduce a metaphorical user simulator (MetaSim) for
end-to-end TDS evaluation (see metasim/).
We also introduce a Tester-based evaluation
framework to generate variants (see system/).
We additionally share the annotation website at web/.
The code is developed upon pytorch and huggingface transformers
🌟[New 2022.11] Introduce Simtester, an open-source toolkit for evaluating user simulator of TOD.
The code for MetaSim is available at metasim/
- Metaphor retriever code at
metasim/metaphor.pyandmetasim/retrieval.py - Metaphor ranker training code at
metasim/train_rerank.py - Policy module training code at
metasim/train_policy.py - Training codes of other module (NLU, NLG, etc.) are aggregated at
metasim/train_mwoz.py metasim/preference.pyshows how we generate the user preferences, andmetasim/test_server.pyshows how the simulator interacts with systems
Usage of the code:
python metasim/train_mwoz.py
The code is still being cleaned up, and we will follow up with the driver code to make it easier to use.
The code for Tester is available at system/ (we develop the systems based on SOLOIST)
-
The training script:
python system/train.py -
The testing script:
python system/decode.py -
The evaluation metrics is available at
system/eval_metric.py
The code is still being cleaned up, and we will follow up an interface that can be called import tester in pyhton.
We implement our models on MultiWOZ, ReDial, and JDDC.
- We use the original entities' annotation on MultiWOZ 2.1
- We identify the entities in ReDial using spaCy and link them to DBpedia (the linker is provided by CR-Walker).
- We identity the entities in JDDC using LTP and link them to e-commerce knowledge provided by Aliyun
The code for our annotation website is available at web/
python web/server.py
Task-oriented dialogue systems (TDSs) are assessed mainly by offline or human evaluation, either limited to single-turn or very time-intensive. Alternatively, user simulators that mimic user behavior enable us to enumerate user goals to generate human-like conversations for simulated evaluation. Employing existing user simulators to evaluate TDSs is challenging as user simulators are primarily designed to optimize dialogue policy for TDSs and have limited evaluation capability. Also, the evaluation of user simulators is an open challenge. This work proposes a metaphorical user simulator for end-to-end TDSs evaluation and a Tester-based evaluation framework to make dialogue systems with different capabilities, namely variants. Our user simulator constructs a metaphorical user model that assists the simulator in reasoning by referring to prior knowledge when encountering new items. We estimate the simulators by checking the simulated interactions between simulators and variants. Our experiments are conducted using three TDS datasets. The metaphorical user simulator demonstrates better consistency with manual evaluation on three datasets; our tester framework demonstrates efficiency, and our approach demonstrates better generalization and scalability.
Metaphorical User Simulators for Evaluating Task-oriented Dialogue Systems
