MT4NLP - Metamorphic Relations Catalog For Testing NLP Systems
This catalog contains a collection of 191 metamorphic relations (MRs) for testing natural language processing (NLP) systems.
Metamorphic testing is a technique that helps to identify potential issues in software by checking if the outputs of a system change in expected ways when the inputs are modified.
Note: These metamorphic relations were used in the paper for testing Large Language Models (LLMs), but they are general and applicable to any NLP system.
Contributors: Steven Cho, Stefano Ruberto,
Valerio Terragni
For the paper:
Metamorphic Testing of Large Language Models for Natural Language Processing —
Download PDF
Accepted at the 41st IEEE International Conference on Software Maintenance and Evolution
(ICSME 2025 — research track)
Some MRs in this list are implemented in our tool: LLMorph: Metamorphic Testing of Large Language Models, which is available on GitHub:
https://github.com/steven-b-cho/llmorph
GitHub repository for this catalog:
https://github.com/mt4nlp/mt4nlp.github.io
How to contribute with new MRs: Fork the repository, add your new MRs, and submit a pull request (PR) with your changes. Make sure your PR clearly describes the new metamorphic relations and associated tasks.
In this catalog, each Metamorphic Relation (MR) consists of:
- IR (Input Relation): Describes the input relations of the inputs.
- OR (Output Relation): Describes the expected relation of their corresponding outputs.
Task acronyms
- CM: Content moderation — Detecting inappropriate or harmful content in a text (11 MRs)
- CR: Coreference resolution — Determining which words in a text refer to the same entity (1 MR)
- DS: Dialogue system — Conversing with humans (7 MRs)
- DSp: Dialogue system (persona-based) — Conversing with humans while maintaining a consistent personality (5 MRs)
- FN: Fake news detection — Detecting whether a text has false or misleading information (9 MRs)
- IR: Information retrieval — Finding relevant information from a text (10 MRs)
- LSR: Lexical semantic relations — Analysing relationship between words (1 MR)
- NER: Named entity recognition — Identifying and classifying words into categories (17 MRs)
- NLI: Natural language inference — Determining whether a hypothesis follows from a premise (10 MRs)
- PD: Plagiarism detection — Detecting instances of copied or unoriginal text (1 MR)
- PDq: Plagiarism detection (query-based) — Detecting instances of copied or unoriginal text using a specific query (7 MRs)
- PST: Part-of-speech tagging — Classifying words into grammatical categories (4 MRs)
- QA: Question answering — Answering questions in natural language, using no context (25 MRs)
- QAb: Question answering (boolean) — Answering true or false questions, using no context (7 MRs)
- QAc: Question answering (incl. context) — Answering questions in natural language, using context (14 MRs)
- QAm: Question answering (multi-choice) — Answering multi-choice questions, using no context (8 MRs)
- RE: Relation extraction — Identifying relationships between two entities in a text (8 MRs)
- SA: Sentiment analysis — Determining how positive or negative a text is (32 MRs)
- SD: Stance detection — Determining the agreement between two texts (1 MR)
- SM: Summarisation — Condensing text while retaining key information (17 MRs)
- TC: Text classification — Determining whether a text falls into predefined categories (17 MRs)
- TD: Toxicity detection — Identifying whether a text is offensive or harmful (10 MRs)
- TR: Translation — Converting text from one language to another (24 MRs)
- TS: Text similarity — Determining the similarity between two texts (8 MRs)