University of Auckland Logo

MT4NLP - Metamorphic Relations Catalog For Testing NLP Systems

This catalog contains a collection of 191 metamorphic relations (MRs) for testing natural language processing (NLP) systems.

Metamorphic testing is a technique that helps to identify potential issues in software by checking if the outputs of a system change in expected ways when the inputs are modified.

Note: These metamorphic relations were used in the paper for testing Large Language Models (LLMs), but they are general and applicable to any NLP system.

Contributors: Steven Cho, Stefano Ruberto, Valerio Terragni

For the paper: Metamorphic Testing of Large Language Models for Natural Language ProcessingDownload PDF
Accepted at the 41st IEEE International Conference on Software Maintenance and Evolution (ICSME 2025 — research track)

Some MRs in this list are implemented in our tool: LLMorph: Metamorphic Testing of Large Language Models, which is available on GitHub: https://github.com/steven-b-cho/llmorph

GitHub repository for this catalog: https://github.com/mt4nlp/mt4nlp.github.io

How to contribute with new MRs: Fork the repository, add your new MRs, and submit a pull request (PR) with your changes. Make sure your PR clearly describes the new metamorphic relations and associated tasks.

In this catalog, each Metamorphic Relation (MR) consists of:

Task acronyms
  • CM: Content moderation — Detecting inappropriate or harmful content in a text (11 MRs)
  • CR: Coreference resolution — Determining which words in a text refer to the same entity (1 MR)
  • DS: Dialogue system — Conversing with humans (7 MRs)
  • DSp: Dialogue system (persona-based) — Conversing with humans while maintaining a consistent personality (5 MRs)
  • FN: Fake news detection — Detecting whether a text has false or misleading information (9 MRs)
  • IR: Information retrieval — Finding relevant information from a text (10 MRs)
  • LSR: Lexical semantic relations — Analysing relationship between words (1 MR)
  • NER: Named entity recognition — Identifying and classifying words into categories (17 MRs)
  • NLI: Natural language inference — Determining whether a hypothesis follows from a premise (10 MRs)
  • PD: Plagiarism detection — Detecting instances of copied or unoriginal text (1 MR)
  • PDq: Plagiarism detection (query-based) — Detecting instances of copied or unoriginal text using a specific query (7 MRs)
  • PST: Part-of-speech tagging — Classifying words into grammatical categories (4 MRs)
  • QA: Question answering — Answering questions in natural language, using no context (25 MRs)
  • QAb: Question answering (boolean) — Answering true or false questions, using no context (7 MRs)
  • QAc: Question answering (incl. context) — Answering questions in natural language, using context (14 MRs)
  • QAm: Question answering (multi-choice) — Answering multi-choice questions, using no context (8 MRs)
  • RE: Relation extraction — Identifying relationships between two entities in a text (8 MRs)
  • SA: Sentiment analysis — Determining how positive or negative a text is (32 MRs)
  • SD: Stance detection — Determining the agreement between two texts (1 MR)
  • SM: Summarisation — Condensing text while retaining key information (17 MRs)
  • TC: Text classification — Determining whether a text falls into predefined categories (17 MRs)
  • TD: Toxicity detection — Identifying whether a text is offensive or harmful (10 MRs)
  • TR: Translation — Converting text from one language to another (24 MRs)
  • TS: Text similarity — Determining the similarity between two texts (8 MRs)