HalluMat: Hallucination Detection in Scientific LLMs
    B. P. Vangala, S. Mahmud, P. Neupane, J. Selvaraj, J. Cheng
    
      - Primary Innovation: Developed HalluMatData, a comprehensive benchmark dataset specifically designed for evaluating hallucination detection methods in domain-specific large language models applied to materials science research
 
      - Technical Framework: Implemented HalluMatDetector, a sophisticated multi-stage detection pipeline that combines intrinsic verification mechanisms, multi-source knowledge retrieval, contradiction graph analysis, and comprehensive metric-based assessment protocols
 
      - Quantitative Results: Demonstrated significant improvement with 30% reduction in hallucination rates compared to baseline LLM outputs, with particular effectiveness in high-entropy query scenarios across multiple materials science subdomains
 
      - Methodological Contribution: Introduced the Paraphrased Hallucination Consistency Score (PHCS), a novel metric for quantifying inconsistencies in LLM responses across semantically equivalent queries, providing deeper insights into model reliability patterns
 
      - Knowledge Integration: Combined knowledge graph-based contradiction detection with fine-grained factual verification techniques to establish a more reliable and interpretable framework for AI-assisted scientific discovery processes
 
      - Domain Impact: Addresses critical challenges in research integrity by providing tools to detect and mitigate factually incorrect or misleading information generation in scientific contexts
 
    
    
    
      Show Abstract
      Artificial Intelligence (AI), particularly Large Language Models (LLMs), is transforming scientific discovery, enabling rapid knowledge generation and hypothesis formulation. However, a critical challenge is hallucination, where LLMs generate factually incorrect or misleading information, compromising research integrity. To address this, we introduce HalluMatData, a benchmark dataset for evaluating hallucination detection methods, factual consistency, and response robustness in AI-generated materials science content. Alongside, we propose HalluMatDetector, a multi-stage hallucination detection framework integrating intrinsic verification, multi-source retrieval, contradiction graph analysis, and metric-based assessment to detect and mitigate LLM hallucinations. Our findings reveal that hallucination levels vary significantly across materials science subdomains, with high-entropy queries exhibiting greater factual inconsistencies. By utilizing HalluMatDetector’s verification pipeline, we reduce hallucination rates by 30% compared to standard LLM outputs. Furthermore, we introduce the Paraphrased Hallucination Consistency Score (PHCS) to quantify inconsistencies in LLM responses across semantically equivalent queries, offering deeper insights into model reliability. Combining knowledge graph-based contradiction detection and fine-grained factual verification, our dataset and framework establish a more reliable, interpretable, and scientifically rigorous approach for AI-driven discoveries.