MA: Dif­fe­ren­ti­al Tes­ting of YAML Par­sers

Abstract:

YAML parsers exhibit inconsistent behavior across programming languages, creating data integrity and security issues. We extended an existing parser testing framework with automated test generation and evaluation for differential testing of YAML parsers. 

We evaluated 16 YAML parsers across 10 programming languages using 1,000 automatically generated test cases. We compared three evaluation approaches: ensemble-based normalization using majority voting among canonical parsers, sentence transformer embeddings, and large language model reasoning.

The ensemble-based approach detected semantic differentials in 28.9% to 41.1% of successfully normalized test cases. Manual analysis identified nine categories of parser inconsistencies: four in YAML 1.1 implementations and five in YAML 1.2 implementations. The yaml1.1-canonical ensemble achieved 91.6% normalization success rate.

Machine learning approaches proved unsuitable for differential detection. LLM-based comparison produced 13.3% incorrect classifications. SBERT-based comparison failed to distinguish type-based semantic differences, assigning similarity scores above 0.99 to outputs representing different data types. Finetuning SBERT models achieved maximum Pearson correlation 0.811, insufficient for reliable classification.

The ensemble-based methodology provides accurate differential detection for YAML parsers when canonical YAML output mode is available.