EntSun News

Popular on EntSun


Similar on EntSun

PQ PDF Research Introduces "Semantic Nondeterminism" Through Analysis of 24,824 Real PDFs

EntSun News/11094598
New research argues that identical document bytes can yield different machine-readable realities, challenging assumptions used by AI, search, compliance, and digital forensics systems.

O FALLON, Mo. - EntSun -- PQ PDF Tools has published a new research program examining what it describes as "Semantic Nondeterminism," the phenomenon where identical document bytes can produce multiple valid semantic interpretations across different consumers despite no changes to the file itself.

The research, available at https://pqpdf.com/research.php, synthesizes findings from multiple studies involving parser disagreement, form-field representation conflicts, OCR-layer divergence, accessibility-tree inconsistencies, and AI document-ingestion behavior. The work is based on analysis of 24,824 real-world PDF documents across three independently measured corpora.

According to the research, PDF was designed to guarantee visual fidelity — ensuring a page appears consistently across devices and printers — but was never designed to guarantee semantic determinism, meaning that every system extracting information from the file will derive the same meaning.

More on EntSun News
The implications have become increasingly relevant as machine systems consume documents at scale. Search engines, retrieval-augmented generation (RAG) systems, large language models, compliance platforms, e-discovery workflows, and digital-forensics tools often rely on machine-readable representations of documents rather than the rendered page viewed by humans.

Among the findings reported:
  • Analysis of 16,971 PDFs from the publicly released DOJ Epstein document corpus found human-versus-machine "reality drift" in 18.6% of documents.
  • Differential testing of six production PDF parsers identified disagreement in approximately one-third of a curated corpus of malicious and edge-case PDFs.
  • Analysis of IRS tax forms found structural differences between rendered content and extracted text in 43 of 44 forms examined.
  • Research into PDF form architectures documented cases where visible field appearances and stored field values can diverge while remaining covered by a valid digital signature.

The research argues that these mechanisms are often treated as isolated issues but may instead represent evidence of a broader property affecting document interpretation.

More on EntSun News
"Modern AI systems do not read pages; they read structure," the research states. "The question is no longer whether a file renders correctly. The question is whether every consumer extracts the same meaning from the same bytes."

The publication introduces Semantic Nondeterminism as a proposed framework for studying cross-consumer semantic agreement and document interpretation. Rather than focusing solely on malware detection or format compliance, the research examines how different software systems may derive different semantic realities from the same document.

The complete research program, methodology summaries, supporting studies, and corpus findings are available through the PQ PDF Tools research portal.

Research Portal: https://pqpdf.com/research.php

About PQ PDF Tools

PQ PDF Tools develops privacy-focused PDF analysis and document-forensics technologies. The platform provides PDF utilities, forensic analysis capabilities, and document-integrity research with a zero-retention processing model.

Contact
PQ PDF
***@pqcrypta.com


Source: PQ PDF

Show All News | Disclaimer | Report Violation

0 Comments

Latest on EntSun News