Our paper titled “TGIF2: Extended Text-Guided Inpainting Forgery Dataset & Benchmark” was accepted at the Journal on Information Security as part of the collection Advances in Information Forensics and Security. This work extends our previous TGIF paper (IEEE WIFS 2024), and was done in collaboration with Symeon (Akis) Papadopoulos, Dimitris Karageorgiou & Paschalis Giakoumoglou from the Media Analysis, Verification and Retrieval Group (MeVer) of the Centre for Research and Technology Hellas (CERTH), Information Technologies Institute (ITI), in Thessaloniki, Greece. 🇬🇷
TGIF2 builds on TGIF with a larger and more challenging setup. The dataset now contains over 271k manipulated images, includes recent models such as FLUX.1, and introduces random (non-semantic) masks to study potential biases. We also extend the benchmark with new experiments, including fine-tuning to provide localization of manipulations in on fully regenerated images and evaluating robustness against AI-based super-resolution.
TGIF2 includes inpainted images made with SD2, SDXL, Adobe Firefly (PS), and three FLUX.1 models. Additionally, three types of masks are used: segmentation, bbox, and random.
Image Forgery Localization (IFL)
We observe the following challenges with state-of-the-art IFL methods:
- IFL performance drops on newer generative models such as FLUX.1
- As already observed in TGIF, IFL methods fail to localize manipulations in fully regenerated (FR) images
Fine-tuning for localization in fully regenerated (FR) images
To address the failure on FR images, we fine-tune IFL models (i.e., TruFor and MMFusion) on TGIF2. We demonstrate that these fine-tuned models are able to localize manipulations in fully regenerated images!
However, we also expose a key limitation. When fine-tuning only on semantic manipulation masks (i.e., object-based edits), the models perform well on similar semantically edited data, but their performance drops significantly on random masks (i.e., non-object-based edits). In other words, they learn to rely on semantics rather than actual manipulation traces. For example, in the image below, a model trained on semantic masks correctly detects the edited tennis racket. But when the edit is applied to a random background region, the model still highlights the tennis racket instead of the actual manipulated area. This demonstrates that the model has learned a shortcut: it focuses on salient objects, not on the manipulation itself.
Finetuning on FR images with only semantic masks reveals a bias: it fails when evaluated on random masks.
Including random masks during training mitigates this issue and improves performance on both random and semantic edits. This highlights how important training data design is for forensic robustness.
Synthetic Image Detection (SID)
We observe that SID performance decreases for newer models such as FLUX.1, similar to what we observed for IFL. Hence, this reinforces the need for more generalizable detection methods and/or training datasets that keep up with new generative models.
Generative super-resolution
We apply generative super-resolution (Real-ESRGAN and compare it to standard resizing. The result is clear: traditional resizing has a limited effect, yet generative super-resolution causes a large drop in performance for IFL methods and some SID methods. In other words, common AI-based image enhancement tools can remove forensic traces, making forensic detection significantly harder.
A notable contrast
An interesting pattern emerges when comparing different types of forensic cues. On the one hand, methods relying more on semantic reasoning tend to be more robust to AI-based image enhancement attacks, but more prone to bias (e.g., focusing on objects). On the other hand, methods relying on low-level traces tend to be more faithful to actual manipulations, but easily disrupted by AI-based enhancement. This highlights a fundamental trade-off: both types of cues are important, but neither is sufficient on its own.
Conclusion
TGIF2 shows that detecting modern AI-based image manipulations remains highly challenging:
- New generative models reduce IFL and SID detection performance
- Fine-tuning helps to localize manipulations in fully regenerated AI-based edits; but the fine-tuning process can introduce biases which should be avoided
- AI-based enhancement can remove forensic traces
Overall, our results emphasize the need for forensic methods that generalize across generative models, avoid semantic shortcuts, and remain robust to AI-based post-processing. We hope TGIF2 provides a useful benchmark to support future research in this direction.
The dataset & code of the TGIF & TGIF2 Dataset is available on GitHub.com.
Paper: TGIF2: Extended Text-Guided Inpainting Forgery Dataset & Benchmark
The paper was partly financed by the COM-PRESS project, which received subsidies from the Flemish government’s Department of Culture, Youth & Media (Departement Cultuur Jeugd & Media).
