Please use this identifier to cite or link to this item: http://dx.doi.org/10.25673/119099
Title: Improving text collations by local text resegmentation
Author(s): Dähne, Janis
Ritter, JörgLook up in the Integrated Authority File of the German National Library
Molitor, PaulLook up in the Integrated Authority File of the German National Library
Issue Date: 2025
Type: Article
Language: English
Abstract: In almost all current approaches, the collation of large texts is applied to a fixed given segmentation of the two texts witnesses to be compared and consists of two consecutive steps. First, the segments of the two texts are aligned, and then the aligned segments are compared in detail. For larger manuscripts or books consisting of many pages, the segments are usually the paragraphs of the texts. When comparing two texts, where the second text is a revised version of the first, poor local alignments can arise. This occurs in places where paragraphs have been split into two smaller paragraphs to insert a new paragraph in between, or where several consecutive sentences have been moved from one paragraph to the previous or next paragraph. Most paragraph collation tools cannot handle these scenarios properly because they align each paragraph with at most one paragraph of the other text. In this paper, we discuss this problem in detail and present a heuristic for resegmenting the two texts to be compared in order to achieve a better collation.
URI: https://opendata.uni-halle.de//handle/1981185920/121055
http://dx.doi.org/10.25673/119099
Open Access: Open access publication
License: (CC BY 4.0) Creative Commons Attribution 4.0(CC BY 4.0) Creative Commons Attribution 4.0
Journal Title: Digital scholarship in the humanities
Publisher: Oxford University Press
Publisher Place: Oxford
Volume: 40
Issue: 2
Original Publication: 10.1093/llc/fqaf033
Page Start: 477
Page End: 186
Appears in Collections:Open Access Publikationen der MLU

Files in This Item:
File Description SizeFormat 
fqaf033.pdf1.32 MBAdobe PDFThumbnail
View/Open