Benchmarking of Document Image Analysis Tasks for Palm Leaf Manuscripts from Southeast Asia

Kesiman, Made Windu Antara; Valy, Dona; Burie, Jean-Christophe; Paulus, Erick; Suryani, Mira; Hadi, Setiawan; Verleysen, Michel; Chhun, Sophea; Ogier, Jean-Marc

DIAL.pr - BOREAL

Accès à distance ? S'identifier sur le proxy UCLouvain

Benchmarking of Document Image Analysis Tasks for Palm Leaf Manuscripts from Southeast Asia

Primary tabs

download

Benchmarking of Document Image Analysis Tasks for Palm Leaf Manuscripts from Southeast Asia.pdf

Open access
PDF
7.79 M

Kesiman, Made Windu Antara [Laboratoire Informatique Image Interaction (L3i), Université de La Rochelle, France] Valy, Dona [UCL] Burie, Jean-Christophe [Laboratoire Informatique Image Interaction (L3i), Université de La Rochelle, France] Paulus, Erick [Department of Computer Science, Universitas Padjadjaran, Bandung , Indonesia] Suryani, Mira [Department of Computer Science, Universitas Padjadjaran, Bandung , Indonesia] Verleysen, Michel [UCL] Ogier, Jean-Marc [Laboratoire Informatique Image Interaction (L3i), Université de La Rochelle, France] et al.[show all]

This paper presents a comprehensive test of the principal tasks in document image analysis (DIA), starting with binarization, text line segmentation, and isolated character/glyph recognition, and continuing on to word recognition and transliteration for a new and challenging collection of palm leaf manuscripts from Southeast Asia. This research presents and is performed on a complete dataset collection of Southeast Asian palm leaf manuscripts. It contains three different scripts: Khmer script from Cambodia, and Balinese script and Sundanese script from Indonesia. The binarization task is evaluated on many methods up to the latest in some binarization competitions. The seam carving method is evaluated for the text line segmentation task, compared to a recently new text line segmentation method for palm leaf manuscripts. For the isolated character/glyph recognition task, the evaluation is reported from the handcrafted feature extraction method, the neural network with unsupervised learning feature, and the Convolutional Neural Network (CNN) based method. Finally, the Recurrent Neural Network-Long Short-Term Memory (RNN-LSTM) based method is used to analyze the word recognition and transliteration task for the palm leaf manuscripts. The results from all experiments provide the latest findings and a quantitative benchmark for palm leaf manuscripts analysis for researchers in the DIA community.

metadata
References

Document type	Article de périodique (Journal article) – Article de recherche
Access type	Accès libre
Publication date	2018
Language	Anglais
Journal information	"Journal of Imaging" - Vol. 4, no.43, p. 27 (2018)
Peer reviewed	yes
issn	2313-433X
Publication status	Publié
Affiliation	UCL - SST/ICTM/ELEN - Pôle en ingénierie électrique
Keywords	Document image analysis; Binarization; Character recognition; Text line segmentation; Word recognition; Transliteration; Palm leaf manuscript; Dataset; Benchmark; Experimental test
Links	https://doi.org/10.3390/jimaging4020043[DOI] http://hdl.handle.net/2078.1/211073[Handle]

tranScriptoriumhttp://transcriptorium.eu/
READ Project—Recognition and Enrichment of Archival Documentshttps://read.transkribus.eu/
IAM Historical Document Database (IAM-HistDB)—Computer Vision and Artificial Intelligencehttp://www.fki.inf.unibe.ch/databases/iam-historical-document-database
Ancient Lives: Archivehttps://www.ancientlives.org/
Document Image Analysis—CVISION Technologieshttp://www.cvisiontech.com/library/pdf/pdf-document/document-image-analysis.html
Ramteke R. J., Invariant Moments Based Feature Extraction for Handwritten Devanagari Vowels Recognition, 10.5120/392-585
Singh Siddharth Kartar, Dhir Renu, Rani Rajneesh, Handwritten Gurmukhi Numeral Recognition using Different Feature Sets, 10.5120/3361-4640
Sharma Dharamveer, Jhajj Puneet, Recognition of Isolated Handwritten Characters in Gurmukhi Script, 10.5120/850-1188
Aggarwal Ashutosh, Singh Karamjeet, Singh Kamalpreet, Use of Gradient Technique for Extracting Features from Handwritten Gurmukhi Characters and Numerals, 10.1016/j.procs.2015.02.116
Ashlin Deepa, Int. J. Adv. Trends Comput. Sci. Eng., 3, 481 (2014)
Kasturi Rangachar, O’Gorman Lawrence, Govindaraju Venu, Document image analysis: A primer, 10.1007/bf02703309
Paper History, Case Paphttp://www.casepaper.com/company/paper-history/
Doermann, 1055 (2014)
Chamchong, 55 (2010)
(2016)
Balinese Alphabet, Language and Pronunciationhttp://www.omniglot.com/writing/balinese.htm
Khmer Manuscript—Recherchehttp://khmermanuscripts.efeo.fr/
Kesiman Made Windu Antara, Valy Dona, Burie Jean-Christophe, Paulus Erick, Sunarya I. Made Gede, Hadi Setiawan, Sok Kim Heng, Ogier Jean-Marc, Southeast Asian palm leaf manuscript images: a review of handwritten text line segmentation methods and new challenges, 10.1117/1.jei.26.1.011011
Bezerra (2017)
Arica N., Yarman-Vural F.T., Optical character recognition for cursive handwriting, 10.1109/tpami.2002.1008386
O’Gorman, 107 (1997)
Gatos B., Ntirogiannis K., Pratikakis I., DIBCO 2009: document image binarization contest, 10.1007/s10032-010-0115-7
Howe Nicholas R., Document binarization with automatic parameter tuning, 10.1007/s10032-012-0192-x
ICFHR2016 Competition on the Analysis of Handwritten Text in Images of Balinese Palm Leaf Manuscriptshttp://amadi.univ-lr.fr/ICFHR2016_Contest/
Gupta Maya R., Jacobson Nathaniel P., Garcia Eric K., OCR binarization and image pre-processing for searching historical documents, 10.1016/j.patcog.2006.04.043
Feng Meng-Ling, Tan Yap-Peng, Contrast adaptive binarization of low quality document images, 10.1587/elex.1.501
Global image threshold using Otsu’s method—MATLAB graythresh—MathWorks Francehttps://fr.mathworks.com/help/images/ref/graythresh.html?requestedDomain=true
Khurshid Khurram, Siddiqi Imran, Faure Claudie, Vincent Nicole, Comparison of Niblack inspired binarization methods for ancient documents, 10.1117/12.805827
Sauvola J., Pietikäinen M., Adaptive document image binarization, 10.1016/s0031-3203(99)00055-2
Rapid Feature Extraction for Optical Character Recognitionhttp://arxiv.org/abs/1206.0238
Kumar Satish, Neighborhood Pixels Weights-A New Feature Extractor, 10.7763/ijcte.2010.v2.119
Bokser M., Omnidocument technologies, 10.1109/5.156470
Shishtla Praneeth, Ganesh V. Surya, Subramaniam Sethuramalingam, Varma Vasudeva, A language-independent transliteration schema using character aligned models at NEWS 2009, 10.3115/1699705.1699715
Ocropy: Python-Based Tools for Document Analysis and OCR, 2018https://github.com/tmbdev/ocropy
Homemade Manuscript OCR (1): OCRopy, Sacré Grlhttps://graal.hypotheses.org/786
Breuel Thomas M., Ul-Hasan Adnan, Al-Azawi Mayce Ali, Shafait Faisal, High-Performance OCR for Printed English and Fraktur Using LSTM Networks, 10.1109/icdar.2013.140
Saund Eric, Lin Jing, Sarkar Prateek, PixLabeler: User Interface for Pixel-Level Labeling of Elements in Document Images, 10.1109/icdar.2009.250
Stamatopoulos Nikolaos, Gatos Basilis, Louloudis Georgios, Pal Umapada, Alaei Alireza, ICDAR 2013 Handwriting Segmentation Contest, 10.1109/icdar.2013.283
PRImAhttp://www.primaresearch.org/tools/Aletheia
Clausner C., Pletschacher S., Antonacopoulos A., Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments, 10.1109/icdar.2011.19

Bibliographic reference	Kesiman, Made Windu Antara ; Valy, Dona ; Burie, Jean-Christophe ; Paulus, Erick ; Suryani, Mira ; et. al. Benchmarking of Document Image Analysis Tasks for Palm Leaf Manuscripts from Southeast Asia. In: Journal of Imaging, Vol. 4, no.43, p. 27 (2018)
Permanent URL	http://hdl.handle.net/2078.1/211073

User menu

Benchmarking of Document Image Analysis Tasks for Palm Leaf Manuscripts from Southeast Asia

Primary tabs

Footer Help

Languages

Footer menu

User menu

Search form

You are here

Benchmarking of Document Image Analysis Tasks for Palm Leaf Manuscripts from Southeast Asia

Primary tabs

Footer Help

Languages

Footer menu