International Journal of Reliable Information and Assurance
Volume 4 No. 2, 2016, pp 7-12
A Text-learning Based Method of Detecting Personal Information in Image Files
Recently, as the private and corporate damages caused by the leakage of files that include personal information are increasing, the leaked personal information itself is being exploited through illegal distribution. This study deals with the issue by developing software for detecting the personal information embedded in image files through text-based training. The Tesseract Optical Character Recognition engine, which detects personal information, converts the characters contained in images to text and uses the Levenshtein Distance Algorithm to check for similarities in the personal information. In addition, the possibility of personal information being included in the image file was represented in percentages. The text recognition capacity and personal information recognition accuracy are enhanced through text trainings. The existing function for detecting personal information can be expanded and applied to the text file, image files, etc., by using the personal information detection software for image files. It is possible to block and prevent the leaks of personal information contained in image files stored on personal and corporate PCs.