An Intelligent Three-Level Digital Watermarking Method for Document Protection

Digital text is the most frequent interchange form of data that could hold sensitive information such as audit firms, banks, and educational institutes. This sensitive information needs to preserve its integrity and originality so that it could not only secure the data but also helps to identify ownership of text documents. This paper presents a novel and invisible digital watermarking approach for the secure exchange of text documents over the internet. Digital watermarking serves from the last decade for detection of forgery and tempering from digital text documents and maintained the copyright and authentication successfully. Many states of the art watermark techniques achieve high imperceptibility, robustness, and high hidden capacity; unfortunately failed to maintain the balance among these three conflicting parameters. As resolvent, we propose an intelligent Three-Level Digital Watermarking (3LDW) system for text documents copyright protection. 3LDW system can be applied to Microsoft Word objects, document open spaces, and text feature coding without affecting the content of the original document. Experimental results reveal that our proposed 3LDW system strongly resist against formatting attacks and efficiently preserves the imperceptibility. Additionally, embedding capacity analysis demonstrates a prominent improvement of the proposed system as compared to other similar approaches.


INTRODUCTION
nformation security has become a significant problem in the digital world in recent years because a vast amount of data is exchanged via a global network [1]. The exchanged data is a variation of text, image, audio, and video. Malicious attacks, threats, violations and illegal use of information are the major challenges for information security research [2]. The information security model consists of confidentiality, integrity, and availability as shown in Fig. 1. Confidentiality means that information will not be disclosed to unauthorized individuals. Integrity refers to the fact that the state of data must remain the same when it is transferred from one system to another.

Fig. 1: Information Security Model
The availability means that the data or information is accessible only for an authorized person. The communication makes our daily life easier in so many things with the growth of the internet. Nowadays, hackers use malware to access information and violate copyrights [3]. The booksellers and publishers spend millions of dollars for copyright authentication. There are various illegal actions that revile copyright protection regarding digital contents, like illegal copying, tampering, and forgery [4].
Digital watermarking protects the intellectual property of digital content and has played a critical role in the protection of copyright [5,6]. In the past, steganography and cryptography were also used to solve the problem, but digital watermarking provides the best solution [7]. A general life cycle of digital watermarking is presented in Fig. 2. A watermarking system is divided into three parts: watermark embedding, attacks, and watermark detection as shown in Fig. 2.
In the embedding phase, secret data is embedded into the original signal that produces a watermarked signal, which is usually transmitted to other persons or stored. If someone modifies the document, then it is called an attack. The detection phase is also called verification that is used to authenticate the contents [8][9][10].
It is crucial to maintain data integrity while ensuring the confidentiality and availability of information [11]. Sensitive text documents are part of every company or organization like banks, audit firms, and educational institutes. A reliable method is required to authenticate text documents [12,13]. In this research our main contribution is the following: • The proposed three-level digital watermarking (3LDW) technique is imperceptible, secure and robust against formatting attacks. The proposed 3LDW technique can be applicable for text document ownership verification and copyright protection.
• The proposed technique is applicable to certain languages. As Microsoft word supports 91 languages, therefore the proposed technique can be applied in all 91 languages.
• The rest of the paper is structured as follows. Section II illustrates the related work. Section III is about the proposed methodology and presents the watermarking embedding and extraction process. Experimental results and discussion are presented in section IV. Section VI demonstrates the conclusion and feature work.

RELATED WORK
Digital text watermarking emerged in 1994 and grew with the passage of time. It is an active area of research and categorized into text, image, audio, and video [14]. The techniques used for text watermarking are classified into image-based, structural-based and hybrid techniques [15].

Image-Based Approach
In this approach, the contents of watermark information are treated as images or logos [16]. This approach is considered safe against formatting attacks, but it has limited applicability because it is not robust against re-typing attack [17]. Rizzo et al. [18] suggest a method based on a password that embeds the watermark in short text and preserves the appearance and content without converting text to the image. A graphical digital text watermarking method and algorithmic framework is proposed in Liu et al. [19], which consists of 8 levels that are a line, pixel, character, paragraph, page pixel, row, etymon, and three other aspects. Each level is divided into three typical features like similarity, structural and self-  [20] propose a technique that measures print-and-scan transformation which can correlate the image before printing and then embed the watermark information.

Structural-Based Approach
In this approach, the structure of text is modified, like extra white spaces or line, and spaces are added for embedding watermark. The style of writing, letters and words location or the double letters are also incorporated for watermarking [21,22]. The drawback of this approach has that when Optical Character Recognition (OCR) is applied, the spaces between the characters and words and the writing style are removed, which also ruin the watermark information. Aman et al. [23] proposes an open space format-based method that embeds the secret message in a Microsoft Word document. The white spaces are targeted to embed the watermark in a document. Liu et al. [24] proposes an algorithm based on Chinese text sentence features. The semantic code of the word calculates segmented text into sentences and sentence entropy. Sentence entropy, length, relevance, and weight function are used to find the weight of each sentence.
The key is used for encryption and registered with a reliable third party. Zhu et al. [25] proposes an algorithm that connects the syllable of Chinese phonetic alphabet parts. The proposed algorithm has high resistance and strong robustness against tampering attacks.

Semantic-Based Approach
This approach uses the semantic of words for embedding the watermark and meaning of the text remains the same. In this approach, morphological alterations and words in the set are used for data hiding without disturbing original text meaning [26]. A syntactic and semantic approach is proposed by Mir et al. [27], where, the watermark information is first encrypted and then embeded into whitespaces using binary controlled characters. White spaces are used to embed the watermark throughout the text content. It is appropriate for web pages and offers security to protect watermarks.

Syntactic-Based Approach
In this approach syntactic tree is constructed first, then syntactic conversion is utilized for watermark embedding. In the syntactic structure, the text consists of sentences and words that can be nouns, verbs, adverbs, adjectives, prepositions, articles, etc. This technique is considered robust but cannot be applicable to all kinds of text like poetry, legal documents and transcripts [28]. A technique is proposed by Ren et al. [29] based on HiCod, HiOpt, HiPhs, and HiMax for text steganography in utilizing online short text. All proposed techniques are evaluated on the basis of security and performance about hiding ease and hiding rate.

Hybrid Approach
The hybrid approach is developed with the combination of different approaches of text watermarking to correct the weaknesses of each approach. This approach can be applied to extensive text documents and its robustness is also improved [30].
Saeed et al. [31] proposed a hybrid technique based on zero watermarking. The original structure of the document is not changed during embedding the watermark. Two steps are involved here, embedding watermark and extraction. A hybrid approach is proposed in [32], which is based on zerowatermarking. The integrity and originality of text documents is verified with the physical alteration. The proposed algorithm is robust against undetected content changes and is able to confirm proof of originality in temper detecting.
Every public or private organization or company transfers sensitive text documents, like legal documents, classified reports, declaration, and soft degrees. However, most of the existing schemes based on text watermarking are either imperceptible, robust or succession in obtaining high concealing capacity, but they do not achieve the balance between these conflicting parameters. In the said perspective, we propose an intelligent Three-Level Digital Watermarking (3LDW) system which provides copyright protection to text documents. Three-level embedding is applied, which includes Microsoft Word objects, document open spaces, and text feature coding without affecting the content of the original document.

METHODOLOGY
In this section, we describe the main characteristics of the proposed scheme. The proposed novel 3LDW digital watermarking technique is shown in Fig. 3, which utilizes the properties of a Microsoft Word document for watermarking without affecting the content and do not modify any word application. In the proposed technique, the watermark is embedded into text document. Three different properties, namely Microsoft Word objects, document open spaces, and text features are taken into consideration. The purpose of three-level embedding is to make the system more secure and efficient, so if any formatting attack disturbs the watermark then it recovers from other properties. Table 1 outlines the notations that are used throughout the paper. The proposed scheme is imperceptible, robust and incorporates a large amount of embedding capacity. The layout of the document is affected by manipulating these properties, and any standard word application command cannot amuse the watermark information. The secret message is encrypted with the help of a private key. Encryption is applied to preserve the watermark and make it difficult for the attacker. The watermarking is our focus in this research, so any encryption algorithm is applied for encrypting the secret message. Advanced Encryption Standard (AES) is applied here to encrypt the secret message.
After encryption, the secret message is converted into binaries then divided into n number of groups. The secret message called watermark W(n) if attacked, only 1/n of the watermark is demolished, where Wi is a group of watermark information as shown in equation (1).

Three Level Watermark Embedding
After generating the watermark information, the original Microsoft Word document is given as input to the system then three-level watermark embedding is applied. Microsoft Word document has a lot of properties that can be manipulated for watermarking and it will not affect the original document.

First Level Embedding (Word Object)
The Microsoft Word document comprises of a lot of word objects that authorize the users to interact and manipulate it. These objects are appropriate for two reasons: first, without affecting imperceptibly the vast capacity of watermark information is stored. Second, the watermark information is not affected by any mutual command of Microsoft Word. These objects are used in the documents to preserve the macro setting between macro sessions and stored as part of the document. The algorithm I describes the first level watermark embedding, where watermark information is embedded into text document's different objects, with the objective to attain the robustness and security. Any common instruction of Microsoft Word application cannot interrupt the watermark, because the watermark is stored in preserve micro setting.

Third Level Embedding (Text feature coding)
In third level embedding features of the text are used for watermarking. First, the secret message is encrypted and converted into numbers as shown in Fig. 4. The original text document is given as input then spaces, commas, full stop, semicolon, colon, question mark, inverted commas, special characters, and symbols are removed from the plain text. The Algorithm III provides the complete procedure for watermark embedding. For example, we have a decimal numbers array like [ 67027603…...], the first number is 6, and after increasing 1 it can locate the character at index 7. The increment of 1 is added in every number to handle the value of zero. The next number is taken from the array which became 8 after the increment, the current number is added in the last number (7) and locate character at 15th index. Then again increment 1 in next number which becomes 1 and donates the character at 16th index and does the changes in character format. After completion of all the numbers, Microsoft Word document is transformed into PDF and communicate.  Convert to PDF and share End

Three Level Watermark Extraction
The watermark extraction or verification is the reverse process of embedding as shown in Fig. 5. The watermarked document is given as input then threelevel extraction is applied. After extraction, regenerate the watermark information is regenerated, numbers are converted into binary and then binary to characters. The same key is applied to decrypt the message using AES algorithm. After decryption, secret message is compared with an original watermark that authenticates the originality of the document. If original watermarked information is matched with watermark, then it is called an original document otherwise it is tempered or changed document.

EXPERIMENTAL RESULTS AND DISCUSSION
In this section, we analyse the experimental results of our proposed scheme. The performance of the proposed scheme is measured in terms of capacity, imperceptibility, and robustness. Imperceptibility test is performed in the first sub-section, whereas the second sub-section analyses the robustness and the third sub-section analyses the rate of embedding capacity.

Imperceptibility Test
The imperceptibility has a primary and fundamental requirement of the watermarking. The watermark cannot be detected and seen by human eyes. Only through special processing or authorized agency can detect the watermark. The statistical analysis is performed, which is more powerful than visual analysis for imperceptibility test. The similarity score of two strings is computed using Jaro Winkler Distance [33] using equation (2). The threshold of 0 and 1 is standardized, where 0 is equivalent to no similarity and 1 represents the exact match. (2) The imperceptibility measure between the two strings is illustrated in Fig. 6, where 20 different message stings remain similar in the proposed technique, but in the existence techniques [34,35] it varies from 0.83 to 0.97. In the proposed technique, average similarity is 0.99, which demonstrates that the proposed technique is 0.99% imperceptible as compared to the previous techniques.
The proposed technique performs excellently in terms of imperceptibility because the watermark is stored in special properties and it cannot affect the whole document. After embedding the watermark information, the original and the watermarked document look like the same as shown in Fig. 7. The overall layout of the document is also not affected.

Robustness Test
Robustness means that after applying different formatting attacks, f the embedded data remains safe. To measure the robustness of the proposed technique different formatting attacks are applied that can be seen in Fig. 8. Insertion, deletion, re-ordering and formatting attacks like font color, font size, italic, underline and text highlighting are applied on the watermarked document.
As shown in Fig. 8, different formatting attacks are applied where the font size of the first word "Abstract" is increased 2 points (16), red font color and highlighter is also applied. The first line of watermark document is "deleted" and replaced with dots (….). The formatting of the second line is also changed and two words "information security" are inserted. The third, fourth-and fifth lines formatting is changed as; font-family: Arial, font-size:10 points, font colour: red and strike last three words. The sixth line is converted to italic and double underline is applied, and only bold the contents of line seven. In the last line, after "result on Imperceptibility" is cut and paste at the beginning of line eight. The font family and the font size of the eighth line are changed, and the strike is applied. After applying formatting attacks, the watermark information is still extracted from word objects. After applying the 90% deletion attack, the 100% watermark information is extracted as shown in Fig. 9.
We apply three-level embedding, so if the data is deleted in formatting attacks it can be recovered from word objects or text feature coding. The circos graph in Fig. 10

Capacity Analysis
In text watermarking, hidden capacity analysis is the major parameter to measure the strength of the proposed algorithm [36]. The capacity indicates the maximum number of bits with the name of a watermark that can be embedded in the host document.
A novel system is required that maximizes hidden capacity without affecting other conflicting parameters, such as robustness and imperceptibility.
A technique is considered decent if it has high embedding capacity and does not affect the visibility of the watermark. The capacity of the proposed system can be measured by equation (3).
The proposed technique can incorporate 917 characters. The capacity analysis of the proposed technique is presented in Table 2 where ten different document sizes with watermarks of different lengths are examined. The original and the watermark document sizes are compared and the change in the size of the document is also measured. After embedding the 917 characters, the size of the water document is changed by 0.08% which is acceptable.
As mentioned above, the word objects of documents are chosen for two reasons, first for incorporating large embedding capacity and second, they do not affect the contents of the original documents. Table 3 presents the capacity comparison, where the proposed technique has a higher embedded capacity as compared with [35,37,38,39]. The size of the original and watermark document is compared in Fig. 11, which shows that after embedding the watermark information, there is no significant change in the watermark document. When the secret message is 84 bytes then watermarked document size is 13,432, and after increasing the amount of watermark information by 917 bytes, the watermarked document size is 13,508.
The proposed scheme performs brilliantly against robustness, imperceptible and also incorporates a massive amount of capacity. The proposed technique is robust and more secure because special properties are considered for watermarking. It can be applied for document authentication and copyright protection, which protects the document against unauthorized access and illegal use.

CONCLUSION
In this paper, a novel 3LDW system has been proposed to achieve robustness, imperceptibility and high embedding capacity. The experimental results demonstrate that the performance of the proposed technique is improved. A three-level watermark embedding is applied, which includes word objects, open spaces and text feature coding for concealing the watermark information. The proposed scheme is robust as we embed the watermark in three-levels. If any formatting attack or Optical Character Recognition is applied, the watermark information is retrieved from other properties. Furthermore, watermark information is still retrieved from word objects after applying 90% deletion attack. In the  future research, the area of textual watermarks will be further investigated. We will analyse the other possible attacks to enhance the robustness in text documents. Moreover, the Portable Document Format (PDF) documents will also be investigated.