A Secure Digital Text Watermarking Algorithm for Portable Document Format (PDF)

Nowadays, with the rapid development of advanced technologies, an illegal copy of digital documents can be easily generated. The Portable Document Format is the most common and widely used text document on the internet. The copyright protection of these documents is a challenging task. Advanced techniques have been proposed in the past but have not delivered the expected results. These techniques are either robust or imperceptible or have a high capacity, but do not maintain the balance between all parameters. Digital watermarking has been used over the past decade to detect forgery and tampering detection, maintain copyright and authentication. This study proposes a novel approach for Portable Document Format based on document page objects. The Special objects of Portable Document Format are used for watermarking without affecting the content of the original document. The proposed technique preserves the imperceptibility and resists against formatting attacks. The watermark information is extracted with high probability, the proposed technique is robust, secured, imperceptible, and embeds 0.22 KB of the watermark in the host document.


INTRODUCTION
ext-based information is disseminated daily on the Internet, for example, in the form of academic documents, electronic books, and emails, etc. Portable Document Format (PDF) is the prevalent carrier of this information; this is why huge number of documents are shared on the internet in PDF format. Due to the advancements in digital technology content can be easily redistributed, copied, and stored. On the other hand, unauthorized copying and illegal distribution of digital contents created problems for data owners for ensuring the copyright, millions of dollars spent by booksellers, and publishers [1]. A massive contribution of mobile The protection of digital text documents has been seriously ignored in the past. However, it is the most dominant part of the Internet, newspapers, articles, ebooks, legal documents, and magazines [5]. For digital content specifically, the text documents copyright protection is the need of time and cannot be neglected. In past steganography, cryptography and information hiding techniques were used to solve the copyright issues [6]. Nowadays, digital watermarking gives a better solution for copyright protection with secret  101 information called watermark embedded in digital content [7,8]. The watermark is used for ownership verification when illegal use of digital content happens. The major application of digital watermarking is presented in Fig, 1. The most common use of digital watermarking is copyright protection. It is used to provide authentication of digital contents or preserve the ownership verification. PDF is an accessible file format and developed by Adobe System Inc. [9][10][11]. Unauthorized persons can easily modify digital documents with the help of advanced technologies. The Portable Document File format includes some security mechanisms which can be broken with advanced technologies. These modified documents can be illegally distributed through the internet. This fact suggests the need to develop an effective authentication system [12,13]. In this paper, we have proposed a secure digital watermark algorithm for PDF documents. Our technique is based on PDF file page objects. The proposed system is secure, robust, incorporates large embedding capacity, and ensures the visual imperceptibility of the document.
Our main contributions in this research are as under: • We develop a secure digital watermark system for PDF documents copyright protection, which prevents illegal distribution, reproduction, and manipulation.
• The proposed watermarking technique does not disturb the digital content and calls zero watermark technology because watermark information is embedded in PDF file objects.
• The proposed technique can incorporate large embedding capacity.
The rest of the paper is organized as follows. In Section 2, the work related detail is presented, the PDF file structure is deliberated in Section 3. Section 4 is about the proposed model. The experimental results and analysis are elaborated in Section 5, and the conclusion and future work are presented in Section 6.

RELATED WORK
Digital watermarking is a hot area of research, which can be categorized into text, image, audio, and video. PDF text documents are considered in this research for watermarking. At present, several methods are proposed for watermarking, which are based on syntax, semantic, format based, and so on. Zhong et al. [14] proposed a system which changes each word preset distance in PDF files for embedding data in the right margin. Two concepts neighbor difference, and environment equal is proposed, which reveal the statistical properties of spaces are shown in Fig. 2. In [15,16], words and paragraph spaces are utilized for watermarking. The main drawback of these techniques is if the spaces are removed between text, then embedding bits will be ruined. Lingjun et al. [ Photographer software is used to generating a font file of the same characters. Simin et al. [21] proposed a novel algorithm based on PDF document page objects. The watermark information is embedded in PDF page structure objects.
Hakak et al. [22] presented a complete framework regarding the automatic authentication and distribution of the digital Quran and Hadith verses. The verification process is divided into two phases, security and verification. Wen et al. [23] proposed an algorithm for XML documents to hide information. In that method, a functional dependency is used for the XML file as a function for Zero-Watermark. The proposed method performs well in alternation attacks, compression attacks, reorganization attacks, and selection attacks. Xiao et al. [24] suggest a novel method for embedding information in text, which is based on font code that embeds a watermark into text by disrupting text characters glyphs while retaining text content. The glyph recognition method is also presented to restore the information that is embedded in the encrypted document. Feng et al. [ proposed an extraction process that provides authentication of received stego-cover file such that only the desired file is acknowledged for the extraction process. Otherwise fake file is discarded by a recipient.

PDF FILE STRUCTURE
PDF is the most critical file format and created in 1993 by Adobe Systems. The primary PDF file structure consists of Header, Body, Cross-Reference Table, and Trailer, as shown in Fig. 3. The header is the first line of PDF file that includes the version number. The body can hold all data, which can be shown in PDF viewer and supports eight types of objects. These objects are Null, String, Integer, Boolean, Array, Name, Stream, and Dictionary. The cross-reference table is a core element of PDF document and provides a binary offset from the beginning of the file [30 -32]. The responsibility of the cross-reference table is that it contains the reference of all objects in a PDF file. It begins with keyword "xref" and the next lines are exactly 20 bytes long as shown in Fig. 4. The keyword "xref" indicates the beginning marks, the list starts at object 0, and the next number is a count of cross-reference table objects. The Trailer is used to find the cross-reference table. Each object is linked with that table which acts as a dictionary. The example of a PDF file syntax object is given in Fig. 5, where a unique number is starting with "obj" to "endobj" assigned to each object. The script and the information for displaying text, figure and images appear between "stream" and "endstream". "BT" signifies Began Text and "ET" denotes End Text. "If", "Td" and "Ti" are some operators to represent the text document, where "Tf" is used for font size and text style. "Td" signifies the offset of the current line, and "Tj" is used to show characters and spaces between them. The PDF document structure is shown in Fig. 6, and the tree structure provides PDF applications for consumers using limited memory [34]. The document catalog consists of the article threads page tree, named destinations, interactive form, and online hierarchy.

SPREAD TRANSFORM DITHER MODULATION (STDM)
STDM is applied to embed the secret information in PDF document. The bits of secret message sm ϵ (0,1) are embedded in PDF document "T". Therefore, according to embedding "sm" bits, two different dither quantizers are applied. Embed the "sm" bit 0, Q0 is used, as shown in equation (1).
where ∆ denotes the size of the quantization step. Each bit of "sm" is embedded into "T" without any Fig. 6: The Structure of PDF Document distortion, which is the most significant advantage of STDM. P is the projection vector of the host signal.
The quantized single is specified as shown in (2).
where we can re-write equation (2) as.

ADVANCED ENCRYPTION STANDARD (AES) ALGORITHM FOR ENCRYPTION AND DECRYPTION
Encryption and decryption are not our primary task, so any encryption algorithm can be used. We use AES symmetric 256-bit key for encryption and decryption. Its main purpose is to protect important information.
In other words, hiding information from unauthorized persons. Encryption is applied to secure the secret message. If any high-level attack is applied on the watermarked document and if anyone acquires the message so cannot read the actual message or ownership detail.

ZERO WATERMARKING
A technique is called Zero Watermarking, if it does not change the original content during watermarking. In our proposed technique PDF page objects mentioned in Table 1 are used, and the attribute values in the content stream can be modified for watermarking. The text status operators can be displayed outside of text objects, and the values they set are preserved for text objects in a single content stream. Each page of a document is represented by one or more content streams, which include the page objects. PDF document has a lot of objects which have different types. The attribute values of the content stream object, which can incorporate the values of the numbers are shown in Table 1. Every object attribute has an operator keyword like Tc, Tw, Tz, TL, Tf, Tr, Ts, etc. The attribute values in the content stream are modified for watermarking, and it will not effect the entire document contents.

PROPOSED MODEL
A zero-text watermarking algorithm is proposed for PDF documents, which is based on PDF page objects. It is shown in Fig. 7. In our proposed technique, the objects and properties of the document are used for storing the secret information. This technique is considered robust when formatting attacks are applied to the digital contents. The watermark information is not deleted or changed because it is stored in the document's objects or properties.

Watermark Embedding
A secret message is embedded into PDF documents page objects as watermark information. The copyright and authentication of PDF documents are proved through the watermark. Algorithm 1 describes the complete embedding process of watermarking.
The secret message is first encrypted through the AES encryption algorithm with the private key, which can enhance the hidden data or watermark security. After encryption, the encrypted data is converted to a binary string and then binary to numbers. The mostly PDF document page objects belong to integer types, and numbers can easily be embedded in those objects. The secret information, which is in number form, divided into three equal groups. If different attacks applied to PDF document, the watermark information is recovered from other groups in the worst case. A document translator is used to getting the content stream of PDF files. In last, these classified groups are embedded in suitable page objects of the original PDF file, also known as the watermark PDF document. The  Tl is used to set text leading and specifies a number.

Tf font size
Tf indicates the font size, which is in numbers.

Tr render
Tr specifies text rendering mode, which shall be an integer.

Ts rise
Ts directs text rise, which is in number complete watermark embedding process is given in Algorithm 1, where a secret message "sm" is embedded in PDF document objects without any distortion.

Watermark Verification
The extracted watermark is checked and compared to the original watermark for document authentication. The details about watermark extraction or verification are as follows: the document translator is used to read PDF document contents in binary form. The page objects of the document are identified, which contain the watermark. The extracted watermark information, which is in the form of numbers, is converted into binaries then characters. The AES decryption algorithm is applied to decrypt the extracted message which is known as a watermark or secret message. After decryption, the retrieved message is compared to the original message that proves the document authentication. The complete watermark extraction process is given in Algorithm 2. The verification of extracted message EM can be performed by using (4).

EXPERIMENTAL RESULTS
In experiments, the PDF file is created using Microsoft Word 2016 with font-family Times New Roman and font size 13pt. To compress and decompress the PDF document "pdftk" toolkit is used before embedding watermark. The experiments are carried on Core i3-3110, Windows operating system. The watermark embedded in the document is "The document copyrights belong to Umair Khadim (umair_khadim@live.com)".
Digital watermarking having three key constraints is described either. The affiliation between steganography parameters is displayed in Fig. 8. These parameters include robustness, imperceptibility, and capacity (payload) [35].

Robustness
A number of PDF editors are available, which can edit the PDF files online and offline. Various attacks are applied to PDF document in order to examine whether the proposed technique is robust or not. We added comments and marks in the watermark PDF document, as shown in Fig. 9. After inserting comments and marks, we tried to extract the watermark information from it. The experiments show that after applying attacks on watermarked document, the accuracy of the extracted watermark is 99.9%.
The formatting attacks did not affect the watermark information, because PDF document page objects are used for embedding watermark. Interactive forms are the particular type used in PDF documents and appropriate to collect user information in the PDF document. They authorize users to edit, write, modify or delete the information in PDF files on a specific location. We tested interactive forms, and the editing option did not damage or affect the watermark information from PDF documents. Therefore, through experimental results, we can see that the proposed algorithm is robust against formatting attacks.
The hiding capacity of the proposed system is measured using equation (5), where "C" indicates capacity, NBits(SM) means the secret message size in bits, and WD(KB) defines the watermarked document size in KBs. As compared with existing techniques, the proposed system improves the embedding capacity size in KBs. As compared with existing techniques, the proposed system improves the embedding capacity [36,37].  ; < => × 100 The length of the watermark information is 0.22 KB. The capacity results of the proposed technique are compared with [21], where the author claims that they can embed 0.10 KB of watermark information in PDF document.  Fig. 10.
After embedding watermark information in the original document, a slight change is measured in the watermarked document. The 3D representation of capacity analysis and change in the document is represented in Fig. 11.
Secret message (watermarked) embedding capacity is measured in KB, where 0.22 KB of data is embedded in the host document. Original and watermarked document change is measured, which is 0.13%, as revealed in Fig. 12.

Imperceptibility
The imperceptibility means that the watermark information could not feel the audience, or the watermark should not affect the original text. The authorized agency can only detect the watermark only through special processing. The watermark embeds the imperceptibly into the PDF file object without affecting the original documents contents. The original and watermarked document is presented in Fig. 13, where watermark information is embedded in PDF file objects and did not modify the contents of watermarked documents. In this work, four different components of PDF file structure are used for watermarking, which include the header, body, crossreference table, and trailer. The proposed scheme used STDM for watermark embedding in PDF file objects. Experimental results illustrated that the proposed scheme is robust, interceptible and improves payload as compared to previous techniques.

CONCLUSION
In this study, we have proposed a digital text watermarking algorithm for Portable Document Format. That is based on page objects Portable Document Format. The special properties of Portable Document Format, which include page objects are exploited for embedding watermark information. The experimental results prove that after applying the formatting attacks watermark information is successfully extracted from the document. The proposed algorithm does not affect the original contents of the document. The proposed technique reports excellent results against robustness and impermeability. The proposed technique is superior to other similar methods in terms of imperceptibility, robustness, and capacity. In future work, we will improve the embedding capacity of watermark information and design a secure watermarking system for printed documents.