Anonymization Framework for Securing Protected Health Information in a Complex Dataset of Medical Narratives

  • Saman Hina Department of Computer Science and Information Technology, NED University of Engineering and Technology, Karachi, Pakistan.
  • Raheela Asif Department of Software Engineering, NED University of Engineering and Technology, Karachi, Pakistan.
  • Syed Abbas Ali Department of Computer and Information Systems Engineering, NED University of Engineering and Technology, Karachi, Pakistan.

Abstract

It is imperative in a medical domain that protection of information does not allow an individual to be overlooked. In medical domain, research community encourages use of real-time datasets for research purposes. These real-time datasets contain structured and unstructured (natural language free text) information that can be useful to researchers in various disciplines including computational linguistics. On the other hand, these real-time datasets cannot be distributed without anonymization of Protected Health Information (PHI). The information of PHI (such as Name, age, address, etc.) that can identify an individual is unethical. Therefore, we present a rule-based Natural Language Processing (NLP) anonymization system using a challenging corpus containing medical narratives and ICD-10 codes (medical codes). This anonymization module can be used for pre-processing the corpus containing identifiable information. The corpus used in this research contains '2534' PHIs in '1984' medical records in total. 15% of the labelled corpus was used for improvement of guidelines in the identification and classification of PHI groups and 85% was held for the evaluation. Our anonymization system follows two step process: (1) Identification and cataloging PHIs with four PHI categories ('Patients Name', 'Doctors Name', 'Other Name [Names other than patients and doctors]', 'Place Name'), (2) Anonymization of PHIs by replacing identified PHIs with their respective PHI categories. Our method uses basic language processing, dictionaries, rules and heuristics to identify, classify and anonymize PHIs with PHI categories. We use standard metrics for evaluation and our system outperforms against human annotated gold standard with 100% of F-measure by increasing 39% from baseline results, which proves the reliability of data usage for research.

Published
Jul 1, 2020
How to Cite
HINA, Saman; ASIF, Raheela; ALI, Syed Abbas. Anonymization Framework for Securing Protected Health Information in a Complex Dataset of Medical Narratives. Mehran University Research Journal of Engineering and Technology, [S.l.], v. 39, n. 3, p. 612 - 624, july 2020. ISSN 2413-7219. Available at: <https://publications.muet.edu.pk/index.php/muetrj/article/view/1704>. Date accessed: 05 july 2020. doi: http://dx.doi.org/10.22581/muet1982.2003.16.
Section
Articles
This is an open Access Article published by Mehran University of Engineering and Technolgy, Jamshoro under CCBY 4.0 International License