Clinical records are merely stored for more than the purpose of archiving in the majority of hospitals nowadays. However, new contextual information could be retrieved from these texts using natural language processing devices, thus creating a medical corpus. This is the long-term aim of the research this paper relates to.
The research material is based on the medical charts of the Ophthalmological Department of Semmelweis University. Documents were structured and normalized during the pre-process constituting the raw material for the current study. The texts have were broken down into units of variable size: sentences, words and punctuation marks. This process is known as tokenizing.
Former tokenizers developed for the Hungarian language (like Magyarlánc or Huntoken) were not designed for medical terminology.
The aim was to create a rule-based tokenizer that is able to segment prepocessed medical texts and manipulate abbreviations effectively. The latter is of extreme significance due to the high frequency of sublingual abbreviatons in medical texts. Furthermore, there is great inconsistency in ortography and segmenting in the documents. Frequently used expressions show individual and intertextual variability. It is crucial that the software identifies and manages the morphological differences in a unitary fashion.
The planning began by defining rules regarding sentence and word boundaries. For sentence boundary disambiguation abbreviations had to be considered. After tokenizing, each abbreviation was tagged with a unique identifier defining its meaning.
The resultant software operates with a 90% precision rate in detecting sentence boundaries and abbreviations.