Adding Missing Words to Regular Expressions


Regular expressions are textual patterns used in data-intensive applications to extract data of specific interest. However, even hand-crafted regular expressions may fail to match all the intended words. In this paper, we propose a novel way to learn a regular expression starting from an original one and a set of missing (non-matched) words. Our method finds an approximate match between the missing word(s) and the regular expression, and adds disjunctions for the unmatched parts appropriately. Our goal is to improve the recall of the initial regular expression without deteriorating its precision. We show the effectiveness and generality of our technique by experiments on various datasets.

In Pacific-Asia Conference on Knowledge Discovery and Data Mining.