November 12, 2020

RegEx Greedy Trap

I have stumbled upon a regular expression trap - so called greedy (or ungreedy) trap. In short, in regex the ungreedy operator does not mean the shortest possible match
I explain here the problem description where I have faced it and workaround I came to use.

Problem

I wanted in AutoHotkey to extract from following Html code the highlighted text:

<span>[10/6 8:23 PM] </span><span>Dalon, Thierry</span><div>Let's Connect presentation available</div><div data-tid="messageBodyContainer">

So I have used following Regex 

sPat = U)<span>([^>]*)</span><div>(.*)</div><div data-tid="messageBodyContainer">

If (RegExMatch(sHtmlThread,sPat,sMatch)) {

    sAuthor := sMatch1

being careful to use the ungreedy option U) to extract the shortest string between the span tags.


AHK Code to reproduce:

sHtml = <span>[10/6 8:23 PM] </span><span>Dalon, Thierry</span><div>Let's Connect presentation available</div><div data-tid="messageBodyContainer">
sPat = U)<span>(.*)</span><div>(.*)</div><div data-tid="messageBodyContainer">
If (RegExMatch(sHtml,sPat,sMatch)) {
    sAuthor := sMatch1
    MsgBox %sMatch1%
}

Strangely the output I got was [10/6 8:23 PM] </span><span>Dalon, Thierry instead of the expected Dalon, Thierry

You can test it here: https://regex101.com/r/HFcaCa/1 

I  have searched in StackOverflow (and found this thread). 

explaining the issue (and reassuring me that I am not crazy).

Workaround

For my case since I don't expect in the searched pattern any < (it shall be a name) I could workaround this trap easily using following match pattern (change highlighted):

sPat = U)<span>([^>]*)</span><div>(.*)</div><div data-tid="messageBodyContainer">

The [^>]* means any character except >.

No comments:

Post a Comment