I have stumbled upon a regular expression trap - so called greedy (or ungreedy) trap. In short, in regex the ungreedy operator does not mean the shortest possible match.
I explain here the problem description where I have faced it and workaround I came to use.
The [^>]* means any character except >.
Problem
I wanted in AutoHotkey to extract from following Html code the highlighted text:
<span>[10/6 8:23 PM] </span><span>Dalon, Thierry</span><div>Let's Connect presentation available</div><div data-tid="messageBodyContainer">
So I have used following Regex
sPat = U)<span>([^>]*)</span><div>(.*)</div><div data-tid="messageBodyContainer">
If (RegExMatch(sHtmlThread,sPat,sMatch)) {
sAuthor := sMatch1
being careful to use the ungreedy option U) to extract the shortest string between the span tags.
AHK Code to reproduce:
sHtml = <span>[10/6 8:23 PM] </span><span>Dalon, Thierry</span><div>Let's Connect presentation available</div><div data-tid="messageBodyContainer">
sPat = U)<span>(.*)</span><div>(.*)</div><div data-tid="messageBodyContainer">
If (RegExMatch(sHtml,sPat,sMatch)) {
sAuthor := sMatch1
MsgBox %sMatch1%
}
Strangely the output I got was [10/6 8:23 PM] </span><span>Dalon, Thierry instead of the expected Dalon, Thierry
You can test it here: https://regex101.com/r/HFcaCa/1
I have searched in StackOverflow (and found this thread).
Finally I have came to this great article explaining this regex greedy trap.
explaining the issue (and reassuring me that I am not crazy).
Workaround
For my case since I don't expect in the searched pattern any < (it shall be a name) I could workaround this trap easily using following match pattern (change highlighted):
sPat = U)<span>([^>]*)</span><div>(.*)</div><div data-tid="messageBodyContainer">
No comments:
Post a Comment