WebParser RegExp multiple lines

DundaDaThunda · Post by **DundaDaThunda** » July 20th, 2011, 4:35 am

Kind of a continuation of a previous unsolved problem I had, but here's the question:

If I'm setting up a RegExp for WebParser to get some values from a non-RSS URL, how do I account for html blocks that contain new lines and tabs? I know the (?siu) is supposed to cover it, but it doesn't seem to for me when I'm trying to parse some regular (non-RSS) pages. Is there anything special I have to do here?

I'm very new to this, so forgive the naïveté.

Post by **jsmorley** » July 20th, 2011, 5:45 am

It would help if you gave a specific example.

You are right that the "s" in (?siU) tells the regular expression to treat the entire text as one line, and consider the "dot" character as meaning "any character, including newline". This means that generally speaking you never need to worry much about newline characters, as they are just treated as any other character when you use .* to skip over a part of the HTML.

When it can become an issue is when there is a real need to search FOR a newline character (or tab) to zero in on what you want to capture. An example might be:

<p><current value>
blue
</current value></p>

Which has that newline after "value>" that you need to deal with if you want to capture the word "blue".

RegExp="(?siU)value>(.*)</current"

Won't work, as you are capturing the newline along with the word blue, and WebParser is going to give you fits about that. You want to just capture the word blue, but since it has nothing between it and the preceding newline, you are going to have to be pretty specific.

I would use:

RegExp="(?siU)value>[\r\n | \n](.*)[\r\n | \n]</current"

Which says "search for vaule>, then EITHER the Windows newline sequence of CR/LF (\r\n) OR the unix newline of LF (\n)". That way you are safe no matter which format the text is in. I also specified the newline after the word blue, although you really don't have to. WebParser seems to ignore trailing newlines in a capture. It can give you trouble with leading ones.

An uglier situation is where you have something like:

red

blue

green

and you want to capture the "blue". Our example above is not going to work as it is, as it will succeed (and thus stop looking) when it gets one, or at least one, newline. This is because of the "U" or "ungreedy" modifier in our (?siU). That is still going to leave the other 3 in between red and blue ending up in our capture.

RegExp="(?siU)red[\r\n | \n](.*)green"

Won't work for the above reason.

I found that

(?siU)red(?-U)[\s]{1,}(.*)(?U)green

Works ok, as you are using the (?-U) to temporarily turn off the modifier "U" that makes the expression "ungreedy" and removed the "max" from the "min,max" modifier, and thus it will just keep on returning \s (which stands for "non-printable character", which both \r and \n are.) characters until it stops finding them (when it hits "blue"). Then we turn "U" back on after the capture, so the rest of your expression works normally.

(?siU)red(?-U)[\s]+(.*)(?U)green

Would also work the same as the {1,} above, as the "+" simply means do the preceding pattern element "one or more" times.

In this simple example, using

(?si)red[\s]{1,}(.*)green

Would also work fine, as we just don't ever use the "U" modifier at all. Assuming your RegExp is going to be more complicated than just one search / capture though, it is a good practice to have the "U" active most of the time in a WebParser environment.

By the way, there may be an even cooler way of doing this using the "^" (beginning of line) and "$" (end of line) modifiers, but I have not dug into them much, so I'll go with what I know.

http://www.regular-expressions.info/

DundaDaThunda · Post by **DundaDaThunda** » July 20th, 2011, 2:01 pm

Thanks, but it doesn't quite solve the problem. I'm trying to parse this site: http://www.weather.com/weather/hourbyhour/30022

In the source, the first thing I'm after is the first hour time, located on line 2758 (at least currently) preceded by <div class="hbhTDTime"><div>, which is definitely the first time that string appears. However, when I try to get to it with my RegExp, it gives me a matching error even if I just try to find only the time, like this:

Code: Select all

RegExp="(?siU)hbhTDTime"><div>(.*)</div>"

I wrote a similar script that works just fine on an RSS formatted site, so the only difference I can think of is that this page has a lot of blank lines and spaces to get through, and I don't know how to get around that, since what I thought would work doesn't seem to.

Post by **JamesAC** » July 20th, 2011, 2:06 pm

I believe the problem is with the " in the middle of your RegExp. You could try:

Code: Select all

RegExp="(?siU)hbhTDTime.><div>(.*)</div>"

As the "." matches with any character.

Post by **Kaelri** » July 20th, 2011, 2:14 pm

Strange. The RegExp you posted works for me as-is:

Code: Select all

[MeasureWeb]
Measure=Plugin
Plugin=Plugins\WebParser.dll
Url=http://www.weather.com/weather/hourbyhour/30022
RegExp="(?siU)hbhTDTime"><div>(.*)</div>"
StringIndex=1
UpdateRate=864000

[Meter]
Meter=STRING
SolidColor=0,0,0,128
MeasureName=MeasureWeb
StringStyle=BOLD
StringAlign=CENTER
FontColor=255,255,255
FontSize=15
X=50
W=100
H=30

DundaDaThunda · Post by **DundaDaThunda** » July 20th, 2011, 3:34 pm

Alright, I seemed to have gotten it to work. A few different things were working together to break it and confuse me. One was some weird behavior from Notepad when I tried to copy/paste (it seemed to break the RegExp into more than one line). Another was that I was trying to get 6 hours' worth of data, and after 5 hours, the weather.com source strangely stops using &deg for the degree symbol and starts using the actual symbol °.

As a temporary workaround, I just used (..) to get the two digit F temperature, but as I'm in the South, I need to be able to just stop at the ° so I can get triple digit temperatures too. What do I use to account for that symbol?

DundaDaThunda · Post by **DundaDaThunda** » July 20th, 2011, 4:18 pm

Never mind. I found the \D and realized I could use that to work around the °. Now it works fine, and I know what the weather will be in 6 hours.

WebParser RegExp multiple lines

WebParser RegExp multiple lines

Re: WebParser RegExp multiple lines

Re: WebParser RegExp multiple lines

Re: WebParser RegExp multiple lines

Re: WebParser RegExp multiple lines

Re: WebParser RegExp multiple lines

Re: WebParser RegExp multiple lines