Kind of a continuation of a previous unsolved problem I had, but here's the question:
If I'm setting up a RegExp for WebParser to get some values from a non-RSS URL, how do I account for html blocks that contain new lines and tabs? I know the (?siu) is supposed to cover it, but it doesn't seem to for me when I'm trying to parse some regular (non-RSS) pages. Is there anything special I have to do here?
I'm very new to this, so forgive the naïveté.
It is currently May 11th, 2024, 8:42 pm
WebParser RegExp multiple lines
-
- Posts: 7
- Joined: July 16th, 2011, 5:48 pm
-
- Developer
- Posts: 22632
- Joined: April 19th, 2009, 11:02 pm
- Location: Fort Hunt, Virginia, USA
Re: WebParser RegExp multiple lines
It would help if you gave a specific example.
You are right that the "s" in (?siU) tells the regular expression to treat the entire text as one line, and consider the "dot" character as meaning "any character, including newline". This means that generally speaking you never need to worry much about newline characters, as they are just treated as any other character when you use .* to skip over a part of the HTML.
When it can become an issue is when there is a real need to search FOR a newline character (or tab) to zero in on what you want to capture. An example might be:
<p><current value>
blue
</current value></p>
Which has that newline after "value>" that you need to deal with if you want to capture the word "blue".
RegExp="(?siU)value>(.*)</current"
Won't work, as you are capturing the newline along with the word blue, and WebParser is going to give you fits about that. You want to just capture the word blue, but since it has nothing between it and the preceding newline, you are going to have to be pretty specific.
I would use:
RegExp="(?siU)value>[\r\n | \n](.*)[\r\n | \n]</current"
Which says "search for vaule>, then EITHER the Windows newline sequence of CR/LF (\r\n) OR the unix newline of LF (\n)". That way you are safe no matter which format the text is in. I also specified the newline after the word blue, although you really don't have to. WebParser seems to ignore trailing newlines in a capture. It can give you trouble with leading ones.
An uglier situation is where you have something like:
red
blue
green
and you want to capture the "blue". Our example above is not going to work as it is, as it will succeed (and thus stop looking) when it gets one, or at least one, newline. This is because of the "U" or "ungreedy" modifier in our (?siU). That is still going to leave the other 3 in between red and blue ending up in our capture.
RegExp="(?siU)red[\r\n | \n](.*)green"
Won't work for the above reason.
I found that
(?siU)red(?-U)[\s]{1,}(.*)(?U)green
Works ok, as you are using the (?-U) to temporarily turn off the modifier "U" that makes the expression "ungreedy" and removed the "max" from the "min,max" modifier, and thus it will just keep on returning \s (which stands for "non-printable character", which both \r and \n are.) characters until it stops finding them (when it hits "blue"). Then we turn "U" back on after the capture, so the rest of your expression works normally.
(?siU)red(?-U)[\s]+(.*)(?U)green
Would also work the same as the {1,} above, as the "+" simply means do the preceding pattern element "one or more" times.
In this simple example, using
(?si)red[\s]{1,}(.*)green
Would also work fine, as we just don't ever use the "U" modifier at all. Assuming your RegExp is going to be more complicated than just one search / capture though, it is a good practice to have the "U" active most of the time in a WebParser environment.
By the way, there may be an even cooler way of doing this using the "^" (beginning of line) and "$" (end of line) modifiers, but I have not dug into them much, so I'll go with what I know.
http://www.regular-expressions.info/
You are right that the "s" in (?siU) tells the regular expression to treat the entire text as one line, and consider the "dot" character as meaning "any character, including newline". This means that generally speaking you never need to worry much about newline characters, as they are just treated as any other character when you use .* to skip over a part of the HTML.
When it can become an issue is when there is a real need to search FOR a newline character (or tab) to zero in on what you want to capture. An example might be:
<p><current value>
blue
</current value></p>
Which has that newline after "value>" that you need to deal with if you want to capture the word "blue".
RegExp="(?siU)value>(.*)</current"
Won't work, as you are capturing the newline along with the word blue, and WebParser is going to give you fits about that. You want to just capture the word blue, but since it has nothing between it and the preceding newline, you are going to have to be pretty specific.
I would use:
RegExp="(?siU)value>[\r\n | \n](.*)[\r\n | \n]</current"
Which says "search for vaule>, then EITHER the Windows newline sequence of CR/LF (\r\n) OR the unix newline of LF (\n)". That way you are safe no matter which format the text is in. I also specified the newline after the word blue, although you really don't have to. WebParser seems to ignore trailing newlines in a capture. It can give you trouble with leading ones.
An uglier situation is where you have something like:
red
blue
green
and you want to capture the "blue". Our example above is not going to work as it is, as it will succeed (and thus stop looking) when it gets one, or at least one, newline. This is because of the "U" or "ungreedy" modifier in our (?siU). That is still going to leave the other 3 in between red and blue ending up in our capture.
RegExp="(?siU)red[\r\n | \n](.*)green"
Won't work for the above reason.
I found that
(?siU)red(?-U)[\s]{1,}(.*)(?U)green
Works ok, as you are using the (?-U) to temporarily turn off the modifier "U" that makes the expression "ungreedy" and removed the "max" from the "min,max" modifier, and thus it will just keep on returning \s (which stands for "non-printable character", which both \r and \n are.) characters until it stops finding them (when it hits "blue"). Then we turn "U" back on after the capture, so the rest of your expression works normally.
(?siU)red(?-U)[\s]+(.*)(?U)green
Would also work the same as the {1,} above, as the "+" simply means do the preceding pattern element "one or more" times.
In this simple example, using
(?si)red[\s]{1,}(.*)green
Would also work fine, as we just don't ever use the "U" modifier at all. Assuming your RegExp is going to be more complicated than just one search / capture though, it is a good practice to have the "U" active most of the time in a WebParser environment.
By the way, there may be an even cooler way of doing this using the "^" (beginning of line) and "$" (end of line) modifiers, but I have not dug into them much, so I'll go with what I know.
http://www.regular-expressions.info/
-
- Posts: 7
- Joined: July 16th, 2011, 5:48 pm
Re: WebParser RegExp multiple lines
Thanks, but it doesn't quite solve the problem. I'm trying to parse this site: http://www.weather.com/weather/hourbyhour/30022
In the source, the first thing I'm after is the first hour time, located on line 2758 (at least currently) preceded by <div class="hbhTDTime"><div>, which is definitely the first time that string appears. However, when I try to get to it with my RegExp, it gives me a matching error even if I just try to find only the time, like this:
I wrote a similar script that works just fine on an RSS formatted site, so the only difference I can think of is that this page has a lot of blank lines and spaces to get through, and I don't know how to get around that, since what I thought would work doesn't seem to.
In the source, the first thing I'm after is the first hour time, located on line 2758 (at least currently) preceded by <div class="hbhTDTime"><div>, which is definitely the first time that string appears. However, when I try to get to it with my RegExp, it gives me a matching error even if I just try to find only the time, like this:
Code: Select all
RegExp="(?siU)hbhTDTime"><div>(.*)</div>"
-
- Developer
- Posts: 318
- Joined: July 14th, 2009, 5:57 pm
Re: WebParser RegExp multiple lines
I believe the problem is with the " in the middle of your RegExp. You could try:
As the "." matches with any character.
Code: Select all
RegExp="(?siU)hbhTDTime.><div>(.*)</div>"
+++ Divide By Cucumber Error. Please Reinstall Universe And Reboot +++
Quis custodiet ipsos custodes?
-
- Developer
- Posts: 1721
- Joined: July 25th, 2009, 4:47 am
Re: WebParser RegExp multiple lines
Strange. The RegExp you posted works for me as-is:
Code: Select all
[MeasureWeb]
Measure=Plugin
Plugin=Plugins\WebParser.dll
Url=http://www.weather.com/weather/hourbyhour/30022
RegExp="(?siU)hbhTDTime"><div>(.*)</div>"
StringIndex=1
UpdateRate=864000
[Meter]
Meter=STRING
SolidColor=0,0,0,128
MeasureName=MeasureWeb
StringStyle=BOLD
StringAlign=CENTER
FontColor=255,255,255
FontSize=15
X=50
W=100
H=30
-
- Posts: 7
- Joined: July 16th, 2011, 5:48 pm
Re: WebParser RegExp multiple lines
Alright, I seemed to have gotten it to work. A few different things were working together to break it and confuse me. One was some weird behavior from Notepad when I tried to copy/paste (it seemed to break the RegExp into more than one line). Another was that I was trying to get 6 hours' worth of data, and after 5 hours, the weather.com source strangely stops using ° for the degree symbol and starts using the actual symbol °.
As a temporary workaround, I just used (..) to get the two digit F temperature, but as I'm in the South, I need to be able to just stop at the ° so I can get triple digit temperatures too. What do I use to account for that symbol?
As a temporary workaround, I just used (..) to get the two digit F temperature, but as I'm in the South, I need to be able to just stop at the ° so I can get triple digit temperatures too. What do I use to account for that symbol?
-
- Posts: 7
- Joined: July 16th, 2011, 5:48 pm
Re: WebParser RegExp multiple lines
Never mind. I found the \D and realized I could use that to work around the °. Now it works fine, and I know what the weather will be in 6 hours.