You are here

tvguide.com.ini, using regex with index_showsplit, and various thoughts

17 posts / 0 new
Last post
greggdurishan
Offline
Joined: 9 months
Last seen: 9 months
tvguide.com.ini, using regex with index_showsplit, and various thoughts

i got to playing recently with an old tv card i hadn't touched in 10 years and as a part of that casual project i came across your program. i was really pleased with how well thought out it is, and the documentation is just stellar: everything you need without a whole bunch of fluff making it tedious. props to Jan and Francis, and also Blackbear199 for the inis he wrote, including this's original, and for all the forum support he's given people, which i most definitely made use of. 

so i saw that there's all this well-made documentation for a well-made program, but when i went into the inis to look for something to mimic, i found that most the stuff was encrypted, and the most thorough-looking one for me to ape wasn't working anymore. i thought that was a shame enough that i decided to make a little sub-project of updating the tvguide.com.ini as best as i could to the best-practices extolled by the documentation and the documented config files. i checked tvguide over for all that could populate the available elements, optimized the regexes for large files, and commented everything for easy reading. i hit a wall partway through, but i've pretty much accomplished my goals, and it gets basic data fine now. check my comments in the header about keepindexpage if using it.

the one gap in the documentation that really stymied me was figuring out what wg+ was doing with the showsplit regex. now that i've figured it out, i made a little explanation / demo of how wg+ parses regex matches in the showsplit, to hopefully save anyone else some time and stress: see also: https://regex101.com/r/itVcnE/

so after working with it a while, i have a couple constructive observations:

it would be nice if the B number format could be opened up alongside the F and D formats for conversion. That would allow dec->bool and then using substring to find a 1 at a set position instead of having to AND on decimal values for a bitwise lookup value. it'd be more readable and easier to work with.

it would be nice if keepindexpage could be cached more granularly, ie, 1 url per channelgroup (zip etc) when that's how the source provides it, instead of either by channel or by whole site. right now i choose between using the "keepindexpage" option and dividing zipcode-feeds among separate config.xmls so that the massive indexpage is only req'd once, or leaving "keepindexpage" off and rerequing it every channel to enable mixed zipcodes in the same config. 

a json parser would sometimes be nicer than regexes, it'd be convenient to be doing node.last/child/etc instead of building regexes that deal with inner nodes that share the same name as outer nodes and such. same with a html dom object.

an extra parameter to add a random delay between 0-x above and beyond the *-delay config options would be neat.

is there a way to preemptively cleanup a whole response for the urlshow->showdetails & urlsubdetail->showsubdetails scopes like showsplit.modify{cleanup(style=jsondecode)} does for the indexshowdetails scope?

the regex engine doesn't like valid ways of dealing with escaped quotes (\") inside two quotes like "((?:[^"\\]|\\.)*)" or "((?:\\.|[^"])+)"

index_urlshow{url|file://.. doesn't seem to work whereas url_index{url|file://.. does, i haven't gotten as far as detail_urlsubdetail

in the showdetails scope, none of these seem to work when element 1 and 2 have the same content:
 detail_element2.modify {clear([='detail_element1'])}
 detail_element2.modify {clear(['detail_element1' = 'detail_element2'])}
 detail_element2.modify {clear(['detail_element1' ~ 'detail_element2'])}
 detail_element2.modify {clear(['detail_element2' ~ 'detail_element1'])}
detail_element2.modify {set|###'detail_element1'###'detail_temp1'###} 
  yields <element2>###detail_element1###detail_temp1###</element2>
whereas these things work in the indexshowdetails scope with their respective index_ counterparts.

if the above 2 are just a thing for non-donators, maybe could variable substitutions be opened up to use at least just within the options ()s so that app behavior doesn't vary in the showdetails scope between donated/non versions?

i was going to finish out the showsubdetails scope if i could, but i'm held up by the file:// options not working for me in some scopes, and being unable to test anything with variable substitutions in the showdetails scope, so that's as far as i can get. its fixed enough for decent index listings, and it still can serve well enough as a decent template for others i think. enjoy!

Blackbear199
Offline
Has donated long time ago
Joined: 10 years
Last seen: 4 months

for tvguide.com the index page data size is small.

index page(grid with all channels for selected provider).

default duration is 120 minutes,you can increase this to 240 minutes,higher than that error.

1440 / 240 = 6 index page requests per day.

index page(single channel).

1 index page request for 14 days.

yes a json parser would be nice.

a alternative for the showsplit that i personally prefer to use is to grab the entire block of data then split it rather than trying to split it into shows in one step.

for the single channel page data..

scope.range {(splitindex)|end}

index_showsplit.scrub {multi|"programSchedules":[||]|]}

index_showsplit.modify {replace|\},\{|\}\|\{}

index_showsplit.modify {cleanup(style=unidecode)}

index_showsplit.modify {cleanup(style=jsondecode)}

*index_showsplit.modify {(debug)}

end_scope

this is also very effective for the grid data page with all channels.

1. grab entire block for all channels.

index_showsplit.scrub {multi|"items":[||]},"links"|]},"links"}

2. split this into channel blocks.

index_showsplit.modify {repalce|\},\{"channel"|\}\|{"channel"}

3. select the channel block for the epg we want.

   webgrab has a select function we can use(4.6.4.8 Select). set the 'config_site_id' value to a element.

global_temp_1.modify {set|'config_site_id'}

index_showsplit.modify {select|"\"sourceId\":'global_temp_1'," ~}

4. get just the channel data from the selected block(optional,makes showsplit debug cleaner).

index_showsplit.modify {substring(type=regex)|"\"programSchedules\":\[(.*?)\]\}"}

5. split into individual shows.

index_showsplit.modify {replace|\},\{|\}\|\{}

seems like alot of steps,especially for this site as its pretty simple data but it very useful for sites with complicated data.

there is no jsondecode for details/subdetails pages.

it must be performed on each element scrubbed.

it works for the showsplit because its a element.

this is incorrect..

 detail_element2.modify {clear([='detail_element1'])}
 detail_element2.modify {clear(['detail_element1' = 'detail_element2'])}
 detail_element2.modify {clear(['detail_element1' ~ 'detail_element2'])}
 detail_element2.modify {clear(['detail_element2' ~ 'detail_element1'])}

arguments in [xxxx] are boolean expressions,in the manual(available on the downloads page) read 4.6.2.3 Boolean Expressions

your missing a operator.

correct for these examples would be..

 detail_element2.modify {clear('detail_element1')}
 detail_element2.modify {clear('detail_element1' = 'detail_element2')} *means same as above.
 detail_element2.modify {clear('detail_element1' ~ 'detail_element2')}

 detail_element2.modify {clear(~ 'detail_element1')}
 detail_element2.modify {clear('detail_element2' ~ 'detail_element1')} * means same as above.

a boolean expression example.. \| is or operator.

detail_element2.modify {clear(['detail_element1' = 'detail_element2'] ['detail_element1' ~ 'detail_element2'] \|)}

greggdurishan
Offline
Joined: 9 months
Last seen: 9 months

yup, i found only 120, 180,  & 20160 to work, added a comment to that effect.

the difference to the user is how many days of *subpage  (*edit: i forgot that was a keyword, i meant showdetail / showsubdetail) requests they want to make, this will no longer hammer tvguide's servers for just index data like it used to, if one follows the new notes in the top.

> grab the entire block of data then split it rather than trying to split it into shows in one step.

ah, i think i get what you're laying down. don't think i would have figured out the syntax to parse and reform my own matches, but i'm glad that's laid out for future readers now. 

>your missing a operator.

am i?  we wrote the same thing, only i enclosed each comparator in a [] pair and included the implicit = on the pre-match operator for readability. i got the a-b-operator order syntax, the docs made that clear. I saw how one chains them, such as where i used 

{ set( ['detail_temp_9' not = ""]  ['detail_subtitle' = ""] & ['detail_temp_9' not = 'detail_title'] & ) | 'detail_temp_9' }

i'm nearly certain i'm just hitting the quirks of being undonated in the showdetails scope, the syntax works fine in the indexshowdetails scope.

i had a couple dawning realizations about what's on purpose, and went back to make a couple quick fixes to the file so that it behaves right for subbed users--i had commented stuff out and made it work better for me unsubbed, but i flipped what's commented out =3 the subtitle is better just as just the episodeTitle if one can't compare it against the title, as unregd users can't. i'll let the undonated users do the comment-editing instead of the other way around.

cheers!

Blackbear199
Offline
Has donated long time ago
Joined: 10 years
Last seen: 4 months

yes i forgot to mention you are limited as a registered user.

check your webgrab license log file.

there is a chart that explains what you get.

20160 only works for a single channel data.

note:

there is a webgrab bug,there is no delay when subpage is used.

for the grid page it would definitely mean a ban.

this was recently fixed in V5.4.0

the delay between subpages is the same as the index-delay setting.

 

 

 

greggdurishan
Offline
Joined: 9 months
Last seen: 9 months

>20160 only works for a single channel data.

nope. the old regex couldn't deal, it wouldn't filter by channel, and for that it needed the channel# placed into the regex ala id='##variable##'. all the channel data for the week's in that one index url req, try it out, it works. the output log will still show an i for the attempt, but fiddler will only catch one outgoing req for an index page. (again, if one follows the new instructions up top re: uncommenting the 2nd "site {" line & its keepindexpage option). a 2week index-only update for a zipcode is now a 1-req operation.

but even if the old regex WAS adjusted to filter by channel to account for whatever change happened since it last worked, it would suffer catastrophic back-tracing when trying to get the last channel in a 2week list. the new regex is much slower for items at the start, but it always finishes in about the same amount of time for any given match in the file, rather than getting worse the further & further it tried to search through all that data. I'm not going to bother to try to measure/calc it, but it's probbably something like O(n) vs O(n^2).

that index with 2 weeks of a zipcode's channel's data at once is HUGE, so only req'ing it once will go a long way to free up headroom for successive small show&detail information reqs before hitting that data limit too quickly and getting 403'd, vs how it used to req the same huge info over and over each channel.

the link is built and ready to be used for the subdetails json page, it wouldn't be much more than poking around and copy-paste coding what i already got to finish 'er off. you're welcome to, or if i get tossed a sub i'll get it done =)

Blackbear199
Offline
Has donated long time ago
Joined: 10 years
Last seen: 4 months

i ran some tests using 20160 duration on the grid page(all channels).

first and last channel in the data.

no keepindexpage

[        ] Job finished at 19/08/2025 15:42:39 done in 38s

keepindexpage

[        ] Job finished at 19/08/2025 15:44:24 done in 28s

but its a waste as all 14 days of indexpage shows will never be used as the user would get blocked trying to get all the detail page data.

the method used to get the indexpage doesnt really matter imho.

 

 

 

greggdurishan
Offline
Joined: 9 months
Last seen: 9 months

not sure what you're testing against. with the keepindexpage & a 6+ second delay on the various options, i can slowly eek my way through all the showdetail reqs for the 2 weeks on 20 channels. i'm guessing the 403s trigger when asking for too much data in a given time-window, and i don't see why adding a subdetails req would change much if kept at that pace, since the hard part, the big index req, is long past by the time 20 channels worth of detail reqs are processed with those delays. an index-only update can be set to 0 delays on all since it's just 1 req and then process it all.

maybe one could futz around with a huge index delay and then less delay on the showdetail reqs, optimizing for large channel-sets and trying to balance the speed you want at the end v the huge data-cost you incur at the start.

Blackbear199
Offline
Has donated long time ago
Joined: 10 years
Last seen: 4 months

with a high enough delay, yes it can be done but there's a limit.

say 20 shows per day per channel.

20 x 6 = 120 secs per channel x 20 channels = approx 40 min.

add enough channels and this will get long enough that it extends to the next day.

if you run a update once/day the previous grab wont be finished before the next starts.

its not the personal user thats causing the issues.

its users with 1000's channel in their list(iptv guys,ect).

i can think of around a dozen big sites that are pretty much useless from people abusing grabbing.

tvtv.us,zap2it.com,directv.com,mydish.com(dishnetwork) and now tvguide.com are some of the big usa ones that all took steps to block grabbing.

 

 

greggdurishan
Offline
Joined: 9 months
Last seen: 9 months

well reducing the data this tries to pull by a multiple of how many channels there are is a start, anyway, one does what one can.

greggdurishan
Offline
Joined: 9 months
Last seen: 9 months

oh, btw, do the intermediary channel.xmls one has to build using the C options have any purpose outside of this app? i'm guessing this ini was written when tvguide had more countries. i bet i could look at eliminating any practical need to treat canada differently, but idk if something external depends on having a countries.xml to function or something.

edit: iduno if zipping and uploading messed up the tabbing, but i tried to set it straight again. the site doesn't seem to like when i delete and reupload a zip of the same name, it doesn't reflect any changes.

Blackbear199
Offline
Has donated long time ago
Joined: 10 years
Last seen: 4 months

correct if leftover from when more countries existed.

canada still follows the same steps as when it had many countries,step 1 could be removed and step 2 hard coded for canada.

i never bothered to change it hoping maybe it was a site issue when they first made the change but turned out not to be.

 

 

greggdurishan
Offline
Joined: 9 months
Last seen: 9 months

aight, mebbe i'd do that if i ever come back to finish the subdetails page. 

 

speaking of delays set me off to double-check: i'm not getting any of the behavior i expect out of index-delay whether keepindexpage or not, localfile or not. i presume the unit is seconds just because of the contextual values in the documentation. a 30 for index-delay never delayed 30sec in any circumstance. what would be ideal for this would be if index-delay in keepindexpage mode occurred only once after the first (successful) index-get, and before moving on to doing shows, then NOT on successive calls to the cached index page.

if the zips i looked at are average... which i suppose they're not, being just broadcast, those big cable cos have buttloads of channels don't they... yet also less need for epg since it's fine in-ui for them... anyway, the index page was ~400k, and detail pages ~600ish on average so 1 index req = 666.66.... (repeating, of course--lol unplanned) detail-pages.

1. measure how low a show-delay one can get away with after a localfile-index load (to avoid reqing any index).

2. use start-stop & b-xferred and you've measured the b/s you can get away with.

3. figure the size of the index page, and index-bytes * (seconds/byte successful rate aka flip the b/s fraction) = the length of the index pause needed to stay at about that same b/s on average. (if index-delay was made to work in keepindexpage mode like i suggested) then you'd be at roughly the best rate one could go.

Blackbear199
Offline
Has donated long time ago
Joined: 10 years
Last seen: 4 months

the index page is saved at html.source.htm in the webgrab config directory.

no keepindexpage   => overwritten for every channel

keepindexpage        => not overwritten

all scrubs for index_elements(showsplit,start,stop,ect) are scrubbed from this data.

when webgrab moves to a details page its overwritten again with the data from each show and all details_elements are scrubbed for this data.

same thing for subdetails page and elements.

this is why for example you cannot scrub a index page element in scope=showdetails or scope=showsubdetails

if you did try the scrub would fail as the index page data has been overwritten by details or subdetails page data.

 

 

greggdurishan
Offline
Joined: 9 months
Last seen: 9 months

yeah, i saw that file, but it was overwriting itself every couple lines the way i was doing things, so it wasn't useful. what WAS useful was the debug lines in the log, which i was surprised to see since the license.txt suggested i wasn't gettin none. i pretty much got to put fiddler away and work strictly from that (& cheesing index_description since i could run detail info through that to the xml--don't close this hole or i never coulda tested the nodes i can't output properly! ^_-) once i noticed url and regex debugs worked for nondonators. i just threw debug on for everything i could and control-f'd around the log & resultant xml, snooping for things i expected. it's pretty great. nothing was hard except the main regex, which took forever since i wasn't sure what it was looking for. like i said at the start, the bangin' documentation and forum discussion here's everything i needed except that regex. =)

i also noticed you can't specifically set index-only in the config's mode, you have to actually remove all details & subd code for it to run in index-only without being demo-licensed? it'd be nice to just set it, my ota epg is like 2 days & missing 1/3 the channels, and even barebones 2wk listings gets me realistic series recording.

Blackbear199
Offline
Has donated long time ago
Joined: 10 years
Last seen: 4 months

you only need to disable the index_urlshow(or index_urlsubdetail/urlsubdetail) line(one that adds the url).

you dont need to remove the scrubs for the code.

if the url for the details/subdetails is not present all their respective scrubs are ignored.

jan(creator) has already been asked for a way to disable details/subdetails pages by adding a setting to site {xxx} line.

when it will happen i have no idea.

 

 

greggdurishan
Offline
Joined: 9 months
Last seen: 9 months

i reread what you said about html.source.htm and i think i get more of what you're saying now. that files's no receipt given after the fact, it's live data huh? i saw the log lines about writing it over and over so many times, and when i poked in it when it was done, and seeing only the item's last showinfo and no way to pause the app for earlier ones, didn't look at it much more.

yeah, i figured the option to clean subpage results wasn't there or it woulda been in the docs. i still think it'd be nice though, maybe it could be an param on the url before its reqd. wouldn't be much code to do, and if ambitious in a later ver, it could also cause a node or htmldom object to become usable in the receiving scope in addition to the cleaned-text.

ah, yes, the index_urlshow line. frankly i find it easier to see and hilight everything below the index scope, cut, save, run, and undo, than to pick that line outta the others lol.

Thinking more on the 403 issue, what with index-delay seemingly not working right now, i think the best way to use this ini right now is to manually webget the 2week index req to a file, chill out for as long as you think tvguide's window is for metering you (my guess is it resets in about 5-30 mins if you haven't 403'd yourself, a 403 can last from an hour to several), and then run the detail updates using wg+ against that saved index file at whatever pace you find you can get away with. keeping that big index get time-far from your tiny detail-gets is probably the ticket.

greggdurishan
Offline
Joined: 9 months
Last seen: 9 months

argh, i had a huge oversight. the &channelSourceIds param on their index page hadn't worked for me originally, and i'd missed the popup channel-specific listing in their UI. forget all that baloney about reducing data pulled--i just had it wrong. i've added it back like the original, so no longer does any of that stuff about keepindexpage apply--it's back to how it was. my bad. ohwell, it's still a little more fixed-up than it was.

 

Log in or register to post comments

Brought to you by Jan van Straaten

Program Development - Jan van Straaten ------- Web design - Francis De Paemeleere
Supported by: servercare.nl