You are here

Problem with index subpages tvspielfilm.de.ini

18 posts / 0 new
Last post
me@meisele.de
Offline
Donator
Joined: 1 month
Last seen: 3 weeks
Problem with index subpages tvspielfilm.de.ini

I recently switched to WebGrab++ because I moved my homeserver to Linux/Ubuntu. Before I was using EPGBuddy with tvspielfilm.de. I wanted to continue to use tvspielfim.de but found it is broken in WebGrab++. Current tvspielfilm.de website has subpages with max. 30 shows per day. However, in the current .ini file Rev. 7 'page=1' is hard-coded.

So I started to modify the .ini file and tried to add the subpage parameter. Initilially it looked like it works, but then I noticed the number of shows retrieved is far too high and incremamtal updates completely mess-up the guide data probably because there are duplicate shows in the guide.xml.

In the attached zip there are the files from 2 runs for 1 channel for today:

Run 1: 'page=1' is hardcoded in url_index. The page is downloaded once in html_source.htm (as it should be, I assume). guide.xml contains exactly the 30 shows from page 1. Runnung an incremamtal update works as expected, no changes made, still 30 shows in guide.xml.

Run 2: 'page=|subpage' enabled. For today there are 2 pages There are in total 40 shows. In html_source.htm the pages are downloaded 2 times. There is page 1, then page 1 again, then page 2, page 2 again etc.. In the resulting guide.xml the 30 shows from page 1 are ok but the 10 shows from page 2 are listed twice. There are 50 shows instead of 40. Running then an incremamtal update messes up the guide.

I am confused. What is wrong with the subpage parameter? Any help or hint would be appreciated.

Michael

Attachments: 
Blackbear199
Offline
Blackbear199's picture
WG++ Team memberDonator
Joined: 9 years
Last seen: 9 min

the short answer is subpage with stop string is broken.
look at your url_index debug,you will see the subpage count is not stopping at page 2.it should stop there as page 3 has "Fehlerseite" in it.
by default webgrab will go to max of 8 pages and stop if it doesnt find the stop string,and thats what your getting.
i sent it to jan to check out.
its more of a pita rather than a issue as its takes time to download the extra pages of useless info.

the other issue you describe is a sure sign of a bad showsplit.
something you should always remember.
as you know the | has a special meaning in webgrab,its used as a separator in multi value elements like the showsplit,actors,ect
if these exist in the page data webgrab converts them to !??!.
this is fine if you have a single showsplit.scrub line and no modify lines after it.
if you do then its a issue as the modify line will convert the back to a | again.
for example it will split a show in the showsplit into 2 when it shouldnt be split.
when webgrab processes the show it will error for one part as it wont have a start time.
add index_showsplit.modify {(debug)} after the showsplit.scrub line and you will see what i mean.

the easiest thing to do is deal with these,even if you think they are not there it doesnt hurt to have the line.

scope.range {(splitindex)|end}
index_showsplit.scrub {multi|<tr class="hover">||</tr>|</tr>}
index_showsplit.modify {replace(type=regex)|"\!\?\?\!"|-}
index_showsplit.modify {cleanup(style=jsondecode)}
*index_showsplit.modify {(debug)}
end_scope

i replace any !??! with a dash.

me@meisele.de
Offline
Donator
Joined: 1 month
Last seen: 3 weeks

Hi, thank you for the quick reply. Sorry, I am completely new to WebGrab++, therefore I tried to change as little as possible in the existing tvspielfilm.de.ini.

I did some more checks based on your proposals. I already had noticed that the logfile indicates 8 URLs have been built, but thought this could be normal. I checked again, when I manually open the URL e.g. for page 3 in EDGE browser 404 response is returned and the page contains the string 'Fehlerseite' at several places. However, in the resulting html.source.htm (attached) I cannot find 'Fehlerseite' but 12 404 errors at the end. This means there were also 2 attempts per page for pages 3 - 8. I was hoping subpage automatically recognizes 404 response and would stop then even without a stop string.

I checked the showsplit with your proposed settings, and also checked the related block/elemant starts in the html.source.htm file. To me the showsplit looks ok. When using hardcoded page=1 there is only one block found which contains 30 elements --> ok. If I use subpage (like for the attached html.source.htm) there are four blocks (there should be only 2), 2 for page one and 2 for page 2. I think this must be resolved first, but I have no idea how. I think the issue is that both existing pages (1 and 2) are downloaded twice. Even if the stopstring does not work it is still strange that these 2 pages get downloaded twice.

Would it be possible to define multiple url_index urls with hardcoded page numbers? Currently I have no other idea if subpage is not working as expected.

Any additional hints or things I could check?

Attachments: 
Blackbear199
Offline
Blackbear199's picture
WG++ Team memberDonator
Joined: 9 years
Last seen: 9 min

interesting find.think i see whats happening.
if we fix the page=x numkber in urlindex with a non valid page(one we know doesnt exist) like page 3 for ard or zdf.
webgrab returns no index page recieved.
based on that i think when the page is invalid they use a javascript redirect to the 404 error page which we see in broswer.
webgrab cannot process javascript so the page is never returned(never see the subpage stop string) and thats why the subpage counts goes to 8 everythime(default stop value).

here my version,i dont see any of the errors your getting.
to test it put the ini in your webgrab config folder.
u dont have tom replace or remove the one in the Germany folder.
also use included channel list.

Updated.
post_back used to gets the page numbers.
no subpage stop string needed.
no revision number change.

Attachments: 
Blackbear199
Offline
Blackbear199's picture
WG++ Team memberDonator
Joined: 9 years
Last seen: 9 min

figured out a solution.
used postback to get the first page and get the number of pages of epg data.
no subpage stop string needed.

Blackbear199
Offline
Blackbear199's picture
WG++ Team memberDonator
Joined: 9 years
Last seen: 9 min

i was bored and decided to try to prove my theory.
i sent the request through flaresolverr(google flaresolverr git).
it can mimic a real browser request and can do javascript.
debugged the response data and guess whats showwed up..
<p class=\"film-title\">Fehlerseite (404)</p>
flaresolverr returns the data escaped,why the \"

this definitely proved the redirect is javascript generated.

Attachments: 
me@meisele.de
Offline
Donator
Joined: 1 month
Last seen: 3 weeks

Thanks a lot, I just confirmed the POST method works with multiple channels and days, including incremental updates. HTML pages are grabbed only once.
There is only one issue left where I don't know how to fix it. When a page for a channel B has no subpages, but the previous channel A had subpages, the list of subpages from channel A is still applied, resulting in 404 errors when grabbing channel B since the URLs for page 2 etc. are built as well. I assume this happens because the page does not contain the navigation pattern (

  • Blackbear199
    Offline
    Blackbear199's picture
    WG++ Team memberDonator
    Joined: 9 years
    Last seen: 9 min

    its a bug.
    i can verify global_temp_9 is empty for a channel with a single page.
    but as you said the subpage still repeats the page count for the previous channel.
    although you get the error the epg is still grabbed so ignore it for now.
    jan will have to see where the problem is.

    me@meisele.de
    Offline
    Donator
    Joined: 1 month
    Last seen: 3 weeks

    ok, thank you for checking. There is obviously no harm by the errors, just 404. I now tried to grab all channels (I have 29 configured) for 3 days. After importing to NextPVR I noticed that several channels have major gaps, in one case the first 2 shows of today are ok, then all following shows are shifted by one day. So far I did not look into the details what happens.

    I therefore tried to combine the first part of your ini file (URL building and showsplit) with the config for scrubbing index and details from revision 7 of the ini. This seems to work so far, currently a full update for 10 days is running. I will then try an incremantal update as well.

    Once I know the new ini file works (at least for me) should I post it here or in the siteini requests? I think someone more knowledgeable than me should check it.

    Thanks again for your quick responses. Highly appreciated.

    Blackbear199
    Offline
    Blackbear199's picture
    WG++ Team memberDonator
    Joined: 9 years
    Last seen: 9 min

    can u supply a few channel names that had issues.

    Blackbear199
    Offline
    Blackbear199's picture
    WG++ Team memberDonator
    Joined: 9 years
    Last seen: 9 min

    i see the problem.
    its errors in the schedule.
    lets look at 3sat.

    05:20-0600 program
    06:00-06:05 program
    06:01-06:06 program
    ect..

    see what happened?
    the last show above start time is before the previous show end time.
    when webgrab processed the last show it did this.
    06:01 start time is before the previous show 06:05 end time so it added a day to the date so the schedule makes sense as it checks for this(show start time must be equal to or after the previous show end time).

    the fix..
    disable the index_stop.scrub line(add a * to the beginning of the line).
    webgrab will use the next show stop time for the current show end time.

    the above will become

    05:20-0600 program
    06:00-06:01 program <=== stop time is changed here.
    06:01-06:06 program

    and the extra day wont be added as now the schedule makes sense to webgrab.

    this also explains the different errors you had in your log with the original ini.
    its uses a different start/stop time.
    there is the above time HH:mm for each show and a unix time in the data.
    webgrab can use either.
    usinx time translates to a full date time like dd-MM-yyyy HH:mm:ss so webgrab cannot get confused about this like using the time above but it still detects the schedule doesnt make sense and why u got the errors u seen.the show is actually skipped(omitted from the data).
    the same fix applies using this data,disable the index_stop.scrub line.

    Attachments: 
    Blackbear199
    Offline
    Blackbear199's picture
    WG++ Team memberDonator
    Joined: 9 years
    Last seen: 9 min

    it doesnt matter whch times u use but the timezone= setting on the site {xxx} like must be set correctly.

    index_start.scrub {single(separator=" - " include=first)|<td class="col-2">|<strong>|</strong>|</td>}
    this is the HH:mm time format and is in germany local time.
    timezone=Europe/Berlin

    index_start.scrub {single|data-rel-start="||"|"}
    this is the unix time.unix time is always in UTC time.
    timezone=UTC

    me@meisele.de
    Offline
    Donator
    Joined: 1 month
    Last seen: 3 weeks

    Thank you, Rev 7 of the inifile already had timezone UTC which is correct when using the 'data-rel-start' tags. Not sure why this has been chenged somewhere between Rev 4 and 7. To me it looks these tags are more reliable than the displayed times. At least I had now several incremental updates with no visible issues (gaps or far too long shows) in the EPG.

    For the time being I am happy. Thanks again for the excellent support.

    Should the new .ini file be made availabe in the German siteini pack? Current Rev. 7 is not working.

    Blackbear199
    Offline
    Blackbear199's picture
    WG++ Team memberDonator
    Joined: 9 years
    Last seen: 9 min

    i usally wait for sites like this as u can see they can be troublesome.
    glad all looks good for u.disabling ther index_stop scrub has already been done.
    in the ini.pack
    but thanks again for confirming its ok.

    Blackbear199
    Offline
    Blackbear199's picture
    WG++ Team memberDonator
    Joined: 9 years
    Last seen: 9 min
    <a href="mailto:me@meisele.de">me@meisele.de</a> wrote:

    Thank you, Rev 7 of the inifile already had timezone UTC which is correct when using the 'data-rel-start' tags. Not sure why this has been chenged somewhere between Rev 4 and 7. To me it looks these tags are more reliable than the displayed times. At least I had now several incremental updates with no visible issues (gaps or far too long shows) in the EPG.
    For the time being I am happy. Thanks again for the excellent support.

    Should the new .ini file be made availabe in the German siteini pack? Current Rev. 7 is not working.

    its up to ini creator which to use.
    there germany local time and UTC
    pick or choose.

    Blackbear199
    Offline
    Blackbear199's picture
    WG++ Team memberDonator
    Joined: 9 years
    Last seen: 9 min

    jan has fixed this for using subpage stop string.
    u need V5.3.0.2
    https://github.com/SilentButeo2/webgrabplus-siteinipack/blob/master/eval...
    he is looking into why the files above(uses post back has the no index page recieved error))

    Attachments: 
    me@meisele.de
    Offline
    Donator
    Joined: 1 month
    Last seen: 3 weeks

    I now had time to test the new ini file with the new WG++ version. Grabbing of the index pages now works as expected, subpages are being grabbed and no more 404 errors. So this issue is resolved. However, there are still the issues with shows missing due to inconsistent start/end time. So I will combine this new ini with the current one which grabs the start/end time from HTML tags. These seem to be more consistent.
    Thank you again for your support.

    Blackbear199
    Offline
    Blackbear199's picture
    WG++ Team memberDonator
    Joined: 9 years
    Last seen: 9 min

    disable the index_stop.scrub as i explained above.

    both the UTC time and the germany local time have the error.
    disabling the stop string is the only way to prevent the show from being skipped(omitted from giude.xml).

    Log in or register to post comments

    Brought to you by Jan van Straaten

    Program Development - Jan van Straaten ------- Web design - Francis De Paemeleere
    Supported by: servercare.nl