You are here

TVB Pearl at mytvsuper.com

10 posts / 0 new
Last post
jksmurf
Offline
jksmurf's picture
Donator
Joined: 11 years
Last seen: 1 year
TVB Pearl at mytvsuper.com

https://www.mytvsuper.com/en/epg/P#today

i.e. today,tomorrow,third,four,five,zix,seven.

Hello, I was hoping someomne could help with this site please.

Webgrab does have a TVB Pearl (84) for HK but it is now so full of Chinese Characters and Brackets I cannot work with it to extract information.

My TVSuper is much simpler, an dall in English.

Thanks a lot!

 

 

jksmurf
Offline
jksmurf's picture
Donator
Joined: 11 years
Last seen: 1 year

Hi

Well I had a go, and boy is it difficult, much more than TVxB (no longer in operation). See ini and xml attached.

I thought I managed to get an ino close but no joy. I had to make a manual channel list using excel.

Reewally appreciate some pointers?

 

[  Info  ] Group (0) :
[  Info  ] update requested for - 1 - out of - 1 - channels for 14 day(s)
[  Debug ]
[  Info  ] (   1/1   ) MYTVSUPER.COM -- chan. (xmltv_id=Pearl) -- mode Incremental
[Error   ] no shows in indexpage!
[Error   ] Cannot find any shows on the Index Page !
[  Info  ]
[  Info  ]    Summary for update of       Pearl
[  Info  ]      no changes, no update necessary !
[  Info  ]      unchanged shows inspected 0
[  Info  ]      total after update        0
 

Attachments: 
jksmurf
Offline
jksmurf's picture
Donator
Joined: 11 years
Last seen: 1 year

Hi again,

Wow, thank you for your patience! I digested this all and have revised the ini but with little success.

Blackbear199 wrote:

the maxdays=7.1 applies here as you have 7 days epg on a single page.

I have amended thus:

site {url=mytvsuper.com|timezone=UTC+08:00|maxdays=7.1|cultureinfo=zh|charset=utf-8|titlematchfactor=90}
url_index{url|http://mytvsuper.com/en/epg/|channel|}
urldate.format {list|#today|#tomorrow|#third|#four|#five|#six|#seven}

Blackbear199 wrote:

I will tell you that you have the correct bs(block start) and be(block end)

OK; I get (and like) the bookend analogy. Thank you. I tried 3 runs with 3 different pages bracketed by the bookends. The error remains, no page downloaded?

Run 1

index_showsplit.scrub {multi|<table class="b epg-detail">"|<tbody>|</tbody>|</table>}
index_start.scrub {single|<td|</td>}
index_title.scrub {single|<td|</td>} 

Run 2

index_showsplit.scrub {multi|<table class="b epg-detail">"|<tr>|</tr>|</table>}
index_start.scrub {single|<td|</td>}
index_title.scrub {single|<td|</td>} 

 Run 3

index_showsplit.scrub {multi|<table class="b epg-detail">"|<td>|</td>|</table>}
index_start.scrub {single|<td|</td>}
index_title.scrub {single|<td|</td>} 

 

jksmurf
Offline
jksmurf's picture
Donator
Joined: 11 years
Last seen: 1 year

Hi Blackbear,

Well I tried a few things as you suggested above, but I am still very green to this and don't really know what I am doing.

Based on the log it seems to recognise the page with times and shows but does not produce any output EPG. 

What can i try next?

Cheers

k.

Attachments: 
xchemical
Offline
Joined: 6 years
Last seen: 3 years

I look at your codes and fix the small errors and tested it and its working

index_showsplit.scrub {multi|<table class="b epg-detail">|<tr>|</tr>|</table>}
index_start.scrub {single|<td>||</td>}
index_title.scrub {multi|<td>||</td>}

 

 

jksmurf
Offline
jksmurf's picture
Donator
Joined: 11 years
Last seen: 1 year
xchemical wrote:

I look at your codes and fix the small errors and tested it and its working

Thank you very much, I will try this tonight and let you know how I get on.

Thanks once again, very kind of you :-).

k.

jksmurf
Offline
jksmurf's picture
Donator
Joined: 11 years
Last seen: 1 year

Hi there,

Well I tried it, but it seems to fail at recognising Date?

In my very first note above I wrote the syntax seemed to be:

https://www.mytvsuper.com/en/epg/P#today

followed by #tomorrow, third, four, five, six, seven. 

I am not sure how they reconcile today with a date, but if it knows today's date, it should know tomorrow's date etc. etc.?  

Thanks!

k.

Attachments: 
jksmurf
Offline
jksmurf's picture
Donator
Joined: 11 years
Last seen: 1 year
Blackbear199 wrote:

so for this site(as i said everything is on a single page) it would be maxdays=7.1

Hi Blackbear, thank you I will have another go, but I did look athe documentation and you will see in my previous .ini I did actually already have the term "maxdays=7.1", only I had it in this line as per the documentation examples. Does this not fulfil that requirement?

site {url=mytvsuper.com|timezone=UTC+08:00|maxdays=7.1|cultureinfo=zh|charset=utf-8|titlematchfactor=90}

k.

jksmurf
Offline
jksmurf's picture
Donator
Joined: 11 years
Last seen: 1 year

Hi Blackbear199, 

That does indeed work, thank you very much. I tried to compare the two ini files but it might as well be Martian to me. I used to be able to configure TVxB on html-only pages reasonably well and that setup made some sense, but WebGrabPlus is over my head, sorry.

  • I see "nopageoverlaps" has been suffixed to to "site {url" but do not know why.
  • I see "site {ratingsystem=CN|episodesystem=onscreen}" has been added but do not know why.
  • I see url_index{url|http://mytvsuper.com/en/epg/|channel|} was changed to url_index{url|https://www.mytvsuper.com/|channel|} and I believe this is done in concert with the change to the XML in which you add "site_id="en/epg/P" there rather than in the ini, but I do not know why it should be so.
  • I see you have added "url_index.headers {customheader=Accept-Encoding=gzip,deflate}" but I do not know why.
  • I see the date "urldate.format {daycounter|0}" has been added, rather than my attempt at using a list format "{list|#today|#tomorrow|#third|#four|#five|#six|#seven}", which I saw in the URL when I hovered the mouse over it.
  • I see the xml set up has been amended as well as you noted.
  • I can guess that the ? here are wildcards and from what you told me above I can guess it is looking for strings or markers which define the start and end of various sections and sub-sections in the page, which are the Channel, the show logo image (a one off item), the start and end times and perhaps show name itself. There is no way I could figure this out in any reasonable period of time.
    • "global_temp_1.modify {substring(type=regex)|'config_site_id' "^(.*?)\/*epg"}"
    • "index_urlchannellogo {url||<h1>|<img src="http://213.126.50.203/%7C?ts%7C%3C%2Fdiv%3E%7D"
    • "index_showsplit.scrub {regex||<table class="b epg-detail">(?:.*?)(?:(<tr\s*>.+?</tr>)(?:.*?))*</table>||}"
  • I understood your earlier version "index_showsplit.scrub {multi|<table class="b epg-detail">|<tr|</tr >|</table>}" (with a gap after /tr to capture both, but its transition to "index_showsplit.scrub {regex||<table class="b epg-detail">(?:.*?)(?:(<tr\s*>.+?</tr>)(?:.*?))*</table>||}" just has me shaking my head.

You do not need to explain any of the above, I just wanted to show I did take the time to look at it, as you were kind enough to take the time to fix it.

I am sorry I have left you frustrated. I think there is a small group of people who understand how to set this whole thing up and who are willing to spend hours and hours on it, but I am well over 50 years old now and unlikely to be able to progress further on it.

Thanks once again for your help.

k.

jksmurf
Offline
jksmurf's picture
Donator
Joined: 11 years
Last seen: 1 year
Blackbear199 wrote:
  1. ​there are no everlapping shows for this site,adding this speeds up webgrab as its one less thing it has to do.in reality for this site it will make next to no difference and evertything is on the index page which already makes it super fast.it added out of habit.
  2. again not needed but added out of habit with the latest webgrab (2.1.5) if you dont have the episodesystem= set in the ini it results with a mesage written to the log file saying so and default of "onscreen" is being used.adding the setting doesnt add this message(less clutter).
  3. If you look at my channel list you will find this same channel as both   "site_id="en/epg/P" and  "site_id="epg/P"

 

Thank you for clarifying all of 1~3. There is little chance I would have got that far alone.

Blackbear199 wrote:

 

  • I did the channels xml creation and ini this way so that you can get the data in english or chinese.i dont have to explain which site_id="xx" is for which do i?
  •  No, and thank you for clarifying

    Blackbear199 wrote:

     

  • this to speed up the downloading of the index pages.this tells websites that it will accept compressed data if the site can send it.
  • Again added out of habit,for this site the urldate is not used so this can be set to any valid urldate format.
  • This is added because as i said above the ini it written to either grab the data in chinese or english. This automatically changes the language attribute to whichever the user choose to use based on the <channel line. So if the <channel line is the chinese one then the site_id="xx" will not have "en" contained in it and the cuntlureinfo= setting language attribute (cn) will be used.
  • Nothing special here gets the channel logo.
  • There are 2 ways to scrub information from the page data. Separator string method which is the format your familar with and regex which is what i used above. There is a section in the manual on this,read it and read it again as its alot to take in.there are examples or separator string method and its equilavent in regex for scrubbing both single and multi value elements. If you have never done regex is a bugger to get a grasp on and all i can say is start simple and build from there(besides doing tons of reading online). Once you get used to it you will love it as there is going to be times when u will find separator string method will not work nicely for what your trying to do where with regex it may be easier(once you understand it). I did'nt have to use regex here,the only diff is what data is captured for the indivual shows.
  • I have much to learn, but to be perfectly honest I am very unlikely to be able to devote the time to doing so, to the extent that seems required. 

    Thank you all the same, once again,

    k.

    Log in or register to post comments

    Brought to you by Jan van Straaten

    Program Development - Jan van Straaten ------- Web design - Francis De Paemeleere
    Supported by: servercare.nl