You are here

YES - 4W Israel

6 posts / 0 new
Last post
izi
Offline
Joined: 1 month
Last seen: 2 days
YES - 4W Israel

Hi,

 

I am testing the siteini for yes.co.il and found that the show title is missing text.

 

looking at the website epg I see the title has 2 parts seperated by a dash. like this:

<span class="text">21:00 - אהרוני וגידי - בטברנה ביוון</span>

but the grabber builds the title of the right side only. like this "אהרוני וגידי"

I think the dash is confusing the scrub.

How to fix this?

izi
Offline
Joined: 1 month
Last seen: 2 days

If I change the scrub to multi then I get the two parts but with vertical bar between them

Blackbear199
Offline
Joined: 2 years
Last seen: 2 days

index_title.scrub {regex||<span class="text">\d{2}:\d{2}\s-\s([^<]*)</span>||}

you cannot use separator string method..using your example

<span class="text">21:00 - אהרוני וגידי - בטברנה ביוון</span>

index_title.scrub {single(separator=" - " exclude=first)|<span class="text">||</span>|</span>}

exclude=first excludes the time

each - gets replace with a | (like you said,webgrab uses these internally for multi value elements) then exclude is implimented which excludes the time part leaving..

בטברנה ביוון|אהרוני וגידי

which you dont want

the regex way looks for the time and the first - after it and grabs everything else...

אהרוני וגידי - בטברנה ביוון

izi
Offline
Joined: 1 month
Last seen: 2 days

Your first option works fine.

What does the cleanup do? I see it removes additional spaces but also removes "!" - correct? what else?

where can I see the full syntax of the scrubs?

Blackbear199
Offline
Joined: 2 years
Last seen: 2 days

on the downloads page there is a manual(get the 2.1 version).

4.6.4.6 Cleanup
This can be useful to tidy-up the result of a scrubbed element. It:
• tries to remove remaining html tags. (see also the argument tags= further down this section)
• replaces newline \n and tabs \t characters by a space.
• removes carriage returns.
• replaces multiple spaces by single spaces
• removes leading and trailing spaces
• removes illegal xml characters.
• restores Unicode character sequences like \\u00e6 to the actual chars
• restores special html characters above char 127 , like &auml; to the actual char ä by default

izi
Offline
Joined: 1 month
Last seen: 2 days

Thanks. Will learn it

Log in or register to post comments

Brought to you by Jan van Straaten

Program Development - Jan van Straaten ------- Web design - Francis De Paemeleere
Supported by: servercare.nl