Syntax for removal of text from Show Title in index_title.modify?

Wed, 2019-02-27 10:35

#2

jksmurf

Offline

Joined: 11 years

Last seen: 2 years

Hmm, thank you kvanc,

I tried both of those but unfortunately did not get the intended result;

I also tried

title.modify {remove|[MAN]} (as opposed) to index_ .... and
title.modify {remove(type=regex)|"(\[MAN\])"}

I also tried Man (vs MAN).

None of which work. Any other hints please? Log and inis attached.

Attachments:

WebGrab.log_.zip

Wed, 2019-02-27 14:29

#3

jksmurf

Offline

Joined: 11 years

Last seen: 2 years

kvanc wrote:

U shud add ur guide.xml to make us see what is problem.

OK, is this OK?

kvanc wrote:

Also u should remove title(not index) element.

OK ta.

Attachments:

NowBBWebGrabOut.zip

Thu, 2019-02-28 03:35

#4

jksmurf

Offline

Joined: 11 years

Last seen: 2 years

Goodness me, sorry kvanc I missed that it had actually worked as I saw so many [Eng] and [Can] remaining!

I'm not a Mandarin or Cantonese speaker, so I've decided to strip out [Eng] as default and leave [Man] [Can] and [Man/Can].
Thanks for your help, appreciated.

Looking at the resultant XML, I also see I need to strip out these from the show titles otherwise the Metadata in my PVR doesn't work very well.

Green Challenge:
Weekend Blockbuster:
Pearl Heritage:
Late Late Show:
And the Oscar Goes to:
Signature Monday:
Market Overview and CGTN - NPC Opening:

In TVxB (very old scraper) it was leftclip based on a comma-delimited list (e.g. titlelclip=Blockbuster:,Heritage:,Show;...etc), however as there are SOME I want to leave (e.g Toon Disney: and Food:) i.e. I do not wish to strip them all out, although a sthere are mroe to strip than leave, I am guessing the regex for

"Remove ALL TEXT before and including the colon, except when it is THIS or THIS] would be more elegant.

Looking around a bit at recommended regex settings), I have come up with this for removing ALL of instances (but it doesn't work anyway)

index_title.modify {remove(type=regex)|"^.+(\:)(?=[^:]+:[^:]+$)"}

it's not doing anything, and log not showing it is failing either.
For the meantime I have just add a list like this:

index_title.modify {remove|[Eng]}
index_title.modify {remove|[Eng/Can]}
index_title.modify {remove|[PG]}
index_title.modify {remove|(PEARL)}
index_title.modify {remove|Weekend Blockbuster:}
index_title.modify {remove|Green Challenge:}
index_title.modify {remove|Weekend Blockbuster:}
index_title.modify {remove|Pearl Heritage:}
index_title.modify {remove|Late Late Show:}
index_title.modify {remove|And the Oscar Goes to:}
index_title.modify {remove|Signature Monday:}
index_title.modify {remove|Market Overview and CGTN - NPC Opening:}
*index_title.modify {remove(type=regex)|"^.+(\:)(?=[^:]+:[^:]+$)"}

I should probably go spend more time learning regex using https://regexr.com/ or https://www.regexpal.com/ or https://www.regexplanet.com/ or http://www.nregex.com/ but most of these sites assume you know it and want to test it. I will need to learn some more first I guess but for now I have an ugly solution.

Cheers

k.

Attachments:

NowBBWebGrab.zip

Thu, 2019-02-28 03:50

#5

Blackbear199

Online

Joined: 9 years

Last seen: 3 min

index_title.modify {remove(type=regex)|"^[^-:]*[-:]\s*"}

^ start from beginning
[^-:]* zero or more of any character thats not a - or a :
[-:] a actual - or a :
\s* zero or more spaces

index_title.modify {remove(type=regex)|"\[(?:Eng(?:\/Can)?\|PG\|PEARL)\]"}

\[ a actual[
(?: start non capture group
(?:Eng(?:\/Can)? Eng or Eng/Can
PG self explainatory
PEARL self explainatory
) end non capture group
\] a actual ]

edit:
PEARL is inside ( ) not [ ]
index_title.modify {remove(type=regex)|"\[(?:Eng(?:\/Can)?\|PG)\]\|$PEARL$"}

Fri, 2019-03-01 03:30

#6

jksmurf

Offline

Joined: 11 years

Last seen: 2 years

Blackbear199 wrote:

index_title.modify {remove(type=regex)|"^[^-:]*[-:]\s*"}
^ start from beginning
[^-:]* zero or more of any character thats not a - or a :
[-:] a actual - or a :
\s* zero or more spaces

Thank you! This works well and is easier to understand, basically, start at the beginning, run along (for removal) any text that is not a "-" or a ":", when you get to "-" or ":" strip out everything before and including, it and stop, except take the space(s) after it as well.

I removed the "-" as 9-1-1 ended up missing the "9" and News at Seven-Thirty became "News at Seven".

I do have a request (as above) and that is exceptions; say I wished to remove all text before a colon that is NOT "Toon Disney:" i.e KEEP the ones called "Toon Disney:" is there an exception expression to the regex syntax?

RESULT index_title.modify {remove(type=regex)|"^[^:]*[:]\s*"}

Blackbear199 wrote:

index_title.modify {remove(type=regex)|"\[(?:Eng(?:\/Can)?\|PG\|PEARL)\]"}
\[ a actual[
(?: start non capture group
(?:Eng(?:\/Can)? Eng or Eng/Can
PG self explanatory
PEARL self explanatory
) end non capture group
\] a actual ]
edit:
PEARL is inside ( ) not [ ]
index_title.modify {remove(type=regex)|"\[(?:Eng(?:\/Can)?\|PG)\]\|$PEARL$"}

This worked well too but is much harder to work out; https://www.regular-expressions.info/refcapture.html
The "non capture" group is applied in this instance to simply recognise text, which {remove then operates on?

So the first "[" is captured by "\[" where the "\" operand says the "[" is the actual character you wish to capture and not part of a regex command, let's call this A; then "(?:ENG=>ENG, say B, so A+B=[ENG, the first part of an yet uncompleted expression.

Is the "(?:" in "(?:\/CAN) is also start of a non-capture group? so => /CAN, where again the "\" = actual (applied to the "/") ? So, say /CAN=C. I am not sure what makes it AND/OR but I assume the absence of anything between "(?:ENG" and "(?:\/CAN" does this by default? So it can be [ENG (A+B) OR [ENG/CAN (A+B+C)? https://www.regular-expressions.info/optional.html

The pipe | then separates the next expression from capture group right? So PG (after the pipe) becomes D. So A+D => "[PG" because B and C are a separate capture group? Then the next character is an actual ]", (denoted by "\")? Say this is E.

AND: $= actual open bracket "("; followed by PEARL; then $ = actual ")". result (PEARL)=F

So
A+B+E = [ENG]
A+B+C+E = [ENG/CAN]
A+D+E = [PG]
F = (PEARL)

I am a wee bit confused with the and/or operators TBH

If I was doing it I would probably just write this for simplicity:

"\[Eng\]|\[Eng\/Can\]|\[PG\]\|$PEARL$"

This would work I guess?

Thank you once again; as you see I ahve tried, but regex really is "out there" for me ... doable but needs patience; not as logical as some syntax :-)

k.

Fri, 2019-03-01 11:59

#7

Blackbear199

Online

Joined: 9 years

Last seen: 3 min

u r correct(kinda).
u forgot to escape the first 2 vertical pipes(read below).
i think u meant to but missed it.

"\[Eng\]|\[Eng\/Can\]|\[PG\]\|$PEARL$"
........^............^.......^.........

when using regex tester u need to remember that wg regex engine is slightly diff.
on regex101 they use /xxxxx/ where / are the delimiters,in wg the " is use as the start and stop delimiter when using remove/replace/substring or when using regex as a scrub its the ||xxxx||
we also need escape the vertical pipe \| where in regex tester u dont.
this has nothing todo with regex but wg as the vertical pipe is use to separate elements that are multi value internally so we must escape it to tell wg that we want a real | and not have it used to separate elements.

so if u wanted to try this same regex on regex101 it would be

/\[Eng\]|\[Eng\/Can\]|\[PG\]|$PEARL$/

notice the | are not escaped

now no capture groups..
are just that, a group that u dont want captured.
so say we have this..

index_title.modify {remove(type=regex)|"\[(?:Eng(?:\/Can)?\|PG)\]\|$PEARL$"}

what we have here is actually 2 non capture groups,one inside the other.

the outer one..

(?:Eng(?:\/Can)?\|PG)

and the inner

(?:\/Can)?

if u notice one has a ? at the end,what is this?

we call this a quantifier.

common quantifiers are..

* zero or more
+ 1 or more
? zero or one

it may look confusing but once u get it in ur head these are very powerful depending on what ur trying to accomplish.

lets go back to the no capture again.

(?:\/Can)?

so the ? ar the end means zero or 1 instance of the preceding character or in our case a non capture group.
in other words the text /Can can be there or not be there.

this is how we can search for Eng or Eng/Can with a single regex expression.

now i will explain a few others..
say we have this..

"id":"12345","name":"abcd",

do u see the diff between these 1 regex..

temp_1.scrub {regex||"id":"(.*?)",||}

and

temp_1.scrub {regex||"id":"(.+?)",||}

the answer is not much(in this case),both will give you the exact same result(12345).

.*?

. any character
* zero or more of the preceding character
? zero or one of the preceding(makes it non greddy)

.+?

. any character
* one or more of the preceding character
? zero or one of the preceding(makes it non greddy)

see the diff the * and + mean?

now lets change the data to this..

"id":"","name":"abcd",

temp_1.scrub {regex||"id":"(.*?)",||}

result...nothing(empty).

temp_1.scrub {regex||"id":"(.+?)",||}

result...","name":"abcd

now ur going wtf...why?

look at the regex .+?

. any character
+ ONE or more of the preceding
? zero or one od preceding(make it non greddy)

so when the regex ran it looked for..

"id":"

and started the capture but since we used + is says i have to have 1 or more of the first character(the preceding .)

so .+ used the first " then regex said i need to goto a ", as a stop string.

the first regex .*? says i dont have to have anything between my start and stop string(it can be empty).

starting to see the diff between a capture and a no capture?

temp_1.scrub {regex||"id":"(.*?)",||} ====> whatever between "id":" and ",

temp_1.scrub {regex||"id":"(?:.*?)",||} ==> everything(including the start and stop string)

Fri, 2019-03-01 12:50

#8

Blackbear199

Online

Joined: 9 years

Last seen: 3 min

also ur right here..

RESULT index_title.modify {remove(type=regex)|"^[^:]*[:]\s*"}

i noticed this after i poste but said... i wonder if u will and see whats wrong.lol

since ur only checking for : u could also make it a bit simplier

index_title.modify {remove(type=regex)|"^[^:]*:\s*"}

Sat, 2019-03-02 04:52

#9

jksmurf

Offline

Joined: 11 years

Last seen: 2 years

Hi Blackbear, just a wee note to acknowledge your incredible response while I go away and do some homework on it. I have read it about 5 times and my head is reeling but it is starting to make sense the more I read. Will come back and comment to try and give your comment the justice it deserves :-). I am finding https://www.regular-expressions.info/tutorial.html useful.

Thank you again,

k.

Sat, 2019-03-02 05:56

#10

jksmurf

Offline

Joined: 11 years

Last seen: 2 years

For the first part wrt

my simplified attempt was:

"\[Eng\]|\[Eng\/Can\]|\[PG\]\|$PEARL$"

you say in WG this should read

"\[Eng\]\|\[Eng\/Can\]\|\[PG\]\|$PEARL$"

as \| escapes the pipe | so it does not get read as part of the regex string to find.
No problem, understood as backslash pipe in here https://www.regular-expressions.info/refbasic.html (Alternation).

Funny that \[ says you WANT to match (an actual) [ but \| does not say you want to match a pipe, but escape it :-).

I will get onto the otehrs later.

Sat, 2019-03-02 11:46

#11

Blackbear199

Online

Joined: 9 years

Last seen: 3 min

thats what escaping does,it says this character has no special meaning.

the | is a separator in regex.

we need to escape it when used in webgrab regex expressions because its also used internally by webgrab to separate multi value elements.so were telling webgrab we want to use a real | which regex uses it as a separator(for regex in this case).

example a multi value category is stored like this internally in webgrab

Category1|Category2|Category3

WebGrab+Plus

You are here

Syntax for removal of text from Show Title in index_title.modify?

WebGrab+Plus

Search form

You are here

Syntax for removal of text from Show Title in index_title.modify?