And to the RexEx experts... Yeah, I know I'm massively oversimplifying things here and using incorrect terminology. I'm trying to keep things as simple as possible for this example.
As far as I'm concerned, you're doing it fine. You just simplified few things a little too much, what can have unwanted side effects. It's the problem with RegEx, it's something really sensible, and it's not this easy to simplify the explanations without taking risks.
As example, the use of a quantifier followed by "?" (so by example "*?") mean "and please don't be greedy". Therefore, the difference between
[ab]*a{2}
(whatever the character "a" or "b", any often that they appear consecutively, ended by two consecutive "a") and
[ab]*?a{2}
(what mean the same things, but in an none greedy way) will make the result radically change. This despite the difference seeming anecdotal ; only the second one would match "abaaabbbbaa".
For the first one, it will be :
- "abaaabbbbaa" match "[ab]*"
- I don't found the two trailing "aa" that should follow.
While for the second one it will be :
- "abaaabbbbaa" match "[ab]*"
- I don't found the two trailing "aa" that should follow.
- What happen if I'm less greedy ?
- "abaaabbbb" match "[ab]*"
- Now there's trailing "aa".
In the same time, it imply that
[ab]*
isn't the same thing that
[ab]*?a{2}
. Since the first one would effectively catch "abaaabbbbaa", but also catch things like "abaaabbbbab" or ""abaaabbbbba".
It's what make RegEx difficult to use correctly, even for experts, because you don't just need to split the string you're searching into a pattern, but also have to make this pattern in such way that it will only match what you are searching.
But for basic search, it's not this difficult, because you apply it to words, and "mom" will always be "mom". Just be sure that you'll not catch words like "
moment" or (well, don't find an example for "xxx
mom") by using
\b
before and after the pattern (
\bmom\b
) to tell that what you search is a full word, and not a part of another word.
\s
is "a space" is \S
is "not a space".
Here, you simplified a little too much, what can have unwanted side effects. It's not "space", but "blank character", with a "blank character" being any character that isn't shown when printed. Therefore it should (it depend of the language) match "space", "tabulation", and "carriage return" (the last character of the line, telling the editor to go to the next line).
Also, and absolutely not your fault, but the inline code make it difficult to distinguish the case. So the first one is the lower case ("s"), and the second the upper case ("S").
There are also qualifiers like *
, which is "any number of the previous character"...
Precisely it's "any number of the previous pattern, with '0' being a valid number".
Note the difference, it's not specifically a character that can come more than once, but the "pattern" right before it.
Keeping my
[ab]*?a{2}
pattern, it will still catch something like "abaa", but also catch something like "aa" ("0 time a letter that is "a" or "b", followed by two consecutive time the letter "a").
If one want to have "at least one occurrence of the pattern, that can be repeated as often as it want", it's
+
that have to be used.
Therefore, while
[ab]*?a{2}
will catch "aa",
[ab]+?a{2}
will not, because there isn't either "a" or "b" before the two consecutive "a".
If you've difficulty to works with RegEx, one things that generally help is to describe it by words in a very precise way. Therefore, the pattern
[ab]*?a{2}
mean :
- Either the character "a" or the character "b", excluding all the other possible characters
- That can be omitted or be present more than once
- In such way that it will not impact the following pattern
- Followed by the character "a"
- That have to be present exactly two time.
When expressed this way, you'll be more likely to found the possible error in the logic behind the pattern you're using.
So I would use a regex enabled search to search for (\bc\b)(.*)(\bmom\b)
.
- An independent pattern that is
- Anything except a letter
- Followed by the sayer variable name
- Followed by anything except a letter
- Followed by another independent pattern that is
- Any character
- That can be omitted or present more than once
Beeeeeeep ! Logical error detected.
It will catch "object.c", "object.c.whatever", "object.c( parameters)" and few things like that.
The pattern should start in a more precise way :
- Starts by
- Only spaces repeated any number of time, or omitted [This is to catch those who don't always indent]
Therefore what it should be
^\s*c\b(.*)
Yet it will still catch things like " c.whatever" and " c( parameters)".
[Side note: The second is a valid catch for a dialog line, but don't correspond to the present case, where the dialog lines are expected to be
sayer "line"
)
Therefore, you need to be more explicit on what is expected by
(.*)
:
- separated by an optional space that can be repeated without limits
- then something that will start by a simple or a double quote
- And finally there can be any kind of characters that you want
So:
^\s*c\b\s*['"].*
It's possible to be even more precise (having the ending quote match the starting one, and taking count of the possible parameters added to the "say" statement), but it's over killing and far to be an "apprentice included" course.
Edit: Correcting typos that messed with the presentation.