Problem with regex again

Asked
Viewd113

0

Let say that i have this to url's

site.com/hello-world/test.html
site.com/hello-world/test/test.html

if i go to the first url i have this regex

^.*/([a-z0-9,-]+)/([a-z0-9,-]+).html$

but url 2 is also vaild url with the regex? how to tell the regex that the first url is the url who should be vaild and not the second?

  • There are infinite solutions, what is the regex logic you want to build?

    Eran Betzalel08 сентября 2009, 12:59
  • Yup - trivial solution would be the regex ^site\.com/hello-world/test\.html$. It matches the first but not the second URL.

    MSalters08 сентября 2009, 13:39
  • И если быть педантичным, без такой схемы, как http:// или https:, это не URL.

    MSalters08 сентября 2009, 13:40
  • Чтобы ответить на этот вопрос, вы должны сообщить нам, почему действителен первый URL, а не второй.

    Adam Bellaire08 сентября 2009, 12:58

6 ответов

3

Of course the second string it is also valid against your regex:

sub-expression        result
-----------------------------------------------------------------------
^.*                   matches:   "site.com/hello-world/test/test.html"
/                     backtrack: "site.com/hello-world/test/"
([a-z0-9,-]+)         matches:   "site.com/hello-world/test/test" 
/                     backtrack: "site.com/hello-world/test/"
([a-z0-9,-]+).html$   matches:   "site.com/hello-world/test/test.html"

better:

sub-expression        result
-----------------------------------------------------------------------
^[^/]+                matches:   "site.com"
/                     matches:   "site.com/"
([a-z0-9,-]+)         matches:   "site.com/hello-world" 
/                     matches:   "site.com/hello-world/"
([a-z0-9,-]+)\.html$  fails (which is the expected result)

So you should use:

^[^/]+/([a-z0-9,-]+)/([a-z0-9,-]+)\.html$
  • That is what you seem to want - the second string should not match, in regex terms that is “the regex fails for this string”.

    Tomalak08 сентября 2009, 13:11
  • ^[^/]./([a-z0-9,-]+)/([a-z0-9,-]+).html$ fails?

    08 сентября 2009, 13:09
0

I think the problem is the use of the greedy match-all .* at the beginning of your expression.

Cheat a little:

^.*(com|org|edu|net|gov)/([a-z0-9,-]+)/([a-z0-9,-]+).html$
1

For the first URL the .* part of the pattern matches "site.com", but for the second URL it matches "site.com/hello-world".

If you don't want to allow more than one folder, you can disallow slash characters in the part of the pattern that matches the domain name:

^[^/]*/([a-z0-9,-]+)/([a-z0-9,-]+)\.html$

(Note that I put a backslash before the period before the html extension. A period matches any character, while \. matches only a period.)

Edit:
If you want to allow both URLs and use "hello-world/test" as folder for the second one, allow slashes in the folder part:

^[^/]*/([a-z0-9,-/]+)/([a-z0-9,-]+)\.html$

If you want to use "hello-world" as folder and "test/test" as page, allow slashes in the file name part:

^[^/]*/([a-z0-9,-]+)/([a-z0-9,-/]+)\.html$
  • @Frozzare: You specifially asked that the second url should not be valid… I added some alternatives in the answer.

    Guffa08 сентября 2009, 13:11
  • i want to allow site.com/hello-world/test.html and site.com/hello-world/test/test.html

    but the are to different pages.

    08 сентября 2009, 13:06
  • @Frozzare: I don’t understand what you want, you seem to contradict yourself over and over… I have given you alternatives both for matching only the first URL and for matching both URLs, something should match your requirements…

    Guffa08 сентября 2009, 18:03
0

Не решение, а всего лишь предложение: существует множество отличных инструментов, которые позволяют экспериментировать с регулярными выражениями и фактически помогают вам их писать.
Мне особенно нравится Expresso ; очевидно, также Регулятор очень хорош.

0

In the second case, .* is matching more than you would expect.

Perhaps replace it with the non-greedy quantifier:

^.*?/([a-z0-9,-]+)/([a-z0-9,-]+).html$
0

.* matches "site.com/hello-world" in the second case. You have to be more specific for the domain part.