Friday, October 27, 2006

Find Links Using RegEx

I am writing a link crawler and found this useful. I found it on another site but modified it a little.

string matchlinks = @"]*?HREF\s*=\s*[""']?([^'"" >]+?)[ '""]?[^>]*?>";

This will match any <a href=""> tag, even if it has other elements in it or uses single quotes. I suggest stripping all line breaks. \r\n, \r, and \n before trying to match. For the match options I used SingleLine or IgnoreCase or IgnorePatternWhitespace. Enjoy!

0 comments: