[ragel-users] Breaking out of a scanner

Adrian Thurston adrian.thurston at esentire.com
Thu Feb 25 07:19:05 PST 2010


So this scanner will backtrack quite a bit. Every word that turns out to 
not be a an email address will be processed starting from every 
character. To eliminate that you can add a pattern just before any that 
consumes email_chars+. As long as it doesn't contain '@' it will just 
replace the default for something that looks almost like an email, but 
isn't quite.

-Adrian

Matthieu Tourne wrote:
> So I'm doing a parser that recognizes email addresses in an html 
> document, in order to obfuscate them.
> 
> This is a slightly simplified version of my current grammar :
> 
> main : |*
> ((email_chars+) >email_start ('.' email_chars+ )* '@' @email_confirmed 
> (domain_part '.')+  domain_part) $email_max => email_end;
> 
> # turn off email scanning until the end of the tag
> '<' html_tag => { RESET(); fgoto tag; };
> 
> # turn off email scanning until the end of the comment
> '<--'  => { RESET(); fgoto comment; };
> 
> any => { RESET(); }
> 
> *|;
> 
> email_chars = [a-zA-Z0-9#&+~_\-];
> domain_part = [a-zA-Z0-9] ([a-zA-Z0-9\-]* [a-zA-Z0-9])?;
> 
> RESET(); is a macro to reset some internal tracking variables (set by 
> actions such as email_start, email_confirmed, etc...).
> html_tag is the list of all possible html tags.
> 
> When I transform this into the pure state machine I described earlier 
> (all expressions unioned and wrapped with a kleene star),
> Some email don't match anymore, and I get parse errors.
> It works currently, but I think if I could suppress the need for 
> backtracking, the performances could really improve.
> 
> Thanks,
> 
> Matthieu.
> 
> 
> On Tue, Feb 23, 2010 at 6:24 AM, Adrian Thurston 
> <adrian.thurston at esentire.com <mailto:adrian.thurston at esentire.com>> wrote:
> 
> 
>     Matthieu Tourne wrote:
> 
>         I've tried that without much success, I have a union with all my
>         scanner patterns, wrapped in ()**.
>         I have also replaced all the => { do_stuff(); } in the scanner
>         with @{ do_stuff(); } for each pattern.
> 
> 
>     If you want, post the specifics and we might be able to nail down
>     the problem.
> 
> 
>         So, I'm back to using a scanner construction, but resetting the
>         backtracking in between buffers, and it seems to work fine.
>         Are there any concerns with doing something like that ?
> 
> 
>     No.
> 
>     -Adrian
> 
> 
>     _______________________________________________
>     ragel-users mailing list
>     ragel-users at complang.org <mailto:ragel-users at complang.org>
>     http://www.complang.org/mailman/listinfo/ragel-users
> 
> 
> 
> 
> -- 
> Matthieu Tourne
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> ragel-users mailing list
> ragel-users at complang.org
> http://www.complang.org/mailman/listinfo/ragel-users




More information about the ragel-users mailing list