Extending sregex to match UTF8

rvsw · 2014-07-28T13:58:41+00:00

Hello! On Thu, Jul 24, 2014 at 2:29 PM, David Pennington wrote: > So, looking at the output of ./configure how do I tell what the default > options ar...

Extending sregex to match UTF8

rvsw

Hello agentzh

One of the TOD items for the sregex library at https://github.com/openresty/sregex is to add support for UTF 8. Is there a timeline for this feature? If not, what is the proposed strategy to implement this e.g. do we use libraries like libiconv to convert UTF8 into ascii (for english language ). Or will there be native code which will support other character sets as well.

Thanks for any answers

agentzh

Hello!

On Sun, Jul 27, 2014 at 10:58 PM, rvsw wrote:
> One of the TOD items for the sregex library at
> https://github.com/openresty/sregex is to add support for UTF 8. Is there a
> timeline for this feature?

I wonder what particular UTF-8 regex features are you interested in?

For literal UTF-8 byte sequence match, it already works, for example,

    replace_filter '你好' 'hello';

> If not, what is the proposed strategy to
> implement this e.g. do we use libraries like libiconv to convert UTF8 into
> ascii (for english language ). Or will there be native code which will
> support other character sets as well.

For full UTF-8 support, like making "." matching a UTF-8 char instead
of a single octet, or those Unicode groups like "\p{Han}", you need to
change the sregex engine directly. Third party libraries like libiconv
will not really be helpful here.

If you just want to convert the response body data stream from one
char encoding to another, then you can just use the ngx_iconv module:

    https://github.com/calio/iconv-nginx-module

Regards,
-agentzh

rvsw

Hello agentzh

Thank you for your prompt response. I didn't quite realize that UTF-8 regex features were already supported. At least the https://github.com/openresty/sregex didn't seem to imply so. I do see UTF-16 not being supported but I guess I can make do with the iconv module.

I didn't find a way to detect if the file is UTF 16 encoded (I don't want to depend on the HTTP headers). I can probably write some Lua code to achieve that but then I will have to dynamically determine whether iconv needs to be invoked or not through Lua. Please advise if there is a way to achieve that.

As always , thanks for your help :-)

On Monday, July 28, 2014 12:17:30 PM UTC-7, agentzh wrote:

Hello!

On Sun, Jul 27, 2014 at 10:58 PM, rvsw wrote:
> One of the TOD items for the sregex library at
> https://github.com/openresty/sregex is to add support for UTF 8. Is there a
> timeline for this feature?

I wonder what particular UTF-8 regex features are you interested in?

For literal UTF-8 byte sequence match, it already works, for example,

replace_filter '你好' 'hello';

> If not, what is the proposed strategy to
> implement this e.g. do we use libraries like libiconv to convert UTF8 into
> ascii (for english language ). Or will there be native code which will
> support other character sets as well.

For full UTF-8 support, like making "." matching a UTF-8 char instead
of a single octet, or those Unicode groups like "\p{Han}", you need to
change the sregex engine directly. Third party libraries like libiconv
will not really be helpful here.

If you just want to convert the response body data stream from one
char encoding to another, then you can just use the ngx_iconv module:

https://github.com/calio/iconv-nginx-module

Regards,
-agentzh

agentzh

Hello!

On Tue, Jul 29, 2014 at 11:59 AM, rvsw wrote:
> Thank you for your prompt response. I didn't quite realize that UTF-8 regex
> features were already supported. At least the
> https://github.com/openresty/sregex  didn't seem to imply so.

No, strictly speaking we cannot say sregex already supports UTF-8. It
does not try to make sense of any multi-byte character encodings at
all. UTF-8 literal patterns work because it is safe to just treat them
as simple octet streams due to the fact that ASCII is a (strict)
subset of UTF-8. Other charsets might not work as expected, like GBK
and UTF-16 (GB2312 is unambiguous and should work though).

> I do see
> UTF-16 not being supported but I guess I can make do with the iconv module.

Yes.

> I didn't find a way to detect if the file is UTF 16 encoded (I don't want to
> depend on the  HTTP headers). I can probably write some Lua code to achieve
> that but then I will have to dynamically determine whether iconv needs to be
> invoked or not through Lua.

Yes, technically speaking you can use body_filter_by_lua to do the
charset discovery. I'm not sure if ngx_iconv provides a user interface
to toggle its functionality. If not, patches welcome (as always) :)

Regards,
-agentzh