c++ - UTF-8 to UTF-32 on iterators using the STL -

September 15, 2015

i have char iterator - std::istreambuf_iterator<char> wrapped in couple of adaptors - yielding utf-8 bytes. want read single utf-32 character (a char32_t) it. can using stl? how?

there's std::codecvt_utf8<char32_t>, apparently works on char*, not arbitrary iterators.

here's simplified version of code:

#include <iostream> #include <sstream> #include <iterator>  // in real code boost adaptors etc. involved // important point is: we're dealing char iterator. typedef std::istreambuf_iterator< char > iterator;  char32_t read_code_point( iterator& it, const iterator& end ) {     // how do conversion?     // codecvt_utf8<char32_t>::in() works on char*     return u'\0'; }  int main() {     // actual code uses std::istream works on strings, files etc.     // that's irrelevant question     std::stringstream stream( u8"\u00ff" );     iterator it( stream );     iterator end;     char32_t c = read_code_point( it, end );     std::cout << std::boolalpha << ( c == u'\u00ff' ) << std::endl;     return 0; }

i aware boost.regex has iterator this, i'd avoid boost libraries not header-only , feels stl should capable of.

i don't think can directly codecvt_utf8 or other standard library components. use codecvt_utf8 you'd need copy bytes iterator stream buffer , convert buffer.

something should work:

char32_t read_code_point( iterator& it, const iterator& end ) {   char32_t result;   char32_t* resend = &result + 1;   char32_t* resnext = &result;   char buf[7];  // room 3-byte utf-8 bom , 4-byte utf-8 character   char* bufpos = buf;   const char* const bufend = std::end(buf);   std::codecvt_utf8<char32_t> cvt;   while (bufpos != bufend && != end)   {     *bufpos++ = *it++;     std::mbstate_t st{};     const char* = bufpos;     const char* bn = buf;     auto conv = cvt.in(st, buf, be, bn, &result, resend, resnext);     if (conv == std::codecvt_base::error)       throw std::runtime_error("invalid utf-8 sequence");     if (conv == std::codecvt_base::ok && bn == be)       return result;     // otherwise read byte , try again   }   if (it == end)     throw std::runtime_error("incomplete utf-8 sequence");   throw std::runtime_error("no character read first 7 bytes"); }

this appears more work necessary, re-scanning whole utf-8 sequence in [buf, bufpos) on every iteration (and making virtual function call codecvt_utf8::do_in). in theory codecvt_utf8::in implementation read incomplete multibyte sequence , store state information in mbstate_t argument, next call resume last 1 left off, consuming new bytes, not re-processing incomplete multibyte sequence seen.

however, implementations not required use mbstate_t argument store state between calls , in practice @ least 1 implementation of codecvt_utf8::in (the 1 wrote gcc) doesn't use @ all. experiments seems libc++ implementation doesn't use either. means stop converting before incomplete multibyte sequence, , leave from_next pointer (the bn argument here) pointing beginning of incomplete sequence, next call should start position , (hopefully) provide enough additional bytes complete sequence , allow complete unicode character read , converted char32_t. because trying read single codepoint, means no conversion @ all, because stopping before incomplete multibyte sequence means stopping @ first byte.

it's possible implementations do use mbstate_t argument, modify function above handle case well, portable still need cope implementations ignore mbstate_t. supporting both types of implementation complicate function considerably, kept simple , wrote form should work implementations, if use mbstate_t. because going reading 7 bytes @ time (in worst case ... average case may 1 or 2 bytes, depending on input text) cost of re-scanning first few bytes every time shouldn't huge.

to better performance codecvt_utf8 should avoid converting 1 codepoint @ time, because it's designed converting arrays of characters not individual ones. since need copy char buffer anyway copy larger chunks input iterator sequence , convert whole chunks. reduce likelihood of seeing incomplete multibyte sequences, since last 1-3 bytes @ end of chunk need re-processed if chunk ends in incomplete sequence, earlier in chunk have been converted.

to better performance reading single codepoints should avoid codecvt_utf8 entirely , either roll own (if need utf-8 utf-32be it's not hard) or use third-party library such icu.

Search This Blog

JVParth

c++ - UTF-8 to UTF-32 on iterators using the STL -

Comments

Post a Comment

Popular posts from this blog

android - Pass an Serializable object in AIDL -

How to provide Authorization & Authentication using Asp.net, C#? -

How to use Authorization & Authentication in Asp.net, C#? -