c++ - UTF-8 to UTF-32 on iterators using the STL -
i have char iterator - std::istreambuf_iterator<char> wrapped in couple of adaptors - yielding utf-8 bytes. want read single utf-32 character (a char32_t) it. can using stl? how?
there's std::codecvt_utf8<char32_t>, apparently works on char*, not arbitrary iterators.
here's simplified version of code:
#include <iostream> #include <sstream> #include <iterator> // in real code boost adaptors etc. involved // important point is: we're dealing char iterator. typedef std::istreambuf_iterator< char > iterator; char32_t read_code_point( iterator& it, const iterator& end ) { // how do conversion? // codecvt_utf8<char32_t>::in() works on char* return u'\0'; } int main() { // actual code uses std::istream works on strings, files etc. // that's irrelevant question std::stringstream stream( u8"\u00ff" ); iterator it( stream ); iterator end; char32_t c = read_code_point( it, end ); std::cout << std::boolalpha << ( c == u'\u00ff' ) << std::endl; return 0; } i aware boost.regex has iterator this, i'd avoid boost libraries not header-only , feels stl should capable of.
i don't think can directly codecvt_utf8 or other standard library components. use codecvt_utf8 you'd need copy bytes iterator stream buffer , convert buffer.
something should work:
char32_t read_code_point( iterator& it, const iterator& end ) { char32_t result; char32_t* resend = &result + 1; char32_t* resnext = &result; char buf[7]; // room 3-byte utf-8 bom , 4-byte utf-8 character char* bufpos = buf; const char* const bufend = std::end(buf); std::codecvt_utf8<char32_t> cvt; while (bufpos != bufend && != end) { *bufpos++ = *it++; std::mbstate_t st{}; const char* = bufpos; const char* bn = buf; auto conv = cvt.in(st, buf, be, bn, &result, resend, resnext); if (conv == std::codecvt_base::error) throw std::runtime_error("invalid utf-8 sequence"); if (conv == std::codecvt_base::ok && bn == be) return result; // otherwise read byte , try again } if (it == end) throw std::runtime_error("incomplete utf-8 sequence"); throw std::runtime_error("no character read first 7 bytes"); } this appears more work necessary, re-scanning whole utf-8 sequence in [buf, bufpos) on every iteration (and making virtual function call codecvt_utf8::do_in). in theory codecvt_utf8::in implementation read incomplete multibyte sequence , store state information in mbstate_t argument, next call resume last 1 left off, consuming new bytes, not re-processing incomplete multibyte sequence seen.
however, implementations not required use mbstate_t argument store state between calls , in practice @ least 1 implementation of codecvt_utf8::in (the 1 wrote gcc) doesn't use @ all. experiments seems libc++ implementation doesn't use either. means stop converting before incomplete multibyte sequence, , leave from_next pointer (the bn argument here) pointing beginning of incomplete sequence, next call should start position , (hopefully) provide enough additional bytes complete sequence , allow complete unicode character read , converted char32_t. because trying read single codepoint, means no conversion @ all, because stopping before incomplete multibyte sequence means stopping @ first byte.
it's possible implementations do use mbstate_t argument, modify function above handle case well, portable still need cope implementations ignore mbstate_t. supporting both types of implementation complicate function considerably, kept simple , wrote form should work implementations, if use mbstate_t. because going reading 7 bytes @ time (in worst case ... average case may 1 or 2 bytes, depending on input text) cost of re-scanning first few bytes every time shouldn't huge.
to better performance codecvt_utf8 should avoid converting 1 codepoint @ time, because it's designed converting arrays of characters not individual ones. since need copy char buffer anyway copy larger chunks input iterator sequence , convert whole chunks. reduce likelihood of seeing incomplete multibyte sequences, since last 1-3 bytes @ end of chunk need re-processed if chunk ends in incomplete sequence, earlier in chunk have been converted.
to better performance reading single codepoints should avoid codecvt_utf8 entirely , either roll own (if need utf-8 utf-32be it's not hard) or use third-party library such icu.
Comments
Post a Comment