Skip to content

GiovanniDicanio/NextUnicodeCodePoint

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

Next Unicode Code Point

C++ code to find the next Unicode code point in UTF-8 and UTF-16 encoded strings.

Associated blog post: https://giodicanio.com/2025/11/03/finding-the-next-unicode-code-point-in-strings-utf-8-vs-utf-16/

The NextCodePoint.hpp header declares the public interfaces of two functions: NextCodePointUtf8 and NextCodePointUtf16. As their names suggest, they are used to find the next code point in UTF-8 and UTF-16 encoded strings. Think of them like the "Unicode evolution" of increasing a character position index (index++, or pch++ with pointers) when iterating through characters of pure ASCII strings:

std::string name = "Connie";

for (size_t index = 0; index < name.size(); index++) {
    std::cout << name[index];
}

The function declarations are the following:

// Returns the next Unicode code point and number of bytes consumed.
// Throws std::out_of_range if index is out of bounds or string ends prematurely.
// Throws std::invalid_argument on invalid UTF-8 sequence.
[[nodiscard]] std::pair<char32_t, size_t> NextCodePointUtf8(const std::string& str, size_t index);


// Returns the next Unicode code point and the number of UTF-16 code units consumed.
// Throws std::out_of_range if index is out of bounds or string ends prematurely.
// Throws std::invalid_argument on invalid UTF-16 sequence.
[[nodiscard]] std::pair<char32_t, size_t> NextCodePointUtf16(const std::wstring& input, size_t index);

These functions can be used like this:

std::wstring text = L"A\xD834\xDD1E!"; // A + U+1D11E (𝄞) + !

size_t index = 0;
while (index < text.size()) {
    auto [codepoint, units] = NextCodePointUtf16(text, index);
    std::cout << "Codepoint: U+" << std::hex << codepoint
              << " (" << units << " units)\n";
    index += units;
}

About

C++ code to find the next Unicode code point in UTF-8 and UTF-16 encoded strings.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages