C++ string literals may be simple arrays of characters, but the Standard Library provides a lot of support on top of that. From a string class to regular expressions, we have a full set of tools to deal with strings in a wide variety of ways.

Table of Contents

Charconv

C++17 introduces <charconv> with a pair of functions for converting primitive types like double to characters and reading them back from characters. These functions don’t allocate memory, throw exceptions, handle localization, or even add NUL terminators. They’re intended to be used in serialization such as to JSON or when sending strings over a network socket:

#include <charconv>
 
// Buffer to print the value to
char buf[100];
char* end = buf + sizeof(buf);
 
// Print 3.14 to the buffer in scientific notation
std::to_chars_result tcr{
    std::to_chars(buf, end, 3.14, std::chars_format::scientific) };
 
// Add a NUL terminator to the returned pointer to the character after the
// last printed character
*tcr.ptr = '\0';
 
DebugLog(buf); // 3.14e+00
DebugLog("Success?", tcr.ec == std::errc()); // true
DebugLog("End pointer index", tcr.ptr - buf); // 8
 
// Read 3.14e+00 from the buffer
double val;
std::from_chars_result fcr{ std::from_chars(buf, end, val) };
 
DebugLog(val); // 3.14
DebugLog("Success?", fcr.ec == std::errc()); // true
DebugLog("End pointer index", fcr.ptr - buf); // 8

The TextReader and TextWriter classes in C# are probably the closest analog as they can write to existing streams rather than operating on individual string objects.

String

Next up is <string> which primarily defines the std::basic_string class template. This is similar to the built-in String/string type in C#. One key difference is that it is mutable, meaning that the string’s characters can change. It is also not a managed reference, as C++ doesn’t have those, and must be wrapped in something like a std::shared_ptr for a similar effect.

A template parameter of std::basic_string specifies the type of characters in the string. The <string> header provides many aliases for common character types so it’s rare to use std::basic_string directly:

Alias Template Meaning
std::string std::basic_string<char> C string
std::wstring std::basic_string<wchar_t> Wide character string
std::u8string std::basic_string<char8_t> UTF-8 string
std::u16string std::basic_string<char16_t> UTF-16 string
std::u32string std::basic_string<char32_t> UTF-32 string

There’s also a pmr version to change how memory is allocated:

Alias Template Meaning
std::pmr::string std::pmr::basic_string<char> C string
std::pmr::wstring std::pmr::basic_string<wchar_t> Wide character string
std::pmr::u8string std::pmr::basic_string<char8_t> UTF-8 string
std::pmr::u16string std::pmr::basic_string<char16_t> UTF-16 string
std::pmr::u32string std::pmr::basic_string<char32_t> UTF-32 string

Whichever we choose, the class “owns” the memory that the string is stored in. That means it allocates memory when needed and deallocates it in the destructor. It also provides a bunch of member functions to perform common operations on the string. Here’s a sampling of that functionality:

#include <string>
 
void Foo()
{
    // Allocate memory for the string
    std::string s{ "hello world" };
 
    // Read and write individual characters
    s[0] = 'H';
    s[6] = 'W';
    DebugLog(s); // Hello World
 
    // Get a NUL-terminated const pointer to the first character (a C string)
    const char* cs = s.c_str();
    DebugLog(cs); // Hello World
 
    // Get a non-const pointer to the first character
    char* d = s.data();
    DebugLog(d); // Hello World
 
    // Check if the string is empty
    DebugLog(s.empty()); // false
 
    // Get the number of characters in the string
    DebugLog(s.size()); // 11
    DebugLog(s.length()); // 11
 
    // Check how much capacity is there to hold characters
    DebugLog(s.capacity()); // Maybe 15
 
    // Allocate enough memory to hold a certain number of characters
    // Note: cannot be used to shrink the string
    s.reserve(128);
    DebugLog(s.capacity()); // At least 128
 
    // Request reducing allocated memory to just enough to hold the string
    s.shrink_to_fit();
    DebugLog(s.capacity()); // Maybe 15
 
    // Add a character to the end
    s.push_back('!');
    DebugLog(s); // Hello World!
 
    // Check if the string starts with another string
    DebugLog(s.starts_with("Hello")); // true
 
    // Replace 1 character starting at index 5 with a comma and a space
    s.replace(5, 1, ", ");
    DebugLog(s); // Hello, World!
 
    // Get a string of 5 characters starting at index 7
    std::string ss{ s.substr(7, 5) };
    DebugLog(ss); // World
 
    // Find an index of a string in the string
    std::string::size_type i = s.find("llo");
    DebugLog(i); // 2
 
    // Copy the string to another string
    std::string s2{ "other" };
    DebugLog(s2); // other
    s2 = s;
    DebugLog(s2); // Hello, World!
 
    // Compare strings' characters with overloaded operators
    DebugLog(s == s2); // true
 
    // Empty the string
    s.clear();
    DebugLog(s); // 
} // Destructor deallocates the string's memory

There are also some functions outside of the class that operate on std::basic_string objects:

#include <string>
 
// Parse a float out of a string
// Throws an exception upon failure
std::string s{ "3.14" };
float f = std::stof(s);
DebugLog(f); // 3.14
 
// Convert a double to a string
std::string s2{ std::to_string(3.14) };
DebugLog(s2); // 3.140000
 
// Check if a string is empty
DebugLog(std::empty(s)); // false
 
// Get a non-const pointer to the first character
char* d = std::data(s);
DebugLog(d); // 3.14

Lastly, there is a user-defined literal in the std::literals::string_literals namespace to create strings. The s suffix is overloaded to create a string based on the type of characters it’s applied to:

#include <string>
 
using namespace std::literals::string_literals;
 
// Plain string literals create a std::string
std::string s{ "hello"s };
 
// char8_t string literals create a UTF-8 string
std::u8string s8{ u8"hello"s };
Locale and Codecvt

Next up is <locale> to help with localization. The std::locale class indentifies a locale like CultureInfo does in C#. Its member functions and other functions in <locale> allow us to perform operations within the context of that locale:

#include <string>
#include <locale>
 
// Construct a locale for a specific locale name
std::locale loc{ "en_US.UTF-8" };
 
// Lexicographically compare strings with the overloaded () operator
std::string a{ "apple" };
std::string b{ "banana" };
DebugLog(loc(a, b)); // true
 
// Check if a character is in a category for this locale
DebugLog(std::isspace(' ', loc)); // true
DebugLog(std::islower('a', loc)); // true
DebugLog(std::isdigit('1', loc)); // true
 
// Convert between uppercase and lowercase in this locale
DebugLog(std::toupper('a', loc)); // A
DebugLog(std::tolower('Z', loc)); // z

Later in the series we’ll look at I/O and see how we can use std::locale to localize value categories like time and money.

In the meantime, let’s look at wstring_convert and wbuffer_convert which work with <codecvt> to provide conversion facilities between different string formats like UTF-8 and UTF-16. These functions and the <codecvt> header were deprecated in C++17 and there will presumably be a replacement at some point in the future. For now, we can use them like this example that converts “😎👍” between UTF-8 and UTF-16:

#include <string>
#include <locale>
#include <codecvt>
 
void Foo()
{
    // Emojis as UTF-8 and UTF-16
    std::string u8 = "\xf0\x9f\x98\x8e\xf0\x9f\x91\x8d";
    std::u16string u16 = u"\xd83d\xde0e\xd83d\xdc4d";
 
    // Make a converter from UTF-8 to UTF-16
    std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> u8u16{};
 
    // Use it to convert from UTF-8 to UTF-16
    std::u16string toU16 = u8u16.from_bytes(u8);
    DebugLog("Success?", u16 == toU16); // true
    DebugLog("UTF-16 size", toU16.size()); // 4
    for (uint32_t c : toU16)
    {
        DebugLog(c);
        // Outputs:
        // 55357
        // 56846
        // 55357
        // 56397
    }
 
    // Make a converter from UTF-16 to UTF-8
    std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> u16u8 {};
 
    // Use it to convert UTF-16 to UTF-8
    std::string toU8 = u16u8.to_bytes(u16);
    DebugLog("Success?", u8 == toU8); // true
    DebugLog("UTF-8 size", toU8.size()); // 8
    for (uint32_t c : toU8)
    {
        DebugLog(c);
        // Outputs:
        // 4294967280
        // 4294967199
        // 4294967192
        // 4294967182
        // 4294967280
        // 4294967199
        // 4294967185
        // 4294967181
    }
}
Format

C++20 adds the <format> header to make formatting data as strings easier and safer than existing methods like sprintf in the C Standard Library. The std::format function is rather similar to string interpolation in C#: $"Score: {score}".

#include <string>
#include <format>
#include <locale>
 
// Format a string
int score = 123;
std::string str{ std::format("Score: {}", score) };
DebugLog(str); // Score: 123
 
// Format a string for a specific locale
std::locale loc{ "en_US.UTF-8" };
str = std::format(loc, "Score: {}", score);
DebugLog(str); // Score: 123

We can specialize the std::formatter class template to enable formatting our own types:

#include <format>
 
struct Vector2
{
    float X;
    float Y;
};
 
namespace std
{
    template<class TChar>
    struct std::formatter<Vector2, TChar>
    {
        template <typename TContext>
        auto parse(TContext& pc)
        {
            return pc.end();
        }
 
        template<typename TContext>
        auto format(Vector2 v, TContext& fc) 
        {
            return std::format_to(fc.out(), "({}, {})", v.X, v.Y);
        }
    };
}
 
Vector2 v{ 1, 2, 3 };
std::string s{ std::format("Vector: {}", v) };
DebugLog(s); // Vector: (1, 2, 3)
String View

C++17 introduces std::basic_string_view as a class template that provides a read-only “view” into another string. It’s an adapter for string literals and other arrays of characters as well as string classes like std::basic_string. Unlike std::basic_string, it doesn’t “own” the memory that holds the characters. That means it doesn’t allocate it or deallocate it but instead acts like a pointer to existing memory and a size_t to keep track of the length. As with other pointers, it’s important to not use the std::basic_string_view after the string it points to is deallocated.

Aliases are provided in parallel with std::basic_string:

Alias Template Meaning
std::string_view std::basic_string_view<char> View of C string
std::wstring_view std::basic_string_view<wchar_t> View of wide character string
std::u8string_view std::basic_string_view<char8_t> View of UTF-8 string
std::u16string_view std::basic_string_view<char16_t> View of UTF-16 string
std::u32string_view std::basic_string_view<char32_t> View of UTF-32 string

Here’s how to use them:

#include <string>
#include <string_view>
 
// A simple array of characters
const char cs[] = "C String";
 
// A view into the array of characters
std::string_view svcs{ cs };
 
// A std::basic_string
std::string bs{ "std::string" };
 
// A view into the std::basic_string
std::string_view svbs{ bs };
 
// Query the string's size
DebugLog(svcs.empty()); // false
DebugLog(svcs.size()); // 8
DebugLog(svcs.length()); // 8
 
// Read characters
DebugLog(svcs[2]); // S
DebugLog(svcs[100]); // Throws std::out_of_range exception
DebugLog(svcs.front()); // C
DebugLog(svcs.back()); // g
DebugLog(svcs.data()); // C String
 
// Copy part of the string
char buf[4] = { '\0' };
svcs.copy(buf, 3, 2);
DebugLog(buf); // Str
 
// Get a view of a sub-string. Does not copy characters.
std::string_view sub{ svcs.substr(5, 3) };
DebugLog(sub); // ing
 
// Compare string views' characters
DebugLog(svcs.compare(svbs)); // -1
DebugLog(svcs == svbs); // false
 
// C++20: check if the string starts or ends with a sub-string
DebugLog(svcs.starts_with("C Str")); // true
DebugLog(svcs.ends_with("ING")); // false
 
// Find a sub-string's index
DebugLog(svcs.find("Str")); // 2
 
// Reduce the view by moving the view's pointer forward
// Does not modifiy the string
svcs.remove_prefix(2);
DebugLog(svcs); // String
 
// Reduce the view by reducing the view's size
// Does not modifiy the string
svcs.remove_suffix(3);
DebugLog(svcs); // Str

Again paralleling std::basic_view, there are also some functions outside of the std::basic_string_view class that operate on std::basic_string_view objects:

#include <string>
#include <string_view>
 
const char cs[] = "C String";
std::string_view svcs{ cs };
 
// Check if a string view is empty
DebugLog(std::empty(svcs)); // false
 
// Get a pointer to the first character
const char* d = std::data(svcs);
DebugLog(d); // C String

There is also a user-defined literal in the std::literals::string_view_literals namespace to create string views with the sv suffix. It’s an inline namespace of std::literals, so we can avoid a little typing:

#include <string>
#include <string_view>
 
using namespace std::literals;
 
const char cs[] = "C String";
std::string_view svcs{ cs };
 
// Plain string literals create a std::string_view
std::string_view s{ "hello"sv };
 
// char8_t string literals create a UTF-8 string view
std::u8string_view s8{ u8"hello"sv };

Like std::basic_string, using std::basic_string_view is vastly more convenient than using a C-style array of characters. Since both std::basic_string and arrays of characters are implicitly and cheaply converted to std::basic_string_view, we can use this type to gain that convenience while supporting different kinds of strings.

The closest C# equivalent to this is ReadOnlySpan<char> as it provides a “view” into the characters of a String. We’ll see C++’s generalized std::span equivalent to this later in the series.

Regex

Finally for today we have regular expressions in the <regex> header. The std::basic_regex class template supports several types of syntax via std::regex::awk, std::regex::grep, std::regex::ECMAScript, and so forth:

#include <string>
#include <regex>
 
// A regular expression for YYYY-MM-DD dates with ECMAScript grammar
// Each part of the date is captured in a group
std::regex re{
    "(\\d{4})-(\\d{2})-(\\d{2})",
    std::regex_constants::ECMAScript };
 
// Check if a string matches and get the results of the match
std::cmatch results{};
DebugLog(std::regex_match("before 2021-03-15 after", results, re)); // true
DebugLog(results.size()); // 4
DebugLog(results[0]); // 2021-03-15 (sub-string that matched)
DebugLog(results[1]); // 2021 (first group)
DebugLog(results[2]); // 03 (second group)
DebugLog(results[3]); // 15 (third group)
 
// Replace the part of a string that matches
std::basic_string s{
    std::regex_replace(
        std::string{ "before 2021-03-15 after" }, re, "YYYY-MM-DD") };
DebugLog(s); // before YYYY-MM-DD after

A wide variety of overloads are available to support various types of strings, sub-strings, character types, case sensitivity, and so forth. In particular, std::cmatch in the above example is an alias to the std::match_results class template for C-style strings. Other aliases for wide character strings and std::basic_string are available.

The C# equivalent of this are classes like Regex and Match in the System.Text.RegularExpressions namespace.

Conclusion

The C++ Standard Library layers quite a lot of functionality on top of a very humble basis. Simple characters and arrays of characters are extended all the way up to regular expressions, string classes, and string views. In between we have functionality for quick and convenient serialization, parsing, and localization.

As is usual for the Standard Library, all of this is done via the specialization of templates. We choose the most optimal version at compile time rather than relying on runtime strategies like virtual functions. We can specialize any of these templates to support new types of strings or to format our own app’s types and reap all the same benefits that standardized types like std::basic_string do.