C++ For C# Developers: Part 44 – Strings Library
C++ string literals may be simple arrays of characters, but the Standard Library provides a lot of support on top of that. From a string
class to regular expressions, we have a full set of tools to deal with strings in a wide variety of ways.
Table of Contents
- Part 1: Introduction
- Part 2: Primitive Types and Literals
- Part 3: Variables and Initialization
- Part 4: Functions
- Part 5: Build Model
- Part 6: Control Flow
- Part 7: Pointers, Arrays, and Strings
- Part 8: References
- Part 9: Enumerations
- Part 10: Struct Basics
- Part 11: Struct Functions
- Part 12: Constructors and Destructors
- Part 13: Initialization
- Part 14: Inheritance
- Part 15: Struct and Class Permissions
- Part 16: Struct and Class Wrap-up
- Part 17: Namespaces
- Part 18: Exceptions
- Part 19: Dynamic Allocation
- Part 20: Implicit Type Conversion
- Part 21: Casting and RTTI
- Part 22: Lambdas
- Part 23: Compile-Time Programming
- Part 24: Preprocessor
- Part 25: Intro to Templates
- Part 26: Template Parameters
- Part 27: Template Deduction and Specialization
- Part 28: Variadic Templates
- Part 29: Template Constraints
- Part 30: Type Aliases
- Part 31: Deconstructing and Attributes
- Part 32: Thread-Local Storage and Volatile
- Part 33: Alignment, Assembly, and Language Linkage
- Part 34: Fold Expressions and Elaborated Type Specifiers
- Part 35: Modules, The New Build Model
- Part 36: Coroutines
- Part 37: Missing Language Features
- Part 38: C Standard Library
- Part 39: Language Support Library
- Part 40: Utilities Library
- Part 41: System Integration Library
- Part 42: Numbers Library
- Part 43: Threading Library
- Part 44: Strings Library
- Part 45: Array Containers Library
- Part 46: Other Containers Library
- Part 47: Containers Library Wrapup
- Part 48: Algorithms Library
- Part 49: Ranges and Parallel Algorithms
- Part 50: I/O Library
- Part 51: Missing Library Features
- Part 52: Idioms and Best Practices
- Part 53: Conclusion
Charconv
C++17 introduces <charconv>
with a pair of functions for converting primitive types like double
to characters and reading them back from characters. These functions don’t allocate memory, throw exceptions, handle localization, or even add NUL terminators. They’re intended to be used in serialization such as to JSON or when sending strings over a network socket:
#include <charconv> // Buffer to print the value to char buf[100]; char* end = buf + sizeof(buf); // Print 3.14 to the buffer in scientific notation std::to_chars_result tcr{ std::to_chars(buf, end, 3.14, std::chars_format::scientific) }; // Add a NUL terminator to the returned pointer to the character after the // last printed character *tcr.ptr = '\0'; DebugLog(buf); // 3.14e+00 DebugLog("Success?", tcr.ec == std::errc()); // true DebugLog("End pointer index", tcr.ptr - buf); // 8 // Read 3.14e+00 from the buffer double val; std::from_chars_result fcr{ std::from_chars(buf, end, val) }; DebugLog(val); // 3.14 DebugLog("Success?", fcr.ec == std::errc()); // true DebugLog("End pointer index", fcr.ptr - buf); // 8
The TextReader
and TextWriter
classes in C# are probably the closest analog as they can write to existing streams rather than operating on individual string
objects.
String
Next up is <string>
which primarily defines the std::basic_string
class template. This is similar to the built-in String
/string
type in C#. One key difference is that it is mutable, meaning that the string’s characters can change. It is also not a managed reference, as C++ doesn’t have those, and must be wrapped in something like a std::shared_ptr for a similar effect.
A template parameter of std::basic_string
specifies the type of characters in the string. The <string>
header provides many aliases for common character types so it’s rare to use std::basic_string
directly:
Alias | Template | Meaning |
---|---|---|
std::string |
std::basic_string<char> |
C string |
std::wstring |
std::basic_string<wchar_t> |
Wide character string |
std::u8string |
std::basic_string<char8_t> |
UTF-8 string |
std::u16string |
std::basic_string<char16_t> |
UTF-16 string |
std::u32string |
std::basic_string<char32_t> |
UTF-32 string |
There’s also a pmr
version to change how memory is allocated:
Alias | Template | Meaning |
---|---|---|
std::pmr::string |
std::pmr::basic_string<char> |
C string |
std::pmr::wstring |
std::pmr::basic_string<wchar_t> |
Wide character string |
std::pmr::u8string |
std::pmr::basic_string<char8_t> |
UTF-8 string |
std::pmr::u16string |
std::pmr::basic_string<char16_t> |
UTF-16 string |
std::pmr::u32string |
std::pmr::basic_string<char32_t> |
UTF-32 string |
Whichever we choose, the class “owns” the memory that the string is stored in. That means it allocates memory when needed and deallocates it in the destructor. It also provides a bunch of member functions to perform common operations on the string. Here’s a sampling of that functionality:
#include <string> void Foo() { // Allocate memory for the string std::string s{ "hello world" }; // Read and write individual characters s[0] = 'H'; s[6] = 'W'; DebugLog(s); // Hello World // Get a NUL-terminated const pointer to the first character (a C string) const char* cs = s.c_str(); DebugLog(cs); // Hello World // Get a non-const pointer to the first character char* d = s.data(); DebugLog(d); // Hello World // Check if the string is empty DebugLog(s.empty()); // false // Get the number of characters in the string DebugLog(s.size()); // 11 DebugLog(s.length()); // 11 // Check how much capacity is there to hold characters DebugLog(s.capacity()); // Maybe 15 // Allocate enough memory to hold a certain number of characters // Note: cannot be used to shrink the string s.reserve(128); DebugLog(s.capacity()); // At least 128 // Request reducing allocated memory to just enough to hold the string s.shrink_to_fit(); DebugLog(s.capacity()); // Maybe 15 // Add a character to the end s.push_back('!'); DebugLog(s); // Hello World! // Check if the string starts with another string DebugLog(s.starts_with("Hello")); // true // Replace 1 character starting at index 5 with a comma and a space s.replace(5, 1, ", "); DebugLog(s); // Hello, World! // Get a string of 5 characters starting at index 7 std::string ss{ s.substr(7, 5) }; DebugLog(ss); // World // Find an index of a string in the string std::string::size_type i = s.find("llo"); DebugLog(i); // 2 // Copy the string to another string std::string s2{ "other" }; DebugLog(s2); // other s2 = s; DebugLog(s2); // Hello, World! // Compare strings' characters with overloaded operators DebugLog(s == s2); // true // Empty the string s.clear(); DebugLog(s); // } // Destructor deallocates the string's memory
There are also some functions outside of the class that operate on std::basic_string
objects:
#include <string> // Parse a float out of a string // Throws an exception upon failure std::string s{ "3.14" }; float f = std::stof(s); DebugLog(f); // 3.14 // Convert a double to a string std::string s2{ std::to_string(3.14) }; DebugLog(s2); // 3.140000 // Check if a string is empty DebugLog(std::empty(s)); // false // Get a non-const pointer to the first character char* d = std::data(s); DebugLog(d); // 3.14
Lastly, there is a user-defined literal in the std::literals::string_literals
namespace to create strings. The s
suffix is overloaded to create a string based on the type of characters it’s applied to:
#include <string> using namespace std::literals::string_literals; // Plain string literals create a std::string std::string s{ "hello"s }; // char8_t string literals create a UTF-8 string std::u8string s8{ u8"hello"s };
Locale and Codecvt
Next up is <locale>
to help with localization. The std::locale
class indentifies a locale like CultureInfo
does in C#. Its member functions and other functions in <locale>
allow us to perform operations within the context of that locale:
#include <string> #include <locale> // Construct a locale for a specific locale name std::locale loc{ "en_US.UTF-8" }; // Lexicographically compare strings with the overloaded () operator std::string a{ "apple" }; std::string b{ "banana" }; DebugLog(loc(a, b)); // true // Check if a character is in a category for this locale DebugLog(std::isspace(' ', loc)); // true DebugLog(std::islower('a', loc)); // true DebugLog(std::isdigit('1', loc)); // true // Convert between uppercase and lowercase in this locale DebugLog(std::toupper('a', loc)); // A DebugLog(std::tolower('Z', loc)); // z
Later in the series we’ll look at I/O and see how we can use std::locale
to localize value categories like time and money.
In the meantime, let’s look at wstring_convert
and wbuffer_convert
which work with <codecvt>
to provide conversion facilities between different string formats like UTF-8 and UTF-16. These functions and the <codecvt>
header were deprecated in C++17 and there will presumably be a replacement at some point in the future. For now, we can use them like this example that converts “😎👍” between UTF-8 and UTF-16:
#include <string> #include <locale> #include <codecvt> void Foo() { // Emojis as UTF-8 and UTF-16 std::string u8 = "\xf0\x9f\x98\x8e\xf0\x9f\x91\x8d"; std::u16string u16 = u"\xd83d\xde0e\xd83d\xdc4d"; // Make a converter from UTF-8 to UTF-16 std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> u8u16{}; // Use it to convert from UTF-8 to UTF-16 std::u16string toU16 = u8u16.from_bytes(u8); DebugLog("Success?", u16 == toU16); // true DebugLog("UTF-16 size", toU16.size()); // 4 for (uint32_t c : toU16) { DebugLog(c); // Outputs: // 55357 // 56846 // 55357 // 56397 } // Make a converter from UTF-16 to UTF-8 std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> u16u8 {}; // Use it to convert UTF-16 to UTF-8 std::string toU8 = u16u8.to_bytes(u16); DebugLog("Success?", u8 == toU8); // true DebugLog("UTF-8 size", toU8.size()); // 8 for (uint32_t c : toU8) { DebugLog(c); // Outputs: // 4294967280 // 4294967199 // 4294967192 // 4294967182 // 4294967280 // 4294967199 // 4294967185 // 4294967181 } }
Format
C++20 adds the <format>
header to make formatting data as strings easier and safer than existing methods like sprintf
in the C Standard Library. The std::format
function is rather similar to string interpolation in C#: $"Score: {score}"
.
#include <string> #include <format> #include <locale> // Format a string int score = 123; std::string str{ std::format("Score: {}", score) }; DebugLog(str); // Score: 123 // Format a string for a specific locale std::locale loc{ "en_US.UTF-8" }; str = std::format(loc, "Score: {}", score); DebugLog(str); // Score: 123
We can specialize the std::formatter
class template to enable formatting our own types:
#include <format> struct Vector2 { float X; float Y; }; namespace std { template<class TChar> struct std::formatter<Vector2, TChar> { template <typename TContext> auto parse(TContext& pc) { return pc.end(); } template<typename TContext> auto format(Vector2 v, TContext& fc) { return std::format_to(fc.out(), "({}, {})", v.X, v.Y); } }; } Vector2 v{ 1, 2, 3 }; std::string s{ std::format("Vector: {}", v) }; DebugLog(s); // Vector: (1, 2, 3)
String View
C++17 introduces std::basic_string_view
as a class template that provides a read-only “view” into another string. It’s an adapter for string literals and other arrays of characters as well as string classes like std::basic_string
. Unlike std::basic_string
, it doesn’t “own” the memory that holds the characters. That means it doesn’t allocate it or deallocate it but instead acts like a pointer to existing memory and a size_t
to keep track of the length. As with other pointers, it’s important to not use the std::basic_string_view
after the string it points to is deallocated.
Aliases are provided in parallel with std::basic_string
:
Alias | Template | Meaning |
---|---|---|
std::string_view |
std::basic_string_view<char> |
View of C string |
std::wstring_view |
std::basic_string_view<wchar_t> |
View of wide character string |
std::u8string_view |
std::basic_string_view<char8_t> |
View of UTF-8 string |
std::u16string_view |
std::basic_string_view<char16_t> |
View of UTF-16 string |
std::u32string_view |
std::basic_string_view<char32_t> |
View of UTF-32 string |
Here’s how to use them:
#include <string> #include <string_view> // A simple array of characters const char cs[] = "C String"; // A view into the array of characters std::string_view svcs{ cs }; // A std::basic_string std::string bs{ "std::string" }; // A view into the std::basic_string std::string_view svbs{ bs }; // Query the string's size DebugLog(svcs.empty()); // false DebugLog(svcs.size()); // 8 DebugLog(svcs.length()); // 8 // Read characters DebugLog(svcs[2]); // S DebugLog(svcs[100]); // Throws std::out_of_range exception DebugLog(svcs.front()); // C DebugLog(svcs.back()); // g DebugLog(svcs.data()); // C String // Copy part of the string char buf[4] = { '\0' }; svcs.copy(buf, 3, 2); DebugLog(buf); // Str // Get a view of a sub-string. Does not copy characters. std::string_view sub{ svcs.substr(5, 3) }; DebugLog(sub); // ing // Compare string views' characters DebugLog(svcs.compare(svbs)); // -1 DebugLog(svcs == svbs); // false // C++20: check if the string starts or ends with a sub-string DebugLog(svcs.starts_with("C Str")); // true DebugLog(svcs.ends_with("ING")); // false // Find a sub-string's index DebugLog(svcs.find("Str")); // 2 // Reduce the view by moving the view's pointer forward // Does not modifiy the string svcs.remove_prefix(2); DebugLog(svcs); // String // Reduce the view by reducing the view's size // Does not modifiy the string svcs.remove_suffix(3); DebugLog(svcs); // Str
Again paralleling std::basic_view
, there are also some functions outside of the std::basic_string_view
class that operate on std::basic_string_view
objects:
#include <string> #include <string_view> const char cs[] = "C String"; std::string_view svcs{ cs }; // Check if a string view is empty DebugLog(std::empty(svcs)); // false // Get a pointer to the first character const char* d = std::data(svcs); DebugLog(d); // C String
There is also a user-defined literal in the std::literals::string_view_literals
namespace to create string views with the sv
suffix. It’s an inline namespace of std::literals
, so we can avoid a little typing:
#include <string> #include <string_view> using namespace std::literals; const char cs[] = "C String"; std::string_view svcs{ cs }; // Plain string literals create a std::string_view std::string_view s{ "hello"sv }; // char8_t string literals create a UTF-8 string view std::u8string_view s8{ u8"hello"sv };
Like std::basic_string
, using std::basic_string_view
is vastly more convenient than using a C-style array of characters. Since both std::basic_string
and arrays of characters are implicitly and cheaply converted to std::basic_string_view
, we can use this type to gain that convenience while supporting different kinds of strings.
The closest C# equivalent to this is ReadOnlySpan<char>
as it provides a “view” into the characters of a String
. We’ll see C++’s generalized std::span
equivalent to this later in the series.
Regex
Finally for today we have regular expressions in the <regex>
header. The std::basic_regex
class template supports several types of syntax via std::regex::awk
, std::regex::grep
, std::regex::ECMAScript
, and so forth:
#include <string> #include <regex> // A regular expression for YYYY-MM-DD dates with ECMAScript grammar // Each part of the date is captured in a group std::regex re{ "(\\d{4})-(\\d{2})-(\\d{2})", std::regex_constants::ECMAScript }; // Check if a string matches and get the results of the match std::cmatch results{}; DebugLog(std::regex_match("before 2021-03-15 after", results, re)); // true DebugLog(results.size()); // 4 DebugLog(results[0]); // 2021-03-15 (sub-string that matched) DebugLog(results[1]); // 2021 (first group) DebugLog(results[2]); // 03 (second group) DebugLog(results[3]); // 15 (third group) // Replace the part of a string that matches std::basic_string s{ std::regex_replace( std::string{ "before 2021-03-15 after" }, re, "YYYY-MM-DD") }; DebugLog(s); // before YYYY-MM-DD after
A wide variety of overloads are available to support various types of strings, sub-strings, character types, case sensitivity, and so forth. In particular, std::cmatch
in the above example is an alias to the std::match_results
class template for C-style strings. Other aliases for wide character strings and std::basic_string
are available.
The C# equivalent of this are classes like Regex
and Match
in the System.Text.RegularExpressions
namespace.
Conclusion
The C++ Standard Library layers quite a lot of functionality on top of a very humble basis. Simple characters and arrays of characters are extended all the way up to regular expressions, string classes, and string views. In between we have functionality for quick and convenient serialization, parsing, and localization.
As is usual for the Standard Library, all of this is done via the specialization of templates. We choose the most optimal version at compile time rather than relying on runtime strategies like virtual functions. We can specialize any of these templates to support new types of strings or to format our own app’s types and reap all the same benefits that standardized types like std::basic_string
do.