FUD Strings
Table of Contents
Published on 28 October 2024.
Last modified 1 November 2024.
All snippets on this page are licensed under the Apache License, Version 2.0 and copyrighted by myself, Dominick Allen. Code may be reformatted and truncated from its original form; don't consider what's written on this page as the actual version, but for illustrative purposes.
If you haven't read my introduction to libfud (the library), I recommend you start there.
1. The String API and Design
This class takes much inspiration from
std::string
, but with the addition of
user controlled allocation, and only specialized for UTF8, represented as
unsigned characters.
-
Strings are a contiguous sequence of
utf8
values.utf8
is an alias forchar8_t
which has an underlying representation ofunsigned char
. -
Strings take a user-supplied allocator, defaulting to the
globalFudAllocator
if none is provided. - No streaming operations are provided on strings.
Strings and all other container types in libfud take a user provided allocator.
1.1.
The use of
char8_t
has a deep impact on the String interface.
First of all, why use
char8_t?
Tom Honermann, the author of the proposals for
char8_t
weighs in:
The motivation for a distinct type with these properties is:
- To provide a distinct type for UTF-8 character data vs character data with an encoding that is either locale dependent or that requires separate specification.
- To enable overloading for ordinary string literals vs UTF-8 string literals (since they may have different encodings).
- To ensure an unsigned type for UTF-8 data (whether char is signed or unsigned is implementation defined).
- To enable better performance via a non-aliasing type; optimizers can better optimize types that do not alias other types.
1.2. Building from the Basics
What makes a string useful? What is its purpose? Strings store messages and
identifiers, and are a unit of transaction for input and output with the
operating system, from identifying files to the basis of formatted printing
messages on the console or GUI. For a String to be useful in a system dominated
by the
C conception of strings
, it must be readily usable for those
interfaces. What's more, in C and C++, so-called
string literals
are given
special privilege in source files during compilation, so that is a reasonable
starting point for creating a string from C strings, that can be passed to
interfaces taking C strings.
constexpr size_t SsoBufSize = 23; class String { private: static constexpr uint8_t maxSmallStringLength = SsoBufSize; static constexpr uint8_t smallStringLengthMask = 0xFF; static constexpr auto isLargeMask = static_cast<uintptr_t>(0x01); static constexpr auto allocatorMask = ~isLargeMask; uintptr_t m_allocator{reinterpret_cast<uintptr_t>(&globalFudAllocator)}; using BufType = Array<utf8, SsoBufSize>; union { struct { size_t capacity; size_t length; utf8* data; } large; struct { uint8_t length = 0; BufType buffer{}; } small{}; } m_repr{}; [[nodiscard]] bool isLarge() const { return (m_allocator & isLargeMask) != 0; } public: using StringResult = Result<String, FudStatus>; static StringResult makeFromCString(const char* cString); static StringResult makeFromCString(const char* cString, Allocator* allocator); [[nodiscard]] constexpr const utf8* data() const { return isLarge() ? m_repr.large.data : m_repr.small.buffer.data(); } [[nodiscard]] inline const char* c_str() const { return reinterpret_cast<const char*>(data()); } [[nodiscard]] inline StringView asView() const { return StringView(*this); } [[nodiscard]] Option<utf8> front(); [[nodiscard]] Option<utf8> back(); };
Let's quickly go over the details here: String has a member
m_allocator
. The
m_allocator
member stores the value of a pointer to an
Allocator
object, and
information to determine which member of the union member
m_repr
is
valid.
m_repr
is union over a
large
struct with dynamic capacity, length,
and a pointer to
utf8
data, and a
small
struct with length and a statically
sized buffer of size
SsoBufSize.
The
isLarge
method discriminates the union,
checking the least significant bit of
m_allocator
.
Allocator
pointers must
be aligned to at least 8 bits; indeed, the class declaration for
Allocator
forces maximum alignment. There are two static methods to create a String - both
take an argument for a C string, one also takes an argument for an
allocator. There are public methods to get the interior data as a constant
pointer, either as
const utf8*
, or
const char*
.
1.3. Small String Optimization
The first thing to discuss of is the use of
Small String Optimization
technique,
shortened as SSO. SSO allows sufficiently short strings to be directly stored in
the String object itself, which greatly reduces the number of trips out to the
heap. However, this optimization carries two penalties: more complex code for
implementation and maintenance, and more instructions to determine capacity,
length, and the actual location of the data being managed. The upshot is that
the first penalty is justified by increased performance, and the second penalty
doesn't matter that much - the CPU can compute much faster than it takes to
access memory after a cache miss
. The SSO techniques used by
libstdc++
,
libc++
, and
MSVC
outlined by
Raymond Chen
are interesting; mine is closest
to
libc++,
but instead of storing the condition of being large within the
union, libfud stores it in the least significant bit of its allocator. As a
note, I could not replicate the
libc++
without invoking undefined behavior -
they get a free pass as an implementation of the standard library. Rules for
thee, I suppose.
// private static constexpr size_t maxStringLength = (static_cast<size_t>(1) << 63) - 1; static constexpr uint8_t maxSmallStringLength = SsoBufSize; static constexpr uint8_t smallStringLengthMask = 0xFF; static constexpr auto isLargeMask = static_cast<uintptr_t>(0x01); static constexpr auto allocatorMask = ~isLargeMask; [[nodiscard]] static bool allocatorValid(Allocator* allocator) { return (reinterpret_cast<uintptr_t>(allocator) & isLargeMask) == 0; } Allocator* allocator() const { return reinterpret_cast<Allocator*>(m_allocator & allocatorMask); } [[nodiscard]] bool isLarge() const { return (m_allocator & isLargeMask) != 0; } void setLarge() { m_allocator |= isLargeMask; } void setSmall() { m_allocator &= allocatorMask; }
When creating a
String
, the
allocator
pointer must be checked with
allocatorValid
to ensure that it is correctly aligned. It is considered
invalid if its last bit is not clear. Getting the allocator pointer with
allocator()
is just a matter of masking out the least significant bit before
accessing it.
setLarge
and
setSmall
do the appropriate bitwise or-ing and
masking to set and clear the large flag, which is checked with
isLarge.
1.4. String Properties
It's useful to query the length, capacity, and emptiness of a string, and it's pretty important when implementing while using the SSO.
// public [[nodiscard]] size_t length() const { if (isLarge()) { return m_repr.large.length; } return m_repr.small.length; } [[nodiscard]] bool empty() const { return length() == 0; } [[nodiscard]] size_t size() const { return length() + 1; } [[nodiscard]] size_t capacity() const { if (isLarge()) { return m_repr.large.capacity - 1U; } return SsoBufSize - 1U; } /** \brief Returns the remaining capacity for characters excluding the null * terminating byte. */ [[nodiscard]] size_t remainingCapacity() const { if (length() > capacity()) { return 0; } return capacity() - length(); } [[nodiscard]] bool utf8Valid() const;
These are all mostly straightforward, aside from the impacts of using
SSO. However, there's an interesting one sitting here,
utf8Valid
. I'll touch
on this again later.
1.5. Creating a String from a C String
Let's take a small diversion from the API to see how the sausage is made, taking
a C string and turning it into a
fud::String.
StringResult String::makeFromCString(const char* cString, Allocator* allocator) { if (allocator == nullptr) { return StringResult::error(FudStatus::NullPointer); } if (!String::allocatorValid(allocator)) { return StringResult::error(FudStatus::ArgumentInvalid); } auto lenResult = cStringLength(cString); if (lenResult < 0 || lenResult >= std::numeric_limits<ssize_t>::max()) { return StringResult::error(FudStatus::ArgumentInvalid); } auto length = static_cast<size_t>(lenResult); if (length > maxStringLength) { return StringResult::error(FudStatus::ArgumentInvalid); } String output{}; output.m_allocator = reinterpret_cast<uintptr_t>(allocator); utf8* data{nullptr}; size_t capacity = length + 1; bool isLarge = capacity > SsoBufSize; if (isLarge) { auto status = output.makeLarge(outputCapacity, view.length(), data); if (status != FudStatus::Success) { return StringResult::error(status); } } else { output.makeSmall(outputCapacity, view.length(), data); } fudAssert(data != nullptr); auto copyStatus = copyMem(data, capacity, cString, length); fudAssert(copyStatus == FudStatus::Success); auto terminateStatus = output.nullTerminate(); fudAssert(terminateStatus == FudStatus::Success); return StringResult::okay(std::move(output)); }
First, the allocator is checked for being null, then that it's valid (low bit clear). The next step can be pretty expensive, but is irreducible in cost: getting the length of a c string. Let's take a quick peek at its implementation:
// fud_c_string.hpp constexpr ssize_t MAX_C_STRING_LENGTH = std::numeric_limits<ssize_t>::max() - 1; constexpr ssize_t cStringLength(const char* str, size_t maxLength) { if (str == nullptr || maxLength > MAX_C_STRING_LENGTH) { return -1; } ssize_t size = 0; while (str[size] != 0 && static_cast<size_t>(size) < maxLength) { size++; } if (str[size] != 0 && static_cast<size_t>(size) == maxLength) { return static_cast<ssize_t>(maxLength) + 1; } return size; } constexpr ssize_t cStringLength(const char* str) { return cStringLength(str, MAX_C_STRING_LENGTH); }
It doesn't look that special… but it's still better than the standard
strlen
for one simple reason: it's not undefined behavior if
str
is a null
pointer. If
str
is a null pointer, the function returns -1. This limits the
maximum size of a string to
std::numeric_limits<ssize_t>::max()
,
unfortunately. I feel bad for all the people whose strings are larger than 9223
petabytes.
So, after validating the length of the string and casting it to
size_t,
it's
time to use that length information to determine whether or not to store the
string contents within the string itself, or in memory that can be provided
by
allocator.
String output{}; output.m_allocator = reinterpret_cast<uintptr_t>(allocator); utf8* data{nullptr}; size_t capacity = length + 1; bool isLarge = capacity > SsoBufSize; if (isLarge) { auto status = output.makeLarge(outputCapacity, view.length(), data); if (status != FudStatus::Success) { return StringResult::error(status); } } else { output.makeSmall(outputCapacity, view.length(), data); }
The check to determine if the string is large is pretty simple - is the capacity of the internal buffer sufficient to hold the contents of the string plus a null terminating byte? If not, then it's time to put the allocator to work:
FudStatus String::makeLarge(size_t cap, size_t len, utf8*& outputData) { m_repr.large.capacity = cap; m_repr.large.length = len; auto dataResult = allocator()->allocate(cap, 1); if (dataResult.isError()) { return dataResult.getError(); } m_repr.large.data = reinterpret_cast<utf8*>(dataResult.getOkay()); outputData = m_repr.large.data; setLarge(); return FudStatus::Success; }
Pretty straightforward stuff. A gentle reminder, the
allocator()
method is
used to get the true value of the pointer to the allocator since the member
variable performs double duty as the indication of the string being long or
short. A pointer to memory for the string is requested with
allocate(cap, 1)
,
which is a request for enough bytes to satisfy the capacity needed for the
string, including the null terminator, aligned to 1 byte. The result is checked
for validity, and assigned to the data pointer. The representation of the string
is set to large, then the function returns, having set
outputData
to the data
pointer obtained. For completeness sake, let's also look at
makeSmall
:
void String::makeSmall(size_t& cap, size_t len, utf8*& outputData) { fudAssert(len < SsoBufSize); cap = SsoBufSize; static_assert(SsoBufSize < std::numeric_limits<int8_t>::max()); m_repr.small.length = len & smallStringLengthMask; outputData = m_repr.small.buffer.data(); setSmall(); }
In sharp contrast to the fallibility of
makeLarge,
makeSmall
is a void
function. It also takes the liberty of setting the
cap
argument to the size of
the small string buffer, and settingout
outputData
to the address of the
buffer's internal storage, before setting the representation of the string to be
small.
In either path, having obtained the valid pointer to the string's data, the
input
str
is copied to the string's buffer, then null terminated. Finally, the
string is returned from the function in a result.
Now, back to the API again while you enjoy the sausage.
2. Copies, Moves, and Destruction
What would a discussion about the design of a C++ object be without invoking the rule of 5? Let's start with the default constructor.
/** \brief Default constructs a small string of zero length using the global * fud allocator. */ String() noexcept = default;
Surprise! The default constructor is defaulted! Bet you weren't expecting that, were you? It creates a zero length string using the default global allocator - which is guaranteed to be nonnull. Anything else becomes infallible, and makes it impossible to write a sound constructor.
String(const String& rhs) = delete;
Next, the copy constructor is deleted. When copying becomes involved and allocation is fallible, a String can not be copy constructed without signalling error, so the copy constructor can not be part of a safe system.
String(String&& rhs) noexcept;
In sharp contrast, however, the move constructor is not deleted. Because it
takes ownership of the contents of
rhs,
no fallible operations happen. This
also means that it takes the
m_allocator
that
rhs
was created with. Let's
take a quick peek at the implementation to note an important detail:
String::String(String&& rhs) noexcept : m_allocator{rhs.m_allocator}, m_repr{std::move(rhs.m_repr)} { rhs.setSmall(); rhs.m_repr.small.length = 0; }
The contents of
rhs
are set such that it behaves as if it's been initialized
with zero data and not allocated anything, regardless of if it has or not. This
is done so that the destructor may behave correctly.
String::~String() { cleanup(); } void String::cleanup() { const auto* allocPtr = allocator(); if (isLarge() && m_repr.large.data != nullptr && allocPtr != nullptr) { allocator()->deallocate( reinterpret_cast<std::byte*>(m_repr.large.data), m_repr.large.capacity); static_cast<void>(deallocStatus); m_repr.large.data = nullptr; } }
The destructor is factored out into a
cleanup()
method to allow it to be
reused in functions with desructive semantics, such as the move assignment
operator.
String& operator=(const String& rhs) = delete; String& operator=(String&& rhs) noexcept; String& String::operator=(String&& rhs) noexcept { if (&rhs == this) { return *this; } cleanup(); m_allocator = rhs.m_allocator; m_repr = std::move(rhs.m_repr); if (isLarge()) { rhs.m_repr.large.data = nullptr; } return *this; }
The copy assignment operator is deleted because of fallible semantics. As you
can see, in contrast, that move assignment is not fallible, but merely
destructive. If you want copying, you have to explicitly use
String::from
with
either another
String
or a
StringView.
static StringResult from( const String& rhs, Option<Allocator*> allocatorOption = NullOpt); static StringResult from( StringView view, Allocator* allocator = &globalFudAllocator); FudStatus copy(const String& rhs);
Making a String from another String is fallible, and takes an optional
allocator. If the option is unspecified (NullOpt), then the allocator from
rhs
is used. Making a String from a StringView is functionally the same as making
one from a C string where the length has already been computed. Copying into an
existing string from another string does not change the string's allocator, but
its contents. As it involves allocation, it is fallible.
3. String Mutation
What's a string in C or C++ without mutation?
3.1. Letter By Letter
Simple manipulation is pretty straightforward.
FudStatus pushBack(char letter); FudStatus pushBack(utf8 letter); FudStatus pushBack(const Utf8& letter); Option<utf8> pop();
There are three ways to
pushBack
- with a
char
, a
utf8
, and
Utf8.
Utf8
is distinct from
utf8
in that it may represent an
ascii
character, or a 2,
3, or 4 byte unicode code point. All of these simply append the letter to the
back of the string.
pop()
returns the last utf8 value, which may invalidate
the validity of the last unicode code point.