FUD Strings

Table of Contents

Published on 28 October 2024.

Last modified 1 November 2024.

All snippets on this page are licensed under the Apache License, Version 2.0 and copyrighted by myself, Dominick Allen. Code may be reformatted and truncated from its original form; don't consider what's written on this page as the actual version, but for illustrative purposes.

If you haven't read my introduction to libfud (the library), I recommend you start there.

1. The String API and Design

This class takes much inspiration from std::string , but with the addition of user controlled allocation, and only specialized for UTF8, represented as unsigned characters.

  1. Strings are a contiguous sequence of utf8 values. utf8 is an alias for char8_t which has an underlying representation of unsigned char .
  2. Strings take a user-supplied allocator, defaulting to the globalFudAllocator if none is provided.
  3. No streaming operations are provided on strings.

Strings and all other container types in libfud take a user provided allocator.

1.1. The use of char8_t has a deep impact on the String interface.

First of all, why use char8_t? Tom Honermann, the author of the proposals for char8_t weighs in:

The motivation for a distinct type with these properties is:

  1. To provide a distinct type for UTF-8 character data vs character data with an encoding that is either locale dependent or that requires separate specification.
  2. To enable overloading for ordinary string literals vs UTF-8 string literals (since they may have different encodings).
  3. To ensure an unsigned type for UTF-8 data (whether char is signed or unsigned is implementation defined).
  4. To enable better performance via a non-aliasing type; optimizers can better optimize types that do not alias other types.

1.2. Building from the Basics

What makes a string useful? What is its purpose? Strings store messages and identifiers, and are a unit of transaction for input and output with the operating system, from identifying files to the basis of formatted printing messages on the console or GUI. For a String to be useful in a system dominated by the C conception of strings , it must be readily usable for those interfaces. What's more, in C and C++, so-called string literals are given special privilege in source files during compilation, so that is a reasonable starting point for creating a string from C strings, that can be passed to interfaces taking C strings.

constexpr size_t SsoBufSize = 23;

class String {
private:
    static constexpr uint8_t maxSmallStringLength = SsoBufSize;
    static constexpr uint8_t smallStringLengthMask = 0xFF;
    static constexpr auto isLargeMask = static_cast<uintptr_t>(0x01);
    static constexpr auto allocatorMask = ~isLargeMask;

    uintptr_t m_allocator{reinterpret_cast<uintptr_t>(&globalFudAllocator)};

    using BufType = Array<utf8, SsoBufSize>;
    union {
        struct {
            size_t capacity;
            size_t length;
            utf8* data;
        } large;
        struct {
            uint8_t length = 0;
            BufType buffer{};
        } small{};
    } m_repr{};

    [[nodiscard]] bool isLarge() const {
        return (m_allocator & isLargeMask) != 0;
    }

public:
    using StringResult = Result<String, FudStatus>;
    static StringResult makeFromCString(const char* cString);
    static StringResult makeFromCString(const char* cString,
                                        Allocator* allocator);

    [[nodiscard]] constexpr const utf8* data() const {
        return isLarge() ? m_repr.large.data : m_repr.small.buffer.data();
    }

    [[nodiscard]] inline const char* c_str() const {
        return reinterpret_cast<const char*>(data());
    }

    [[nodiscard]] inline StringView asView() const
    {
        return StringView(*this);
    }

    [[nodiscard]] Option<utf8> front();
    [[nodiscard]] Option<utf8> back();
};

Let's quickly go over the details here: String has a member m_allocator . The m_allocator member stores the value of a pointer to an Allocator object, and information to determine which member of the union member m_repr is valid. m_repr is union over a large struct with dynamic capacity, length, and a pointer to utf8 data, and a small struct with length and a statically sized buffer of size SsoBufSize. The isLarge method discriminates the union, checking the least significant bit of m_allocator . Allocator pointers must be aligned to at least 8 bits; indeed, the class declaration for Allocator forces maximum alignment. There are two static methods to create a String - both take an argument for a C string, one also takes an argument for an allocator. There are public methods to get the interior data as a constant pointer, either as const utf8* , or const char* .

1.3. Small String Optimization

The first thing to discuss of is the use of Small String Optimization technique, shortened as SSO. SSO allows sufficiently short strings to be directly stored in the String object itself, which greatly reduces the number of trips out to the heap. However, this optimization carries two penalties: more complex code for implementation and maintenance, and more instructions to determine capacity, length, and the actual location of the data being managed. The upshot is that the first penalty is justified by increased performance, and the second penalty doesn't matter that much - the CPU can compute much faster than it takes to access memory after a cache miss . The SSO techniques used by libstdc++ , libc++ , and MSVC outlined by Raymond Chen are interesting; mine is closest to libc++, but instead of storing the condition of being large within the union, libfud stores it in the least significant bit of its allocator. As a note, I could not replicate the libc++ without invoking undefined behavior - they get a free pass as an implementation of the standard library. Rules for thee, I suppose.

// private
static constexpr size_t maxStringLength = (static_cast<size_t>(1) << 63) - 1;
static constexpr uint8_t maxSmallStringLength = SsoBufSize;
static constexpr uint8_t smallStringLengthMask = 0xFF;
static constexpr auto isLargeMask = static_cast<uintptr_t>(0x01);
static constexpr auto allocatorMask = ~isLargeMask;

[[nodiscard]] static bool allocatorValid(Allocator* allocator)
{
    return (reinterpret_cast<uintptr_t>(allocator) & isLargeMask) == 0;
}

Allocator* allocator() const
{
    return reinterpret_cast<Allocator*>(m_allocator & allocatorMask);
}

[[nodiscard]] bool isLarge() const
{
    return (m_allocator & isLargeMask) != 0;
}

void setLarge()
{
    m_allocator |= isLargeMask;
}

void setSmall()
{
    m_allocator &= allocatorMask;
}

When creating a String , the allocator pointer must be checked with allocatorValid to ensure that it is correctly aligned. It is considered invalid if its last bit is not clear. Getting the allocator pointer with allocator() is just a matter of masking out the least significant bit before accessing it. setLarge and setSmall do the appropriate bitwise or-ing and masking to set and clear the large flag, which is checked with isLarge.

1.4. String Properties

It's useful to query the length, capacity, and emptiness of a string, and it's pretty important when implementing while using the SSO.

// public
[[nodiscard]] size_t length() const {
    if (isLarge()) {
        return m_repr.large.length;
    }
    return m_repr.small.length;
}

[[nodiscard]] bool empty() const {
    return length() == 0;
}

[[nodiscard]] size_t size() const {
    return length() + 1;
}

[[nodiscard]] size_t capacity() const {
    if (isLarge()) {
        return m_repr.large.capacity - 1U;
    }
    return SsoBufSize - 1U;
}

/** \brief Returns the remaining capacity for characters excluding the null
 * terminating byte. */
[[nodiscard]] size_t remainingCapacity() const
{
    if (length() > capacity()) {
        return 0;
    }

    return capacity() - length();
}

[[nodiscard]] bool utf8Valid() const;

These are all mostly straightforward, aside from the impacts of using SSO. However, there's an interesting one sitting here, utf8Valid . I'll touch on this again later.

1.5. Creating a String from a C String

Let's take a small diversion from the API to see how the sausage is made, taking a C string and turning it into a fud::String.

StringResult String::makeFromCString(const char* cString, Allocator* allocator)
{
    if (allocator == nullptr) {
        return StringResult::error(FudStatus::NullPointer);
    }

    if (!String::allocatorValid(allocator)) {
        return StringResult::error(FudStatus::ArgumentInvalid);
    }

    auto lenResult = cStringLength(cString);
    if (lenResult < 0 || lenResult >= std::numeric_limits<ssize_t>::max()) {
        return StringResult::error(FudStatus::ArgumentInvalid);
    }

    auto length = static_cast<size_t>(lenResult);
    if (length > maxStringLength) {
        return StringResult::error(FudStatus::ArgumentInvalid);
    }

    String output{};
    output.m_allocator = reinterpret_cast<uintptr_t>(allocator);

    utf8* data{nullptr};
    size_t capacity = length + 1;
    bool isLarge = capacity > SsoBufSize;
    if (isLarge) {
        auto status = output.makeLarge(outputCapacity, view.length(), data);
        if (status != FudStatus::Success) {
            return StringResult::error(status);
        }
    } else {
        output.makeSmall(outputCapacity, view.length(), data);
    }

    fudAssert(data != nullptr);

    auto copyStatus = copyMem(data, capacity, cString, length);
    fudAssert(copyStatus == FudStatus::Success);

    auto terminateStatus = output.nullTerminate();
    fudAssert(terminateStatus == FudStatus::Success);

    return StringResult::okay(std::move(output));
}

First, the allocator is checked for being null, then that it's valid (low bit clear). The next step can be pretty expensive, but is irreducible in cost: getting the length of a c string. Let's take a quick peek at its implementation:

// fud_c_string.hpp
constexpr ssize_t MAX_C_STRING_LENGTH = std::numeric_limits<ssize_t>::max() - 1;

constexpr ssize_t cStringLength(const char* str, size_t maxLength)
{
    if (str == nullptr || maxLength > MAX_C_STRING_LENGTH) {
        return -1;
    }

    ssize_t size = 0;

    while (str[size] != 0 && static_cast<size_t>(size) < maxLength) {
        size++;
    }

    if (str[size] != 0 && static_cast<size_t>(size) == maxLength) {
        return static_cast<ssize_t>(maxLength) + 1;
    }

    return size;
}

constexpr ssize_t cStringLength(const char* str)
{
    return cStringLength(str, MAX_C_STRING_LENGTH);
}

It doesn't look that special… but it's still better than the standard strlen for one simple reason: it's not undefined behavior if str is a null pointer. If str is a null pointer, the function returns -1. This limits the maximum size of a string to std::numeric_limits<ssize_t>::max() , unfortunately. I feel bad for all the people whose strings are larger than 9223 petabytes.

So, after validating the length of the string and casting it to size_t, it's time to use that length information to determine whether or not to store the string contents within the string itself, or in memory that can be provided by allocator.

String output{};
output.m_allocator = reinterpret_cast<uintptr_t>(allocator);

utf8* data{nullptr};
size_t capacity = length + 1;
bool isLarge = capacity > SsoBufSize;
if (isLarge) {
    auto status = output.makeLarge(outputCapacity, view.length(), data);
    if (status != FudStatus::Success) {
        return StringResult::error(status);
    }
} else {
    output.makeSmall(outputCapacity, view.length(), data);
}

The check to determine if the string is large is pretty simple - is the capacity of the internal buffer sufficient to hold the contents of the string plus a null terminating byte? If not, then it's time to put the allocator to work:

FudStatus String::makeLarge(size_t cap, size_t len, utf8*& outputData)
{
    m_repr.large.capacity = cap;
    m_repr.large.length = len;
    auto dataResult = allocator()->allocate(cap, 1);
    if (dataResult.isError()) {
        return dataResult.getError();
    }
    m_repr.large.data = reinterpret_cast<utf8*>(dataResult.getOkay());
    outputData = m_repr.large.data;
    setLarge();
    return FudStatus::Success;
}

Pretty straightforward stuff. A gentle reminder, the allocator() method is used to get the true value of the pointer to the allocator since the member variable performs double duty as the indication of the string being long or short. A pointer to memory for the string is requested with allocate(cap, 1) , which is a request for enough bytes to satisfy the capacity needed for the string, including the null terminator, aligned to 1 byte. The result is checked for validity, and assigned to the data pointer. The representation of the string is set to large, then the function returns, having set outputData to the data pointer obtained. For completeness sake, let's also look at makeSmall :

void String::makeSmall(size_t& cap, size_t len, utf8*& outputData)
{
    fudAssert(len < SsoBufSize);
    cap = SsoBufSize;
    static_assert(SsoBufSize < std::numeric_limits<int8_t>::max());
    m_repr.small.length = len & smallStringLengthMask;
    outputData = m_repr.small.buffer.data();
    setSmall();
}

In sharp contrast to the fallibility of makeLarge, makeSmall is a void function. It also takes the liberty of setting the cap argument to the size of the small string buffer, and settingout outputData to the address of the buffer's internal storage, before setting the representation of the string to be small.

In either path, having obtained the valid pointer to the string's data, the input str is copied to the string's buffer, then null terminated. Finally, the string is returned from the function in a result.

Now, back to the API again while you enjoy the sausage.

2. Copies, Moves, and Destruction

What would a discussion about the design of a C++ object be without invoking the rule of 5? Let's start with the default constructor.

/** \brief Default constructs a small string of zero length using the global
 * fud allocator. */
String() noexcept = default;

Surprise! The default constructor is defaulted! Bet you weren't expecting that, were you? It creates a zero length string using the default global allocator - which is guaranteed to be nonnull. Anything else becomes infallible, and makes it impossible to write a sound constructor.

String(const String& rhs) = delete;

Next, the copy constructor is deleted. When copying becomes involved and allocation is fallible, a String can not be copy constructed without signalling error, so the copy constructor can not be part of a safe system.

String(String&& rhs) noexcept;

In sharp contrast, however, the move constructor is not deleted. Because it takes ownership of the contents of rhs, no fallible operations happen. This also means that it takes the m_allocator that rhs was created with. Let's take a quick peek at the implementation to note an important detail:

String::String(String&& rhs) noexcept :
    m_allocator{rhs.m_allocator},
    m_repr{std::move(rhs.m_repr)}
{
    rhs.setSmall();
    rhs.m_repr.small.length = 0;
}

The contents of rhs are set such that it behaves as if it's been initialized with zero data and not allocated anything, regardless of if it has or not. This is done so that the destructor may behave correctly.

String::~String()
{
    cleanup();
}

void String::cleanup()
{
    const auto* allocPtr = allocator();
    if (isLarge() && m_repr.large.data != nullptr && allocPtr != nullptr) {
        allocator()->deallocate(
            reinterpret_cast<std::byte*>(m_repr.large.data),
            m_repr.large.capacity);
        static_cast<void>(deallocStatus);
        m_repr.large.data = nullptr;
    }
}

The destructor is factored out into a cleanup() method to allow it to be reused in functions with desructive semantics, such as the move assignment operator.

String& operator=(const String& rhs) = delete;

String& operator=(String&& rhs) noexcept;

String& String::operator=(String&& rhs) noexcept
{
    if (&rhs == this) {
        return *this;
    }

    cleanup();

    m_allocator = rhs.m_allocator;
    m_repr = std::move(rhs.m_repr);

    if (isLarge()) {
        rhs.m_repr.large.data = nullptr;
    }

    return *this;
}

The copy assignment operator is deleted because of fallible semantics. As you can see, in contrast, that move assignment is not fallible, but merely destructive. If you want copying, you have to explicitly use String::from with either another String or a StringView.

static StringResult from(
    const String& rhs,
    Option<Allocator*> allocatorOption = NullOpt);

static StringResult from(
    StringView view,
    Allocator* allocator = &globalFudAllocator);

FudStatus copy(const String& rhs);

Making a String from another String is fallible, and takes an optional allocator. If the option is unspecified (NullOpt), then the allocator from rhs is used. Making a String from a StringView is functionally the same as making one from a C string where the length has already been computed. Copying into an existing string from another string does not change the string's allocator, but its contents. As it involves allocation, it is fallible.

3. String Mutation

What's a string in C or C++ without mutation?

3.1. Letter By Letter

Simple manipulation is pretty straightforward.

FudStatus pushBack(char letter);

FudStatus pushBack(utf8 letter);

FudStatus pushBack(const Utf8& letter);

Option<utf8> pop();

There are three ways to pushBack - with a char , a utf8 , and Utf8. Utf8 is distinct from utf8 in that it may represent an ascii character, or a 2, 3, or 4 byte unicode code point. All of these simply append the letter to the back of the string. pop() returns the last utf8 value, which may invalidate the validity of the last unicode code point.

3.2. Appending a Sequence