Strings Deep Dive: Unicode, Indexing & Encoding

SwiftChapter 11 of the Ultimate Swift Series30 min readApril 11, 2026Intermediate

In This Article

  1. Strings as Collections
  2. Grapheme Clusters: What a "Character" Really Is
  3. String Indexing: Why No Integer Subscripts
  4. Equality and Canonicalization
  5. Substrings: Efficient Slicing
  6. Raw Strings
  7. Character Properties
  8. Encoding: UTF-8 and UTF-16
  9. Exercises
  10. Key Points

In Chapter 4, you learned the basics of String — creating them, concatenating, and interpolating. But strings in Swift are far more sophisticated than most languages. This chapter goes deep: how characters are really stored, why you can't subscript with integers, how emoji work under the hood, and how encoding determines memory usage.

Strings as Collections

Strings in Swift are collections of Character values. This means you can iterate over them, count them, and use all the collection methods you learned about:

let greeting = "Hello" for char in greeting { print(char) // H, e, l, l, o } greeting.count // 5 greeting.isEmpty // false greeting.contains("ell") // true

So far, straightforward. But what exactly is a Character? The answer is more nuanced than you might expect.

Grapheme Clusters: What a "Character" Really Is

A Swift Character is not a single Unicode code point. It's a grapheme cluster — one or more code points that together represent a single visible symbol.

Consider the letter é (e with an acute accent). It can be represented two ways:

// Single code point: รฉ (code point 233) let cafe1 = "caf\u{00E9}" // Two code points: e (101) + combining acute accent (769) let cafe2 = "cafe\u{0301}" cafe1.count // 4 cafe2.count // 4 โ€” Swift sees the e + accent as ONE character

Both strings have a count of 4 because Swift treats each grapheme cluster as a single Character, regardless of how many code points make it up.

This also applies to emoji. Many emoji are actually multiple code points combined:

let thumbsUp = "👍🏽" // ๐Ÿ‘๐Ÿฝ = thumbs up + skin tone modifier thumbsUp.count // 1 โ€” it's one grapheme cluster let family = "👨‍👩‍👧" // Family emoji โ€” multiple code points joined family.count // 1
Why this matters

Because characters have variable sizes (1 to many code points, each of which may need 1 to 4 bytes), you can't jump to the nth character by simple math. This is why string.count takes O(n) time — Swift must walk through every character to count grapheme clusters. And it's why integer subscripts don't work.

String Indexing: Why No Integer Subscripts

In most languages, string[3] gives you the 4th character. Swift deliberately doesn't support this because it would be misleading — it looks like O(1) but would actually be O(n).

Instead, Swift uses String.Index, a special opaque index type:

let name = "Swift" // Get the first character let first = name[name.startIndex] // "S" // Get the last character let lastIdx = name.index(before: name.endIndex) let last = name[lastIdx] // "t" // Get the character at offset 2 let thirdIdx = name.index(name.startIndex, offsetBy: 2) let third = name[thirdIdx] // "i"
endIndex is past the end

endIndex points after the last character, not at it. To get the last character, use index(before: endIndex). Accessing string[string.endIndex] directly crashes with a fatal error.

Equality and Canonicalization

Because the same visible character can be represented multiple ways (single code point vs. combining characters), Swift normalizes both strings before comparing. This process is called canonicalization.

let cafe1 = "caf\u{00E9}" // Single รฉ let cafe2 = "cafe\u{0301}" // e + combining accent cafe1 == cafe2 // true โ€” Swift canonicalizes before comparing

Most languages would say these are different strings. Swift says they're equal because they look the same to a human. This is one of Swift's most thoughtful design decisions.

Substrings: Efficient Slicing

You can slice strings using ranges of String.Index:

let fullName = "Matt Galloway" let spaceIdx = fullName.firstIndex(of: " ")! // Open-ended ranges โ€” Swift infers start or end let firstName = fullName[..<spaceIdx] // "Matt" let lastName = fullName[fullName.index(after: spaceIdx)...] // "Galloway"

The result type is Substring, not String. This is a deliberate optimization: a Substring shares memory with its parent string, so slicing costs zero extra memory.

When you need an independent String (for long-term storage or passing to APIs), convert explicitly:

let firstNameString = String(firstName) // Now it's an independent String copy

Raw Strings

Sometimes you need strings with lots of backslashes or quotes — regular expressions, file paths, ASCII art. Wrapping a string in # makes it raw, disabling escape sequences and interpolation:

let raw = #"No escaping here: \n \t \(nope)"# print(raw) // Prints literally: No escaping here: \n \t \(nope) // Use \# for interpolation inside raw strings let name = "Swift" let raw2 = #"Hello, \#(name)!"# // "Hello, Swift!"

You can use multiple # symbols if your string itself contains #:

let raw3 = ##"She said "# is the number sign""##

Character Properties

The Character type has built-in properties for inspecting what kind of character it is:

let x: Character = "x" x.isASCII // true x.isLetter // true x.isNumber // false x.isUppercase // false x.isWhitespace // false let five: Character = "5" five.isHexDigit // true five.wholeNumberValue // Optional(5) // Works with non-Latin characters too! let thai: Character = "\u{0E59}" // Thai digit nine: เน™ thai.wholeNumberValue // Optional(9)

These properties are invaluable when parsing or validating text.

Encoding: UTF-8 and UTF-16

At the hardware level, strings are stored as sequences of bytes. The encoding determines how code points map to bytes (called code units).

UTF-8: Swift's internal encoding

UTF-8 uses variable-width code units (1 to 4 bytes per code point):

You can inspect UTF-8 code units through the utf8 view:

let text = "A½✓" // A, ยฝ, โœ“ for byte in text.utf8 { print(byte, terminator: " ") } // 65 194 189 226 156 147 // A=1 byte, ยฝ=2 bytes, โœ“=3 bytes

UTF-16: used by some systems

UTF-16 uses 16-bit code units. Most characters fit in one code unit (2 bytes), but emoji and rare characters need two code units (a surrogate pair). You can inspect via the utf16 view:

for unit in "🙃".utf16 { print(String(unit, radix: 16)) } // d83d de43 (surrogate pair for the upside-down face emoji)
Swift is encoding-agnostic

Swift stores strings as UTF-8 internally for the best balance of memory and performance. But the String API works at the grapheme cluster level, hiding encoding details. You only touch encoding through the utf8, utf16, and unicodeScalars views when you need to. This is one of the reasons Swift handles Unicode more correctly than most languages.

Exercises

Try These in Your Playground

  1. Create a string with your name. Use index(_:offsetBy:) to extract the 3rd character.
  2. Iterate over your name and print the Unicode scalar values for each character using char.unicodeScalars.
  3. Create "caf\u{00E9}" and "cafe\u{0301}". Verify they're equal with == but have different unicodeScalars.count.
  4. Split a full name string (e.g., "Ada Lovelace") into first and last name using firstIndex(of: " ") and open-ended ranges. Convert both substrings to String.
  5. Write a function characterCount(in text: String) -> [Character: Int] that counts occurrences of each character.
  6. Iterate over the utf8 view of the string "Hello 🌍" and count the bytes. Then do the same with utf16. Compare the results.
  7. Challenge: Write a function that reverses each word in a sentence without using split. For "My dog is cute" return "yM god si etuc".

Key Points

What You Learned

This completes the deep dive into strings. In the next chapter, we begin Section III: Building Your Own Types, starting with Structs — Swift's primary value type for modeling data.

Watch the video lessons

Our Swift Fundamentals course covers strings, Unicode, and text processing with hands-on examples in 96 video lessons.

Watch Swift Videos