How to tokenize a string in C++

Tokenization breaks a string into smaller parts, like splitting a sentence into words, so you can work with each part individually.

What is a token in C++?

A token in C++ refers to a substring or component extracted from a larger string, separated by specific delimitersA delimiter is a character or sequence of characters used to separate tokens in a string, such as a space, comma, or semicolon.. Tokens could be words, numbers, or any defined segments based on the delimiter. Have you ever needed to extract individual words from a sentence? That’s tokenization in action! For example, in the string "C++ is fun", the tokens are "C++", "is", and "fun" if the delimiter is a space.

Why do we need tokenization in C++?

Tokenization is useful for parsing and processing data efficiently. Here are some common use cases:

  1. Processing input: Breaking down user input into meaningful segments.

  2. Data analysis: Parsing CSV or log files to extract data fields.

  3. Text manipulation: Splitting sentences into words for linguistic processing.

  4. Command handling: Extracting arguments and commands from input strings in console applications.

Now that we understand why tokenization is useful, let’s look at how it works in C++.

Steps to tokenize a string in C++

We can create a stringstream object or use the built-in strtok() function to tokenize a string. However, we will create our own tokenizer in C++.

Follow the steps below to tokenize a string:

  • Read the complete string.

  • Select the delimiter the point that you want to tokenize your string. In this example, we will tokenize the string in every space.

  • Iterate over all the characters of the string and check when the delimiter is found.

  • If delimiter is found, then you can push your word to a vector.

  • Repeat this process until you have traversed the complete string.

  • For the last token, you won’t have any space, so that token needs to be pushed to the vector after the loop.

  • Finally, return the vector of tokens.

Note: strtok modifies the input string by replacing delimiters with null characters, so it is better suited for temporary strings or copies rather than original data.

Now let’s look at the code for clarity:

Coding example

Here’s a simple example demonstrating string tokenization using std::strtok:

#include <iostream>
#include <cstring> // For strtok
void tokenizeString(const std::string& str, const char* delimiter) {
char tempStr[100];
std::strcpy(tempStr, str.c_str()); // Convert std::string to char array
char* token = std::strtok(tempStr, delimiter);
while (token != nullptr) {
std::cout << "Token: " << token << std::endl;
token = std::strtok(nullptr, delimiter); // Get the next token
}
}
int main() {
std::string input = "C++ programming; tokenization, example; ; ";
const char* delimiter = ", "; // Delimiters are comma and space
std::cout << "Original String: " << input << std::endl;
std::cout << "Tokens:" << std::endl;
tokenizeString(input, delimiter);
return 0;
}

Code explanation

Line 1–2: Include the necessary libraries: iostream for output and cstring for string manipulation (strtok).

Line 4: Define the tokenizeString function, which takes a string and a delimiter for tokenizing the input string.

Line 5: Declare a character array tempStr to store a mutable copy of the input string.

Line 6: Copy the content of the input string str into tempStr using std::strcpy.

Line 8: Tokenize tempStr using std::strtok and the provided delimiter. The first token is returned and stored in token.

Line 9–12: Loop through the tokens while token is not null. Print each token to the console. Inside the loop, get the next token by calling std::strtok again with nullptr and the same delimiter.

Line 16–17: Start the main function, define the input string, and set delimiters to comma and space.

Line 21: Call tokenizeString to tokenize the input string and print the tokens.

Modify the above code to handle multiple delimiters, such as a comma (,), semicolon (;), and space ( ). Make sure no empty tokens are printed.

Common methods for tokenization in C++

Here’s a table summarizing common methods for tokenization in C++:

Method Name

Description

Use cases

std::strtok

Splits a C-style string into tokens based on specified delimiters.

Suitable for simple tokenization tasks with temporary or copied strings. Not thread-safe.

std::getline

Extracts tokens from a stream until a specified delimiter is encountered.

Ideal for reading tokens from input streams like files or std::cin.

std::istringstream

Treats a string as a stream, allowing easy token extraction using >> operator.

Useful for tokenizing strings with well-defined formats, such as space-separated values.

Key takeaways

  1. Tokenization helps break down strings into meaningful segments for easy processing.

  2. In C++, you can use methods like std::strtok, std::istringstream, or std::getline for tokenization.

  3. Always consider edge cases like multiple consecutive delimiters or empty strings while tokenizing.

  4. Tokenization is a fundamental skill useful in data parsing, text analysis, and many real-world applications.

Start your programming journey with our Learn C++ Course. This course offers step-by-step guidance, hands-on examples, and practical exercises to help you build a strong foundation in C++. Looking to go from beginner to expert? Explore the Become a C++ Programmer Path. This skill path covers everything from the basics of C++ to advanced programming concepts, with the skills needed to excel in real-world projects.

Frequently asked questions

Haven’t found what you were looking for? Contact Us


What is the Tokenize string function in C++?

A tokenize string function in C++ is a custom function or implementation that splits a string into smaller parts called “tokens” based on specified delimiters (e.g., commas, spaces). It is commonly used to process text data, extract meaningful parts, or handle formatted inputs.


How to tokenize the string?

To tokenize a string in C++, you can use the following steps:

  • Include the <cstring> library to use std::strtok.
  • Convert the string into a mutable char array.
  • Use std::strtok with a set of delimiters to split the string.
  • Iterate over the tokens returned by std::strtok until it returns nullptr.

What is strtok() in C++?

std::strtok is a function from the C standard library (available in <cstring> in C++) used to split a string into tokens based on specified delimiters.

  • First call: Pass the string to tokenize and delimiters.
  • Subsequent calls: Pass nullptr as the string to continue tokenizing the same string.
  • Returns nullptr when there are no more tokens.

Is a string immutable in C++?

No, strings in C++ are mutable by default. The std::string class allows modification of its content using member functions like append, erase, replace, or direct access via indexing. However, C-style strings (character arrays) are mutable only when declared as non-constant (char[]) and cannot be modified if declared as const char*.


Free Resources