Skip to content

text_utils

String utilities for text processing and management in the okcourse package.

This module provides a collection of functions for processing strings and managing text, including tokenizer checks and downloads, splitting text into chunks, sanitizing filenames, formatting durations, and swapping words to reduce LLM-specific word inflections.

Examples of usage include:

  • Checking and downloading an NLTK tokenizer:
from text_utils import tokenizer_available, download_tokenizer

if not tokenizer_available():
    download_tokenizer()
  • Splitting text into manageable chunks:
from text_utils import split_text_into_chunks

text = "Your long text here..."
chunks = split_text_into_chunks(text, max_chunk_size=1024)
  • Sanitizing a name for use as a filename:
from text_utils import sanitize_filename

safe_name = sanitize_filename("My Unsafe Filename.txt")
  • Formatting a duration in seconds to a human-readable format:
from text_utils import get_duration_string_from_seconds

duration = get_duration_string_from_seconds(3661)  # "1:01:01"
  • Replacing overused LLM words with simpler alternatives:
from text_utils import swap_words, LLM_SMELLS

updated_text = swap_words("In this course we delve into...", LLM_SMELLS)

Functions:

Name Description
download_tokenizer

Downloads the NLTK 'punkt_tab' tokenizer.

get_duration_string_from_seconds

Formats a given number of seconds into H:MM:SS or M:SS format.

sanitize_filename

Returns a filesystem-safe version of the given string.

split_text_into_chunks

Splits text into chunks of approximately max_chunk_size characters, preserving sentence boundaries.

swap_words

Replaces words in text based on a dictionary of replacements.

tokenizer_available

Checks if the NLTK 'punkt_tab' tokenizer is available on the system.

Attributes:

Name Type Description
LLM_SMELLS dict[str, str]

Dictionary mapping words overused by some large language models to their simplified 'everyday' forms.

LLM_SMELLS module-attribute

LLM_SMELLS: dict[str, str] = {
    "delve": "dig",
    "delved": "dug",
    "delves": "digs",
    "delving": "digging",
    "utilize": "use",
    "utilized": "used",
    "utilizing": "using",
    "utilization": "usage",
    "meticulous": "careful",
    "meticulously": "carefully",
    "crucial": "critical",
}

Dictionary mapping words overused by some large language models to their simplified 'everyday' forms.

Words in the keys may be replaced by their simplified forms in generated lecture text to help reduce "LLM smell."

This dictionary is appropriate for use as the replacements parameter in the swap_words function.

download_tokenizer

download_tokenizer() -> bool

Downloads the NLTK 'punkt_tab' tokenizer.

Returns:

Type Description
bool

True if the tokenizer was downloaded.

get_duration_string_from_seconds

get_duration_string_from_seconds(seconds: float) -> str

Formats a given number of seconds into H:MM:SS or M:SS format.

Parameters:

Name Type Description Default
seconds float

The number of seconds.

required

Returns:

Type Description
str

A string formatted as H:MM:SS if hours > 0, otherwise M:SS.

sanitize_filename

sanitize_filename(name: str) -> str

Returns a filesystem-safe version of the given string.

  • Strips leading and trailing whitespace
  • Replaces spaces with underscores
  • Removes non-alphanumeric characters except for underscores and hyphens
  • Tranforms to lowercase

Parameters:

Name Type Description Default
name str

The string to sanitize.

required

Returns:

Type Description
str

A sanitized string suitable for filenames.

split_text_into_chunks

split_text_into_chunks(
    text: str, max_chunk_size: int = 4096
) -> list[str]

Splits text into chunks of approximately max_chunk_size characters, preserving sentence boundaries.

If a sentence exceeds max_chunk_size, a ValueError is raised.

Typical use of this function is to split a long piece of text into chunks that are each just under the character length limit a TTS model will accept for converting to audio. For example, OpenAI's tts-1 model has a limit of 4096 characters.

Parameters:

Name Type Description Default
text str

The text to split.

required
max_chunk_size int

The maximum number of characters in each chunk.

4096

Returns:

Type Description
list[str]

A list of text chunks where each chunk is equal to or less than the max_chunk_size.

Raises:

Type Description
ValueError

If max_chunk_size < 1 or if a sentence exceeds max_chunk_size.

swap_words

swap_words(text: str, replacements: dict[str, str]) -> str

Replaces words in text based on a dictionary of replacements.

Preserves the case of the original word: uppercase, title case, or lowercase.

Parameters:

Name Type Description Default
text str

The text within which to perform word swaps.

required
replacements dict[str, str]

A dictionary whose keys are the words to replace with their corresponding values.

required

Returns:

Type Description
str

The updated text with words replaced as specified.

tokenizer_available

tokenizer_available() -> bool

Checks if the NLTK 'punkt_tab' tokenizer is available on the system.

Returns:

Type Description
bool

True if the tokenizer is available.