text_utils

String utilities for text processing and management in the okcourse package.

This module provides a collection of functions for processing strings and managing text, including tokenizer checks and downloads, splitting text into chunks, sanitizing filenames, formatting durations, and swapping words to reduce LLM-specific word inflections.

Examples of usage include:

Checking and downloading an NLTK tokenizer:

from text_utils import tokenizer_available, download_tokenizer

if not tokenizer_available():
    download_tokenizer()

Splitting text into manageable chunks:

from text_utils import split_text_into_chunks

text = "Your long text here..."
chunks = split_text_into_chunks(text, max_chunk_size=1024)

Sanitizing a name for use as a filename:

from text_utils import sanitize_filename

safe_name = sanitize_filename("My Unsafe Filename.txt")

Formatting a duration in seconds to a human-readable format:

from text_utils import get_duration_string_from_seconds

duration = get_duration_string_from_seconds(3661)  # "1:01:01"

Replacing overused LLM words with simpler alternatives:

from text_utils import swap_words, LLM_SMELLS

updated_text = swap_words("In this course we delve into...", LLM_SMELLS)

Functions:

Name	Description
`download_tokenizer`	Downloads the NLTK 'punkt_tab' tokenizer.
`get_duration_string_from_seconds`	Formats a given number of seconds into H:MM:SS or M:SS format.
`sanitize_filename`	Returns a filesystem-safe version of the given string.
`split_text_into_chunks`	Splits text into chunks of approximately `max_chunk_size` characters, preserving sentence boundaries.
`swap_words`	Replaces words in text based on a dictionary of replacements.
`tokenizer_available`	Checks if the NLTK 'punkt_tab' tokenizer is available on the system.

Attributes:

Name	Type	Description
`LLM_SMELLS`	`dict[str, str]`	Dictionary mapping words overused by some large language models to their simplified 'everyday' forms.

LLM_SMELLS `module-attribute`

LLM_SMELLS: dict[str, str] = {
    "delve": "dig",
    "delved": "dug",
    "delves": "digs",
    "delving": "digging",
    "utilize": "use",
    "utilized": "used",
    "utilizing": "using",
    "utilization": "usage",
    "meticulous": "careful",
    "meticulously": "carefully",
    "crucial": "critical",
}

Dictionary mapping words overused by some large language models to their simplified 'everyday' forms.

Words in the keys may be replaced by their simplified forms in generated lecture text to help reduce "LLM smell."

This dictionary is appropriate for use as the replacements parameter in the swap_words function.

download_tokenizer

download_tokenizer() -> bool

Downloads the NLTK 'punkt_tab' tokenizer.

Returns:

Type	Description
`bool`	True if the tokenizer was downloaded.

get_duration_string_from_seconds

get_duration_string_from_seconds(seconds: float) -> str

Formats a given number of seconds into H:MM:SS or M:SS format.

Parameters:

Name	Type	Description	Default
`seconds`	`float`	The number of seconds.	required

Returns:

Type	Description
`str`	A string formatted as H:MM:SS if hours > 0, otherwise M:SS.

sanitize_filename

sanitize_filename(name: str) -> str

Returns a filesystem-safe version of the given string.

Strips leading and trailing whitespace
Replaces spaces with underscores
Removes non-alphanumeric characters except for underscores and hyphens
Tranforms to lowercase

Parameters:

Name	Type	Description	Default
`name`	`str`	The string to sanitize.	required

Returns:

Type	Description
`str`	A sanitized string suitable for filenames.

split_text_into_chunks

split_text_into_chunks(
    text: str, max_chunk_size: int = 4096
) -> list[str]

Splits text into chunks of approximately max_chunk_size characters, preserving sentence boundaries.

If a sentence exceeds max_chunk_size, a ValueError is raised.

Typical use of this function is to split a long piece of text into chunks that are each just under the character length limit a TTS model will accept for converting to audio. For example, OpenAI's tts-1 model has a limit of 4096 characters.

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to split.	required
`max_chunk_size`	`int`	The maximum number of characters in each chunk.	`4096`

Returns:

Type	Description
`list[str]`	A list of text chunks where each chunk is equal to or less than the `max_chunk_size`.

Raises:

Type	Description
`ValueError`	If `max_chunk_size` < 1 or if a sentence exceeds `max_chunk_size`.

swap_words

swap_words(text: str, replacements: dict[str, str]) -> str

Replaces words in text based on a dictionary of replacements.

Preserves the case of the original word: uppercase, title case, or lowercase.

Parameters:

Name	Type	Description	Default
`text`	`str`	The text within which to perform word swaps.	required
`replacements`	`dict[str, str]`	A dictionary whose keys are the words to replace with their corresponding values.	required

Returns:

Type	Description
`str`	The updated text with words replaced as specified.

tokenizer_available

tokenizer_available() -> bool

Checks if the NLTK 'punkt_tab' tokenizer is available on the system.

Returns:

Type	Description
`bool`	True if the tokenizer is available.

text_utils

LLM_SMELLS module-attribute

download_tokenizer

get_duration_string_from_seconds

sanitize_filename

split_text_into_chunks

swap_words

tokenizer_available

LLM_SMELLS `module-attribute`