text_utils
String utilities for text processing and management in the okcourse
package.
This module provides a collection of functions for processing strings and managing text, including tokenizer checks and downloads, splitting text into chunks, sanitizing filenames, formatting durations, and swapping words to reduce LLM-specific word inflections.
Examples of usage include:
- Checking and downloading an NLTK tokenizer:
from text_utils import tokenizer_available, download_tokenizer
if not tokenizer_available():
download_tokenizer()
- Splitting text into manageable chunks:
from text_utils import split_text_into_chunks
text = "Your long text here..."
chunks = split_text_into_chunks(text, max_chunk_size=1024)
- Sanitizing a name for use as a filename:
- Formatting a duration in seconds to a human-readable format:
from text_utils import get_duration_string_from_seconds
duration = get_duration_string_from_seconds(3661) # "1:01:01"
- Replacing overused LLM words with simpler alternatives:
from text_utils import swap_words, LLM_SMELLS
updated_text = swap_words("In this course we delve into...", LLM_SMELLS)
Functions:
Name | Description |
---|---|
download_tokenizer |
Downloads the NLTK 'punkt_tab' tokenizer. |
get_duration_string_from_seconds |
Formats a given number of seconds into H:MM:SS or M:SS format. |
sanitize_filename |
Returns a filesystem-safe version of the given string. |
split_text_into_chunks |
Splits text into chunks of approximately |
swap_words |
Replaces words in text based on a dictionary of replacements. |
tokenizer_available |
Checks if the NLTK 'punkt_tab' tokenizer is available on the system. |
Attributes:
Name | Type | Description |
---|---|---|
LLM_SMELLS |
dict[str, str]
|
Dictionary mapping words overused by some large language models to their simplified 'everyday' forms. |
LLM_SMELLS
module-attribute
LLM_SMELLS: dict[str, str] = {
"delve": "dig",
"delved": "dug",
"delves": "digs",
"delving": "digging",
"utilize": "use",
"utilized": "used",
"utilizing": "using",
"utilization": "usage",
"meticulous": "careful",
"meticulously": "carefully",
"crucial": "critical",
}
Dictionary mapping words overused by some large language models to their simplified 'everyday' forms.
Words in the keys may be replaced by their simplified forms in generated lecture text to help reduce "LLM smell."
This dictionary is appropriate for use as the replacements
parameter in the
swap_words
function.
download_tokenizer
download_tokenizer() -> bool
Downloads the NLTK 'punkt_tab' tokenizer.
Returns:
Type | Description |
---|---|
bool
|
True if the tokenizer was downloaded. |
get_duration_string_from_seconds
sanitize_filename
Returns a filesystem-safe version of the given string.
- Strips leading and trailing whitespace
- Replaces spaces with underscores
- Removes non-alphanumeric characters except for underscores and hyphens
- Tranforms to lowercase
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The string to sanitize. |
required |
Returns:
Type | Description |
---|---|
str
|
A sanitized string suitable for filenames. |
split_text_into_chunks
Splits text into chunks of approximately max_chunk_size
characters, preserving sentence boundaries.
If a sentence exceeds max_chunk_size
, a ValueError is raised.
Typical use of this function is to split a long piece of text into chunks that are each just under the character
length limit a TTS model will accept for converting to audio. For example, OpenAI's tts-1
model has a limit of
4096 characters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
The text to split. |
required |
max_chunk_size
|
int
|
The maximum number of characters in each chunk. |
4096
|
Returns:
Type | Description |
---|---|
list[str]
|
A list of text chunks where each chunk is equal to or less than the |
Raises:
Type | Description |
---|---|
ValueError
|
If |
swap_words
Replaces words in text based on a dictionary of replacements.
Preserves the case of the original word: uppercase, title case, or lowercase.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
The text within which to perform word swaps. |
required |
replacements
|
dict[str, str]
|
A dictionary whose keys are the words to replace with their corresponding values. |
required |
Returns:
Type | Description |
---|---|
str
|
The updated text with words replaced as specified. |