tivars.tokenizer.encoder

module documentation

(source)

Context-aware text encoder

Function	`encode`	Encodes a string of tokens represented as text into a byte stream and its minimum supported OS version
Function	`normalize`	Applies NFC normalization to a given string to ensure recognition of certain Unicode characters used as token names

def encode(string: str, *, trie: TITokenTrie = None, mode: str = None, normalize: bool = True) -> tuple[bytes, OsVersion]: (source) ¶

Encodes a string of tokens represented as text into a byte stream and its minimum supported OS version

Tokenization is performed using one of three procedures, dictated by mode:

max: Always munch maximally, i.e. consume the most input possible to produce a token
smart: Munch maximally or minimally depending on context
string: Always munch minimally (equivalent to smart string context)

The smart tokenization mode uses the following contexts, munching maximally otherwise:

For reference, here are the tokenization modes utilized by popular IDEs and other software:

All tokenization modes respect token glyphs for substituting Unicode symbols.

Parameters
string:`str`	The text string to encode
trie:`TITokenTrie`	The `TokenTrie` object to use for tokenization (defaults to the TI-84+CE trie)
mode:`str`	The tokenization mode to use (defaults to `smart`)
normalize:`bool`	Whether to apply NFC normalization to the input before encoding (defaults to `true`)

Returns
`tuple[bytes, OsVersion]`	A tuple of a stream of token bytes and a minimum `OsVersion`

def normalize(string: str): (source) ¶

Applies NFC normalization to a given string to ensure recognition of certain Unicode characters used as token names

Parameters
string:`str`	The text to normalize

Returns
The text in `string` normalized