module documentation
Context-aware text encoder
Function | encode |
Encodes a string of tokens represented as text into a byte stream and its minimum supported OS version |
Function | normalize |
Applies NFC normalization to a given string to ensure recognition of certain Unicode characters used as token names |
def encode(string:
str
, *, trie: TokenTrie
= None, mode: str
= None, normalize: bool
= True) -> tuple[ bytes, OsVersion]
:
(source)
¶
Encodes a string of tokens represented as text into a byte stream and its minimum supported OS version
- Tokenization is performed using one of three procedures, dictated by mode:
- max: Always munch maximally, i.e. consume the most input possible to produce a token
- smart: Munch maximally or minimally depending on context
- string: Always munch minimally (equivalent to smart string context)
- The smart tokenization mode uses the following contexts, munching maximally otherwise:
- Strings: munch minimally, except when interpolating using Send(
- Program names: munch minimally up to 8 tokens
- List names: munch minimally up to 5 tokens
- For reference, here are the tokenization modes utilized by popular IDEs and other software:
- SourceCoder: max
- TokenIDE: max
- TI Connect CE: smart
- TI-Planet Project Builder: smart
- tivars_lib_cpp: smart
All tokenization modes respect token glyphs for substituting Unicode symbols.
Parameters | |
string:str | The text string to encode |
trie:TokenTrie | The TokenTrie object to use for tokenization (defaults to the TI-84+CE trie) |
mode:str | The tokenization mode to use (defaults to smart) |
normalize:bool | Whether to apply NFC normalization to the input before encoding (defaults to true) |
Returns | |
tuple[ | A tuple of a stream of token bytes and a minimum OsVersion |