class documentation

Base class for encoder states

Each state represents some encoding context which affects tokenization.

Method __init__ Undocumented
Method munch Munch the input string and determine the resulting token, encoder state, and remainder of the string
Method next Determines the next encode state given a token
Class Variable max_length The maximum number of tokens to emit before leaving this state
Class Variable mode Whether to munch maximally (0) or minimally (-1)
Instance Variable length Undocumented
def __init__(self, length: int = 0): (source)

Undocumented

def munch(self, string: str, trie: TokenTrie) -> tuple[Token, str, list[EncoderState]]: (source)

Munch the input string and determine the resulting token, encoder state, and remainder of the string

Parameters
string:strThe text string to tokenize
trie:TokenTrieThe TokenTrie object to use for tokenization
Returns
tuple[Token, str, list[EncoderState]]A tuple of the output Token, the remainder of string, and a list of states to add to the stack
def next(self, token: Token) -> list[EncoderState]: (source)

Determines the next encode state given a token

The current state is popped from the stack, and the states returned by this method are pushed.

If the list of returned states is...
  • empty, then the encoder is exiting the current state.
  • length one, then the encoder's current state is being replaced by a new state.
  • length two, then the encoder is entering a new state, able to exit back to this one.
Parameters
token:TokenThe current token
Returns
list[EncoderState]A list of encoder states to add to the stack
max_length = (source)

The maximum number of tokens to emit before leaving this state

Undocumented