pliers.extractors.BertSequenceEncodingExtractor¶
- class pliers.extractors.BertSequenceEncodingExtractor(pretrained_model='bert-base-uncased', tokenizer='bert-base-uncased', framework='pt', pooling='mean', return_special=None, return_input=False, model_kwargs=None, tokenizer_kwargs=None)[source]¶
Bases:
BertExtractor
- Extract contextualized sequence encodings using pretrained BERT
(or similar models, e.g. DistilBERT).
- Parameters
pretrained_model (str) – A string specifying which transformer model to use. Can be any pretrained BERT or BERT-derived (ALBERT, DistilBERT, RoBERTa, CamemBERT etc.) models listed at https://huggingface.co/transformers/pretrained_models.html or path to custom model.
tokenizer (str) – Type of tokenization used in the tokenization step. If different from model, out-of-vocabulary tokens may be treated as unknown tokens.
framework (str) – name deep learning framework to use. Must be ‘pt’ (PyTorch) or ‘tf’ (tensorflow). Defaults to ‘pt’.
pooling (str) – defines numpy function to use to pool token-level encodings (excludes special tokens).
return_special (str) – defines whether to return encoding for special sequence tokens (‘[CLS]’ or ‘[SEP]’), instead of pooling of other tokens. Must be ‘[CLS]’, ‘[SEP]’, or ‘pooler_output’. The latter option returns last layer hidden-state of [CLS] token further processed by a linear layer and tanh activation function, with linear weights trained on the next sentence classification task. Note that some Bert-derived models, such as DistilBert, were not trained on this task. For these models, setting this argument to ‘pooler_output’ will return an error.
return_input (bool) – If True, the extractor returns an additional feature column with the encoded sequence.
model_kwargs (dict) – Named arguments for pretrained model. See: https://huggingface.co/transformers/main_classes/model.html and https://huggingface.co/transformers/model_doc/bert.html
tokenizer_kwargs (dict) – Named arguments for tokenizer. See https://huggingface.co/transformers/main_classes/tokenizer.html
- __init__(pretrained_model='bert-base-uncased', tokenizer='bert-base-uncased', framework='pt', pooling='mean', return_special=None, return_input=False, model_kwargs=None, tokenizer_kwargs=None)[source]¶
- transform(stim, *args, **kwargs)¶
Executes the transformation on the passed stim(s).
- Parameters
One or more stimuli to process. Must be one of:
A string giving the path to a file that can be read in as a Stim (e.g., a .txt file, .jpg image, etc.)
A Stim instance of any type.
An iterable of stims, where each element is either a string or a Stim.
validation (str) –
String specifying how validation errors should be handled. Must be one of:
’strict’: Raise an exception on any validation error
’warn’: Issue a warning for all validation errors
’loose’: Silently ignore all validation errors
args – Optional positional arguments to pass onto the internal _transform call.
kwargs – Optional positional arguments to pass onto the internal _transform call.