pliers.extractors.BertExtractor¶

class pliers.extractors.BertExtractor(pretrained_model='bert-base-uncased', tokenizer='bert-base-uncased', model_class='AutoModel', framework='pt', return_input=False, model_kwargs=None, tokenizer_kwargs=None)[source]¶

Bases: ComplexTextExtractor

Returns encodings from the last hidden layer of BERT or similar models (ALBERT, DistilBERT, RoBERTa, CamemBERT). Excludes special tokens. Base class for other Bert extractors. :param pretrained_model: A string specifying which transformer

model to use. Can be any pretrained BERT or BERT-derived (ALBERT, DistilBERT, RoBERTa, CamemBERT etc.) models listed at https://huggingface.co/transformers/pretrained_models.html or path to custom model.

Parameters

tokenizer (str) – Type of tokenization used in the tokenization step. If different from model, out-of-vocabulary tokens may be treated as unknown tokens.
model_class (str) – Specifies model type. Must be one of ‘AutoModel’ (encoding extractor) or ‘AutoModelWithLMHead’ (language model). These are generic model classes, which use the value of pretrained_model to infer the model-specific transformers class (e.g. BertModel or BertForMaskedLM for BERT, RobertaModel or RobertaForMaskedLM for RoBERTa). Fixed by each subclass.
framework (str) – name deep learning framework to use. Must be ‘pt’ (PyTorch) or ‘tf’ (tensorflow). Defaults to ‘pt’.
return_input (bool) – if True, the extractor returns encoded token and encoded word as features.
model_kwargs (dict) – Named arguments for transformer model. See https://huggingface.co/transformers/main_classes/model.html
tokenizer_kwargs (dict) – Named arguments for tokenizer. See https://huggingface.co/transformers/main_classes/tokenizer.html

__init__(pretrained_model='bert-base-uncased', tokenizer='bert-base-uncased', model_class='AutoModel', framework='pt', return_input=False, model_kwargs=None, tokenizer_kwargs=None)[source]¶

transform(stim, *args, **kwargs)¶

Executes the transformation on the passed stim(s).

Parameters

stims (str, Stim, list) –
One or more stimuli to process. Must be one of:
- A string giving the path to a file that can be read in as a Stim (e.g., a .txt file, .jpg image, etc.)
- A Stim instance of any type.
- An iterable of stims, where each element is either a string or a Stim.
validation (str) –
String specifying how validation errors should be handled. Must be one of:
- ’strict’: Raise an exception on any validation error
- ’warn’: Issue a warning for all validation errors
- ’loose’: Silently ignore all validation errors
args – Optional positional arguments to pass onto the internal _transform call.
kwargs – Optional positional arguments to pass onto the internal _transform call.