pliers.extractors.BertLMExtractor¶

class pliers.extractors.BertLMExtractor(pretrained_model='bert-base-uncased', tokenizer='bert-base-uncased', framework='pt', mask='MASK', top_n=None, threshold=None, target=None, return_softmax=False, return_masked_word=False, return_input=False, model_kwargs=None, tokenizer_kwargs=None)[source]¶

Bases: BertExtractor

Returns masked words predictions from BERT (or similar, e.g.: DistilBERT) models.

Parameters

pretrained_model (str) – A string specifying which transformer model to use. Can be any pretrained BERT or BERT-derived (ALBERT, DistilBERT, RoBERTa, CamemBERT etc.) models listed at https://huggingface.co/transformers/pretrained_models.html or path to custom model.
tokenizer (str) – Type of tokenization used in the tokenization step. If different from model, out-of-vocabulary tokens may be treated as unknown tokens.
framework (str) – name deep learning framework to use. Must be ‘pt’ (PyTorch) or ‘tf’ (tensorflow). Defaults to ‘pt’.
mask (int or str) – Words to be masked (string) or indices of words in the sequence to be masked (indexing starts at 0). Can be either a single word/index or a list of words/indices. If str is passed and more than one word in the input matches the string, only the first one is masked.
top_n (int) – Specifies how many of the highest-probability tokens are to be returned. Mutually exclusive with target and threshold.
target (str or list) – Vocabulary token(s) for which probability is to be returned. Tokens defined in the vocabulary change across tokenizers. Mutually exclusive with top_n and threshold.
threshold (float) – If defined, only values above this threshold will be returned. Mutually exclusive with top_n and target.
return_softmax (bool) – if True, returns probability scores instead of raw predictions.
return_masked_word (bool) – if True, returns masked word (if defined in the tokenizer vocabulary) and its probability.
model_kwargs (dict) – Named arguments for pretrained model. See: https://huggingface.co/transformers/main_classes/model.html and https://huggingface.co/transformers/model_doc/bert.html.
tokenizer_kwargs (dict) – Named arguments for tokenizer. See https://huggingface.co/transformers/main_classes/tokenizer.html.

__init__(pretrained_model='bert-base-uncased', tokenizer='bert-base-uncased', framework='pt', mask='MASK', top_n=None, threshold=None, target=None, return_softmax=False, return_masked_word=False, return_input=False, model_kwargs=None, tokenizer_kwargs=None)[source]¶

transform(stim, *args, **kwargs)¶

Executes the transformation on the passed stim(s).

Parameters

stims (str, Stim, list) –
One or more stimuli to process. Must be one of:
- A string giving the path to a file that can be read in as a Stim (e.g., a .txt file, .jpg image, etc.)
- A Stim instance of any type.
- An iterable of stims, where each element is either a string or a Stim.
validation (str) –
String specifying how validation errors should be handled. Must be one of:
- ’strict’: Raise an exception on any validation error
- ’warn’: Issue a warning for all validation errors
- ’loose’: Silently ignore all validation errors
args – Optional positional arguments to pass onto the internal _transform call.
kwargs – Optional positional arguments to pass onto the internal _transform call.

update_mask(new_mask)[source]¶

Updates mask attribute with value of new_mask. :param new_mask: word to mask (str) or index/position of the

word to mask in input sequence (int). Indexing starts at 0.