Configuration¶
Pliers contains a number of package-wide options that can be configured via the pliers.config module. These include:
Setting |
Type |
Default |
Description |
---|---|---|---|
cache_transformers |
bool |
True |
Whether or not to cache Transformer outputs in memory |
default_converters |
dict |
see module |
See explanation in the Converters section |
drop_bad_extractor_results |
bool |
True |
When |
log_transformations |
bool |
True |
Whether or not to log transformation details in each Stim’s .history attribute |
n_jobs |
int |
CPU-1. |
Number of simultaneous jobs to execute (if |
parallelize |
bool |
True |
Whether or not to use naive parallelization by default |
progress_bar |
bool |
True |
Whether or not to display progress bars when looping over Stims |
use_generators |
bool |
False |
Whether Transformers should return generators rather than lists when iterating over Stims |
allow_large_jobs |
bool |
True |
Whether to allow API jobs with a greater duration than in long_job |
long_job |
int |
60 |
Maxmimum duration to allow for API jobs if allow_large_jobs is False |
large_job |
int |
100 |
Maximum number of transformations allowed f allow_large_jobs is False. |
api_key_validation |
bool |
False |
Explicitly validate API keys prior to attemping extraction |
Setting options¶
Package-wide options can be changed either at initialization or at run-time.
At initialization¶
By default, when pliers is first imported, it will look in three places for configuration files that override the package defaults. In order of precedence, these are:
A
pliers_config.json
file in the current (working) directory.A filename set in the
PLIERS_CONFIG
environment variable.A
pliers_config.json
file located in the user’s home directory.
In all cases, the file must be a standard .json file containing only valid option names as keys. Default package values will continue to be used for any options not explicitly specified in the file. For example:
{
"parallelize": True,
"n_jobs": 4
}
If the above is placed in a pliers_config.json
file in one’s home directory, pliers will execute all iterable transformations in parallel (with 4 jobs).
At run-time¶
Package options can also be changed dynamically, via the .get_option()
and .set_option()
(or, for multiple options, .set_options()
) accessors:
>>> import pliers as pl
>>> pl.get_option('use_generators')
'False'
>>> pl.set_option('use_generators', True)
# Or...
>>> pl.set_options(use_generators=True, progress_bar=False)
Option details¶
cache_transformers (bool)¶
When set to True
, the output produced by all .transform()
call will be cached in memory (filesystem caching is not currently available). This is the default, and can be very useful in cases where (a) many calls to commercial feature extraction services (e.g., the Google or IBM families of Extractors) are being made, or (b) there are intermediate Stim
representations generated by Converter
classes that are computationally expensive to produce. Setting cache_transformers
to False
will result in every transform()
call being recomputed, with no intermediates stored in memory.
Note that caching in pliers (really, memoization) is based on the combination of the Transformer
class, its initialization parameters, and the id of the input Stim
. If any of these changes, results will be computed anew. So, for example, creating two separate instances of the ClarifaiAPIImageExtractor
, each with different model
arguments, will result in two separate calls being made to the Clarifai API even if the exact same Stim
inputs are passed. (However, different instances of the same ClarifaiAPIImageExtractor
initialized using the same arguments will still point to the same entry in the cache.)
default_converters (dict)¶
This option specifies what Converter
classes to use for implicit conversion between Stim
types (i.e., in cases where the code does not explicitly specify every conversion step). The format for this setting is a bit more involved; for details, see Package-wide conversion defaults.
drop_bad_extractor_results (bool)¶
In certain conditions, .transform()
calls may return None
values. Typically this happens either because of an unexpected internal failure (e.g., a timeout occurs in an API-based Extractor
), or because None
is the expected behavior for a Transformer given certain inputs. Either way, such values can wreak havoc on downstream transformations, because None
is not a valid input to any .transform()
call in pliers.
To avoid having entire workflows failing unpredictably as soon as any single Transformer
/Stim
combination returns a None
value, pliers will, by default, drop bad values as it encounters them. While this is usually desired, in cases where failures (or other causes of an invalid value return) are important to identify, we can disable this sanitization process by setting the drop_bad_extractor_results
option to False
. Note that this will typically result in an Exception
being raised the first time a bad value is encountered.
log_transformations (bool)¶
By default, pliers logs every transformation applied to a Stim
object in the Stim’s .history
property. While this is usually desirable, in contexts where hundreds of thousands or even millions of Stim
objects are being processed, the aggregate memory footprint of all of these logs may be non-trivial. We can disable transformation logging at any time by setting the log_transformations
setting to False
.
parallelization (bool), n_jobs (int)¶
By default, pliers executes all transformations serially–even in cases where an iterable of Stims is passed in (so that transformation is, in principle, embarrassingly parallel). However, pliers also supports rudimentary parallelization of transformations via the pathos package. If the parallelization
option is set to True
, any Transformer
passed an iterable of Stims as input will apply its transformations to the elements of the list in parallel.
The n_jobs
option specifies how many workers to launch. The default value of None
will be interpreted as num(CPU cores) - 1. Note that n_jobs
will be ignored unless parallelization is enabled.
progress_bar (bool)¶
By default, pliers shows a progress bar (using tqdm) when transforming iterable inputs (e.g., lists of Stims). To disable this behavior, set progress_bar
to False
.
use_generators (bool)¶
Internally, pliers uses generators whenever iteration over Stims occurs, in order to (potentially) reduce its memory footprint. However, generators can be confusing to users new to Python. To minimize confusion, pliers therefore converts all generators to lists before returning results to the user (or passing them as inputs to the next Transformer
in a Graph
). More experienced users who are comfortable with generator expressions and want to take advantage of their potential memory-saving benefits can enable generators by setting use_generators
to True
. (Note that it is not a foregone conclusion that enabling generators will reduce memory consumption; if caching is enabled and/or the number of intermediate conversions is large, using generators is unlikely to help much.)
allow_large_jobs (bool)¶
By default, this is False and pliers will allow you to run arbitarily large jobs. However, this could be unexpectedly costly on paid remote APIs. To prevent unexpectedly large jobs from executing set this vaiable to True.
long_job (int)¶
This variable allows you to set the maximum stimulus duration that will be sent to API extractors, if allow_large_jobs is False.
large_job (int)¶
This variable allows you to set the maximum nmber of transformations (i.e. stimuli) that will be sent to API extractors, if allow_large_jobs is False.
api_key_validation (bool)¶
Explicilty validates the API keys prior to attempting remote feature extraction. Setting this to True will make it easier to diagnose if any errors with remote APIs is due to an invalid API key.