Scoring sequences for humanness in AntPack
Here’s how to create a tool to use for scoring sequences::
from antpack import SequenceScoringTool
scoring_tool = SequenceScoringTool(offer_classifier_option = False,
normalization = "none")
Note that the scoring tool can work with both heavy and light chains and will by default figure out what type of chain you’re giving it for any sequence you provide.
The scoring tool can normalize scores; this is mostly useful if you
want to combine heavy and light chain scores into a single antibody
score, or if you’re trying to compare scores assigned to specific
regions (frameworks or CDRs). There are two normalization options.
Setting normalization='training_set_adjust'
subtracts the median
training set score for heavy and light chains from scores assigned to
those chains. This is useful for combining the heavy and light chain
scores for an antibody if you haven’t masked out any regions or used
any masking options. If masking regions, normalization='normalize'
is probably most useful; this divides the score by the number of non-
masked residues. 'none'
is the default.
The scoring tool can run in classifier mode if offer_classifier_option
is True. (See the Background section for more on how this works and why
we don’t generally recommend it.) Setting offer_classifier_option
to False makes the tool a little more lightweight, since it doesn’t
need to load some additional model parameters.
Scoring a sequence
To score sequences with various options, use score_seqs
.
If you want more detailed information about why a sequence
may have been scored the way it was, use get_diagnostic_info
.
If you want a mask for all positions EXCEPT a standard IMGT-defined
framework or cdr region, use get_standard_mask
; the resulting
mask can then be passed to score_seqs
. This will enable you
to just get the score for a specific region.
See below for details.
- class antpack.SequenceScoringTool(offer_classifier_option=False, normalization='none')
Tool for scoring sequences.
- convert_sequence_to_array(seq)
Converts an input sequence to a type uint8_t array where the integer at each position indicates the amino acid at that position. Can be used in conjunction with the cluster returned by retrieve_cluster or get_closest_clusters to determine which amino acids are contributing most (or least) to the humanness score.
- Parameters:
seq (str) – The sequence of interest.
- Returns:
chain_name (str) – The chain type; one of “H”, “L”.
arr (np.ndarray) – A numpy array of shape (1,M) where M is the sequence length after converting to a fixed length array.
- get_diagnostic_info(seq)
Gets diagnostic information for an input sequence that is useful if troubleshooting.
- Parameters:
seq (str) – The sequence. May be either heavy chain or light.
- Returns:
gapped_imgt_positions (list) – A list of gaps at sites that are sometimes filled in the IMGT numbering system. The presence of these is not a cause for concern but may be useful for diagnostic purposes.
unusual_positions (list) – A list of unexpected insertions at sites where an insertion is very unusual in the training set. Take note of these if any are found – these are not taken into account when scoring the sequence, so if this is not an empty list, the score may be less reliable.
chain_name (str) – one of “H”, “L” for heavy or light. Indicates the chain type to which the sequence was aligned.
- get_standard_mask(chain_type: str, region: str = 'framework_1')
Returns a mask for ALL positions EXCEPT a specified IMGT- defined region. You can then use this mask as input to score_seqs to see what the score would be if only that region were included.
- Parameters:
chain_type (str) – One of “H”, “L”.
region (str) – One of ‘framework_1’, ‘framework_2’, ‘framework_3’, ‘framework_4’, ‘cdr_1’, ‘cdr_2’, ‘cdr_3’, ‘cdr_4’. This function will construct a mask that excludes all other regions.
- Returns:
mask (list) – A list of excluded imgt positions that can be passed to one of the scoring functions.
- Raises:
ValueError – A ValueError is raised if unexpected inputs are supplied.
- get_standard_positions(chain_type)
Returns a list of the standard positions used by SAM when scoring a sequence. IMGT numbered positions outside this set are ignored when assigning a score. If you want to use a generic mask for an IMGT- defined region, call get_standard_mask. If you want to create a custom mask for everything except a specific region of interest, use this function to get a list of all positions.
- retrieve_cluster(cluster_id, chain_type)
A convenience function to get the per-position probabilities associated with a particular cluster.
- Parameters:
cluster_id (int) – The id number of the cluster to retrieve. Can be generated by calling self.get_closest_clusters or self.batch_score_seqs with mode = “assign”.
chain_type (str) – One of “H”, “L”.
- Returns:
mu_mix (np.ndarray) – An array of shape (1, sequence_length, 21), where 21 is the number of possible AAs. The clusters are sorted in order from most to least likely given the input sequence.
mixweights (float) – The probability of this cluster in the mixture.
aas (list) – A list of amino acids in standard order. The last dimension of mu_mix corresponds to these aas in the order given.
- score_seqs(seq_list, mask_cdr3: bool = False, custom_light_mask: list | None = None, custom_heavy_mask: list | None = None, mask_terminal_dels: bool = False, mask_gaps: bool = False, mode: str = 'score')
Scores a list of sequences in batches or assigns them to clusters. Can be used in conjunction with a user-supplied mask (for positions to ignore) and in conjunction with Substantially faster than single seq scoring but does not offer the option to retrieve diagnostic infoCan also be used to assign a large number of sequences to clusters as well.
- Parameters:
seq_list (str) – The list of input sequences. May contain both heavy and light.
mask_cdr3 (bool) – If True, ignore IMGT-defined CDR3 when assigning a score. CDR3 is not distinctive across species so this is often useful. Ignored if mode is ‘assign’, ‘assign_no_weights’.
custom_light_mask (list) – Either None or a list of strings indicating IMGT positions to ignore. Use self.get_standard_positions and/or self.get_standard_mask to construct a mask. This can be useful if you just want to score a specific region, or if there is a large deletion that should be ignored.
custom_heavy_mask (list) – Either None or a list of strings indicating IMGT positions to ignore. Use self.get_standard_positions and/or self.get_standard_mask to construct a mask. This can be useful if you just want to score a specific region, or if there is a large deletion that should be ignored.
mask_terminal_dels (bool) – If True, N and C-terminal deletions are masked when calculating a score or assigning to a cluster. Useful if there are large unusual deletions at either end of the sequence that you would like to ignore when scoring.
mask_gaps (bool) – If True, all non-filled IMGT positions in the sequence are ignored when calculating the score. This is useful when your sequence has unusual deletions and you would like to ignore these.
mode (str) – One of ‘score’, ‘assign’, ‘assign_no_weights’, ‘classifier’. If score, returns the human generative model score. If ‘assign’, provides the most likely cluster number for each input sequence. If ‘assign_no_weights’, assigns the closest cluster ignoring mixture weights, so that the closest cluster is assigned even if that cluster is a low-probability one. If ‘classifier’, assigns a score using the Bayes’ rule classifier, which also takes into account some info regarding other species. ‘classifier’ is not a good way to score sequences in general because it only works well for sequences of known origin, so it should only be used for testing.
- Returns:
output_scores (np.ndarray) – log( p(x) ) for all input sequences.