Scoring sequences for humanness in AntPack
Here’s how to create a tool to use for scoring sequences::
from antpack import SequenceScoringTool
scoring_tool = SequenceScoringTool(offer_classifier_option = False,
normalization = "none")
Note that the scoring tool can work with both heavy and light chains and will by default figure out what type of chain you’re giving it for any sequence you provide.
The scoring tool can normalize scores; this is mostly useful if you
want to combine heavy and light chain scores into a single antibody
score, or if you’re trying to compare scores assigned to specific
regions (frameworks or CDRs). There are two normalization options.
Setting normalization='training_set_adjust'
subtracts the median
training set score for heavy and light chains from scores assigned to
those chains. This is useful for combining the heavy and light chain
scores for an antibody if you haven’t masked out any regions or used
any masking options. If masking regions, normalization='normalize'
is probably most useful; this divides the score by the number of non-
masked residues. 'none'
is the default.
The scoring tool can run in classifier mode if offer_classifier_option
is True. (See the Background section for more on how this works and why
we don’t generally recommend it.) Setting offer_classifier_option
to False makes the tool a little more lightweight, since it doesn’t
need to load some additional model parameters.
Scoring a sequence
To score sequences with various options, use score_seqs
.
If you want more detailed information about why a sequence
If you want a mask for all positions EXCEPT a standard IMGT-defined
framework or cdr region, use get_standard_mask
; the resulting
mask can then be passed to score_seqs
. This will enable you
to just get the score for a specific region.
Finally and most importantly, you can use get_closest_clusters
to find out which cluster(s) are closest to your sequence then
call calc_per_aa_probs
to determine the probability of each
AA in your sequence given those clusters. This enables easy identification
of low-probability residues and motifs that could pose developability
issues and is illustrated in more detail in the examples.
See below for details.
- class antpack.SequenceScoringTool(offer_classifier_option=False, normalization='none', max_threads=2)
Tool for scoring sequences.
- calc_per_aa_probs(seq: str, cluster_id: int)
Calculate the log probability of each amino acid in the input sequence given a specified cluster number and identify what that cluster considers the most likely amino acid at each position. To get the cluster number of the cluster closest to your input sequence, call get_closest_clusters.
- Parameters:
seq (str) – The sequence of interest.
cluster_id (int) – The index of the cluster you would like to use to generate these probabilities. Call get_closest_clusters to find those closest to your input sequence.
- Returns:
chain_type (str) – One of “H”, “L”, or “unknown” if there is an error numbering your sequence.
logprobs (np.ndarray) – An array of shape (M) where M is the length of your input sequence. nan is returned if there is an error numbering the sequence.
most_likely_aas (list) – A list of the most likely AA at each position in your input sequence according to the specified cluster.
- convert_sequence_to_array(seq: str)
Converts an input sequence to a type uint8_t array where each integer indicates the amino acid at that position.
- Parameters:
seq (str) – The sequence of interest.
- Returns:
chain_name (str) – The chain type; one of “H”, “L” or “unknown” if there is an error.
arr (np.ndarray) – A numpy array of shape (1,M) where M is the length after converting to a fixed length array. nan is returned if there is an error numbering the sequence.
- get_diagnostic_info(seq: str)
Gets diagnostic information for an input sequence that is useful if troubleshooting.
- Parameters:
seq (str) – The sequence. May be either heavy chain or light.
- Returns:
gapped_imgt_positions (list) – A list of gaps at sites that are sometimes filled in the IMGT numbering system. The presence of these is not a cause for concern but may be useful for diagnostic purposes.
unusual_positions (list) – A list of unexpected insertions at sites where an insertion is very unusual in the training set. Take note of these if any are found – these are not taken into account when scoring the sequence, so if this is not an empty list, the score may be less reliable.
chain_name (str) – one of “H”, “L” for heavy or light. Indicates the chain type to which the sequence was aligned. “Unknown” if there was an error in numbering.
- get_standard_mask(chain_type: str, region: str = 'fmwk1', cdr_labeling_scheme='imgt')
Returns a mask for all regions EXCEPT the one you specify. You can then use this mask as input to score_seqs to see what the score would be if only that region were included.
- Parameters:
chain_type (str) – One of “H”, “L”.
region (str) – One of ‘fmwk1’, ‘fmwk2’, ‘fmwk3’, ‘fmwk4’, ‘cdr1’, ‘cdr2’, ‘cdr3’, ‘fmwk’, ‘cdr’. This function will construct a mask that excludes all other regions.
cdr_labeling_scheme (str) – The numbering scheme used for humanness calculations is IMGT, but for generating a mask, you can use a different scheme to assign CDRs if desired. This value can be one of ‘aho’, ‘imgt’, ‘kabat’, ‘martin’.
- Returns:
mask (list) – A numpy array of the same length as the list returned by “get_standard_positions()”. Only positions marked True will be used when scoring a sequence. You can further modify this mask if needed and can pass it to “score_seqs” to use when scoring sequences.
- Raises:
ValueError – A ValueError is raised if unexpected inputs are supplied.
- get_standard_positions(chain_type: str)
Returns a list of the standard positions used by SAM when scoring a sequence. IMGT numbered positions outside this set are ignored when assigning a score.
- retrieve_cluster(cluster_id: int, chain_type: str)
A convenience function to get the per-position probabilities associated with a particular cluster.
- Parameters:
cluster_id (int) – The id number of the cluster to retrieve. Can be generated by calling self.get_closest_clusters or self.batch_score_seqs with mode = “assign”.
chain_type (str) – One of “H”, “L”.
- Returns:
mu_mix (np.ndarray) – An array of shape (1, sequence_length, 21), where 21 is the number of possible AAs.
mixweights (float) – The probability of this cluster in the mixture.
aas (list) – A list of amino acids in standard order. The last dimension of mu_mix corresponds to these aas in the order given.
- score_seqs(seq_list, custom_light_mask: list | None = None, custom_heavy_mask: list | None = None, mask_terminal_dels: bool = False, mask_gaps: bool = False, mode: str = 'score')
Scores a list of sequences in batches or assigns them to clusters. Can be used in conjunction with a user-supplied mask (for positions to ignore).
- Parameters:
seq_list (str) – The list of input sequences. May contain both heavy and light.
custom_light_mask (list) – Either None or an array generated by self.get_standard_mask indicating positions to ignore. This can be useful if you just want to score a specific region, or if there is a large deletion that should be ignored.
custom_heavy_mask (list) – Either None or an array generated by self.get_standard_mask indicating positions to ignore. This can be useful if you just want to score a specific region, or if there is a large deletion that should be ignored.
mask_terminal_dels (bool) – If True, N and C-terminal deletions are masked when calculating a score or assigning to a cluster. Useful if there are large unusual deletions at either end of the sequence that you would like to ignore when scoring.
mask_gaps (bool) – If True, all non-filled IMGT positions in the sequence are ignored when calculating the score. This is useful when your sequence has unusual deletions and you would like to ignore these.
mode (str) – One of ‘score’, ‘assign’, ‘assign_no_weights’, ‘classifier’. If score, returns the human generative model score. If ‘assign’, provides the most likely cluster number for each input sequence. If ‘assign_no_weights’, assigns the closest cluster ignoring mixture weights, so that the closest cluster is assigned even if that cluster is a low-probability one. If ‘classifier’, assigns a score using the Bayes’ rule classifier, which also takes into account some info regarding other species. ‘classifier’ is not a good way to score sequences in general because it only works well for sequences of known origin, so it should only be used for testing.
- Returns:
output_scores (np.ndarray) – log( p(x) ) for all input sequences.
Humanizing sequences with AntPack
AntPack can also suggest mutations to “humanize” a sequence or make it more human. You can manually humanize a sequence by 1) scoring it using AntPack, 2) retrieving the closest clusters (see the “Generating new human sequences” page), 3) determining which regions of the sequence are least human, 4) mutating these and 5) rescoring the sequence. This can be however a little tedious. AntPack offers an easy way to automatically choose mutations that lie at a selected location along that tradeoff curve.
Regardless of which approach you take, we suggest you carefully review suggested mutations and re-score the altered sequence to be sure that the changes that have been made are compatible with your objectives.
To automatically humanize a sequence, start by creating a SequenceScoringTool as shown above. (This is a significant change in AntPack v0.4; note that prior versions had a separate HumanizationTool). Next, feed it the sequence you’d like to humanize using the following function:
- class antpack.SequenceScoringTool(offer_classifier_option=False, normalization='none', max_threads=2)
Tool for scoring sequences.
- suggest_humanizing_mutations(seq: str, excluded_positions: list = [], s_thresh: float = 1.25)
Takes an input sequence, scores it per position, uses the nclusters closest clusters to determine which modification would be most likely to have an impact, suggest mutations and report both the mutations and the new score. CDRs are excluded, together with user-specified excluded positions.
- Parameters:
seq (str) – The sequence to update.
s_thresh (float) – The maximum percentage by which the score can shift before backmutation stops. Smaller values (closer to 1) will prioritize increasing the score over preserving the original sequence. Larger values will prioritize preserving the original sequence.
excluded_positions (list) – A list of strings (IMGT position numbers) indicating positions which should not be changed. This enables the user to mask key residues, Vernier zones etc if so desired.
cdr_labeling_scheme (str) – The sequence is numbered using the IMGT scheme, but to determine which positions are CDR, you can use ‘aho’, ‘kabat’, ‘imgt’, ‘martin’ or ‘north’ by supplying an appropriate argument here.
- Returns:
initial_score (float) – The score of the sequence pre-modification.
final_scores (float) – The scores of the sequence after each mutation is adopted (in sequential order).
mutations (list) – The suggested mutations in AA_position_newAA format, where position is the IMGT number for the mutation position. These are in sequential order (the same as final_scores).
updated_seq (list) – The updated sequences after each mutation with all gaps removed. (This may be a different length from the input sequence if the suggested mutation is a deletion or an insertion).
The “knob” you can turn here is s_thresh
. A value <= 1 means that
AntPack will basically do a straight in silico CDR graft. This is
straightforward but may lose affinity. A value > 1 means that AntPack will
1) score the straight in silico CDR graft of your sequence, which will
generally achieve a very high humanness score (the “optimal score”), then
2) revert as many positions as possible to the original sequence without
causing the score to go over s_thresh * optimal score. Larger values of s_thresh
cause more of the original sequence to be preserved at the expense of
a smaller improvement in humanness.
Kabat-defined CDRs are excluded from humanization. If you want to exclude additional key positions, you can do this by passing a list of IMGT-numbered positions you would like to be excluded from consideration.