Clustering antibody sequences in AntPack

For clustering small datasets (a few dozen to a few thousand sequences), AntPack makes it easy to construct a distance matrix using Hamming distance for any specified subregion of your sequences (for example, the framework, the CDRs, or a specific framework or CDR region) or the full sequence if needed. To do this, use the build_distance_matrix method of the SingleChainAnnotator and PairedChainAnnotator tools used for numbering:

class antpack.SingleChainAnnotator(chains=['H', 'K', 'L'], scheme='imgt')

__init__(chains=['H', 'K', 'L'], scheme='imgt')

Class constructor.

Parameters:

chains (list) – A list of chains. Each must EITHER be one of “H”, “K”, “L” for antibodies or one of “A”, “B”, “D”, “G” for TCRs. If [“H”, “K”, “L”] (default) or [“A”, “B”, “D”, “G”] the annotator will automatically determine the most appropriate chain type for each input sequence. You cannot supply a mixture of TCR and antibody chains (e.g. [“H”, “A”]) – the list you supply must contain either TCR or antibody chains but not both.
scheme (str) – The numbering scheme. Must be one of “imgt”, “martin”, “kabat”, “aho”. If TCR chains are supplied, only “imgt” is accepted.

Raises:

ValueError – A ValueError is raised if unacceptable inputs are supplied.

build_distance_matrix

Takes as input a group of numbered sequences that have been converted to an MSA (most easily done by just calling build_msa) and uses them to populate a distance matrix that you pass. The distance can be determined using the framework regions, the CDRs, a specific framework or CDR region or the whole sequence. You can then pass this distance matrix to any clustering function from another library that uses distance matrices as input (e.g. scipy or scikit-learn). This procedure has bad scaling to large datasets – for more than say 5,000 sequences prefer another technique.

If you want to cluster heavy and light chains at the same time, just generate a distance matrix for each and sum the distance matrices.

Parameters:

distance_matrix (ndarray) – A numpy array of shape (# sequences, # sequences). Must be of type float64. The array will be modified in-place and all entries will be overwritten.
aligned_seqs (list) – A list of sequences that are all the same length and thus constitute an MSA. You can obtain this by calling build_msa.
position_codes (list) – The numbering / position codes for the aligned sequences. You can obtain this by calling build_msa.
chain (str) – A valid chain (e.g. ‘H’, ‘K’, ‘L’, ‘A’). The assigned chain is the third element of the tuple returned by analyze_seq. For this function only, ‘K’ and ‘L’ are equivalent since they both refer to a light chain, so if your chain is light you can supply either for the same result.
region (str) – One of “fmwk1”, “fmwk2”, “fmwk3”, “fmwk4”, “cdr1”, “cdr2”, “cdr3”, “fmwk”, “cdr” or “all”. “fmwk” and “cdr” will use all frameworks and all cdrs respectively, while “all” uses the whole sequence, and the other codes use the indicated framework region or cdr.
scheme (str) – Either “” or a valid scheme. If “” (default), the scheme that is used is the same as the one selected when the annotator was constructed. Using a different scheme can enable you to “cross-assign” CDRs and number with one scheme while assigning CDRs with another. So if you create an annotator with “imgt” as the scheme then you are numbering using “imgt”, but by passing e.g. “kabat” to this function, you can use the kabat CDR definitions instead of the IMGT ones. Valid schemes for this function only are ‘imgt’, ‘aho’, ‘kabat’, ‘martin’, ‘north’. For TCRs only “” and “imgt” are accepted.

Because distance matrices have n^2 scaling in size and construction time with dataset size, this method is recommended only for datasets up to a few thousand sequences in size. Once you’ve built a distance matrix, you can supply it to a variety of Scipy and scikit-learn functions and easily cluster and visualize it using a variety of methods. See the Clustering examples on the main page of the docs to see how to do this.

For large datasets, AntPack currently offers just one highly scalable option (other options coming soon). The EMCategoricalMixture is designed to use multithreading and run with datasets too large to load to memory; it can easily cluster datasets ranging from a few hundred to tens of millions in size. This model is a mixture model which assigns a probability to each amino acid at each position in each cluster. As such, it is a probabilistic model that can calculate the probability that a new sequence could have come from its training data or the probability of a specific mutation given its training set. As with distance matrix construction, you can cluster using the whole sequence or any selected subregion.

Choosing the number of clusters for the EMCategoricalMixture can be a little tricky. Currently it provides [BIC](https://en.wikipedia.org/wiki/Bayesian_information_criterion) and [AIC](https://en.wikipedia.org/wiki/Akaike_information_criterion) and you can use these to select the number of clusters by finding the number that gives the lowest BIC and AIC. (For an example of how to do this – and how to use the EMCategoricalMixture more generally – see the Clustering examples on the main page of the docs.) If you have a small dataset and are clustering however AIC and BIC will tend to select a number of clusters that is too small (e.g. 1!) Also note that it’s ok to use more clusters than needed because the EMCategoricalMixture will eliminate empty clusters during fitting, so if the number of clusters is larger than needed it can kill off some unneeded ones. In future versions we will likely add a procedure to auto-select the number of clusters so that no manual selection is required.

EMCategoricalMixture is designed to use multithreading to process multiple batches of data in parallel during fitting; selecting a max_threads > 1 will automatically enable multithreading and is highly recommended if you’re fitting a large dataset. If you have a large dataset (e.g. millions of sequences) in a fasta file or gzipped fasta file and don’t want to load it to memory, EMCategoricalMixture can encode it to a set of temporary files in a location you specify by calling encode_fasta_file. You supply the list of these temporary files to the fit function and EMCategoricalMixture will take care of the rest – just remember to delete the temporary files once you’re done.

class antpack.EMCategoricalMixture(n_components: int, sequence_length: int = 0, numbering: list | None = None, chain_type: str = 'H', numbering_scheme: str = 'imgt', cdr_scheme: str = 'imgt', region: str = 'all', max_threads: int = 2, verbose: bool = True)

A categorical mixture model that is fitted using the EM algorithm.

AIC(sequences: list | None = None, filepaths: list | None = None)

Calculates the AIC or Akaike information criterion for a fitted model.

Parameters:

sequences (list) – A list of sequences that are all the same length (i.e. an MSA). If None, you must supply a list of filepaths where data is stored (next argument).
filepaths (list) – Either None or a list of filepaths, each of which is a 2d numpy array of type uint8 containing encoded sequence data. The easiest way to generate these is to call encode_fasta below, which will take an input fasta file (possibly compressed) and convert it to numpy arrays on disk. This is useful when your dataset is too large to load to memory. If this argument is None, sequences cannot also be None.

Returns:

aic (float) – The AIC for the input dataset.

BIC(sequences: list | None = None, filepaths: list | None = None)

Calculates the BIC or Bayes information criterion for a fitted model.

Parameters:

sequences (list) – A list of sequences that are all the same length (i.e. an MSA). If None, you must supply a list of filepaths where data is stored (next argument).
filepaths (list) – Either None or a list of filepaths, each of which is a 2d numpy array of type uint8 containing encoded sequence data. The easiest way to generate these is to call encode_fasta below, which will take an input fasta file (possibly compressed) and convert it to numpy arrays on disk. This is useful when your dataset is too large to load to memory. If this argument is None, sequences cannot also be None.

Returns:

bic (float) – The Bayes information criterion for the input dataset.

__init__(n_components: int, sequence_length: int = 0, numbering: list | None = None, chain_type: str = 'H', numbering_scheme: str = 'imgt', cdr_scheme: str = 'imgt', region: str = 'all', max_threads: int = 2, verbose: bool = True)

Constructor.

Parameters:

n_components (int) – The number of clusters to use.
sequence_length (int) – The length of the input sequences. This can be set to 0 if a list of position codes is supplied as the “numbering” argument. If no list is supplied for the “numbering” argument however this must be an integer greater than 0.
numbering (list) – A list of position codes for a numbering scheme. If your sequences are an MSA then this numbering list would be the list of position codes for the MSA (use SingleChainAnnotator’s build_msa function to see an example). This argument does not have to be supplied. If it is None however you must supply a number for sequence length. The advantage to supplying numbering is that you can use it to cluster just one region of your input sequences (e.g. the CDRs) if desired.
chain_type (str) – The chain this clustering will be for (e.g. one of “H”, “K”, “L”, “A” etc.) This argument is ignored and is not required if numbering is None.
numbering_scheme (str) – A valid numbering scheme; one of ‘imgt’, ‘aho’, ‘martin’, ‘kabat’. If you supply a numbering list and a region (see below) this will be used to determine what region should be extracted; otherwise it is ignored.
cdr_scheme (str) – A valid scheme for assigning CDRs; one of ‘imgt’, ‘aho’, ‘martin’, ‘kabat’, ‘north’. If you supply a numbering list and a region (see below) this will be used to determine what region should be extracted; otherwise it is ignored. This value can be different from “numbering_scheme” to allow you to use e.g. Kabat or North CDR definitions with IMGT or Martin numbering.
region (str) – If “all” the full input sequences are clustered. Otherwise you can supply “fmwk1”, “fmwk2”, “fmwk3” or “fmwk4” to cluster a specific framework region, “cdr1”, “cdr2”, “cdr3” to cluster a specific cdr, “fmwk” to cluster all framework regions or “cdr” to cluster all cdrs, then the model will extract just that region during fitting and prediction. If this argument is NOT “all”, you must supply the ‘numbering’, ‘numbering_scheme’, ‘chain_type’ and ‘cdr_scheme’ arguments.
max_threads (int) – The maximum number of threads to use. The tool will use up to this number of threads wherever it makes sense to do so.
verbose (bool) – If True, print loss on every fitting iteration.

encode_fasta(fasta_filepath, temporary_dir, chunk_size=10000)

This function takes all of the sequences in the fasta_filepath and saves them as 2d numpy arrays with chunk_size sequences in each array of type uint8 under temporary_dir, then returns the list of filepaths. You can supply this list of filepaths to fit, BIC and AIC, then you can os.remove the filepaths when you are done fitting. This is a convenient way to fit the model for datasets that are too large to load to memory. The sequences must all be the same length, i.e. should be an MSA.

Expect this to take #sequences * sequence_length / 1e9 GB of disk space; so if you have 50 million sequences of length 150, this will take 7.5 GB of disk space.

This method is preferred to loading the sequences from the fasta file on each iteration because it allows us to greatly accelerate fitting (the model will load and process up to max_threads numpy files at a time).

Parameters:

fasta_filepath (str) – The location of the fasta file. It can be a gzipped file or uncompressed.
temporary_dir (str) – A path to a (probably) temporary directory where the encoded sequences can be saved.
chunk_size (int) – The maximum number of sequences per numpy file. The model will try to load up to max_threads * chunk_size sequences at a time during fitting, so do not set this too high to avoid excess memory consumption. For a reasonable number of threads, 10000 is likely fine.

Returns:

numpy_filepaths (list) – A list of absolute filepaths where the temporary numpy files with encoded sequences for fitting are stored. You can directly supply this list to fit, bic, aic. Once you are done fitting these files can safely be deleted.

Raises:

RuntimeError – An error will be raised if invalid sequences (all different lengths, unexpected characters etc) are found in the fasta file.

fit(sequences: list | None = None, filepaths: list | None = None, max_iter: int = 150, tol: float = 0.001, n_restarts: int = 3, random_state: int = 123, prune_after_fitting: bool = True)

Fits the mixture model to either a list of filepaths representing numpy arrays saved as npy files OR an input numpy array.

Parameters:

sequences (list) – Either None or a list of sequences that are all the same length (i.e. an MSA). If None, you must supply a list of filepaths where data is stored (next argument).
filepaths (list) – Either None or a list of filepaths, each of which is a 2d numpy array of type uint8 containing encoded sequence data. The easiest way to generate these is to call encode_fasta below, which will take an input fasta file (possibly compressed) and convert it to numpy arrays on disk. This is useful when your dataset is too large to load to memory. If this argument is None, sequences cannot also be None.
max_iter (int) – The maximum number of iterations to run per restart.
tol (float) – If the loss changes by less than this amount, assume convergence and end the restart.
n_restarts (int) – The number of times to re-run fitting with randomly reinitialized weights.
random_state (int) – The random seed.
prune_after_fitting (bool) – If True, empty clusters are removed at the end of fitting. This will change the n_components of the model. Call self.get_model_specs() to get the new n_components value.

get_model_parameters(): Returns the cluster parameters followed by the mixture weights as two numpy arrays.

get_model_specs(): Returns n_components, sequence_length and the maximum number of allowed aas (generally 21).

load_params(mu_mix, mix_weights)

Loads user-supplied parameters for a model that has already been fitted.

Parameters:

mu_mix (ndarray) – The mixture model parameters.
mix_weights (ndarray) – The mixture weights.

predict(sequences, mask=None, mask_terminal_dels=False, mask_gaps=False, use_mixweights=True)

Determine the most probable cluster for each datapoint in a numpy array. Note that you should also check the overall probability of each datapoint. If a datapoint is very different from your training set, it will have very low overall probability, but this function will still assign it to the most likely cluster – whichever that is – by default.

Parameters:

sequences (list) – A list of sequences that are all the same length (i.e. an MSA).
mask (np.ndarray) – Either None or a numpy array of type bool and shape (xdata.shape[1]). If not None, indicated positions are masked, i.e. any position marked False is ignored.
mask_gaps (bool) – If True, all non-filled IMGT positions in the sequence are ignored when calculating the score. This is useful when your sequence has unusual deletions and you would like to ignore these.
mask_terminal_dels (bool) – If True, ignore N- and C-terminal deletions when assigning to a cluster.
use_mixweights (bool) – If True, take mixture weights into account; otherwise, find the closest cluster (even if it is a low-probability cluster).

Returns:

preds (np.ndarray) – An array of shape (xdata.shape[0]) containing a number from 0 to self.n_components - 1 indicating the predicted cluster for each datapoint.

Raises:

RuntimeError – Raised if unexpected inputs are supplied.

predict_proba(sequences, mask=None, mask_terminal_dels=False, mask_gaps=False, use_mixweights=True)

Determine the probability of each datapoint for each cluster in the input array.

Parameters:

sequences (list) – A list of sequences that are all the same length (i.e. an MSA).
mask (np.ndarray) – Either None or a numpy array of type bool and shape (xdata.shape[1]). If not None, indicated positions are masked, i.e. any position marked False is ignored.
mask_gaps (bool) – If True, all non-filled IMGT positions in the sequence are ignored when calculating the score. This is useful when your sequence has unusual deletions and you would like to ignore these.
mask_terminal_dels (bool) – If True, ignore N- and C-terminal deletions when assigning to a cluster.
use_mixweights (bool) – If True, take mixture weights into account; otherwise, find the closest cluster (even if it is a low-probability cluster).

Returns:

proba (np.ndarray) – An array of shape (ncomponents, xdata.shape[0]) indicating the probability of each cluster for each datapoint.

Raises:

RuntimeError – Raised if unexpected inputs are supplied.

score(sequences, mask=None, mask_terminal_dels=False, mask_gaps=False, normalize_scores=False)

Generate the overall log-likelihood of individual datapoints. This is very useful to determine if a new datapoint is very different from the training set. If the log-likelihood of the new datapoints is much lower than the training set distribution of log-likelihoods, it is fairly unlikely that the new datapoint is a sample from the distribution represented by the model, and you should not try to assign it to a cluster.

Parameters:

xdata (np.ndarray) – An array with the input data, of type np.uint8.
mask (np.ndarray) – Either None or a numpy array of type bool and shape (xdata.shape[1]). If not None, indicated positions are masked, i.e. any position marked False is ignored.
mask_terminal_dels (bool) – If True, ignore N- and C-terminal deletions when calculating a score.
mask_gaps (bool) – If True, all non-filled IMGT positions in the sequence are ignored when calculating the score. This is useful when your sequence has unusual deletions and you would like to ignore these.
normalize_scores (bool) – If True, normalize the score by dividing by the number of non-masked residues in the input.

Returns:

loglik (np.ndarray) – A float64 array of shape (x.shape[0]) where each element is the log-likelihood of that datapoint given the model.

Raises:

RuntimeError – Raised if unexpected inputs are supplied.