Clustering antibody sequences in AntPack =============================================== For clustering small datasets (a few dozen to a few thousand sequences), AntPack makes it easy to construct a distance matrix using Hamming distance for any specified subregion of your sequences (for example, the framework, the CDRs, or a specific framework or CDR region) or the full sequence if needed. To do this, use the ``build_distance_matrix`` method of the ``SingleChainAnnotator`` and ``PairedChainAnnotator`` tools used for numbering: .. autoclass:: antpack.SingleChainAnnotator :special-members: __init__ :members: build_distance_matrix Because distance matrices have n^2 scaling in size and construction time with dataset size, this method is recommended only for datasets up to a few thousand sequences in size. Once you've built a distance matrix, you can supply it to a variety of Scipy and scikit-learn functions and easily cluster and visualize it using a variety of methods. See the Clustering examples on the main page of the docs to see how to do this. For large datasets, AntPack currently offers just one *highly* scalable option (other options coming soon). The ``EMCategoricalMixture`` is designed to use multithreading and run with datasets too large to load to memory; it can easily cluster datasets ranging from a few hundred to tens of millions in size. This model is a mixture model which assigns a probability to each amino acid at each position in each cluster. As such, it is a probabilistic model that can calculate the probability that a new sequence could have come from its training data or the probability of a specific mutation given its training set. As with distance matrix construction, you can cluster using the whole sequence or any selected subregion. Choosing the number of clusters for the EMCategoricalMixture can be a little tricky. Currently it provides [BIC](https://en.wikipedia.org/wiki/Bayesian_information_criterion) and [AIC](https://en.wikipedia.org/wiki/Akaike_information_criterion) and you can use these to select the number of clusters by finding the number that gives the lowest BIC and AIC. (For an example of how to do this -- and how to use the EMCategoricalMixture more generally -- see the Clustering examples on the main page of the docs.) If you have a small dataset and are clustering however AIC and BIC will tend to select a number of clusters that is too small (e.g. 1!) Also note that it's ok to use more clusters than needed because the EMCategoricalMixture will eliminate empty clusters during fitting, so if the number of clusters is larger than needed it can kill off some unneeded ones. In future versions we will likely add a procedure to auto-select the number of clusters so that no manual selection is required. ``EMCategoricalMixture`` is designed to use multithreading to process multiple batches of data in parallel during fitting; selecting a ``max_threads`` > 1 will automatically enable multithreading and is highly recommended if you're fitting a large dataset. If you have a large dataset (e.g. millions of sequences) in a fasta file or gzipped fasta file and don't want to load it to memory, ``EMCategoricalMixture`` can encode it to a set of temporary files in a location you specify by calling ``encode_fasta_file``. You supply the list of these temporary files to the ``fit`` function and ``EMCategoricalMixture`` will take care of the rest -- just remember to delete the temporary files once you're done. .. autoclass:: antpack.EMCategoricalMixture :special-members: __init__ :members: get_model_parameters, load_params, get_model_specs, fit, BIC, AIC, predict, predict_proba, score, encode_fasta