Database search and clustering tutorial ========================================= AntPack can quickly (just a few minutes for 10 million sequences) construct a local database for antibodies and TCRs. The resulting database can be searched quickly using a very efficient algorithm for any sequence that has CDRs similar to a query, where similarity is defined using percent difference (and also optionally a maximum per-residue BLOSUM distance). The search can be further restricted to capture only sequences that have the same v-gene family or same v-gene assigment and have the same j-gene family as well. Databases can be constructed from a variety of inputs. You can build one using: * a list of fasta files where the sequences have not been numbered yet. In this case AntPack will have to number them, which may slow database construction down somewhat; * a list of csv files where specific columns contain certain information (e.g. the heavy and light chain sequences, the v-gene and j-gene assignments, previously assigned numbering generated by AntPack or some other tool etc.); * a list of csv files where the CDRs have already been extracted and specific columns contain CDR sequences and V/J gene assignments. It is common to "clonotype" antibody sequences, i.e. group them into clusters where all sequences in a cluster share the same v-gene, same j-gene and have percent difference less than either 25% or 20% to some other sequence in the cluster. This type of clustering when performed using a simple search has :math:O(N^2) . scaling so it is not optimal for very large datasets. Remarkably, however, AntPack's search algorithm is fast enough you can clonotype datasets as large as 40-50 million sequences in a few hours, and 100 million by the next day. For larger datasets, we strongly recommend using a more scalable algorithm to cluster the database (AntPack will provide additional options for this soon!) Search and clustering will generally be much faster on solid state drives (SSD) than hard drives so we recommend constructing on SSD where possible. Typical disk space usage is about 6 GB per 10 million single chains and possibly more if there is a large amount of associated metadata that you want to store for each sequence. To search or cluster an existing database, use a LocalDBSearchTool, e.g.::: from antpack import LocalDBSearchTool ldb = LocalDBSearchTool(db_filepath) where ```db_filepath``` is the path to the database you would like to use. For detailed descriptions of all the things you can do with the ```LocalDBSearchTool``` and the functions you can use to build a database (depending on the input file type), see below. .. autoclass:: antpack.LocalDBSearchTool :special-members: __init__ :members: search, get_sequence, get_vgene_jgene, get_database_metadata, get_num_seqs, get_max_seqid, basic_clustering, retrieve_preprocessed_search_data, search_from_preprocessed_data .. autofunction:: antpack.build_database_from_fasta .. autofunction:: antpack.build_database_from_full_chain_csv .. autofunction:: antpack.build_database_from_cdr_only_csv