Database search and clustering tutorial
=========================================

AntPack can quickly (just a few minutes for
10 million sequences) construct a local database
for antibodies and TCRs. The resulting database
can be searched quickly using a very efficient
algorithm for any sequence that has CDRs similar
to a query, where similarity is defined using
percent difference (and also optionally a
maximum per-residue BLOSUM distance). The
search can be further restricted to capture
only sequences that have the same v-gene family
or same v-gene assigment and have the same j-gene
family as well.

Databases can be constructed from a variety of
inputs. You can build one using:

* a list of fasta files where the sequences have not been
  numbered yet. In this case AntPack will have to
  number them, which may slow database construction
  down somewhat;

* a list of csv files where specific columns contain certain
  information (e.g. the heavy and light chain sequences,
  the v-gene and j-gene assignments, previously assigned
  numbering generated by AntPack or some other tool etc.);

* a list of csv files where the CDRs have already been extracted
  and specific columns contain CDR sequences and V/J gene
  assignments. 

It is common to "clonotype" antibody sequences,
i.e. group them into clusters where all sequences
in a cluster share the same v-gene, same j-gene
and have percent difference less than either 25%
or 20% to some other sequence in the cluster. This
type of clustering when performed using a simple
search has :math:O(N^2) . scaling so it is not
optimal for very large datasets. Remarkably, however,
AntPack's search algorithm is fast enough you can
clonotype datasets as large as 40-50 million sequences
in a few hours, and 100 million by the next day. For
larger datasets, we strongly recommend using a more
scalable algorithm to cluster the database (AntPack
will provide additional options for this soon!)

Search and clustering will generally be much faster
on solid state drives (SSD) than hard drives so we recommend
constructing on SSD where possible. Typical disk space
usage is about 6 GB per 10 million single chains and
possibly more if there is a large amount of associated
metadata that you want to store for each sequence.

To search or cluster an existing database, use a
LocalDBSearchTool, e.g.:::

  from antpack import LocalDBSearchTool
  ldb = LocalDBSearchTool(db_filepath)

where ```db_filepath``` is the path to the database
you would like to use.

For detailed descriptions of all the things you can do
with the ```LocalDBSearchTool``` and the functions you
can use to build a database (depending on the input file
type), see below.

.. autoclass:: antpack.LocalDBSearchTool
   :special-members: __init__
   :members: search, get_sequence, get_vgene_jgene, get_database_metadata, get_num_seqs, get_max_seqid, basic_clustering, retrieve_preprocessed_search_data, search_from_preprocessed_data


.. autofunction:: antpack.build_database_from_fasta

.. autofunction:: antpack.build_database_from_full_chain_csv

.. autofunction:: antpack.build_database_from_cdr_only_csv