Database search and clustering tutorial

AntPack can quickly (just a few minutes for 10 million sequences) construct a local database for antibodies and TCRs. The resulting database can be searched quickly using a very efficient algorithm for any sequence that has CDRs similar to a query, where similarity is defined using percent difference (and also optionally a maximum per-residue BLOSUM distance). The search can be further restricted to capture only sequences that have the same v-gene family or same v-gene assigment and have the same j-gene family as well.

Databases can be constructed from a variety of inputs. You can build one using:

a list of fasta files where the sequences have not been numbered yet. In this case AntPack will have to number them, which may slow database construction down somewhat;
a list of csv files where specific columns contain certain information (e.g. the heavy and light chain sequences, the v-gene and j-gene assignments, previously assigned numbering generated by AntPack or some other tool etc.);
a list of csv files where the CDRs have already been extracted and specific columns contain CDR sequences and V/J gene assignments.

It is common to “clonotype” antibody sequences, i.e. group them into clusters where all sequences in a cluster share the same v-gene, same j-gene and have percent difference less than either 25% or 20% to some other sequence in the cluster. This type of clustering when performed using a simple search has :math:O(N^2) . scaling so it is not optimal for very large datasets. Remarkably, however, AntPack’s search algorithm is fast enough you can clonotype datasets as large as 40-50 million sequences in a few hours, and 100 million by the next day. For larger datasets, we strongly recommend using a more scalable algorithm to cluster the database (AntPack will provide additional options for this soon!)

Search and clustering will generally be much faster on solid state drives (SSD) than hard drives so we recommend constructing on SSD where possible. Typical disk space usage is about 6 GB per 10 million single chains and possibly more if there is a large amount of associated metadata that you want to store for each sequence.

To search or cluster an existing database, use a LocalDBSearchTool, e.g.::

from antpack import LocalDBSearchTool
ldb = LocalDBSearchTool(db_filepath)

where `db_filepath` is the path to the database you would like to use.

For detailed descriptions of all the things you can do with the `LocalDBSearchTool` and the functions you can use to build a database (depending on the input file type), see below.

class antpack.LocalDBSearchTool(database_path: str)

Contains tools for searching a local database and clustering its contents.

__init__(database_path: str)

Class constructor.

Parameters:: database_path (str) – The filepath for the database. All other necessary info about the database will be retrieved from the database.
Raises:: RuntimeError – A runtime error is raised if the database is not found, cannot be opened or does not contain expected tables and metadata.

basic_clustering(chain_type: str = 'heavy', mode: str = '123', cdr_cutoff: float = 0.2, blosum_cutoff: float = -1, filter_by_jgene: bool = True, verbose: bool = True)

Clusters the linked database using single linkage, by searching every sequence once against the database. Only sequences that have the same cdr3 length and the same vgenes, jgenes and species can be assigned to a cluster. This is an O(N^2) procedure but astonishingly, depending on hardware, can be fast enough to cluster 10 million sequences in 15 minutes or so! This may be suitable for clustering as many as 40 - 80 million sequences overnight. For still larger datasets, we recommend using an algorithm with sub- quadratic scaling instead.

Parameters:

chain_type (str) – One of “heavy”, “light”. For TCRs, TCRB is heavy.
mode (str) – One of “123” or “3”, indicating whether to use cdr3 only or cdrs 1, 2 and 3 when determining whether a hit meets neighborhood criteria. Generally use 123 UNLESS you are working with sequences where you have data for cdr3 only (e.g. TCRs).
cdr_cutoff (float) – A value in the range from 0 to 0.25 (inclusive). Determines the percent identity for two sequences to be in the same cluster.
blosum_cutoff (float) – Either -1 or a value >=0. If -1, this argument is ignored. If >=0, sequences must have a BLOSUM distance < this cutoff at all positions to be in the same cluster.
filter_by_jgene (bool) – If True, members of a cluster are required to have the same jgene. If False, jgenes are ignored.
verbose (bool) – If True, print periodic updates while clustering.

Returns:

cluster_assignments (list) – A list of ints of the same length as the number of sequences in the database, including all chain types. Each sequence has a cluster label which is 0 if the sequence could not be assigned (e.g. no heavy chain present when clustering heavy chains), 1 if the sequence is not assigned to any clusters because it has no neighbors within the cdr_cutoff, or a number >= 2 which indicates the cluster to which the sequence belongs.

get_database_metadata()

Returns metadata about the database (numbering scheme, receptor type, cdr definition type, and memo entered when creating the db if any).

Returns:

numbering_scheme (str) – The numbering scheme (e.g. aho, imgt etc.) used for this database.
receptor_type (str) – The receptor type (mab or tcr).
sequence_type (str) – Either ‘single’, ‘paired’ or ‘unknown’. If ‘unknown’, it is assumed that each sequence added may have either one chain or two and they are analyzed accordingly.
cdr_scheme (str) – The cdr definitions used for this database (e.g. aho, imgt etc.)
user_memo (str) – A memo supplied by the user when creating the database with some info about what it is for; may be “” if no memo was supplied.

get_max_seqid()

Gets the largest seqid in the database. This will be the same as the number of stored sequences IF no database rows have been deleted. get_max_seq_id is MUCH faster than get_num_seqs because it does not have to count rows, so if you are in a hurry get_max_seqid is a better way to figure out (roughly) how many sequences are in the database.

Returns:: seq_id (int) – The largest sequence id in the database.

get_num_seqs(chain_type='all')

Gets either the total number of sequences OR the total number of entries in the heavy chain table or the light chain table, as requested.

Parameters:: chain_type (str) – One of “all”, “heavy”, “light”. Determines whether to count all chains or just heavy or light.
Returns:: count (int) – The number of sequences in the table matching the specifications.

get_sequence(seq_id: int)

Retrieves the sequence and metadata associated with a given sequence id. Sequence ids can be obtained from any of the search functions.

Parameters:

seq_id (int) – A sequence id associated with one of the sequences in the database.

Returns:

sequence (str) – Either a blank string if the id that was supplied does not match any database sequences, or the sequence from the database.
metadata (str) – Either a blank string if the id that was supplied does not match any database sequences, or the metadata associated with the sequence from the database.

get_vgene_jgene(seq_id: int, chain_type: str)

Retrieves the vgene and jgene associated with a given sequence id and chain type. Sequence ids can be obtained from any of the search functions.

Parameters:

seq_id (int) – A sequence id associated with one of the sequences in the database.
chain_type (str) – One of ‘heavy’, ‘light’. If using TCRs, B is heavy.

Returns:

vgene (str) – Either a blank string if the id that was supplied does not match any database sequences or there is no vgene for this sequence, or the vgene from the database.
metadata (str) – Either a blank string if the id that was supplied does not match any database sequences or there is no jgene for this sequence, or the jgene from the database.

retrieve_preprocessed_search_data(seq_id: int, chain_type: str = 'heavy', filter_by_vgene: bool = True, filter_by_jgene: bool = True, use_vgene_family_only: bool = False)

Retrieves preprocessed search data that can be supplied directly to `search_from_preprocessed_data`. This is useful if you are retrieving sequences from one database to search against another database created with the same numbering and cdr scheme.

Parameters:

seq_id (int) – A sequence id associated with one of the sequences in the database.
chain_type (str) – One of ‘heavy’, ‘light’.
filter_by_vgene (bool) – If True, the returned results are set up so that the resulting search will find only sequences with the same vgene and vgene family. If False, the returned results are configured to retrieve sequences regardless of vgene.
filter_by_jgene (bool) – If True, the returned results are set up so that the resulting search will find only sequences with the same jgene and jgene family. If False, the returned results are configured to retrieve sequences regardless of jgene.
use_vgene_family_only (bool) – If True, the returned results are set up so that the resulting search will filter for sequences with the same vgene family (but possibly a different vgene). Ignored if filter_by_vgene is False.

Returns:

preprocessed_inputs (tuple) – A tuple of query_seq (the three concatenated cdrs), cdr_lengths (the length of all three cdrs), chain code, vgene_info and jgene_info. This can be supplied directly to `search_from_preprocessed_data`.

search(seq: str, annotation: tuple, mode: str = '3', cdr_cutoff: float = 0.25, blosum_cutoff: float = -1, max_cdr_length_shift: int = 2, use_family_only=True, symmetric_search=False, vgene: str = '', species: str = '', jgene: str = '')

Searches the database and returns a list of nearest neighbors that meet the input criteria. The search can be conducted using the full sequence or a sub-region of it (e.g. cdr3, the cdrs etc).

Parameters:

seq (str) – The input amino acid sequence. May contain ‘X’, must not contain ‘*’.
annotation (tuple) – A tuple of (numbering, percent id, chain type, error message) that is returned by any of the analyze_seq functions for annotators in AntPack.
mode (str) – One of “3” or “123”, indicating whether to look for sequences that are similar based on cdr3 or based on all three cdrs.
cdr_cutoff (float) – The maximum number of differences allowed from the query seq as a percentage of the query seq cdr length(s) (excluding gaps). This is assessed separately for cdr3 and cdrs 1/2 if using mode “123”. In other words, if searching all three cdrs with cutoff 0.25, the maximum number of differences allowed in cdr3 is 25% of that cdr length rounded down the nearest integer (floor operation), and the maximum number of differences in cdrs 1 & 2 combined is 25% of their combined lengths rounded down to the nearest integer (floor operation). cdr_cutoff is allowed to range from 0 to 0.3 (if you want to use a larger value you will need to do a slower full-table scan instead of using this function).
blosum_cutoff (float) – Either -1 or a value >= 0. If < 0, this argument is ignored. If >= 0, any match with a max BLOSUM mismatch score above this value is discarded. Specifying a value >= 0 makes the search more restrictive by excluding results that are within CDR cutoff but have a highly improbable mutation (as determined by BLOSUM scoring).
max_cdr_length_shift (int) – The maximum +/- amount by which the length of cdr3 for a match can differ from the query. If 0, the results are required to have cdr3 be the same length as the query.
use_family_only (bool) – If True, when filtering by vgene, hits are required to have the same vgene family, but can have a different vgene within that family. If False, hits are required to have the same vgene AND family. This argument is ignored if vgene or species is “”.
symmetric_search (bool) – If True, the cdr cutoff is applied both forward and reverse – in other words, the Hamming distance from hit to query must be less than either cdr_cutoff * query_cdr_length OR cdr_cutoff * hit_cdr_length. This ensures that distance matrices constructed using search are symmetric. If not planning to build a distance matrix, this is unnecessary.
vgene (str) – One of “” or a valid vgene. If a valid vgene, hits are required to match either the vgene family or the specific vgene (depending on the use_vjgene_family_only argument). Ignored if no species is supplied.
species (str) – One of “” or a valid species (human, mouse, alpaca, rabbit). If “”, hits are required to belong to the same assigned species.
jgene (str) – One of “” or a valid jgene. If a valid jgene, hits are required to match either the jgene family or the specific jgene (depending on the use_vjgene_family_only argument). Ignored if no species or vgene is supplied.

Returns:

hits (list) – A list of tuples containing (sequence_id, distance, num_non_canonical_positions). The sequence ids can be used to retrieve the sequences and their metadata using the get_sequence call. Non-canonical positions are rare unusual insertions that are not included in the distance calculation; this value will normally be zero. If it is not, it indicates the hit contains unusual non-canonical positions.
error_message (str) – An error message which is “” if all went well and is informative otherwise.

search_from_preprocessed_data(prepped_inputs: tuple, mode: str = '3', cdr_cutoff: float = 0.25, blosum_cutoff: float = -1, max_cdr_length_shift: int = 2, symmetric_search=False)

Searches the database using preprocessed query info generated by `retrieve_preprocessed_search_data` and returns a list of nearest neighbors that meet the input criteria. This function is a good alternative to `search` if you are retrieving specific sequences from one database and searching them against another, in which case it is much faster than numbering / prepping the retrieved sequences yourself.

Parameters:

prepped_inputs (tuple) – A tuple of information returned by `retrieve_preprocessed_search_data`.
mode (str) – One of “3” or “123”, indicating whether to look for sequences that are similar based on cdr3 or based on all three cdrs.
cdr_cutoff (float) – The maximum number of differences allowed from the query seq as a percentage of the query seq cdr length(s) (excluding gaps). This is assessed separately for cdr3 and cdrs 1/2 if using mode “123”. In other words, if searching all three cdrs with cutoff 0.25, the maximum number of differences allowed in cdr3 is 25% of that cdr length rounded down the nearest integer (floor operation), and the maximum number of differences in cdrs 1 & 2 combined is 25% of their combined lengths rounded down to the nearest integer (floor operation). cdr_cutoff is allowed to range from 0 to 0.3 (if you want to use a larger value you will need to do a slower full-table scan instead of using this function).
blosum_cutoff (float) – Either -1 or a value >= 0. If < 0, this argument is ignored. If >= 0, any match with a max BLOSUM mismatch score above this value is discarded. Specifying a value >= 0 makes the search more restrictive by excluding results that are within CDR cutoff but have a highly improbable mutation (as determined by BLOSUM scoring).
max_cdr_length_shift (int) – The maximum +/- amount by which the length of cdr3 for a match can differ from the query. If 0, the results are required to have cdr3 be the same length as the query.
use_family_only (bool) – If True, when filtering by vgene, hits are required to have the same vgene family, but can have a different vgene within that family. If False, hits are required to have the same vgene AND family. This argument is ignored if vgene or species is “”.
symmetric_search (bool) – If True, the cdr cutoff is applied both forward and reverse – in other words, the Hamming distance from hit to query must be less than either cdr_cutoff * query_cdr_length OR cdr_cutoff * hit_cdr_length. This ensures that distance matrices constructed using search are symmetric. If not planning to build a distance matrix, this is unnecessary.

Returns:

hits (list) – A list of tuples containing (sequence_id, distance, num_non_canonical_positions). The sequence ids can be used to retrieve the sequences and their metadata using the get_sequence call. Non-canonical positions are rare unusual insertions that are not included in the distance calculation; this value will normally be zero. If it is not, it indicates the hit contains unusual non-canonical positions.
error_message (str) – An error message which is “” if all went well and is informative otherwise.

antpack.build_database_from_fasta(fasta_filepaths: list, database_filepath: str, numbering_scheme: str = 'imgt', cdr_definition_scheme: str = 'imgt', sequence_type: str = 'single', receptor_type: str = 'mab', pid_threshold: float = 0.7, user_memo: str = '', reject_file: str = None, verbose: bool = True)

Builds a database from a list of fasta files which may or may not be gzipped. The database is constructed so it can be searched quickly and the sequence descriptions for each sequence in the fasta file are saved as metadata.

Parameters:

fasta_filepath (str) – A list of fasta filepath locations, which may or may not be gzipped.
database_filepath (str) – The desired location and filename for the database.
numbering_scheme (str) – One of ‘imgt’, ‘kabat’, ‘martin’ or ‘aho’. If receptor_type is ‘tcr’, ‘imgt’ is the only allowed option.
cdr_definition_scheme (str) – One of ‘imgt’, ‘kabat’, ‘martin’, ‘aho’ or ‘north’. If receptor_type is ‘tcr’, ‘imgt’ only is allowed.
sequence_type (str) – One of ‘single’, ‘paired’, ‘unknown’. If ‘paired’ each sequence is assumed to be paired. If ‘unknown’ it is assumed each sequence MAY be paired and should be analyzed as paired just in case.
receptor_type (str) – One of ‘mab’, ‘tcr’.
pid_threshold (float) – A value between 0 and 1 for percent identity threshold. If sequence_type is ‘single’ or ‘unknown’, sequences not meeting this threshold are rejected. If sequence_type is ‘paired’ the sequences are still retained as long as one of the chains meets this threshold.
user_memo (str) – A string describing the purpose of the database / anything important you want your future self or other users to know about the contents. Will be saved as part of the database metadata.
reject_file (str) – Either None or a filepath. If None, any sequences that are rejected are silently ignored. If a filepath, rejected sequences are written to that filepath which is saved as a fasta file.
verbose (bool) – If True, print regular updates while running.

Raises:

RuntimeError – A RuntimeError is raised if invalid arguments are supplied.

antpack.build_database_from_full_chain_csv(csv_filepaths: list, database_filepath: str, column_selections: dict, header_rows: int = 1, numbering_scheme: str = 'imgt', cdr_definition_scheme: str = 'imgt', delimiter: str = ',', receptor_type: str = 'mab', pid_threshold: float = 0.7, user_memo: str = '', reject_file: str = None, default_species: str = 'human', verbose: bool = True)

Builds a database from a list of csv files which may or may not be gzipped. The database is constructed so it can be searched quickly. The csv files should already contain heavy and light chains split up into appropriate columns. The csv files can optionally contain additional columns with numbering or Vgenes – if present these will save time for database construction since AntPack will not need to perform numbering or vgene identification. It is ok for some rows to contain both a heavy and light chain and others to contain only one or the other with “” for the missing chain. You can also load files that contain a heavy or light chain only.

Use this option if you have the full chains. If you only have pre- extracted cdrs, use build_database_from_cdr_only_csv.

Note that the beta chain for TCRs is assumed to be “heavy” and alpha is light.

Parameters:

csv_filepaths (list) – A list of csv filepaths, which may or may not be gzipped.
database_filepath (str) – The desired location and filename for the database.
column_selections (dict) –
A dictionary which should contain at least some of the following keys:
- "heavy_chain": The number (from 0) of the column in each csv file containing the heavy chains. Do not include this if there are no heavy chains. If heavy chains are supplied, heavy vgenes and jgenes must be supplied also.
- "light_chain": The number (from 0) of the column in each csv file containing the light chains. Do not include this if there are no light chains. If light chains are supplied, light vgenes and jgenes must be supplied also.
- "heavy_numbering": The number (from 0) of the column (if any) containing numbering for the heavy chains. The numbering should be delimited using the ‘_’ character. If you number your sequences using AntPack you can obtain this using ‘_’.join(antpack_numbering) for each sequence. If this is not supplied each heavy chain will be numbered by this function.
- "light_numbering": The number (from 0) of the column (if any) containing numbering for the light chains. The numbering should be delimited using the ‘_’ character. If you number your sequences using AntPack you can obtain this using ‘_’.join(antpack_numbering) for each sequence. If this is not supplied each light chain will be numbered by this function.
- "heavy_vgene": The number (from 0) of the column (if any) containing vgene assignments for the heavy chains.
- "light_vgene": The number (from 0) of the column (if any) containing vgene assignments for the light chains.
- "heavy_jgene": The number (from 0) of the column (if any) containing jgene assignments for the heavy chains.
- "light_jgene": The number (from 0) of the column (if any) containing jgene assignments for the light chains.
- "species": The number (from 0) of the column (if any) containing species assignments. If this is not supplied, it will be assumed that all sequences come from the default species (see below).
- "metadata": The number (from 0) of the column (if any) containing metadata for each heavy / light chain or chain pair.
header_rows (int) – The number of header rows. Header rows are skipped.
numbering_scheme (str) – One of ‘imgt’, ‘kabat’, ‘martin’ or ‘aho’. If receptor_type is ‘tcr’, ‘imgt’ is the only allowed option.
cdr_definition_scheme (str) – One of ‘imgt’, ‘kabat’, ‘martin’, ‘aho’ or ‘north’. If receptor_type is ‘tcr’, ‘imgt’ only is allowed.
delimiter (str) – The delimiter for the csv files.
receptor_type (str) – One of ‘mab’, ‘tcr’.
pid_threshold (float) – A value between 0 and 1 for percent identity threshold. If you have not supplied pre-written numbering, sequences are numbered during construction, and any not meeting this threshold are rejected.
user_memo (str) – A string describing the purpose of the database / anything important you want your future self or other users to know about the contents. Will be saved as part of the database metadata.
reject_file (str) – Either None or a filepath. If None, any sequences that are rejected are silently ignored. If a filepath, rejected sequences are written to that filepath which is saved as a csv file.
default_species (str) – ‘human’, ‘alpaca’, ‘rabbit’ or ‘mouse’. If you do not supply a species column, AntPack will use this species as a default.
verbose (bool) – If True, print regular updates while running.

Raises:

RuntimeError – A RuntimeError is raised if invalid arguments are supplied.

antpack.build_database_from_cdr_only_csv(csv_filepaths: list, database_filepath: str, column_selections: dict, delimiter=',', receptor_type='mab', header_rows: int = 1, user_memo: str = '', reject_file: str = None, default_species: str = 'human', verbose: bool = True)

TCR data and some antibody data is often stored with cdr3 and/or other cdrs pre-extracted. This function builds a database specifically using this type of input format where you have already extracted the cdrs. If using this function, you MUST use IMGT numbering and cdr definitions. Note that the beta chain for TCRs is assumed to be “heavy” and alpha is light.

Parameters:

csv_filepaths (list) – A list of csv filepaths, which may or may not be gzipped.
database_filepath (str) – The desired location and filename for the database.
column_selections (dict) –
A dictionary which should contain at least some of the following keys. Do not include any key for which information is not present. Note that if you supply some light or heavy chain cdr columns, you MUST supply the cdr3 and vgene for that chain type.
- "light_cdr3": The number (from 0) of the column in each csv file containing light cdr3 (alpha for TCRs).
- "heavy_cdr3": The number (from 0) of the column in each csv file containing heavy cdr3 (beta for TCRs).
- "light_cdr2": The number (from 0) of the column in each csv file containing light cdr2 (alpha for TCRs).
- "heavy_cdr2": The number (from 0) of the column in each csv file containing heavy cdr2 (beta for TCRs).
- "light_cdr1": The number (from 0) of the column in each csv file containing light cdr1 (alpha for TCRs).
- "heavy_cdr1": The number (from 0) of the column in each csv file containing heavy cdr1 (beta for TCRs).
- "light_vgene": The number (from 0) of the column containing vgene assignments for the light chains.
- "heavy_vgene": The number (from 0) of the column containing vgene assignments for the heavy chains.
- "light_jgene": The number (from 0) of the column (if any) containing jgene assignments for the light chains. If this is supplied, a light_vgene column must be supplied as well.
- "heavy_jgene": The number (from 0) of the column containing jgene assignments for the heavy chains.
- "species": The number (from 0) of the column containing species assignments. If this is not supplied, it will be assumed that all sequences come from the default species (see below).
- "metadata": The number (from 0) of the column (if any) containing metadata for each sequence.
delimiter (str) – The delimiter for the csv files.
receptor_type (str) – One of ‘mab’, ‘tcr’.
header_rows (int) – The number of header rows. Header rows are skipped.
user_memo (str) – A string describing the purpose of the database / anything important you want your future self or other users to know about the contents. Will be saved as part of the database metadata.
reject_file (str) – Either None or a filepath. If None, any sequences that are rejected are silently ignored. If a filepath, rejected sequences are written to that filepath which is saved as a csv file.
default_species (str) – ‘human’, ‘alpaca’, ‘rabbit’ or ‘mouse’. If you do not supply a vgene column, AntPack will use this species as a default.
verbose (bool) – If True, print regular updates while running.

Raises:

RuntimeError – A RuntimeError is raised if invalid arguments are supplied.