A whitepaper by Dr Catarina Carrao
Antibodies (Abs) have been engineered to have potent and specific binding to a given target; but also to have favorable drug properties, including in vivo stability, manufacturability, immunogenicity, solubility, and polyspecificity1.
Monoclonal antibodies (mAbs), for example, have now become ubiquitous therapeutics used to treat cancer, inflammation, and infectious diseases, with over $111 billion USD in sales worldwide in 20212.
Furthermore, the novel bispecific antibody constructs have shown a unique ability to bind to two different targets at the same time, and revealed advantages in specific disease treatments related to efficacy, specificity, and size, with a current 2022 market size of >20 billion USD3.
It is estimated that the human antibody repertoire contains around 1013 unique sequences4; and, this diversity is a result of how the proteins are encoded in the genome5.
Antibodies are composed of two types of protein chain - heavy (h) and light (l) chains; and, each of these is encoded by multiple gene segments that are spliced together using a process called V(D)J recombination6.
As such, the sequence for the light-chain variable region (Fv) is made up of two segments: the variable segment (V) and the joining segment (J); while, the heavy chain is encoded from variable, joining, and diversity (D) segments5.
There are many genes for each V, D, and J segments, which can be matched-up in diverse combinations to produce a various range of antibody sequences. Further diversity is introduced through the insertion or deletion of nucleotides at the segment junctions7 and somatic hypermutations - a process through which the number of random mutations that occur is increased5,8.
The majority of the variation in the sequences occurs in the complementarity-determining regions, or CDRs; and, by creating a large, diverse repertoire of antibody sequences, an individual is able to react to almost any antigen it may encounter.
Additionally, the ability of an antibody to bind to its target antigen is directed by its 3D structure. As such, knowledge of an antibody's structure allows a deeper understanding of its physicochemical properties, that is fundamental to its therapeutic value5.
The structural diversity that allows binding to many different targets occurs mainly in the CDRs; and, these correspond to loops in the 3D structure, which are responsible for most of the antigen-binding interactions9.
For five of the six CDRs (H1, H2, and L1–L3), structural diversity is limited - only a few different shapes have been observed (canonical structures)5.
However, the H3 loop is much more variable in sequence than the other CDRs; and, consequently, is also more structurally diverse, and it is thought to contribute the most to antigen-binding properties10.
Once we are exposed to an antigen, any antibody inside of our B or T-cells that is able to bind to it, will immediately do so. Humoral immunity depends on the B-cells while cell immunity depends on the T-cells; and, by that, antibody selection will take place.
As such, having a large repertoire of antibodies present in our bodies will increase the chance that at least one has the ability to bind to the antigen, even weakly; and, therefore, initiate an appropriate immune response5. B-cells can then enter clonal expansion, and start cycles of proliferation producing binding antibodies with simultaneous somatic hypermutations, in order to produce antibodies with higher affinity8.
At this point, the antibody repertoire is enriched with antibodies that bind to the target antigen5; and, this variation and specificity is the basis for antibody therapeutics - not only to previous known diseases, but also to new pathogens that we might encounter throughout our lifetime.
Efforts to determine the antibody repertoire using high-throughput DNA sequencing technologies have been advancing at an extremely rapid pace and are transforming the understanding of our immune responses11. This technique, first described by Glanville and colleagues in 200912, has since then increased exponentially the volume of data available.
Whereas previously only a handful of sequences could be obtained at a time, technological advances now mean that large snapshots of this repertoire can be attained using next-generation sequencing approaches to claim the antibody, or receptor sequence repertoire.
Since the H3 loop is what mostly determines binding properties, many studies have only focused on sequencing this particular region; however, repertoires containing full-length sequences are increasingly being produced on the h chain11, l chain13 or both5,14.
Furthermore, the most recent studies have led to a couple of repertoires that also include native pairing information, which details which heavy-chain sequences belong with which light-chain sequences5.
Novel single-cell RNA sequencing methods allow coupling individual T-cell clones to their phenotype and function using their gene expression profiles15; but, unfortunately, the actual antigen specificity (i.e., the set of antigens that can be potentially recognized by a given T-cell receptor, TCR) remains a mystery for most of the T-cells observed by high-throughput profiling.
Immune repertoire profiling technology (AIRR-Seq) is an efficient technique that can be employed to study the structure and dynamics of the adaptive immune system.
This technology makes it possible to characterize the structure of both naive and antigen-experienced TCR repertoires, tumour infiltrating T-cells, and TCRs related to autoimmunity, leading to numerous downstream applications in both basic and applied immunological research16.
Recent developments in the field of bioinformatic analysis of AIRR-Seq data are aimed at providing a mean for annotation of TCR repertoires with predicted antigen specificities. For example, the McPAS-TCR database17 lists pathogen- and disease-associated TCRs, and the VDJdb database18 features a large set of TCRs with experimentally verified epitope specificities and their MHC restrictions.
Existing computational methods for TCR repertoire annotation allow both matching against a database of known antigen specificities and clustering of TCR sequences for de novo motif detection16.
Annotation of a large number of TCR repertoires from healthy donors19 demonstrates both high variance of frequencies of epitope-specific T-cells and the imprint of past and ongoing pathogen encounters.
Thus, de novo discovery of T-cells associated with antigens of interest or certain disease appears to be a hard problem, complicated by (1) the biases in the structure of the naive (unperturbed) TCR repertoire20, (2) the presence of existing clonal expansions specific to unrelated pathogens, and the (3) high number of false positives that result from the extremely high diversity of the TCR repertoire16.
A solution could be to develop a general framework that can be used to infer sets of T-cells specific to antigens of interest using AIRR-Seq data and TCR neighbourhood enrichment algorithms (e.g., ALICE and TCRNET), as Pogorelly et al16 demonstrated.
By applying HLA restriction rules and matching against a database of TCRs with known antigen specificity, they detected motifs of epitope-specific responses in individual repertoires.
"De novo discovery of T-cells associated with antigens of interest or certain disease appears to be a hard problem"
Furthermore, Carter et al21 analysed paired TCR sequences from nearly 100,000 unique CD4+ and CD8+ T cells captured using two different high-throughput, single-cell sequencing approaches, and determined the amount of useful information about TCR repertoire function encoded within αβ pairings.
Their results showed little overlap in the healthy CD4+ and CD8+ repertoires; and, using tools from information theory and machine learning, they showed that while α and β chains are only weakly associated with lineage, αβ pairings appear to synergistically drive TCR-MHC interactions.
As such, they were able to demonstrate that approximately a third of the T cells possess α and β chains that each recognize different known antigens, suggesting that αβ pairing is critical for the accurate inference of repertoire functionality21.
This study further demonstrates the utility of using new single-cell sequencing approaches, in addition to conventional high-throughput bulk-sequencing, so that a more accurate picture of TCR repertoire function can be captured.
In principle, humans can produce an antibody response to any non-self-antigen molecule in the appropriate context; and, this flexibility is achieved by the presence of a large repertoire of naive antibodies, the diversity of which is expanded by somatic hypermutation following antigen exposure22.
Chen et al.23 used several machine learning (ML) approaches (e.g., random forests [RF]) to predict an antibody's developability based on sequence and structural parameters, physicochemical properties), and representing sequences either using multiple sequence alignment or embedding.
Going one step further, Amimeur et al.24 leveraged a generative adversarial network (GAN) trained on publicly available, full-length heavy- and light-chain data (not just the CDRH3 region, including paired-chains encoding), to learn the rules of human antibody formation.
Subsequently, the authors used transfer learning to bias the GAN to generate molecules with developability properties of interest and validated their method by successfully expressing GAN-generated antibodies via phage display; and, testing their “nativeness” via homological modelling and biophysical properties25.
A future challenge in the developability engineering of antibodies is the joint optimization of several developability parameters in conjunction with affinity and epitope (e.g., structural features) design25.
Antibody evolution is also used in vitro for the design of antibodies with improved properties. As such, to better understand the basic concepts of antibody evolution, researchers analysed the mutational paths, both in terms of amino acid substitution and insertions and deletions, taken by antibodies.
For example, Kirik and colleagues26 focused on the evolution of the heavy chain variable domain of sets of antibodies, each with an origin in 1 of 11 different germline genes representing six human heavy chain germline gene subgroups.
They then investigated the isolated genes from cells of human bone marrow, a major site of antibody production, and characterized them by next-generation sequencing and an in-house bioinformatics pipeline.
Apart from substitutions within the complementarity determining regions, multiple framework residues including those in protein cores were targets of extensive diversification.
Diversity, both in terms of substitutions and insertions and deletions, in antibodies is focused to different positions in the sequence in a germline gene-unique manner. Their findings create a framework for understanding patterns of evolution of antibodies from defined germline genes26.
A thorough understanding of the ways through which antibodies derive from different germlines and evolve as a consequence of somatic mutation processes is fundamental to aid a proper mutational analysis of clones that populate immune responses.
All in all, the amount of information encoded by all of the rearranged antibody and T cell receptor genes in one person - the “genome” of the adaptive immune system - exceeds the size of the human genome by more than four orders of magnitude22.
Furthermore, because much of the B lymphocyte population is localized in organs or tissues that cannot be comprehensively sampled from living subjects, human repertoire studies have mostly focused on circulating B cells.
As such, the largest repertoire sequencing study of B cell receptors was performed by Briney et al.22, which resulted in a set of >300 million heavy-chain sequences.
This dataset now allows genetic study of the baseline human antibody repertoire at an unprecedented depth and granularity. This reveals largely unique repertoires for each individual studied, a subpopulation of universally shared antibody clonotypes, and an exceptional overall diversity of the antibody repertoire.
"The “genome” of the adaptive immune system exceeds the size of the human genome by more than four orders of magnitude"
Many algorithms and pipelines have now been developed that pre-process the previously generated data ready for analysis - performing tasks such as translation from nucleotides to aminoacids, error estimation and correction, and sequence numbering27. iReceptor28, VDJServer29, and ImmuneDB30 are currently available platforms meant to create standardized, publicly available repositories of sequencing data, creating opportunities for large-scale data mining.
Also, the Observed Antibody Space (OAS) database collates full-length variable region sequences and now contains >1 billion sequences from different studies31.
The immune system has evolved to encode an amazing diversity of antibodies that jointly embrace the antibody repertoire, providing a potent collection of recognition components that can identify almost any organic biological macromolecule.
The germline conservation of antibody genes passes the imprint of ancient human adaptations to pathogens we were exposed to during evolution; and, the immunologic insight represented by the repertoire, encoded by antigen-experienced mature B cells within an individual, allows adaptations to future pathogen encounters.
We can now begin to decipher both of these sections of humoral immunity, and how each is shaped by the other.
Technologies that can improve sequence precision and data analysis are essential to support antibody drug discovery and vaccine development, as we have seen in recent years. But also, much needed is the implementation of standards and data analysis for the creation of databases that can facilitate the collection and sharing of all the created data.