An Entity Evolving into a Community: Defining the Common Ancestor and Evolutionary Trajectory of Chronic Lymphocytic Leukemia Stereotyped Subset #4

INTRODUCTION Patients with chronic lymphocytic leukemia (CLL) assigned to stereotyped subset #4 are characterized clinically by an early age at diagnosis and an indolent disease course and molecularly by B-cell receptor immunoglobulins (BcR IGs) that exhibit a series of distinctive immunogenetic features (1,2). More specifically, they are IgG-switched (a rarity in CLL since the great majority of CLL clones, >90% of all cases, express IgM/IgD) and are composed of heavy chains encoded by the IGHV4-34 gene and light chains encoded by the IGKV2-30 gene (3–5). The antigen-binding sites of subset #4 are equally interesting, being composed of a variable heavy complementarity determining region 3 (VH CDR3) that is long and enriched in positively charged residues (reminiscent of pathogenic antiDNA antibodies) (3,4). Anti-DNA is the most common specificity in autoreactivity, with DNA binding often acquired through surface-active basic amino acids; predominantly arginine (R) but also, to a lesser extent, lysine (K) (6–8). This point is worthy of note since the VH CDR3 of subset #4 is defined by a (R/K)RYY motif which is deemed to not only be “CLL-biased” but also exclusive to subset #4 as it has never been found outside this context (3,4). In addition, both the VH and variable kappa (VK) domains of subset #4 demonstrate a high impact of somatic hypermutation (SHM) and are remarkable for carrying shared (“stereotyped”) SHM, that is, identical changes at the same codon position of the variable domain (3,9). Subset #4 is also outstanding due to intense intraclonal diversification (ID) An Entity Evolving into a Community: Defining the Common Ancestor and Evolutionary Trajectory of Chronic Lymphocytic Leukemia Stereotyped Subset #4


INTRODUCTION
Patients with chronic lymphocytic leukemia (CLL) assigned to stereotyped subset #4 are characterized clinically by an early age at diagnosis and an indolent disease course and molecularly by B-cell receptor immunoglobulins (BcR IGs) that exhibit a series of distinctive immunogenetic features (1,2). More specifically, they are IgG-switched (a rarity in CLL since the great majority of CLL clones, >90% of all cases, express IgM/IgD) and are composed of heavy chains encoded by the IGHV4-34 gene and light chains encoded by the IGKV2-30 gene (3)(4)(5). The antigen-binding sites of subset #4 are equally interesting, being composed of a variable heavy complementarity determining region 3 (VH CDR3) that is long and enriched in positively charged residues (reminiscent of pathogenic anti-DNA antibodies) (3,4). Anti-DNA is the most common specificity in autoreactivity, with DNA binding often acquired through surface-active basic amino acids; predominantly arginine (R) but also, to a lesser extent, lysine (K) (6)(7)(8). This point is worthy of note since the VH CDR3 of subset #4 is defined by a (R/K)RYY motif which is deemed to not only be "CLL-biased" but also exclusive to subset #4 as it has never been found outside this context (3,4). In addition, both the VH and variable kappa (VK) domains of subset #4 demonstrate a high impact of somatic hypermutation (SHM) and are remarkable for carrying shared ("stereotyped") SHM, that is, identical changes at the same codon position of the variable domain (3,9).
Subset #4 is also outstanding due to intense intraclonal diversification (ID) within their IG genes in the context of ongoing SHM, alluding to an active, ongoing interaction with antigen(s) (10,11). Indeed, by conducting a large-scale longitudinal study of subset #4 we previously established: (i) the existence in most cases of distinct "clusters" of subcloned sequences; (ii) a hierarchical pattern of subclonal evolution, thus revealing which SHMs were negatively or positively selected overtime; and, (iii) subclonal drift, that is, temporal changes in the relative size of different clusters of sequences (12).
Nevertheless, this study only investigated clonal evolution at an individual case level and hence could not shed light on the clonal ancestry of subset #4 as a whole, which is relevant since the remarkable biological and clinical similarities of subset #4 cases strongly support derivation from a common ancestor. In an attempt to trace the ontogeny of subset #4, we here sought to revisit ID in subset #4 and reconstruct their evolutionary history by determining the structure of a community of related clones profiled at different time points for both IG heavy and light chains.

Patient Group
Peripheral blood samples were collected at multiple time points from eight CLL patients meeting the International Workshop on Chronic Lymphocytic Leukaemia (iwCLL) criteria; these eight patients, on the basis of both their IG gene sequence features and our previously established criteria, were assigned to subset #4 (1,3,4,13). Patients' demographics and clinical and molecular data are summarized in Supplemental Table 1. Cases were analyzed over a six-year period (range 7 to 72 months, median 20 months) and no patient received treatment during sampling (Supplemental Table 1). The diagnostic sample was available, and called time point 1, for 6 of the 8 patients analyzed. No diagnostic samples were available for the remaining two patients (P0103 and P2451) and therefore the initial sample (time point 1) analyzed for these patients were 81 and 63 months post diagnosis, respectively. Written informed consent was obtained in accordance with the Declaration of Helsinki and the study was approved by the local ethics review committee.

Visualization of Clonal Evolution in Subcloned IG Gene Sequences
Sequence data was processed using the Damerau-Levenshtein edit distance algorithm (14)(15)(16). The Damerau-Levenshtein distance, as defined in this paper, is a multivariate function of two parameters; in this study these two parameters are the amino acid (or nucleotide) sequences. The defined distance is used for the computation of the difference between two IG chains. It is a distance metric in the sense that, given the amino acid (or nucleotide) sequences s1, s2, s3, the following conditions apply: • Nonnegativity: d(s1, s2) ≥ 0; • Nondegeneracy: d(s1, s2) = 0 if and only if s1 = s2; The analytical form of the Damerau-Levenshtein distance between two chains, a and b, having lengths M and N, respectively, is defined by the following: where lev is a two-dimension matrix with M rows and N columns and the (i,j) entry is lev a,b (i,j), where i = 0, 1, 2, … , M-1 and j = 0, 1, 2, … , N-1. This matrix has one in the element when a match is found between the i-th letter of the chain a and the j-th element of chain b; otherwise this element is equal to 0. Application of this algorithm to the entire IG variable domain was used to illustrate the diversity/ similarity between subcloned sequences obtained from different time points across all patients. Consequently, each clonal sequence was compared with the entire data set, i.e., all clonal sequences obtained irrespective of time point or case, and this strategy provided a more robust evolutionary model for CLL subset #4 than inferring clonal relations at an individual case level. Modeling the genesis of subset #4 in this manner facilitated the deconvolution of sequence changes that occurred during the life of the clone. The distance matrix (17,18) resulting from the comparison process was used to interconnect each clone to the remainder in a minimum spanning tree which was subsequently visualized using purpose-built tools (19)(20)(21). Clones were positioned within this tree according to their individual distances, thus forming clusters which illustrated clonal relatedness beyond the individual patient level.
To explore the functional similarities of observed sequence changes, we followed the ImMunoGeneTics information system (IMGT) classification of the 20 common amino acids for the properties of hydropathy and chemical characteristics (http://www.imgt.org/IMGTeducation/ Aide-memoire/_UK/aminoacids/) and performed the following comparisons: (i) amino acid sequence distance including only replacement mutations; (ii) amino acid sequence distance when considering amino acids with similar physicochemical properties as single equivalent entities; (iii) amino acid sequence distance when considering amino acids within the same hydropathy group as single equivalent entities; and (iv) nucleotide sequence distance.
Focusing on both the VH and VK CDR3, hierarchical visualization was performed and by determining which nu-cleotide or amino acid had the highest probability of appearing at a certain position, a hypothetical VH and VK CDR3 sequence from which all subset #4 CDR3 sequences derive could be constructed. More specifically, a hierarchical tree structure comprised of nodes and branches was assembled. Within this structure, the root node corresponded to the derived (proposed) ancestral sequence, and the branches were determined based on the calculated optimal string distance of each node. The string distance of a node indicated its position from the root node and also its proximity to the other nodes.

Composite Clusters of Subset #4 IG Sequences: Convergent Patterns of Subclonal Evolution
Clustering at the amino acid level. Analysis of the IGHV-IGHD-IGHJ amino acid sequences produced six distinct clusters ( Figure 1A). Four of these clusters were composed of subclonal sequences obtained from different patients (P0907, P1422, P3020 and P1939), with each individualized cluster exhibiting a distinctive dispersion of clones, thus reflecting the varying extent of ID among subset #4 cases (Supplemental Figure 1). The remaining two clusters largely consisted of sequences from two patients each (composite clusters). The first such cluster contained sequences from patients P3916 and P2920 that grouped closely together. The second multimember cluster primarily contained sequences from two patients, P0103 and P2451; however, seven subcloned sequences from patient P1422, stemming from two different time points also clustered within this group, while the majority of sequences from patient P1442 clustered separately and at some distance away ( Figures 1A,B). Thus, subcloned sequences from individual subset #4 cases clustered close together and behaved as if they were clonally related, that is, as if they stemmed from a community of clones with common ancestry. The observed branching may be indicative of special selective pressures occurring in parallel in distinct subclones.
Similar analysis of the IGKV-IGKJ amino acid sequences produced five clusters ( Figure 1C). Two clusters were located within a very close distance, forming a more central core from which a further two clusters emanated. More specifically, the first cluster was formed by two patients (P2920 and P0907), while the second closely neighboring cluster contained the subcloned sequences from P0103 and P2451. Subcloned sequences from P3916 bridged these two clusters ( Figure 1D). Clonal sequences from P1422 formed one of the two more distant clusters, while the other cluster was composed of clonal sequences from P1939. Patient P3020, previously found to carry limited ID despite bearing the highest SHM load (within both the IG heavy and light chain), was distanced from all other clusters. As with the cluster analysis of IG heavy chain sequences, we noted that individual IG kappa sequences occasionally were separated from their respective clusters and, instead, attached to clusters generated by other patients and located some distance away. Hence, the pattern of clustering evidenced from the kappa light chain sequences is analogous to that of their partner heavy chains, thereby reenforcing the idea that subset #4 essentially constitutes a community of related clones that follow closely similar ontogenetic and evolutionary pathways.
Clustering based on shared amino acid properties. Further comparisons were performed at the amino acid level by permitting a degree of ambiguity through the use of amino acid equivalences, that is, following the IMGT grouping of amino acids into classes based on distinct physicochemical or hydropathic properties. Excluding amino acids with the same physicochemical properties from the sequence distancedefining algorithm, that is, considering such amino acids as equal and, hence, not resulting in an overall change, resulted in a slight alteration to cluster formation. At the IG heavy chain level, five distinct clusters were now observed as opposed to the six clusters initially observed when physicochemical properties were included; this change primarily resulted from the clustering of P3020 with P0907 (previously occurring as two distinct clusters) (Figure 2A). For subset #4 kappa light chains, the effect on cluster formation was very minor ( Figure 2B).
When amino acids within the same hydropathy groups were considered as equal, it was noted that mutations introduced by SHM did not lead to striking changes in hydropathy for any patient. Nevertheless, one difference observed when clustering in this manner concerned the subcloned IG heavy chain sequences of P3020 which, although remaining as a distinct cluster, now emerged from the cluster produced by patient P0907; prior to taking physicochemical properties or hydropathy into consideration, this cluster hailed from the central core cluster (Supplemental Figures 2, 3). The observed clustering based on shared amino acid properties indicates strong functional constraints for preservation of critical physicochemical properties; the limited range of permissible amino acids potentially reflects selection events governed by structural constraints for optimal antigen recognition.
Clustering at the nucleotide level. Finally, clustering based on changes within IG heavy chain nucleotide sequences produced an individual and distinct cluster for six patients, while patients P0103 and P2451 remained clustered together ( Figure 3A). Within the IG kappa chains, although the clusters generated shared similarities to cluster formation at the amino acid level, the two central cores were completely distanced from each other and, instead, a major cluster was formed by four patients (P0103, P2451, P3916 and P1939) ( Figure 3B). Since these four patients all carry a 10-amino acid VK CDR3, the enhanced segregation of clusters observed at the nucleotide level is likely attributable to the additional three nucleotides that these sequences carry compared with cases carrying a 9amino acid VK CDR3; thus accounting  Figure 1A illustrates cluster formation following analysis of the IGHV-IGHD-IGHJ amino acid sequences (n = 511). Six distinct clusters were observed: a central core was created by clonal sequences from two patients, P0103 and P2451, and from this core radiated a further five clusters. Figure 1B provides a more detailed view of the composition of this central core. The central core is framed by dotted lines and each cluster is then dissected further. The seven sequences from P1422 segregated from the parent cluster were observed initially at diagnosis as a minor subclone, were represented by only a single subcloned sequence at the second time point (1/33 subcloned sequences; 3%) and were undetectable at the third time point. Figure 1C details cluster formation following analysis of the IGKV-IGKJ amino acid sequences (n = 397). Figure 1D provides a more complete view of the major cluster resulting from analysis of the IGKV-IGKJ subclonal sequences. The major cluster is surrounded by dotted lines, and a comprehensive breakdown of each cluster is provided. As observed with the IG heavy chain sequences, we noted that individual IG kappa sequences occasionally were separated from their respective clusters, and instead attached to distant clusters. This was particularly noted for three clonal sequences, one from P1939 and two from P2451, which carried 9-amino acid VK CDR3s while their remaining clonal sequences all carried a 10-amino acid VK CDR3; the longer VK CDR3 is created by an additional proline at codon 115 and an equal proportion of subset #4 cases in this study carried either a 9-amino acid VK CDR3 or a 10-amino acid VK CDR3. During cluster formation, patients with identical sequences become hidden by the last patient to be analyzed and found to harbor the exact same sequence. Thus, while it may initially appear that P0907 is absent from the clustering analysis in Figure 1C, it is merely obscured by another patient. This is illustrated in the reverse image of the P1939 (yellow)/P2920 (blue) cluster provided in Figure 1D, with the subclonal sequences of P0907 indicated by the red circle. Each circle represents subcloned sequences from one of the eight subset #4 patients included in the study. Identical sequences overlap and are thus represented by a single circle. Circles are color coded to match the patient tag and different shades of the same color indicate subclonal sequences from the same patient but from a different time point. The number within each circle indicates how many sequences carried that specific rearrangement. In Figure 1C, subcloned sequences with a 9-amino acid VK CDR3 lie above the dashed gray line while subclonal sequences from patients with a 10-amino acid VK CDR3 lie below the line. The number of circles appearing for each case is related to the level of intraclonal diversification observed. The asterisk beside the number 42 in Figure 1B indicates that this circle represents sequences from more than one patient.
for three additional sequence changes as opposed to one at the amino acid level.
Taken collectively, this detailed computational reconstruction of CLL subset #4 clonal evolution based on merged IG sequence data for all eight cases (at either an amino acid or nucleotide level) reveals a convergent and unified tumorigenic evolutionary process. Thus, this framework is indicative of a "consensus path" of evolution for subset #4 cases with the branched evolutionary growth perhaps reflecting selective pressures honing their BcR affinities.

Tracing the Origins of CLL Subset #4: Molecular Phylogeny of CDR3 Sequences
Both the VH and VK CDR3s were visualized hierarchically with the aim of constructing a CDR3 sequence at both the nucleotide and amino acid level, which then could be considered as the root from which all subset #4 CDR3 sequences derive. Comparison of each CDR3 sequence to the derived root sequence, using the same algorithmic process applied throughout the entire variable domain, enabled us to identify the mutational path followed by each individual patient.
With regards to the VH CDR3, the derived root sequences for both nucleotide and amino acids, were GCG AGA GGC TAC GCG GAT ACA GCT GTG GTT AGG AGG TAC TAC TAT TAC GGT ATG GAC GTC and ARGYADTAVVRRYYYYGMDV, respectively. These sequences would have been created through the association of the IGHV4-34 and IGHJ6 genes with the IGHD5-18 gene in reading frame 1. Within these sequence strings, GGC TAC GCG (translation: GYA) and AGG AGG (translation: RR) cannot be assigned to the germline sequence of any IGHD and/or IGHJ gene, and thus would correspond to nontemplated regions (N1 and N2, respectively). Regarding the VK CDR3, the derived root nucleotide and amino acid sequences were ATG CAA GGC ACA CAC TGG CCC CCG TAC ACT and MQGTHWPPYT, respectively, and would have been created by the association of the IGKV2-30 and IGKJ2 genes.
Comparison of the complete VH CDR3 amino acid sequence data set to the root revealed that no patient's sequence exactly matched the root. That said, clonal sequences exhibiting the least differences were from patients P1422, P1939, P2451 and P3916 while P0907, P2920 and P3020 were those located furthest away (Figure 4A). Within the VK CDR3 data set, 41% (162/397) of all kappa light chain sequences were identical to the derived root. These sequences were from patients P0103 (n = 75), P2451 (n = 60) and P3916 (n = 27), while their few remaining sequences together with the subclonal sequences obtained from all other patients, contained only one or two differences, thus explaining the limited branching observed from the root ( Figure 4B). Overall, by adopting this strategy we could for the first time propose the A B Figure 2. Cluster formation when considering amino acids within the same physicochemical groups as equals. Figure 2A illustrates clustering of the IG heavy chains (n = 511) while figure 2B concerns clustering of the IG kappa light chains (n = 397). When considering amino acids within the same physicochemical groups (as defined by IMGT) as equals, a new cluster was formed at the heavy chain level between P3020 and P0907 (previously represented by two distinct clusters) while the effect on IG kappa light chains was minor and predominantly related to the separation of P2920 and P0907 from the central cluster. Circles are color coded to match the patient tag and different shades of the same color indicate subclonal sequences from the same patient but from a different time point. The number of circles appearing for each case is related to the level of ID observed.
preimmune VH and VK CDR3 which forms the subset #4 BcR IG.

DISCUSSION
CLL subset #4 lies at the intersection between autoimmunity and malignancy. The expression of IGHV4-34 endows B cells with the capacity to recognize the N-acetyllactosamine (NAL) antigenic epitope present in both self and exogenous antigens via a germline-encoded motif located within the heavy variable framework region 1 of the IGHV4-34 gene (22,23). This motif remains intact in all CLL subset #4 IG heavy chain sequences despite a heavy SHM load and intense ID (3,4,10). Notably, recombinant monoclonal antibodies from CLL subset #4 patients have been found to bind viable B cells, recognizing the NAL epitope present on B-cell CD45 (24,25). Additional features encoded in the subset #4 IG BcR sequence that hint at autoreactivity in-clude: (i) the predicted high electropositivity of their long arginine-rich VH CDR3s, reminiscent of pathogenic anti-DNA antibodies; and (ii) the presence of recurrent SHMs typified by the frequent introduction of acidic residues, similar to edited anti-DNA antibodies (3,9).
The route to malignancy for CLL subset #4 clones may thus be a multifactorial phenomenon, beginning with autoreactive precursors that undergo positive selection by DNA, nucleosomes and/or surface structures of apoptotic cells (26,27). Thereafter, modifications introduced by SHM may curtail this autoreactivity, thus rendering these clones anergic (28)(29)(30), though still capable of reactivation either through their BcRs and/or other immune receptors, namely toll-like receptors (TLRs) (31)(32)(33)(34)(35). While this scenario bodes well for our understanding of the evolutionary pathway followed by subset #4 clones, despite much ingenuity and effort, our knowledge about the specific eliciting antigen(s) for subset #4 remains limited. Along these lines, it is relevant to mention that recombinant monoclonal antibodies derived from subset #4 patients lacked detectable reactivity with DNA, however, upon removal of SHMs (reversion to germline configuration), these antibodies regained the ability to strongly bind DNA (24). Nevertheless, owing to difficulties in defining the unmutated progenitor rearrangement, mainly due to the extensive SHM present within subset #4 clones, the contribution made by the somatically generated CDR3s to auto-antibody specificity (24,25,(36)(37)(38)(39) may have been underestimated, thus obscuring the actual antibody-antigen interactions (40)(41)(42).
In an attempt to clarify and enhance our understanding of the ontogeny of CLL subset #4 B cells, we sought not A B Figure 3. Composite clusters of subset #4 IG sequences at the nucleotide level. Figure 3A illustrates cluster formation following analysis of the IGHV-IGHD-IGHJ nucleotide sequences (n = 511). Seven distinct clusters were observed; six clusters represented a single patient each, while P0103 and P2451 remained clustered together, thus accounting for the seventh cluster. Figure 3B highlights cluster formation following analysis of the IGKV-IGKJ nucleotide sequences (n = 397) and highlights the distancing of the two central cores and instead the formation of a major cluster containing the subcloned sequences of four patients (P0103, P2451, P3916 and P1939). Circles are color coded to match the patient tag and different shades of the same color indicate subclonal sequences from the same patient but from a different time point. The number of circles appearing for each case is related to the level of intraclonal diversification observed.
only to reconstruct the evolutionary history of subset #4 clones viewed as a single antibody lineage, that is, the sequence of changes introduced into the lineage during the development of the clone, but also to identify the common ancestral sequence from which all sub-set #4 cases are derived-a task hitherto unattainable due to the heavy SHM load within the antigen-binding sites. One means to obtain insight into the trajectory of subset #4 clones would be through characterization of their genetic sequence, with the greatest insight ob-tained from longitudinal sampling. Consequently, for this purpose, we drew on a community of related clones profiled at different time points, for both heavy and light chains, derived from 8 subset #4 cases (12). The Damerau-Levenshtein distance algorithm, in the form of a A B Figure 4. Molecular phylogeny of the VH and VK CDR3 sequences of subset #4. Figure 4A illustrates how hierarchical visualization of the VH CDR3 amino acid sequence from all patients facilitated the construction of a VH CDR3 sequence that can now be considered as the root from which all subset #4 VH CDR3 sequences are derived. Since no patient's VH CDR3 sequence exactly matched the derived root, they are visually placed as branches. Sequences that have an equal number of amino acid changes from the root are placed at the same level (within the same row), since they branch from the root in a similar manner. Figure 4B illustrates the above phenomena for the VK CDR3 amino acid sequences. The limited branching evidenced is indicative of sequence relatedness, with only one or two differences between any patient sequence and the derived root. Identical sequences overlap and are thus represented by a single circle. The asterisk indicates that this circle represents sequences from more than one case. The number within each circle indicates how many sequences carried that specific rearrangement.
purpose-built computational tool, was applied and enabled us to infer both the unmutated ancestral rearrangement and the maturation intermediates, and hence gain further insight into the interplay between mutational constraints and selection on antigen-binding affinity. Through this approach the focused evolution of subset #4, the evolution of single entities into a community of related clones, was clearly evidenced with most patient clusters found lying very close to each other due to a high degree of sequence relatedness. The branching observed within such clusters could perhaps reflect specific selective pressures that occurred in parallel in distinct subclones, as a means to fine-tune their BcR affinities. Importantly, exploring the evolutionary trajectory of subset #4 enabled us to suggest for the first time the common ancestral sequence from which all subset #4 cases likely descend. By determining the most probable sequence of mutations, the mutationally preferred pathway, the unmutated common ancestor (including the predicted VH CDR3) could be inferred, which could now serve as a template for antigen reactivity studies (which should better predict antigen specificities compared with previous studies). Defining the antigens bound by the CLL cells should aid in unraveling the path to malignancy in subset #4. We thus reason that knowledge of the subset #4 ancestral rearrangement could provide a blueprint for the resolution of crystal structures, which would not only further define structural characteristics of the #4 antibody, but also provide detailed molecular insights into the nature of contact sites between the antibody and antigen.

CONCLUSION
The tale of CLL subset #4 is truly intriguing; bestowed with autoreactive properties at birth, they fortuitously escape immunological tolerance and exist in an anergic state in the periphery, only to reemerge as immunocompetent cells (potentially due to dual engagement of the BcRs and TLRs). That said, the story is far from complete and unresolved issues relate to where and under what influence SHM (and also switching to the IgG isotype) occurs, and whether specific modalities of BcR/TLR collaboration and/or regulation may eventually impact on the biological behavior of the clones. Nevertheless, results from this study unveil new leads in the ontogeny of CLL subset #4 clones and bring fresh insights, which may directly impact the design of studies concerning the antigenic specificity of the clonotypic BcR IGs. Although it is difficult to predict how revelations in biological understanding may translate into improved immunological interventions, it seems reasonable to think that once a detailed understanding of the B-cell ontogeny of CLL subset #4 is achieved, doors for therapeutic strategies may open, for example, the design of peptides that would inhibit or alter the consequences of antigen-antibody interactions.