# Genetic associations of protein-coding variants in human disease

### Samples and individuals

UKB is a UK inhabitants research of roughly 500,000 individuals aged 40–69 years at recruitment2. Participant information (with knowledgeable consent) embody genomic, digital well being document linkage, blood, urine and an infection biomarkers, bodily and anthropometric measurements, imaging information and varied different intermediate phenotypes which are continually being up to date. Additional particulars can be found at https://biobank.ndph.ox.ac.uk/showcase/. Analyses on this research have been carried out beneath UK Biobank Accepted Undertaking quantity 26041. Ethic protocols are supplied by the UK Biobank Ethics Advisory Committee (https://www.ukbiobank.ac.uk/learn-more-about-uk-biobank/about-us/ethics).

FG is a public-private partnership venture combining digital well being document and registry information from six regional and three Finnish biobanks. Participant information (with knowledgeable consent) embody genomics and well being information linked to illness endpoints. Additional particulars can be found at https://www.finngen.fi/. Extra particulars on FG and ethics protocols are supplied in Supplementary Info. We used information from FG individuals with accomplished genetic measurements (R5 information launch) and imputation (R6 information launch). FinnGen individuals supplied knowledgeable consent for biobank analysis. Recruitment protocols adopted the biobank protocols accredited by Fimea, the Nationwide Supervisory Authority for Welfare and Well being. The Coordinating Ethics Committee of the Hospital District of Helsinki and Uusimaa (HUS) accredited the FinnGen research protocol Nr HUS/990/2017. The FinnGen research is accredited by Finnish Institute for Well being and Welfare.

### Illness phenotypes

FG phenotypes have been routinely mapped to these used within the Pan UKBB (https://pan.ukbb.broadinstitute.org/) venture. Pan UKBB phenotypes are a mix of Phecodes37 and ICD10 codes. Phecodes have been translated to ICD10 (https://phewascatalog.org/phecodes_icd10, v.2.1) and mapping was based mostly on ICD-10 definitions for FG endpoints obtained from explanation for loss of life, hospital discharge and most cancers registries. For illness definition consistency, we reproduced the identical Phecode maps utilizing the identical ICD-10 definitions in UKB. Particularly, we expertly curated 15 neurological phenotypes utilizing ICD10 codes. We retained phenotypes the place the similarity rating (Jaccard index: ICD10FG ∩ ICD10UKB / ICD10FG ICD10UKB) was >0.7 and moreover excluded spontaneous deliveries and abortions.

Phecodes and ICD10 coded phenotypes have been first mapped to unified illness names and illness teams utilizing mappings from Phecode, PheWAS and icd R packages adopted by handbook curation of unmapped traits and illnesses teams, mismatched and duplicate entries. Illness endpoints have been mapped to Experimental Issue Ontology (EFO) phrases utilizing mappings from EMBL-EBI and Open Targets based mostly on precise illness entry matches adopted by handbook curation of unmapped traits.

Illness trait clusters have been decided by way of first calculating the phenotypic similarity by way of the cosine similarity, then figuring out clusters by way of hierarchical clustering on the gap matrix (1-similarity) utilizing the Ward algorithm and chopping the hierarchical tree, after inspection, at peak 0.8 to supply essentially the most semantically significant clusters.

### Genetic information processing

#### UKB genetic QC

UKB genotyping and imputation have been carried out as described beforehand2. Entire-exome sequencing information for UKB individuals have been generated on the Regeneron Genetics Heart (RGC) as a part of a collaboration between AbbVie, Alnylam Prescription drugs, AstraZeneca, Biogen, Bristol-Myers Squibb, Pfizer, Regeneron and Takeda with the UK Biobank. Entire-exome sequencing information have been processed utilizing the RGC SBP pipeline as described3,38. RGC generated a QC-passing ‘Goldilocks’ set of genetic variants from a complete of 454,803 sequenced UK Biobank individuals for evaluation. Further high quality management (QC) steps have been carried out previous to affiliation analyses as detailed beneath.

#### FG genetic QC

Samples have been genotyped with Illumina and Affymetrix arrays (Thermo Fisher Scientific). Genotype calls have been made with GenCall and zCall algorithms for Illumina and AxiomGT1 algorithm for Affymetrix information. Pattern, genotyping in addition to imputation procedures and QC are detailed in Supplementary Info.

#### Coding variant choice

GnomAD v.2.0 variant annotations have been used for FinnGen variants39. The next gnomAD annotation classes are included: pLOF, low-confidence loss-of-function (LC), in-frame insertion–deletion, missense, begin misplaced, cease misplaced, cease gained. Variants have been filtered to imputation INFO rating > 0.6. Further variant annotations have been carried out utilizing variant impact predictor (VEP)40 with SIFT and PolyPhen scores averaged throughout the canonical annotations.

### Illness endpoint affiliation analyses

For optimized meta-analyses with FG, analyses in UKB have been carried out within the subset of exome-sequence UKB individuals with white European ancestry for consistency with FG (n = 392,814). We used REGENIE v1.0.6.7 for affiliation analyses by way of a two-step process as detailed in ref. 41. Briefly, step one suits a complete genome regression mannequin for particular person trait predictions based mostly on genetic information utilizing the go away one chromosome out (LOCO) scheme. We used a set of high-quality genotyped variants: MAF > 5%, MAC > 100, genotyping charge >99%, Hardy–Weinberg equilibrium (HWE) check p > 10−15, <5% missingness and linkage-disequilibrium pruning (1,000 variant home windows, 100 sliding home windows and r2 < 0.8). Traits the place the step 1 regression didn’t converge because of case imbalances have been subsequently excluded from subsequent analyses. The LOCO phenotypic predictions have been used as offsets in step 2 which performs variant affiliation analyses utilizing the approximate Firth regression detailed in ref. 41 when the P worth from the usual logistic regression rating check is beneath 0.01. Commonplace errors have been computed from the impact measurement estimate and the chance ratio check P-value. To keep away from points associated to extreme case imbalance and very uncommon variants, we restricted affiliation check to phenotypes with >100 circumstances and for variants with MAC ≥ 5 in whole samples and MAC ≥ 3 in circumstances and controls. The variety of variants used for analyses varies for various illnesses on account of the MAC cut-off for various illness prevalence. The affiliation fashions in each steps additionally included the next covariates: age, age2, intercourse, age*intercourse, age2*intercourse, first 10 genetic principal elements (PCs).

Affiliation analyses in FG have been carried out utilizing combined mannequin logistic regression technique SAIGE v0.3942. Age, intercourse, 10 PCs and genotyping batches have been used as covariates. For null mannequin computation for every endpoint every genotyping batch was included as a covariate for an endpoint if there have been no less than 10 circumstances and 10 controls in that batch to keep away from convergence points. One genotyping batch want be excluded from covariates to not have them saturated. We excluded Thermo Fisher batch 16 because it was not enriched for any explicit endpoints. For calculating the genetic relationship matrix, solely variants imputed with an INFO rating >0.95 in all batches have been used. Variants with >3% lacking genotypes have been excluded in addition to variants with MAF < 1%. The remaining variants have been linkage-disequilibrium pruned with a 1-Mb window and r2 threshold of 0.1. This resulted in a set of 59,037 well-imputed not uncommon variants for GRM calculation. SAIGE choices for null computation have been: “LOCO=false, numMarkers=30, traceCVcutoff=0.0025, ratioCVcutoff=0.001”. Affiliation assessments have been carried out phenotypes with case counts >100 and for variants with minimal allele depend of three and imputation INFO >0.6 have been used.

We moreover carried out sex-specific associations for a subset of gender-specific illnesses (60 feminine illnesses and in 50 illness clusters, 14 male illnesses and in 13 illness clusters) in each FG and UKB utilizing the identical method with out inclusion of sex-related covariates (Supplementary Desk 2)

We carried out fixed-effect inverse-variance meta-analysis combining abstract impact sizes and customary errors for overlapping variants with matched alleles throughout FG and UKB utilizing METAL43.

### Definition and refinement of great areas

To outline significance, we used a mix of (1) a number of testing corrected threshold of P < 2 × 10−9 (that’s, 0.05/(roughly 26.8 × 106), the sum of the imply variety of variants examined per illness cluster)), to account for the truth that some traits are extremely correlated illness subtypes, (2) concordant route of impact between UKB and FG associations, and (3) P < 0.05 in each UKB and FG.

We outlined impartial trait associations by way of linkage-disequilibrium-based (r2 = 0.1) clumping ±500 kb across the lead variants utilizing PLINK44, excluding the HLA area (chr6:25.5-34.0Mb) which is handled as one area because of complicated and in depth linkage-disequilibrium patterns. We then merged overlapping impartial areas (±500 kb) and additional restricted every impartial variant (r2 = 0.1) to essentially the most important sentinel variant for every distinctive gene. For overlapping genetic areas which are related to a number of illness endpoints (pleiotropy), to be conservative in reporting the variety of associations we merged the overlapping (impartial) areas to type a single distinct area (listed by the area ID column in Supplementary Desk 3).

### Cross-reference with identified associations

We cross-referenced the sentinel variants and their proxies (r2 > 0.2) for important associations (P < 5 × 10−8) of mapped EFO phrases and their descendants in GWAS Catalog11 and PhenoScanner12. To be extra conservative with reporting of novel associations, we additionally thought of whether or not the most-severe related gene in our analyses have been reported in GWAS Catalog and PhenoScanner. As well as, we additionally queried our sentinel variants in ClinVar13 to outline identified associations with rarer genetic illnesses and additional manually curated novel associations (the place the affiliation is a novel variant affiliation and a novel gene affiliation) for earlier genome-wide important (P < 5 × 10−8) associations.

To evaluate medical actionability of related genes, we cross-referenced the related genes with the newest ACMG v3. (75 distinctive genes linked to 82 situations, linked to most cancers (n = 28), cardiovascular (n = 34), metabolic (n = 3), or miscellaneous situations (n = 8)). This listing was supplemented by 20 ‘ACMG watchlist genes’14 for which proof for inclusion to ACMG 3.0 listing was thought of too preliminary based mostly on both technical, penetrance or medical administration considerations

### Biomarker associations of lead variants

For the lead sentinel variants, we carried out affiliation analyses utilizing the two-step REGENIE method described above with 117 biomarkers together with anthropometric traits, bodily measurements, medical haematology measurements, blood and urine biomarkers out there in UKB (detailed in Supplementary Desk 8). Further biochemistry subgroupings have been based mostly on UKB biochemistry subcategories: https://www.ukbiobank.ac.uk/media/oiudpjqa/bcm023_ukb_biomarker_panel_website_v1-0-aug-2015-edit-2018.pdf

### Drug goal mapping and enrichment

We mapped the annotated gene for every sentinel variant to medication utilizing the therapeutic goal database (TTD)21. We retained solely medication which have been accredited or are in medical trial phases. For enrichment evaluation of accredited medication with genetic associations, we used Fisher’s precise check on the proportion of great genes focused by accredited drug towards a background of all accredited medication in TTD21 (n = 595) and 20,437 protein coding genes from Ensembl annotations45.

### Mendelian randomization analyses

#### F5 and F10 results on pulmonary embolism

The missense variants rs4525 and rs61753266 in F5 and F10 genes have been taken as genetic devices for Mendelian randomization analyses. To evaluate potential that every issue stage is causally related to pulmonary embolism we used two-sample Mendelian randomization utilizing abstract statistics, with impact of the variants on their respective issue ranges obtained from earlier giant scale (protein quantitative trait loci) pQTL research46,47. Let ({beta }_{{XY}}) denote the estimated causal impact of an element stage on pulmonary embolism danger and ({beta }_{X}), ({beta }_{Y}) be the genetic affiliation with an element stage (FV, FX or FXa) and pulmonary embolism danger respectively. Then, the Mendelian randomization ratio-estimate of ({beta }_{{XY}}) is given by:

$${beta }_{{XY}}=frac{{beta }_{Y}}{{beta }_{X}}$$

the place the corresponding customary error ({rm{se}}({beta }_{{XY}})), computed to main order, is:

$${rm{se}}({beta }_{{XY}})=frac{{rm{se}}({beta }_{Y})}{left|{beta }_{X}proper|}$$

#### Clustered Mendelian randomization

To evaluate proof of a number of distinct causal mechanisms by which AF could affect pulse charge (PR) we used MR-Clust31. Briefly, MR-Clust is a purpose-built clustering algorithm to be used in univariate Mendelian randomization analyses. It extends the standard Mendelian randomization assumption {that a} danger issue can affect an consequence by way of a single causal mechanism48 to a framework that permits a number of mechanisms to be detected. When a risk-factor impacts an consequence by way of a number of mechanisms, the set of two-stage ratio-estimates will be divided into clusters, such that variants inside every cluster have comparable ratio-estimates. As proven in31, two or extra variants are members of the identical cluster if and provided that they have an effect on the result by way of the identical distinct causal pathway. Furthermore, the estimated causal impact from a cluster is proportional to the overall causal impact of the mechanism on the result. We included variants inside clusters the place the chance of inclusion >0.7. We used MR-Clust algorithm permitting for singletons/outlier variants to be recognized as their very own ‘clusters’ to mirror the big however biologically believable impact sizes seen with uncommon and low-frequency variants.

### Bioinformatic analyses for METTL11B

We searched [Ala/Pro/Ser]-Professional-Lys motif containing proteins utilizing the ‘peptide search’ operate on UniProt49, filtering for reviewed Swiss-Prot proteins and proteins listed in Human Protein Atlas50 (HPA) (n = 7,656). We obtained genes with elevated expression in cardiomyocytes (n = 880) from HPA based mostly on the factors: ‘cell_type_category_rna: cardiomyocytes; cell sort enriched, group enriched, cell sort enhanced’ as outlined by HPA at https://www.proteinatlas.org/humanproteome/celltype/Muscle+cells#cardiomyocytes (accessed twentieth March 2021) with filtering for these with legitimate UniProt IDs (Swiss-Prot, n = 863). Enrichment check was carried out utilizing Fisher’s precise check. Moreover, we carried out enrichment analyses utilizing any [Ala/Pro/Ser]-Professional-Lys motif positioned throughout the N-terminal half of the protein (n = 4,786).

Further strategies Further strategies on additional FinnGen QC; theoretical description and simulation of the impact of MAF enrichment on inverse-variance weighted (IVW) meta-analysis Z-scores; and practical characterization of PITX2c(Pro41Ser) are supplied within the Supplementary Info.

### Reporting abstract

Additional data on analysis design is offered within the Nature Analysis Reporting Abstract linked to this paper.