Software for the analysis of family data – GERA – Genetic Epidemiology Research Alliance

CI lead: Associate Professor Shuai Li, Centre of Epidemiology and Biostatistics, Melbourne School of Population and Global Health

Team members: Dr James Dowty, CEB, MSPGH

Awarded: $40,000 in GERA’s Grant Round 1 (2025)

About the project: R is the programming language of choice for many statisticians, because almost all statistical procedures have been added to R as software add-ons, called packages. We have previously developed an R package for the analysis of family data, called clipp. This R package is similar to the classic program Mendel v3.2, but it is easier to use (being written in a modern language) and is significantly faster (by utilising parallel computing). However, clipp only provides the bare bones required for analyses, so the proposed project will add significant flesh to this R package.

In particular, we will add the following capabilities to clipp:

1. Functions to visualise family data

Data visualisation is a key step in any statistical analysis, including the analysis of family data. We have written a prototype function to draw a pedigree picture and display user-specified data on each person. This function is based on a novel recursive trick, and it rapidly finds an aesthetically appealing layout, even for very large and complex families (see Figure 1, below). We will optimise this function for aesthetics and robustness, and we will provide default settings to easily display important data such as affected statuses, ages and genotypes.

2. Functions to analyse X-linked loci and multiple linked genetic loci

Currently, clipp includes helper functions to analyse unlinked, autosomal genetic loci. We will create helper functions to streamline these analyses, as well as add new helper functions to analyse multiple linked loci (genetic loci close together on the same chromosome) and loci on the X chromosome.

3. Functions to perform complex segregation analyses

Complex segregation analyses are the state-of-the-art method underpinning many analyses of family data. For example, our recent JAMA paper used clipp to estimate the lifetime risks of stomach cancer for carriers of pathogenic variants in the gene CDH1 (PMID: 38873722). And in collaboration with the Crick Institute in London, we are currently using clipp in a pioneering long-read sequencing study to screen novel structural variants for their co-segregation with breast cancer in large, multiple case families (in progress). These analyses are possible because we have written complex and delicate code that supplements clipp’s basic functions. In the current project, we will convert this prototype code into robust helper functions to allow researchers to easily perform these important tasks. Namely, we will create functions to estimate the lifetime risks of disease associated with genetic variants and perform hypothesis testing. We will also create helper functions that can easily be incorporated into the bioinformatic pipelines for sequencing studies. In our collaboration with the Crick Institute, our co-segregation analyses are a critical step in the bioinformatic pipeline, but there are few publicly available bioinformatic tools for this step. The proposed project will allow clipp to fill this very important niche.

Impact: Our basic clipp package has been downloaded 20,000 times, making it an important and highly influential research output. The significant enhancements proposed here will facilitate clipp’s use in thousands of studies of human health and disease. They will allow clipp to fill a critical niche in the bioinformatic pipeline of sequencing studies, by providing robust and easy-to-use functions to screen novel genetic variants for co-segregation (association) with disease. They will also allow many researchers to use clipp to estimate the risks of disease for carriers of genetic variants. Currently in clipp, both of these capabilities are only possible for experts who can write their own custom code. The enhancements to clipp proposed here will therefore help many researchers discover disease-associated genetic variants and estimate their corresponding risks. This could have a profound impact on human health and health policy. Further, the proposed enhancements will be badged with GERA’s imprimatur, providing significant exposure for GERA to the world’s genetic epidemiologists.

Go back to:

All Research Projects, Research Projects on Methods and Programs