Exploring big wheat data for the development of functional markers related to heat and drought tolerance

Description of the topic

Improving heat and drought tolerance in wheat is paramount to ensure food security of the burgeoning world population under scenarios of climate change. Several genes with a probable role in abiotic stress tolerance have been cloned in wheat and functionally validated in transgenic Arabidopsis, tobacco and/or wheat lines under controlled environments. However, breeders have not been able to integrate them in their breeding pipelines due to lack of functional markers. Identification of functional variants in these candidate genes (CGs) is, therefore, proposed exploring the newest waves of big data including whole genome, pan-genome and exome sequences. These cross-cutting resources will facilitate search for high-value functional variants with greater accuracy than before, allowing a new way of allele mining. Of all functional variants of relevance in the genes, deleterious variants (DVs) contribute significantly to the phenotypic variation and hence will be the most important targets to unveil in the current study. The project will utilize a combination of approaches in order to predict DVs from both coding and regulatory regions of ten selected CGs; the novel GERP (Genomic Evolutionary Rate Profiling) approach that explores gene sequences of multiple species and the widely used SIFT (Sorting Intolerant From Tolerant) algorithm that predicts an amino acid substitution. The shortlisted DVs will be converted to KASP (Kompetitive Allele Specific Primer) assays for genotyping two different germplasm panels of advanced lines existing at CIMMYT. A wealth of phenotypic data (grain yield, yield components, physiological traits), generated under drought and heat stress environments, exists at CIMMYT in advanced germplasm panels. The project will exploit existing phenotypic data to investigate association of KASP assays with traits using association mapping (AM) approach. The favorable alleles identified for the CGs in AM analysis will be used to generate a heatmap of all advanced breeding lines. Heat map will allow breeders to select lines carrying combinations of favorable alleles for individual trait or trait combinations to enhance breeding efficiency and to identify best parental lines for future crossing schemes.


The research plan is divided into 3 work packages detailed below.

WP1: Retrieval of sequences of the selected CGs and their orthologs (One month)

The complete genomic and coding sequences of the selected CGs will be downloaded in FASTA format from NCBI database using accession numbers reported in the publications cited in Table 1. The retrieved sequences will be utilized to determine complete gene structures (i.e. number and physical positions of introns, exons) of the CGs. For each candidate gene, three homoeologous sequences will be retrieved by using BLASTN algorithm in available genome archives (e.g., Ensembl Plants) and aligned together to investigate homoeologous variation. By exploring the sequence variations across the pangenome, exome and WGS data of CIMMYT lines, the most diverse set of 30 lines will be selected and re-sequenced for three homoeologous copies of each candidate gene by designing genome specific primers. The generated homoeologous sequences will be aligned together to obtain a consensus sequence for each candidate gene. Orthologous sequences from ~20 plant species and model crops (Rice, Brachypodium) will be retrieved using these consensus sequences by using BLASTN.

WP2: Identification of DVs by assigning SIFT and GERP scores (Two months)

All retrieved orthologous and wheat sequences in WP1 will be aligned, and a multiple sequence alignment (MSA) file will be created for all CGs separately. For calculation of SIFT scores, SNPs will be extracted from MSA file using package snps-sites in Linux environment. MSA file will be converted to VCF file format and will be used as an input in Variant Effect Program (VEP) at Ensembl Plants server to determine the functional consequences of variants using SIFT scores. For assigning GERP scores, coding wheat sequences and orthologs will be translated to protein sequences and an MSA file built from protein sequences will be subjected to modified GERP++ algorithm (Davydov et al. 2010).  Positive GERP scores represent a substitution deficit (i.e., fewer substitutions than average neutral site) and thus indicate that a site is under evolutionary constraint. Negative scores, on the other hand, indicate that a site is evolving neutrally. A combination of SIFT (<0.05) and GERP (>2) scores will be used to identify a higher-confidence set of DVs.

WP3: Experimental validation of DVs and construction of heat map (Three months)

For validation, all putative deleterious SNPs will be converted to FMs (KASP assays) for high throughput genotyping of the two independent and existing panels at CIMMYT. One of the panels is an elite yield trial (EYT) composed of 829 spring wheat lines that formed the entries of multi-environment trials during the 2015–2016 growing season. The other panel is the well-known WAMI (Wheat Association Mapping Initiative) that includes 294 historical cultivars and synthetic-derived lines (Lopes et al., 2012). Both these panels have been used extensively in previous association mapping studies (Sehgal et al. 2019b; Lopes et al. 2015; Sukumaran et al. 2018) and a wealth of phenotypic data is available under drought and heat stress environments. Association of KASP assays with traits will be investigated using a mixed linear model with principal components as a fixed variate and kinship as random as described in Sehgal et al. (2019b) on these panels. The favorable alleles identified for the CGs in association analysis will be plotted graphically on individual lines and a heatmap of all advanced lines will be generated.



Exploring latest whole genome, pan-genome and exome sequence data, the project proposes an allele mining approach and seeks to 1) identify DVs within the selected candidate genes and generate FMs (KASP assays), 2) validate FMs in different germplasm panels of CIMMYT 3) exploit FMs to assist pre-breeding and breeding activities for improving abiotic stress tolerance in wheat.

Work expectations

The project’s ultimate output will be KASP markers of SNPs (within the selected candidate genes) that show associations with different traits under drought and heat stress environments. Hence, the student is expected to work as part of wheat molecular biology team to design these markers and validate them on breeders’ germplasm. At the end of the project, the student is expected to draft results in the form of a manuscript for submission in a peer-reviewed journal.



The student will conduct exhaustive bioinformatics work to retrieve the sequences of the candidate genes and their homologs and orthologs in the first phase of the project. Once sequences are retrieved, they should be aligned to be used in different servers to assign SIFT and GERP scores. The second phase will involve laboratory work and analysis of the data. Ideally, student should have good bioinformatics skills and knowledge of different sequence databases.


Required skills


Functioning of different sequence databases

Primer designing (not essential)

Setting up PCR reactions