Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Pedigree plays an important role in the animal selective breeding program. On the one hand, the accuracy of estimated breeding value can be improved by pedigree information. On the other hand, the use of pedigree information can also control inbreeding and avoid depression of traits. Therefore, the reliable and accurate pedigree records are very important for a selective breeding program. In addition, a pedigree is usually saved in the form of three columns: individual, sire, and dam, which makes it difficult to visually view individual ancestor and offspring individuals. Therefore, it is very important to visualize the pedigree of individuals. In the Windows platform, Professor Yang Da’s team from the University of Minnesota developed a software pedigraph that can be used to display individual pedigrees. It can display a pedigree included many individuals. It is very powerful, but it needs be configured by a parameter file. Professor Brian Kinghorn in the University of New England developed the software pedigree viewer, which can trim and prune the pedigree, and visually display the individuals’ pedigrees in a window. But if the number of individuals is very large, the individuals will overlap each other. So the function about pedigree display needs to be further optimized. Under the R environment, packages such as pedigree, nadiv, optiSel, etc. all have the function of pedigree preparation. We also can use packages like kinship2 and synbreed to draw a pedigree tree. However, the drawing pedigree tree will be overlapped greatly when the number of individuals is large.

Therefore, we developed the visPedigree package based on data.table with strong data cleaning and igraph with excellent drawing of social network, which further enhanced the function of tidying and visualizing pedigree. Using this package, we can trace and prune the ancestors and descendants of any individual before and after different generations. This package also can help us automatically optimize the layout of the pedigree tree and quickly display the pedigree including a large number of individuals (the number of individuals in each generation > 10000) by reducing the full-sib individuals in the pedigree and outlining the pedigree. The main contents are as follows：

## 1. Installation of the visPedigree package

The visPedigree package has not been released in cran, but it can be installed from github(https://github.com) using the devtools package.

In this blog, all R scripts are runned in Rstudio. If the devtools package is not found in the library, please install it first, then load it.

suppressPackageStartupMessages(is_installed <- require(devtools))
if (!is_installed) {
install.packages("devtools")
suppressPackageStartupMessages(require(devtools))
}

If the visPedigree package is not found in the library, please install it from github, then load it. The package is developed and depends on data.table and igraph packages. If these two packages are not installed, they will be installed together.

suppressPackageStartupMessages(is_installed <- require(visPedigree))
if (!is_installed) {
install_github("luansheng/visPedigree")
suppressPackageStartupMessages(require(visPedigree))
}

## 2 Pedigree format specification

The first three columns of pedigree data must be in the order of individual, sire, and dam IDs. Names of the three columns can be assigned as you would like, but their orders must be not changed in the pedigree. Individual ID should not be coded as “”, " “,”0“, asterisk, and”NA“, otherwise these individuals will be deleted from the pedigree. Missing parents should be denoted by either”NA“,”0“, asterisk. Space and”" will also be recoded as missing parents, but not be recommended. More columns, such as sex, generation can be included in the pedigree file.

The fread function in the data.table package is used to read the pedigree information from a file. This function is very powerful and can automatically recognize various delimiters in text.

ped_2 <- data.table::fread(file="datasets/ped2.csv",
sep=",",
stringsAsFactors = FALSE)
##          ID Sire Dam  Sex Cand
## 1: X0YY0500    0   0 Male    0
## 2: X0YY0600    0   0 Male    0
## 3: X0YY0700    0   0 Male    0
## 4: X0YY1200    0   0 Male    0
## 5: X0YX0300    0   0 Male    0
## 6: X0YX0400    0   0 Male    0

## 3 Checking and tidying pedigree

### 3.1 Introduction

The pedigree can be checked and tidied through the tidyped() function.

This function takes a pedigree, checks duplicated, bisexual individuals, detects pedigree loop, adds missing founders, sorts the pedigree, and traces the pedigree of the candidates.

If the parameter cand contains individuals’ IDs, then only these individuals and their ancestors or descendants will be kept in the pedigree.

The tracing direction and tracing generation number can be provided when the parameters trace and tracegen are not NULL.

Individual virtual generation will be inferred and assigned when the parameter addgen is TRUE.

Numeric pedigree will be generated when the parameter addnum is TRUE.

All individuals’ sex will be inferred if there is not sexual information in the pedigree. If the pedigree includes the column Sex, then individuals’ sexes need to be recoded as “male”, “female”, or NA (unknown sex). Missing sexes will be identified from the pedigree structure and be added if possible.

The visPedigree package comes with multiple datasets. You can check through the following command.

data(package="visPedigree")

The following code will show the simple_ped dataset. It includes four columns, the first three are individual, sire and dam, and the last one is sex. Missing parents is written as “NA”, “0”, or asterisk. Moreover, the founder individuals were not added in the pedigree. And some parents were sorted after the offspring.

head(simple_ped)
##        ID   Sire    Dam    Sex
## 1: J4Y326 J3Y620 J3Y771   male
## 2: J1H419 J0Z938 J0Z167 female
## 3: J2F588     NA J1Z417 female
## 4: J1J576 J0Z938 J0Z843   male
## 5: J1C802 J0Z333 J0C355   male
## 6: J2Z411 J1X971 J1J134 female
tail(simple_ped)
##        ID   Sire    Dam    Sex
## 1: J1E852 J0Z848 J0Z624 female
## 2: J1H604 J0C583 J0Z380 female
## 3: J5X804 J4Y326 J4E185 female
## 4: J1I438 J0Z990 J0Z808   male
## 5: J2C808 J1I975 J1F266   male
## 6: J1K462 J0C317 J0C450 female
# The number of individuals in the pedigree dataset
nrow(simple_ped)
## [1] 31
# Individual records with missing parents
simple_ped[Sire %in% c("0","*","NA",NA) | Dam %in% c("0","*","NA",NA)]
##        ID   Sire    Dam    Sex
## 1: J2F588     NA J1Z417 female
## 2: J1J858 J0Z060      * female
## 3: J3X697 J2Z903      0 female

Small test: your try to set female J0Z167 as father of the J2F588. It will find this bisexual problem after running tidyped().

x <- data.table::copy(simple_ped)
x[ID == "J2F588",Sire:="J0Z167"]
y <- tidyped(x)
## Warning in checkped(ped, addgen): In the sire or dam column, Blank, Zero,
## asterisk, or character NA is recognized as a missing parent and is replaced
## with the missing value NA.
## Warning in checkped(ped, addgen): The following individuals are
## simultaneously bisexual.
## Warning in checkped(ped, addgen): J0Z167

Moreover, the tidyped function will also sort the simple_ped pedigree, replace the missing parent with “NA”, put the parents behind the offspring, and add the missing founders’ pedigree.

tidy_simple_ped <- tidyped(simple_ped)
## Warning in checkped(ped, addgen): In the sire or dam column, Blank, Zero,
## asterisk, or character NA is recognized as a missing parent and is replaced
## with the missing value NA.
##       Ind Sire  Dam    Sex Gen IndNum SireNum DamNum
## 1: J0C032   female   1      1       0      0
## 2: J0C185   female   1      2       0      0
## 3: J0C231   female   1      3       0      0
## 4: J0C317     male   1      4       0      0
## 5: J0C450   female   1      5       0      0
## 6: J0C561     male   1      6       0      0
tail(tidy_simple_ped)
##       Ind   Sire    Dam    Sex Gen IndNum SireNum DamNum
## 1: J1C802 J0Z333 J0C355   male   5     54      47     46
## 2: J4E185 J3L886 J3X697 female   5     55      48     49
## 3: J4Y326 J3Y620 J3Y771   male   5     56      50     51
## 4: J1C929 J0Z511 J0Z444   male   6     57      53     52
## 5: J2Y434 J1C802 J1H419 female   6     58      54     28
## 6: J5X804 J4Y326 J4E185 female   6     59      56     55
nrow(tidy_simple_ped)
## [1] 59

In the prepared tidy_simple_ped, the founders’ records including gender were added, and the parents were sorted before the offspring. The number of individuals increases from 31 to 59. The column names of the animal, sire and dam are renamed to Ind, Sire, and Dam.The missing parents are uniformly replaced with “NA”, and there will be corresponding prompts after running tidyped() function. New columns including Gen, IndNum, SireNum and DamNum are added by default in the tidy_simple_ped. These columns will be generated when setting the parameters addgen and addnum as FALSE.

If the simple_ped dataset does not include the Sex column, it will be added in the tidy_simple_ped dataset.

tidy_simple_ped_no_gen_num <- tidyped(simple_ped,addgen = FALSE,addnum = FALSE)
## Warning in checkped(ped, addgen): In the sire or dam column, Blank, Zero,
## asterisk, or character NA is recognized as a missing parent and is replaced
## with the missing value NA.
##       Ind Sire  Dam  Sex
## 1: J0Z333   male
## 2: J0Z511   male
## 3: J0Z664   male
## 4: J0Z848   male
## 5: J0Z475   male
## 6: J0Z938   male

After tidying the pedigree, you can use the fwrite function of the data.table package to output it for the genetic evaluation software such as ASReml.

The missing parents should be replaced with 0 When saving a pedigree file.

saved_ped <- data.table::copy(tidy_simple_ped)
saved_ped[is.na(Sire),Sire:="0"]
saved_ped[is.na(Dam),Dam:="0"]
data.table::fwrite(x=saved_ped,file = "tidysimpleped.csv",sep=",",quote = FALSE)
head(saved_ped)

### 3.2 Tracing the pedigree of a specific individual

You should set the cand parameter to trace the pedigree of a specific individual. A new column of Cand will be added in the returned dataset. TRUE indicates that the individuals are the specific candidates. Only the candidates and their ancestors and offspring will be kept in the pedigree if this parameter is not NULL.

tidy_simple_ped_J5X804_ancestors <- tidyped(ped=tidy_simple_ped_no_gen_num,cand="J5X804")
tail(tidy_simple_ped_J5X804_ancestors)
##       Ind   Sire    Dam    Sex  Cand Gen IndNum SireNum DamNum
## 1: J3X697 J2Z903    female FALSE   4     45      43      0
## 2: J3Y620 J2C161 J2Z411   male FALSE   4     46      37     42
## 3: J3Y771 J2G465 J2X544 female FALSE   4     47      40     41
## 4: J4E185 J3L886 J3X697 female FALSE   5     48      44     45
## 5: J4Y326 J3Y620 J3Y771   male FALSE   5     49      46     47
## 6: J5X804 J4Y326 J4E185 female  TRUE   6     50      49     48

By default, tidyped() will trace candidates’ pedigree to ancestors. If you only want to trace back a specific generation number, you can set the tracegen parameter. This parameter can only be used when the trace parameter is not NULL. All generations of the candidates will be traced when the parameter tracegen is NULL.

tidy_simple_ped_J5X804_ancestors_2 <- tidyped(ped=tidy_simple_ped_no_gen_num,cand="J5X804",tracegen = 2)
print(tidy_simple_ped_J5X804_ancestors_2)
##       Ind   Sire    Dam    Sex  Cand Gen IndNum SireNum DamNum
## 1: J3L886         male FALSE   1      1       0      0
## 2: J3X697       female FALSE   1      2       0      0
## 3: J3Y620         male FALSE   1      3       0      0
## 4: J3Y771       female FALSE   1      4       0      0
## 5: J4E185 J3L886 J3X697 female FALSE   2      5       1      2
## 6: J4Y326 J3Y620 J3Y771   male FALSE   2      6       3      4
## 7: J5X804 J4Y326 J4E185 female  TRUE   3      7       6      5

The above codes will trace the pedigree of the J5X804 to ancestors for two generations.

If you want to trace the descendants of an individual, you can get it by setting the trace parameter as down.

There are three options for the trace parameter:

• “up”-trace candidates’ pedigree to ancestors;
• “down”-trace candidates’ pedigree to descendants;
• “all”-trace candidaes’ pedigree to ancestors and descendants simultaneously.
tidy_simple_ped_J0Z990_offspring <- tidyped(ped=tidy_simple_ped_no_gen_num,cand="J0Z990",trace="down")
print(tidy_simple_ped_J0Z990_offspring)
##       Ind   Sire    Dam    Sex  Cand Gen IndNum SireNum DamNum
## 1: J0Z990         male  TRUE   1      1       0      0
## 2: J1I438 J0Z990      male FALSE   2      2       1      0
## 3: J2G465 J1I438      male FALSE   3      3       2      0
## 4: J3Y771 J2G465    female FALSE   4      4       3      0
## 5: J4Y326    J3Y771   male FALSE   5      5       0      4
## 6: J5X804 J4Y326    female FALSE   6      6       5      0

Tracing down to the descendants of J0Z990, a total of 5 descendants can be found.

### 3.3 Creating an integer pedigree

Some programs require an integer pedigree for genetic evaluation. Individuals will need to be numbered consecutively when calculating the additive genetic correlation matrix.

By default, the tidyped function will add three columns (IndNum, SireNum, and DamNum) in the returned dataset. If you don’t need it, you can set addnum=FALSE to turn it off.

## 4 Drawing the pedigree

The visped() function takes a pedigree tidied by the tidyped() function, outputs a hierarchical graph for all individuals in the pedigree. The graph can be shown on the defaulted graphic device and be saved in a pdf file. The graph in the pdf file is a vector drawing, is legible and isn’t overlapped. It is especially useful when the number of individuals is big and the width of individual label is long in one generation. This function can draw the graph of a very large pedigree (> 10,000 individuals per generation) by compacting the full-sib individuals. It is very effective for drawing the pedigree of aquatic animal, which usually including many full-sib families per generation in the nucleus breeding population. The outline of a pedigree without individuals’ label is still shown if the width of a pedigree graph is longer than the maximum width (200 inches) of the pdf file. It is useful to help breeders quickly browse the process of constructing nucleus breeding population to see if there is the introduction of blood.

Important hints：It is strongly recommended to set the cand parameters when tidying a pedigree. After the pedigree is pruned by setting the cand parameter to the specific individuals, the generation number the individuals belonged to is more accurately inferred, and the layout of the individuals in the drawing pedigree tree will be more reasonable.

A small pedigree is drawn in the following figure. Legible vector figure is saved in a pdf file.

tidy_small_ped <-
tidyped(ped = small_ped,
cand = c("Y","Z1","Z2"))
visped(tidy_small_ped, compact = TRUE, file="doc/smallped.pdf")
## The vector drawing of the pedigree is saved in the C:/Users/luan_/OneDrive/hugo/luansheng/content/post/doc/smallped.pdf file
## The cex for individual label is 0.7.
## Please decrease or increase the value of the parameter cex if the label's width is longer or shorter than that of the circle or square in the graph.

In the above graph, two shapes and three colors are used. Circle is for individual, and square is for family. Dark sky blue means male, dark golden rod means female, and dark olive green means unknown sex. For example, one circle with dark sky blue means a male individual; one square with dark golden rod means all female individuals in a full-sib family when compact = TRUE. The ancestors are drawn at the top and descendants are drawn at the bottom in the pedigree graph. The parents and offspring are connected by a dummy node. The colors of lines from the offspring to the dummy nodes are dark grey, and the colors of lines from the dummy nodes to the sire and dam are the same with the colors of parents.

### 4.1 A simple pedigree graph

The graph of the trimmed simple_ped pedigree is drawn and displayed on the default graphics device of R or Rstudio. The addgen and addnum parameters need to be set to TRUE when tidying the pedigree using the tidyped function.

visped(tidy_simple_ped)
## The cex for individual label is 0.7.
## Please decrease or increase the value of the parameter cex if the label's width is longer or shorter than that of the circle or square in the graph.
## It is recommended that the pedigree graph is saved in the pdf file using the parameter file
## The graph in the pdf file is a vector drawing: shapes, labels and lines are legible; shapes and labels isn't overlapped.

Usually, the figure displayed on the Plots panel of Rstudio has poor definition. The individual IDs will overlap with each other due to the restricted size of the pedigree graph if the number of individuals is large. This problem will be resolved by saving the pedigree graph as vectorgraph in a pdf file. The visped() function will not output pedigree graph on the default graphics device by setting showgraph = FALSE.

suppressMessages(visped(tidy_simple_ped, showgraph = FALSE, file="doc/simpleped.pdf"))

After opening the simpleped.PDF file and you’ll see a high definition pedigree graph.

### 4.2 A reduced pedigree graph

Warning messages will be shown when you try to draw the pedigree graph of the deep_ped dataset.

visped(tidyped(deep_ped))
Too many individuals (>=3362) in one generation!!! Two choices:
1. Removing full-sib individuals using the parameter compact = TRUE; or,
2. Visualizing all nodes without labels using the parameter outline = TRUE.
Rerun visped() function!

The function indicates that too many individuals in one generation to draw a pedigree graph. It is recommended to use the compact or outline parameters to simplify the pedigree.

First, let’s try the compact parameter and output it in the deepped1.pdf file. The figure on the default graphic device has serious overlapping problems due to the large number of individuals and the limited plot size.

visped(tidyped(deep_ped),compact = TRUE, showgraph=TRUE, file="doc/deepped1.pdf")
## Warning in checkped(ped, addgen): In the sire or dam column, Blank, Zero,
## asterisk, or character NA is recognized as a missing parent and is replaced
## with the missing value NA.
## Warning in checkped(ped, addgen): Blank and NA are recoded as a missing sex
## in the Sex column of the pedigree.
## The vector drawing of the pedigree is saved in the C:/Users/luan_/OneDrive/hugo/luansheng/content/post/doc/deepped1.pdf file
## The cex for individual label is 0.525.
## Please decrease or increase the value of the parameter cex if the label's width is longer or shorter than that of the circle or square in the graph.

Let’s open the deepped1.pdf file and view the high-definition pedigree vectorgraph. Most of shapes are square at the bottom, and the internal numbers are the total number of male or female individuals for each family. The individual label is shorter than square or circle, and it is not matched. The individual label can be magnified by increasing the cex parameter. Cex is used to control the size of the individual label (ID) in the graph. The bigger the cex is, the longer the individual label is, and vice versa. The range of cex is generally 0 to 1, can be greater than 1, with 0.1 as a break for each adjustment. The visped function will output warning messages including the cex value which was used for drawing the pedigreed graph.

visped(tidyped(deep_ped),compact = TRUE, cex=0.83, showgraph = FALSE, file="doc/deepped2.pdf")
## Warning in checkped(ped, addgen): In the sire or dam column, Blank, Zero,
## asterisk, or character NA is recognized as a missing parent and is replaced
## with the missing value NA.
## Warning in checkped(ped, addgen): Blank and NA are recoded as a missing sex
## in the Sex column of the pedigree.
## The vector drawing of the pedigree is saved in the C:/Users/luan_/OneDrive/hugo/luansheng/content/post/doc/deepped2.pdf file
## The cex for individual label is 0.83.
## Please decrease or increase the value of the parameter cex if the label's width is longer or shorter than that of the circle or square in the graph.

Let’s open the deepped2.pdf file to view the high-definition pedigree vectorgraph. There is higher matching degree between individual labels and shapes compared to deepped1.pdf. If it doesn’t feel right, you can continue to modify the cex.

### 4.3 An outlined pedigree graph

An outlined pedigree graph will be drawn by setting outline=TRUE. Individual labels will not be shown in the graph. It is very effective for the large pedigree including many individuals.

In this graph, you can directly observe that there are external individuals introduced in some generations. Please click here to view the pdf file.

suppressMessages(visped(tidyped(deep_ped),compact = TRUE, outline=TRUE, showgraph = TRUE, file="doc/deepped3.pdf"))
## Warning in checkped(ped, addgen): In the sire or dam column, Blank, Zero,
## asterisk, or character NA is recognized as a missing parent and is replaced
## with the missing value NA.
## Warning in checkped(ped, addgen): Blank and NA are recoded as a missing sex
## in the Sex column of the pedigree.

Let’s try to draw another pedigree with big family size. The graph is saved in here.

cand_2007_G8_labels <- big_family_size_ped[(Year == 2007) & (substr(Ind,1,2) == "G8"),Ind]
cand_2007_G8_tidy_ped <- tidyped(big_family_size_ped,cand=cand_2007_G8_labels)
## Warning in checkped(ped, addgen): In the sire or dam column, Blank, Zero,
## asterisk, or character NA is recognized as a missing parent and is replaced
## with the missing value NA.
# Use suppressMessages to disable output prompts.
suppressMessages(visped(cand_2007_G8_tidy_ped,compact = TRUE, outline=TRUE, showgraph = TRUE, file="doc/bigfullsibped.pdf"))

### 4.4 How to use this package in a selective breeding program

#### 4.4.1 An analysis of founders for an individual

Selective breeding is actually a process of enrichment of the desirable minor genes dispersed among multiple founders through successive mating for multiple generations. The support theory behind it is the well-known minor polygene hypothesis.

We select the individual “K110550H” in the deep_ped dataset to visualize its pedigree. The pdf pedigree is here.

suppressWarnings(K110550H_ped <- tidyped(deep_ped,cand="K110550H"))
suppressMessages(visped(K110550H_ped,showgraph = TRUE,file="doc/K110550Hped.pdf"))

As you can see from the figure above, the number of founder individuals (without parents) of the K110550H individual is71.This means that this individual has accumulated a number of favorable genes from the founders, so that the breeding object trait will be improved with great genetic gain.

#### 4.4.2 The contribution of different families in a selective breeding program

When using the optimum contribution theory to optimize mating design, the number of individuals contributed by each family is not same, and the family with a high integrated selection index contributes more individuals. By visualizing pedigree, we can directly see the contribution ratio of different families.

The below codes will show the composition of the parents of 106 families born in the nucleus breeding population in 2007. Only two generations including parents and grandparents are drawn in the graph by setting the tracegen=2.

  suppressWarnings(
cand_2007_G8_tidy_ped_ancestor_2 <-
tidyped(
big_family_size_ped,
cand = cand_2007_G8_labels,
trace = "up",
tracegen = 2)
)
sire_label <-
unique(cand_2007_G8_tidy_ped_ancestor_2[Ind %in% cand_2007_G8_labels,
Sire])
dam_label <-
unique(cand_2007_G8_tidy_ped_ancestor_2[Ind %in% cand_2007_G8_labels,
Dam])
sire_dam_label <- unique(c(sire_label, dam_label))
sire_dam_label <- sire_dam_label[!is.na(sire_dam_label)]
sire_dam_ped <-
cand_2007_G8_tidy_ped_ancestor_2[Ind %in% sire_dam_label]
sire_dam_ped <- sire_dam_ped[, FamilyID := paste(Sire, Dam, sep = "")]
family_size <- sire_dam_ped[, .N, by = c("FamilyID")]
fullsib_family_label <- unique(sire_dam_ped\$FamilyID)
suppressMessages(
visped(
cand_2007_G8_tidy_ped_ancestor_2,
compact = TRUE,
outline = TRUE,
showgraph = TRUE
)
)

In the above figure, 106 families are shown at the bottom, the parents are shown in the middle, and the grandparents are shown at the top. It can be seen that the parents are composed of 80 sires and 106 dams. The parents are from 54 full-sib families in the generation of grandparent. About 25 parents are from two full-sib families because the optimum contribution theory was used, and account for 13.44% of the total number of parents.

ps: This blog is posted to R-Bloggers.com.