Tales of Genes and Proteins

BY KLAUS D. LINSE on APRIL 26, 2013 • ( 1 )

Rate This
Implications of the Human Genome Project

The human genome project revealed that the human genome contains approximately 26,000 to 38,000 genes that code for proteins. This is significantly fewer than had been previously anticipated. It was found that protein-coding sequences account for only a very small fraction of the whole genome. Approximately only 1.5% of the estimated 3.2 billion base pairs (bp) in the human genome code for proteins. The rest was original called “junk DNA” and is now associated with non-coding RNA molecules, regulatory DNA sequences, introns, and sequences that have as of yet no function assigned.

Our accredited DNA testing laboratory provids full-service human DNA identity testing for DNA paternity, family relationship, and forensic DNA testing and analysis.
DNA testing BSI

A gamete is a cell that fuses with another cell during fertilization or conception in sexually reproducing organisms. In these organisms a sperm or egg carries a full set of chromosomes that includes a single copy of each gene. Ploidy refers to the number of sets of chromosomes in the nucleus of a cell. In normal human body cells, chromosomes exist in pairs. DNA variations, or polymorphisms, that tend to be inherited together are called a haplotype. Haplotype can also refer to a combination of alleles or to a set of single nucleotide polymorphisms (SNPs) found on the same chromosome. The haploid number (n) is the number of chromosomes in a gamete. Two gametes form a diploid zygote with twice this number (2n), for example two copies of autosomal chromosomes. The haploid human genome found in egg and sperm cells consists of three billion DNA base pairs, while the diploid genome found in somatic cells has twice the DNA content.

A glossary for genomic terms can be found at the website of the National Human Genome Research Institute: http://www.genome.gov/.

The new technologies of “next-generation deep-sequencing” are presently generating huge amounts of raw protein sequences. This has caused a new type of problem. It turns out that it is very difficult to annotate accurately the emerging protein-sequence space. A complete proteome is now a collection of all valid proteins from a sequenced genome. The sequencing of whole genomes as well as partial sequencing efforts has resulted in archiving of more than 20 million protein sequences in the UniProtKB database (release 2012_1, 25 January 2012). The compiled sequence collection contains millions of viral sequences, thousands of microbial genomes and sequences from thousands of multicellular organism, now called the “known protein space”. The functional information for the vast majority of these protein sequences is mostly based on sequence-similarity approaches. Only approximately 3.5 % of all sequences found in the UniPortKB have experimental support.

Protein three-dimensional structures provide the most reliable information on biochemical function. Presently there are more than 80,000 solved protein structures in the protein database, more than 1,250 DNA structures, and approximately 1,000 RNA structures, as well as more than 4,000 structures of protein nucleic acid complexes available. The solved protein structures are indirectly associated with a large fraction of the protein space through semi-automated classifications and are organized in an inventory of approximately 1,500 basic protein domain folds. Sequence-based approaches for the assignment of protein families rely primarily on the notion of domains as the building blocks of proteins. Generally a multiple sequence alignment is translated into statistical based models. The Pfam database contain more info on how this works. The Pfam database of protein families currently contains more than 13,000 manually curated protein families as of release 26.0. Pfam is available via servers in the UK, the USA, and Sweden.

Approximately 30% of the UniProtKB sequences are marked as putative or hypothetical amounting to 6.7 million sequences for which the current methods to assign a function with high confidence have mostly failed. The reason is that most sequence- and structure-based assignments rely on local information such as structural fold, sequence domain and functional signature annotations at the level of the full-length protein which are prone to errors. Large scale sequencing of environmental samples will only increase the scope of this problem.

Is ProtoNet the solution to the protein sequence annotation problem?

Rappoport et al. in 2013 used computational methods to tackle this problem and published a report on the ProtoNet 6.1 family tree classification resource that provides an unsupervised analysis of protein sequences.

The resource generates protein families by

(i) Precalculating sequence-similarity values for all possible pairwise relationships using BLAST values,

(ii) Applying an unsupervised bottom-up clustering algorithm that organizes large sets of proteins in a hierarchical tree yielding high-quality protein families, and

(iii) A process of pruning the ProtoNet tree to retain only the most informative clusters.

This process generates a tree-like skeleton of the entire known protein space. In addition, rigorous annotation-based quality tests assign a statistically based quality measure for each stable cluster. Nearly 19 million full-length protein sequences are included in ProtoNet 6.1.


Nadav Rappoport, Nathan Linial & Michal Linial; ProtoNet: charting the expanding universe of protein sequences. Nature Biotechnology 31, 290–292 (2013). doi:10.1038/nbt.2553

Bernaschi M, Castiglione F, Ferranti A, Gavrila C, Tinti M, Cesareni G.; ProtNet: a tool for stochastic simulations of protein interaction networks dynamics. BMC Bioinformatics. 2007 Mar 8;8 Suppl 1:S4. PMID: 17430571.