The first two positions (14370, 17330) are simple single-base pair substitutions. Three of the variants have IDs including 2 dbSNP records (rs6054257, rs6040355). The specification includes an example VCF file.īelow are some notes to help understand the first 5 columns about the above file.Īll of the variants occur on Chromosome 20 on the NCBI36 (hg18). Sample-level field data separated by semicolons corresponding to FORMAT field declarations.Įxample Data Explanation Specification File INFO - Site-level (non-sample) information in semicolon separated name-value format.įORMAT - Sample-level field name declarations separated by semicolons. 100 means 1 in 10^10 chance of error.įILTER - Indicates which filters have failed (semicolon-separated), PASS or MISSING. QUAL - Quality score that is on a log scale. ![]() Insertions can be represented by a dot.ĪLT - Comma-separated Alternate base(s) (ACGT). ID - Unique identifiers separated by semicolons. Sorted numerically in ascending order by chromosome. In all cases, MISSING values should be represented by a dot (‘.’). The data corresponds to the columns specified in the header and must be separated by tabs and ended with a newline.īelow are the columns and their expected values. All of these column names must be separated by tabs, as well.Įach data line represents a position in the genome. If there is genotype data, then a FORMAT column is declared and followed by unique sample names. At GenomOncology, where we integrate with a variety of DNA sequencers and variant callers, we have invested in making our VCF processing software highly configurable to quickly adapt to new VCF formats that we may encounter.Įach VCF file has a single header line that has 8 mandatory fields separated by tabs that represent columns for each data line: However, this flexibility comes at a cost because downstream processing software may need to account for differences in output formats. This design allows for great flexibility in the data represented by any given VCF file, allowing each variant calling pipeline to capture the most accurate data and metadata appropriate possible. ![]() Below are some examples of each type from the VCF specification document: This Meta section also declares and describes the fields provided at both the site-level (INFO) and sample-level (FORMAT) in the Data Lines. #reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta The first few rows from the VCF specification demonstrate this type of information: This can include information about the sequencing performed, the variant calling software, or the reference genome used for determining variants. The Meta section describes the format and content of that specific VCF file. Header Line - Single line prefixed with a one pound symbol (#).ĭata Lines - Remainder of the file with 1 position per line. ![]() Meta Information Lines - Multiple lines prefixed by double pound symbols (#). File Format Main SectionsĪs described in the specification for the Variant Call Format (VCF), there are 3 main sections to each file: Originally developed for the 1000 Genomes Project, the VCF specification has become the de facto standard output for variant calling software due to its concise format and the increase of sequencing data generated from the Next Generation Sequencing (NGS) methods.
0 Comments
Leave a Reply. |