前言

怀揣着对于基因检测的浓厚兴趣,在华大基因专家和生医学院妹子们的帮助下,我成功成为了基因测序实践课堂上最靓的仔!

这里介绍一些生物信息学和基因测序相关的理论,还加上课程中给出的一些图片和命令行代码,这会给我本人的未来带来一些成就感!

生物信息学

什么是生物信息学?

生物信息学(Bioinformatics) 是一门交叉学科,结合生物学、计算机科学和信息技术,目的是管理、分析和理解生物数据,尤其是分子生物学数据(如DNA、RNA和蛋白质序列数据)。它通过算法、软件工具和数据库,帮助科学家从大量生物数据中提取有用信息,推动基因组学、转录组学、蛋白质组学等领域的发展。


生物信息学的主要用处

  1. 基因组序列分析:组装和注释基因组序列,发现基因及其功能。
  2. 基因表达分析:解析RNA测序数据,研究基因表达调控。
  3. 进化研究:通过比较不同物种的序列,构建系统发育树。
  4. 蛋白质结构预测:推测蛋白质的三维结构及其功能。
  5. 变异检测:识别基因组中的突变,关联疾病。
  6. 个性化医疗:辅助制定基于基因特征的治疗方案。
  7. 药物设计:利用结构生物信息学辅助靶点发现和药物筛选。

基因测序

三代基因测序技术介绍

第一代基因测序技术

  • 代表技术:Sanger测序(链终止法)
  • 特点
    • 读长较长(约800-1000 bp)
    • 准确度高,错误率低
    • 通量低,测序成本高
  • 应用:基因克隆测序、小规模测序项目

第二代基因测序技术(Next-Generation Sequencing, NGS)

  • 代表技术:Illumina、454测序、SOLiD等
  • 特点
    • 高通量,测序速度快
    • 读长较短(一般100-300 bp)
    • 成本大幅降低
  • 应用:全基因组测序、转录组测序、变异检测等大规模研究

第三代基因测序技术

  • 代表技术:PacBio SMRT测序、Oxford Nanopore测序
  • 特点
    • 读长超长(数千至百万碱基)
    • 直接测序,不需PCR扩增
    • 可检测DNA修饰(如甲基化)
    • 错误率略高,但随着技术进步有所改进
  • 应用:复杂基因组结构解析、转录变异检测、表观遗传学等

DNA数据分析流程和常用软件工具

DNA数据分析流程(典型流程)

  1. 数据质量控制(QC)

    • 使用工具(如FastQC)评估原始测序数据(FASTQ格式)质量
    • 去除低质量序列、接头污染和重复序列
  2. 序列比对(Mapping/Alignment)

    • 将测序读段比对到参考基因组(使用BWA、Bowtie等软件)
    • 生成比对文件(SAM/BAM格式)
  3. 后期处理

    • 排序、去重复(Deduplication)
    • 质量重校正(Base Quality Score Recalibration)
  4. 变异检测

    • 识别单核苷酸多态性(SNPs)、插入缺失(Indels)
    • 使用工具如GATK、FreeBayes、Samtools
  5. 注释变异

    • 利用数据库(如dbSNP、ClinVar)标注变异的生物学意义和潜在疾病关联
  6. 功能分析

    • 预测基因功能影响、蛋白质结构改变
    • 下游分析如GO富集、通路分析
  7. 结果可视化与报告

    • 制作图表、报告,方便理解和发表

1. 数据质量控制(Quality Control, QC)

目的:

  • 评估测序数据的质量,剔除低质量序列和污染序列,确保后续分析的准确性。

常用软件及功能:

  • FastQC:生成数据质量报告,包含序列质量分布、GC含量、接头污染分析等。
  • MultiQC:整合多个FastQC报告,便于批量样本质量总结。
  • Trimmomatic / Cutadapt / fastp:去除接头序列、低质量碱基和过短序列。

典型流程:

  • 读取FASTQ文件并检测质量(Phred质量分数)
  • 过滤低质量读段(常用阈值:Q20或Q30)
  • 去除接头序列、防止假阳性

2. 序列比对(Mapping/Alignment)

目的:

  • 将测序的短序列准确映射到参考基因组上,是变异分析的基础。

常用软件:

  • BWA (Burrows-Wheeler Aligner):用于短序列比对,速度快,适用于全基因组重测序。
  • Bowtie2:高性能短序列比对工具,适合基因组和转录组数据。
  • Minimap2:专门针对长读长测序(第三代测序)设计,适合PacBio和Nanopore数据。
  • STAR(转录组数据):RNA-seq比对工具,能处理剪接事件。

典型流程:

  • 构建参考基因组索引
  • 运行比对程序生成SAM文件(Sequence Alignment/Map)
  • 转换SAM到压缩的BAM格式,便于后续处理

3. 后期处理

目的:

  • 提升比对数据质量,减少误差。

关键操作及软件:

  • Samtools:排序(sort)、索引(index)BAM文件。
  • Picard:去除PCR重复(MarkDuplicates),减少扩增偏好造成的假信号。
  • GATK BaseRecalibrator:基于已知变异信息调整测序错误的质量评分,提高准确度。
  • Qualimap:评估比对数据的整体质量和覆盖度。

4. 变异检测(Variant Calling)

目的:

  • 识别基因组中的SNP(单核苷酸变异)和Indel(插入/缺失)。

常用工具:

  • GATK HaplotypeCaller:主流变异检测工具,准确且功能强大。
  • FreeBayes:适合多样本联合检测。
  • Samtools/bcftools:轻量级SNP/Indel识别。
  • DeepVariant:基于深度学习的新兴工具,变异识别准确率高。

流程概要:

  • 输入BAM文件
  • 运行变异检测程序生成VCF文件(Variant Call Format)
  • 对变异结果进行初步过滤(质控阈值)

5. 变异注释(Variant Annotation)

目的:

  • 解释变异的生物学意义,预测其对基因功能的影响。

参考数据库:

  • dbSNP, ClinVar, 1000 Genomes, ExAC, gnomAD

常用注释工具:

  • ANNOVAR:功能丰富,支持众多数据库及预测工具接口。
  • SnpEff:快速,能注释变异所在基因及其影响(同义、错义等)。
  • VEP (Variant Effect Predictor):由Ensembl发布,支持定制注释。
  • Funcotator(GATK):集成多数据库的功能注释工具。

6. 功能分析

目的:

  • 通过对筛选后的变异基因集合进行生物学功能富集,帮助理解变异影响。

常用分析内容:

  • 基因本体论(GO)富集分析
  • 通路分析(KEGG,Reactome等)
  • 蛋白结构预测及影响分析
  • 共表达网络分析

工具推荐:

  • DAVID:在线功能注释和富集分析。
  • Metascape:一站式综合分析平台。
  • ClusterProfiler(R包):强大的富集分析工具。
  • STRING数据库:蛋白质-蛋白质相互作用网络分析。

7. 结果可视化与报告

目的:

  • 制作易于解释和展示的图表与报告,让科研人员直观理解结果。

常用软件和工具:

  • IGV (Integrative Genomics Viewer):可视化基因组比对和变异情况。
  • R语言和Python(Matplotlib、Seaborn等可视化库):自定义绘制统计图表。
  • JBrowse:Web端交互式基因组浏览器。
  • MultiQC:质量报告汇总。
  • cBioPortal:癌症组学数据综合展示。

SNP分析

什么是SNP分析?

  • SNP (Single Nucleotide Polymorphism),即单核苷酸多态性,是基因组中某个特定位点上不同个体之间单碱基的变异。
  • SNP分析就是识别和研究这些单碱基变异,从而揭示个体间遗传差异及其生物学意义。

SNP分析的主要用处

  1. 遗传变异识别:发现个体或群体的基因组变异,了解遗传多样性。
  2. 疾病关联研究:寻找与疾病风险、药物反应相关的SNP,推动个性化医疗。
  3. 进化和种群研究:探究物种进化历史和群体结构。
  4. 作物和畜牧育种:利用SNP标记实现分子辅助选择,提高品种改良效率。
  5. 法医鉴定和亲缘关系分析:通过遗传标记进行身份识别。

SNP分析的大致流程

  1. 样本测序与数据获取

    • 通常通过高通量测序(WGS、Targeted sequencing等)获得DNA序列数据。
  2. 数据质量控制

    • 使用FastQC、fastp等工具过滤低质量序列。
  3. 序列比对

    • 将测序读段比对到参考基因组(BWA、Bowtie等工具)。
  4. 变异检测

    • 利用变异检测软件(如GATK HaplotypeCaller、FreeBayes)识别SNP和Indel。
  5. 变异过滤

    • 筛除低质量或假阳性变异,确保分析准确。
  6. 变异注释

    • 利用数据库(dbSNP、ClinVar等)注释SNP的功能和可能影响。
  7. 后续分析

    • 群体遗传分析、关联分析(GWAS)、功能富集等。

SNP分析详细流程与软件推荐


1. 数据质量控制

  • 软件:fastp
  • 目的:去除接头序列,滤除低质量reads
1
2
3
4
fastp -i sample_R1.fastq.gz -I sample_R2.fastq.gz \
      -o sample_R1.clean.fastq.gz -O sample_R2.clean.fastq.gz \
      --detect_adapter_for_pe --qualified_quality_phred 20 \
      --length_required 50 -w 8 --html sample_fastp_report.html

2. 比对到参考基因组

  • 软件:BWA mem
  • 目的:将clean reads比对到参考基因组(如hg38)
1
bwa mem -t 16 -M reference.fa sample_R1.clean.fastq.gz sample_R2.clean.fastq.gz > sample.sam

3. 转换排序与去重复

  • 软件:samtools + picard
1
2
3
4
5
6
7
# 转换sam到bam并排序
samtools view -@ 8 -bS sample.sam | samtools sort -@ 8 -o sample.sorted.bam

# 标记并去除重复reads
picard MarkDuplicates I=sample.sorted.bam O=sample.dedup.bam M=sample.dup_metrics.txt REMOVE_DUPLICATES=true

samtools index sample.dedup.bam

4. 基线质量校正(可选但强烈建议)

  • 软件:GATK BaseRecalibrator
1
2
3
4
5
6
gatk BaseRecalibrator -R reference.fa -I sample.dedup.bam \
  --known-sites dbsnp.vcf --known-sites Mills_and_1000G_gold_standard.indels.vcf \
  -O recal_data.table

gatk ApplyBQSR -R reference.fa -I sample.dedup.bam \
  --bqsr-recal-file recal_data.table -O sample.recal.bam

5. SNP和Indel变异检测

  • 软件:GATK HaplotypeCaller
1
gatk HaplotypeCaller -R reference.fa -I sample.recal.bam -O sample.raw.vcf.gz -ERC GVCF

(如果单样本,可以用 -ERC GVCF 生成GVCF文件,后续可联合调用)


6. 联合基因分型(多样本时)

1
2
3
4
gatk GenomicsDBImport --genomicsdb-workspace-path my_database \
  -V sample1.g.vcf.gz -V sample2.g.vcf.gz -V sample3.g.vcf.gz --intervals chr1

gatk GenotypeGVCFs -R reference.fa -V gendb://my_database -O cohort.raw.vcf.gz

7. 变异过滤

  • 软件:GATK VariantFiltration

硬过滤示例:

1
2
3
4
5
gatk VariantFiltration -R reference.fa -V cohort.raw.vcf.gz -O cohort.filtered.vcf.gz \
  --filter-name "QD_filter" --filter-expression "QD < 2.0" \
  --filter-name "FS_filter" --filter-expression "FS > 60.0" \
  --filter-name "MQ_filter" --filter-expression "MQ < 40.0" \
  --filter-name "SOR_filter" --filter-expression "SOR > 3.0"

8. SNP注释

  • 软件:ANNOVAR, SnpEff, 或 VEP

举例ANNOVAR注释:

1
2
3
table_annovar.pl cohort.filtered.vcf.gz humandb/ -buildver hg38 \
  -out cohort.annotated -remove -protocol refGene,cytoBand,exac03,avsnp150,clinvar_20210501 \
  -operation g,r,f,f,f -nastring . -vcfinput

总结简易Shell脚本模板(单样本)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#!/bin/bash

## 1. QC
fastp -i ${1}_R1.fastq.gz -I ${1}_R2.fastq.gz -o ${1}_R1.clean.fastq.gz -O ${1}_R2.clean.fastq.gz --detect_adapter_for_pe --qualified_quality_phred 20 --length_required 50 -w 8

## 2. Alignment
bwa mem -t 16 -M reference.fa ${1}_R1.clean.fastq.gz ${1}_R2.clean.fastq.gz > ${1}.sam

## 3. Convert & sort & mark duplicates
samtools view -@ 8 -bS ${1}.sam | samtools sort -@ 8

WGS全基因组测序分析示例流程

下面是一个针对WGS(Whole Genome Sequencing)全基因组测序的典型分析流程示例,包括常用工具及示范命令,以及参数推荐。流程基于Linux环境和常用的开源工具,可供实际使用时参考和调整。


1. 质量控制(QC)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# 运行FastQC生成质量报告
fastqc sample_R1.fastq.gz sample_R2.fastq.gz -o ./fastqc_results/

# 结合所有样本报告,方便查看
multiqc ./fastqc_results/ -o ./multiqc_report/

# 使用fastp进行接头去除和质量过滤
fastp -i sample_R1.fastq.gz -I sample_R2.fastq.gz \
      -o sample_R1.trimmed.fastq.gz -O sample_R2.trimmed.fastq.gz \
      --detect_adapter_for_pe \
      --qualified_quality_phred 20 \
      --length_required 50 \
      --thread 8 \
      --html fastp_report.html

参数说明:

  • --qualified_quality_phred 20:保留Q20及以上的碱基
  • --length_required 50:保留序列长度至少50bp

2. 参考基因组索引

基因组采用人类参考基因组hg38(或根据样本选择)

1
2
# 以BWA为例构建索引
bwa index hg38.fa

3. 序列比对

1
2
# 使用BWA mem进行比对(短读长)
bwa mem -t 16 -M hg38.fa sample_R1.trimmed.fastq.gz sample_R2.trimmed.fastq.gz > sample.sam

参数说明:

  • -t 16:使用16线程
  • -M:兼容Picard的标记重复格式

4. SAM转BAM及排序

1
2
3
# 转换为BAM文件并排序
samtools view -@ 8 -bS sample.sam | samtools sort -@ 8 -o sample.sorted.bam
samtools index sample.sorted.bam

5. 去除PCR重复

1
2
picard MarkDuplicates I=sample.sorted.bam O=sample.dedup.bam M=sample.dup_metrics.txt REMOVE_DUPLICATES=true
samtools index sample.dedup.bam

6. 基线质量校正(BQSR)

需有已知变异位点数据库(如dbSNP)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# 生成重校正模型
gatk BaseRecalibrator \
    -I sample.dedup.bam \
    -R hg38.fa \
    --known-sites dbsnp.vcf \
    --known-sites Mills_and_1000G_gold_standard.indels.vcf \
    -O sample.recal_data.table

# 应用BQSR
gatk ApplyBQSR \
    -R hg38.fa \
    -I sample.dedup.bam \
    --bqsr-recal-file sample.recal_data.table \
    -O sample.recalibrated.bam

7. 变异检测

1
2
3
4
5
gatk HaplotypeCaller \
    -R hg38.fa \
    -I sample.recalibrated.bam \
    -O sample.raw.vcf.gz \
    -ERC GVCF

通常多个样本会用GVCF模式联合基因分型,这里单样本示范。


8. 变异过滤

单样本硬过滤示例

1
2
3
4
5
6
7
8
gatk VariantFiltration \
    -R hg38.fa \
    -V sample.raw.vcf.gz \
    -O sample.filtered.vcf.gz \
    --filter-name "QD_filter" --filter-expression "QD < 2.0" \
    --filter-name "FS_filter" --filter-expression "FS > 60.0" \
    --filter-name "MQ_filter" --filter-expression "MQ < 40.0" \
    --filter-name "SOR_filter" --filter-expression "SOR > 3.0"

常见过滤条件参考GATK Best Practices。


9. 变异注释

1
2
3
4
# 以ANNOVAR为例
table_annovar.pl sample.filtered.vcf.gz humandb/ -buildver hg38 \
    -out sample.annotated -remove -protocol refGene,cytoBand,exac03,avsnp150,clinvar_20210501 \
    -operation g,r,f,f,f -nastring . -vcfinput

10. 结果可视化

  • 使用 IGV 载入 sample.recalibrated.bamsample.annotated.vcf 查看比对与变异
  • 使用R或Python绘制变异统计图、SNP分布等

总结

流程步骤推荐软件核心参数/说明
质量控制FastQC, fastp, MultiQCPhred≥20,去接头和短序列
比对BWA mem多线程,-M参数兼容
排序转BAMsamtools sort/index-@ 8或更高线程数
去重Picard MarkDuplicatesREMOVE_DUPLICATES=true
质量重校正GATK BaseRecalibrator/AppyBQSR需已知bQSR数据库
变异检测GATK HaplotypeCallerGVCF模式,便于联合分析
变异过滤GATK VariantFiltration根据QD, FS, MQ, SOR等指标过滤
注释ANNOVAR, SnpEff, VEP结合dbSNP, ClinVar等数据库
可视化IGV, R, Python助力结果理解

群体联合变异检测(Joint Variant Calling)流程

群体联合变异检测的优势是多样本同时分析,提高低频变异的检测灵敏度和准确度,减少假阳性。一般推荐使用GATK的GVCF工作流。

1. 单样本GVCF生成

对每个样本单独运行HaplotypeCaller,输出GVCF文件。

1
2
3
4
5
gatk HaplotypeCaller \
  -R reference.fa \
  -I sample1.recalibrated.bam \
  -O sample1.g.vcf.gz \
  -ERC GVCF

对所有样本重复此操作,生成如sample2.g.vcf.gz等文件。


2. 联合基因分型

使用GenomicsDBImport将所有GVCF导入数据库,统一处理。

1
2
3
4
gatk GenomicsDBImport \
  -V sample1.g.vcf.gz -V sample2.g.vcf.gz -V sample3.g.vcf.gz \
  --genomicsdb-workspace-path my_database \
  --intervals chr20

然后针对数据库执行联合调用:

1
2
3
4
gatk GenotypeGVCFs \
  -R reference.fa \
  -V gendb://my_database \
  -O cohort.raw.vcf.gz

3. 变异过滤

对联合VCF文件进行硬过滤或基于机器学习的变异质量评分(VQSR)。

3.1 硬过滤示例:

1
2
3
4
5
6
7
gatk VariantFiltration \
  -R reference.fa \
  -V cohort.raw.vcf.gz \
  -O cohort.filtered.vcf.gz \
  --filter-name "QD_filter" --filter-expression "QD < 2.0" \
  --filter-name "FS_filter" --filter-expression "FS > 60.0" \
  --filter-name "MQ_filter" --filter-expression "MQ < 40.0"

3.2 VQSR(推荐大样本量):

  • 使用VariantRecalibrator训练模型
  • 使用ApplyVQSR过滤变异

下游分析简介

1. 变异注释

注释变异对基因或调控区的影响及与疾病关联。

  • 工具:ANNOVAR, SnpEff, VEP
  • 注释内容包括基因结构、基因名、频率(gnomAD、ExAC)、临床意义(ClinVar)、功能预测(SIFT, PolyPhen)

2. 群体遗传学分析

  • 等位基因频率计算:揭示群体内变异频率分布
  • 个体基因型分布:用于罕见病或群体分层研究
  • 连锁不平衡(LD)分析
  • 群体结构分析(如PCA, ADMIXTURE)
  • 常用软件:PLINK,VCFtools

3. 关联分析(GWAS)

  • 群体样本的表型与基因型进行关联分析
  • 识别与疾病或性状相关的遗传标记位点
  • 软件:PLINK,BOLT-LMM,GEMMA

4. 拷贝数变异(CNV)分析

  • 通过测序深度检测大段的DNA片段增减变异
  • 软件示例:CNVnator,XHMM

5. 结构变异(SV)检测

  • 检测较大染色体重排、缺失、插入等
  • 软件:Manta,Delly,LUMPY

6. 功能富集和通路分析

  • 针对有显著变异的基因集合,进行GO、KEGG通路富集,提高生物学理解
  • 软件/平台:DAVID, Metascape, ClusterProfiler(R包)

7. 可视化

  • PCA图:群体结构和样本间遗传关系
  • Manhattan图:GWAS结果展示显著位点
  • 变异频率热图

常用工具:R (ggplot2, PCAtools), Python (matplotlib, seaborn), LocusZoom


总结与建议

阶段核心软件关键点
GVCF生成GATK HaplotypeCaller每样本单独生成GVCF
联合基因分型GATK GenomicsDBImport + GenotypeGVCFs整合样本、提高准确度
变异过滤GATK VariantFiltration / VQSR建议样本大时用VQSR
注释分析ANNOVAR, VEP集成多数据库信息
群体遗传学PLINK, VCFtools群体结构、频率分析
关联分析PLINK, BOLT-LMM群体表型-基因型关联
CNV & SV分析CNVnator, Manta结构变异检测
功能富集DAVID, ClusterProfiler生物学意义理解
可视化IGV, R, Python结果直观展示

本人实践过程简要记录

DNA检测流程

变异检测

SNP分析

湿实验流程

DNA数据分析流程

1 测序基因质量评估实践

使用SOAPnuke生成数据质量评估表,并按指定标准对低质量数据进行过滤,向01_clean中生成clean数据:

1
/Pipeline/FIS.Traits_v1.0/tools/SOAPnuke filter -1 ./raw_data/test.fq.gz -l 10 -q 0.5 -n 0.01 -T 1 -o ./01_clean/ -C test_clean.fq.gz

可在01_clean路径下使用less命令逐一查看测序数据质量评估表,我们简单展示过滤前后数据总体情况如下:

1
less 01_clean/Basic_Statistics_of_Sequencing_Quality.txt

使用FASTQC工具对数据质量情况进行绘图,通过图形化展示了解数据质量:

1
/Pipeline/FIS.Traits_v1.0/tools/anaconda/bin/fastqc ./01_clean/test_clean.fq.gz

2 测序数据序列比对

建立索引:

1
/Pipeline/FIS.Traits_v1.0/tools/bwa-mem2 index ./00_ref/MGI358.SNP.fa

  1. 序列比对生成SAM
1
/Pipeline/FIS.Traits_v1.0/tools/bwa-mem2 mem -M -Y -t 1 ./00_ref/MGI358.SNP.fa ./01_clean/test_clean.fq.gz > ./02_align/test_clean.sam

  1. 节省存储空间生成BAM
1
samtools sort ./02_align/test_clean.sam -o ./02_align/test_clean.sort.bam
  1. 为建立的bam文件建立索引
1
2
3
samtools index ./02_align/test_clean.sort.bam

samtools view -F 256 -hb ./02_align/test_clean.sort.bam > ./02_align/test_clean.sort.uniq.bam
  1. 查看生成BAM文件
1
samtools view -S ./02_align/test_clean.sort.bam|less -S
  1. 统计覆盖深度
1
/Pipeline/FIS.Traits_v1.0/tools/bamdst -p ./00_ref/target.358.SE50.subSNP.bed -o ./02_align  ./02_align/test_clean.sort.uniq.bam
  1. 查看覆盖深度
1
less ./02_align/depth.tsv.gz

3 测序数据变异检测分析

建立索引

1
samtools index ./02_align/test_clean.sort.uniq.bam

运行贝叶斯,完成SNP Calling

1
/Pipeline/FIS.Traits_v1.0/tools/freebayes -m 30 -q 20 -f ./00_ref/MGI358.SNP.fa -@  ./00_ref/alleles_all.vcf.gz -t ./00_ref/target.358.SE50.subSNP.bed  --report-all-haplotype-alleles ./02_align/test_clean.sort.uniq.bam > ./03_SNPCalling/test_clean.SNP.vcf

查看突变

1
less -S ./03_SNPCalling/test_clean.SNP.vcf

准备VCF列表文件

1
echo "test ./03_SNPCalling/test_clean.SNP.vcf 1" > ./03_SNPCalling/vcf.list

将突变转换为基因型格式

1
perl 00_ref/vcf2geno_free.pl -in ./03_SNPCalling/vcf.list -dir ./03_SNPCalling/ -std ./00_ref/alleles_all.vcf.gz -summary
1
less ./03_SNPCalling/test/test.genotype

4 生成个人报告

根据barcde 号和fq 绝对路径创建ID.list 文件。

ID.list中第一列为barcode,第二列为path,中间同“\t"分隔

运行报告代码

1
/Pipeline/FIS.Traits_v1.0/tools/python3 /Pipeline/FIS.Traits_v1.0/FIS.trait.py -i ID.list -d outdir -nohup

附录

使用Python分析FastQ文件

随着基因测序技术的飞速发展,FastQ文件已成为生物信息学中常见的文件格式。FastQ文件包含了测序数据的基本信息,如碱基序列、质量分数等。掌握Python快速解读FastQ文件的能力,对于基因测序数据分析至关重要。本文将详细介绍如何使用Python进行FastQ文件的读取、解析和初步分析。

FastQ文件简介

FastQ文件是一种文本格式,用于存储高通量测序数据。每个FastQ文件包含四行:

  1. 质量分数标签:以“@”开头,后跟样本ID和描述信息。
  2. 碱基序列:以“+”开头,后跟一个空格,然后是碱基序列。
  3. 质量分数:以“+”开头,后跟一个空格,然后是每个碱基对应的质量分数。
  4. 空行:作为四行记录的间隔。

Python运行环境配置和相应代码

在进行FastQ文件解析之前,请确保已安装以下库:

  • BioPython:用于生物信息学分析。
  • pandas:用于数据处理和分析。
  • numpy:用于数值计算。

你可以使用以下命令安装这些库:

1
pip install biopython pandas numpy

针对未解压的fq.gz文件,使用下面的Python代码即可进行分析:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import gzip  
from Bio.Seq import Seq  
from Bio.SeqIO.QualityIO import FastqGeneralIterator  
from Bio.SeqRecord import SeqRecord  
import numpy as np  

file_path = 'xxxx.fq.gz'  

def parse_fastq(filename):  
    records = []  
    with gzip.open(filename, 'rt') as f:  
        for title, seq, qual_str in FastqGeneralIterator(f):  
            seq_obj = Seq(seq)  
            record = SeqRecord(seq_obj, id=title, description='')  
            records.append(record)  
    return records  

def analyze_fastq(records):  
    seq_lengths = [len(record.seq) for record in records]  
    gc_contents = [(record.seq.count('G') + record.seq.count('C')) / len(record.seq) * 100 for record in records]  
    print('Average sequence length:', np.mean(seq_lengths))  
    print('Average GC content (%):', np.mean(gc_contents))  

fastq_records = parse_fastq(file_path)  
analyze_fastq(fastq_records)  

输出结果如下所示:

1
2
Average sequence length: 50.0
Average GC content (%): 46.72927286232832

估计SNP位点数量

平均序列长度为50的短序列,如果你想估计“SNP位点数量”,需要明确几个条件:

  1. SNP(单核苷酸多态性)密度:不同物种、不同基因组区域SNP密度差别很大。比如人类基因组平均大约每1000个碱基有1个SNP(大约0.1%),但有时更高或更低。

  2. 测序的覆盖度和样本数:测你有多少序列(reads)?每个位点有没有多次测到?能不能准确判定多态?


大致估计方法:

  • 假设SNP密度 1 SNP / 1000 bp(0.1%),平均序列是50 bp。
  • 每条序列大约期望出现的SNP个数:

$50 \times \frac{1}{1000} = 0.05$

也就是5%的序列包含一个SNP位点。


如果你有N条序列,那么期望的SNP数量大约是

$\text{SNP数量} \approx N \times 0.05$


举个例子:

  • 你有1000条50 bp的序列,则总碱基数 = 1000 × 50 = 50,000 bp
  • 粗略估计SNP数 = 50,000 / 1000 = 50个SNP位点

需要注意:

  • 这个估算完全是基于假设的SNP密度;
  • 实际SNP数受物种基因组多样性、测序区域、测序质量和样本群体影响;
  • 如果是高度多样的区域或有群体变异,SNP数会更高。

使用Python分析FASTQ基因测序数据并进一步构建进化树

用Python分析FASTQ基因测序数据并进一步构建进化树,通常包含以下几个步骤:

1. 读取和预处理FASTQ数据

FASTQ是高通量测序数据的主要格式,包含序列和质量信息。

  • 利用BioPython库的SeqIO模块读取FASTQ文件:
1
2
3
4
5
6
from Bio import SeqIO

fastq_file = "your_data.fastq"
sequences = []
for record in SeqIO.parse(fastq_file, "fastq"):
    sequences.append(record)
  • 你可以根据质量值过滤低质量序列:
1
filtered_seqs = [rec for rec in sequences if min(rec.letter_annotations["phred_quality"]) > 20]

2. 序列比对和多序列比对 (Multiple Sequence Alignment, MSA)

进化树构建需要多序列比对。

  • 如果是单条基因/序列,可以先用BLAST检测相似序列,或自己已有多条相关序列。
  • 用外部工具做MSA,比如MAFFT, Clustal Omega,也可以用BioPython调用。

示例:调用MAFFT(需先安装软件)

1
2
3
4
5
import subprocess

input_fasta = "sequences.fasta"
output_aln = "aligned.fasta"
subprocess.run(["mafft", "--auto", input_fasta], stdout=open(output_aln, "w"))

3. 构建进化树

  • 使用BioPython的Phylo模块或者专门的工具如IQ-TREE, RAxMLFastTree.
  • 先生成距离矩阵或基于模型构建树。

示例:使用Bio.Phylo解析和可视化已有的新ick树:

1
2
3
4
from Bio import Phylo

tree = Phylo.read("tree.nwk", "newick")
Phylo.draw(tree)

如果想用BioPython构建简单的NJ树:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from Bio.Align import MultipleSeqAlignment
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor

alignment = AlignIO.read("aligned.fasta", "fasta")
calculator = DistanceCalculator('identity')
dm = calculator.get_distance(alignment)

constructor = DistanceTreeConstructor()
tree = constructor.nj(dm)
Phylo.draw(tree)

4. 完整流程框架建议

  1. 读取FASTQ,提取高质量序列;
  2. 序列拼接或筛选代表序列
  3. 多序列比对(调用外部工具如MAFFT);
  4. 构建进化树(如NJ法、最大似然法);
  5. 树的可视化和后续分析

额外提示:

  • 如数据量大,最好利用Linux环境,调用外部高效工具。
  • 进化树构建依据不同模型,选择对应工具和参数。
  • 可以结合scikit-bioDendroPy等库做更复杂的系统发育分析。

vcf2phylip.py完整代码

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""
The script converts a collection of SNPs in VCF format into a PHYLIP, FASTA,
NEXUS, or binary NEXUS file for phylogenetic analysis. The code is optimized
to process VCF files with sizes >1GB. For small VCF files the algorithm slows
down as the number of taxa increases (but is still fast).

Any ploidy is allowed, but binary NEXUS is produced only for diploid VCFs.
"""

__author__      = "Edgardo M. Ortiz"
__credits__     = "Juan D. Palacio-Mejía"
__version__     = "2.8"
__email__       = "e.ortiz.v@gmail.com"
__date__        = "2021-08-10"

import argparse
import gzip
import random
import sys
from pathlib import Path

# Dictionary of IUPAC ambiguities for nucleotides
# '*' is a deletion in GATK, deletions are ignored in consensus, lowercase consensus is used when an
# 'N' or '*' is part of the genotype. Capitalization is used by some software but ignored by Geneious
# for example
AMBIG = {
    "A"    :"A", "C"    :"C", "G"    :"G", "N"    :"N", "T"    :"T",
    "*A"   :"a", "*C"   :"c", "*G"   :"g", "*N"   :"n", "*T"   :"t",
    "AC"   :"M", "AG"   :"R", "AN"   :"a", "AT"   :"W", "CG"   :"S",
    "CN"   :"c", "CT"   :"Y", "GN"   :"g", "GT"   :"K", "NT"   :"t",
    "*AC"  :"m", "*AG"  :"r", "*AN"  :"a", "*AT"  :"w", "*CG"  :"s",
    "*CN"  :"c", "*CT"  :"y", "*GN"  :"g", "*GT"  :"k", "*NT"  :"t",
    "ACG"  :"V", "ACN"  :"m", "ACT"  :"H", "AGN"  :"r", "AGT"  :"D",
    "ANT"  :"w", "CGN"  :"s", "CGT"  :"B", "CNT"  :"y", "GNT"  :"k",
    "*ACG" :"v", "*ACN" :"m", "*ACT" :"h", "*AGN" :"r", "*AGT" :"d",
    "*ANT" :"w", "*CGN" :"s", "*CGT" :"b", "*CNT" :"y", "*GNT" :"k",
    "ACGN" :"v", "ACGT" :"N", "ACNT" :"h", "AGNT" :"d", "CGNT" :"b",
    "*ACGN":"v", "*ACGT":"N", "*ACNT":"h", "*AGNT":"d", "*CGNT":"b",
    "*"    :"-", "*ACGNT":"N",
}

# Dictionary for translating biallelic SNPs into SNAPP, only for diploid VCF
# 0 is homozygous reference
# 1 is heterozygous
# 2 is homozygous alternative
GEN_BIN = {
    "./.":"?",
    ".|.":"?",
    "0/0":"0",
    "0|0":"0",
    "0/1":"1",
    "0|1":"1",
    "1/0":"1",
    "1|0":"1",
    "1/1":"2",
    "1|1":"2",
}


def extract_sample_names(vcf_file):
    """
    Extract sample names from VCF file
    """
    if vcf_file.lower().endswith(".gz"):
        opener = gzip.open
    else:
        opener = open
    sample_names = []
    with opener(vcf_file, "rt") as vcf:
        for line in vcf:
            line = line.strip("\n")
            if line.startswith("#CHROM"):
                record = line.split("\t")
                sample_names = [record[i].replace("./", "") for i in range(9, len(record))]
                break
    return sample_names


def is_anomalous(record, num_samples):
    """
    Determine if the number of samples in current record corresponds to number of samples described
    in the line '#CHROM'
    """
    return bool(len(record) != num_samples + 9)


def is_snp(record):
    """
    Determine if current VCF record is a SNP (single nucleotide polymorphism) as opposed to MNP
    (multinucleotide polymorphism)
    """
    # <NON_REF> must be replaced by the REF in the ALT field for GVCFs from GATK
    alt = record[4].replace("<NON_REF>", record[3])
    return bool(len(record[3]) == 1 and len(alt) - alt.count(",") == alt.count(",") + 1)


def num_genotypes(record, num_samples):
    """
    Get number of genotypes in VCF record, total number of samples - missing genotypes
    """
    missing = 0
    for i in range(9, num_samples + 9):
        if record[i].startswith("."):
            missing += 1
    return num_samples - missing


def get_matrix_column(record, num_samples, resolve_IUPAC):
    """
    Transform a VCF record into a phylogenetic matrix column with nucleotides instead of numbers
    """
    nt_dict = {str(0): record[3].replace("-","*").upper(), ".": "N"}
    # <NON_REF> must be replaced by the REF in the ALT field for GVCFs from GATK
    alt = record[4].replace("-", "*").replace("<NON_REF>", nt_dict["0"])
    alt = alt.split(",")
    for n in range(len(alt)):
        nt_dict[str(n+1)] = alt[n]
    column = ""
    for i in range(9, num_samples + 9):
        geno_num = record[i].split(":")[0].replace("/", "").replace("|", "")
        try:
            geno_nuc = "".join(sorted(set([nt_dict[j] for j in geno_num])))
        except KeyError:
            return "malformed"
        if resolve_IUPAC is False:
            column += AMBIG[geno_nuc]
        else:
            column += AMBIG[nt_dict[random.choice(geno_num)]]
    return column


def get_matrix_column_bin(record, num_samples):
    """
    Return an alignment column in NEXUS binary from a VCF record, if genotype is not diploid with at
    most two alleles it will return '?' as state
    """
    column = ""
    for i in range(9, num_samples + 9):
        genotype = record[i].split(":")[0]
        if genotype in GEN_BIN:
            column += GEN_BIN[genotype]
        else:
            column += "?"
    return column


def main():
    parser = argparse.ArgumentParser(description=__doc__,
                                     formatter_class=argparse.RawDescriptionHelpFormatter)
    parser.add_argument("-i", "--input",
        action = "store",
        dest = "filename",
        required = True,
        help = "Name of the input VCF file, can be gzipped")
    parser.add_argument("--output-folder",
        action = "store",
        dest = "folder",
        default = "./",
        help = "Output folder name, it will be created if it does not exist (same folder as input by "
               "default)")
    parser.add_argument("--output-prefix",
        action = "store",
        dest = "prefix",
        help = "Prefix for output filenames (same as the input VCF filename without the extension by "
               "default)")
    parser.add_argument("-m", "--min-samples-locus",
        action = "store",
        dest = "min_samples_locus",
        type = int,
        default = 4,
        help = "Minimum of samples required to be present at a locus (default=4)")
    parser.add_argument("-o", "--outgroup",
        action = "store",
        dest = "outgroup",
        default = "",
        help = "Name of the outgroup in the matrix. Sequence will be written as first taxon in the "
               "alignment.")
    parser.add_argument("-p", "--phylip-disable",
        action = "store_true",
        dest = "phylipdisable",
        help = "A PHYLIP matrix is written by default unless you enable this flag")
    parser.add_argument("-f", "--fasta",
        action = "store_true",
        dest = "fasta",
        help = "Write a FASTA matrix (disabled by default)")
    parser.add_argument("-n", "--nexus",
        action = "store_true",
        dest = "nexus",
        help = "Write a NEXUS matrix (disabled by default)")
    parser.add_argument("-b", "--nexus-binary",
        action = "store_true",
        dest = "nexusbin",
        help = "Write a binary NEXUS matrix for analysis of biallelic SNPs in SNAPP, only diploid "
               "genotypes will be processed (disabled by default)")
    parser.add_argument("-r", "--resolve-IUPAC",
        action = "store_true",
        dest = "resolve_IUPAC",
        help = "Randomly resolve heterozygous genotypes to avoid IUPAC ambiguities in the matrices "
               "(disabled by default)")
    parser.add_argument("-w", "--write-used-sites",
        action = "store_true",
        dest = "write_used",
        help = "Save the list of coordinates that passed the filters and were used in the alignments "
               "(disabled by default)")
    parser.add_argument("-v", "--version",
        action = "version",
        version = "%(prog)s {version}".format(version=__version__))
    args = parser.parse_args()

    outgroup = args.outgroup.split(",")[0].split(";")[0]

    # Get samples names and number of samples in VCF
    if Path(args.filename).exists():
        sample_names = extract_sample_names(args.filename)
    else:
        print("\nInput VCF file not found, please verify the provided path")
        sys.exit()
    num_samples = len(sample_names)
    if num_samples == 0:
        print("\nSample names not found in VCF, your file may be corrupt or missing the header.\n")
        sys.exit()
    print("\nConverting file '{}':\n".format(args.filename))
    print("Number of samples in VCF: {:d}".format(num_samples))

    # If the 'min_samples_locus' is larger than the actual number of samples in VCF readjust it
    args.min_samples_locus = min(num_samples, args.min_samples_locus)

    # Output filename will be the same as input file, indicating the minimum of samples specified
    if not args.prefix:
        parts = Path(args.filename).name.split(".")
        args.prefix = []
        for p in parts:
            if p.lower() == "vcf":
                break
            else:
                args.prefix.append(p)
        args.prefix = ".".join(args.prefix)
    args.prefix += ".min" + str(args.min_samples_locus)

    # Check if outfolder exists, create it if it doesn't
    if not Path(args.folder).exists():
        Path(args.folder).mkdir(parents=True)

    outfile = str(Path(args.folder, args.prefix))

    # We need to create an intermediate file to hold the sequence data vertically and then transpose
    # it to create the matrices
    if args.fasta or args.nexus or not args.phylipdisable:
        temporal = open(outfile+".tmp", "w")

    # If binary NEXUS is selected also create a separate temporal
    if args.nexusbin:
        temporalbin = open(outfile+".bin.tmp", "w")


    ##########################
    # PROCESS GENOTYPES IN VCF

    if args.write_used:
        used_sites = open(outfile+".used_sites.tsv", "w")
        used_sites.write("#CHROM\tPOS\tNUM_SAMPLES\n")

    if args.filename.lower().endswith(".gz"):
        opener = gzip.open
    else:
        opener = open

    with opener(args.filename, "rt") as vcf:
        # Initialize line counter
        snp_num = 0
        snp_accepted = 0
        snp_shallow = 0
        mnp_num = 0
        snp_biallelic = 0

        while 1:
            # Load large chunks of file into memory
            vcf_chunk = vcf.readlines(50000)
            if not vcf_chunk:
                break

            for line in vcf_chunk:
                line = line.strip()

                if line and not line.startswith("#"): # skip empty and commented lines
                    # Split line into columns
                    record = line.split("\t")
                    # Keep track of number of genotypes processed
                    snp_num += 1
                    # Print progress every 500000 lines
                    if snp_num % 500000 == 0:
                        print("{:d} genotypes processed.".format(snp_num))
                    if is_anomalous(record, num_samples):
                        print("Skipping malformed line:\n{}".format(line))
                        continue
                    else:
                        # Check if the SNP has the minimum number of samples required
                        num_samples_locus = num_genotypes(record, num_samples)
                        if  num_samples_locus < args.min_samples_locus:
                            # Keep track of loci rejected due to exceeded missing data
                            snp_shallow += 1
                            continue
                        else:
                            # Check that neither REF nor ALT contain MNPs
                            if is_snp(record):
                                # Add to running sum of accepted SNPs
                                snp_accepted += 1
                                # If nucleotide matrices are requested
                                if args.fasta or args.nexus or not args.phylipdisable:
                                    # Uncomment for debugging
                                    # print(record)
                                    # Transform VCF record into an alignment column
                                    site_tmp = get_matrix_column(record, num_samples,
                                                                 args.resolve_IUPAC)
                                    # Uncomment for debugging
                                    # print(site_tmp)
                                    # Write entire row of single nucleotide genotypes to temp file
                                    if site_tmp == "malformed":
                                        print("Skipping malformed line:\n{}".format(line))
                                        continue
                                    else:
                                        temporal.write(site_tmp+"\n")
                                        if args.write_used:
                                            used_sites.write(record[0] + "\t"
                                                             + record[1] + "\t"
                                                             + str(num_samples_locus) + "\n")
                                # Write binary NEXUS for SNAPP if requested
                                if args.nexusbin:
                                    # Check that the SNP only has two alleles
                                    if len(record[4]) == 1:
                                        # Add to running sum of biallelic SNPs
                                        snp_biallelic += 1
                                        # Translate genotype into 0 for homozygous REF, 1 for
                                        # heterozygous, and 2 for homozygous ALT
                                        binsite_tmp = get_matrix_column_bin(record, num_samples)
                                        # Write entire row to temporary file
                                        temporalbin.write(binsite_tmp+"\n")
                            else:
                                # Keep track of loci rejected due to multinucleotide genotypes
                                mnp_num += 1

        # Print useful information about filtering of SNPs
        print("Total of genotypes processed: {:d}".format(snp_num))
        print("Genotypes excluded because they exceeded the amount "
              "of missing data allowed: {:d}".format(snp_shallow))
        print("Genotypes that passed missing data filter but were "
              "excluded for being MNPs: {:d}".format(mnp_num))
        print("SNPs that passed the filters: {:d}".format(snp_accepted))
        if args.nexusbin:
            print("Biallelic SNPs selected for binary NEXUS: {:d}".format(snp_biallelic))

    if args.write_used:
        print("Used sites saved to: '" + outfile + ".used_sites.tsv'")
        used_sites.close()
    print("")

    if args.fasta or args.nexus or not args.phylipdisable:
        temporal.close()
    if args.nexusbin:
        temporalbin.close()


    #######################
    # WRITE OUTPUT MATRICES

    if not args.phylipdisable:
        output_phy = open(outfile+".phy", "w")
        output_phy.write("{:d} {:d}\n".format(len(sample_names), snp_accepted))

    if args.fasta:
        output_fas = open(outfile+".fasta", "w")

    if args.nexus:
        output_nex = open(outfile+".nexus", "w")
        output_nex.write("#NEXUS\n\nBEGIN DATA;\n\tDIMENSIONS NTAX={:d} NCHAR={:d};\n\tFORMAT "
                         "DATATYPE=DNA MISSING=N GAP=- ;\nMATRIX\n".format(len(sample_names),
                                                                                      snp_accepted))

    if args.nexusbin:
        output_nexbin = open(outfile+".bin.nexus", "w")
        output_nexbin.write("#NEXUS\n\nBEGIN DATA;\n\tDIMENSIONS NTAX={:d} NCHAR={:d};\n\tFORMAT "
                            "DATATYPE=SNP MISSING=? GAP=- ;\nMATRIX\n".format(len(sample_names),
                                                                                     snp_biallelic))

    # Get length of longest sequence name
    len_longest_name = 0
    for name in sample_names:
        if len(name) > len_longest_name:
            len_longest_name = len(name)

    # Write outgroup as first sequence in alignment if the name is specified
    idx_outgroup = None
    if outgroup in sample_names:
        idx_outgroup = sample_names.index(outgroup)

        if args.fasta or args.nexus or not args.phylipdisable:
            with open(outfile+".tmp") as tmp_seq:
                seqout = ""

                # This is where the transposing happens
                for line in tmp_seq:
                    seqout += line[idx_outgroup]

                # Write FASTA line
                if args.fasta:
                    output_fas.write(">"+sample_names[idx_outgroup]+"\n"+seqout+"\n")

                # Pad sequences names and write PHYLIP or NEXUS lines
                padding = (len_longest_name + 3 - len(sample_names[idx_outgroup])) * " "
                if not args.phylipdisable:
                    output_phy.write(sample_names[idx_outgroup]+padding+seqout+"\n")
                if args.nexus:
                    output_nex.write(sample_names[idx_outgroup]+padding+seqout+"\n")

                # Print current progress
                print("Outgroup, '{}', added to the matrix(ces).".format(outgroup))

        if args.nexusbin:
            with open(outfile+".bin.tmp") as bin_tmp_seq:
                seqout = ""

                # This is where the transposing happens
                for line in bin_tmp_seq:
                    seqout += line[idx_outgroup]

                # Write line of binary SNPs to NEXUS
                padding = (len_longest_name + 3 - len(sample_names[idx_outgroup])) * " "
                output_nexbin.write(sample_names[idx_outgroup]+padding+seqout+"\n")

                # Print current progress
                print("Outgroup, '{}', added to the binary matrix.".format(outgroup))

    # Write sequences of the ingroup
    for s in range(0, len(sample_names)):
        if s != idx_outgroup:
            if args.fasta or args.nexus or not args.phylipdisable:
                with open(outfile+".tmp") as tmp_seq:
                    seqout = ""

                    # This is where the transposing happens
                    for line in tmp_seq:
                        seqout += line[s]

                    # Write FASTA line
                    if args.fasta:
                        output_fas.write(">"+sample_names[s]+"\n"+seqout+"\n")

                    # Pad sequences names and write PHYLIP or NEXUS lines
                    padding = (len_longest_name + 3 - len(sample_names[s])) * " "
                    if not args.phylipdisable:
                        output_phy.write(sample_names[s]+padding+seqout+"\n")
                    if args.nexus:
                        output_nex.write(sample_names[s]+padding+seqout+"\n")

                    # Print current progress
                    print("Sample {:d} of {:d}, '{}', added to the nucleotide matrix(ces).".format(
                                                           s+1, len(sample_names), sample_names[s]))

            if args.nexusbin:
                with open(outfile+".bin.tmp") as bin_tmp_seq:
                    seqout = ""

                    # This is where the transposing happens
                    for line in bin_tmp_seq:
                        seqout += line[s]

                    # Write line of binary SNPs to NEXUS
                    padding = (len_longest_name + 3 - len(sample_names[s])) * " "
                    output_nexbin.write(sample_names[s]+padding+seqout+"\n")

                    # Print current progress
                    print("Sample {:d} of {:d}, '{}', added to the binary matrix.".format(
                                                           s+1, len(sample_names), sample_names[s]))

    print()
    if not args.phylipdisable:
        print("PHYLIP matrix saved to: " + outfile+".phy")
        output_phy.close()
    if args.fasta:
        print("FASTA matrix saved to: " + outfile+".fasta")
        output_fas.close()
    if args.nexus:
        output_nex.write(";\nEND;\n")
        print("NEXUS matrix saved to: " + outfile+".nex")
        output_nex.close()
    if args.nexusbin:
        output_nexbin.write(";\nEND;\n")
        print("BINARY NEXUS matrix saved to: " + outfile+".bin.nex")
        output_nexbin.close()

    if args.fasta or args.nexus or not args.phylipdisable:
        Path(outfile+".tmp").unlink()
    if args.nexusbin:
        Path(outfile+".bin.tmp").unlink()

    print( "\nDone!\n")

if __name__ == "__main__":
    main()