June & July 2017 Tool Shed contributions

[Galaxy ToolShed](http://toolshed.g2.bx.psu.edu/)

Tools contributed to the Galaxy Project ToolShed in June and July 2017.

New Tools

  • From eschen42:

    • w4mclstrpeakpics: Visualize sample-cluster peaks. Produce a figure to assess the similarities and differences among peaks in a cluster of samples using XCMS-preprocessed data files as input.
  • From testtool:

    • get_gr_set: Reading Illumina methylation array data from GEO. This tool downloads data from GEO and returns to GenomicRatioSet object for further analysis.
    • find_dmp: Tool finds differentially methylated positions (DMPs) with respect to a phenotype covariate.
    • find_snp: SNPs inside the probe body or at the nucleotide extension can have important consequences on the downstream analysis, these tool offers the possibility to remove such probes.
  • From artbio:

    • yac_clipper: Clips 3' adapters for small RNA sequencing reads. Clips 3' adapters for small RNA sequencing reads. Supports fasta and fastq output.
    • small_read_size_histograms: Generates size histograms from small read alignments. Generates read size histograms from tabular, sam or bam alignments.
    • mircounts: Generates miRNA count lists from read alignments to mirBase. Generates pre-miRNA and mature miRNA count tables from read alignments to pre-miRNA sequences and a gff file, both downloaded from mirBase. Produces also read coverage plots of pre-miRNAs.
    • small_rna_maps: Generates small read maps from alignment BAM files. Generates read count maps from alignment BAM files, using pysam and r-lattice. In addition to the read counts (bars), median sizes, mean sizes and coverages of reads mapping at a any given position are plotted in graphs. Takes sorted BAM files as inputs and produce pdf outputs.
    • lumpy_sv: Find structural variations. This tool takes as an input a sorted bam alignment of paired-end sequencing reads. It extracts discordant paired-end alignments and split-read alignments, and generates a vcf file containing structural variation calls.
  • From mmonot:

    • phageterm: Determine Bacteriophage Termini and Packaging Mode using randomly fragmented NGS data. PhageTerm software is a tool to determine phage termini and packaging mode from high throughput sequences that rely on the random fragmentation of DNA (e.g. Illumina TruSeq but NOT Nextera). Phage sequencing reads from a fastq file are aligned to the phage reference genome in order to calculate two types of coverage values (whole genome coverage and the starting position coverage). The starting position coverage is used to perform a detailed termini analysis. If the user provides the host sequence, reads that does not match the phage genome are tested on the host using the same mapping function.
  • From nathandunn:

  • From in_silico:

  • From caleb-easterly:

    • validate_fasta_database: runs Compomics database identification tool on any FASTA database, and separates valid and invalid entries based on a series of checks.
  • From davidvanzessen:

  • From fabio:

    • iristcga: 20170531. IRIS-TCGA: automatically filter, extract, and integrate different genomic experiments from The Cancer Genome Atlas. IRIS-TCGA is a web service to automatically query, filter, extract and integrate different genomic experiments from The Cancer Genome Atlas. The service is available at http://bioinf.iasi.cnr.it/iristcga/.
  • From greg:

    • plant_tribes_gene_family_integrator: Integrates de novo assembly sequences with scaffold gene family sequences. One of the PlantTribes collection of automated modular analysis pipelines that utilize objective classifications of complete protein sequences from sequenced plant genomes to perform comparative evolutionary studies. This tool integrates classified post processed de novo transcriptome assembly sequences with the scaffold gene family sequences.
    • plant_tribes_gene_family_classifier: Classifies gene sequences into precomputed orthologous gene family clusters using either blastp (faster), HMMScan (slower but more sensitive to remote homologs) or both (more exhaustive). One of the PlantTribes collection of automated modular analysis pipelines that utilize objective classifications of complete protein sequences from sequenced plant genomes to perform comparative evolutionary studies. This tool classifies gene sequences into precomputed orthologous gene family clusters using either blastp (faster), HMMScan (slower but more sensitive to remote homologs) or both (more exhaustive).
    • plant_tribes_gene_family_phylogeny_builder: Creates multiple sequence alignments and inferred maximum likelihood phylogenies for orthogroups. One of the PlantTribes collection of automated modular analysis pipelines that utilize objective classifications of complete protein sequences from sequenced plant genomes to perform comparative evolutionary studies. It performs phylogenomic analyses by creating multiple sequence alignments and inferred maximum likelihood phylogenies for orthogroups produced by the GeneFamilyAligner tool.
    • plant_tribes_assembly_post_processor: Postprocesses de novo assembly transcripts into putative coding sequences and their corresponding amino acid translations, locally assembling targeted gene families. One of the PlantTribes collection of automated modular analysis pipelines that utilize objective classifications of complete protein sequences from sequenced plant genomes to perform comparative evolutionary studies. It postprocesses de novo assembly transcripts into putative coding sequences and their corresponding amino acid translations, locally assembling targeted gene families.
    • plant_tribes_kaks_analysis: Performs orthologous or paralogous ks analyses of coding sequences and amino acid sequences. One of the PlantTribes collection of automated modular analysis pipelines that utilize objective classifications of complete protein sequences from sequenced plant genomes to perform comparative evolutionary studies. This tool performs orthologous or paralogous ks analyses of coding sequences and amino acid sequences.
    • plant_tribes_gene_family_aligner: Integrates de novo assembly sequences with scaffold gene family sequences. One of the PlantTribes collection of automated modular analysis pipelines that utilize objective classifications of complete protein sequences from sequenced plant genomes to perform comparative evolutionary studies. This tool aligns gene family sequences.
    • plant_tribes_ks_distribution: Plots the distribution of synonymous substitution (Ks) rates and fits significant component(s). One of the PlantTribes collection of automated modular analysis pipelines that utilize objective classifications of complete protein sequences from sequenced plant genomes to perform comparative evolutionary studies. This tool plots the distribution of synonymous substitution (Ks) rates and fits significant component(s).
  • From rnateam:

    • data_manager_sortmerna_database_downloader: SortMeRNA: a sequence analysis tool for filtering, mapping and clustering NGS reads. SortMeRNA is a software designed to rapidly filter ribosomal RNA fragments from metatransriptomic data produced by next-generation sequencers.
    • viennarna_kinwalker: Wrapper for ViennaRNA application kinwalker. RNA secondary structure prediction through energy minimization is the most used function in the package. There are three kinds of dynamic programming algorithms for structure prediction provided: the minimum free energy algorithm of (Zuker & Stiegler 1981) which yields a single optimal structure, the partition function algorithm of (McCaskill 1990) which calculates base pair probabilities in the thermodynamic ensemble, and the suboptimal folding algorithm of (Wuchty et.al 1999) which generates all suboptimal structures within a given energy range of the optimal energy. For secondary structure comparison, the package contains several measures of distance (dissimilarities) using either string alignment or tree-editing (Shapiro & Zhang 1990). Finally, we provide an algorithm to design sequences with a predefined structure (inverse folding).
  • From yating-l:

    • hg_gc_percent_340: Calculates GC percentage using a sliding window. This tool uses a sliding window to calculate the GC percentages of a twoBit file.
    • twobit_mask_340: Apply repeat masking to a twoBit file. This tool applies a mask to a twoBit file based on repeat analysis results produced by programs such as RepeatMasker and Tandem Repeats Finder (TRF).
    • fa_to_twobit_340: Convert FASTA sequences to twoBit format. This tool converts nucleotide sequences in FASTA format into the twoBit format. The twoBit format allows random access of DNA sequences and can also contain repeat masking information.
    • twobit_to_fa_340: Convert a twoBit file to FASTA format. This tool converts either all or a part of a twoBit file into FASTA format.
    • twobit_to_cytoband_340: Creates a cytoband file from a twoBit file. This tool creates a Cytoband file based on a twoBit file to enable quicker navigation of individual scaffolds in the UCSC Genome Browser.
    • twobit_info_340: Obtain sequence information from a twoBit file. This tool reports the length of each scaffold and the gap locations stored in a twoBit file.
  • From earlhaminst:

    • apoc: large-scale structural comparison of protein pockets.
    • smart_domains: SMART domains. Search domains in protein sequences using SMART.
    • plotheatmap: This tool can be used to plot heatmap of gene expression data. The genes are chosen based on p-value, FDR, log FC and log CPM from edgeR output.
  • From iuc:

    • limma_voom: Perform RNA-Seq differential expression analysis using limma voom pipeline. Apply limma voom pipeline on a table of tab separated count data to generate HTML report for differential expression analysis. Report includes mean-variance trend, MDS and smear plots as well as summarised table of statistics on each gene.
    • jq: JQ is a lightweight and flexible command-line JSON processor. jq is like sed for JSON data - you can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text.
    • filter_tabular: Filter Tabular. Provides tools for manipulation of tablular files. A variety of line filters may be applied to each line of a tabular file as it is read. These filters can be used to select lines, select and reorder columns, and normalize tabular files for use in a relational database. Regex regular expression functions re.match, re.search, and re.sub as defined in SQLite databse connections, so that re.match, re.search, and re.sub can be used in SQL queries. The Query Tabular Loads tabular files into a new or existing SQLite DB to perform a SQL query producing a tabular output. It includes the line filtering options. The SQLite DB may optionally be saved as an additional output. The SQLite to Tabular tool performs a query on an existing SQLite DB. The Filter Tabular tool only uses the filtering options to produce a filtered tabaular output. http://www.sqlite.org/index.html https://docs.python.org/release/2.7/library/re.html.
    • unicycler: Unicycler is a hybrid assembly pipeline for bacterial genomes. Unicycler takes a good set of Illumina reads from a bacterial isolate (required) and long reads from the same isolate (optional). If the input is sufficient, it will produce a completed assembly of circularised sequences.
    • query_tabular: Query Tabular. Provides tools for manipulation of tablular files. A variety of line filters may be applied to each line of a tabular file as it is read. These filters can be used to select lines, select and reorder columns, and normalize tabular files for use in a relational database. Regex regular expression functions re.match, re.search, and re.sub as defined in SQLite databse connections, so that re.match, re.search, and re.sub can be used in SQL queries.
    • roary: Roary the pangenome pipeline. Creates lists of core and accessory genes in collections of annotations from Prokka.
    • qiime_make_otu_table: Wrapper for the qiime tool suite: Make OTU table. "QIIME: Quantitative Insights Into Microbial Ecology QIIME is an open-source bioinformatics pipeline for performing microbiome analysis from raw DNA sequencing data".
    • snp_sites: Finds SNP sites from a multi-FASTA alignment file. SNP-sites can rapidly extract SNPs from a multi-FASTA alignment using modest resources and can output results in multiple formats for downstream analysis. SNPs can be extracted from a 8.3 GB alignment file (1,842 taxa, 22,618 sites) in 267 seconds using 59 MB of RAM and 1 CPU core, making it feasible to run on modest computers. SNP-sites is implemented in C and is available under the open source license GNU GPL version 3.
    • sqlite_to_tabular: SQLite to tabular. Provides tools for manipulation of tablular files. A variety of line filters may be applied to each line of a tabular file as it is read. These filters can be used to select lines, select and reorder columns, and normalize tabular files for use in a relational database. Regex regular expression functions re.match, re.search, and re.sub as defined in SQLite databse connections, so that re.match, re.search, and re.sub can be used in SQL queries.
    • gubbins: Gubbins - bacterial recombination detection. Creates a gff of recombination events in closely related bacterial strains from a whole genome alignment.
    • qiime_collapse_samples: Wrapper for the qiime tool suite: Collapse samples. "QIIME: Quantitative Insights Into Microbial Ecology QIIME is an open-source bioinformatics pipeline for performing microbiome analysis from raw DNA sequencing data".
  • From sarahinraauzeville:

    • dna_inserts_reconstruction: This tool is the first step of the fosmid assembly and annotation pipeline. PIPELINE: This tool is the first step of the fosmid assembly and annotation pipeline. It assembles raw reads, looks for and removes the cloning vector, and extracts the longest and the most covered contigs. It has been build to handle two types of raw reads as inputs : single (454, ion torrent reads,...) or paired (Illumina,...) reads. This tools is not able to process PacBio or Oxford Nanopore reads. Raw read fastq file organization and naming The raw read files are organized in directories, one per sample. In the directories, each fastq file has to be gzipped. All the sample directories must be zipped in an unique file. The input files must be named "MiSeq.zip" (Paired.zip) for paired files, and "Proton.zip" (Single.zip) for single read files. Even if you have only one fastq file, this gz fastq file should also be in a zipped directory. When you upload your inputs files, choose the "no_unzip.zip" format. The assembly is perform by SPAdes. Assembly and post processing steps : 1 ) 100x before assembling read Sub-selection. 2 ) SPAdes (more information on http://bioinf.spbau.ru/en/spades) assembly. vector detection : the vector is searched and masked in the assembled contigs using cross-match (http://www.phrap.org/phredphrapconsed.html) contig filtering : the result file only includes contigs containing the vector and having a average depth of XXX. Warning Input fastq files name for paired data should contain "R1.f*" (R1.fq.. or R1.fastq...) and "R2.f*" to be identify as paired data by this tool. Outputs : The outputs are presented in a table containing one line per assembled sample. Each line contains the name of the sample, the number of resulting contigs, the length of the longest contig, the total length of the contigs and a link to the contig fasta file. Underneath the table there is a link to to complete result file containing all the resulting fasta files organized in directories named after the samples. Version Galaxy Tool : V1.0.
  • From egtortuero:

    • virannot: De novo (viral) genome annotator. VirAnnot is a script written in Python 2.7 that annotates genomes automatically (using a de novo algorithm) and predict the function of their proteins using BLAST and, optionally, HMMER. The program was originally designed for viral contigs but it could be used also for bacterial and archaeal sequences.
  • From engineson:

    • multiqc: multiqc tool. MultiQC searches a given directory for analysis logs and compiles a HTML report. It's a general use tool, perfect for summarising the output from numerous bioinformatics tools.