NAME Bio::SeqAlignment::Examples::TailingPolyester - extending the Polyester RNAsequencing simulator by including polyA tails VERSION version 0.01 SYNOPSIS A collection of examples that demonstrate how to extend the polyester RNA sequencing tool by including polyA tails in the reference RNA being used to generate the simulated RNA sequencing data. The module also shows the general present day relevance of Perl for constructing bioinformatic applications related to sequence mapping. DESCRIPTION This distribution provides examples of the use of Perl, BioPerl and the Perl Data Language to extend the polyester RNA sequencing tool by providing it with the ability to include polyA tails in the reference RNA being used to generate the simulated RNA sequencing data. It also shows how to use these sequences for RNA sequence mapping. The main module created for the example is found under the namespace Bio::SeqAlignment::Applications::Sequencing::Simulators::RNASeq::Polyest er and it is a command line tool that wraps over the Polyester simulator, which itself is a R based bioconductor package. In our extension we provided polyester with the capabilities to add a tail to the RNA sequences it simulated. To do so we also created a pure R command line tool for poyester and put it under the control of Perl. This example requires a few other modules that may be of some general use. Some of these modules are imported under the Bio::SeqAlignment::Examples::TailingPolyester namespace. Other modules were given their own namespace under Bio::SeqAlignment. These modules fall in three separate categories: A Modules related to the simulation of random values from truncated distributions. Those are functional and will eventually find themselves under their own namespace once I figure which one this will be! Until then, one can load them by importing the relevant module under Bio::SeqAlignment::Examples::TailingPolyester 1. SimulatePDLGSL : module that uses the Gnu Scientific Library (GSL) and the Perl Data Language (PDL) to simulate random numbers from truncated versions of the distributions provided by the GSL using two role plugins: one for simulating random numbers from the uniform distribution, and one for computing the CDF (Cumulative density function) of the truncated distribution and their inverse. 2. SimulateMathGSL : module that uses the Gnu Scientific Library (GSL) base Perl to simulate random numbers from truncated versions of the distributions in GSL using using two role plugins: one for simulating random numbers from the uniform distribution, and one for computing the CDF (Cumulative density function) of the truncated distribution and their inverse. 3. SimulateTruncatedRNGPDL : a role plugin that implements the inverse CDF method for drawing random numbers from a possibly truncated version of a distribution using the Perl Data Language (PDL). 4. SimulateTruncatedRNG : a role plugin that implements the inverse CDF method for drawing random numbers from a possibly truncated version of a distribution in base Perl. 5. PDLRNG: a role plugin that draws random numbers from the uniform distribution using the Xoshiro256+ algorithm in the Perl Data Language (PDL). 6. GSLRNG: a role plugin that draws random numbers from the uniform distribution using the uniform (flat) distribution in the PDL::GSL module of PDL 7. PERLRNGPDL: a role plugin that draws random numbers from the uniform distribution using the builtin rand() function in Perl and returns a ndarray with these values 8. PERLRNG: a role plugin that draws random numbers from the uniform distribution using the builtin rand() function in Perl and returns a reference to array of said values. B. Modules related to generic tasks such as reading and processing collections of BioX::Seq objects, tailing of sequences, documenting sequence modifications etc. polyA processing and removal of such tails from sequencing data. BioX::Seq is a lightweight framework for representing biological sequences such as those that come from sequencing instruments. It is a simple object that holds the sequence data, the quality data, and the name of the sequence. It is used as a lightweight alternative to the BioPerl Bio::Seq object. It can handle both FASTA and FASTQ files, including their compressed versions. The modules that fall under this category are: 1. Bio::SeqAlignment::Components::Conversions::BioXFASTX . This module handles the conversion of lists of BioX::Seq objects to FASTX (where X is either A or Q indicating a FASTA or a FASTQ) file in the disk. The module is used as an example of input/output plugins for the Bio::SeqAlignment::Components::TrimTail module. 2. Bio::SeqAlignment::Components::Sundry::IOHelpers : a collection of modules that read, write and split FASTX (either FASTA or FASTQ) files. It provides convenience functions to read/write such files using the lightweight module BioX::Seq::Stream. 3. Bio::SeqAlignment::Components::Sundry::Tailing : This module provides functions to add various tails to the 3' of biological sequences. Such modifications are useful for e.g. simulating polyA tails in RNAseq, adding UMI (Universal Molecular Identifier) tags to sequences, etc. The function add_polyA is used by the Bio::SeqAlignment::Applications::Sequencing::Simulators::RNASeq::Polyest er module to add poly A tails in the extension of Polyester presented in the talk. 4. Bio::SeqAlignment::Components::Sundry::DocumentSequenceModifications : This module is used to store modifications to sequences that are carried out by components of the simulator (or the modules that process sequences for mapping). During the execution of the Perl code, we use hash structures to store such modifications (a type of in-memory log) and then write them out in YAML, JASON or MessagePack formats. These files may be loaded at a subsequent point and used to analyze the results of what ever sequence modification was carried out in the source data. A single application script is provided in the bin directory of the distribution. This script is called polyester.pl and is used to attach the polyA tails to the reference sequences, before calling out the polyester R script. In addition to this distribution contains example scripts for the use of these modules and comparator scripts for high performance random frequency generation against R and Python. PDL just shines in this area. All modules, and application scripts were used for the talk given to the S cience Track of the Perl & Raku conference 2024. https://tprc.us/tprc-2024-las/ https://blogs.perl.org/users/oodler_577/2024/01/perl-raku-conference-202 4-to-host-a-science-track.html scripts This is a directory that holds various scripts in Perl and R that are used to generate and analyze performance data of various aspects covered in this talk. The generated data are found in the subfolder data, while the results of these analyses are stored as image files under 'scripts'. The following files are found under this location: cutadapt_polyA_algo_timing.pl This script benchmarks various potential approaches to trimming the polyA tail from sequences, including various native Perl implementations of the cutadapt algorithm, as well as PDL and C implementations of the same algorithm. It also includes an implementation of a changepoint method in C. cutadapt_polyA_algo_timing.py A python script for the implementation of the cutadapt algorithm for trimming polyA tails from sequences and a modified version developed for benchmarking. This script is used to compare the performance of various implementations of the cutadapt algorithm in Perl, Python, and C. testRNG_performance.pl This script tests different combinations of random number generators, and implementations of the inverse CDF method for sampling from truncated distributions. It's main output is a comma separated script of timing data. testsimsGSL.R This script is used to test the performance of the GSL RNGs against the inverse CDF implemented via a procedural logic in R. It outputs a single PNG file with the violin plots (a combination of box plots and kernel density) of the timing data for different possible implementations of the inverse CDF method in either R or Perl. vioplot_Perl_R_lognormal.png Performance comparison of Perl and R for the generation of truncated lognormal variates. It is produced by testsimsGSL.R testPerl.csv This is a CSV file that contains the timing data for the Perl RNGs and the inverse CDF method implemented in PDL. It is produced by testRNG_performance.pl perl_timing.txt This is a text file that contains the timing data for the various implementations of cutadapt in native Perl, PDL and PDL/C methods. It is produced by the script cutadapt_polyA_algo_timing.pl python_timing.txt This is a text file that contains the timing data for the various implementations of cutadapt in native Python. It is produced by the script cutadapt_polyA_algo_timing.py SEE ALSO * Bio::SeqAlignment A collection of tools and libraries for aligning biological sequences from within Perl. * cutadapt This module provides an interface to the cutadapt tool for identifying and trimming adapters and primers from sequencing data. * PDL The Perl Data Language (PDL) gives standard Perl the ability to compactly store and speedily manipulate the large N-dimensional data arrays which are the bread and butter of scientific computing. PDL turns Perl into a free, array-oriented, numerical language that can be a very solid alternative to switching to Python or R for numerical computations during complex data analysis tasks and pipelines. * polyester Polyester is an R package designed to simulate RNA sequencing experiments with differential transcript expression.Given a set of annotated transcripts, Polyester will simulate the steps of an RNA-seq experiment (fragmentation, reverse-complementing, and sequencing) and produce files containing simulated RNA-seq reads. Simulated reads can be analyzed using your choice of downstream analysis tools. Polyester has a built-in wrapper function to simulate a case/control experiment with differential transcript expression and biological replicates. Users are able to set the levels of differential expression at transcripts of their choosing. This means they know which transcripts are differentially expressed in the simulated dataset, so accuracy of statistical methods for differential expression detection can be analyzed. AUTHOR Christos Argyropoulos COPYRIGHT AND LICENSE This software is copyright (c) 2024 by Christos Argyropoulos. This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.