Skip to content
Snippets Groups Projects

Overview

To investigate 3'UTR isoforms in neuronal compartments we conducted series of 3'End sequencing experiments in different samples to assess the diversity, localization, stability and translational efficiency of 3'UTR isoforms in neuronal compartments. The library preparation is done using the MACE method. Step by step analyses are described in the following documentation.

Experimental Data

Hippocampus CA1 region microdissection leads to separation of somata (cell soma enriched tissue) and neuropil (cell processes axons and dendrites enriched tissue) layers. Such anatomical architecture allows us to address mRNA populations residing in either neuronal soma or neuronal processes.

Furthermore hippocampal tissue consists of variety of cell types: neurons, glia, epithelial. To classify the two major classes neuronal and glial, we conducted 3'End sequencing between neuron-enriched and glia-enriched hippocampus cell-culture. Defining the enrichment properties of detected genes allowed us to add cell-type enrichment label for each category.

To address 3'UTR isoform specific mRNA stability we conducted time-points experiment applying transcription inhibitors TI. By using nonlinear least squares curve fitting we predicted the exponental decay of each 3'UTR isoform and drew conclusions upon half-life distribution between cellular compartments.

Neuronal activity enahnces the plasticity and redistribution of mRNAs. To investigate how such changes are driven by the choice and processing of isoform specific 3'UTRs, we conducted an experiment applying BiCu treatment for 4h on hippocampal slices and did microdissection. Going further to explain contribution of transcriptional regulation, we combined previous experiment with set of samples under transcription inhibition TI. Overall we sequenced 2 biological replica per condition ending up in 28 indexed runs (12 in hippocampal cell-culture and 16 in hippocampal tissue).

Sample list:

  • hippocampus CA1 microdissection in somata vs. neuropil layers

    • under basal condition, targeting mRNAs localization (2 samples x 2 replica)
      • BiCu treatment, targeting mRNA localization upon enhanced neuronal activity (2 samples x 2 replica)
      • transcription inhibitor (TI), address transport upon blocked transcription, 3'UTR isoforms redistribution (2 samples x 2 replica)
      • BiCu & + TI, address 3'UTR choice after enhanced neuronal activity based on redistribution and stability (2 samples x 2 replica)
  • hippocampal culture (20DIV neurons with depleated Glial cells) + transcription inhibitor, time points (0, 2, 4, 9, 16h) (5 samples x 2 replica)

  • glia-enriched culture (20DIV glia culture with depleated Neuronal cells) (1 sample x 2 replica)

Data repository can be found in NCBI SRA.

Identification and annotation of poly(A) supported sites (PASS) after 3'End sequencing

update on PASS annotation May 2017

Create a reference RefSeq gene annotation

Download current annotation from UCSC Table Browser. refSeq_rn5_Raw_May2017.bed :: tab-delimited 12 columns bed file with RefSeq gene annotation refSeq_rn5_IDMap_May2017.txt :: 2-columns tab-delimited file mapping RefSeq transcript ID to official gene symbol.

intersecting both files based on RefSeq transcript ID::

awk -F"\t" 'FNR==NR{LIST[$1]=$2;next;}{OFS="\t";$4=$4";"LIST[$4];print $0}' refSeq_rn5_IDMap_May2017.txt refSeq_rn5_Raw_May2017.bed |sort -k1,1 -k2,2n > refSeq_rn5_Annotated_May2017.bed

refSeq_rn5_Annotated_May2017.bed :: tab-delimited 12 columns bed file with RefSeq annotation, where Name field consists of transcript ID and offficial gene symbol separated by ";".

Associate PASS clusters with reference genes

Consolidated set of experimental data

Classification of cell-enriched data

Structural properties of neuronal 3'UTRs

Localisation of neuronal 3'UTRs between cellular compartments

Stability of neuronal 3'UTRs

3'UTRs redistribution upon altered neuronal activity

We further compare mRNA redistribution upon enhanced neuronal activity (applying BiCu treatment) or address effects of transport and stability (applying transcription inhibitors TI). Total RNA is isolated from either hippocampal tissue or cultured neurons. Each sample is sequenced in 2 biological replica, ending in a set of 13 samples and 26 indexed runs. Sample list:

  • hippocampus CA1 microdissection in somata vs. neuropil layers

    • under basal condition, targeting mRNAs localization (2 samples x 2 replica)
      • BiCu treatment, targeting mRNA localization upon enhanced neuronal activity (2 samples x 2 replica)
      • transcription inhibitor (TI), address transport upon blocked transcription, 3'UTR isoforms redistribution (2 samples x 2 replica)
      • BiCu & + TI, address 3'UTR choice after enhanced neuronal activity based on redistribution and stability (2 samples x 2 replica)
  • hippocampal culture (20DIV neurons with depleated Glial cells) + transcription inhibitor, time points (0, 2, 4, 9, 16h) (5 samples x 2 replica)

Analysis Steps

Analysis include known and custom made tools. Here a thorough overview of each step is being given where one can follow what is the novel implemented scripts and what requirements or dependancies they require.

  • determine sequenced read quality by fastqSeqStats
  • alignment by STAR
  • detecting of Poly(A) supported sites PASSFinder
  • clustering of Poly(A) supported sites
  • expression of Poly(A) supported sites

Reads Quality

To check proper sequenced read quality, we used the in-house implemented tool fastqSeqStats. It will generate summary statistics of base frequency and phred score per sequence length. The plotReadQuality.m file is used to generate the plot. figure Reads Quality One can recognize the increased A frequency toward sequence length, consistant with the likelihood of capturing mRNA polyA tail in sequenced reads. Overall Phred score stays stable and in accepted range.

Alignment

Read alignment is done using STAR. The following parameters were utalized with the program:

STAR --runMode alignReads --runThreadN 6 --genomeDir $genomePath --readFilesIn $queryFile --readFilesCommand zcat --outSAMattributes All --outStd Log --outSAMtype BAM SortedByCoordinate --outSAMstrandField intronMotif --outFilterIntronMotifs RemoveNoncanonical --alignSoftClipAtReferenceEnds No --outFilterScoreMinOverLread 0.25 --outFilterMatchNminOverLread 0.25;

To facilitate alignment of all samples one can use the batch processing script AlignSingleEndGz.sh.

Find 3'UTR isoforms

Detection

In order to detect clusters of poly(A) sites we implemented the PASSFinder. It relies on HTSLib to read an alignment file in BAM format. The coverage on each base is piled up, where a read is being classified as either poly(A) containing or not based on a poly(A) recognition algorithm designed by Kent for UCSC utilities. Each base is cross-referenced with records for genome internal poly (A/T) regions to evaluate potential internal priming events. To create a poly(A/T) mask a string of consecutive 10 mono bases is aligned to the genome with BOWTIE and the resuls is converted to BED record. The mask file is indexed and randomly accessed with Tabix.

time bowtie /path_to_genome/bowtie_index/genome PolyTailMask.fa -f -v 2 --all --sam --threads 6 |samtools view -buS - | samtools sort - PolyTailMask

bedtools bamtobed -i PolyTailMask.bam -split | bedtools merge -i - -s -c 4 -o count | sort -k1,1 -k2,2n | awk -F"\t" 'OFS="\t"{print $1,$2,$3,sprintf("poly%06d",NR),$5,$4}' | bgzip > species_PolyRegions.bed.gz

tabix species_PolyRegions.bed.gz

One can either merge BAM alignment files with samtools merge or use the provided batchPASSFinder.sh script to run it on multiple queries. The resulted bed coverage files are sorted by position, compressed using BGZIP and indexed using Tabix. Explore the statistics of accumulated reads with the statsPASSFinder.sh. You will get an overview of number of uniquely aligned reads, reads classified with poly(A) tail, reads classified with poly(A) tail and hitting a mask region and reads on the mitochondrial chromosome. figure Reads Summary The sequenced reads are UMI-tagged and filterd for unique molecules only. The sequencing yield varies between 5 and 16 million reads, where after alignment the uniquely mapped reads are in the range of >80% (notice the color coded ratio from sequenced reads). Reads that contain +(A) tails represent around 10% and they will be used as seeds for identification of poly(A) supported sites. The number of reads mapped next to or over genomic repeats is also 10%. We do not allow cluster formation around such regions. It is interesting to observe the mitochondrial contribution. In neuronal cell culture samples it varies between 18-23%, where in the somata layer it is incresed to 19-32% and spikes in the neuropil layer with 31-43%.

Clustering

3'base coverage holds the infomation of potential 3'UTR sites. In order to identify the positions of polyadenylation we implemented a base clustering techniques that can merge positions in user defined window. During the grouping process information about best expressed base and best expressed seed (3'base of reads containing poly(A) tail) is maintained:

  1. chrom
  2. chrom_start
  3. chrom_end
  4. group_id
  5. span
  6. strand
  7. masked_bases
  8. bases_counts
  9. reads_sum_coverage
  10. reads_max_coverage
  11. reads_best_base
  12. tails_counts
  13. tails_sum_coverage
  14. tails_max_coverage
  15. tails_best_base

To run the procedure run the PASSCluster.pl script. To run a batch process routine evoke the batchPASSCluster.sh. We used a clustering windows of 25 nucleotides as suggested in Müller et. al.. To estimate the statiscs for the clustering process one can use statsPASSCluster.pl and visualize the results with plotGroupStats.m. For generating the frequency of clusters span we used the following one liner:

gzcat ./clusters/passData_win25_11Nov2016.bed.gz |awk -F"\t" 'OFS="\t"{if($1!="chrM"){COUNTS[$5]++;READS[$5]+=$9}}END{for(key in COUNTS) print key,COUNTS[key],READS[key]}' | sort -k1,1n > spanPerCluster_11Nov2016.txt

figure Cluster Summary In subplot A we compare the total number of detected clusters per sample vs. clusters that contain a poly(A) tail read as a seed and such that lack poly(A) tails. Previously we observed that the poly(A) tails serving for seeds are only ~10% of all the sequenced reads. What is interesting here is that clusters without any poly(A) reads (-seeds) contribute a lot to the toal number of detected clusters but they accumulate less then ~10 of the reads data. Cluster containing poly(A) reads (+seads) are on average 40.000 per sample but they will accumulate >90% of the reads data in the sample. This result gives us confidence to consider only cluster containing poly(A) tail and refereing the best position of that tail as the poly(A) supported site. In subplot B we wanted to investigate what is the optimal size of the cluster. First observation is that cluster with length ot 1 base (log10(1) = 0, black line) are around 62% of the clustered sites, but their actual contribution is less then 1% (dark pink line at base 1). As an optimal size we investigate the window between 25 and 500 bases which will accumulate 20% of the clusters (black line) and 60% of the reads (dark pink line). As one should notice the dark pink line will end around 65%, but here we omitted the contribution of the mitochondrial chromosome. There we have only 13 clusters, corresponding to the mitochondrial genes, but they accumulate more than 30% of the reads data (as shown before in the figure for reads summary).

Annotation

The annotation is being done in two steps. First we will prepare a gene reference map and then we will assign the closest gene to each identified cluster.

Prepare reference annotation file

We will use the UCSC Table Browser tool to download the current RefSeq annotation for Rattus norvegicus rn5. We need to download two files, the RefSeq annotation in BED12 file format and the gene symbol annotation per mRNA transcript from the refLink table. For help with that step one can consider reading the Table Browser User Guide provided by UCSC. To merge both files we use the following AWK command:

awk -F"\t" 'FNR==NR{LIST[$1]=$2;next;}{name=sprintf("%s;%s",$4,LIST[$4]);$4=name;OFS="\t";print $0}' refSeq_rn5_11Nov2016.txt refSeq_rn5_11Nov2016.bed |sort -k1,1 -k2,2n > refSeq_rn5_RefAnn_11Nov2016.bed

We end up with BED12 reference annotation file, where the name field contains transcript and gene information separated by ';'. In order to extract genomic features, we implemented BED12Split.pl. The script will split BED12 file into BED6 file and it will append feature label to the name of each record, where available labels are 5'UTR, CDS, intron, 3'UTR. We will use the resulted BED6 files in the next step.

Find closest gene to cluster

To associate PASS clusters with genes we implemented the PASSAnnotate.pl routine. The script required a BED features map created in the previous step and PASS clusters file from the clustering procedure. It uses [bedtools closest] (http://bedtools.readthedocs.io/en/latest/content/tools/closest.html) as internal engine for assosiating neighbouring genes and sites. The routine will calculate UTR predicted length and relative contribution of peak per gene. The following fields are added to the information of the clusters:

  1. gene symbol
  2. gene feature
  3. upstream start position
  4. upstream span
  5. filter (true if pass site contributes >= 1% to total gene expression)

Example command:

perl PASSAnnotate.pl passData_ClustersRaw.bed.gz refSeq_rn5_Features.bed 20000 |sort -k1,1 -k2,2n |gzip > passData_ClustersAnnotated.bed.gz

Expression

To count expression one needs to create coverage files with

bash batchPASSFinder.sh /storage/schu/ProjectGroups/RNA/Data/RNASequencing/GATC/tpend/project/glial/bams/ /storage/scic/Data/External/Bioinformatics/Rat/Genomes/rn5_Mar2012/poly/rn5_PolyRegions.bed.gz /storage/schu/ProjectGroups/RNA/Data/RNASequencing/GATC/tpend/project/glial/coverage/

Second step is to crete a tabix index for the coverage files with

tabix --sequence 1 --begin 2 --end 2 --zero-based $file_out

Running the perl script act as a batch merger

perl PASSCountExpression.pl <annotation> <path to coverage>