Difference between revisions of "Bioinformatics"
(→Programs to Install) |
(→High Throughput Sequencing workflow) |
||
Line 309: | Line 309: | ||
=High Throughput Sequencing workflow= | =High Throughput Sequencing workflow= | ||
+ | Viewing list - | ||
+ | * [https://www.youtube.com/watch?v=p4vKJJlNTKA Generation of genomic data (Hubbard Center for Genomic Studies)] gives background on DNA/RNA sequencing. | ||
+ | * [https://www.youtube.com/watch?v=tlf6wYJrwKY&list=PLblh5JKOoLUJo2Q6xK4tZElbIvAACEykp StatQuest High Throughput Sequencing] - watch the videos in this playlist as needed. Start with the first. | ||
+ | Basic steps in workflow of going from RNA seq files to produce counts files. | ||
+ | 0. Get input files. We use SRA, and take the first 1000 reads from SRR4011874 using fastq-dump from the SRA toolkit. | ||
+ | 1. Trim adapters. We use cutadapt. | ||
+ | 2. Check for quality (could do that before trimming adapters as well). We use FastQC. | ||
+ | 3. Align to reference genome. We use hisat2. | ||
+ | 4. Convert files as needed by whatever comes next. For us, we convert SAM to BAM, sort BAM, create an index for BAM. We use samtools. | ||
+ | 5. Take alignment results and count how many reads per gene. We use htseq-count. | ||
+ | 6. Do the above for each sample in the dataset, and combine the individual counts files into a single matrix file (i.e., csv). | ||
+ | |||
+ | A shell script that has all of these steps for our test dataset is [https://cs.indstate.edu/~jkinne/bioinformatics/rna_seq_counts/rnaseq_pipeline.sh rnaseq_pipeline.sh] | ||
=Genome Assembly= | =Genome Assembly= |
Revision as of 16:08, 10 July 2024
Contents
Background
For each video that is listed, vocab terms are given that are either explained within the video or are assumed the viewer already knows.
Biology
Definitions are from NCBI or Wikipedia.
A video giving an overview and explaining some of the key points is here - zoom recording explaining Bio background.
Cell
Things to understand...
- Cell diagram - be able to identify different parts of the cell in a diagram and what each part does (roughly). See diagrams on Wikipedia.
Vocab...
- prokaryote - Single-celled microorganism whose cells lack a well-defined, membrane-enclosed nucleus. Comprises two of the major domains of living organisms—the Bacteria and the Archaea.
- eukaryote - Organism composed of one or more cells with a distinct nucleus and cytoplasm. Includes all forms of life except viruses and prokaryotes (bacteria and archea).
- DNA (deoxyribonucleic acid) - Polynucleotide formed from covalently linked deoxyribonucleotide units. It serves as the store of hereditary information within a cell and the carrier of this information from generation to generation.
- nucleus - Prominent membrane-bounded organelle in a eukaryotic cell, containing DNA organized into chromosomes.
- cytoplasm - Contents of a cell that are contained within its plasma membrane but, in the case of eukaryotic cells, outside the nucleus.
- cell membrane (plasma membrane) - Membrane that surrounds a living cell (all types of cells).
- cell wall - Mechanically strong extracellular matrix deposited by a cell outside its plasma membrane. It is prominent in most plants, bacteria, algae, and fungi. Not present in most animal cells.
- vacuole - Very large fluid-filled vesicle found in most plant and fungal cells, typically occupying more than a third of the cell volume.
- chloroplast - Organelle in green algae and plants that contains chlorophyll and carries out photosynthesis. It is a specialized form of plastid.
- organelle - Membrane-enclosed compartment in a eukaryotic cell that has a distinct structure, macromolecular composition, and function. Examples are nucleus, mitochondrion, chloroplast, Golgi apparatus.
- lipid - Organic molecule that is insoluble in water but tends to dissolve in nonpolar organic solvents. A special class, the phospholipids, forms the structural basis of biological membranes.
- protein - The major macromolecular constituent of cells. A linear polymer of amino acids linked together by peptide bonds in a specific sequence.
- cytoskeleton - System of protein filaments in the cytoplasm of a eukaryotic cell that gives the cell shape and the capacity for directed movement. Its most abundant components are actin filaments, microtubules, and intermediate filaments.
- RNA (ribonucleic acid) - Polymer formed from covalently linked ribonucleotide monomers (which are represented by the letters A, U, C, G).
- ribosome - Particle composed of ribosomal RNAs and ribosomal proteins that associates with messenger RNA and catalyzes the synthesis of protein.
- endoplasmic reticulum (ER) - Labyrinthine membrane-bounded compartment in the cytoplasm of eukaryotic cells, where lipids are synthesized and membrane-bound proteins and secretory proteins are made.
- rough ER - Endoplasmic reticulum with ribosomes on its cytosolic surface. Involved in the synthesis of secreted and membrane-bound proteins.
- smooth ER - Region of the endoplasmic reticulum not associated with ribosomes. It is involved in lipid synthesis.
- vesicle - Small, membrane-bounded, spherical organelle in the cytoplasm of a eukaryotic cell.
- Golgi apparatus (Golgi complex) - Membrane-bounded organelle in eukaryotic cells in which proteins and lipids transferred from the endoplasmic reticulum are modified and sorted. It is the site of synthesis of many cell wall polysaccharides in plants and extracellular matrix glycosaminoglycans in animal cells.
- mitochondria - Membrane-bounded organelle, about the size of a bacterium, that carries out oxidative phosphorylation and produces most of the ATP in eukaryotic cells. The "powerhouse of the cell".
- symbiosis - Intimate association between two organisms of different species from which both derive a long-term selective advantage.
- surface area to volume ratio - The physics of a system is different at difference SA to Vol ratios (e.g., to a flying insect, flapping their wings is more like it would be for humans to fly in water). The reason is that the mass of an object is proportional to its volume (which is a cubed measurement) while the interaction with the environment is through an object's surface area (which is a squared measurement). The larger an object, the smaller its surface area to volume ratio will be.
Genetics
- allele - One of a set of alternative forms of a gene (the DNA letters of the gene). In a diploid cell each gene will have two, each occupying the same position (locus) on homologous chromosomes.
- dominant - In genetics, refers to the member of a pair of alleles that is expressed in the phenotype of the organism while the other allele is not, even though both alleles are present. Opposite of recessive.
- recessive - In genetics, refers to the member of a pair of alleles that fails to be expressed in the phenotype of the organism when the dominant allele is present. Also refers to the phenotype of an individual that has only the recessive allele.
- gene - Region of DNA that controls a discrete hereditary characteristic, usually corresponding to a single protein or RNA. This definition includes the entire functional unit, encompassing coding DNA sequences, noncoding regulatory DNA sequences, and introns.
- epigenetics - The study of heritable traits, or a stable change of cell function, that happen without changes to the DNA sequence.
- genome - The totality of genetic information belonging to a cell or an organism; in particular, the DNA that carries this information.
- model organism - A species, such as Drosophila melanogaster (fruit fly) or Escherichia coli (E coli), that has been studied intensively over a long period and thus serves as a "model" of the biology of a particular type of organism. Other such prominent organisms include: Mus musculus (house mouse), Saccharomyces cerevisiae (baker's yeast), Arabidopsis thaliana (thale cress, a plant).
- methylation - Addition of a methyl group to DNA. Extensive methylation of the cytosine base in CG sequences is used in vertebrates to keep genes in an inactive state.
- methyl group - Containing methyl (-CH3), a hydrophobic chemical group derived from methane (CH4).
- genotype - Genetic constitution (that is, what the letters are in the DNA) of an individual cell or organism, as opposed to the observed characteristics of the organism.
- phenotype - The observable character of a cell or an organism.
DNA Structure
Some facts...
- Size of human genome: about 3 billion nucleotides.
- DNA error rate in humans: around 1 in 10 billion (after proofreading and fixing mistakes).
Some things to understand...
- DNA structure - the basic structure of double helix, sugar phosphate backbone, complementary base pairs.
- DNA replication - roughly how it works - helicase unzips portion of double helix, DNA polymerase attaches new nucleotides to each side.
Vocab...
- nucleotide - Nucleoside with one or more phosphate groups joined in ester linkages to the sugar moiety. DNA and RNA are polymers of nucleotides.
- sugar - Small carbohydrates with a monomer unit of general formula (CH2O)n. Examples are the monosaccharides glucose, fructose and mannose, and the disacharide sucrose (composed of a molecule of glucose and one of fructose linked together).
- Carbohydrate - General term for sugars and related compounds containing carbon, hydrogen, and oxygen, usually with the empirical formula (CH2O)n.
- base - A substance that can accept a proton in solution. The purines and pyrimidines in DNA and RNA are organic nitrogenous bases and are often referred to simply as bases.
- base pair - Two nucleotides in an RNA or DNA molecule that are held together by hydrogen bonds—for example, G pairs with C, and A with T or U (remember - AT, GC).
- double helix - The three-dimensional structure of DNA, in which two DNA chains held together by hydrogen bonding between the bases are wound into a helix.
- antiparallel strands - Describes the relative orientation of the two strands in a DNA double helix; the polarity of one strand is oriented in the opposite direction to that of the other.
- adenine, guanine, cytosine, thymine - nucleotide bases of DNA.
- adenine, guanine, cytosine, uracil - nucleotide bases of RNA.
- hydrogen bond - An electrostatic force of attraction between a hydrogen (H) atom which is covalently bonded to a more electronegative "donor" atom or group (Dn), and another electronegative atom bearing a lone pair of electrons—the hydrogen bond acceptor (Ac). Such an interacting system is generally denoted Dn−H···Ac, where the solid line denotes a polar covalent bond, and the dotted or dashed line indicates the hydrogen bond.
- chromosome - Structure composed of a very long DNA molecule and associated proteins that carries part (or all) of the hereditary information of an organism. Especially evident in plant and animal cells undergoing mitosis or meiosis, where each chromosome becomes condensed into a compact rodlike structure visible under the light microscope.
- enzyme - Protein that catalyzes a specific chemical reaction (normally a biological reaction).
- DNA polymerase - Enzyme that synthesizes DNA by joining nucleotides together using a DNA template as a guide.
- DNA helicase - Enzyme that is involved in opening the DNA helix into its single strands for DNA replication.
- mutation - Heritable change in the nucleotide sequence of a chromosome.
DNA Transcription
Process to understand...
- Process of gene in DNA being made into a protein (DNA -> pre-mRNA -> mRNA -> protein). Should be able to draw a picture for each part and explain what happens.
Vocab...
- mRNA (messenger RNA) - RNA molecule that specifies the amino acid sequence of a protein. Produced by RNA splicing (in eukaryotes) from a larger RNA molecule made by RNA polymerase as a complementary copy of DNA. It is translated into protein in a process catalyzed by ribosomes.
- transcription (DNA transcription) - Copying of one strand of DNA into a complementary RNA sequence by the enzyme RNA polymerase.
- RNA polymerase - Enzyme that catalyzes the synthesis of an RNA molecule on a DNA template from nucleoside triphosphate precursors.
- promoter - Nucleotide sequence in DNA to which RNA polymerase binds to begin transcription.
- pre-mRNA - RNA that was directly copied from a strand of DNA within the cell nucleus, and has not yet been spliced.
- poly-A tail - Multiple adenosine monophosphates; in other words, it is a stretch of RNA that has only adenine bases (the letter A). In eukaryotes, polyadenylation is part of the process that produces mature mRNA for translation. In many bacteria, the poly(A) tail promotes degradation of the mRNA. It, therefore, forms part of the larger process of gene expression.
- RNA splicing - Process in which intron sequences are excised from RNA transcripts in the nucleus during formation of messenger and other RNAs.
- intron - Noncoding region of a eukaryotic gene that is transcribed into an RNA molecule but is then excised by RNA splicing during production of the messenger RNA or other functional structural RNA. Remember - "in between".
- exon - Segment of a eukaryotic gene that consists of a sequence of nucleotides that will be represented in messenger RNA or the final transfer RNA or ribosomal RNA. In protein-coding genes, exons encode amino acids in the protein. An exon is usually adjacent to a noncoding DNA segment called an intron.
- ribosome - Particle composed of ribosomal RNAs and ribosomal proteins that associates with messenger RNA and catalyzes the synthesis of protein.
- alternative splicing - The production of different proteins from the same RNA transcript by splicing it in different ways.
- central dogma (of molecular biology) - DNA makes RNA, and RNA makes protein. This is generally true.
RNA Translation
Process to understand...
- Process of gene in DNA being made into a protein
Vocab...
- peptide - Short chains of amino acids linked by peptide bonds.
- protein - The major macromolecular constituent of cells. A linear polymer of amino acids linked together by peptide bonds in a specific sequence.
- translation (RNA translation) - Process by which the sequence of nucleotides in a messenger RNA molecule directs the incorporation of amino acids into protein. It occurs on a ribosome.
- amino acids - Organic molecule containing both an amino group and a carboxyl group. Those that serve as the building blocks of proteins are alpha amino acids, having both the amino and carboxyl groups linked to the same carbon atom.
- codon - Sequence of three nucleotides in a DNA or messenger RNA molecule that represents the instruction for incorporation of a specific amino acid into a growing polypeptide chain.
- stop codon - A codon that signals the termination of the translation process of the current protein.
- ribosome - Particle composed of ribosomal RNAs and ribosomal proteins that associates with messenger RNA and catalyzes the synthesis of protein.
- ribosomal RNA (rRNA) - Any one of a number of specific RNA molecules that form part of the structure of a ribosome and participate in the synthesis of proteins. Often distinguished by their sedimentation coefficient, such as 28S rRNA or 5S rRNA.
- start codon - The first codon of a messenger RNA (mRNA) transcript translated by a ribosome. Normally codes for methionine in eukaryotes (letter M).
- methionine - Normally the first amino acid in a peptide sequence in eukaryotes.
- transfer RNA (tRNA) - Set of small RNA molecules used in protein synthesis as an interface (adaptor) between messenger RNA and amino acids. Each type of tRNA molecule is covalently linked to a particular amino acid.
- anticodon - Sequence of three nucleotides in a transfer RNA molecule that is complementary to a three-nucleotide codon in a messenger RNA molecule.
- polypeptide - A larger linear polymer composed of multiple amino acids. Term is interchangeable with "protein".
- protein folding - The physical process by which a protein, after synthesis by a ribosome as a linear chain of amino acids, changes from an unstable random coil into a more ordered three-dimensional structure.
Gene Expression
Some facts...
- How many genes are there? About 20,000 in humans.
- How long or large are genes? Can be as small as a few hundred DNA bases, or as large as than 2 million bases. The average size of a protein-coding gene in humans is around 62 kilobases (kb), and the median length is about 24 kb. Note that of the 62kb of a gene, on average 60kb of this is introns (regions that are spliced out before being translated to protein), so the amount that is actually translated to protein is an average of about 2kb.
Vocab...
- gene regulation - Includes a wide range of mechanisms that are used by cells to increase or decrease the production of specific gene products (i.e., proteins).
- gene - Region of DNA that controls a discrete hereditary characteristic, usually corresponding to a single protein or RNA. This definition includes the entire functional unit, encompassing coding DNA sequences, noncoding regulatory DNA sequences, and introns.
- differential gene expression - The process where different genes are activated in a cell, giving that cell a specific purpose that defines its function.
- transcriptional regulation - The means by which a cell regulates the conversion of DNA to RNA, thereby orchestrating gene activity.
- non-coding DNA - Components of an organism's DNA that do not encode protein sequences. Some is transcribed into functional non-coding RNA molecules.
- transcription factor - Term loosely applied to any protein required to initiate or regulate transcription in eukaryotes. Includes both gene regulatory proteins.
- micro RNA (miRNA) - Small, single-stranded, non-coding RNA molecules containing 21 to 23 nucleotides. Found in plants, animals and some viruses, these are involved in RNA silencing and post-transcriptional regulation of gene expression.
- small interfering RNA (aka short interfering RNA, or silencing RNA) - A class of double-stranded RNA at first non-coding RNA molecules, typically 20–24 base pairs in length, similar to miRNA, and operating within the RNA interference pathway.
- epigenetic mechanisms - Heritable traits, or a stable change of cell function, that happen without changes to the DNA sequence. One example is DNA methylation.
- histone - One of a group of small abundant proteins, rich in arginine and lysine, four of which form the nucleosome on the DNA in eukaryotic chromosomes.
- DNA methylation - Addition of a methyl group to DNA. Extensive methylation of the cytosine base in CG sequences is used in vertebrates to keep genes in an inactive state.
- post-transcriptional regulation - The control of gene expression at the RNA level. It occurs once the RNA polymerase has been attached to the gene's promoter and is synthesizing the nucleotide sequence.
Genetic Mutations
Vocab...
- mutation - Heritable change in the nucleotide sequence of a chromosome.
- DNA replication error - In the process of copying DNA, an error is made in copying.
- mutagen - A physical or chemical agent that permanently changes genetic material, usually DNA, in an organism and thus increases the frequency of mutations above the natural background level.
- somatic mutation - Mutation in any cell of a plant or animal other than a germ cell (reproductive cell) or germ-cell precursor.
- substitution - A type of mutation that replaces one nucleotide or amino acid with another.
- missense mutation - When a single nucleotide base in a DNA sequence is swapped for another one, resulting in a different codon and, therefore, a different amino acid.
- nonsense mutation - A genetic alteration that causes the premature termination of a protein.
- silent mutation - Base substitutions that result in no change of the amino acid or amino acid functionality when the altered messenger RNA (mRNA) is translated.
- frameshift mutation - A genetic mutation caused by insertions or deletions of a number of nucleotides in a DNA sequence that is not divisible by three.
- gene therapy - A medical technology that aims to produce a therapeutic effect through the manipulation of gene expression or through altering the biological properties of living cells.
- CRISPR/Cas9 - Edits genes by precisely cutting DNA and then harnessing natural DNA repair processes to modify the gene in the desired manner. The system has two components: the Cas9 enzyme and a guide RNA.
- pharmacogenomics - The study of the role of the genome in drug response.
Review...
Note that there are /a lot/ of terms so far, and many of the definitions are kind of complicated sounding (since they are from a textbook or Wikipedia). The following are the most important terms and concepts you need to know.
- Parts of the cell - be able to identify most of them in a picture or diagram.
- DNA and RNA are made of nucleotides ("bases"). DNA is double-stranded, RNA is single-stranded. DNA letters are A, T, C, G. RNA letters are A, U, C, G. A pairs with T or U, and C pairs with G. DNA is stable for long periods of time. RNA is not stable for long periods of time - what is happening right now in the cell.
- Gene - region of DNA that is transcribed to mRNA and then normally to protein.
- Genes are a small part of the human genome. The parts of the genome that are not genes are also NOT "junk DNA". Regions that are not genes often have an impact on gene expression.
- Gene expression - a gene that is actively been transcribed. At any given time in a cell, only some of the genes are active.
- Basic steps going from DNA to protein: pre-mRNA, mRNA, protein. Be able to draw a picture of the basic steps.
- All cells in an organism have the same DNA (except for mutations that have happened during DNA replication). Cells in general do /NOT/ all have the same genes being expressed. Cells in the same tissue will tend to have mostly the same genes being expressed.
- DNA in chromosomes are normally stored in a condensed form (twisted, balled up); this is called chromatin. Only the genes that are accessible are able to be transcribed.
- In eukaryotes, a gene contains introns and exons. introns are regions that are part of the gene but are spliced out before the gene is translated to protein. exons are the regions that /are/ translated to protein. In prokaryotes this is different (most genes do not contain introns).
- Human genome is about 3 billion base pairs of DNA. There are around 20,000 genes that code for proteins, with each gene being a median size of about 24,000 letters of DNA, and an average size of about 62,000 letters of DNA. Of the 62,000 letters of DNA, on average about 60,000 are spliced out before translating to protein (these are the introns), and on average about 2,000 are translated to protein (the exons). So about 1/3 of the human genome is contained within a gene region, but only 1-2% of the genome are letters that get translated to proteins.
Sequencing
RNA Sequencing
Video to watch - high throughput RNA seq (StatQuest)
Processes to understand...
- Steps going from a biological experiment all the way through to having RNA seq counts data ready to analyze.
- main steps for RNA-seq - prepare library (isolate RNA, break RNA into fragments, convert into double stranded DNA, add sequencing adapters, PCR amplify, quality control), sequence library, analyze results
- FastQ file - basic structure, typical properties, quality scores.
- Typical size of RNA sequencing - millions of reads for one sequencing run, 50-1000 bp per read.
- Difficulties in sequence alignment - given read might not align exactly to the reference genome (due to mutations or differences between individuals), sequence might align in multiple locations
- Best practices in sequencing - be as consistent as possible with the samples through the entire process. This includes saving up the sequence libraries until all experiments are complete so the samples can be sequenced in one sequencing run if possible (due to "batch effects" where each sequencing run will have slightly different biases or tendencies for mistakes).
Vocab...
- RNA-seq - RNA sequencing, which is determining the sequences for mRNA that is currently in a sample.
- gene expression - the amount that a gene is actively being transcribed (from DNA to mRNA).
- mutated cell/sample - a cell or sample that has some mutation (usually induced by the researchers in order to see what difference this will make in the organism).
- wild type - experiments using model organisms often have one "control group" of unmodified individuals to see what difference the "treatment group" has from the control group.
- chromosome - Structure composed of a very long DNA molecule and associated proteins that carries part (or all) of the hereditary information of an organism. Especially evident in plant and animal cells undergoing mitosis or meiosis, where each chromosome becomes condensed into a compact rodlike structure visible under the light microscope.
- gene - Region of DNA that controls a discrete hereditary characteristic, usually corresponding to a single protein or RNA. This definition includes the entire functional unit, encompassing coding DNA sequences, noncoding regulatory DNA sequences, and introns.
- mRNA transcript - RNA product of DNA transcription (the RNA that was produced by copying a gene from DNA).
- high throughput sequencing - the comprehensive term used to describe technologies that sequence DNA and RNA in a rapid and cost-effective manner.
- sequencing library - a biological sample that is composed of the RNA or DNA that is ready to be sequenced. If sequencing is performed at a later date or offsite, the sequencing library will be made and stored (typically in a -80C freezer) until needed.
- flow cell - optical cells used through which a sample is passed for detection before being measured or counted by electrometric or optical means.
- fluorescent probes - molecules that absorb light of a specific wavelength and emit light of a different, typically longer, wavelength (a process known as fluorescence), and are used to study biological samples (i.e., will show up in an image or to a device to indicate the presence of whatever the researcher is measuring).
- quality score - a value indicating the confidence the sequencer had that a given nucleotide is correct.
- fastq file - a file format for storing sequence data (RNA, DNA, or peptide) that contains a series of sequences together with their identifiers and quality scores.
- garbage reads - low quality or shorter than expected reads that should be removed before further analysis. If there are too many of these reads, then there may be some issue with the sequencing or sample preparation.
- sequence alignment - performing a string matching to determine where in a genome a given sequence matches, possibly allowing some differences in the sequence (i.e., if there is no exact match).
- read counts per gene - for each gene, count is the number of individual RNA short reads that aligned to the DNA sequence of that gene.
- bulk RNA sequencing - term used to indicate sequencing of a sample of tissue that will generally have at least millions of cells, including some of different cell types.
- single cell RNA-seq - term used to indicate sequencing that attempts to sequence the RNA or DNA in individual cells.
- normalization - a transformation done to read counts so that the counts for different samples can be compared even though there may be a different total number of counts from different genes.
- PCA (principle components analysis) - a process that performs a mathematical transformation of a matrix that allows graphing the rows or columns of the matrix in two dimensions, used as a form of clustering.
- CPM (counts per million) - in a read counts file, indicates a scaling has been done to divide each sample's counts by a scaling factor to account for some samples having a higher total number of reads than others.
- logCPM (log counts per million) - logarithm of the CPM.
- logFC (log fold change) - difference in the logarithm of two values (e.g., difference in logarithm between read counts for a gene between two samples).
- PCR (polymerase chain reaction) - Technique for amplifying specific regions of DNA by the use of sequence-specific primers and multiple cycles of DNA synthesis, each cycle being followed by a brief heat treatment to separate complementary strands. This can also used to amplify an entire sample of DNA or RNA (by attaching primers to all segments in a sample) and is a standard part of the procedure for sequencing RNA or DNA.
- reference genome - a consensus genome created for a particular organism that is meant to represent a "normal" genome. Note that each individual organism (e.g., two different people) do not have identical genomes, so a reference genome is necessarily an approximation of what the "normal" genome is.
CHIP Sequencing
Video to watch - CHIP Seq (StatQuest)
Single Cell Sequencing
Video to watch - single cell RNA seq (StatQuest)
ATAC Sequencing
Video to watch - ATAC seq (Activ Motif)
Other
- CITE seq
- Flow cytometry
- Western blot
- Northern blot
- Gell electrophoresis
- transcription
- reverse transcription
- cDNA
- polyA tail
- lyse
- reverse transcription
- cDNA library
- 96 well plate
- aliquot
- mass cytometry
- mass spec
Review
After putting all of that together, I will summarize here what is most important to remember and understand.
Programs to Install
See this video for installing a bunch of these programs on Windows 10 in the Windows Subsystem for Linux (WSL).
On Windows - If using Windows, enable the Windows Subsystem for Linux (WSL): https://learn.microsoft.com/en-us/windows/wsl/install and then install Ubuntu (from the MS app store, free). This is so we can run programs that are only available on Linux/Mac. After you have Ubuntu installed, start it up, and run the following command: sudo apt-get update. That will get an updated list of packages. You can then install packages like this: sudo apt-get install emacs. If you download programs and want them to be in your terminal path, then you will edit your .bashrc (or .tcshrc or .zshrc, or whatever rc file for your terminal) and put a line to set the PATH to include the directory to where you have installed the new files. If you have an installation file to download you can do this: wget https://some_link. You can then extract it and put it where you want.
Compression - if your OS does not unzip certain zip files (e.g., .gz and .tar, which are not natively supported in some Windows versions), install 7-Zip. MacOS and Linux natively support most compression formats that we will need.
R - first install R, and then install RStudio Desktop (free) (the IDE I normally use for workingo n R).
MS Office - ISU faculty/staff/students can install MS Office 365 for free. Start by logging into https://portal.office.com with your ISU credentials, and click on the button "Install and more". You can download and install for Windows and Mac OS (not available for Linux). You can also use MS Office programs in the browser on any OS. You should install Excel on your computer for using to look at .csv and .tsv files.
Gitlab at ISU - Login to https://git.indstate.edu so you can be added onto projects there.
SRA Toolkit - needed for downloading sequence files from NCBI. SRA Toolkit. Note for Mac - you will likely need to approve each program within the toolkit the first time you run it (like with many programs not security signed by Apple, it doesn't let you run it, then you go to Settings / Security, and click an appropriate button to allow the program to run, and then try running again). Note - you will want to update the PATH in your shell rc file (.zshrc, .tcshrc, etc.) so that it is set automatically when you start a terminal.
Short reads sequence aligner - hisat2. This is only available for Linux and Mac, so if using Windows then it will be run under WSL.
Adapter trimming - cutadapt - for removing adapters from RNA/DNA sequence files.
Quality check - FastQC - for checking quality of FastQ files. Note that you need to have a Java runtime environment installed for this to work.
SAM/BAM files - samtools - for dealing with SAM and BAM sequence files.
reads counting - htseq-count - for taking an alignment of a reads file and producing a counts matrix.
downloading metadata - Entrez Direct - can be used for pulling information from Entrez databases.
Data Files
This section contains links to sample data files. See the following video walking through some of these data files - video looking at GSE85331 gene expression and RNA-seq files.
Gene expression (from RNA-seq) - GSE85331 gene expression file, which comes from GSE85331 at NCBI Gene Expression Omnibus (GEO).
FastQ - C15_0_1 sample from GSE85331. This is a link to page that hosts the reads data. A FastQ file can be downloaded using the SRA Toolkit.
Genome FastA, Genome GFF - hg19 version of the human genome (which was used in the GSE85331 study as the reference genome), available from NCBI or from UCSC (which is what GSE85331 said they used).
Practice
Practice with some of the data files...
Practice with GSE85331 gene expression file in Excel
Start with the GSE8331 gene expression file. Note that you will be adding columns and rows to the file. You should save the file as an xlsx file so formulas and such will be saved properly. Before doing anything else, you should create a new sheet called "log2" and copy/paste in the expression values with the formula log2(value+1). Then you should add columns to the end that will be the max, min, average, median, max-min, median of day0, median of CM, day0-CM, each of those being based on that row's expression values. And you should add rows to the end that are the max, min, average, median, max-min of that column. This should give you enough to answer the following questions. For some of these you will sort based on one of the columns or rows. See the following video walking you through this process - video walkthrough.
- How many genes are in this file? How many samples?
- Which sample has the highest expression for gene TNNT2? What is that expression value? For this question and all the rest, you should be using the log2 expression values created as instructed above.
- What is the highest expression value in the entire file - which gene, which sample, and what is the expression value?
- Which gene has the highest median expression value? What is that value?
- Which sample has the highest median expression value, and what is that value? Which sample has the lowest median expression value, and what is that value?
- List the top 10 genes in terms of the median difference between the CM (day 30) values and the day 0 values. Give the genes and the difference. Note - you will be taking a median of 8 columns (the CM ones), subtract the median of 8 other columns (the day 0 ones).
- For the following genes, describe their expression profile (at what time point are they highest, where lowest, etc.): POU5F1, T, GATA4, TBX2, TBX5, TNNT2.
R Programming
First get R installed (see the programs to install section above). Next...
- watch this video introducing R.
- Look through Jeff's intro to R slides.
- Look through R Getting Started.
- Pick another R tutorial or resource to start looking through (see the suggestions in R Getting Started, or ask the internet).
- Take R files from class, download them, try them out yourself. Files... gse85331_a.R and video developing that file. gse85331_b.R and video developing that file. GSE244362_lab3.R and the video explaining what you are supposed to do in that lab.
- Take the R programming quiz until you can get 100%.
- Get started on the first R lab.
- Start keeping your own personal cheat sheet for how to do different things in R.
- R Markdown - read this introduction, see a cheat sheet like this one, and google search for any other formatting things you would want, or different behaviors for the code sections. Use this Rmd file as a starting point, and make the changes described in this video.
Statistics
Viewing lists - StatQuest Statistics Fundamentals, StatQuest Statistics and Machine Learning in R, StatQuest High Throughput Sequencing
High Throughput Sequencing workflow
Viewing list -
- Generation of genomic data (Hubbard Center for Genomic Studies) gives background on DNA/RNA sequencing.
- StatQuest High Throughput Sequencing - watch the videos in this playlist as needed. Start with the first.
Basic steps in workflow of going from RNA seq files to produce counts files. 0. Get input files. We use SRA, and take the first 1000 reads from SRR4011874 using fastq-dump from the SRA toolkit. 1. Trim adapters. We use cutadapt. 2. Check for quality (could do that before trimming adapters as well). We use FastQC. 3. Align to reference genome. We use hisat2. 4. Convert files as needed by whatever comes next. For us, we convert SAM to BAM, sort BAM, create an index for BAM. We use samtools. 5. Take alignment results and count how many reads per gene. We use htseq-count. 6. Do the above for each sample in the dataset, and combine the individual counts files into a single matrix file (i.e., csv).
A shell script that has all of these steps for our test dataset is rnaseq_pipeline.sh
Genome Assembly
Viewing list
- Generation of genomic data (Hubbard Center for Genomic Studies) gives background on DNA/RNA sequencing.
- Sequencing and assembling a genome (Hubbard Center for Genomic Studies) provides a nice overview of the steps and challenges.
- And then we'll have a video(s) working through doing all the steps of genome assembly.
OLD
Things here will be moved into other sections as we need them.
Reading
Potentially good things to read / tutorials, etc. ...
- R: R Programming - Getting Started - programs to install, reading, etc.
- Other courses like this one - Introduction to Computational & Quantitative Biology - Columbia Dept Microbiology & Immunology,
Foundations of Bioinformatics - UC San Diego CS (UCSC), Computational Biology - UT Dallas Dept. Biology, Biomedical Data Science - Harvard
In particular, your assigned reading includes...
- From the R Programming Getting Started, start looking through each of the items linked in Reading
- UCSD lecture 17 - Transcriptomics and the analysis of RNA-Seq data
- Up through Figure 1 in Genome-Wide Temporal Profiling of Transcriptome and Open-Chromatin of Early Cardiomyocyte Differentiation Derived From hiPSCs and hESCs
- Columbia - check each of the lectures to see what is basically there, and refer back to it when we get to those parts. These lecture slides are very much at a level that is good for what we are doing.
- SVM slides in Unit 6 from UT Dallas https://personal.utdallas.edu/~prr105020/biol6385/2018/lecture.html
- Dummy Variables in SVM / KNN
- Machine Learning with caret in R
- Decision trees in R (datacamp), Random forests (towards data science)
- Jeff's notes on terms, etc.
Gene Expression
Start by watching the video introduction (16min, watch it at 2x or 1.5x).
We start by getting into this GSE85331 dataset, described in this publication (and see supplementary information for how they processed/analyzed the data).
On your own computer, download the dataset and extract (uncompress) the file (on MacOS or Linux just double click it, on Windows use 7-Zip or something similar).
Spreadsheet
After extracting you can open the file in Excel, Sheets, or LibreOffice. Note that it is a tsv file. If you double click, your OS may not know what program to use to open it. So start your spreadsheet program and then open the file. Some things are not too painful to do in your spreadsheet program. For example, you should verify that the following are all correct...
- Genes with highest H1_day0_0 values: SNORD97, SNHG25, EEF1A1, RPL38, RPS27.
- Genes with highest H1_CM_0 values: H19, MYL7, RPL31, SNORD9, RPS27.
- Number of genes (#rows - 1): 26257
- Median value for H1_day0_0: 0.539942
- Median value for H1_CM_0: 1.246015
- Average value for H1_day0_0: 15.86772859
- Average value for H1_CM_0: 16.4574767
It seems that this dataset might be normalized so that the average values for each column (sample) are similar.
And that is about all we want to do in the spreadsheet right now. You can save it as an xlsx or import into Google Sheets in case we want to do anything else manually with it.
R and R Studio
Start by watching the video about gse85331_first_look.R (18min).
First Look Let's see what we can do with the same file in R and R Studio. First you should install R and R Studio on your computer, see links above. Let's take a first look at the data and confirm the values we got from Excel. You can download the R file here - gse85331_first_look.R and run it to confirm this. See also this video showing the file and explaining it.
Differential expression From the supplementary information from the publication, differentially expressed genes were found as follows - "Statistical analysis was performed for each cell line individually by pairwise comparisons across time-points and day 0 (control)." So, let's see if we can duplicate that. You can download the R file here - gse85331_diff_exp.R and run it to see one way to do this. See also this video showing the file and explaining it.
Simulated data Taking the previous analysis from above and putting in simulated data where we know what the answer should be for each gene. You can download the R file here - gse85331_diff_exp_simulated.R and run it yourself. See also this video showing the file and explaining it.
Next up - Using an R package specifically made for differential gene expression (DESeq and edgeR) and comparing with using plain old ANOVA.
ShapeFiles
An R file showing how to get started with shape files - shapefiles.R
The data files for those are in the Teams for CS 618 / Papers, References / Files / Shapefiles
Note that a shapefile often has a number of associated files with it (same basename, different file extension). You likely need all of those for things to work properly.
Miscellaneous
Things that will be put somewhere else eventually, but for now will be here.
Setup for using gitlab.indstate.edu from the CS server, Linux, or Mac
- Go to https://gitlab.indstate.edu and login with your ISU credentials.
- Run the following.
mkdir ~/.ssh # unless it exists already cd ~/.ssh ssh-keygen -t ecdsa -f filename_for_your_ssh_key # pick whatever you want for the filename, press enter when prompted for passphrase
- Edit your
~/.ssh/confi
file (creating it if it doesn't exist already), and include the following.
Host gitlab.indstate.edu Hostname gitlab.indstate.edu IdentityFile ~/.ssh/filename_for_your_ssh_key # or whatever you used for the filename* Take the public key file (filename_for_your_ssh_key.pub, or whatever you used for the filename with .pub on the end) and add it to your profile on gitlab.indstate.edu. Login to gitlab.indstate.edu, click the User icon in top right / Edit Profile / SSH Keys. Copy/paste the public key (.pub file) into the text field, leave “expires at” empty, set a title to something, click Add key. Now when you use git commands on the system that you setup like this, it should properly authenticate to gitlab.indstate.edu.