Difference between revisions of "Bioinformatics"

From Computer Science
Jump to: navigation, search
(Biology)
(Genetics)
Line 31: Line 31:
  
 
==Genetics==
 
==Genetics==
[https://www.youtube.com/watch?v=9zwq8N4Ufd8&list=PL8dPuuaLjXtPW_ofbxdHNciuLoTRLPMgB&index=33 Video]
+
[https://www.youtube.com/watch?v=9zwq8N4Ufd8&list=PL8dPuuaLjXtPW_ofbxdHNciuLoTRLPMgB&index=33 Crash Course Biology Video]
 
** ''allele'' - One of a set of alternative forms of a gene. In a diploid cell each gene will have two alleles, each occupying the same position (locus) on homologous chromosomes.
 
** ''allele'' - One of a set of alternative forms of a gene. In a diploid cell each gene will have two alleles, each occupying the same position (locus) on homologous chromosomes.
 
** ''dominant'' - In genetics, refers to the member of a pair of alleles that is expressed in the phenotype of the organism while the other allele is not, even though both alleles are present. Opposite of recessive.
 
** ''dominant'' - In genetics, refers to the member of a pair of alleles that is expressed in the phenotype of the organism while the other allele is not, even though both alleles are present. Opposite of recessive.
Line 43: Line 43:
 
** ''genotype'' - Genetic constitution of an individual cell or organism.
 
** ''genotype'' - Genetic constitution of an individual cell or organism.
 
** ''phenotype'' -The observable character of a cell or an organism.
 
** ''phenotype'' -The observable character of a cell or an organism.
 +
 +
==DNA Structure==
 +
[https://www.youtube.com/watch?v=4YNDB_zSzfE&list=PL8dPuuaLjXtPW_ofbxdHNciuLoTRLPMgB&index=34 Crash Course Biology Video]
 +
 +
==DNA Transcription==
 +
[https://www.youtube.com/watch?v=j6YaOqKORYY&list=PL8dPuuaLjXtPW_ofbxdHNciuLoTRLPMgB&index=35 Crash Course Biology Video]
 +
 +
==RNA Translation==
 +
[https://www.youtube.com/watch?v=6ulXau2HyHg&list=PL8dPuuaLjXtPW_ofbxdHNciuLoTRLPMgB&index=36 Crash Course Biology Video]
 +
 +
==Gene Expression==
 +
[https://www.youtube.com/watch?v=NeeaP8pp9HI&list=PL8dPuuaLjXtPW_ofbxdHNciuLoTRLPMgB&index=37 Crash Course Biology Video]
 +
 +
==Genetic Mutations==
 +
[https://www.youtube.com/watch?v=8HfzUgxumVE&list=PL8dPuuaLjXtPW_ofbxdHNciuLoTRLPMgB&index=38 Crash Course Biology Video]
  
 
==Sequencing==
 
==Sequencing==

Revision as of 12:11, 23 May 2024

Background

For each video that is listed, vocab terms are given that are either explained within the video or are assumed the viewer already knows.

Biology

Definitions are from NCBI or Wikipedia.

Cell

Video

    • prokaryote - Single-celled microorganism whose cells lack a well-defined, membrane-enclosed nucleus. The procaryotes comprise two of the major domains of living organisms—the Bacteria and the Archaea.
    • eukaryote - Organism composed of one or more cells with a distinct nucleus and cytoplasm. Includes all forms of life except viruses and procaryotes (bacteria and archea).
    • DNA (deoxyribonucleic acid) - Polynucleotide formed from covalently linked deoxyribonucleotide units. It serves as the store of hereditary information within a cell and the carrier of this information from generation to generation.
    • nucleus - Prominent membrane-bounded organelle in a eucaryotic cell, containing DNA organized into chromosomes.
    • cytoplasm - Contents of a cell that are contained within its plasma membrane but, in the case of eucaryotic cells, outside the nucleus.
    • cell membrane (plasma membrane) - Membrane that surrounds a living cell (all types of cells).
    • cell wall - Mechanically strong extracellular matrix deposited by a cell outside its plasma membrane. It is prominent in most plants, bacteria, algae, and fungi. Not present in most animal cells.
    • vacuole - Very large fluid-filled vesicle found in most plant and fungal cells, typically occupying more than a third of the cell volume.
    • chloroplast - Organelle in green algae and plants that contains chlorophyll and carries out photosynthesis. It is a specialized form of plastid.
    • organelle - Membrane-enclosed compartment in a eucaryotic cell that has a distinct structure, macromolecular composition, and function. Examples are nucleus, mitochondrion, chloroplast, Golgi apparatus.
    • lipid - Organic molecule that is insoluble in water but tends to dissolve in nonpolar organic solvents. A special class, the phospholipids, forms the structural basis of biological membranes.
    • protein - The major macromolecular constituent of cells. A linear polymer of amino acids linked together by peptide bonds in a specific sequence.
    • cytoskeleton - System of protein filaments in the cytoplasm of a eucaryotic cell that gives the cell shape and the capacity for directed movement. Its most abundant components are actin filaments, microtubules, and intermediate filaments.
    • RNA (ribonucleic acid) - Polymer formed from covalently linked ribonucleotide monomers.
    • ribosome - Particle composed of ribosomal RNAs and ribosomal proteins that associates with messenger RNA and catalyzes the synthesis of protein.
    • endoplasmic reticulum (ER) - Labyrinthine membrane-bounded compartment in the cytoplasm of eucaryotic cells, where lipids are synthesized and membrane-bound proteins and secretory proteins are made.
    • rough ER - Endoplasmic reticulum with ribosomes on its cytosolic surface. Involved in the synthesis of secreted and membrane-bound proteins.
    • smooth ER - Region of the endoplasmic reticulum not associated with ribosomes. It is involved in lipid synthesis.
    • vesicle - Small, membrane-bounded, spherical organelle in the cytoplasm of a eucaryotic cell.
    • Golgi apparatus (Golgi complex) - Membrane-bounded organelle in eucaryotic cells in which proteins and lipids transferred from the endoplasmic reticulum are modified and sorted. It is the site of synthesis of many cell wall polysaccharides in plants and extracellular matrix glycosaminoglycans in animal cells.
    • mitochondria - Membrane-bounded organelle, about the size of a bacterium, that carries out oxidative phosphorylation and produces most of the ATP in eucaryotic cells.
    • symbiosis - Intimate association between two organisms of different species from which both derive a long-term selective advantage.
    • surface area to volume ratio - The physics of a system is different at difference SA to Vol ratios (e.g., to a flying insect, flapping their wings is more like it would be for humans to fly in water). The reason is that the mass of an object is proportional to its volume (which is a cubed measurement) while the interaction with the environment is through an object's surface area (which is a squared measurement). The larger an object, the smaller its surface area to volume ratio will be.

Genetics

Crash Course Biology Video

    • allele - One of a set of alternative forms of a gene. In a diploid cell each gene will have two alleles, each occupying the same position (locus) on homologous chromosomes.
    • dominant - In genetics, refers to the member of a pair of alleles that is expressed in the phenotype of the organism while the other allele is not, even though both alleles are present. Opposite of recessive.
    • recessive - In genetics, refers to the member of a pair of alleles that fails to be expressed in the phenotype of the organism when the dominant allele is present. Also refers to the phenotype of an individual that has only the recessive allele.
    • gene - Region of DNA that controls a discrete hereditary characteristic, usually corresponding to a single protein or RNA. This definition includes the entire functional unit, encompassing coding DNA sequences, noncoding regulatory DNA sequences, and introns.
    • epigenetics - The study of heritable traits, or a stable change of cell function, that happen without changes to the DNA sequence.
    • genome - The totality of genetic information belonging to a cell or an organism; in particular, the DNA that carries this information.
    • model organism - A species, such as Drosophila melanogaster (fruit fly) or Escherichia coli (E coli), that has been studied intensively over a long period and thus serves as a “model” of the biology of a particular type of organism. Other such prominent organisms include: Mus musculus (house mouse), Saccharomyces cerevisiae (baker's yeast), Arabidopsis thaliana (thale cress).
    • methylation - Addition of a methyl group to DNA. Extensive methylation of the cytosine base in CG sequences is used in vertebrates to keep genes in an inactive state.
    • methyl group - Containing methyl (-CH3), a hydrophobic chemical group derived from methane (CH4).
    • genotype - Genetic constitution of an individual cell or organism.
    • phenotype -The observable character of a cell or an organism.

DNA Structure

Crash Course Biology Video

DNA Transcription

Crash Course Biology Video

RNA Translation

Crash Course Biology Video

Gene Expression

Crash Course Biology Video

Genetic Mutations

Crash Course Biology Video

Sequencing

CITE seq Flow cytometry Western blot Northern blot Gell electrophoresis transcription reverse transcription cDNA polyA tail lyse reverse transcription cDNA library 96 well plate aliquot mass cytometry mass spec

RNA seq, CHIP seq, ATAC seq, sc seq

Additional Programs to Install

  • Compression - for those using Windows, install 7-Zip. MacOS and Linux natively support most compression formats that we will need.
  • R - first install R and then install RStudio Desktop (free).
  • Teams - ISU faculty/staff/students can install MS Teams for free along with other parts of Office 365. Start by logging into https://portal.office.com with your ISU credentials, click around looking for Teams (may have to click on "All Apps" or something like that), and once you get to Teams look for a link to download the desktop application (for Windows and Mac OS, not available for Linux). You can also use Teams in the browser.
  • Login to https://gitlab.indstate.edu so you can be added onto projects there.

Reading

Potentially good things to read / tutorials, etc. ...

Foundations of Bioinformatics - UC San Diego CS (UCSC), Computational Biology - UT Dallas Dept. Biology, Biomedical Data Science - Harvard

In particular, your assigned reading includes...

Gene Expression

Start by watching the video introduction (16min, watch it at 2x or 1.5x).

We start by getting into this GSE85331 dataset, described in this publication (and see supplementary information for how they processed/analyzed the data).

On your own computer, download the dataset and extract (uncompress) the file (on MacOS or Linux just double click it, on Windows use 7-Zip or something similar).

Spreadsheet

After extracting you can open the file in Excel, Sheets, or LibreOffice. Note that it is a tsv file. If you double click, your OS may not know what program to use to open it. So start your spreadsheet program and then open the file. Some things are not too painful to do in your spreadsheet program. For example, you should verify that the following are all correct...

  • Genes with highest H1_day0_0 values: SNORD97, SNHG25, EEF1A1, RPL38, RPS27.
  • Genes with highest H1_CM_0 values: H19, MYL7, RPL31, SNORD9, RPS27.
  • Number of genes (#rows - 1): 26257
  • Median value for H1_day0_0: 0.539942
  • Median value for H1_CM_0: 1.246015
  • Average value for H1_day0_0: 15.86772859
  • Average value for H1_CM_0: 16.4574767

It seems that this dataset might be normalized so that the average values for each column (sample) are similar.

And that is about all we want to do in the spreadsheet right now. You can save it as an xlsx or import into Google Sheets in case we want to do anything else manually with it.

R and R Studio

Start by watching the video about gse85331_first_look.R (18min).

First Look Let's see what we can do with the same file in R and R Studio. First you should install R and R Studio on your computer, see links above. Let's take a first look at the data and confirm the values we got from Excel. You can download the R file here - gse85331_first_look.R and run it to confirm this. See also this video showing the file and explaining it.

Differential expression From the supplementary information from the publication, differentially expressed genes were found as follows - "Statistical analysis was performed for each cell line individually by pairwise comparisons across time-points and day 0 (control)." So, let's see if we can duplicate that. You can download the R file here - gse85331_diff_exp.R and run it to see one way to do this. See also this video showing the file and explaining it.

Simulated data Taking the previous analysis from above and putting in simulated data where we know what the answer should be for each gene. You can download the R file here - gse85331_diff_exp_simulated.R and run it yourself. See also this video showing the file and explaining it.

Next up - Using an R package specifically made for differential gene expression (DESeq and edgeR) and comparing with using plain old ANOVA.

ShapeFiles

An R file showing how to get started with shape files - shapefiles.R

The data files for those are in the Teams for CS 618 / Papers, References / Files / Shapefiles

Note that a shapefile often has a number of associated files with it (same basename, different file extension). You likely need all of those for things to work properly.

Miscellaneous

Things that will be put somewhere else eventually, but for now will be here.

Setup for using gitlab.indstate.edu from the CS server, Linux, or Mac

  mkdir ~/.ssh # unless it exists already
  cd ~/.ssh
  ssh-keygen -t ecdsa -f filename_for_your_ssh_key # pick whatever you want for the filename, press enter when prompted for passphrase
  
  • Edit your ~/.ssh/confi file (creating it if it doesn't exist already), and include the following.
  Host gitlab.indstate.edu
   Hostname gitlab.indstate.edu
   IdentityFile ~/.ssh/filename_for_your_ssh_key # or whatever you used for the filename
  
* Take the public key file (filename_for_your_ssh_key.pub, or whatever you used for the filename with .pub on the end) and add it to your profile on gitlab.indstate.edu.  Login to gitlab.indstate.edu, click the User icon in top right / Edit Profile / SSH Keys.  Copy/paste the public key (.pub file) into the text field, leave “expires at” empty, set a title to something, click Add key.

Now when you use git commands on the system that you setup like this, it should properly authenticate to gitlab.indstate.edu.