Difference between revisions of "Bioinformatics"

From Computer Science
Jump to: navigation, search
(Gene Expression)
(Reading)
 
(32 intermediate revisions by the same user not shown)
Line 3: Line 3:
 
* R - first install [https://cloud.r-project.org/ R] and then install [https://www.rstudio.com/products/rstudio/ RStudio Desktop (free)].
 
* R - first install [https://cloud.r-project.org/ R] and then install [https://www.rstudio.com/products/rstudio/ RStudio Desktop (free)].
 
* Teams - ISU faculty/staff/students can install MS Teams for free along with other parts of Office 365.  Start by logging into https://portal.office.com with your ISU credentials, click around looking for Teams (may have to click on "All Apps" or something like that), and once you get to Teams look for a link to download the desktop application (for Windows and Mac OS, not available for Linux).  You can also use Teams in the browser.
 
* Teams - ISU faculty/staff/students can install MS Teams for free along with other parts of Office 365.  Start by logging into https://portal.office.com with your ISU credentials, click around looking for Teams (may have to click on "All Apps" or something like that), and once you get to Teams look for a link to download the desktop application (for Windows and Mac OS, not available for Linux).  You can also use Teams in the browser.
 +
* Login to https://gitlab.indstate.edu so you can be added onto projects there.
  
 
=Reading=
 
=Reading=
 
Potentially good things to read / tutorials, etc. ...
 
Potentially good things to read / tutorials, etc. ...
 
* R: [[R Programming - Getting Started]] - programs to install, reading, etc.
 
* R: [[R Programming - Getting Started]] - programs to install, reading, etc.
 +
* Other courses like this one - [https://microbiology.columbia.edu/icqb Introduction to Computational & Quantitative Biology - Columbia Dept Microbiology & Immunology],
 +
[https://bioboot.github.io/bggn213_f17/lectures/ Foundations of Bioinformatics - UC San Diego CS (UCSC)],
 +
[https://personal.utdallas.edu/~prr105020/biol6385/2018/lecture.html Computational Biology - UT Dallas Dept. Biology],
 +
[https://genomicsclass.github.io/book/ Biomedical Data Science - Harvard]
 +
 +
In particular, your assigned reading includes...
 +
* From the R Programming Getting Started, start looking through each of the items linked in [https://cs.indstate.edu/wiki/index.php/R_Programming_-_Getting_Started#Reading Reading]
 +
* [https://bioboot.github.io/bggn213_f17/lectures/#17 UCSD lecture 17 - Transcriptomics and the analysis of RNA-Seq data]
 +
* Up through Figure 1 in [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5576565/ Genome-Wide Temporal Profiling of Transcriptome and Open-Chromatin of Early Cardiomyocyte Differentiation Derived From hiPSCs and hESCs]
 +
* [https://microbiology.columbia.edu/icqb Columbia] - check each of the lectures to see what is basically there, and refer back to it when we get to those parts.  These lecture slides are very much at a level that is good for what we are doing.
 +
* SVM slides in Unit 6 from UT Dallas https://personal.utdallas.edu/~prr105020/biol6385/2018/lecture.html
 +
* [https://stats.libretexts.org/Bookshelves/Computing_and_Modeling/RTG%3A_Classification_Methods/4%3A_Numerical_Experiments_and_Real_Data_Analysis/Preprocessing_of_categorical_predictors_in_SVM%2C_KNN_and_KDC_(contributed_by_Xi_Cheng) Dummy Variables in SVM / KNN]
 +
* [http://topepo.github.io/caret Machine Learning with caret in R]
 +
* [https://www.datacamp.com/community/tutorials/decision-trees-R Decision trees in R (datacamp)], [https://towardsdatascience.com/understanding-random-forest-58381e0602d2#:~:text=The%20random%20forest%20is%20a,that%20of%20any%20individual%20tree. Random forests (towards data science)]
 +
* [https://docs.google.com/document/d/1Fe-w4GNSq7-2nPfZ2QF3byD36hGctZppQY-XIZWKaXs/edit# Jeff's notes on terms, etc.]
  
 
=Gene Expression=
 
=Gene Expression=
We start by getting into this [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85331 GSE85331 dataset], described in [https://pubmed.ncbi.nlm.nih.gov/28663367/ this publication] (and see [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5576565/bin/NIHMS889282-supplement-Online_Data_Supplement.pdf supplementary information] for how they processed/analyzed the data).
+
Start by watching the [https://indstate-edu.zoom.us/rec/share/uhl2WGCOd5FmJPKagZBCB5GEIlufLVGKgTtq5W8r40eRF5mPw8O5az5-Z1NZEgWV.Bey1Q2Kfb2wHh8NL video introduction] (16min, watch it at 2x or 1.5x).
 +
 
 +
We start by getting into this [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85331 GSE85331 dataset], described in [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5576565/ this publication] (and see [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5576565/bin/NIHMS889282-supplement-Online_Data_Supplement.pdf supplementary information] for how they processed/analyzed the data).
  
 
On your own computer, download [https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE85331&format=file&file=GSE85331%5Fall%2Egene%2EFPKM%2Eoutput%2Ereplicates%2Etxt%2Egz the dataset]
 
On your own computer, download [https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE85331&format=file&file=GSE85331%5Fall%2Egene%2EFPKM%2Eoutput%2Ereplicates%2Etxt%2Egz the dataset]
Line 28: Line 46:
  
 
==R and R Studio==
 
==R and R Studio==
Let's see what we can do with the same file in R and R Studio.  First you should install R and R Studio on your computer.  See links above.
+
Start by watching the [https://indstate-edu.zoom.us/rec/share/jatjli-YpV3d4tM6HPtgRPMq59dGmyTdeeTafTV41aERUd6V0uMT2jw3F3zj68Y6.sjpykKNTn4KkK2N0 video about gse85331_first_look.R] (18min).
 +
 
 +
'''First Look''' Let's see what we can do with the same file in R and R Studio.  First you should install R and R Studio on your computer, see links above.  Let's take a first look at the data and confirm the values we got from Excel. You can download the R file here - [https://cs.indstate.edu/~jkinne/cs618-s2022/code/FILES/gse85331_first_look.R gse85331_first_look.R] and run it to confirm this.  See also [https://indstate-edu.zoom.us/rec/share/jatjli-YpV3d4tM6HPtgRPMq59dGmyTdeeTafTV41aERUd6V0uMT2jw3F3zj68Y6.sjpykKNTn4KkK2N0 this video] showing the file and explaining it.
 +
 
 +
'''Differential expression''' From the supplementary information from the publication, differentially expressed genes were found as follows - "Statistical analysis was performed for each cell line individually by pairwise comparisons across time-points and day 0 (control)."  So, let's see if we can duplicate that.  You can download the R file here - [https://cs.indstate.edu/~jkinne/cs618-s2022/code/FILES/gse85331_diff_exp.R gse85331_diff_exp.R] and run it to see one way to do this.  See also [https://indstate-edu.zoom.us/rec/share/pH4GRsBcimQreLtjsMsuRD9gblf6twUKKRb7yWraEVRMdWzOUIIH5dQXrmLI9aB6.lMvQHPD0Zoe1usGL this video] showing the file and explaining it.
 +
 
 +
'''Simulated data''' Taking the previous analysis from above and putting in simulated data where we know what the answer should be for each gene.  You can download the R file here - [https://cs.indstate.edu/~jkinne/cs618-s2022/code/FILES/gse85331_diff_exp_simulated.R gse85331_diff_exp_simulated.R] and run it yourself.  See also [https://indstate-edu.zoom.us/rec/share/lXucpVqYQWLBE4tBaerc-_eY9qvbwgh0aaTwbLh1m1k6Zfe-ybYSLKtivz7Wt8IT.mx8EVD0DorxGdhOK this video] showing the file and explaining it.
 +
 
 +
'''Next up''' - Using an R package specifically made for differential gene expression (DESeq and edgeR) and comparing with using plain old ANOVA.
 +
 
 +
=ShapeFiles=
 +
An R file showing how to get started with shape files - [https://cs.indstate.edu/~jkinne/cs618-s2022/code/FILES/shapefiles.R shapefiles.R]
 +
 
 +
The data files for those are in the Teams for CS 618 / Papers, References / Files / Shapefiles
 +
 
 +
Note that a shapefile often has a number of associated files with it (same basename, different file extension).  You likely need all of those for things to work properly.
 +
 
 +
=Miscellaneous=
 +
Things that will be put somewhere else eventually, but for now will be here.
 +
 
 +
==Setup for using gitlab.indstate.edu from the CS server, Linux, or Mac==
 +
* Go to https://gitlab.indstate.edu and login with your ISU credentials.
 +
* Run the following.
 +
  <pre>
 +
  mkdir ~/.ssh # unless it exists already
 +
  cd ~/.ssh
 +
  ssh-keygen -t ecdsa -f filename_for_your_ssh_key # pick whatever you want for the filename, press enter when prompted for passphrase
 +
  </pre>
 +
* Edit your <code>~/.ssh/confi</code> file (creating it if it doesn't exist already), and include the following.
 +
  <pre>
 +
  Host gitlab.indstate.edu
 +
  Hostname gitlab.indstate.edu
 +
  IdentityFile ~/.ssh/filename_for_your_ssh_key # or whatever you used for the filename
 +
  <pre>
 +
* Take the public key file (filename_for_your_ssh_key.pub, or whatever you used for the filename with .pub on the end) and add it to your profile on gitlab.indstate.edu.  Login to gitlab.indstate.edu, click the User icon in top right / Edit Profile / SSH Keys.  Copy/paste the public key (.pub file) into the text field, leave “expires at” empty, set a title to something, click Add key.
 +
 
 +
Now when you use git commands on the system that you setup like this, it should properly authenticate to gitlab.indstate.edu.

Latest revision as of 15:51, 18 March 2022

Additional Programs to Install

  • Compression - for those using Windows, install 7-Zip. MacOS and Linux natively support most compression formats that we will need.
  • R - first install R and then install RStudio Desktop (free).
  • Teams - ISU faculty/staff/students can install MS Teams for free along with other parts of Office 365. Start by logging into https://portal.office.com with your ISU credentials, click around looking for Teams (may have to click on "All Apps" or something like that), and once you get to Teams look for a link to download the desktop application (for Windows and Mac OS, not available for Linux). You can also use Teams in the browser.
  • Login to https://gitlab.indstate.edu so you can be added onto projects there.

Reading

Potentially good things to read / tutorials, etc. ...

Foundations of Bioinformatics - UC San Diego CS (UCSC), Computational Biology - UT Dallas Dept. Biology, Biomedical Data Science - Harvard

In particular, your assigned reading includes...

Gene Expression

Start by watching the video introduction (16min, watch it at 2x or 1.5x).

We start by getting into this GSE85331 dataset, described in this publication (and see supplementary information for how they processed/analyzed the data).

On your own computer, download the dataset and extract (uncompress) the file (on MacOS or Linux just double click it, on Windows use 7-Zip or something similar).

Spreadsheet

After extracting you can open the file in Excel, Sheets, or LibreOffice. Note that it is a tsv file. If you double click, your OS may not know what program to use to open it. So start your spreadsheet program and then open the file. Some things are not too painful to do in your spreadsheet program. For example, you should verify that the following are all correct...

  • Genes with highest H1_day0_0 values: SNORD97, SNHG25, EEF1A1, RPL38, RPS27.
  • Genes with highest H1_CM_0 values: H19, MYL7, RPL31, SNORD9, RPS27.
  • Number of genes (#rows - 1): 26257
  • Median value for H1_day0_0: 0.539942
  • Median value for H1_CM_0: 1.246015
  • Average value for H1_day0_0: 15.86772859
  • Average value for H1_CM_0: 16.4574767

It seems that this dataset might be normalized so that the average values for each column (sample) are similar.

And that is about all we want to do in the spreadsheet right now. You can save it as an xlsx or import into Google Sheets in case we want to do anything else manually with it.

R and R Studio

Start by watching the video about gse85331_first_look.R (18min).

First Look Let's see what we can do with the same file in R and R Studio. First you should install R and R Studio on your computer, see links above. Let's take a first look at the data and confirm the values we got from Excel. You can download the R file here - gse85331_first_look.R and run it to confirm this. See also this video showing the file and explaining it.

Differential expression From the supplementary information from the publication, differentially expressed genes were found as follows - "Statistical analysis was performed for each cell line individually by pairwise comparisons across time-points and day 0 (control)." So, let's see if we can duplicate that. You can download the R file here - gse85331_diff_exp.R and run it to see one way to do this. See also this video showing the file and explaining it.

Simulated data Taking the previous analysis from above and putting in simulated data where we know what the answer should be for each gene. You can download the R file here - gse85331_diff_exp_simulated.R and run it yourself. See also this video showing the file and explaining it.

Next up - Using an R package specifically made for differential gene expression (DESeq and edgeR) and comparing with using plain old ANOVA.

ShapeFiles

An R file showing how to get started with shape files - shapefiles.R

The data files for those are in the Teams for CS 618 / Papers, References / Files / Shapefiles

Note that a shapefile often has a number of associated files with it (same basename, different file extension). You likely need all of those for things to work properly.

Miscellaneous

Things that will be put somewhere else eventually, but for now will be here.

Setup for using gitlab.indstate.edu from the CS server, Linux, or Mac

  mkdir ~/.ssh # unless it exists already
  cd ~/.ssh
  ssh-keygen -t ecdsa -f filename_for_your_ssh_key # pick whatever you want for the filename, press enter when prompted for passphrase
  
  • Edit your ~/.ssh/confi file (creating it if it doesn't exist already), and include the following.
  Host gitlab.indstate.edu
   Hostname gitlab.indstate.edu
   IdentityFile ~/.ssh/filename_for_your_ssh_key # or whatever you used for the filename
  
* Take the public key file (filename_for_your_ssh_key.pub, or whatever you used for the filename with .pub on the end) and add it to your profile on gitlab.indstate.edu.  Login to gitlab.indstate.edu, click the User icon in top right / Edit Profile / SSH Keys.  Copy/paste the public key (.pub file) into the text field, leave “expires at” empty, set a title to something, click Add key.

Now when you use git commands on the system that you setup like this, it should properly authenticate to gitlab.indstate.edu.