PhD Preliminary Exam
Preliminary Exam for non-CS PhD students, CS part Computer Science / Programming EES and Biology PhD students at Indiana State University sometimes include a CS faculty member as part of their dissertation committee and include a CS portion of their preliminary exam. The prelim exam is a “breadth” exam, and includes a written part and oral part that is a followup to the written exam. This document details what is expected for the CS part of a non-CS PhD student’s prelim exam. The particular details can vary somewhat for each student. You should discuss with your CS committee member. Exam format, things to keep in mind
The CS written prelim exam normally includes a “take home” and a written portion. For all three portions of the exam (take home, written, oral) the PhD student can use any resource (computer, internet, etc.) and can keep their responses. The responses should be typed. Any source that is used (including code exams) should be cited with a link or full citation. CS body of knowledge The following gives a summary of the basic CS knowledge that PhD students should master for their CS prelim exam. These are the most important and basic parts of CS that may be useful in your research.
Basic programming In whatever language you use (python, matlab, or something else). Some classic programming problems - that don’t have anything to do with your project - you have a program to solve the problem and understand how it works Testing if a number is prime (brute force that takes sqrt(N) time to test if # N is prime, and simple Fermat test [be able to do an example on the board]). Sorting a list of unsorted numbers (an O(n2) method and an O(n log(n)) method) Binary search on a list of sorted numbers (takes O(log n) time) Running time - you can automatically compute the running time of different parts of your program Directories and files - you can take a directory and have a program iterate over all files in the directory (of a certain type, maybe). Example: directory has files named first_last_data.txt, and you open each and print the information. Functions and modules - be able to split your program into functions that do specific parts, to keep each section of code understandable. Main goal - one screen of code is a function. Specific goal - take some part of your programs we looked at where you have the same thing copy/pasted with different names, take that and make into loop. User input - be able to read text from the user and either - Use the input as numbers Use the input as a filename Running time analysis For your project, take some data on the running time for different input lengths, and make a guess at running time for larger. For your project, based on the overall structure, how should the running time grow (quadratic, linear, etc.)? It may be that you say - “8 * (# of buildings) * (running time of arcpy.built-inFcn)” If you know the arcpy function that is most important, you can also look up the running time of that Specific to EES students - for both raster and vector data, basic predictions of running time by size. Example: raster image is 1000 x 1000 and code takes roughly width*height*log(width), then how much time for certain # of images, certain geographic region, etc., or what is the maximum image size that can be handled in a certain amount of CPU time. Multiple CPUs - be able to say how long your processing would take if spilt across multiple machines, and say how machines would be needed to have it finish in a certain amount of time. Specific to bioinformatics student - you might run program/algorithm for different # of reads, different # of fastq files. Maybe # reads / file = 10k, 1m, 10m; # fastq files = 1, 10, 100. Plot running time, memory use, disk space use versus parameters. Another parameter is length of reads. Memory analysis
Similar to running time but for memory. What is the overall memory usage of your program? How does the memory usage grow when you include larger data sets? Running programs and seeing how much memory they need/use.
Disk space analysis Similar to running time but for disk. What is the disk usage of your program - does it create temporary files, if so, how large? How much disk space would you need to scale up your study area to certain levels? Basic data structures and algorithms - optional, depending on the student For these data structures - binary search tree, hash table, unsorted array, sorted array - know the running time of the operations - insert, lookup, delete. Also, draw a picture. BST - height of the tree (O(log n) if tree is close to balanced). Note - map in c++. BST are good for doing ranges (all items between X and Y). Hash table - constant if parameters chosen right, normally. Note - dictionary in python, unordered_map in c++, Unsorted array - linear time for lookup, insert/delete are constant. Note - list in python, vector in c++, Sorted array - O(log n) for lookup, linear time insert/delete Know these algorithms (can describe in one sentence, what is the reason for the running time) - binary search, at least one n2 sorting method (bubble, insertion, selection), at least one faster sorting method (merge, quick, heap, using BST), one algorithm from your research area (could sequence aligning algorithm, or something else, could be from BIO 587) Reading for this - https://www.tutorialspoint.com/data_structures_algorithms/ and/or any introductory material about this - Introduction to Algorithms book, wikipedia, site:.edu insertion sort lecture, etc. Take home portion Your project - we look at it and agree on what you should do before the written portion. Classic programs or data structures & algorithms - look up or develop solutions to the classic basic programming problems mentioned above. You should have code in your programming language that you can run. You should understand and can explain how the code works. I can ask you to make small changes, and you are able to do so. It is fine to lookup code for this, but include a citation to where you got it. Written portion Your project - at least one followup question asking you to modify your program in some way or take data on something about your project Classic program - some small modification of one of the classic programming problems we reviewed for the take home portion Resource usage - a few questions asking you to analyze scenarios in terms of CPU time, memory usage, and disk usage Data structures & algorithms (optional) - some questions about which DS to use for different things, and the cost in running time. Oral portion Your project - follow up, explain (possibly with hints on how I would do it), possibly explain some parts to other committee members Resource usage - follow up, explain, possibly explain parts to other committee members
Past exam questions Take Home and Written Exam 1 - Take Home - your project For a student who had a project using arcgis in python to do time-series analysis.
This is for the python code we have been working on over the past few weeks. Upload your code, documentation, and images to google drive and you can send the link to your full committee and the department office to turn it in.
1a) In ArcGIS, do the following with one of your datasets looking at time points year 1, year 5, year 10. Produce an image that has impervious for each of the three time points as a different color. This allows us to see how the impervious surface region has grown/shrunk over the 10 year period.
1b) Do the same as 1a) but now in python using ArcPy. You should rely on ArcGIS as little as possible. Work with a cropped section of your data so that the images involved are reasonably small (< 20MB).
1c) Measure how long the processing takes in 1a) and 1b). Based on this measured time and the geometry of the cropped image (# pixels, surface area encompassed), make a prediction for how long the processing would take on your full dataset and on the continental United State.
1a processing time - 1b processing time - ArcGIS processing time for whole dataset (estimate) - Python/ArcPy processing time for whole dataset (estimate) - ArcGIS processing time for continental US (estimate) - Python/ArcPy processing time for whole dataset (estimate) -
1d) Measure the amount of memory used for 1a) and 1b). For both, it is sufficient to open Task Manager and pay attention to how much memory it says your program (ArcGIS or python) is using. 2 - Take Home - classic programs and data structures For a student who had been working through Python tutorials and solving exercises in an online Python text for the month or two leading up to the exam.
2a) Follow through the suggested Python tutorials (How to Think Like a Computer Scientist, Python 3 version, and the tutorial from python.org). Keep track of your progress. Save your solutions to the exercises you solve. Make all of this available in a google drive folder.
2b) For the Fermat primality test program you have worked on, turn the test into a function that takes the number n to be tested as a parameter. The function should return True or False.
Part of your primality test currently includes the calculation pow(a, n-1) % n Note that pow(a, n-1, n) should perform the same calculation but is likely much faster (and uses less memory). Determine a value of n so that pow(a, n-1) % n takes around 1 minute. For this value of n, measure the amount of time the calculation takes, and also measure the amount of time pow(a, n-1, n) takes. How much faster is the latter? For the time measurement, use python code (e.g., timeit). 3 - written part - your project 3a) Make the changes to your code that we discussed based on your work for the take home exam. Put all of your python code into a single python file (you currently have it in two different files). Have your python code produce a tiff image. Use the following colors - impervious at all time points (black), pervious at all time points (white), pervious at year 0 and impervious at year 5 (red), pervious at years 0 and 5 and impervious at year 10 (orange).
3b) The cropped image you have selected is for an urban region that is mostly impervious. Select a second cropped area that is at most roughly half impervious (either fully in the countryside/forest/etc., or on the boundary of such a region). In your python code, include a line near the top that will select either the mostly-urban or mostly-not-urban cropped area to process. You should be able to process either cropped area by commenting/uncommenting these two lines.
Place your final code and tiff images in the google drive folder that contains your answers for this exam.
3c) Make additions to your python code to produce two additional images. One should be just for year 1, and should have impervious as black and pervious as white. The other should be just for year 10, and should use the same color scheme. If we look at the two images side by side we can see how things changed in 10 years. 4 - written part - classic programming and data structures 4a) Make the changes to your Fermat primality test that we discussed. Your function should not use a for loop and should not construct a list. a should be selected at random between 2 and n-1.
4b) Use the faster version of your primality test (that is, comment out the line that uses pow(a,n-1)%n and use instead the line that uses pow(a,n-1,n)) to do the following. Pick increasing multiples of 10, and use your prime test to determine how many pseudoprime numbers there are up to that multiple of 10. Note that this requires a for loop from 2 up to the multiple of 10. Find the largest multiple of 10 where your program will still finish within 1 minute. Record in a comment the multiple of 10 and the estimate for the number of primes up to that value.
You can compare your result against the actual values at https://primes.utm.edu/howmany.html Your estimated result will be slightly larger than the correct value.
1 - Take Home - your project For a student who had a project using arcgis in python to do analysis of some data.
We have looked at the running time for your project. You have already identified which parts are taking the most time. Two further questions to study -
1a) Do an analysis of the size of intermediate files and memory as the project runs. Make up a table of the sizes at each step, and explain whether this makes sense. You can use import os and os.path.getsize("filename").
1b) Try to install psutil, don't spend more than 1-2 hours. If that works, also keep track of memory using import psutil and psutil.virtual_memory()
1c) For the parts that take the longest, can you find out why? Search to see what people online, and also ask Steve Aldrich.
2 - Take home - classic introductory programming exercises For a student who had a project using arcgis in python to do analysis of some data.
Pick sorting, binary search, or brute force prime test. Look up an algorithm and python code. Make sure the python code works on your computer. Come and explain the code, algorithm, running time. See notes above for more details on this. 1a Code - take home exam Create a programming file that does the following: This was a student whose research project used satellite images of the US for regular time intervals Ask the user for a date. Choose the image set from your dataset which has the closest date to the date given. Determine the maximum, minimum, and average temperature in the given image set. Output the date of the images and the max, min, and average values. Create a smaller image (roughly 1 MB) that is an image of the United States and has highlighted all locations that have a temperature that is within 1 degree C of the maximum (red) or minimum (blue). If possible, make the creation of this image automatic. 1b Processing time - take home exam This was a student whose research project used satellite images of the US for regular time intervals Measure the amount of processing time that is taken for the processing to determine max/min/average temperatures. Perform the processing time measurement for a full scale image and for an image that is both ½ and ⅛ of the width and height of the original. Use this information to answer the following questions. Give an estimate for how long it would take to determine the max, min, and average temperature for all images in your 13 year dataset. Give a formula for the running time as a function of the number of images N, the width of each image W, and the height of each image H. Suppose you had a dataset with the same number of images and resolution but for the entire earth. What would be the number of pixels of each image? What would be the total size in GB, TB, or PB of the dataset (assuming for a 13 year period and 46 images per year)? How long would it take to process all images? What would be the cost in HDD to store these images? For the data processing you have previously performed on your images, what task required the most processing time? In terms of N, W, and H, what is a reasonable formula for the running time of that task? 2a Maximum Temperature - written exam This was a student whose research project used satellite images of the US for regular time intervals Modify your program from part 1a so that a single image is created that contains information about the maximum temperature for as many images as possible. The single image should have areas highlighted that were within 1 degree C of the maximum for any of the images in the dataset. Locations that were within 1 degree C of the maximum for multiple images should be colored darker. You should include as many images as possible so that the total running time is under 2 minutes. 2b Running Time - written exam Consider the third question from part 1b (processing images for the entire Earth). Suppose you now have access to images with resolution that is 10 times finer (in both width and height), and you have images for 10,000 days. How long would it take to perform the min, max, and average calculations for all images? If you split the work up between X different computers that are similar to the one you have used for processing, how long would it take? If you wanted to obtain the maximum temperature over all images, and you are using X computers to divide up the work, how much communication (in the total number of values that need to be sent between all of the computers) is needed to determine the final result? Why? Given this information, and assuming a computer like yours costs $2,000 per year to maintain, what is a reasonable cost for 100 CPU hours of computation?
Oral Exam This was a student whose research project used satellite images of the US for regular time intervals Show us some images related to what you did for the exam. And explain. What about having max and min on same image? What about an image that shows which pixels are an increasing trend, and which are decreasing? Using the slope of a fit of 13 years for each pixel on some particular day? What about the earth? What about the “within 1 degree” issue? What does it look like if you expand that to 2, 5, or 10 degrees? Is the same on a single date? Does this make sense, or is it sensor error? Talk about the type of modeling you have done for your research? What is the difference between linear and non-linear models? Compare/contrast in terms of running time and accuracy. Talk about under-sampling and over-sampling. Do you think it is more important to have data that is finer-grained with respect to space or with respect to time - for the problems you work on? Urban versus rural - what about of contrasting with scenario with high agricultural use? Experiment design - urban planning Twitter - verify temperature Other uses of GIS/remote sensing techniques - different types of data.
For another student … Explain dependance of running time on # features, spatial extent, spatial resolution. You said the running time does not depend on spatial extent/resolution at all because it is vector data. Is that really true - is it really just handling each building separately? Can do better by breaking computation up into blocks and then combining those? For automatic classification (urban, suburban, agricultural, etc.) - what types of areas are difficult for the tools to get right? How to measure effect of energy efficiency improvements and attempts to reduce light pollution on night time light versus economic activity? What is the optimal size of cities to maximize efficiency and sustainability? Which has a higher impact on greenhouse gas emissions and urban heat effect - roofs being either solar panels, vegetation, or high reflectant surface?
For another student … 1) Your Project - 1a - how to determine a formula for the running time of some operation in R or
python, such as a sort routine, or doing a lookup in a data frame.
1b - tradeoffs between R, C, Python - what are they good for
2) Classic Program - bubble sort - 2a - basic explanation of how bubble sort works, what is the running time roughly, how
can you determine what the largest list is you could use and have it finish in one day
2b - best case and worst case for bubble sort
3) Resource usage - your R program - 3a - explain how to determine a rough formula for the time it takes your program to
finish, as a function of the number of vertices and edges in the graph, and how to estimate the largest graph you can handle
4) Data structures and algorithms - 4a - example of data structure to use to store data where lookup time is linear,
and an example where lookup time is less than linear. Moral of the story?