CS 457/557 - Database - F2012, Assignments

Assignments

[Note: this webpage last modified Sunday, 09-Dec-2012 14:29:26 EST]

Homework assignments will be posted to this website. Each homework assignment will list the date the assignment is due. Late assignments will not receive any credit; I will grade them just so you know how you did.

Here is a rough idea of my plans for some of the assignments.

Given task of storing BLANK data and making BLANK queries on it, choose a data structure for storing the data, choose algorithms for inserting/deleting/querrying, and estimate the efficiency (time and memory).
Design of a database to do BLANK.
SQL queries to do BLANK.
Compare/contrast performance/usability/etc. of various DBMS's and front-ends. One DBMS/front-end per student.
Problems from text.

Some potential data we might use in our examples/assignments/projects.

University example from textbook.
Web search, for mathcs.indstate.edu or indstate.edu.
Math/CS department: faculty/staff information, courses, majors, minors, students, activities, ...
Scientific data - GIS, astronomical, Mars, SETI, genes/DNA, proteins, ...
Old classics - library (patrons, books, librarians, etc.), company (products, customers, purchases, managers, employees, etc.).
Tracking statistics - processes running on CS, web requests on CS, web traffic accross ISU, ... And then doing something useful with those statistics
Chat room/social networking - users, chat history, etc.
Picture gallery
Data from "real" places (hw2 question 1): Sloan Digital Sky Survey, Life Under Your Feet, Lake Tahoe Census (and other) data, racing lap times, impact database, NBA player stats, SQL tryit, chemical information, foodborne illnesses, severe weather history

My best guess at the "projects" we'll be doing.

First half of semester - something involving SQL. Leading candidate is a search engine for indstate.edu. I would probably give you the data in a text file, and you would design the database and SQL queries to search. Some portion of the grade will be based on how efficient your solution is.
Second half of semester - something dealing with the lower-level details of databases. Leading candidate now is making some modifications to add functionality to either Minibase or SimpleDB.
Other ideas: here.

Homework Assignments

The homework assignments are posted in the in class code. hw1.txt contains the first homework assignment, hw2.php is the second homework assignments, etc. I will announce in class and/or by email when a new assignment has been posted. Note that if you are logged into the CS or one of the x** machines, you can copy the in class code from the directory ~jkinne/public_html/cs457-f2012/code/.

Software Setup

All assignments must compile and run on the CS server and x** machines in room A-015. Each assignment template file has instructions for how the program will be run/evaluated.

Logging into CS: If you want to keep your files on the CS server and login remotely to do your programming, you need an SSH client. A free client for Windows is Putty. A remote sftp client is WinSCP. If you have Linux, Unix, or Mac OS X then ssh and sftp access are already included - just open a terminal and type "ssh username@cs.indstate.edu" or "sftp username@cs.indstate.edu".

Emacs If you are programming by logging into the CS server, you have some choice about what text editor to use. Those installed on the CS server include: vi, pico, nano, emacs. I use emacs because it does auto-indenting and other things. If you decide to use emacs, you'll need to learn the shortcuts for things. The ones I use most are the following (search online for others).

Save file: Ctrl-x Ctrl-s
Close emacs: Ctrl-x Ctrl-c
Undo: Ctrl-Shift-_
Auto-indent current line: Tab
Auto-indent current paragraph: Alt-q
Search forward in the file: Ctrl-s
Repeat last search: Ctrl-s Ctrl-s
Search reverse in the file: Ctrl-r
Delete character just right of cursor: Ctrl-d
Go to the end of the line: Ctrl-e
Kill from cursor to end of line: Ctrl-k
Yank/paste what was just killed: Ctrl-y
Create new file or open existing one: Ctrl-x Ctrl-f
If more than one file is open, switch between them: Ctrl-x b
Close file but don't close emacs: Ctrl-x k
Switch to "split screen" mode: Ctrl-x 2
Switch back to "normal screen" mode: Ctrl-x 1
Search/replace: Esc x replace-string
Open shell/terminal: Esc x shell
Spell-check (not used much on programming files, but anyway): Esc x ispell

Windows: You can also run programs on Windows if you will work mostly from your personal computer. But when you turn the program in, it must and run on the cs and x** systems.

Project

Due date - dec 14, with possible 1-2 day extension.
Rubric/what's required.
- Database - put stuff in .sql file for creating and importing data.
- User interface - php/html/javascript.
- Documentation - text/word/pdf - describe the project, how you made decisions about the tables, etc., and talk about efficiency, anything you needed to learn that may be of interest to the rest of the class.
- Required features - user interface to query and update, insert.
Create project directory in cs457/handin/, put everything in there.
Grading: total 50 points. 10 points for documentation, 10 for user interface (webpage), 30 points for correctness. Points off of that if required features not present, e.g., -5 for not having update or insert.
Come and explain/show it to me sometime next week. At least -5 if you don't.

Final Exam

Exam format: ditto other exams.

Exam topics:

first and second exam topics
query optimization - converting queries into equivalent ones, review equivalence rules
basics of concurrency/serialization - definition of conflict serializable, be able to say whether two schedules are or not.
concurrency control with locking - shared/exclusive locks, two-phase locking (serializable, but could have deadlock).
deadlock detection - waiting graph, look for cycles.
timestamp-ordering protocol (no locking, just check for problems as reads/writes happen and rollback if needed). conflict serializable, no deadlock, but possible starvation.
note on "per row" versus "per table" - insert and the "phantom effect"... one solution is to not allow updates during read transactions (lock on updating table). another is to do index locking.

Types of questions.

Give SQL query
What does SQL query do
Calculate
Run the algorithm
Practice exercises to pay attention to in the newer chapters: 12.2, 12.3, 12.4, 12.5, 13.2, 13.4, 13.5, 13.11, 14.4, 14.6, 14.11, 15.2, 15.11, 15.16. Remember that the only way to get anything out of looking at these is to try them on your own, really think about them and come up with an answer, then later check the correct answers on the book's website. Note that for chapter 13 problems, I'll define any of the symbols/greek letters used (e.g., sigma is select).

Second Exam

Exam format: same as first.

Exam topics: Most of what we've done since the first exame. In particular...

File formats for storing data in files. Know these terms and how they are used: file header, free list, slotted-page structure, heap file organization.
Index files - primary versus secondary index, why indices are used, how they relate to the actual data file, dense versus sparse, composite key.
Binary search tree - how operations are performed, cost of each operation.
B+ tree - how all operations are performed, cost of each operation.
Hash index file - how all operations are performed, cost of each operation.
Extendable hashing - how all operations are performed, cost of each operation.
Query processing. For each type of query talked about in the chapter, understand where the formula comes from (that means understanding how the query is being processed, and what the different measures of cost are/mean).

Types of questions.

T/F, 1 point each, 3 questions.
Calcualte, 2 points each, 2-3 questions.
Run the algorithm, 2 points each, 3 questions.
Query processing plan, 4 points, 1 question.

Sample exam.

T/F. In the worst-case, a lookup into an extendable hash table takes O(1) time.
T/F. Assume there is a B+ index for an attribute in a table that is stored in a file that uses slotted pages. The pointers from the B+ tree leaves into the data file point directly to the part of the page where the record is stored.
T/F. Because having indices can make queries faster, it is best for performance to make sure the database keeps indices on all attributes in all tables.
Calculate. Suppose your hard drive has the following properties: 4KB blocks, 4ms/seek, .001 ms transfer/block. What is the time to do the following:
- Lookup in B+ tree with depth 5 (depth includes root but does not include leaves).
- Insert in B+ tree with depth 5.
- Block nested-loop join with two tables that each have 1 million rows, stored 512 records/block.
Run the algorithm. BST. Draw what a BST (with items >= an item's key going to the right) would look like after:
(i) insert 5, insert 10, insert 7, insert 0, insert 4, insert -1, insert 10, insert 15, insert 9, insert 6,
(ii) delete 10,
(iii) delete 5
Run the algorithm. B+. Consider the B+ tree The internal nodes must have at least 2 pointers. The leaves must have at least 3. Draw what the B+ tree looks like after:
(i) insert 20,
(ii) insert 26,
(iii) delete 35.
Run the algorithm - Indexed Nested-Loop Join for the query "SELECT * from student NATURAL JOIN takes". The student table is
```
ID	name	dept_name	tot_cred
00128	Zhang	Comp. Sci.	102
12345	Shankar	Comp. Sci.	32
19991	Brandt	History	80
23121	Chavez	Finance	110
44553	Peltier	Physics	56
45678	Levy	Physics	46
```
The takes table is
```
ID	course_id	sec_id	semester	year	grade
00128	CS-101	1	Fall	2009	A
00128	CS-347	1	Fall	2009	A-
12345	CS-101	1	Fall	2009	C
12345	CS-190	2	Spring	2009	A
98765	CS-315	1	Spring	2010	B
98988	BIO-101	1	Summer	2009	A
```
Assume the data for each table is stored in a file in the order shown, with 3 records per block.
To answer the question, first give what the result of the query will be.
Next, how many disk reads are executed in evaluating the join, and what are they?
Query processing plan. Suppose the following are queries that will regularly be run on the university database:
SELECT * FROM student WHERE name = "__something__"
SELECT * FROM instructor NATURAL JOIN teaches WHERE dept_name = "__something__"

Decide on a storage format for the tables involved and any indices you will keep on any of the attributes. Describe what you have decided.
Decide on how the queries would be evaluated, and describe the basic idea.
Give a formula for how long each query would take in terms of the following variables: ts = time for seek, tb = time to read one block, assume 512 records/block for any files used, S = # records in student table, I = # records in instructor table, T = # records in teaches table, assume 20 instructors in each department, assume student's names are unique.

First Exam

Exam format: on paper, in class, no computer/calculators. You can have one sheet of paper with anything you want on it (front and back, hand written or printed off the computer).

Exam topics: Most of what we've done so far. In particular...

Basics of relational databases - relations, attribute, columns, rows, operations, keys.
SQL (be able to use these, and be able to say what a given query would do): SELECT, FROM, WHERE, AS, ORDER BY, UNION, INTERSECT, EXCEPT, AVG, MIN, MAX, SUM, COUNT, GROUP BY, HAVING, EXISTS, UNIQUE, SOME, ALL, AND/NOT/OR//>=/=, INSERT, CREATE TABLE, UPDATE, NATURAL JOIN, OUTER JOIN, INNER JOIN, LEFT join, RIGHT join, PRIMARY KEY, REFERENCES, NULL, DEFAULT, ENUM, GRANT, REVOKE, CREATE VIEW, CREATE ROLE, data types (INT, FLOAT, NUM, DATE, TIME, DATETIME, TEXT)
Be able to give pseudocode for implementing a query with loops and say how fast the naive implementation would be given some information about what the data is stored.
Databases - university database from book, nobel and population databases from homeworks.
E-R diagrams - be able to make one, and be able to say what one means.
See review terms, etc. at the back of the chapters in the book.

Exam, types of questions: For T/F, you need to give a sentence or two explaining why your answer is correct. For "Give SQL query", you are asked to give an SQL query (or a few if more than one is needed) to accomplish a task. For "What does SQL query do", you are given an SQL query and need to describe the result set. For analyze, you are asked a question requiring a bit more thought.

T/F, 1 point each, 3 questions.
Give SQL query, 2 points each, 2 questions.
What does SQL query do, 2 points each, 2 questions.
Analyze, 3 points each, 2 questions.

Sample Exam.

T/F. There can be only one KEY in a database table/relation.
T/F. It is not a good idea to create too many database views because storing the extra data would take up valuable disk space.
T/F. It is possible that an inner join and an outer join could return the same result set for a given table.
Give SQL query. Consider the following insurance database.
```
 person(driver_id PK, name, address)
 car(license PK, model, year)
 accident(report_number PK, date, location)
 owns(driver_id PK, license FK)
 participated(report_number FK, license FK, driver_id PK, damage_amount)
```
Give a SQL queries to find (1) the total number of people who owned cars that were involved in accidents in each year from 1980 to 2012, (2) the average number of such people per year in the years in that range.
Give SQL query. Given an SQL query to remove all accidents with damage_amounts less than $50. Give an SQL query to find the names of all people involved in accidents with a damage_amount about $10,000.

What does SQL query do. Sample data would be given, and you'd need to explain what it does and give the resulting table.

SELECT model, year, SUM(damage_amount) FROM car, participated
WHERE year > 2000
GROUP BY model, year
HAVING SUM(damage_amount) > 1000
ORDER BY damage_amount DESC

What does SQL query do.

SET @d = '1950-01-01';
DELETE FROM participated WHERE 
report_number IN 
(SELECT report_number FROM accident WHERE date < @d);
DELETE FROM accident WHERE date < @d;

Analyze. For the query in problem 6, give pseudocode for how you would implement this query (e.g., with for loops). Give a big-O estimate of the running time of your pseudocode, letting n = total # rows in all tables. Assume each table is stored in a binary search tree with the key value being the primary key (or first listed foreign key, if there is no primary key for that table).

Analayze. Give an E-R diagram for the following banking database.

branch(branch_name PK, branch_city, assets)
customer(customer_name PK, customer_street, customer_city)
loan(loan_number PK, branch_name, amount)
borrower(customer_name FK, loan_number FK)
account(account_number PK, branch_name, balance)
depositor(customer_name FK, account_number FK)

Note: you should also be able to take an E-R diagram and produce SQL statements to create a schema based off the diagram.