CPSC471/F05 Assignment 1

Due by 5PM on Thursday October 6.

Marking Scheme: Part I (50 marks); Part II (50 marks). Total marks = 100.

Your solutions will be judged on correctness, clarity of exposition and ingenuity.

Part I (50 marks)

To be completed and submitted individually.

For this exercise you will have to model an application using the E/R approach and produce a relational schema for this application. Your relational model should be described using the SQL DDL. Choose one application domain from the list below to work on, and following these guidelines:

This is the list of applications you can choose from:

Part II (50 marks)

To be completed individually or in pairs. If you work with another student, make sure you submit only one report and that both names and student numbers appear in the cover.

The goal of this exercise is to illustrate the need for powerful and easy to use data management tools, even when dealing with toy-sized datasets. You will have to solve several problems using a small dataset about movies as your only source of information.

This dataset is represented in a plain text file. All data about each movie are given in consecutive lines; blank lines in the file are used to separate movies. Each non-blank line in the file contains a head, in captital letters, that indicates what data is described in the line, followed by a colon, and the data. If the data consists of multiple items (e.g., the cast of the movie), commas are used for separating them.

Here is an example of how a movie is described in this file:

TITLE:Hanky Panky
YEAR:1982
COUNTRY:USA
VIEWER RATING:4.8 (98 Votes)
RUNTIMES:110min (in USA),
CERTIFICATIONS:PG (in USA), 15 (in Sweden),
GENRES:Comedy,
KEYWORDS:
DIRECTOR:Sidney Poitier,
CAST:Gene Wilder, Gilda Radner, Kathleen Quinlan, Richard Widmark, Robert Prosky, Josef Sommer, Johnny Sekka, Jay O. Sanders, Sam Gray, Larry Bryggman, Pat Corley, Johnny Brown, Bill Beutel, Nat Habib, James Tolkan,

Important notes:

What to hand in:

To obtain marks in each problem below you are required to provide an explanation of how you obtained the solution. If you decide to write programs/scripts for solving the problems, it suffices to include them (together with brief instructions of how the TAs can reproduce your work on a standard linux or windows teaching machine available in the CPSC department) as the explanation. In this case, you must submit all your programs electronically; you may submit your report for this part of the assignment electronically as a single file (in pdf, ps, or plain text format) if you wish.

If you decide to solve all problems manually, your explanation has to be better than "I did by hand". The point here is that you must demonstrate how hard it was to obtain the solutions. One way of doing that could be to provide the start/end date/time for solving each problem together with all notes you made while working on it.

Problem 1 (2 marks). List the directors of all Canadian movies.

Problem 2 (3 marks). List the titles of all movies whose Swedish versions are longer than their British versions.

Problem 3 (3 marks). List the titles of all comedies that were released after Matilda.

Problem 4 (3 marks). Give the name of the director of the oldest movie.

Problem 5 (5 marks). Find all pairs of movies that have the same director, and list their titles.

Problem 6 (5 marks). List all actors who starred in movies with Dan Aykroyd.

Problem 7 (5 marks). List the titles of all movies that are rated "18" in Germany or "R" in the USA, but are not horror movies.

Problem 8 (7 marks). Enumerate the 5 most common genres among those movies that have been rated more than 100 times by viewers. Show how many movies are in each such genre.

Problem 9 (7 marks). Rank the genres according to the average rating of all movies in each genre. A movie with more than one genre contributes equally to the overall score of each genre.

Problem 10 (10 marks). In this problem you will compare the movie rating standards in different countries. More specifically, our goal is to find what are the odds that a movie receives a rating r1 in country c1 given that it received a rating r2 in country c2. That is, we are interested in the conditional probability P(r1:c1 | r2:c2) (Recall that P(A|B) = P(A and B)/P(B).

For the purposes of this problem, define the probability of a movie being rated r1 in c1 as the fraction of all movies in the dataset that have that rating. Similarly, the probability of a movie being rated r1 in c1 and r2 in c2 is the ratio of movies that have both ratings in the dataset.

For example:

(5 marks) Compute the following probabilities:

(5 marks) A concerned parent living in the UK needs help in translating foreign ratings into her familiar UK ratings. Your job is to find, for every non-UK rating, the UK rating(s) that is(are) most likely to be assigned to an arbitrary foreign movie with that rating.