CMSC 10200 Lab — Spring 2006
Lab Assignment 8
Lab is due Sunday at 11:59pm. There is Extra Credit.

A Baseball Database Server: Part II, Batting Statistics

In this lab we turn-up the heat on our Baseball Database and find some real statistics. I have written an HTML page that provides a form which allows a client to design their own query of the major batting statistics: ball2.html. Your task in this lab is to write a JSP which queries the baseball database in MySQL and returns an HTML table of the results of the query. This is a very complicated process which must take client queries on up to 22 fields with numerous constraints for aggregate statistics or individual rows, and create a single SQL query for your ResultTable class.

Your lab will be judged on several features

  1. Throrough testing of the java classes you write, and the JSP response page for the client.
  2. The style of your java code, and thoroughness of your comments. With a program this complex, it is hard to get every detail correct, so it is very important that your java class is simple and clear. In particular, you should not try to write one big constructor!!
  3. Simplicity of your JSP request page. This means minimal java code, nothing more than calling constructors and methods.
You will not be evaluated on the HTML client database webpage. You are encouraged to keep the bare-bones client page I have written for you. Focus on getting the JSP response correct. You may change any of the parameter names I have given, although I have tried to make these simple and clear.

MySQL Server

Everyone will access the same database on MySQL. This database is on a MySQL server running on a local machine danryan, not dbserver, and you will access it as an anonymous user, cs102, with access limited only to making SELECT queries. The commandline is

        mysql cs102 -h danryan -u cs102 -P 34463 -p
and the password is cs102 (easy to remember!!).

SQL SELECT Command

You will need a more robust SQL SELECT command for this weeks lab. This is still not as rich as SQL allows, but it is sufficient for satisfying every client query by a single SQL SELECT query.

  SELECT columns FROM tables
  WHERE where_conditions
  GROUP BY groups
  HAVING have_conditions
  ORDER BY column [ASC | DESC]
  LIMIT offset, count
 
The where_conditions will be extended beyond LIKE operator (for pattern matching) to include comparison operators
	>, >=, <, <=, =, !=
which will be used in conditions having the form
	column compare value
where value can be a string, integer or decimal value. The have_conditions are exactly like the where_conditions, but they may additionally include aggregate operations where column names must occur in where_conditions. The SELECT block, located after the SELECT keyword may contain a list of column names as well as aggregate operations and aliases. For this lab, the only aggregate function you need is SUM. You will need alias specifications, which take the form
	column-or-aggregate AS alias
in the column-list. You MUST use aliases for aggregate functions, especially if there are constraints on these values (in which case the constraints must be placed under the HAVING block) or you are using an aggregate to order the results, in the ORDER block. These are the only places aggregates may occur. In particular, they may not occur in the WHERE block or GROUP block. (These are constraints placed by SQL.)

Here are some things to keep in mind

  1. Aggregates can only appear in the SELECT block, the HAVING block and the ORDER block.
  2. You must use an alias for an aggregate whenever you need to refer to the aggregate in the HAVING or ORDER block.
  3. You may use an alias which is identical to a column name. In this case, whenever the alias appears in the HAVING or ORDER block it refers to the alias, and everywhere else, it refers to the column. For example
    	SELECT first,last,SUM(HR) AS HR, SUM(AB)/SUM(HR) AS ab-per-hr
    	FROM Master,Batting
    	WHERE Master.playerID = Batting.playerID
    	GROUP BY Master.playerID
    	HAVING HR > 500
    	ORDER BY HR ASC
    	LIMIT 0,25;
          
    The HR in the SELECT block aggregate SUM(HR) and SUM(AB)/SUM(HR) refers to the column HR of the Batting table; all other occurrences refer to the alias created to refer to SUM(HR).
  4. All aggregates must appear in the column_list of the SELECT block. The non-aggregate columns of a table, such as the playerID or year, need not appear in the SELECT block to constrain the query.
  5. You will not need more than one column or aggregate in the ORDER field, although SQL permits multiple values.
  6. The SQL default for ordering is ASC, although the queries generated for your ResultTable will always specify this value.
  7. SQL allows one integer in the LIMIT block, although the queries generated for your ResultTable will always produce the offset as well as the count.

Java Query Code

You will need your html package code and ResultTable class code. These will not require modification, if they are properly functioning.

Download the new sql package, and the database package and unarchive it by

        tar xfz sql.tgz
        tar xfz database.tgz
These will produce directories called sql and database.The first contains the new java classes to implement the more robust SELECT SQL command for this week's lab. Your ResultTable does not depend upon the actual details of SelectQuery, so it will work without modification.

I have implemented the SelectQuery as a single class with numerous methods for adding various constraints. One very important modification is that you do not have to add all your columns at the time of construction of a SelectQuery. It will allow you to add columns after its construction, and will store all columns in the order you add them. It does keeps multiple instances of the same column, so it is up to you to add a column once to the column list in SelectQuery. Read the Javadoc documentation for more on the variety of methods provided for you.

Baseball Database

For this lab, you will only need two tables, Master and Batting. (The extra credit adds a selected use of the Fielding table.) The primary key for the Master table is playerID, and this provides the key link which identifies a player with all other player data tables. You will find a description of these tables in MySQL using

     SHOW COLUMNS FROM table;
and a description of the Lahman baseball database (version 5-3). The Batting table is organized with each row corresponding to a single player (determined by the playerID), in a single year and a single team (determined by the teamID, which accesses the Teams table, which you will not need for this lab.) The row contains the major statistics for that player.

You will need to compute some statistics yourself. There is a seasonal and aggregate form. I have provided the seasonal version here, and leave it to you to determine the correct notation for the aggregate form. One (apparent) quirk of MySQL is that it computes using DECIMAL, which provides only two digits. I have not discovered how to produce a third digit appropriate for baseball statistics. All columns refer to columns (unless specifically noted) in the Batting table:

  1. Singles: (H-2B-3B-HR), which I will abbreviate as 1B.
  2. Batting Average: H/AB
  3. On Base Percentage: (H+BB+HBP)/(AB+BB+HBP)
  4. Slugging Percentage: (HR*4 + 3B*3 + 2B*2 + 1B)/AB
  5. age: year - Master.birth

Your query must report the first and last name, in that order, for any row of data. The user optionally requests among twenty other major statistics, for seasonal or aggregate totals. The third reported value is the value on which the order is based, if it is not the default last name. (This is the case, even if the user does not specifically request this statistic reported. The order of all other statistics does not matter for this lab. The SelectQuery class will produce the column list for the SELECT SQL query in the order you add the columns (starting with those you pass in the constructor.) It does not ensure a column is reported only once. It only adds a column through the addColumn() method, so it is up to you to ensure columns are not multiply reported.

The database has been optimized to report queries in under 5 seconds. If you do not recieve a response in this time, there is a good chance there is a problem with your query.

Whenever appropriate, I have use correct field names as values in the form for ball2.html, your baseball database server webpage. For example, the teamID of the Chicago Cubs is "CHN" and for the Chicago White Sox is "CHA". I have also tried to use column names where appropriate, for parameter names, such as with teamID, to identify the users input for the team constraint. I have also tried to ensure that reasonable values would be passed your way in most fields. For the exceptions, such as the year constraint, you may ignore any non-integer input values for given for a year; any stupid values--those not between 1876 and 2005 are due to client stupidity, so they must suffer the consequences!!

I have use "all" at all places where the client does not wish to add any constraint. For statistics, this means that the client does wish your query to report the statistic, but no constraint is placed on the value. I have used "none" to signify that the user does not wish the query to report this value. Remember, that if a column is used to order the data, it must be reported after the first and last names, even if the user does not specifically request it.

Extra Credit

For extra credit you will implement the Position constraint. The table Fielding contains a row for each playerID, each year, each teamID (identification of team played for--an index to the Teams table) played for that year, each pos (fielding position, one of "P","C","1B","2B","3B","SS" or "OF") played on that team, and the fielding statistics. You will need to access the position. For simplicity, you may assume a player played a position for a team in a given year if they played that position for the team in at least one game. For example, Eddie Mathews played 79 games at first base for Houston in 1967 and 24 games at third base for them, so he could be counted as a first baseman or third baseman that year. He also played first base (13 games) and third base (21 games) with Detroit in 1967. So, for the year 1967 Mathews has 4 rows in the Fielding table. He has two rows in the Batting table, one for each team he played for in 1967. There is a field, stint which distinguishes the rows by the team played for, in the order of appearance. Eddie Mathews played for Houston first, so his stint value is 1, and then played for Detroit that same year, so his stint value is 2. This field is common to both Batting and Fielding. (The value of stint does not depend upon the multiple positions played in the Fielding table, so Mathews stint value is 1 for both rows in which he played for Houston in 1967, and is 2 for both rows in which he played for Detroit in 1967.)

Tracking down the players position and correctly computing batting totals introduces special problems, since you must now search on a third table, besides Master and Batting. Here are some things to be aware

  1. Batting and Fielding have several rows in common, so you will have new ambiguities. One example, is with year, but there are several others. You will need to be careful in qualifying these fields.
  2. You need to be careful about simply adding the Fielding table to a query. The following query is intended to produce the homerun leaders
    	SELECT first,last,SUM(HR) AS hr FROM Master,Batting,Fielding
    	WHERE Master.playerID=Batting.playerID
    	GROUP BY Master.playerID ORDER BY hr DESC LIMIT 10;
           
    Without Fielding table the query takes a couple of seconds and returns the correct values. Even though the Fielding table is never directly accessed, MySQL uses the table in producing its query results leading to a very long query with gives badly incorrect values.
  3. Finding the most homeruns by a third baseman in a career has some subtlties created by the explanation of Eddie Mathews year in 1967. Mathews played third base every year of his career (and was one of the great third basemen), so his career homeruns should equal is homeruns in years he played thirdbase. But it is rather subtle to figure-out how write a SQL query which verifies this.

Handin

Your lab is due Sunday at 11:59pm. See handin.


Kenneth Harris