$19.60
1. The Project Proposal (3 points) is due at Canvas during week #9 (One submission per team). Also see the following page of this pdf file for possible datasets of your project. Please make sure that all teammates sign up to the same group at Canvas (through “People” and “7406 Project Group”), and you need to sign up at Canvas even if you are working alone. Please do not create your own group or change the group names, so that it is easier for TAs and instructor to manager group grading. The purpose of the proposal is to get you started, and also allows the TAs and other students to provide feedback to your project. It shall be 1 ∼ 2 pages. You will need to provide the following information: (a) Your name(s) (b) Project description (c) How and where you obtained the data. For the data set, you can just direct to a website where we can find them. (d) Scientific Research questions you may want to address and corresponding data mining & statistical learning methods Peer review comments: please provide comments to the proposal, (e.g., on problem formulation whether the project sounds interesting, on dataset whether the dataset can help answer the questions, on the proposed methods, etc.). Note that all teams will receive full credits (3 points) on the project proposal as long as the team provide all these information. 2. The final presentation file (10 points) of your team project is due at Canvas during week #13 (either pptx or pdf version will be fine). One submission per team. (a) There are no specific guidelines on the presentations, and the commonsense applies, e.g., write all team members’ names somewhere on the first slide, highlight your problem, data set, main ideas/methods, and conclusions. (b) There are no official guidelines how many slides your group might include. To give you a rough estimate, ideally you or your team should prepare for the slides so that each teammate member can present about 3 minutes with a standard deviation of 0.5 minutes, e.g., a team of 2 students will prepare for slides for about 6 ± 1 minutes presentation, and a team of 3 students should prepare for slides for 9 ± 1.5-minutes presentation, etc. Hopefully this gives you rough guidelines how to prepare for your slides. (c) (Optional, not Required, no credits): The team is encouraged to submit a recorded video/voice oral presentation (i.e., each student member gives a 3-minute oral presentation). We understand that it might be difficult to generate such video/voice presentation, and thus this is optional, not required. 1 Peer review comments: please write constructive comments to the presentation slides, e.g., whether it is easy to understand the presentation, whether the presentation is interesting, whether the methodology or main conclusions are reasonable, etc. The TAs will assign a grade based on their own reading, • 10 points (=100%,A+) if the presentation file is clear, the project sounds interesting, and the conclusion sounds reasonable, etc. • 9 points (=90%, A) if there are some minor concerns on the presentation file • 8 points (=80%, B) if there are some major concerns on the presentation file • 7 points (=70%, C) if the presentation file contains some critical technique errors or has poor presentation • 6 points (=60%, D) if the presentation file is not understandable or sloppy • 0 points if no submission. 3. The final written report (12 points) of your team project is due at Canvas during week #14 (One submission per team). Either word or pdf file is fine. See the page #6 of the pdf file on some suggestions on the writeup of your report. (a) In your writeups, we expect clear explanations of models chosen, hypotheses tested, and findings analogous to what you would produce for a consulting project. (b) Mandatory subsection in the final written report: the lessons you learned (you can use any names for this subsection). For the purpose of this class, at the end of conclusion section of your final report, please add a subsection for lessons you learned from this project or this course. You can also write any suggestions to the instructor. The instructor/TAs will read this subsection, so that we can improve our teaching in the future. Peer Review Comments: please feel free to provide comments on the team’s selecting and adhering to a logical and readable format for the report; on the appropriate use of whatever data mining technique the team uses; on the appropriateness in the conclusions of the report; and on the readability and understandability of the report when technical material is needed. The TAs will assign one of the following grades based on their own independent reading: • 12 points (=100%,A+) if you think this is an outstanding or excellent project, e.g., one that deserves possible publication • 10.8 points (=90%, A) if you have some minor concerns on the project or report (e.g., on either presentation or technique aspects) • 9.6 points (=80%, B) if you have some major concerns on the project or report • 8.4 points (=70%, C) if you think the project contains some critical technique errors or the report has poor presentation • 7.2 points (=60%, D) if you think the project or report is not understandable • 0 points if no submission • The Instructor/TA keeps the right to deduct 2 points if we find out that the team miss the mandatory subsection on the lessons learned. As always, if you or your team has a concern about peer grading, please feel free to let the instructor/TA know asap at piazza: we will double check to make a final decision, although please do understand that ultimately the grade on the final written project will be subjective. 2 4. Peer evaluation form: if there are two or more students on a team, each teammate should also independently submit the completed peer evaluation form at Canvas (with the same due date of the final written report). If you conduct the project by yourself without team, this is optional and not required. This peer evaluation is to discourage free ride, and allows the Instructor to adjust an individual student’s score based on the teammates’ peer evaluations if needed. In general, the ideal is for all team members to receive the same grade on the final project. However, individual deductions from the team’s final project grade will be assessed for failing to contribute a fair and significant share to the team’s project, as determined by the teammates’ peer evaluation and the instructor. 3 Possible Topics of Your Project The objective of a class project is to help you gain experience with research, and to relate what you learn to real life problems which may require you learn new techniques (or develop new methods by yourself). You are expected to present the project findings during the class and submit a summary report at the end of the semester. Below are the two types of possible projects (you only need to choose one of them). 1. Solving a real life data mining problem. A typical report includes problem formulation, data analysis, proposed solutions, and interpretation of results. The data set can be from your own research or the public domain, see the information below. As an example, you can choose to participate a data mining competition such as the Knowledge Discovery and Data Mining (KDD) cup, see the link below for the past KDD Cup <http://www.kdd.org/kdd-cup>, or the KDD CUP 2017, <http://www.kdd.org/kdd2017/>. Another example is “2017 Data Challenge” sponsored by the Government Statistics Section of the American Statistician Associations (ASA) that analyzes the Consumer Expenditure Survey (CE) data on the Bureau of Labor Statistics website, see <http://magazine.amstat.org/blog/2017/01/01/data-challenge-on-tap-for-jsm2017> for the announcement and <https://www.bls.gov/cex/pumd.htm> for the datasets. 2. Numerical study of data mining methods using well-known data sets in the literature. Note that when dealing with well-known data sets, your approach needs to be substantially different from the literature, i.e., you should do more than repeating the analysis there. Some examples are • Compare performance of competitive data mining techniques; • Ask different questions or investigate new ideas of data mining methods; • Identify optimal parameters of specific data mining techniques; Note that the crucial aspect of your project is to analyze some data sets and justify your conclusions, not using some specific statistical models or methods we discussed in class. Datasets: You can collect the data by yourself, use the data set from your own research or the public domain. One way to find online datasets is to use the search engineer such as google. The followings are some examples of online datasets (you can use google or other search engineer to find more): 1. http://kdd.ics.uci.edu/ or http://archive.ics.uci.edu/ml/ One example is the KDD cup 1999 data at http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html More KDD cup data can be found at http://www.kdd.org/kdd-cup 2. http://www.quandl.com/ (financial and economic time-series datasets) 3. Data sets from some government websites such as <http://www.cdc.gov/surveillancepractice/data.html> or <http://www.ngdc.noaa.gov/stp/satellite/goes/dataaccess.html>. 4. http://lib.stat.cmu.edu/DASL/ 5. http://www.kdnuggets.com/datasets/index.html (links to more data repositories.) 6. http://www.dmoz.org/Computers/Artificial Intelligence/Machine Learning/Datasets/ To inspire your projects, some concrete examples can be as follows: • analyze some data sets in some competitions, see the links < http://www.kaggle.com/competitions> • find the traffic or crash pattern near Georgia Tech or your appartment/home by using data from <http://www.dot.ga.gov/DS/Data> 4 • predict Allergy season by using Atlanta Pollen count data from <http://www.atlantaallergy.com/PollenCount.aspx> . • derive the relationship between sleep and selected health risk behaviors, see the paper <http://www.cdc.gov/nchs/data/hestat/sleep04-06/sleep04-06.pdf> To further motivate your projects and encourage you to write up a solid project report, try to think that you want to publish your project report as a paper. There are two possible kinds of data mining or statistical learning papers (you only need to choose one). • Application Papers: apply standard methods to analyze some datasets, thereby answering some important questions in real-world applications such as bioinformatics, economic, finance, banking, healthcare, online advertisements, manufacturing, music, natural disasters, social networks, (bio)surveillance, warehouse, logistics, etc. • Methodology Papers: develop new methodologies and demonstrate their advantages as compared to the standard methods when analyzing some data sets, say, in the context of temporal data mining, spatial data mining, spatio-temporal, streaming data mining, web or graphic mining, etc. 5 ISyE 7406 — Data Mining & Statistical Learning Yajun Mei (ymei@isye.gatech.edu) The final written report shall not be longer than 25 pages, and the main body of the report is generally 5 ∼ 12 pages. Only very relevant plots and tables shall be included in the body of the report, and the rest should go to Appendix. When writing up your summary report, it is useful to ask yourself the following questions: What is the work? Why is it important? What background is needed? How will the work be presented? Here is a suggested format for your summary report. 1. Title Page: Project Title, author(s) (your name, the last three digits of your student ID, and email address), the submission date, course name/number; 2. Abstract: informative summary of the whole report (100-300 words). 3. Introduction includes problem description and motivation, data mining challenge(s), problem solving strategies, accomplished learning from the applications and outline of the report. 4. Problem Statement or Data Sources: cite the data sources, and provide a simple presentation of data to help readers understand the problem or challenge(s). 5. Proposed Methodology: explain (and justify) your proposed data mining strategies. 6. Analysis and Results: present key findings when executing the proposed data mining methods. For the benefit of readability, detailed results should be placed in the Appendix. Reference of computer softwares to implement your proposed data mining methods (even it is a web page) should be given. 7. Conclusions: Draw conclusions from your data mining practice. Unfinished or possible future work could be included (with proper explanation or justification). ∗A Mandatory Subsection of “Lessons we have learned”: at the end of conclusion section, please add a subsection for lessons you or your team learned from this project or this course. Please feel free to write any comments/suggestions/remarks, or share your experiences of data mining. 8. Appendix: This section only includes needed documents to support the presentation in the report. Feel free to divide it into several subsections if necessary. Do NOT dump all computer outputs unorganized here. 9. Bibliography and Credits. Parts 3-6 constitute the main body of the paper for your primary audience. Usually, as with fictional boss in this example, your audience is intelligent but unschooled in Data Mining or Statistics. So these parts should have as little technical material as you can possibly get away with. It is appropriate, and even recommended, to refer the reader to the appendix in part 8 if you need to provide a more technical explanation for something. Part 8 is your secondary audience - me - and should follow closely enough the ”story” of parts 4 − 6 that it is easy for me to see what technical material backs up with results and discussion. It is not necessary to number these parts 1-9 or name them as-above-mentioned. Please feel free to merge some parts or provide more informative section names if it seems natural to do so. A good on-line resource for writing reports is http://www.ccp.rpi.edu/. This site has links to writing centers at universities around the country, many of which in turn have pages that describe how to put together different types of reports. 6