searching and analysis

The provided samples files are extracted from a bigger set of benchmark data used for text analysis. For your reference
here is the link to the benchmark data: http://www.daviddlewis.com/resources/testcollections/rcv1/. We will call this set
of files for the project as your database, and will form the source for testing your program. Examine the contents of the
files carefully to get familiar with the content.
The purpose of this project is to develop a software tool that can take as input a keyword to search on, and produce a list
of documents that contain the keyword. You should test your program to support searching based on:
- newsitem ID
- topic
- country
- any word
You should also provide option to search for keyword in title, versus keyword in body of the news, versus both.
In each case, the tool needs to produce the list of document names, and their locations on your system. The list should be
sorted based on some user choices such as by date, by country,..
I. Project Expectations
In this project, you are requested to design and implement the system to meet the specified requirements above.
You will need to:
• Show the choice of the data structures used in your design, and justify the choice of the data structures. You can
discuss the performance/memory impact of using different structures on the program to support your choice.
• Analyze your program, and:
o Use system clock to analyze the different parts of your program along with the overall program
performance.
o Use troubleshooting tools (e.g. vtune) to analyze potential bottlenecks in your programs
Your groups should be formed of three members, and the names in each group should have already been
provided to your instructor.
Due Date
Your group will be expected to submit a report and give a presentation by November 30, 2013.
Intermediate milestones will be requested throughout the whole semester.
Suggestions for Implementation:
This project is typical of many software problems, where will find yourself going through these steps:
A. Data Reading / Collection:
First you have to develop an XML parser to read and interpret the relevant text in each file. You will need to create the
appropriate data structures to store the data appropriately in memory for processing. You will explain your correct choice
in the project and during the project presentation.
B. Data Analysis:
• You should analyze/process the text data, so that you can extract and process the relevant information.
• Here, you will need to implement the needed data structures to:
i. Build an index for searching files and store index locally
ii. Store your extracted data during processing
iii. Store the identified list of documents.
iv. Store other needed data for CPU processing.
C. Data Storage:
• You store the XML files physically on your disk.
• You should also store the results of the analysis part. You should to store the resulting past searches for quick
retrieval if requested again.
D. Data Display:
• You should have a user interface that can take in the request of the user, and then displays the names and
locations of the files that contain the requested keyword.
• You should be able to display the list of files based on some sorting criteria (e.g. by date, by country, by topic,
by name,..)
• Create a menu that has many features including:
1- Display items filtered by different criteria.
2- Display and store some statistics on the data found. Example: Number of documents with the keyword.
% of topics per country,… You can suggest more stats, and what makes sense. Be creative here.
More advanced features – Bonus Grades:
These include one or more of the following options:
- The system can work for both English and Arabic. You will need to get Arabic documents, and test the program
with Arabic.
- A text categorization program that can read the body of the document and automatically categorize the document
into a topic covered. Note this is an example of text mining
- A web-access program, where the program can search the web for the keyword, provide the locations for the
sites, extract the data in the sites, and store them locally on your PC
- Ability to scale to a large number of documents, where the program is tested with all of the RCV1 benchmark.
- A C++ Window forms application with a GUI
III. What you need to submit:
You will need to submit by November 30, 2013: Hard copies of the report, and an upload on moodle that
includes: 1. presentation, 2. report, 3. source code, and 4. Additional data used in the experiments, 5. Files
produced in experiments.
The presentation and report should have clear descriptions of the:
1. Algorithms used. You should also include justification for your choices.
2. Data structures used. You should also include justification for your choices.
3. Performance analysis and comparison requested above. That includes asymptotic analysis, timing of
different parts of the program, along with the whole program.
4. Your conclusion and analysis of the proposed solution.
5. Presentation will also include a demo of the running program. IMPORTANT!!!
You will be expected to deliver the following:
-­‐ A presentation/demo (10-15 mins) on the highlights of the project
-­‐ A 2-3 page report. The format of the report should follow the guideline of writing technical papers. It should
include the following sections: Title, abstract, introduction, related work, Requirements, Design, Implementation,
Test Plan, experiments and results, conclusion. (Note that the references should be in IEEE format.)
Your electronic submissions should include:
1. Source code and data used in testing the program.
2. A readme file that explains the structure of the source code.
3. The report and the presentation should be provided in their original text format (example: .doc or .txt for
the report) on moodle
I'm going to post each part I do. I'm asking you for guidness and help.
I decided to use binary search trees.
the first part that i did is the parser.
i did it using linked list but i'll convert it to bst for sure
Topic archived. No new replies allowed.