AWS Tobacco Settlement
Virginia Tech AWS Tobacco Settlements Capstone Team
Rahul Ray, College of Engineering, Computer Science
WHAT HAPPENS HERE?
In Fall 2019, the CS5604 class built a functioning search engine/information retrieval system on the Computer Science Container Cluster. Our team is working to help make that run on Amazon Web Services (AWS). The search engine is indexed to a collection of 14 million documents that relate to the settlement between US states and the 7 large tobacco companies. We will improve the indexing from page indexing to line indexing for the approximately 8,000 deposition documents. This will proceed after organizing all documents with tags indicating deposition, so those that actually are transcripts, as well as each of the associated files (e.g., exhibits), are classified.
WHAT WAS THE PROCESS?
We had to convert page indexing to line indexing for 8000 deposition documents by giving access to the virtual machine (VM), allowing for linewise indexing, and by testing cases that are required to ensure that the script works as needed. We then pushed 8000 deposition documents to elastic search for the index of records. Post accomplishing that task we finished page indexing for the remaining documents by again giving access to the VM and retrieving all of the metadata. Finally we properly document all work done in a final report and final presentation.