The use of semi-structured data is very important in the field of artificial intelligence (AI), natural language processing (NLP), and also in other branch of computer sciences. Opportunities abound for developing those softwares using semi-structured datasources, for example, the greatest collaboratively editable knowledge base, Wikipedia. This thesis aims to present the full lifecycle of a new research tool that can help in computer science researches and it is able to process the articles of Wikipedia and publishes the result to the researchers.
Before designing this software, I investigated similar solutions to learn from their experience. Such research helps avoid common mistakes, and helps develop good architecture. After examining the available solutions, none proved to be flexible; the current solutions can be used only for specific tasks. However, all solutions did indicate three major, well-defined components: gathering, processing and storing the data. These three components were also designed and implemented in my thesis with regards to the specified requirements.
The architecture of the system is based on OSGi module system and service platform that comes with flexibility and maintainability. The OSGi framework extends Java language's abilities with a component driven development method.
During that specific design phase, separate OSGi components were created one-by-one, taking care of high performance. The speed of the application comes from the multi-threaded approach, the async implementation, and the usage of high performance technologies.
Within the implementation chapter of the thesis, tested technologies are demonstrated, covering the advantages and disadvantages of each technologies. In addition to the technical elements, the design patterns are cardinal in my thesis, as well.
The results of the implementation will be demonstrated via the measurement chapter, which leads to opportunities of further development. Further development ideas are proven by measurements. They expalin how it is possible to modify the application to create a useful software in the field of semi-structured data processing systems.