Development of a search engine for indexing online advertisements

OData support
Supervisor:
Dr. Szikora Béla
Department of Electronics Technology

Aim of my web-based system is to collect as many ads in a specific topic on the web as possible. This system also provides search mode and sorted listings of collected advertisements. The present system is ready for indexing used cars ads.

I was developing in Visual Studio 2010 development environment so I created my web-based system in ASP.NET, and I wrote code in C#. The web-based system consists of two main parts, the web portal and the database-refreshing search engine. My complete database is located on MS-SQL 2008 R2 server.

The web portal provides two search interfaces: a simpler and a more detailed search interface. Among other things, you can select vehicle make, type, price, performance and other parameters.

The search in the entire database is launched after filter parameters are set on the search interface. Vehicles matching the search parameters are displayed in a list. The profile page of the selected vehicle on the source webpage can be reached by clicking on a button.

Database of the web portal is regularly updated, if the automation is activated. Thus, the search results displayed will contain up to date information. Update of the database is performed by another web application (another ASP.net solution). A part of the database is duplicated because of some load balancing and security reasons. During the update process of the database, the refreshing web application retrieves the existing ads from the copied database, but updates both, copied and the original full database. In case of data loss, the web portal advertisement database can be restored due to the redundancy.

I have implemented a universal database-update algorithm, which can be easily extended with new resource pages. During the configuration of the refreshing web application only the most necessary information have to be set about the new source page to index it properly. The algorithm is robust, because it uses regular expressions to locate data, and not the layout of the source webpage.

Downloads

Please sign in to download the files of this thesis.