Industry sector classification of companies by context

OData support
Gáspár Csaba
Department of Telecommunications and Media Informatics

This research project focuses on using the online press to define industrial sector information according to the context of different articles the companies appeared in.

Industry sector knowledge plays an important role in defining the business strategy. It comes handy when the CEO wants to know if a firm is considered to be in the same primary industry sector being a competitor this way, or another one being a potential strategic partner.

As a successful combination of the Web Crawling extension with the Text Processing and Text Mining features of RapidMiner I implemented a classification model which is able to predict the industrial content of a given article. In a second phase company names are retrieved from the articles and are classified by the context and network of articles they appeared in. As firms may appear in articles with different context after collecting the names of Hungarian companies the top three industry label is specified by choosing the highest confidence values from the calculated prediction.

As a final step descriptive analyses and a short evaluation of the model are to be shown.

Through the research project I have developed the performance of the complete process, improved the accuracy of the classification model, and extended the analyzable industry sectors to 9 as follows: healthcare, info-communication (ICT), energy, industry, telecommunication, commerce, finance, agriculture and general economy.

The main scope of this research project is an up-to-date database using latest innovative technologies and methods to retrieve information from the Internet specialized in business and economic topics offering this way an automatic solution for sector and industry knowledge improvement.

The main tasks are classified in five different groups:

Crawl the World Wide Web using the Web Mining tool in RapidMiner, optimize the search criteria for different topics and web pages, save economy related articles in 9 different topics from the 5 main news portals in Hungary (Világgazdaság (, Portfólió (, Népszabadság Online (, Origo ( and HVG ( then save these articles’ content to use them for learning the model. These parameters are different at each website, as they are built differently, and so have to be treated differently to gain the best hits. This way the number of collected articles exceeded 23’000, which is considered to be an enormous data set for text mining purposes.

The next part of the task is processing the documents: extract their content, tokenize, transform the cases, filter tokens by stopwords, length, content). As a result a wordlist of close to 69’000 tokens are stored which play the role of different attributes in the following processes. To decrease this number Support Vector Machine is introduced to the process which weights the attributes by their importance. For process runtime optimization purposes the top 0.05% of the tokens are selected keeping this way 3’500 of the most important attributes, which best describe the dataset of 24’000 articles.

Choose the right model: it is an important decision to choose from the wide range of available models in RapidMiner. The most common used methods for polynomial classification purposes are Naïve Bayes, Neural Net and K-Nearest Neighbor. To select the one with the best performance on the data set ROC curves are applied over each separate label. As a result eight different graphs are gained, all containing the ROC curves of Naïve Bayes, K-NN and Neural Net showing the true positive rate over the false positive rate. From the graphs can be easily seen, that in the most cases the K-Nearest-Neighbor method gave the best results on a smaller example set.

Collect Hungarian company names appeared in articles matching different search criteria such as endings (Kft., Rt., Zrt., etc), ownership related keywords with possessive affix (tulajdonosa, vezetője) and others (versenytársa, partnere, etc.) and store them together with the article names which plays the role of ID’s.

Extract three labels that are the most significant for the given company. Joining the table containing the articles and their predicted labels with the table containing the articles and extracted company names results in a database of article-company-confidences. This database contains nine labels for each company and article with different predicted confidence. For further analysis the top three label is used.

Improve the results by drawing two networks of articles and companies where 2 articles are neighbors if contain the same company, as well as 2 companies are neighbors when appearing in the same article.


Please sign in to download the files of this thesis.