Trading on the Stock Market is the modern engine of our time’s economy, thus comes the question whether this engine is effective, and if not, how we can seize this opportunity of imperfection. In this thesis we set out to get an insight – motivated by the findings of behavioral science – into the effect of public opinion on the stock prices.
On one hand, it is about finding the right sources which present the most valuable root for our data mining efforts. It is the 40 most prominent financial news sites’ articles, and the stream of short messages from the Twitter Real-Time API that we chose, for both of which we built our own data acquisition and storage subsystems.
On the other hand, it is about comparing two sentiment analysis method groups on the collected data: lexicon based approaches and a machine learning approach (Naive Bayes).
We observed in the case of news articles that while one of the lexicons was capable of outperforming the baseline S&P 500 buy & hold strategy, other lexicons and Naive Bayes failed to do so.
As for the brief entries of Twitter it was quite the opposite, that is, lexicon based methods tended to classify an unreasonably large number of the days as positive, therefore they were not able to gain an upper edge compared to the index, whereas the Naive Bayes classifier had an appreciable advantage.
It is important to note, that trading simulations in such a bull market that we currently have cannot show the whole picture about the robustness of the methods above, for identifying relatively few negative days can lead to these returns.
We conclude that consistently with previous work on the subject, machine learning on news articles for predicting the stock market may not be ideal, while the concise charge of information found in Twitter messages means a silver lining for future work on this area.