StyloMetrix - classification based on interpretable stylometrics

Grammar not only plays a role in language but often holds the meaning of the text within it!

Page image
Project logo

Challenge

There are already many methods for detecting specific content, including harmful content, but most operate at the semantic layer of the text. Additionally, large models like transformers still need to be explored. Based on expert knowledge in linguistics and years of experience in narratology and close reading, we have developed a model called StyloMetrix, which provides an interpretable statistical vector representation, allowing for the representation of the grammatical layer.

Besides surprisingly high classification results based solely on grammar, this approach offers several other advantages, such as an interpretable representation of the grammatical structure of a text and its distinctive features without the need to access the original content (which is advantageous when dealing with harmful content) and provides model explanations.

Section image
Page description secondary image
Project leader
Inez Okulska, PhD

What we did

StyloMetrix calculates linguistic statistics of documents in Polish, English, and Ukrainian. Regardless of the sample length, it offers normalized and interpretable vector representations of entire documents. StyloMetrix vectors can be used as input for machine learning models or as a source of information for corpus analysis. The interface allows for the selection of metric groups to be included in the vector, allowing us to tailor the feature set to specific tasks or corpora.

The model’s interpretability encompasses the visualization of specific features in the vector, indicating which parts of the text are represented and how, and the interpretability of the model’s decisions. XAI libraries such as dalex or PyArtemis showcase Shapley values or interactions of the most important features directly translating into understandable grammatical patterns. This enables classification and provides new knowledge about the text in terms of its stylistic elements. Additionally, the StyloMetrix vector can be used as a tool for stylistic fine-tuning of BERT-like models. Preliminary experiments have shown that this hybrid approach improves the effectiveness of the transformer or speeds up its learning process.

StyloMetrix can be used for analyzing stylistic patterns in texts or documents of various forms, topics (making it easier to generalize, which is crucial for new, previously unknown topics), style, or tone. Currently, it is available for Polish, English, and Ukrainian languages, with a Russian and German version in preparation.

GitHub repository – StyloMetrix