logo_projekty_stylometrix_kv

logo_projekty_StyloMetrix_KV

logo_projekty_stylometrix_logo

logo_projekty_StyloMetrix_logo

StyloMetrix - classification based on interpretable stylometrics

StyloMetrix - klasyfikacja oparta na interpretowalnej stylometrii

annie-spratt-askpr0s66rg-unsplash

Text from the book

projekty_stylometrix_ikona-2

Projekty_StyloMetrix_ikona

Dr Anna Kołos

There are already many methods for detecting specific content, including harmful content, but most operate at the semantic layer of the text. Additionally, large models like transformers still need to be explored. Based on expert knowledge in linguistics and years of experience in narratology and close reading, we have developed a model called StyloMetrix, which provides an interpretable statistical vector representation, allowing for the representation of the grammatical layer.
Besides surprisingly high classification results based solely on grammar, this approach offers several other advantages, such as an interpretable representation of the grammatical structure of a text and its distinctive features without the need to access the original content (which is advantageous when dealing with harmful content) and provides model explanations.

Metod wykrywania konkretnych treści, w tym szkodliwych, jest już wiele, aczkolwiek w większości operują one na warstwie semantycznej tekstu; poza tym wielkie modele, takie jak transformery, wciąż pozostają w dużej mierze niewyjaśnione. Na bazie eksperckiej wiedzy w zakresie lingwistyki i wieloletnich doświadczeń w analizie i interpretacji tekstu chcieliśmy opracować model interpretowalnej statystycznej reprezentacji wektorowej, który pozwala m.in. na reprezentację warstwy gramatycznej.
Oprócz zaskakująco wysokich wyników klasyfikacji opartej wyłącznie na gramatyce, takie podejście oferuje szereg innych zalet: w sposób interpretowalny pozwala przedstawić strukturę gramatyczną tekstu i jej cechy dystynktywne bez konieczności obcowania z oryginalną treścią (co w przypadku pracy z treściami o charakterze szkodliwym jest sporą zaletą) oraz wyjaśnić model.

<div class="app-content__short-desc-container"></div>

StyloMetrix calculates linguistic statistics of documents in Polish, English, and Ukrainian. Regardless of the sample length, it offers normalized and interpretable vector representations of entire documents. StyloMetrix vectors can be used as input for machine learning models or as a source of information for corpus analysis. The interface allows for the selection of metric groups to be included in the vector, allowing us to tailor the feature set to specific tasks or corpora.
The model&#8217;s interpretability encompasses the visualization of specific features in the vector, indicating which parts of the text are represented and how, and the interpretability of the model&#8217;s decisions. XAI libraries such as dalex or PyArtemis showcase Shapley values or interactions of the most important features directly translating into understandable grammatical patterns. This enables classification and provides new knowledge about the text in terms of its stylistic elements. Additionally, the StyloMetrix vector can be used as a tool for stylistic fine-tuning of BERT-like models. Preliminary experiments have shown that this hybrid approach improves the effectiveness of the transformer or speeds up its learning process.
StyloMetrix can be used for analyzing stylistic patterns in texts or documents of various forms, topics (making it easier to generalize, which is crucial for new, previously unknown topics), style, or tone. Currently, it is available for Polish, English, and Ukrainian languages, with a Russian and German version in preparation.
GitHub repository &#8211; <a href="https://github.com/ZILiAT-NASK/StyloMetrix">StyloMetrix</a>

What have we done?

<div class="app-content__short-desc-container">
StyloMetrix oblicza statystyki lingwistyczne dokumentów w języku polskim, angielskim i ukraińskim. Oferuje znormalizowane, interpretowalne reprezentacje wektorowe całych dokumentów, niezależnie od ich długości. Wektory StyloMetrix mogą być wejściem do modeli uczenia maszynowego lub źródłem informacji do własnych badań korpusu. Interfejs umożliwia własny wybór grup metryk, które mają wejść w skład wektora, dzięki czemu możemy dopasować zbiór cech do konkretnego typu zadania lub korpusu.
Wyjaśnialność modelu oznacza tu zarówno możliwość wizualizacji konkretnych cech wektora – które części tekstu są reprezentowane w jaki sposób, jak i wyjaśnialność decyzji modelu. Biblioteki XAI, takie jak dalex czy PyArtemis, pokazują wartości Shapleya czy interakcje najważniejszych cech, które tutaj przekładają się bezpośrednio na zrozumiałe wzorce gramatyczne. W ten sposób oprócz klasyfikacji zyskujemy też nową wiedzę na temat klasyfikowanego tekstu – w zakresie charakterystycznych dla niego elementów stylistyki. Ponadto wektor StyloMetrix może być traktowany jako narzędzie do stylistycznego finetuningu modeli typu BERT. Wstępne eksperymenty pokazały, że podejście hybrydowe podnosi skuteczność transformera lub przyśpiesza proces jego uczenia.
</div>
StyloMetrix może służyć do analizy wzorców stylistycznych tekstów czy dokumentów o różnej formie, tematyce (łatwiej się generalizuje, co jest istotne przy nowych, nieznanych wcześniej tematach), stylu czy wydźwięku. Obecnie dostępny jest dla języka polskiego, angielskiego oraz ukraińskiego. W przygotowaniu wersja rosyjska i niemiecka.
Repozytorium biblioteki na GitHub &#8211; <a href="https://github.com/ZILiAT-NASK/StyloMetrix">StyloMetrix</a>

Co zrobiliśmy?

annakolos-s

AnnaKołos-s

Anna Kołos&#8217;s primary field of research is literary studies in the context of cultural history in the broadest sense, including in particular issues related to imagology and (post)colonial discourse in writing. Currently, she is expanding her research to include issues of digital humanities and is engaged in stylometric analysis of large text corpora, both applied and strictly literary. Her interests include: ancient literature and culture, modern literature and writing, colonial discourse, and imagology.
In 2014, she completed a research internship at the prestigious Warburg Institute in London as part of the NCN &#8220;Etiuda&#8221; doctoral fellowship. She participated in the NCN &#8220;Sonata&#8221; grant dedicated to Polish and Serbian images of China (18th century &#8211; 1939). In 2018, she was awarded a scholarship for young scientists by the Foundation for Polish Science (START).

Prymarnym polem badań naukowych Anny Kołos jest literaturoznawstwo w kontekście szeroko pojmowanej historii kultury, w tym w szczególności kwestie związane z imagologią i dyskursem (post)kolonialnym w piśmiennictwie. Aktualnie, rozwija ona swoje badania o zagadnienia humanistyki cyfrowej i zajmuje się analizą stylometryczną dużych korpusów tekstowych, zarówno użytkowych, jak i stricte literackich. Jej zainteresowania to: literatura i kultura dawna, literatura i piśmiennictwo nowoczesne, dyskurs kolonialny, imagologia. 

W 2014 roku odbyła staż naukowy w prestiżowym Instytucie Warburga w Londynie w ramach stypendium doktorskiego NCN &#8220;Etiuda&#8221;. Uczestniczyła w grancie NCN &#8220;Sonata&#8221; poświęconym polskiemu i serbskiemu obrazowi Chin (XVIII wiek &#8211; 1939). W 2018 roku została laureatką stypendium Fundacji na rzecz Nauki Polskiej dla młodych naukowców (START).

StyloMetrix - classification based on interpretable stylometrics

Challenge

What have we done?