Conclusions and future work 1. Conclusions

STYLOMETRIC ANALYSIS OF THE CORRESPONDENCE OF ZSIGMOND MÓRICZ 1

5. Conclusions and future work 1. Conclusions

Our main conclusion is that Eder’s simple delta seems to be more suitable than the classical delta for stylistic text classification problems. Eder’s simple delta for bigrams gives more accurate results in cluster analysis. Experiments have also been carried out with cosine distance and Manhattan distance, but with far worse results than the classic and Eder’s Simple deltas.

Another important conclusion is that the visualisation method of distance measurement is a key factor in the evaluation of the results. The differences between the dendrograms plotted on the basis of the two deltas were not visible until PCA was performed. However, it is also clear that the separation of the branches of the dendrograms is based on different principles than the plotting of the PCA coordinate system, and therefore leads to slightly different results.

For most classification problems, a combination of visualisation methods will likely be appropriate.

5.2. Future works

In most classification studies, the size of the available corpus is crucial. It is quite clear that if we increase the size of our corpus we can get even better results, although the results supporting our hypothesis are already apparent from the corpus we have so far. A good way of expanding the corpus might be to include other works by Móricz (e.g., his novels).

It also seems to be a good idea to analyse other textual and stylometric features, e.g. syntactic n-grams, which requires syntactic analysis of the corpus though. It would also be worth examining the letters using some kind of machine learning algorithm (e.g., automatic text classification methods), but it also requires an increase in the size of the corpus. However, these topics should be the subject of another paper.

WEB SOURCES

W1 = https://voyant-tools.org/?corpus=7ca70f930f575a021120b1cfcbfa3cdc&view=

Summary (2021. 08.11.)

REFERENCES

Abbasi, Ahmed – Hsinchun, Chen 2006. Visualizing authorship for identification. In Sharad Mehrotra – Daniel D. Zeng – Hsinchun Chen – Bhavani Thuraisingham – Fei-Yue Wang (szerk.): Intelligence and Security Informatics: Lecture Notes in Computer Science (3975). Berlin – Heidelberg: Springer. 60–71.

https://doi.org/10.1007/11760146_6

146 Cséve Anna – Kalcsó Gyula – Mihály Eszter Abbasi, Ahmed – Hsinchun, Chen 2007. A framework for stylometric similarity

detection in online settings. In: AMCIS 2007 Proceedings. 127. http://aisel.aisnet.

org/amcis2007/127 (2021. 07. 14.)

Argamon, Shlomo 2007. Interpreting Burrows’s Delta: geometric and probabilistic foundations. Literary and Linguistic Computing 23: 131–47.

https://doi.org/10.1093/llc/fqn003

Burrows, John 2002. „Delta”: A measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing 17: 267–87.

https://doi.org/10.1093/llc/17.3.267

Eder, Maciej – Piasecki, Maciej – Walkowiak,Tomasz 2017. An open stylometric system based on multilevel text analysis. Cognitive Studies | Études cognitives 17. https://ispan.waw.pl/journals/index.php/cs-ec/article/view/cs.1430 (2021.

07. 14.) https://doi.org/10.11649/cs.1430

Eder, Maciej – Rybicki, Jan –Kestemont, Mike 2016. Stylometry with R: a package for computational text analysis. The R Journal 8: 107–21.

https://doi.org/10.32614/RJ-2016-007

Fazli, Can – Patton, Jon M. 2004. Change of writing style with time. Computers and the Humanities 38: 61–82. https://doi.org/10.1023/B:CHUM.0000009225.28847.77 Gómez-Adorno, Helena Montserrat – Ríos-Toledo, Germán – Posadas-Durán, Juan-Pablo – Sidorov, Grigori – Sierra, Gerardo 2018. Stylometry-based approach for detecting writing style changes in literary texts. Computación y Sistemas 22: 47–53. https://doi.org/10.13053/cys-22-1-2882

Grieve, Jack 2007. Quantitative authorship attribution: an evaluation of techniques.

Literary and Linguistic Computing 22: 251–270. https://doi.org/10.1093/llc/fqm020 Kjell, Bradley – Woods, W. Addison – Frieder, Ophir 1994. Discrimination of authorship using visualization. Information Processing & Management 30:

141–150. https://doi.org/10.1016/0306-4573(94)90029-9

Kómár, Éva – Cséve, Anna – Fellegi, Zsófia 2018. Móricz Zsigmond levelezésének (1892–1913) digitális kritikai kiadása. [A digital critical edition of Zsigmond Móricz’s correspondence, 1892–1913] Digitális Bölcsészet 1: 159–74.

https://doi.org/10.31400/dh-hun.2018.1.227

Lancashire, Ian – Hirst, Graeme 2009. Vocabulary changes in Agatha Christie’s mysteries as an indication of dementia: a case study. 19Th Annual Rotman Research Institute Conference, Cognitive Aging: Research and Practice. https://

www.cs.toronto.edu/pub/gh/Lancashire+Hirst-2009-poster.pdf (2021. 07. 14.) Moisl, Hermann 2015. Cluster analysis for corpus linguistics. H. n.: De Gruyter

Mouton. https://doi.org/10.1515/9783110363814

Pennebaker, James W. – Stone, Lori D. 2003. Words of wisdom: language use over the life span. Journal of Personality and Social Psychology 85: 291–301.

https://doi.org/10.1037/0022-3514.85.2.291

Stanikknas, Daumantas – Mandravickaite, Justina – Krilavicius, Tomas 2017.

Comparison of distance and similarity measures for stylometric analysis of Lithuanian texts. In: CEUR Workshop proceedings [electronic resource]:

ICYRIME 2017: proceedings of the symposium for young researchers in informatics,

147 Stylometric analysis of the correspondence of Zsigmond Móricz

mathematics and engineering. Kaunas, Lithuania, April 28, 2017. Aachen:

CEUR-WS. 1–7. http://ceur-ws.org/Vol-1852/p01.pdf (2021. 07. 14.)

Móricz Zsigmond levelezésének stilometriai elemzése

Jelen cikk egy kutatásról számol be, amelynek keretében számítógépes stilomet-riai módszerekkel vizsgáltuk meg Móricz Zsigmond feleségéhez és másokhoz 1902 és 1913 között írt leveleinek textuális és stilometriai sajátosságait. Ez a kísérlet a Petőfi Irodalmi Múzeum Digitális Bölcsészeti Központjának az első stilometriai próbálkozása. A korpusz a Petőfi Irodalmi Múzeum Móricz-különgyűjteményének leveleiből készült digitális tudományos kiadásán alapul, 478 levelet (220 268 szót) tartalmaz. Egy R-csomagot, a Stylót, valamint távolságmérési módszereket (klasszikus deltát és Eder egyszerű deltáját) alkalmaztunk a fent említett sajá-tosságok elemzésére. Az eredményeket kétféleképpen vizualizáltuk: klasztera-nalízissel (dendrogramon) és főkomponens-aklasztera-nalízissel. A levelek klasszifikációja sikeres volt, bár csak a két vizualizációs módszer együttes alkalmazása vezetett eredményre. Sikerült kimutatnunk, hogy stilometriailag mérhető különbségek vannak a Jankának és másoknak írt Móricz-levelek között.

pp. 149–160 ACTA Universitatis, Sectio Linguistica, Tom. XLVII.

https://doi.org/10.46437/ActaUnivEszterhazyLinguistica.2021.149 JUDIT TAKÁCS

In document Az Eszterházy Károly Egyetem tudományos közleményei (Új sorozat 47. köt.) = Acta Universitatis de Carolo Eszterhazy Nominatae. Sectio Linguistica Hungarica (Pldal 147-151)