7 Evaluation of the Sentence Boundary Detection

While the shared task is dedicated to Sentence Boundary De-tection, we focused on designing our top-down pipeline for the PDF itself. Therefore, we did not have enough time to fine tune the output. The main difficulty we encountered was to align our internal representation to the expected FinSBD representation since both representations are very different.

A complex ad-hoc module had to be implemented to try to map our structure to the expected character-based structure.

Our algorithm consists in splitting the text blocks extracted by our top-down pipeline using a single regular expression based on the presence of an end of sentence punctuation mark followed by a space separator or a line separator. In the fol-lowing tables we show the detailed results of an improved version of our system in which the beginning and end of para-graphs are correctly detected. We only kept the subtask1 sult of our original submission to ease comparisons. We re-moved the results on lists and numbered items since our sys-tem does not give these units yet.

In Table 1 and table 2 are shown the results obtained on the train set, respectively in English and in French. We focused on the sentence and the items for the system we submitted.

Our system has much better results in terms of Precision but seems to miss many sentences.

Document sent item

f1 prec. recall f1 prec. recall

Invesco-Fu 37.8 44.2 33.0 0 0.0 0

EdR-Privat 43.7 34.8 58.9 11.1 78.6 6.0 CANDRIAM-G 63.8 83.7 51.5 78.5 73.9 83.6 Dexia-Equi 65.9 80.5 55.8 46.1 67.3 35.0 Credit-Sui 78.5 89.9 69.8 48.0 69.1 36.7

Macro 57.9 66.6 53.8 36.7 57.8 32.3

Table 1: Results on the English train set, 32.6 F-measure on sub-task1 (VS 23.6 for our official submission) sorted by F-measure on sentences

Our results on the test set are shown in Table 3 and table 4.

One can see that the results are high in English as compared to the train set but the dataset is too small to draw any conclusion from that. The fact that the same pattern in French maybe show that our rule based system does not suffer too much from over-fitting.

8 Conclusion

Our team participated to the FinSBD-2 Shared Task dedicated to Sentence Boundary Detection in Financial Prospectuses. It was our first participation to this shared task. Our motivation was to improve our model driven approach to multilingual document analysis.

The work we have achieved is very promising. We had the opportunity to handle the full workflow and to define, control and implement each NLP component.

Concerning FinSBD shared task, we lack time to finalize the creation of list objects, unordered list objects and sen-tences. We chose to control the whole workflow and it was a bit too ambitious regarding time constraints since aligning our internal representations to the offsets of the groundtruth.

In a near future, we intend to enhance the implementation of our page layout model in order to be compliant with the page layout model described in [Giguet, 2008]. We would also like to implement the document model we introduced in INEX Book Structure Extraction Competition in order to divide a document in main parts and chapters [Giguetet al., 2009]. This strategy applied at document scope could have

Document sent item f1 prec. recall f1 prec. recall LCL-OBLIGA 28.8 36.3 23.8 2.9 3.4 2.6 LCL-DOUBLE 33.4 36.8 30.7 3.7 5.3 2.9 LCL-INVEST 34.6 43.6 28.7 1.1 2.6 0.7 AMUNDI-VIE 34.9 44.3 28.8 2.5 6.5 1.6 FUNDQUEST- 38.1 51.9 30.1 43.0 44.4 41.7 BNP-PARIBA 44.8 70.8 32.8 45.0 39.1 52.9 QUILVEST-C 51.0 62.1 43.3 34.6 51.9 25.9 GROUPAMA-O 53.1 60.5 47.3 39.7 40.0 39.5 AVIVA-INTE 53.3 66.1 44.6 32.0 29.1 35.6 CREDIT-MUT 53.7 83.5 39.6 33.8 26.1 47.9 GUTENBERG- 54.2 58.4 50.5 34.5 37.0 32.3 Fondo-BNP- 57.2 59.2 55.3 66.7 73.7 60.9 CM-CIC-EUR 57.3 56.8 57.8 44.8 41.4 48.8 FCPI-IDINV 59.8 78.4 48.3 88.9 88.9 88.9 GASPAL-CON 61.5 73.4 52.9 35.8 70.7 24.0 Le-PAL ´E-FR 62.0 76.8 51.9 60.1 48.8 78.2 NORDEN-SMA 62.1 74.0 53.4 49.7 40.4 64.7 ORCHIDEE-I 62.3 64.8 60.1 54.5 70.6 44.4 S ´ELECT-OBL 65.6 90.9 51.3 32.9 28.6 38.7 S´ecuri-Tau 68.1 84.3 57.1 28.6 19.0 57.1 QUADRIGE-M 69.2 85.6 58.1 82.8 89.1 77.4 FCPI-Innov 72.8 84.6 63.9 50.2 52.6 48.1 INNOVEN-EU 77.7 89.7 68.5 8.5 11.8 6.7

Macro 54.6 66.6 46.9 38.1 40.0 40.1

Table 2: Results on the French train set 31.9 F-measure on subtask1 (VS 33.5% for our original submission) sorted by F-measure on sen-tences

Document sent item

f1 prec. recall f1 prec. recall Arabesque- 71.8 88.4 60.5 55.3 88.7 40.1 MAGALLANES 76.9 92.1 66.0 23.8 88.9 13.7

Macro 74.3 90.2 63.2 39.5 88.8 26.9

Table 3: Results on the English test set : 37.9 F-measure on sub-task1 (VS 31.7 for our original submission) sorted by F-measure on sentences

made more accurate decisions at lower level of the hierarchy (i.e., divide-and-conquer strategy).

References

[Ait Azziet al., 2019] Abderrahim Ait Azzi, Houda Bouamor, and Sira Ferra. The finsbd-2019 shared task:sentence boundary detection in pdf noisy text in the financial domain. In Chung-Chi Chen, Hen-Hsen Huang, Hiroya Takamura, and Hsin-Hsi Chen, editors, Proceed-ings of the First Workshop on Financial Technology and Natural Language Processing, pages 74–80, Macao, China, August 2019.

[Antonacopouloset al., 2009] Apostolos Antonacopoulos, David Bridson, Christos Papadopoulos, and Stefan Pletschacher. A realistic dataset for performance eval-uation of document layout analysis. In Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, pages 296–300, 01 2009.

Document sent item

f1 prec. recall f1 prec. recall CM-CIC-OBL 32.3 27.7 38.8 40.0 36.2 44.7

LCL-MULTI- 35.3 34.5 36.2 0 0.0 0.0

HEXASTEP-H 40.6 71.5 28.3 11.1 33.3 6.7

AMUNDI-IND 50.0 51.5 48.6 0 0 0.0

LAZARD-ACT 57.1 69.8 48.2 39.1 29.3 58.6 FIP-IXO-DE 58.7 71.1 50.0 50.0 100.0 33.3 BNP-Pariba 59.0 54.1 64.9 31.6 24.5 44.4 GREEN-BOND 59.5 67.1 53.6 27.6 52.2 18.8 KLE-EONIA- 60.4 70.3 53.0 63.5 61.4 65.9 ECUREUIL-P 67.8 73.0 63.4 55.7 53.1 58.6

Macro 52.1 59.1 48.5 31.9 39.0 33.1

Table 4: Results on the French test set, 27.98 F-measure on sub-task1 (VS 26.2 for our original submission) sorted by F-measure on sentences

[Bachenkoet al., 1995] Joan Bachenko, Eileen Fitzpatrick, and Jeffrey Daugherty. A rule-based phrase parser for real-time text-to-speech synthesis. Natural Language En-gineering, 1(2):191–212, 1995.

[Bernhardet al., 2017] Delphine Bernhard, Amalia Todi-rascu, Fanny MARTIN, Pascale Erhart, Lucie Steible, Dominique Huck, and Christophe Rey. Probl`emes de tok´enisation pour deux langues r´egionales de France, l’alsacien et le picard. InDiLiTAL 2017, Actes de l’atelier

“ Diversit´e Linguistique et TAL ”, pages 14–23, Orl´eans, France, June 2017.

[Corro, 2020] Caio Corro. Span-based discontinuous con-stituency parsing: a family of exact chart-based algorithms with time complexities from o(n⁶) down to o(n³), 2020.

[Daleet al., 2000] Robert Dale, H. L. Somers, and Hermann Moisl. Handbook of Natural Language Processing. Mar-cel Dekker, Inc., USA, 2000.

[D´ejean and Meunier, 2006] Herv´e D´ejean and Jean-Luc Meunier. A system for converting pdf documents into structured xml format. In Horst Bunke and A. Lawrence Spitz, editors,Document Analysis Systems VII, pages 129–

140, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg.

[D´ejean and Meunier, 2010] Herv´e D´ejean and Jean-Luc Meunier. Reflections on the inex structure extraction com-petition. In Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, DAS ’10, page 301–308, New York, NY, USA, 2010. Association for Computing Machinery.

[D´ejean, 2007] Herv´e D´ejean. pdf2xml open source soft-ware, 2007. Last access on July 31, 2019.

[D´ejean, 2010] Herv´e D´ejean. Numbered sequence detection in documents. In Laurence Likforman-Sulem and Gady Agam, editors,Document Recognition and Retrieval XVII, volume 7534, pages 41 – 52. International Society for Op-tics and Photonics, SPIE, 2010.

[Gabayet al., 2019] Simon Gabay, Marine Riguet, and Lo¨ıc Barrault. A Workflow For On The Fly Normalisation Of 17th c. French. In DH2019, Utrecht, Netherlands, July 2019. ADHO.

[Giguet and Lejeune, 2019] Emmanuel Giguet and Ga¨el Lejeune. Daniel@FinTOC-2019 shared task : TOC ex-traction and title detection. InProceedings of the Second Financial Narrative Processing Workshop (FNP 2019), pages 63–68, Turku, Finland, September 2019. Link¨oping University Electronic Press.

[Giguet and Lucas, 2010a] Emmanuel Giguet and Nadine Lucas. The book structure extraction competition with the resurgence software at caen university. In Shlomo Geva, Jaap Kamps, and Andrew Trotman, editors, Focused Re-trieval and Evaluation, pages 170–178, Berlin, Heidel-berg, 2010. Springer Berlin Heidelberg.

[Giguet and Lucas, 2010b] Emmanuel Giguet and Nadine Lucas. The book structure extraction competition with the resurgence software for part and chapter detection at caen university. In Shlomo Geva, Jaap Kamps, Ralf Schenkel, and Andrew Trotman, editors,Comparative Evaluation of Focused Retrieval - 9th International Workshop of the Ini-titative for the Evaluation of XML Retrieval, INEX 2010, Vugh, The Netherlands, December 13-15, 2010, Revised Selected Papers, volume 6932 of Lecture Notes in Com-puter Science, pages 128–139. Springer, 2010.

[Giguetet al., 2009] Emmanuel Giguet, Alexandre Bau-drillart, and Nadine Lucas. Resurgence for the book struc-ture extraction competition. In Shlomo Geva, Jaap Kamps, and Andrew Trotman, editors,INEX 2009 Workshop Pre-Proceedings, pages 136–142, 2009.

[Giguet, 1995] Emmanuel Giguet. Multilingual sentence categorization according to language. In Proceedings of the European Chapter of the Association for Computa-tional Linguistics (EACL) SIGDAT Workshop ”From text to tags : Issues in Multilingual Language Analysis, pages 73–76, March 1995.

[Giguet, 2008] Emmanuel Giguet. Rapport scientifique du projet r´esurgence. Technical report, Universit´e de Caen Basse-Normandie, November 2008.

[Giguet, 2011] Emmanuel Giguet. De l’analyse syntaxique automatique `a l’analyse automatique du discours dans les collections multilingues de documents num´eriques com-posites. M´emoire d’habilitation `a diriger des recherches, Universit´e de Caen Basse-Normandie, September 2011.

[Grefenstette and Tapanainen, 1994] Gregory Grefenstette and Pasi Tapanainen. What is a word, what is a sen-tence? problems of tokenization. In Proceedings of the 3rd Conference on Computational Lexicography and TextResearch, pages 79–87, 1994.

[Kiss and Strunk, 2006] Tibor Kiss and Jan Strunk. Unsu-pervised multilingual sentence boundary detection. Com-putational Linguistics, 32(4):485–525, 2006.

[Le and Mikolov, 2014] Quoc V. Le and Tomas Mikolov.

Distributed representations of sentences and documents.

CoRR, abs/1405.4053, 2014.

[Luc, 2001] Christophe Luc. Une typologie des

´enum´erations bas´ee sur les structures rh´etoriques et architecturales du texte. In TALN2001, Universit´e de

Tours, 05/07/2001-07/07/2001, pages 263–272. ., juillet 2001.

[Martinet al., 2020] Louis Martin, Benjamin Muller, Pe-dro Javier Ortiz Su´arez, Yoann Dupont, Laurent Ro-mary, ´Eric Villemonte de la Clergerie, Djam´e Seddah, and Benoˆıt Sagot. Camembert: a tasty french language model.

InProceedings of the 58th Annual Meeting of the Associ-ation for ComputAssoci-ational Linguistics, 2020.

[Maurelet al., 2006] Fabrice Maurel, Mustapha Mojahid, Nadine Vigouroux, and Jacques Virbel. Documents num´eriques et transmodalit´e. transposition automatique

`a l’oral des structures visuelles de texte. Document num´erique, 9, 09 2006.

[Maurel, 2004] Fabrice Maurel. Transmodalit´e et multi-modalit´e ´ecrit/oral : mod´elisation, traitement automatique et ´evaluation de strat´egies de pr´esentation des structures

”visuo-architecturale” des textes. PhD thesis, Universit´e de Toulouse 3, 2004.

[Pascual and Virbel, 1996] Elsa Pascual and Jacques Virbel.

Semantic and Layout Properties of Text punctuation. 34th Annual meeting of the Association for Computational Linguistics. In International Workshop on Punctuation in Computational Linguistics, Santa Cruz, USA, , Santa Cruz, USA, juin 1996. Univ. of California. Dates de conf´erence : juin 1996 1996. Pages de la publication : ?.

[R´emi Juge, 2019] Sira Ferradans R´emi Juge, Najah-Imane Bentabet. The fintoc-2019 shared task: Financial document structure extraction. InThe Second Workshop on Financial Narrative Processing of NoDalida 2019, 2019.

[Smith, 2020] Noah A. Smith. Contextual word represen-tations: Putting words into computers. Commun. ACM, 63(6):66–74, May 2020.

[Sorin, 2015] Laurent Sorin. Contributions of textual ar-chitectures to the non-visual accessibility of digital doc-uments. Theses, Universit´e Toulouse le Mirail - Toulouse II, December 2015.

[Virbel, 1999] Jacques Virbel. Structures textuelles.

Planches. Fasciccule I : Enum´erations. Rapport de recherche -, IRIT, Universit´e Paul Sabatier, Toulouse, f´evrier 1999.

Subtl.ai at the FinSBD-2 task: Document Structure Identification by Paying

In document The Second Workshop on Financial Technology and Natural Language Processing in conjunction with IJCAI-PRICAI 2020 Proceedings of the Workshop (Pldal 80-83)