Scott Martens
2025
jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval
Michael Günther | Saba Sturua | Mohammad Kalim Akram | Isabelle Mohr | Andrei Ungureanu | Bo Wang | Sedigheh Eslami | Scott Martens | Maximilian Werk | Nan Wang | Han Xiao
Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)
Michael Günther | Saba Sturua | Mohammad Kalim Akram | Isabelle Mohr | Andrei Ungureanu | Bo Wang | Sedigheh Eslami | Scott Martens | Maximilian Werk | Nan Wang | Han Xiao
Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)
We introduce jina-embeddings-v4, a 3.8 billion parameter embedding model that unifies text and image representations, with a novel architecture supporting both single-vector and multi-vector embeddings. It achieves high performance on both single-modal and cross-modal retrieval tasks, and is particularly strong in processing visually rich content such as tables, charts, diagrams, and mixed-media formats that incorporate both image and textual information. We also introduce JVDR, a novel benchmark for visually rich document retrieval that includes more diverse materials and query types than previous efforts. We use JVDR to show that jina-embeddings-v4 greatly improves on state-of-the-art performance for these kinds of tasks.
2014
Thomas Aquinas in the TüNDRA: Integrating the Index Thomisticus Treebank into CLARIN-D
Scott Martens | Marco Passarotti
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Scott Martens | Marco Passarotti
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
This paper describes the integration of the Index Thomisticus Treebank (IT-TB) into the web-based treebank search and visualization application TueNDRA (Tuebingen aNnotated Data Retrieval & Analysis). TueNDRA was originally designed to provide access via the Internet to constituency treebanks and to tools for searching and visualizing them, as well as tabulating statistics about their contents. TueNDRA has now been extended to also provide full support for dependency treebanks with non-projective dependencies, in order to integrate the IT-TB and future treebanks with similar properties. These treebanks are queried using an adapted form of the TIGERSearch query language, which can search both hierarchical and sequential information in treebanks in a single query. As a web application, making the IT-TB accessible via TueNDRA makes the treebank and the tools to use of it available to a large community without having to distribute software and show users how to install it.
2012
Large aligned treebanks for syntax-based machine translation
Gideon Kotzé | Vincent Vandeghinste | Scott Martens | Jörg Tiedemann
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Gideon Kotzé | Vincent Vandeghinste | Scott Martens | Jörg Tiedemann
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
We present a collection of parallel treebanks that have been automatically aligned on both the terminal and the nonterminal constituent level for use in syntax-based machine translation. We describe how they were constructed and applied to a syntax- and example-based machine translation system called Parse and Corpus-Based Machine Translation (PaCo-MT). For the language pair Dutch to English, we present evaluation scores of both the nonterminal constituent alignments and the MT system itself, and in the latter case, compare them with those of Moses, a current state-of-the-art statistical MT system, when trained on the same data.
2010
Bottom-up Transfer in Example-based Machine Translation
Vincent Vandeghinste | Scott Martens
Proceedings of the 14th Annual Conference of the European Association for Machine Translation
Vincent Vandeghinste | Scott Martens
Proceedings of the 14th Annual Conference of the European Association for Machine Translation
Varro: An Algorithm and Toolkit for Regular Structure Discovery in Treebanks
Scott Martens
Coling 2010: Posters
Scott Martens
Coling 2010: Posters
An Efficient, Generic Approach to Extracting Multi-Word Expressions from Dependency Trees
Scott Martens | Vincent Vandeghinste
Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications
Scott Martens | Vincent Vandeghinste
Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications