Machine learning: spaCy 3.1 forwards predictions in the pipeline – Market Research Telecast

The Berlin company Explosion AI has released version 3.1 of the natural language processing (NLP) Python library spaCy. One of the innovations is the option to pass on annotations about predictions from one component to another during training. A new component is also used to label any and potentially overlapping text passages.

The open source Python library spaCy is also used for processing natural language (NLP), as is the Natural Language Toolkit (NLTK). While the latter mainly plays a role in the academic environment, spaCy aims for productive use. The Berlin company Explosion AI advertises it as Industrial strength NLP in Python. (Not only) because of its German roots, German is one of the supported languages.

Similar to the NumPy or Pandas libraries with methods for matrix operations, data science and numerical calculations, spaCy offers ready-made functions for typical computer-linguistic tasks such as tokenization or lemmatization. The former describes the segmentation of a text into units such as words, sentences or paragraphs, and the latter brings inflections of words to their basic forms, the lemmas.

spaCy is implemented in Cython and offers numerous extensions such as sense2vec as an extended form of word2vec or Holmes to extract information from German or English texts based on predicate logic. Version 3.0 of the library introduced a Transformer-based pipeline system.

The training process for components is usually isolated: the individual components have no insight into the predictions of the components that are ahead of them in the pipeline. The current release enables annotations to be written during training that can be accessed by other components. The new configuration setting training.annotating_components defines which components write annotations.

In this way, for example, the information on the grammatical structure from the dependency of the parser can be used for tagging with the Tok2Vec extension, as the following example from the spaCy documentation shows:

Annotations may be made from both regular and frozen components (frozen_components come. The latter are not updated during training. The procedure results in an overhead for non-frozen components, since they cause a double pass during training: The first updates the model that is used as the basis for the predictions in the second pass.

spaCy 3.1 introduces the new component SpanCategorizer to label any text passages that can overlap or be nested. The component previously identified as experimental is intended to cover those cases in which Named Entity Recognition (NER) reaches its limits. The latter categorizes the individual entities of a text, which must, however, be clearly separable.

Parallel to the new component, Explosion AI has a pre-release version of the Annotationswerkzeugs Prodigy published, which among other things offers a new UI for annotating nested and overlapping passages. The annotations defined therein can be used as training data for SpanCategorizer use.

Prodigy enables the labeling of overlapping text passages.

(Image: ExplosionAI)

Further innovations in spaCy 3.1 such as the additional pipeline packages for Catalan and Danish as well as the direct connection to the Hugging Face Hub can be added to the Refer to ExplosionAI blog.

(rme)

Article Source

Disclaimer: This article is generated from the feed and not edited by our team.

Originally posted here:
Machine learning: spaCy 3.1 forwards predictions in the pipeline - Market Research Telecast

Related Posts

Comments are closed.