The PatentSemTech2019 workshop, aims to establish a long-term collaboration and a two-way communication channel between the IP industry and academia from relevant fields such as natural-language processing (NLP), text and data mining (TDM) and semantic technologies (ST) in order to explore and transfer new knowledge, methods and technologies for the benefit of industrial applications as well as support research in applied sciences for the IP and neighbouring domains. We have had a brief with the Keynote Speaker of the workshop Anthony Trippe from Patinformatics, LLC as well as the organizers, researchers Linda Andersson (TU Wien) and Hidir Aras (FIZ Karlsruhe). In this interview Linda, Anthony and Hidir talk about achievements and challenges from both perspectives, academia and industry, as well as the contents of the workshop itself.
The Patent SemTech 2019 Workshop addresses Patent Text Mining and Semantic Technologies. What are the main benefits of participating in an event which specifically addresses these areas?
Linda Andersson & Hidir Aras: The workshop aims to be a bridge between industry and academy within the fields of Intellectual property. In the last twenty years semantic technologies such as exploiting ontologies, knowledge graphs and linked data on the one side and emerging AI methods like deep learning on the other helped to resolve problems related to text understanding, text analysis, speech and image recognition and many others. For the IP domain, we believe that just applying these technologies to patent data will not suffice, as researchers should first get insights into specific challenges from the perspective of real-world applications for the industry when trying to deal with individual text mining problems. For example, paraphrasing, so called patentese, is a big challenge different from common texts, making it more difficult to understand the descriptions of a patent, also, the different types of languages used in patents. Hence, from the research perspective as well, new approaches for modelling and analysing semi-structured and unstructured data are needed.
Anthony Trippe: Patent Text Mining and Semantic Technologies represent a unique challenge in the machine learning field since patent text is different from conventional text in that the primary objective of patent text is to provide legal protection as opposed to using text which maximizes comprehension by the audience. Since the author of a patent is their own lexographer they are free to use any language they like to describe a concept instead of using the clearest description free from ambiguity and potential misinterpretation. Because of this and other nuances associated with patent data it is critical for machine learning practitioners to work closely with patent information professionals to maximize their understanding of this unique data source. At the same time, machine learning and semantic technologies are becoming standard for the patent information field and can provide value over conventional patent information retrieval techniques. It is critical that patent information professionals take advantage of opportunities to work with the text mining and semantic communities so they can take advantage of these approaches to their work.
What are the biggest challenges ahead of us in this field?
Linda Andersson & Hidir Aras: Even if innovation is an essential part of future technologies and the entire society, the effort to support IP experts and researchers with enhanced solutions searching and exploring scientific-technical information such as patents have been rather limited. The limitation is not due to the lack of effort but rather that the advanced AI solutions require in depth-domain knowledge. To make smart AI text mining solutions for patents requires human expertise and this type of know-how is hard to come by when developing research prototypes. Despite these challenges, it is important to boost semi-automatic training data generation for learning problems in text- and data mining with scientific-technical information, as many expert-based solutions are rule-based. Here, for all applied methods recall is sacrificed for the sake of precision.
Within the framework of PatentSemTech, the first step would be to establish a set of benchmarking data and standardized datasets, helping researchers to compare their results with existing (mostly commercial) rule-based solutions, which need a lot of maintenance effort and hard-coding of solutions. Although in the past some benchmarking data have been collected, e.g. in the course of the CLEF-IP workshops, still for many text mining tasks such as text segmentation, key term extraction, similarity calculation, etc., benchmarking data is still missing.
Anthony Trippe: There are many new systems coming into the market that claim to enhance patent information retrieval and subsequent categorization using machine learning methods, but it is difficult to evaluate these systems and potentially even more difficult to customize these systems to adapt to different needs in the patent information community. There is also a significant need for patent information professionals to become educated on how these systems work and how the user can influence them based on domain knowledge. There needs to not only be transparency from the development and research communities, but also an increased knowledge in the patent information community to learn more about how these workand can be further developed.
What have been the most innovative approaches to these challenges so far?
Linda Andersson & Hidir Aras: From a text mining perspective, the innovative approaches have been focusing on adapting state-of-the-art text mining technologies to better suit the patent domain, exploring meta-data in combination with full text retrieval, deploying emerging technologies such as word embedding for patent retrieval. A few research studies have also focused on how to reduce the need of expert human annotations by using different bootstrapping technologies and pre-semi-automatic labelling.
Anthony Trippe: Patents are multi-dimensional documents with many different attributes associated with them. Some of these are embedded in various unstructured text fields while others are captured in structured fields. The most innovative approaches will use as many attributes of the document as possible to reinforce the methods used in order to increase the precision of them. Eventually, the best systems will also incorporate analyses using the images and tables associated with these documents as well.
What can we expect to discuss and learn about at PatentSemTech 2019?
Linda Andersson & Hidir Aras: Our aim is to join different research communities within the field of text mining, natural language processing and semantic technologies, in order to establish a two way communication channels between research in academy and IP industry. Herewith, we intend to provide interested researchers a platform for finding out about specific challenges for applied science in the IP as well as neighbouring domains, e.g. the life sciences, in order to evaluate new and emerging methods and technologies.
The workshop will be more than a one-day event per year, our intention is to make it into an active community with webinars on relevant topics, training and assessment activities to promote patent data mining and establishing more benchmark data addressing different patent use cases.
Anthony Trippe: The opportunity to work on issues collectively can not be understated. A community approach to working on this area of research will allow the entire field to move forward more quickly and to provide a better result for all interested parties. Everyone participating can expect to leave the workshop having made a significant contribution to improving patent text mining, semantic technologies, and machine learning systems. Patent information professionals can also expect to leave the workshop with a greater understanding of how these systems work, how they can evaluate them for assisting them with their work, and how to use them practically to become more precise and efficient in their jobs.
About Linda, Hidir and Anthony:
Linda Andersson has for the last 15 years conducted text mining research in close connection to the IP industry. Ms Andersson has worked on different aspects of text mining. In 2009, Ms Andersson finalized her Master Thesis, “A Vector Space Analysis of Swedish Patent Claims, Does Decompounding Help?” which was based on a collaboration with the Swedish Patent and Registration Office. For her PhD Thesis, “The Essence of Patent Text Mining,” Linda continued working close with the text mining industry. Part of Ms Andersson’s work and research is developing real world patent text mining applications using Natural Language Processing techniques. Ms Andersson has in her PhD research established a generic method for Natural Language Annotation Design for domain-specific text mining solutions for medicine, legal and technical text. In 2018 she launched the product idea ‘Artificial Researcher in Science’ which received the Commercial Viability Award from the Austrian Angel Investors Association. Ms Andersson is the founder of the Artificial Researcher-IT GmbH start-up.
Dr. Hidir Aras is a research assistant and project manager for text and data mining at FIZ Karlsruhe. His applied research interests include big data analytics, text and data mining, and semantic analysis of patent information. Hidir Aras joined FIZ Karlsruhe in 2012 and was previously a research associate and PhD student at the University of Bremen, where he received his PhD on "Semantic Interaction in Web-based Retrieval Systems". Before, after completing his studies in business informatics at the University of Mannheim, he worked for several years at the European Media Laboratory GmbH in Heidelberg on various research projects related to geographical information systems, intelligent mobile assistance and the Semantic Web.
Anthony Trippe is the Managing Director of Patinformatics LLC. Patinformatics is an advisory firm specializing in patent analytics and landscaping to support decision making for technology-based businesses. In addition to operating Patinformatics, Mr Trippe is also an Adjunct Professor of IP Management and Markets at Illinois Institute of Technology, teaching a course on patent analysis, and landscapes for strategic decision making. Mr Trippe has written or contributed to IP related articles that have appeared in the Wall Street Journal, Forbes, The Washington Post and more than a dozen additional sources.