Learning-based Methods with Human-in-the-loop for Entity Resolution

Tutorial at CIKM 2019

Jane

ABSTRACT

This tutorial is intended for researchers and practitioners working in the data integration area and, in particular, entity resolution (ER), which is the sub-area focused on linking entities across heterogeneous datasets. We outline the ideal requirements of modern ER systems: (1) capture domain knowledge via (minimal) human interaction, (2) provide as much automation as possible via machine learning techniques, and (3) achieve high explainability. We describe recent research trends towards bringing such ideal ER systems closer to reality. We first overview human-in-the-loop methods that are based on techniques such as crowdsourcing and active learning. We then dive into recent trends that involve deep learning techniques such as representation learning to automate feature engineering, and combinations of transfer learning and active learning to reduce the amount of required user labels. We also discuss how explainable AI is related to ER, and overview some recent advances towards explainable ER.


Schedule and Outline

Time: from 13:30 to 17:00, Sunday November 3, 2019

Location: Room 301A

  • 1. Introduction (20 mins)
  • An overview of the entity resolution problem and the scope of this tutorial.

  • 2. Explainability (35 mins)
  • Overview techniques and frameworks for explainable entity resolution.

  • 3. Crowdsourcing (35 mins)
  • Discuss different crowdsourcing-based approaches for entity resolution.

  • Break (15:00 - 15:30)
  • 4. Active Learning (40 mins)
  • Discuss different active learning-based approaches for entity resolution.

  • 5. Deep Learning (40 mins)
  • Discuss recent advances in deep learning-based entity resolution.

  • 6. Conluding Remarks (10 mins)
  • Conclude the tutorial and discuss open questions and future research directions.


Who we are
Jane
Sairam

Sairam Gurajada

IBM Research - Almaden

Lucian

Lucian Popa

IBM Research - Almaden


Sairam Gurajada is a researcher at IBM Research - Almaden. His current work focuses on developing non-interpretable as well as interpretable learning models for the well-known, yet hard, entity resolution problem in database. Prior to this, he worked on developing efficient and scalable techniques for distributed querying of large labeled graphs and on-line index maintenance for IR systems. Some of his works have been published at notable conferences like SIGMOD, CIKM, IJCAI, ACL, etc.

Lucian Popa is a Principal Research Staff Member and manager at IBM Research - Almaden. He is known for his work on data exchange and schema mapping, for which he received two Test-of-Time Awards in ICDT 2013 and PODS 2014, and for work on declarative foundations and tools for entity resolution, for which he received a Best Paper Award in ICDT 2015. He has contributed to several IBM products in the area of information integration and entity resolution, and he is an ACM Distinguished Member.

Kun Qian is a researcher at IBM Research - Almaden, working on human-in-the-loop machine learning for entity understanding. He is particularly interested in building intelligent human-in-the-loop systems, with intuitive user interface, for entity understanding and knowledge creation. His work on entity resolution and related topics has been published in CIKM, PODS, COLING, AAAI, and ACL. He also developed several active learning based systems for entity name normalization and entity resolution that have been included in the demo tracks of ICDE 2018, VLDB 2019, and AAAI 2020.

Prithviraj Sen is a researcher at IBM Research - Almaden. His core interests lies in developing algorithms for learning explainable models. In the past, he has worked on diverse areas including but not limited to learning ER rules using active learning and learning explainable rules using neural networks having published in ICML, UAI, KDD, VLDB, ICDE, ACL, CIKM besides others. He is a contributor to Apache SystemML which is an open source engine for large-scale machine learning (the subject of a KDD tutorial ), having contributed to multiple IBM products besides that.