Prashant D., Chief Technology Office, JPMorgan Chase
October 10, 2019
A key barrier for companies to adopt machine learning is not lack of data but lack of labeled data. Labeling data gets expensive, and the difficulties of sharing and managing large datasets for model development make it a struggle to get machine learning projects off the ground.
That’s where our “learn more from less data” approach comes into action. At JPMorgan Chase, we are focused on reducing the need for data to build models. Instead, we focus on building gold training datasets, helping reduce the labeling cost and increasing the agility of model development.
Labeled data is a group of samples that have been tagged with one or more labels. After obtaining a labeled dataset, machine learning models can be applied to the data so that new, unlabeled data can be presented to the model and a likely label can be guessed or predicted for that piece of unlabeled data. A gold training dataset is a small, labeled dataset with high predictive power.
Active learning is a form of semi-supervised learning, which works well when you have a lot of data but face the expense of getting that data labeled. By labeling data points that help the quality of the model, teams can identify the samples that are most informative.
Using machine learning (ML) models, active learning can help identify difficult data points and ask a human annotator to focus on labeling them.
To explain passive learning and active learning, let’s use the analogy of teacher and student. In the passive learning approach, a student learns by listening to the teacher's lecture. In active learning, the teacher describes concepts, students ask questions, and the teacher spends more time explaining the concepts that are difficult for a student to understand. Student and teacher interact and collaborate in the learning process.
In ML model development using active learning, annotator and modeler interact and collaborate. An annotator provides a small labeled dataset. The modeling team builds a model and generates input on what to label next. Within a few iterations, teams can build refined requirements, a labeled gold training set, active learner and working machine learning model.
To identify difficult data points, we use a combination of methods, including:
Classification uncertainty sampling: When querying for labels, the strategy selects the sample with the highest uncertainty — data points the model knows least about. Labeling these data points makes the ML model more knowledgeable.
Margin uncertainty: When querying for labels, the strategy selects the sample with the smallest margin. These are data points the model knows about but isn’t confident enough to make good classifications. Labeling these examples increase model accuracy.
Entropy sampling: Entropy is a measure of uncertainty. It is proportional to the average number of guesses one has to make to find the true class. In this approach, we pick the samples with the highest entropy.
Disagreement-based sampling: While using this method, we pick those samples where different algorithms disagree. Example: if model is classifying into 5 classes (A,B, C, D & E), and if we are using 5 different classifiers, e.g.
1. Bag of words
5. HAN (Hierarchical Attention Networks)
Annotator can label examples on which classifiers disagree.
Information density: In this approach, we focus on a denser region of data and select few points in each dense region. Labeling these data points help the model classify large number of data points around these points.
Business value: In this method, we focus on labeling the data points that have higher business value than the others.
Traditionally, data scientists work with annotators to label a portion of their data and hope for the best when training their model. If the model wasn’t sufficiently predictive, more data would be labeled, and they would try again until its performance reached an acceptable level. While this approach still makes sense for some problems, for those that have vast amounts of data or unstructured data, we find that active learning is a better solution.
Active learning combines the power of machine learning with human annotators to select the next best data points to label. This intelligent selection leads to the creation of high-performance models in less time and at lower cost.
The Artificial Intelligence & Machine Learning group is focused on increasing the volume and velocity of AI applications across the firm by helping develop common platforms, reusable services and solutions.
This communication is provided for information purposes only. Please read J.P. Morgan research reports related to its contents for more information, including important disclosures. JPMorgan Chase & Co. or its affiliates and/or subsidiaries (collectively, J.P. Morgan) normally make a market and trade as principal in securities, other financial products and other asset classes that may be discussed in this communication.
This communication has been prepared based upon information, including market prices, data and other information, from sources believed to be reliable, but J.P. Morgan does not warrant its completeness or accuracy except with respect to any disclosures relative to J.P. Morgan and/or its affiliates and an analyst's involvement with any company (or security, other financial product or other asset class) that may be the subject of this communication. Any opinions and estimates constitute our judgment as of the date of this material and are subject to change without notice. Past performance is not indicative of future results. This communication is not intended as an offer or solicitation for the purchase or sale of any financial instrument. J.P. Morgan Research does not provide individually tailored investment advice. Any opinions and recommendations herein do not take into account individual client circumstances, objectives, or needs and are not intended as recommendations of particular securities, financial instruments or strategies to particular clients. You must make your own independent decisions regarding any securities, financial instruments or strategies mentioned or related to the information herein. Periodic updates may be provided on companies, issuers or industries based on specific developments or announcements, market conditions or any other publicly available information. However, J.P. Morgan may be restricted from updating information contained in this communication for regulatory or other reasons. Clients should contact analysts and execute transactions through a J.P. Morgan subsidiary or affiliate in their home jurisdiction unless governing law permits otherwise.
This communication may not be redistributed or retransmitted, in whole or in part, or in any form or manner, without the express written consent of J.P. Morgan. Any unauthorized use or disclosure is prohibited. Receipt and review of this information constitutes your agreement not to redistribute or retransmit the contents and information contained in this communication without first obtaining express permission from an authorized officer of J.P. Morgan.
Copyright 2020 JPMorgan Chase & Co. All rights reserved.