Graft uses a number of job types and processing resources to support your business case, most commonly initiated from the Entity Dashboard, this document provides a short description of each job types function.
This job connects to the user data sources and saves the data into Graft for downstream processing. NOTE: In some cases (e.g., S3 bucket of images) only URIs to the data objects are saved, not the raw data. In other cases (e.g., rows in a database) the raw data is saved.
The enrichment is defined by specifying labels for the data and a way to join the labels with the raw data/embeddings. This job ingests the data for the labels from a user-defined data source (similar to data ingestion, but for the labels).
Note: This job appears only for enrichments created by the user (not for pretrained enrichments).
An enrichment is defined by specifying targets (e.g. labels for classification, numerical values for regression) for the data and a way to join with the raw data/embeddings. This job ingests targets from a user-defined data source (similar to data ingestion, but for the targets).
Note: This job appears only for custom enrichments created by the user (not for pretrained enrichments)
This job computes the vector representation of the raw data selected by the user to be embedded (during the entity creation step). This job appears only if the entity has an embedding defined and it depends on the data having been ingested in the previous step.
This job executes multiple steps of training the enrichment model on the provided data and labels. This fine tuning job uses the embeddings computed in the prior step, thus it has a dependency on this job. Note: This job only applies to new enrichments.
This job uses the enrichment model trained by the finetune job or a Graft pretrained enrichment model (depending on the type of the enrichment defined) to predict/infer labels for each data point.
Note This job depends on embeddings having been computed. If the enrichment is not a Graft pretrained enrichment, then it depends on the ingest-labels and finetune jobs too (because the model used for the prediction has to be trained first).
Job Description: Optional: Generate clusters from embeddings
The cluster job generates a cluster label for every entity instance based on a user given cluster count. Cluster labels starting at 0. For example if the user provides a cluster count of 4, Graft attempt to cluster the embedding data and will determine the closest match for every entity instance (row) and provide an additional field called GRAFT_CLUSTER_LABEL with values cluster_0, cluster_1, cluster_2, cluster_3.
Clustering is an alternative way to predict the labels of things without having to have actual labels.
classify into one of N classesuse-case, one might nominally try to cluster with
k = Nto see if the data will cluster cleanly in such a fashion. If it does cluster well, specifically that each of the
Nclasses falls into a different cluster, then your classifier should work extremely well, but the opposite is not true as there are many ways that a supervised classifier can work well while k-Means clustering won’t:
- K Means makes assumptions about the distribution of data, namely all clusters are Gaussian distributed with same variance/intra-cluster spread per cluster
- There might be more or less than N “clusters” in the data, independent of the N classes
- Classes might fall into the same nominal cluster (because they’re closer to one another in the embedding space than they are to the other classes)
Job Description: Optional: Generate projections from embeddings (Enable visualizations)
The generate project job will flatten the highly dimensional embeddings for a given entity instance into 3 dimensions allowing the data to be graphically displayed and provide a simplified representation of the entity instance.
Projections are helpful in visualizing results for example with classifiers to find outliers or errors
This job further trains a trunk model with users data. A trunk model has already learned a set of useful features from the previous dataset it was trained on. This job can improve its performance for users tasks while requiring less data and less computational resources compared to training a new model from scratch.