DataHubGc
DataHubGcSource is responsible for performing garbage collection tasks on DataHub.
This source performs the following tasks:
- Cleans up expired tokens.
- Truncates Elasticsearch indices based on configuration.
- Cleans up data processes and soft-deleted entities if configured.
CLI based Ingestion
Install the Plugin
The datahub-gc
source works out of the box with acryl-datahub
.
Config Details
- Options
- Schema
Note that a .
is used to denote nested fields in the YAML recipe.
Field | Description |
---|---|
cleanup_expired_tokens boolean | Whether to clean up expired tokens or not Default: True |
dry_run boolean | Whether to perform a dry run or not. This is only supported for dataprocess cleanup and soft deleted entities cleanup. Default: False |
truncate_index_older_than_days integer | Indices older than this number of days will be truncated Default: 30 |
truncate_indices boolean | Whether to truncate elasticsearch indices or not which can be safely truncated Default: True |
truncation_sleep_between_seconds integer | Sleep between truncation monitoring. Default: 30 |
truncation_watch_until integer | Wait for truncation of indices until this number of documents are left Default: 10000 |
dataprocess_cleanup DataProcessCleanupConfig | Configuration for data process cleanup |
dataprocess_cleanup.batch_size integer | The number of entities to get in a batch from GraphQL Default: 500 |
dataprocess_cleanup.delay number | Delay between each batch Default: 0.25 |
dataprocess_cleanup.delete_empty_data_flows boolean | Wether to delete Data Flows without runs Default: True |
dataprocess_cleanup.delete_empty_data_jobs boolean | Wether to delete Data Jobs without runs Default: True |
dataprocess_cleanup.hard_delete_entities boolean | Whether to hard delete entities Default: False |
dataprocess_cleanup.keep_last_n integer | Number of latest aspects to keep Default: 5 |
dataprocess_cleanup.max_workers integer | The number of workers to use for deletion Default: 10 |
dataprocess_cleanup.retention_days integer | Number of days to retain metadata in DataHub Default: 10 |
dataprocess_cleanup.aspects_to_clean array | List of aspect names to clean up Default: ['DataprocessInstance'] |
dataprocess_cleanup.aspects_to_clean.string string | |
soft_deleted_entities_cleanup SoftDeletedEntitiesCleanupConfig | Configuration for soft deleted entities cleanup |
soft_deleted_entities_cleanup.batch_size integer | The number of entities to get in a batch from GraphQL Default: 500 |
soft_deleted_entities_cleanup.delay number | Delay between each batch Default: 0.25 |
soft_deleted_entities_cleanup.max_workers integer | The number of workers to use for deletion Default: 10 |
soft_deleted_entities_cleanup.platform string | Platform to cleanup |
soft_deleted_entities_cleanup.query string | Query to filter entities |
soft_deleted_entities_cleanup.retention_days integer | Number of days to retain metadata in DataHub Default: 10 |
soft_deleted_entities_cleanup.env string | Environment to cleanup |
soft_deleted_entities_cleanup.entity_types array | List of entity types to cleanup |
soft_deleted_entities_cleanup.entity_types.string string |
The JSONSchema for this configuration is inlined below.
{
"title": "DataHubGcSourceConfig",
"type": "object",
"properties": {
"dry_run": {
"title": "Dry Run",
"description": "Whether to perform a dry run or not. This is only supported for dataprocess cleanup and soft deleted entities cleanup.",
"default": false,
"type": "boolean"
},
"cleanup_expired_tokens": {
"title": "Cleanup Expired Tokens",
"description": "Whether to clean up expired tokens or not",
"default": true,
"type": "boolean"
},
"truncate_indices": {
"title": "Truncate Indices",
"description": "Whether to truncate elasticsearch indices or not which can be safely truncated",
"default": true,
"type": "boolean"
},
"truncate_index_older_than_days": {
"title": "Truncate Index Older Than Days",
"description": "Indices older than this number of days will be truncated",
"default": 30,
"type": "integer"
},
"truncation_watch_until": {
"title": "Truncation Watch Until",
"description": "Wait for truncation of indices until this number of documents are left",
"default": 10000,
"type": "integer"
},
"truncation_sleep_between_seconds": {
"title": "Truncation Sleep Between Seconds",
"description": "Sleep between truncation monitoring.",
"default": 30,
"type": "integer"
},
"dataprocess_cleanup": {
"title": "Dataprocess Cleanup",
"description": "Configuration for data process cleanup",
"allOf": [
{
"$ref": "#/definitions/DataProcessCleanupConfig"
}
]
},
"soft_deleted_entities_cleanup": {
"title": "Soft Deleted Entities Cleanup",
"description": "Configuration for soft deleted entities cleanup",
"allOf": [
{
"$ref": "#/definitions/SoftDeletedEntitiesCleanupConfig"
}
]
}
},
"additionalProperties": false,
"definitions": {
"DataProcessCleanupConfig": {
"title": "DataProcessCleanupConfig",
"type": "object",
"properties": {
"retention_days": {
"title": "Retention Days",
"description": "Number of days to retain metadata in DataHub",
"default": 10,
"type": "integer"
},
"aspects_to_clean": {
"title": "Aspects To Clean",
"description": "List of aspect names to clean up",
"default": [
"DataprocessInstance"
],
"type": "array",
"items": {
"type": "string"
}
},
"keep_last_n": {
"title": "Keep Last N",
"description": "Number of latest aspects to keep",
"default": 5,
"type": "integer"
},
"delete_empty_data_jobs": {
"title": "Delete Empty Data Jobs",
"description": "Wether to delete Data Jobs without runs",
"default": true,
"type": "boolean"
},
"delete_empty_data_flows": {
"title": "Delete Empty Data Flows",
"description": "Wether to delete Data Flows without runs",
"default": true,
"type": "boolean"
},
"hard_delete_entities": {
"title": "Hard Delete Entities",
"description": "Whether to hard delete entities",
"default": false,
"type": "boolean"
},
"batch_size": {
"title": "Batch Size",
"description": "The number of entities to get in a batch from GraphQL",
"default": 500,
"type": "integer"
},
"max_workers": {
"title": "Max Workers",
"description": "The number of workers to use for deletion",
"default": 10,
"type": "integer"
},
"delay": {
"title": "Delay",
"description": "Delay between each batch",
"default": 0.25,
"type": "number"
}
},
"additionalProperties": false
},
"SoftDeletedEntitiesCleanupConfig": {
"title": "SoftDeletedEntitiesCleanupConfig",
"type": "object",
"properties": {
"retention_days": {
"title": "Retention Days",
"description": "Number of days to retain metadata in DataHub",
"default": 10,
"type": "integer"
},
"batch_size": {
"title": "Batch Size",
"description": "The number of entities to get in a batch from GraphQL",
"default": 500,
"type": "integer"
},
"delay": {
"title": "Delay",
"description": "Delay between each batch",
"default": 0.25,
"type": "number"
},
"max_workers": {
"title": "Max Workers",
"description": "The number of workers to use for deletion",
"default": 10,
"type": "integer"
},
"entity_types": {
"title": "Entity Types",
"description": "List of entity types to cleanup",
"type": "array",
"items": {
"type": "string"
}
},
"platform": {
"title": "Platform",
"description": "Platform to cleanup",
"type": "string"
},
"env": {
"title": "Env",
"description": "Environment to cleanup",
"type": "string"
},
"query": {
"title": "Query",
"description": "Query to filter entities",
"type": "string"
}
},
"additionalProperties": false
}
}
}
Code Coordinates
- Class Name:
datahub.ingestion.source.gc.datahub_gc.DataHubGcSource
- Browse on GitHub
Questions
If you've got any questions on configuring ingestion for DataHubGc, feel free to ping us on our Slack.
Is this page helpful?