需求:
在实际的项目实施过程中,经常有统计不同机构间的订阅、数据同步情况的需求。
统计信息位置:
业务上的目录、资源以及订阅关系存储于DataCatalog支撑库中;
每次任务执行的状态、具体输入及同步具体数据条数等信息位于DataExchange支撑库中。
实际场景示例:
1、统计资源名称、订阅机构、发布机构、数据源名称、交换时间、数据量、状态(是否交换成功)
DataCatalog支撑库SQL:
SELECT DISTINCT (dist.title) AS 资源名称, sub.` TASK_O_ID ` AS 任务唯一标识 FROM dcat_distribution dist, dcat_distribution_subscription sub WHERE sub.task_o_id = "aa43efab9b8f0ef26141b022eb37da25" AND dist. IDENTIFIER = sub.` DIST_IDENTIFIER `
DataExchange支撑库SQL:
SELECT targetStation. NAME AS 订阅机构, sourceStation. NAME AS 发布机构, sourceDs.` NAME ` AS 数据源名称, cdl.INPUT_COUNT AS 交换数据量, cdl.` END_TIME ` AS 交换时间, cdl.` STATE ` AS 状态, cdl.` TASK_OID ` AS 任务唯一标识 FROM ude_component_detail_log cdl, ude_datasource sourceDs, ude_datasource targetDs, ude_station sourceStation, ude_station targetStation WHERE cdl.TGT_DS_ID IS NOT NULL AND cdl.SRC_DS_ID = sourceDs. ID AND cdl.TGT_DS_ID = targetDs. ID AND sourceDs.STATION_ID = sourceStation. ID AND targetDs.STATION_ID = targetStation. ID
低版本DataExchange的ude_component_detail_log表没有taskoid字段,可以使用如下支撑库SQL:
SELECT targetStation. NAME AS 订阅机构, sourceStation. NAME AS 发布机构, sourceDs.`NAME` AS 数据源名称, cdl.INPUT_COUNT AS 交换数据量, cdl.`END_TIME` AS 交换时间, cdl.`STATE` AS 状态, task.`O_ID` AS 任务唯一标识 FROM ude_component_detail_log cdl, ude_datasource sourceDs, ude_datasource targetDs, ude_station sourceStation, ude_station targetStation, ude_task task, ude_task_transmodel ref WHERE cdl.TGT_DS_ID IS NOT NULL AND cdl.SRC_DS_ID = sourceDs.id AND cdl.TGT_DS_ID = targetDs.id AND sourceDs.STATION_ID = sourceStation.id AND targetDs.STATION_ID = targetStation.id AND cdl.`TRANSMODEL_ID` = ref.`TRANSMODEL_ID` AND ref.`TASK_ID` = task.`ID`
两个支撑库通过“任务唯一标识”关联,项目组也可以使用DataExchange创建数据同步任务,将相关表数据同步到一个数据库中查询统计。
2、目录供需情况查询:
DataCatalog支撑库SQL:
SELECT dist.title AS 资源名称, dist.description AS 资源摘要, orggx.NAME AS 共享部门, orgxq.NAME AS 需求部门, ds.`IDENTIFIER` AS 点击资源查看资源详情页的目录关键字, dist.identifier AS 点击资源查看资源详情页的资源关键字, ds.version_no AS versionNo, ds.id AS dataSetId, ds.is_current AS isCurrent FROM dcat_distribution dist, dcat_organization orggx, dcat_organization orgxq, dcat_dataset ds, dcat_distribution_subscription distSub WHERE dist.DATASET_ID = ds.ID AND ds.ORG_ID = orggx.ID AND distSub.`DIST_IDENTIFIER` = dist.`IDENTIFIER` AND orggx.`ID` = distSub.`PUBLISHER_ORG` AND orgxq.`ID` = distSub.`SUBSCRIBER_ORG` AND ds.`IS_CURRENT`='Y' AND orggx.id = "共享部门机构id" AND ds.require_department LIKE '%需求部门机构id%'
两个查询条件,需要先把机构表(dcat_organization)的所有信息查出来,以id/name做成下拉列表,查询的时候以id查询,分别对应sql的最后两行的条件
进入资源详情页的链接:
http://10.4.46.60:8080/datacatalog/dataset/home/26479b3a584f413a8f4c905daf3eb508/distributions/5f3fc7fafe930014300800ca?dataSetId=24fa56027d033aa1976506ea3cb1c8f4&versionNo=5&isCurrent=Y
3、目录变更情况查询:
DataCatalog支撑库SQL示例:
SELECT ORG_NAME 部门名称, DATE_FORMAT(MODIFIED_TIME, '%Y-%m-%d') 操作日期, count(DISTINCT(IDENTIFIER)) 当日操作总计, MODIFIED_BY 操作人员 FROM dcat_dataset GROUP BY DAY (MODIFIED_TIME) order by MODIFIED_TIME desc
新增、编辑、删除具体数值可以基于created_time和MODIFIED_TIME计算,目录有多版本,需要基于identifier字段去重;
供参考的具体计算方式:
SELECT t1.ORG_NAME, t1.MODIFIED_BY, DATE_FORMAT( t1.MODIFIED_TIME, '%Y-%m-%d' ) AS modifiedTime, t1.TITLE, t1.IDENTIFIER AS dsIdentifier, t1.VERSION_NO, t1.id, t1.IS_CURRENT, dist.IDENTIFIER AS distIdentifier FROM dcat_dataset t1 LEFT JOIN dcat_distribution dist ON t1.id = dist.DATASET_ID WHERE t1.IS_DELETED = 'N' and DATE_FORMAT(t1.MODIFIED_TIME, '%Y-%m-%d') > DATE_FORMAT(t1.CREATED_TIME, '%Y-%m-%d') GROUP BY t1.org_name, t1.MODIFIED_BY, DATE_FORMAT( t1.MODIFIED_TIME, '%Y-%m-%d' );
数据目录关键业务表ER图:
以上。