[October 2024] Lead a task team in IQuOD (International Quality Controlled Ocean Database
Since 2022, I served as a task team leader of ‘duplicate checking group’ under the framework of IQuOD. The group is to achieve the goal of identifying and removing duplicated profiles from the World Ocean Database (WOD) and IQuOD databases.
As there are numerous global and regional ocean databases, duplicate data continues to be an issue in data management, data processing and database merging, posing a challenge on effectively and accurately using oceanographic data to derive robust statistics and reliable data products. The group aims to provide high-efficiency algorithms to identify the duplicates and assign duplicated labels to them. During the past 3 years, I led this team to:
Proposed a set of criteria to define what is the duplicate data;
- Developed an open-source and semi-automatic system (named DC_OCEAN) to detect duplicate data and erroneous metadata, with robust evaluations.
- Deployed the abovementioned system to the WOD.
I also took this opportunity to supervise a M.S student (Xinyi Song) for her academic training.
In October 2024, the first round of tasks had been finished. A peer-reviewed journal paper (Song and Tan et al., 2024) has been published in the Frontier in Marine Sciences, with me as the co-first author. (The first author is the M.S. student who was under my supervision)
Please click here for the paper.
The duplicate checking algorithm (DC_OCEAN; https://github.com/IQuOD/duplicated_checking_IQuOD) is available as an open-source Python package under the Apache-2.0 license (https://pypi.org/project/DC-OCEAN/).
Citation: X. Song†, Z. Tan†, R. Locarnini, S. Simoncelli, R. Cowley, S.i Kizu, T. Boyer, F. Reseghetti, G. Castelao, V. Gouretski, L. Cheng, 2024: An open-source algorithm for identification of duplicates in ocean database. Frontier in Marine Science. doi.10.3389/fmars.2024.1403175
Below are some photos of the IQuOD steering team meeting (July 2023, Potsdam, Germany)