清华大学数字人文 Digital Humanities at Tsinghua University

图片来源: 清华大学, 未提供日期

[English Version

名称

清华大学数字人文

成立年份

2020(網站), 2015(数字人文团队)

简要描述

清华大学数字人文项目由清华为核心的数字人文跨学科团队组成,该团队得到了清华大学的大力支持,以人文学院、计算机科学与技术系和统计学中心的师生为主,也得到了来自社科院、澳门理工大学和浙江大学等高校学者的支持。受国家社科基金重大项目资助,清华大学数字人文项目团队创建了数字人文刊物、门户网站和「璇琮数字人文智慧平台」,还开设了面向日新书院本科生的试验性课程。数字人文门户网站是该团队创办的综合性数字人文门户网站,由清华大学文科建设「双高」计划支持,中华书局、中国知网、国学网、中文在线等参与的。DHLIB是中文世界的第一家,旨在为方兴未艾的数字人文研究提供一个「学术交流、开放获取、跨界交互、共建共享」的平台。清华大学数字人文团队的核心成员为人文学院院长刘石、人工智能研究院常务副院长孙茂松,中文系副系主任李飞跃、和统计学中心原副主任邓柯。另外,清华大学数字人文团队还与中华书局共同创办了《数字人文》学术期刊。2022年11月12日至13日,由清华大学人文学院、《数字人文》编辑部主办,巴克内尔大学中国研究所协办的「声律·网络·未来——第三届清华数字人文国际论坛」,在清华大学以线上线下的形式成功举办。来自多个国家/地区的包括20所境外高校在内的40多家高校及科研机构的六十多位学者发表了他们的最新研究成果。

关键学者

刘石教授是清华大学人文学院院长,专攻文学与文化研究。

孙茂松教授是清华大学人工智能研究院常务副院长,研究重点包括人工智能、大型语言模型及其在社会科学、人文学科和艺术中的应用。

李飞跃教授是清华大学中文系副主任,专长于中国文学与数字人文。

邓柯副教授是清华大学统计中心前副主任,专攻统计方法及其在数字人文中的应用,如内容分析。

关键项目

清华大学数字人文中心的项目和成果提供了一系列支持数字人文研究和项目的数字工具和资源。其目标是促进学术合作、数据共享以及数字资源的开发。

  1. 平台
    • 「明清水陆路程与文学」(Ming-Qing Routes and Literature, MQRL)是简锦松教授以「现地研究」方法,全面整理中国古代道路的数字化服务网站,也是明清文学创新研究的网站。
    • 「智慧古籍平台」是借鉴知识图谱理念,综合运用大数据的计量统计、定位查询、聚类查询、空间分析、数据关联、网络分析、机器标引、众筹众包等技术,将中国古典文献和研究成果图谱化、智能化,从而打造集浏览、查询、研究、欣赏于一体,熔审美阅读、知识学习、场景体验于一炉的古籍智慧大数据平台。
  2. 工具
    • 中文古典诗词语义搜索 – AI九歌: 清华大学自然语言处理与社会人文计算实验室推出了一款AI「九歌」中国古诗词类义句搜奇(「搜奇」可视作是「检索」文学化一点的表述)工具(简称「九歌类义句搜奇」)。他们设计了一种基于深层神经网络模型BERT及针对古诗词特点的改进最长公共子序列匹配相融合的类义句检索算法,可以更好地反映古诗词中的复杂语义,其检索结果也因之更为准确、细致、丰富。此外,他们利用Annoy技术实现了一个以树为数据结构的近似最近邻搜索机制,以最大限度地提高两个稠密向量之间相似度计算的速度;还实现了一个基于倒排索引的最长公共子序列优化机制,以最大限度地提高字符串匹配速度。
    • THULAC:一个高效的中文词法分析工具包: 清华大学自然语言处理与社会人文计算实验室研制推出的一套中文词法分析工具包,具有中文分词和词性标注功能。THULAC集成了目前世界上规模最大的人工分词和词性标注中文语料库(约含5800万字)训练而成,模型标注能力强大。该工具包在标准数据集Chinese Treebank(CTB5)上分词的F1值可达97.3%,词性标注的F1值可达到92.9%,与该数据集上最好方法效果相当。同时进行分词和词性标注速度为300KB/s,每秒可处理约15万字。只进行分词速度可达到1.3MB/s。

数字人文教課

清华大学数字人文与文学研究国际工作坊由清华大学中文系和芝加哥大学 Text Lab在2017年6月联合举办。 来自芝加哥大学的霍伊特·朗、苏真和朱远骋,哥伦比亚大学的戴安德等十余位学者出席了本次会议并作主题发言,人文学院副院长刘石教授致辞,中文系系主任王中忱教授作会议总结和展望。这次会议吸引了校内外百余人参加。与会学者围绕数字人文的统计学方法、文本细读和历史主义三种研究方法进行了广泛而深入的探讨,并对国内外人文数据库、中国数字人文研究现状等进行了介绍。

Image credit: Tsinghua University, n.d.

Digital Humanities at Tsinghua University

Name

 Digital Humanities, Tsinghua University

Year of Foundation

2020 (website) , 2015 (DH research group)

Short Description

The Tsinghua University Digital Humanities project is led by an interdisciplinary team centered at Tsinghua University, with substantial support from the university itself. The core team comprises faculty and students from the School of Humanities, the Department of Computer Science and Technology, and the Centre for Statistics, as well as scholars from institutions such as the Chinese Academy of Social Sciences, the Macau University of Science and Technology, and Zhejiang University. Funded by a major project from the National Social Science Fund, the team has established a digital humanities journal, a comprehensive portal website, and the ‘Xuancong Digital Humanities Intelligence Platform’, and has introduced an experimental course for undergraduates at the Dayxin College.

The digital humanities portal website is a comprehensive digital humanities platform founded by the team, supported by Tsinghua University’s ‘Double-High’ plan and involving contributions from Zhonghua Book Company, CNKI, National Studies Network, and Chinese Online. DHLIB is the first of its kind in the Chinese-speaking world, aiming to provide a platform for ‘academic exchange, open access, interdisciplinary interaction, and collaborative sharing’.

Key members of the Tsinghua Digital Humanities team include Liu Shi, Dean of the School of Humanities; Sun Maosong, Executive Vice Dean of the Institute for Artificial Intelligence; Li Feiyue, Deputy Director of the Chinese Department; and Deng Ke, former Deputy Director of the Centre for Statistics. The team also co-founded the academic journal ‘Digital Humanities’ with Zhonghua Book Company.

On November 12-13, 2022, the ‘Voice, Network, Future: The Third Tsinghua Digital Humanities International Forum’, organized by Tsinghua University’s School of Humanities and the editorial office of ‘Digital Humanities’, and co-organised by the Bucknell University China Research Institute, was successfully held both online and offline at Tsinghua University. More than sixty scholars from over forty universities and research institutions, including twenty overseas institutions, presented their latest research findings.

Key Academics

Prof Liu Shi is the Dean of the School of Humanities at Tsinghua University, specializing in literature and cultural studies.

Prof Sun Maosong is the Executive Vice Dean of the Institute for Artificial Intelligence, focusing on artificial intelligence, large language model and its applications in social science, humanities and arts .

Prof Li Feiyue is the Deputy Director of the Department of Chinese, with expertise in Chinese literature and digital humanities.

Dr Deng Ke is the former Deputy Director of the Centre for Statistics, specialising in statistical methods and their application in digital humanities, such as content analysis.

Key Projects

Projects and achievements in the Centre for Digital Humanities at Tsinghua University provides a range of digital tools and resources designed to support digital humanities research and projects. It aims to facilitate academic collaboration, data sharing, and the development of digital resources.

  1. Platforms
    • Ming-Qing Routes and Literature (MQRL) is a digital platform developed by Professor Jian Jinsong. It focuses on the comprehensive digitalization of ancient Chinese routes and is dedicated to innovative research in Ming and Qing literature. The MQRL platform provides digital services for the study of ancient Chinese roads, offering detailed and accessible data on historical routes. It supports innovative research into Ming and Qing literature, exploring how geographical routes influenced literary works and historical narratives.
    • Intelligent Ancient Books Platform  draws on the concept of knowledge graphs and integrates various technologies such as big data analytics, location-based queries, clustering queries, spatial analysis, data association, network analysis, machine indexing, and crowdsourcing. The platform aims to transform Chinese classical literature and research outcomes into a graph-based, intelligent format. This approach creates a comprehensive big data platform for ancient books, combining browsing, querying, research, and appreciation. It seamlessly integrates aesthetic reading, knowledge learning, and immersive experiences into one unified system.
  2.  Tools
    • Chinese Classical Poetry Semantic Search – AI Jiuge: Tsinghua University’s Natural Language Processing and Social Humanities Computing Laboratory has introduced an AI tool abbreviated as “Jiuge Semantic Search,” utilizes a retrieval algorithm that combines a deep neural network model based on BERT with an improved longest common subsequence matching tailored specifically for the characteristics of classical Chinese poetry. This algorithm is better suited to capture the complex semantics of classical poetry, resulting in more accurate, detailed, and enriched search outcomes. Additionally, the team implemented an approximate nearest neighbor search mechanism using Annoy technology, which employs a tree data structure to maximize the speed of similarity calculations between two dense vectors. They also developed an optimized longest common subsequence mechanism based on an inverted index, further enhancing the speed of string matching.
    • THULAC (THU Lexical Analyzer for Chinese): This is a Chinese lexical analysis toolkit developed by the Natural Language Processing and Social Humanities Computing Laboratory at Tsinghua University. The toolkit offers functionalities for Chinese word segmentation and part-of-speech tagging.THULAC is trained using the world’s largest manually segmented and part-of-speech tagged Chinese corpus, containing approximately 58 million characters, which gives it robust tagging capabilities. On the standard dataset, Chinese Treebank (CTB5), THULAC achieves an F1 score of 97.3% for word segmentation and 92.9% for part-of-speech tagging, comparable to the best-performing methods on this dataset. The combined speed for word segmentation and part-of-speech tagging is 300KB/s, processing around 150,000 characters per second. When performing word segmentation alone, the speed can reach 1.3MB/s.

Teaching

The ‘Tsinghua University International Workshop on Digital Humanities and Literary Studies‘ hosted by the Department of Chinese Language and Literature at Tsinghua University in collaboration with the University of Chicago’s Text Lab, took place in June 2017.

Several prominent scholars attended the workshop, including Hoyt Long, Su Zhen, and Zhu Yuancheng from the University of Chicago, as well as Deandre from Columbia University, among others. Professor Liu Shi, Vice Dean of the School of Humanities, delivered the opening remarks, and Professor Wang Zhongzhen, Head of the Department of Chinese Language and Literature, provided a summary and outlook at the conclusion of the event. The workshop attracted over a hundred participants from both within and outside the university.

The attending scholars engaged in extensive and in-depth discussions on three primary research methods in Digital Humanities: statistical methods, close reading of texts, and historicism. They also provided an overview of domestic and international humanities databases and the current state of Digital Humanities research in China.