265. 【数据库运维】hdfs,10T硬盘被撑爆
2024-04-10 04:20:33  阅读数 397

最近遇到一个很坑,我一个 6 节点的分布式数据库,一个节点 10T 的硬盘,经过一层又一层的手动翻 hdfs 本地目录去找大文件,终于找到源头,一个 dncp-block-verification.log.curr 占了 5.6T,心中一个个问号冒出来时,非常义愤填膺:这玩意也能撑这么大?比我数据文件还要大?

image.png

今天才假期第二天,客户那边就来催了,“解决方案商量好了吗?”,我赶紧在本地虚拟机上再尝试复现一下——虽然解决方案已经出来了,把那两个文件删了就行了,但毕竟是生产环境,不敢随便删除,还是稳点好。

回过头来,这其实是一个 老版本 hdfs 的 bug,在新版本之后已经修复了,我们关掉 Datanode 把这两个特别大的日志删了就行了。

另附上正统的解法:

One solution, although slightly drastic, is to disable the block scanner entirely, by setting into the HDFS 
DataNode configuration the key `dfs.datanode.scan.period.hours` to `0` (default is `504` in hours). The 
negative effect of this is that your DNs may not auto-detect corrupted block files (and would need to wait 
upon a future block reading client to detect them instead); this isn't a big deal if your average replication is 3-
ish, but you can consider the change as a short term one until you upgrade to a release that fixes the issue.

Note that this problem will not happen if you upgrade to the latest CDH 5.4.x or higher release versions, 
which includes the [HDFS-7430](https://issues.apache.org/jira/browse/HDFS-7430) rewrite changes and 
associated bug fixes. These changes have done away with the use of such a local file, thereby removing the 
problem.