265. 【数据库运维】hdfs，10T硬盘被撑爆

2024-04-10 04:20:33 阅读数 397

最近遇到一个很坑，我一个 6 节点的分布式数据库，一个节点 10T 的硬盘，经过一层又一层的手动翻 hdfs 本地目录去找大文件，终于找到源头，一个 dncp-block-verification.log.curr 占了 5.6T，心中一个个问号冒出来时，非常义愤填膺：这玩意也能撑这么大？比我数据文件还要大？

image.png

今天才假期第二天，客户那边就来催了，“解决方案商量好了吗？”，我赶紧在本地虚拟机上再尝试复现一下——虽然解决方案已经出来了，把那两个文件删了就行了，但毕竟是生产环境，不敢随便删除，还是稳点好。

回过头来，这其实是一个老版本 hdfs 的 bug，在新版本之后已经修复了，我们关掉 Datanode 把这两个特别大的日志删了就行了。

另附上正统的解法：

One solution, although slightly drastic, is to disable the block scanner entirely, by setting into the HDFS 
DataNode configuration the key `dfs.datanode.scan.period.hours` to `0` (default is `504` in hours). The 
negative effect of this is that your DNs may not auto-detect corrupted block files (and would need to wait 
upon a future block reading client to detect them instead); this isn't a big deal if your average replication is 3-
ish, but you can consider the change as a short term one until you upgrade to a release that fixes the issue.

Note that this problem will not happen if you upgrade to the latest CDH 5.4.x or higher release versions, 
which includes the [HDFS-7430](https://issues.apache.org/jira/browse/HDFS-7430) rewrite changes and 
associated bug fixes. These changes have done away with the use of such a local file, thereby removing the 
problem.