いつもの初期確認、まずは S.M.A.R.T. の値です。
smartctl 5.43 2016-09-28 r4347 [x86_64-linux-2.6.32-754.el6.x86_64] (local build) Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 082 063 044 Pre-fail Always - 168507570 3 Spin_Up_Time 0x0003 097 091 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 127 5 Reallocated_Sector_Ct 0x0033 099 099 036 Pre-fail Always - 39 7 Seek_Error_Rate 0x000f 061 060 030 Pre-fail Always - 4296392929 9 Power_On_Hours 0x0032 088 088 000 Old_age Always - 10830 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 1 12 Power_Cycle_Count 0x0032 099 037 020 Old_age Always - 1413 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 068 046 045 Old_age Always - 32 (Min/Max 25/32) 194 Temperature_Celsius 0x0022 032 054 000 Old_age Always - 32 (0 24 0 0 0) 195 Hardware_ECC_Recovered 0x001a 048 004 000 Old_age Always - 168507570 197 Current_Pending_Sector 0x0012 002 002 000 Old_age Always - 2008 198 Offline_Uncorrectable 0x0010 002 002 000 Old_age Offline - 2008 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0稼働時間は 10830 時間 (約451日) でしたが、Power_Cycle_Count が 1413 と高め (過去に入手したものは 100 程度) なので、使用する時だけ電源投入するという運用だったのではと考えられます。それから、Current_Pending_Sector が 2008 と高い値になってるので、このままでは早晩 I/O エラーに遭遇すると考えられます。
いままでに入手した6個の中古 HDD の中では、最も状態が悪いですが、ジャンク扱いということで格安 (6個の中では最安値) で入手しています。
このような状態の HDD は、これまでの経験上、SecureErase または こちらの手順 でリフレッシュできる場合が多く、ZFS の raid 領域であれば、まだ十分使用できるとふんでます。
そんなわけで、今回は、こちらの手順 のほうで、リフレッシュ作業してみました。
結果は次のとおりです。
smartctl 5.43 2016-09-28 r4347 [x86_64-linux-2.6.32-754.el6.x86_64] (local build) Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 060 060 044 Pre-fail Always - 205654349 3 Spin_Up_Time 0x0003 098 091 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 128 5 Reallocated_Sector_Ct 0x0033 099 099 036 Pre-fail Always - 39 7 Seek_Error_Rate 0x000f 066 060 030 Pre-fail Always - 4299157096 9 Power_On_Hours 0x0032 088 088 000 Old_age Always - 11030 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 1 12 Power_Cycle_Count 0x0032 099 037 020 Old_age Always - 1414 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 024 024 000 Old_age Always - 76 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 066 046 045 Old_age Always - 34 (Min/Max 31/34) 194 Temperature_Celsius 0x0022 034 054 000 Old_age Always - 34 (0 24 0 0 0) 195 Hardware_ECC_Recovered 0x001a 052 004 000 Old_age Always - 205654349 197 Current_Pending_Sector 0x0012 100 002 000 Old_age Always - 3 198 Offline_Uncorrectable 0x0010 100 002 000 Old_age Offline - 3 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0ゼロにはなりませんでしたが、3 に減りました。
単体で使うのは危険ですが、経験上 ZFS の raid 領域ならまだ使えると思えるので、実際に組み込みました。
[root@hoge ~]# zpool status tankQ pool: tankQ state: ONLINE scan: resilvered 104K in 0h0m with 0 errors on Thu Oct 18 17:36:46 2018 config: NAME STATE READ WRITE CKSUM tankQ ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 tankQf ONLINE 0 0 0 tankQk ONLINE 0 0 0 tankQe ONLINE 0 0 0 tankQc ONLINE 0 0 0 errors: No known data errorsZFS としてエラーのない状態になりました。zpool scrub でもエラーでなくなりました。なお、この tankQ では、各ディスクを LUKS で暗号化した上で使用しています。
以下、その他の初期確認データです。
[root@hoge ~]# hdparm -i /dev/sdk /dev/sdk: Model=ST31000340NS, FwRev=SN06, SerialNo=9xxxxxxH Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4 BuffType=unknown, BuffSize=unknown, MaxMultSect=16, MultSect=16 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=1953525168 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6 AdvancedPM=no WriteCache=enabled Drive conforms to: unknown: ATA/ATAPI-4,5,6,7 * signifies the current active mode
[root@hoge ~]# hdparm -I /dev/sdk /dev/sdk: ATA device, with non-removable media Model Number: ST31000340NS Serial Number: 9xxxxxxH Firmware Revision: SN06 Transport: Serial Standards: Used: unknown (minor revision code 0x0029) Supported: 8 7 6 5 Likely used: 8 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 268435455 LBA48 user addressable sectors: 1953525168 Logical/Physical Sector size: 512 bytes device size with M = 1024*1024: 953869 MBytes device size with M = 1000*1000: 1000204 MBytes (1000 GB) cache/buffer size = unknown Nominal Media Rotation Rate: 7200 Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, no device specific minimum R/W multiple sector transfer: Max = 16 Current = ? Recommended acoustic management value: 254, current value: 0 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set * Write cache * Look-ahead * Host Protected Area feature set * WRITE_BUFFER command * READ_BUFFER command * DOWNLOAD_MICROCODE SET_MAX security extension * 48-bit Address feature set * Device Configuration Overlay feature set * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT * SMART error logging * SMART self-test * General Purpose Logging feature set * 64-bit World wide name Write-Read-Verify feature set * WRITE_UNCORRECTABLE_EXT command * {READ,WRITE}_DMA_EXT_GPL commands * Segmented DOWNLOAD_MICROCODE * Gen1 signaling speed (1.5Gb/s) * Gen2 signaling speed (3.0Gb/s) * Native Command Queueing (NCQ) * Phy event counters * Software settings preservation * SMART Command Transport (SCT) feature set * SCT Write Same (AC2) * SCT Error Recovery Control (AC3) * SCT Features Control (AC4) * SCT Data Tables (AC5) unknown 206[12] (vendor specific) Security: Master password revision code = 65534 supported not enabled not locked not frozen not expired: security count supported: enhanced erase 192min for SECURITY ERASE UNIT. 192min for ENHANCED SECURITY ERASE UNIT. Logical Unit WWN Device Identifier: 5000c500yyyyyyy9 NAA : 5 IEEE OUI : 000c50 Unique ID : 0yyyyyyy9 Checksum: correct
smartctl 5.43 2016-09-28 r4347 [x86_64-linux-2.6.32-754.el6.x86_64] (local build) Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Seagate Barracuda ES.2 Device Model: ST31000340NS Serial Number: 9xxxxxxH LU WWN Device Id: 5 000c50 0yyyyyyy9 Firmware Version: SN06 User Capacity: 1,000,204,886,016 bytes [1.00 TB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Thu Oct 18 17:49:17 2018 JST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 625) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 225) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x103d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 060 060 044 Pre-fail Always - 205654349 3 Spin_Up_Time 0x0003 098 091 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 128 5 Reallocated_Sector_Ct 0x0033 099 099 036 Pre-fail Always - 39 7 Seek_Error_Rate 0x000f 066 060 030 Pre-fail Always - 4299157087 9 Power_On_Hours 0x0032 088 088 000 Old_age Always - 11030 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 1 12 Power_Cycle_Count 0x0032 099 037 020 Old_age Always - 1414 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 024 024 000 Old_age Always - 76 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 066 046 045 Old_age Always - 34 (Min/Max 31/34) 194 Temperature_Celsius 0x0022 034 054 000 Old_age Always - 34 (0 24 0 0 0) 195 Hardware_ECC_Recovered 0x001a 052 004 000 Old_age Always - 205654349 197 Current_Pending_Sector 0x0012 100 002 000 Old_age Always - 3 198 Offline_Uncorrectable 0x0010 100 002 000 Old_age Offline - 3 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 SMART Error Log Version: 1 ATA Error Count: 119 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 119 occurred at disk power-on lifetime: 11004 hours (458 days + 12 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 fd 25 6c 00 Error: UNC at LBA = 0x006c25fd = 7087613 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 e0 b0 26 6c 40 00 7d+05:46:35.674 READ FPDMA QUEUED 60 00 e0 d0 25 6c 40 00 7d+05:46:35.669 READ FPDMA QUEUED 60 00 f0 d8 24 6c 40 00 7d+05:46:35.669 READ FPDMA QUEUED 60 00 28 78 25 6c 40 00 7d+05:46:35.664 READ FPDMA QUEUED 60 00 30 a8 24 6c 40 00 7d+05:46:35.663 READ FPDMA QUEUED Error 118 occurred at disk power-on lifetime: 10830 hours (451 days + 6 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 71 04 9d 00 32 40 Device Fault; Error: ABRT 4 sectors at LBA = 0x0032009d = 3276957 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 d0 e1 7d 40 00 00:15:51.654 READ DMA EXT 25 00 08 d0 e1 7d 40 00 00:15:51.654 READ DMA EXT 25 00 08 d0 e1 7d 40 00 00:15:51.654 READ DMA EXT 25 00 08 d0 e1 7d 40 00 00:15:51.654 READ DMA EXT 25 00 08 d0 e1 7d 40 00 00:15:51.653 READ DMA EXT Error 117 occurred at disk power-on lifetime: 10830 hours (451 days + 6 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 71 04 9d 00 32 40 Device Fault; Error: ABRT 4 sectors at LBA = 0x0032009d = 3276957 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 d0 e1 7d 40 00 00:15:51.653 READ DMA EXT 25 00 08 c8 c5 2d 40 00 00:15:51.209 READ DMA EXT 25 00 08 c8 c5 2d 40 00 00:15:51.209 READ DMA EXT 25 00 08 c8 c5 2d 40 00 00:15:51.209 READ DMA EXT 25 00 08 c8 c5 2d 40 00 00:15:51.208 READ DMA EXT Error 116 occurred at disk power-on lifetime: 10830 hours (451 days + 6 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 71 04 9d 00 32 40 Device Fault; Error: ABRT 4 sectors at LBA = 0x0032009d = 3276957 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 c8 c5 2d 40 00 00:15:51.209 READ DMA EXT 25 00 08 c8 c5 2d 40 00 00:15:51.209 READ DMA EXT 25 00 08 c8 c5 2d 40 00 00:15:51.209 READ DMA EXT 25 00 08 c8 c5 2d 40 00 00:15:51.208 READ DMA EXT 25 00 08 c8 c5 2d 40 00 00:15:51.208 READ DMA EXT Error 115 occurred at disk power-on lifetime: 10830 hours (451 days + 6 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 71 04 9d 00 32 40 Device Fault; Error: ABRT 4 sectors at LBA = 0x0032009d = 3276957 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 c8 c5 2d 40 00 00:15:51.208 READ DMA EXT 25 00 08 b8 c6 2d 40 00 00:15:51.081 READ DMA EXT 25 00 08 b8 c6 2d 40 00 00:15:51.081 READ DMA EXT 25 00 08 b8 c6 2d 40 00 00:15:51.080 READ DMA EXT 25 00 08 b8 c6 2d 40 00 00:15:51.080 READ DMA EXT SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 11007 - # 2 Short offline Completed without error 00% 11004 - # 3 Short offline Completed without error 00% 10953 - # 4 Selective offline Completed without error 00% 10837 - # 5 Selective offline Completed: read failure 90% 10837 1887270886 # 6 Selective offline Completed: read failure 90% 10837 1887261750 # 7 Selective offline Completed: read failure 90% 10836 1887217021 # 8 Selective offline Completed: read failure 90% 10833 63511735 # 9 Selective offline Completed: read failure 90% 10833 63502125 #10 Selective offline Completed: read failure 90% 10833 63490659 #11 Selective offline Completed: read failure 90% 10833 12121842 #12 Selective offline Completed: read failure 90% 10833 12110355 #13 Selective offline Completed: read failure 90% 10833 12098051 #14 Selective offline Completed: read failure 90% 10833 12089280 #15 Selective offline Completed: read failure 90% 10833 12078170 #16 Selective offline Completed: read failure 90% 10833 12068537 #17 Selective offline Completed: read failure 90% 10833 12059282 #18 Selective offline Completed: read failure 90% 10833 11972284 #19 Selective offline Completed: read failure 90% 10833 11957107 #20 Selective offline Completed: read failure 90% 10833 11947496 #21 Selective offline Completed: read failure 90% 10833 10545773 17 of 17 failed self-tests are outdated by newer successful extended offline self-test # 1 SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 1887270886 1953525167 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.HDD は機種によってかなり挙動が異なりますが、もしこの記事を見てリフレッシュ試みる場合、Self-test log が参考になるものと思います。なお、HDD の機種によっては、Self-test log を表示できないもの (機能が実装されてない?) もあるようです。
最後に警告となりますが、ZFS または Btrfs のように、データの End-to-End チェックサムが実装されていて、なおかつ raid 構成でなければ、今回のような状態の HDD は使えないです。単体利用はもちろんダメですが、ハードウェア RAID でも使うのは危険と思いますので、くれぐれも気をつけてください。HDD や OS の挙動を学習するための実験に使うならば、よいでしょうけれど。。。
わたし自身も、tankQ をプライマリなデータ領域として使ってるわけではなく、バックアップなどのセカンダリ領域 (最悪壊れても許容できる) として利用しています。OS屋のはしくれとして、Linux(CentOS6) + ZFS それに HDD の振る舞い (特にセクターエラー発生時のリカバリ動作) を体感して経験値を積みたい、というのが主な目的です。
0 件のコメント:
コメントを投稿