2018年10月21日日曜日

中古 HDD の初期確認、6個目(2018年10月)

これまで5個の中古 HDD を購入しましたが、3個目 (2016年6月購入) が限界に達した (Reallocated_Sector_Ct が THRESH を下回った) ので、交換用に6個目を購入しました。今回もしつこく Seagate Barracuda ES.2 1TB です。同じ機種のほうが経験積めると思うので。

いつもの初期確認、まずは S.M.A.R.T. の値です。
smartctl 5.43 2016-09-28 r4347 [x86_64-linux-2.6.32-754.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   082   063   044    Pre-fail  Always       -       168507570
  3 Spin_Up_Time            0x0003   097   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       127
  5 Reallocated_Sector_Ct   0x0033   099   099   036    Pre-fail  Always       -       39
  7 Seek_Error_Rate         0x000f   061   060   030    Pre-fail  Always       -       4296392929
  9 Power_On_Hours          0x0032   088   088   000    Old_age   Always       -       10830
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       1
 12 Power_Cycle_Count       0x0032   099   037   020    Old_age   Always       -       1413
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   068   046   045    Old_age   Always       -       32 (Min/Max 25/32)
194 Temperature_Celsius     0x0022   032   054   000    Old_age   Always       -       32 (0 24 0 0 0)
195 Hardware_ECC_Recovered  0x001a   048   004   000    Old_age   Always       -       168507570
197 Current_Pending_Sector  0x0012   002   002   000    Old_age   Always       -       2008
198 Offline_Uncorrectable   0x0010   002   002   000    Old_age   Offline      -       2008
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
稼働時間は 10830 時間 (約451日) でしたが、Power_Cycle_Count が 1413 と高め (過去に入手したものは 100 程度) なので、使用する時だけ電源投入するという運用だったのではと考えられます。それから、Current_Pending_Sector が 2008 と高い値になってるので、このままでは早晩 I/O エラーに遭遇すると考えられます。
いままでに入手した6個の中古 HDD の中では、最も状態が悪いですが、ジャンク扱いということで格安 (6個の中では最安値) で入手しています。

このような状態の HDD は、これまでの経験上、SecureErase または こちらの手順 でリフレッシュできる場合が多く、ZFS の raid 領域であれば、まだ十分使用できるとふんでます。

そんなわけで、今回は、こちらの手順 のほうで、リフレッシュ作業してみました。
結果は次のとおりです。
smartctl 5.43 2016-09-28 r4347 [x86_64-linux-2.6.32-754.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   060   060   044    Pre-fail  Always       -       205654349
  3 Spin_Up_Time            0x0003   098   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       128
  5 Reallocated_Sector_Ct   0x0033   099   099   036    Pre-fail  Always       -       39
  7 Seek_Error_Rate         0x000f   066   060   030    Pre-fail  Always       -       4299157096
  9 Power_On_Hours          0x0032   088   088   000    Old_age   Always       -       11030
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       1
 12 Power_Cycle_Count       0x0032   099   037   020    Old_age   Always       -       1414
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   024   024   000    Old_age   Always       -       76
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   066   046   045    Old_age   Always       -       34 (Min/Max 31/34)
194 Temperature_Celsius     0x0022   034   054   000    Old_age   Always       -       34 (0 24 0 0 0)
195 Hardware_ECC_Recovered  0x001a   052   004   000    Old_age   Always       -       205654349
197 Current_Pending_Sector  0x0012   100   002   000    Old_age   Always       -       3
198 Offline_Uncorrectable   0x0010   100   002   000    Old_age   Offline      -       3
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
ゼロにはなりませんでしたが、3 に減りました。
単体で使うのは危険ですが、経験上 ZFS の raid 領域ならまだ使えると思えるので、実際に組み込みました。
[root@hoge ~]# zpool status tankQ
  pool: tankQ
 state: ONLINE
  scan: resilvered 104K in 0h0m with 0 errors on Thu Oct 18 17:36:46 2018
config:

        NAME        STATE     READ WRITE CKSUM
        tankQ       ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            tankQf  ONLINE       0     0     0
            tankQk  ONLINE       0     0     0
            tankQe  ONLINE       0     0     0
            tankQc  ONLINE       0     0     0

errors: No known data errors
ZFS としてエラーのない状態になりました。zpool scrub でもエラーでなくなりました。なお、この tankQ では、各ディスクを LUKS で暗号化した上で使用しています。

以下、その他の初期確認データです。
[root@hoge ~]# hdparm -i /dev/sdk

/dev/sdk:

 Model=ST31000340NS, FwRev=SN06, SerialNo=9xxxxxxH
 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4
 BuffType=unknown, BuffSize=unknown, MaxMultSect=16, MultSect=16
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=1953525168
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio1 pio2 pio3 pio4 
 DMA modes:  mdma0 mdma1 mdma2 
 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
 AdvancedPM=no WriteCache=enabled
 Drive conforms to: unknown:  ATA/ATAPI-4,5,6,7

 * signifies the current active mode

[root@hoge ~]# hdparm -I /dev/sdk

/dev/sdk:

ATA device, with non-removable media
        Model Number:       ST31000340NS                            
        Serial Number:      9xxxxxxH
        Firmware Revision:  SN06    
        Transport:          Serial
Standards:
        Used: unknown (minor revision code 0x0029) 
        Supported: 8 7 6 5 
        Likely used: 8
Configuration:
        Logical         max     current
        cylinders       16383   16383
        heads           16      16
        sectors/track   63      63
        --
        CHS current addressable sectors:   16514064
        LBA    user addressable sectors:  268435455
        LBA48  user addressable sectors: 1953525168
        Logical/Physical Sector size:           512 bytes
        device size with M = 1024*1024:      953869 MBytes
        device size with M = 1000*1000:     1000204 MBytes (1000 GB)
        cache/buffer size  = unknown
        Nominal Media Rotation Rate: 7200
Capabilities:
        LBA, IORDY(can be disabled)
        Queue depth: 32
        Standby timer values: spec'd by Standard, no device specific minimum
        R/W multiple sector transfer: Max = 16  Current = ?
        Recommended acoustic management value: 254, current value: 0
        DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
             Cycle time: min=120ns recommended=120ns
        PIO: pio0 pio1 pio2 pio3 pio4 
             Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
        Enabled Supported:
           *    SMART feature set
                Security Mode feature set
           *    Power Management feature set
           *    Write cache
           *    Look-ahead
           *    Host Protected Area feature set
           *    WRITE_BUFFER command
           *    READ_BUFFER command
           *    DOWNLOAD_MICROCODE
                SET_MAX security extension
           *    48-bit Address feature set
           *    Device Configuration Overlay feature set
           *    Mandatory FLUSH_CACHE
           *    FLUSH_CACHE_EXT
           *    SMART error logging
           *    SMART self-test
           *    General Purpose Logging feature set
           *    64-bit World wide name
                Write-Read-Verify feature set
           *    WRITE_UNCORRECTABLE_EXT command
           *    {READ,WRITE}_DMA_EXT_GPL commands
           *    Segmented DOWNLOAD_MICROCODE
           *    Gen1 signaling speed (1.5Gb/s)
           *    Gen2 signaling speed (3.0Gb/s)
           *    Native Command Queueing (NCQ)
           *    Phy event counters
           *    Software settings preservation
           *    SMART Command Transport (SCT) feature set
           *    SCT Write Same (AC2)
           *    SCT Error Recovery Control (AC3)
           *    SCT Features Control (AC4)
           *    SCT Data Tables (AC5)
                unknown 206[12] (vendor specific)
Security: 
        Master password revision code = 65534
                supported
        not     enabled
        not     locked
        not     frozen
        not     expired: security count
                supported: enhanced erase
        192min for SECURITY ERASE UNIT. 192min for ENHANCED SECURITY ERASE UNIT. 
Logical Unit WWN Device Identifier: 5000c500yyyyyyy9
        NAA             : 5
        IEEE OUI        : 000c50
        Unique ID       : 0yyyyyyy9
Checksum: correct
smartctl 5.43 2016-09-28 r4347 [x86_64-linux-2.6.32-754.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda ES.2
Device Model:     ST31000340NS
Serial Number:    9xxxxxxH
LU WWN Device Id: 5 000c50 0yyyyyyy9
Firmware Version: SN06
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Thu Oct 18 17:49:17 2018 JST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  625) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 225) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x103d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   060   060   044    Pre-fail  Always       -       205654349
  3 Spin_Up_Time            0x0003   098   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       128
  5 Reallocated_Sector_Ct   0x0033   099   099   036    Pre-fail  Always       -       39
  7 Seek_Error_Rate         0x000f   066   060   030    Pre-fail  Always       -       4299157087
  9 Power_On_Hours          0x0032   088   088   000    Old_age   Always       -       11030
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       1
 12 Power_Cycle_Count       0x0032   099   037   020    Old_age   Always       -       1414
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   024   024   000    Old_age   Always       -       76
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   066   046   045    Old_age   Always       -       34 (Min/Max 31/34)
194 Temperature_Celsius     0x0022   034   054   000    Old_age   Always       -       34 (0 24 0 0 0)
195 Hardware_ECC_Recovered  0x001a   052   004   000    Old_age   Always       -       205654349
197 Current_Pending_Sector  0x0012   100   002   000    Old_age   Always       -       3
198 Offline_Uncorrectable   0x0010   100   002   000    Old_age   Offline      -       3
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 119 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 119 occurred at disk power-on lifetime: 11004 hours (458 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 fd 25 6c 00  Error: UNC at LBA = 0x006c25fd = 7087613

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 e0 b0 26 6c 40 00   7d+05:46:35.674  READ FPDMA QUEUED
  60 00 e0 d0 25 6c 40 00   7d+05:46:35.669  READ FPDMA QUEUED
  60 00 f0 d8 24 6c 40 00   7d+05:46:35.669  READ FPDMA QUEUED
  60 00 28 78 25 6c 40 00   7d+05:46:35.664  READ FPDMA QUEUED
  60 00 30 a8 24 6c 40 00   7d+05:46:35.663  READ FPDMA QUEUED

Error 118 occurred at disk power-on lifetime: 10830 hours (451 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 71 04 9d 00 32 40  Device Fault; Error: ABRT 4 sectors at LBA = 0x0032009d = 3276957

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 d0 e1 7d 40 00      00:15:51.654  READ DMA EXT
  25 00 08 d0 e1 7d 40 00      00:15:51.654  READ DMA EXT
  25 00 08 d0 e1 7d 40 00      00:15:51.654  READ DMA EXT
  25 00 08 d0 e1 7d 40 00      00:15:51.654  READ DMA EXT
  25 00 08 d0 e1 7d 40 00      00:15:51.653  READ DMA EXT

Error 117 occurred at disk power-on lifetime: 10830 hours (451 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 71 04 9d 00 32 40  Device Fault; Error: ABRT 4 sectors at LBA = 0x0032009d = 3276957

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 d0 e1 7d 40 00      00:15:51.653  READ DMA EXT
  25 00 08 c8 c5 2d 40 00      00:15:51.209  READ DMA EXT
  25 00 08 c8 c5 2d 40 00      00:15:51.209  READ DMA EXT
  25 00 08 c8 c5 2d 40 00      00:15:51.209  READ DMA EXT
  25 00 08 c8 c5 2d 40 00      00:15:51.208  READ DMA EXT

Error 116 occurred at disk power-on lifetime: 10830 hours (451 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 71 04 9d 00 32 40  Device Fault; Error: ABRT 4 sectors at LBA = 0x0032009d = 3276957

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 c8 c5 2d 40 00      00:15:51.209  READ DMA EXT
  25 00 08 c8 c5 2d 40 00      00:15:51.209  READ DMA EXT
  25 00 08 c8 c5 2d 40 00      00:15:51.209  READ DMA EXT
  25 00 08 c8 c5 2d 40 00      00:15:51.208  READ DMA EXT
  25 00 08 c8 c5 2d 40 00      00:15:51.208  READ DMA EXT

Error 115 occurred at disk power-on lifetime: 10830 hours (451 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 71 04 9d 00 32 40  Device Fault; Error: ABRT 4 sectors at LBA = 0x0032009d = 3276957

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 c8 c5 2d 40 00      00:15:51.208  READ DMA EXT
  25 00 08 b8 c6 2d 40 00      00:15:51.081  READ DMA EXT
  25 00 08 b8 c6 2d 40 00      00:15:51.081  READ DMA EXT
  25 00 08 b8 c6 2d 40 00      00:15:51.080  READ DMA EXT
  25 00 08 b8 c6 2d 40 00      00:15:51.080  READ DMA EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     11007         -
# 2  Short offline       Completed without error       00%     11004         -
# 3  Short offline       Completed without error       00%     10953         -
# 4  Selective offline   Completed without error       00%     10837         -
# 5  Selective offline   Completed: read failure       90%     10837         1887270886
# 6  Selective offline   Completed: read failure       90%     10837         1887261750
# 7  Selective offline   Completed: read failure       90%     10836         1887217021
# 8  Selective offline   Completed: read failure       90%     10833         63511735
# 9  Selective offline   Completed: read failure       90%     10833         63502125
#10  Selective offline   Completed: read failure       90%     10833         63490659
#11  Selective offline   Completed: read failure       90%     10833         12121842
#12  Selective offline   Completed: read failure       90%     10833         12110355
#13  Selective offline   Completed: read failure       90%     10833         12098051
#14  Selective offline   Completed: read failure       90%     10833         12089280
#15  Selective offline   Completed: read failure       90%     10833         12078170
#16  Selective offline   Completed: read failure       90%     10833         12068537
#17  Selective offline   Completed: read failure       90%     10833         12059282
#18  Selective offline   Completed: read failure       90%     10833         11972284
#19  Selective offline   Completed: read failure       90%     10833         11957107
#20  Selective offline   Completed: read failure       90%     10833         11947496
#21  Selective offline   Completed: read failure       90%     10833         10545773
17 of 17 failed self-tests are outdated by newer successful extended offline self-test # 1

SMART Selective self-test log data structure revision number 1
 SPAN     MIN_LBA     MAX_LBA  CURRENT_TEST_STATUS
    1  1887270886  1953525167  Not_testing
    2           0           0  Not_testing
    3           0           0  Not_testing
    4           0           0  Not_testing
    5           0           0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
HDD は機種によってかなり挙動が異なりますが、もしこの記事を見てリフレッシュ試みる場合、Self-test log が参考になるものと思います。なお、HDD の機種によっては、Self-test log を表示できないもの (機能が実装されてない?) もあるようです。

最後に警告となりますが、ZFS または Btrfs のように、データの End-to-End チェックサムが実装されていて、なおかつ raid 構成でなければ、今回のような状態の HDD は使えないです。単体利用はもちろんダメですが、ハードウェア RAID でも使うのは危険と思いますので、くれぐれも気をつけてください。HDD や OS の挙動を学習するための実験に使うならば、よいでしょうけれど。。。
わたし自身も、tankQ をプライマリなデータ領域として使ってるわけではなく、バックアップなどのセカンダリ領域 (最悪壊れても許容できる) として利用しています。OS屋のはしくれとして、Linux(CentOS6) + ZFS それに HDD の振る舞い (特にセクターエラー発生時のリカバリ動作) を体感して経験値を積みたい、というのが主な目的です。

0 件のコメント:

コメントを投稿

人気ブログランキングへ にほんブログ村 IT技術ブログへ