problème valeur smart sur serveur: Load_Cycle

Marsh Posté le 09-03-2006 à 12:16:09

salut à tous,
j'ai un serveur sous Debian. Depuis quelques heures il mouline vachement, quand j'ouvre mon thunderbird au bureau il met des plombes à rappatrier les mails. A la fin il y arrive, mais après plusieurs plantages et de longues minutes d'attente.
Je regarde s'il y a un processus qui tourne et bouffe le cpu : NON. et là seuls la partie serveur mail est active, pas de trucs qui bouffent de la connection comme bittorrent etc.
Pour voir je regarde l'état du disque dur avec Smart et là j'ai ça :

smartctl -A /dev/hda
smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000d 100 099 050 Pre-fail Offline - 8589934770
2 Throughput_Performance 0x0005 100 075 050 Pre-fail Offline - 4030
3 Spin_Up_Time 0x0007 100 100 050 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 081 081 050 Old_age Always - 19642
5 Reallocated_Sector_Ct 0x0033 098 098 010 Pre-fail Always - 48
7 Seek_Error_Rate 0x000f 100 100 050 Pre-fail Always - 627
8 Seek_Time_Performance 0x0005 091 087 050 Pre-fail Offline - 1237
9 Power_On_Minutes 0x0032 079 079 060 Old_age Always - 10950h+50m
10 Spin_Retry_Count 0x0013 100 100 050 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 095 095 050 Old_age Always - 5944
191 G-Sense_Error_Rate 0x000a 100 100 050 Old_age Always - 1
192 Power-Off_Retract_Count 0x0032 099 099 050 Old_age Always - 646
193 Load_Cycle_Count 0x0032 001 001 050 Old_age Always FAILING_NOW 887766/887119
194 Temperature_Celsius 0x0022 094 042 000 Old_age Always - 43 (Lifetime Min/Max 69/14)
195 Hardware_ECC_Recovered 0x001a 100 060 050 Old_age Always - 507
196 Reallocated_Event_Count 0x0032 096 096 001 Old_age Always - 48
197 Current_Pending_Sector 0x0032 100 099 001 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 001 Old_age Offline - 1
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
223 Load_Retry_Count 0x0012 100 100 050 Old_age Always - 0
230 Head_Amplitude 0x0032 085 085 060 Old_age Always - 453981
250 Read_Error_Retry_Rate 0x000a 100 030 050 Old_age Always In_the_past 182

Donc l'attribut Loaad_Cycle_Count est marqué "failing now". Il y a manifestement un problème sur le disque dur, mais je ne sais pas si cet attribut est vraiment critique (il est seulement marqué en old age).
Est-ce que quelqu'un a une idée d'un truc qui pourra faire ramer le serveur à ce point ?

Message édité par shaddy le 09-03-2006 à 12:38:07

Reply

Marsh Posté le 09-03-2006 à 12:16:09

Reply

Marsh Posté le 09-03-2006 à 12:28:11

Bon apparemment cet attribut indique en quelque sorte la durée de vie du disque estimée par le constructeur :
"I think that Load_Cycle_Count is the
number of times that the disk heads move from a "parked" position to a
different position over the spinning disk" mais je ne sais pas trop ce que ça veut dire : parked position

Reply

Marsh Posté le 09-03-2006 à 12:37:24

Bon goon je crois qu'on va arrêter le serveur et changer le disque illico :
http://web.glandium.org/blog/?p=54
http://paul.luon.net/journal/hacking/BrokenHDDs.html (lire surtout celui-là pour voir comment c'est pas bon cette erreur)
http://www.hitachigst.com/hdd/libr [...] d/load.htm (pour comprendre le Load Cycle Count)

Apparemment donc le disque est en fin de vie et risque de lâcher. Selon les infos ci-dessus il y aurait même un problème avec les systèmes de fichiers journalisés et certains disques durs de portable. Avec le ext3, il y aurait des loading / unloading en permanence alors que normalement c'est 5 par heure. Du coup forcément le disque dur il dépasse la limite rapidement. Enfin le nôtre il est un peu vieux donc c'est pas si surprenant.

Reply

Marsh Posté le 09-03-2006 à 12:40:14

apparemment on peut bloquer, ralentir le Load_Cycle_Count avec
hdparm -B254
ça enlève l'APM, mais c'est pas génial non plus

Reply

Marsh Posté le 09-03-2006 à 12:43:39

on s'en tape de l'APM, le serveur bouge pas

Reply

Marsh Posté le 09-03-2006 à 14:56:28

un petit test avec smart et après
smatctl -l error /dev/hda

voilà les erreurs renvoyées par le test. Apparemment les erreurs surviennent quand le test lance les commandes
READ VERIFY SECTOR(S) et READ DMA
ça veut dire qu'il y a des secteurs defectueux non ?

smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
ATA Error Count: 67 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 67 occurred at disk power-on lifetime: 7563 hours (315 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 01 03 5c 63 e0 Error: UNC at LBA = 0x00635c03 = 6511619

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
40 00 01 03 5c 63 e0 00 00:05:59.370 READ VERIFY SECTOR(S)
c8 00 01 00 00 00 e0 00 00:05:59.370 READ DMA
40 00 02 05 5c 63 e0 00 00:05:59.350 READ VERIFY SECTOR(S)
c8 00 01 00 00 00 e0 00 00:05:59.350 READ DMA
40 00 02 03 5c 63 e0 00 00:05:57.230 READ VERIFY SECTOR(S)

Error 66 occurred at disk power-on lifetime: 7563 hours (315 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 02 03 5c 63 e0 Error: UNC at LBA = 0x00635c03 = 6511619

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
40 00 02 03 5c 63 e0 00 00:05:57.230 READ VERIFY SECTOR(S)
c8 00 01 3b 8b 38 e1 00 00:05:57.230 READ DMA
40 00 04 03 5c 63 e0 00 00:05:55.130 READ VERIFY SECTOR(S)
c8 00 01 00 00 00 e0 00 00:05:55.130 READ DMA
40 00 04 ff 5b 63 e0 00 00:05:55.130 READ VERIFY SECTOR(S)

Error 65 occurred at disk power-on lifetime: 7563 hours (315 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 04 03 5c 63 e0 Error: UNC at LBA = 0x00635c03 = 6511619

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
40 00 04 03 5c 63 e0 00 00:05:55.130 READ VERIFY SECTOR(S)
c8 00 01 00 00 00 e0 00 00:05:55.130 READ DMA
40 00 04 ff 5b 63 e0 00 00:05:55.130 READ VERIFY SECTOR(S)
c8 00 01 00 00 00 e0 00 00:05:55.130 READ DMA
40 00 08 07 5c 63 e0 00 00:05:55.110 READ VERIFY SECTOR(S)

Error 64 occurred at disk power-on lifetime: 7563 hours (315 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 04 03 5c 63 e0 Error: UNC at LBA = 0x00635c03 = 6511619

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
40 00 08 ff 5b 63 e0 00 00:05:52.970 READ VERIFY SECTOR(S)
40 00 10 0f 5c 63 e0 00 00:05:52.950 READ VERIFY SECTOR(S)
c8 00 01 3b 8b 38 e1 00 00:05:52.950 READ DMA
40 00 10 ff 5b 63 e0 00 00:05:50.820 READ VERIFY SECTOR(S)
c8 00 01 00 00 00 e0 00 00:05:50.820 READ DMA

Error 63 occurred at disk power-on lifetime: 7563 hours (315 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 0c 03 5c 63 e0 Error: UNC at LBA = 0x00635c03 = 6511619

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
40 00 10 ff 5b 63 e0 00 00:05:50.820 READ VERIFY SECTOR(S)
c8 00 01 00 00 00 e0 00 00:05:50.820 READ DMA
40 00 20 1f 5c 63 e0 00 00:05:50.790 READ VERIFY SECTOR(S)
c8 00 01 00 00 00 e0 00 00:05:50.790 READ DMA
40 00 20 ff 5b 63 e0 00 00:05:48.650 READ VERIFY SECTOR(S)

Reply

Marsh Posté le 09-03-2006 à 15:02:58

On va mourir on va mourir

Message cité 1 fois

Reply

Marsh Posté le 09-03-2006 à 15:09:50

Goon a écrit :

On va mourir on va mourir

meuh non

Reply

Marsh Posté le 09-03-2006 à 15:10:49

Tiens regarde, le disque dur il est super rapide !!!!!!!!!!!!

hdparm -tT /dev/hda

/dev/hda:
Timing cached reads: 548 MB in 2.00 seconds = 273.44 MB/sec
Timing buffered disk reads: 8 MB in 3.78 seconds = 2.12 MB/sec

Message édité par shaddy le 09-03-2006 à 15:12:10

Reply

Marsh Posté le 09-03-2006 à 15:23:23

2.12 :ouch:

Reply

problème valeur smart sur serveur: Load_Cycle_Count

Sujets relatifs:

Leave a Replay