{"id":108,"date":"2012-07-21T01:04:30","date_gmt":"2012-07-21T06:04:30","guid":{"rendered":"http:\/\/www.paulreed.ca\/?p=108"},"modified":"2012-07-21T01:38:41","modified_gmt":"2012-07-21T06:38:41","slug":"setting-up-an-email-alert-for-a-linux-software-raid-failure","status":"publish","type":"post","link":"https:\/\/paulreed.ca\/?p=108","title":{"rendered":"Setting up an email alert for a linux software raid failure"},"content":{"rendered":"<p>Recently I had a drive fail in a software RAID1 array on a CentOS 5.x server. I decided to write a simple cronjob\/bash script that would nag me if it detected that things weren&#8217;t running correctly. This should also work for almost any other Linux disto as well.<\/p>\n<p>I used vi to edit these files, you&#8217;ll have to look elsewhere if you need assistance on using vi.<\/p>\n<p>Here is the entry for the root&#8217;s crontab (added while logged in as root using &#8220;crontab -e&#8221;):<\/p>\n<pre>*\/15 * * * * sh \/root\/raidstat.sh > \/dev\/null 2>&1<\/pre>\n<p>And here is the contents of \/root\/raidstat.sh:<\/p>\n<pre>#!\/bin\/sh \r\nTEST=`cat \/proc\/mdstat | grep -o \"\\[UU\\]\" | wc -w` \r\nif [ \"$TEST\" == \"3\" ]; then \r\n    # RAID OK\r\n    MDSTAT=`cat \/proc\/mdstat` \r\nelse \r\n    # RAID NOT OK - send out email\r\n    MDSTAT=`cat \/proc\/mdstat` \r\n    echo \"$MDSTAT\" | mail -s \"*WARNING* - RAID FAILURE DETECTED ON <yourhostname here>\" <your email address here>\r\nfi<\/pre>\n<p>Put simply, this script searches for 3 instances of &#8220;[UU]&#8221; (which indicates 3 RAID1 software arrays). If it doesn&#8217;t find 3, then that indicates there is a problem with 1 or all of my raid arrays. The output of &#8220;cat \/proc\/mdstat&#8221; is then emailed out to me, so I can determine what is wrong before I even log into the server. <\/p>\n<p>The cron job will repeat this email alert every 15mins until the issue is resolved.<\/p>\n<p>Here is an example of my output for &#8220;cat \/proc\/mdstat&#8221;, showing the 3 arrays (and 1 in progress of recovery). Since only 2 instances of &#8220;[UU]&#8221; are present, I will get emails until the array is rebuilt and 3 instances are found.<\/p>\n<pre>Personalities : [raid1]\r\nmd0 : active raid1 xvdb1[1] xvda1[0]\r\n      104320 blocks [2\/2] [UU]\r\n\r\nmd1 : active raid1 xvdb2[1] xvda2[0]\r\n      2096384 blocks [2\/2] [UU]\r\n\r\nmd2 : active raid1 xvdb5[2] xvda5[0]\r\n      484086528 blocks [2\/1] [U_]\r\n      [=====>...............]  recovery = 26.7% (129286388\/484086528) finish=90.2min speed=65553K\/sec\r\n\r\nunused devices: <none><\/pre>\n<p>Software RAID5 arrays will have more U&#8217;s in the status, you&#8217;ll have to adjust accordingly in the script. If you have a mix of RAID5 and RAID1, I suggest using 2 copies of the script, one for each RAID level searching only for the specific # of [UU] or [UUUUU] instances. <\/p>\n<p>There will always be 1 U for each drive present and functioning in the array. An underscore (such as [U_]) indicates that a drive is missing (in the case of [U_] the second device is missing. [_U] would indicate the first device missing).<\/p>\n<p>In my case, I was able to restart my server and the 2nd drive came back up, so I was able to re-add the 2nd device back into the array for it to rebuild. A similar process would be followed after replacing a disk completely (you&#8217;ll have to search elsewhere for a full replacement scenario).<\/p>\n<p>I was able to bring my missing device back with the following command, which started the rebuild process using the existing disk:<\/p>\n<pre>mdadm --manage \/dev\/md2 --add \/dev\/xvdb5\r\n<\/pre>\n<p>Hope this helps someone else out there.<\/p>\n<p>Paul<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Recently I had a drive fail in a software RAID1 array on a CentOS 5.x server. I decided to write a simple cronjob\/bash script that would nag me if it detected that things weren&#8217;t running correctly. This should also work for almost any other Linux disto as well. I used vi to edit these files, [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[25],"tags":[31,30,47,29,26,28,27],"_links":{"self":[{"href":"https:\/\/paulreed.ca\/index.php?rest_route=\/wp\/v2\/posts\/108"}],"collection":[{"href":"https:\/\/paulreed.ca\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/paulreed.ca\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/paulreed.ca\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/paulreed.ca\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=108"}],"version-history":[{"count":39,"href":"https:\/\/paulreed.ca\/index.php?rest_route=\/wp\/v2\/posts\/108\/revisions"}],"predecessor-version":[{"id":148,"href":"https:\/\/paulreed.ca\/index.php?rest_route=\/wp\/v2\/posts\/108\/revisions\/148"}],"wp:attachment":[{"href":"https:\/\/paulreed.ca\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=108"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/paulreed.ca\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=108"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/paulreed.ca\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=108"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}