After thinking my Alpha 4000 was 'getting better!' - well after having the problems listed below ***
- it would run for a day or two before I shut it down normally but today after giving the BOOT and CONTINUE commands, the process halted with
"
halted CPU 0
halt code 5
HALT INSTRUCTION EXECUTED
PC = ffffffff838272f0
"
TO GET TO THE POINT : Does anyone know if the PC value can be used to identify the problem?
------------------------------------------------------------------------------------------------------------------------------------------------------
(I've reloaded the Alphabios and SRM firmware - as I thought I might have corrupted it and
run
>>> test cpu0
>>> test PCI0
>>> test mem0 and mem1
- I didn't get any errors but the tests timed out - so they weren't comprehensive.
Sorry if this is a bit garbled but I made a few notes when I tried to install a SCSI card in my Alpha 4000 a few weeks ago.....
The problems started after physically installing a KZPSA SCSI card and connecting it to the 2nd shelf in the ALPHA 4000
It had been working nicely with the 5 member RAID 5 disk setup in the lower shelf - so
it was perhaps a bit silly to start modifying the hardware setup. Whilst 'playing' with the SCSI card I attempted to upgrade the BIOS firmware, using the LFU to install the KZPSAA12 firmware
Initially I had a failure to power up and various errors which I haven't noted well but I decided that
there was a problem with the 512 MBytes of memory:
the first message was "Power up tests have detected a problem with your system! Reset the system
and observe the OCP display and/or type "show config" and "show power"
-from SRM console I ran "test mem0" and got a deluge! of messages, including
"CPU detected uncorrectable ECC error" and
"CPU 00 unexpected machine check through vector 0000670"
I replaced the 2 * B3030-EA 256MByte mem cards (asynchronous) with 4 * B3020-CH 64 Mbyte (synchronous)
cards. Afterwards the machine booted and was running normally (AFAsIKnow). Although the machine now has less memory, the synchronous RAM is described online, as faster.
Later (after the RAM change) I booted normally and the system seemed ok but after being distracted for a while I came back to my PC and got no response at the SSH Alpha terminal sessions.
I went to the Alpha Console and got:
halt code 6
pcb 6a3d0
!!! console entry context is not valid - Reset the system !!!
Brk 0 at 0006917c
0006917 ! BPT
I tried to power up and boot several times and got more messages, like:
UNEX MCHK on CPU 0
EXC_ADR4239
and Test CPU0
"your system has halted due to an irrecoverable error, record the error
halt code and contact your Digital Service Rep."
Type INFO 5 and INFO 8
also
OpenVMNS Alpha V8.3 - BUGCHECK
Bugcheck code = 000000215 : MachineCheck while in Kernel Mode
and another halt code 7 event
Anyway after several boots, the system would run for increasing durations before hanging.
Edited by somersdave on June 18 2012 23:32
I wonder if someone could have a look at the crash dump (my first ever ) from the Alpha 4000 and say if they can see any obvious clue to the 'HALT' problem which consistently happens on booting, unlike the variety of errors I was getting earlier.
System Registers:
Page Table Base Register (PTBR) 00000000.000000F9
Processor Base Register (PRBR) FFFFFFFF.81C14000
Crashdump Summary Information:
------------------------------
Privileged Context Block Base (PCBB ) 00000000.00002180
System Control Block Base (SCBB ) 00000000.00000A47
Software Interrupt Summary Register (SISR) 00000000.00000000
Address Space Number (ASN) 00000000.0000007D
AST Summary / AST Enable (ASTSR_ASTEN) 00000000.00000000
Floating-Point Enable (FEN) 00000000.00000000
Interrupt Priority Level (IPL) 00000000.0000001F
Machine Check Error Summary (MCES) 00000000.00000008
Virtual Page Table Base Register (VPTB ) FFFFFEFC.00000000
Crashdump Summary Information:
------------------------------
Failing Instruction:
FFFFFFFF.838272F0: HALT
Instruction Stream (last 20 instructions):
FFFFFFFF.838272A0: HALT
FFFFFFFF.838272A4: HALT
FFFFFFFF.838272A8: HALT
FFFFFFFF.838272AC: HALT
FFFFFFFF.838272B0: HALT
FFFFFFFF.838272B4: HALT
FFFFFFFF.838272B8: HALT
FFFFFFFF.838272BC: HALT
FFFFFFFF.838272C0: HALT
FFFFFFFF.838272C4: HALT
FFFFFFFF.838272C8: HALT
FFFFFFFF.838272CC: HALT
FFFFFFFF.838272D0: HALT
FFFFFFFF.838272D4: HALT
FFFFFFFF.838272D8: HALT
Crashdump Summary Information:
------------------------------
FFFFFFFF.838272DC: HALT
FFFFFFFF.838272E0: HALT
FFFFFFFF.838272E4: HALT
FFFFFFFF.838272E8: HALT
FFFFFFFF.838272EC: HALT
FFFFFFFF.838272F0: HALT
FFFFFFFF.838272F4: HALT
FFFFFFFF.838272F8: HALT
FFFFFFFF.838272FC: HALT
FFFFFFFF.83827300: HALT
SDA> exit
OK - the worldwide effort to find the answer to my problem (the alpha 4000 one, anyway) can stand down
I overcame my fear of losing the 'stuff', I'd installed and did a V8.3 reinstall from CD, using the PRESERVE option and all looks well. (for now, anyhow) - Not sure why I had the power up problems.
If a reinstall fixed the problem, it means that something that it replaced was either corrupted or was a driver that did not support your hardware.
Since you do not know what the root cause was, it may be time to make sure that you have a good way to replace and restore the system disk, as what you are seeing may be a symptom of a disk starting to go bad.
The problems started when I tried installing the SCSI adapter in a 'cavalier' sort of way but I can't remember the exact details of what went awry - I seem to remember powering the alpha OFF while it was still 'doing something' plus, about the same time, a local substation problem cut the power several times whilst the alpha was booted.
I made a disk saveset and FTPed? it to my PC disk and have only tried doing some programming exercises with cxx and Fortran since then
After reinstalling the firmware and V8.3, the alpha can 'see' the KZPSA controller and the RAID disk hasn't got any error count - so I guess that's a good sign. Doing the reinstall might be a useful learning exercise, if I want to upgrade to V8.4 and I might try using the KZPSA to connect to the 2nd disk shelf for backups.
$ sh dev pk/fu
Device PKA0:, device type NCR 53C810 SCSI, is online, error logging is enabled.
Error count 0 Operations completed 23
Owner process "" Owner UIC [SYSTEM]
Owner process ID 00000000 Dev Prot S:RWPL,O:RWPL,G,W
Reference count 0 Default buffer size 0
Device PKB0:, device type KZPSA/SCSI (SIMport), is online, error logging is
enabled.
Error count 0 Operations completed 24
Owner process "" Owner UIC [SYSTEM]
Owner process ID 00000000 Dev Prot S:RWPL,O:RWPL,G,W
Reference count 0 Default buffer size 0
$ sh dev dra0/fu
Disk ALPHA1$DRA0:, device type 5 Member RAID 5, is online, mounted, file-
oriented device, shareable, available to cluster, error logging is enabled.
Error count 0 Operations completed 29071
Owner process "" Owner UIC [SYSTEM]
Owner process ID 00000000 Dev Prot S:RWPL,O:RWPL,G:R,W
Reference count 471 Default buffer size 512
Total blocks 33513472 Sectors per track 64
Total cylinders 32728 Tracks per cylinder 16
Logical Volume Size 33513472 Expansion Size Limit 2147475456
Volume label "ALPHASYS" Relative volume number 0
Cluster size 16 Transaction count 463
Free blocks 15217744 Maximum files allowed 16711679
Extend quantity 5 Mount count 1
Mount status System Cache name "_ALPHA1$DRA0:XQPCACHE"
Extent cache size 64 Maximum blocks in extent cache 1521774
File ID cache size 64 Blocks in extent cache 40384
Quota cache size 0 Maximum buffers in FCP cache 1034
Volume owner UIC [1,1] Vol Prot S:RWCD,O:RWCD,G:RWCD,W:RWCD
Volume Status: ODS-5, subject to mount verification, protected subsystems
enabled, file high-water marking, write-through caching enabled, hard
links enabled.
Some PCI options like RAID disk controllers may not be completely powered off unless the power cord is removed.
Typically though, a RAID controller will still not corrupt the disks in this situation unless its backup battery is bad. And if a RAID controller suspects corruption, it will take the entire RAID array offline until you go into the controller and force it to be set to good.
I had this happen in a few cases where the disk drives were powered off before the Raid controller was, even though the operating system was shutdown. After marking the disks online, I never had any real corruption on a production VMS system.
As the Alpha seems to now be operating properly and VMS can 'see' the KZPSA, I'm wondering about trying to connect the BA350? shelf using a 'H885-AA RevA02 Trilink Connector' (as I'm lacking a spare personality - connector, that is - seems a strange name in logical world of computing) via the cable which has 2 'AMP' (wide? SCSI ) plugs. The Trlink has two front SCSI data sockets, one of which has a blanking plug.
With my recent Alpha problems (of my own creation), I'm just a little concerned that I could do some more damage. Has anyone any thoughts on the risks which might exist? I just slotted various 4 and 9 Gig 'bricks' into the BA350 enclosure.
with a disk backup :
$
$ backup/image/verify alpha1$dra0: dkb500:[000000]alpha4000_08JUL2012.bck/save
%BACKUP-E-OPENIN, error opening ALPHA1$DRA0:[SYS0.SYSEXE]NET$PROXY.DAT;1 as inpu
t
-SYSTEM-W-ACCONFLICT, file access conflict
%BACKUP-I-NOBACKUP, ALPHA1$DRA0:[SYS0.SYSEXE]PAGEFILE.SYS;2 data not copied, fil
e marked NOBACKUP
%BACKUP-I-NOBACKUP, ALPHA1$DRA0:[SYS0.SYSEXE]PAGEFILE.SYS;1 data not copied, fil
e marked NOBACKUP
%BACKUP-I-NOBACKUP, ALPHA1$DRA0:[SYS0.SYSEXE]SWAPFILE.SYS;3 data not copied, fil
e marked NOBACKUP
%BACKUP-I-NOBACKUP, ALPHA1$DRA0:[SYS0.SYSEXE]SWAPFILE.SYS;2 data not copied, fil
e marked NOBACKUP
%BACKUP-I-NOBACKUP, ALPHA1$DRA0:[SYS0.SYSEXE]SWAPFILE.SYS;1 data not copied, fil
e marked NOBACKUP
%BACKUP-I-NOBACKUP, ALPHA1$DRA0:[SYS0.SYSEXE]SYS$ERRLOG.DMP;2 data not copied, f
ile marked NOBACKUP
%BACKUP-I-NOBACKUP, ALPHA1$DRA0:[SYS0.SYSEXE]SYS$ERRLOG.DMP;1 data not copied, f
ile marked NOBACKUP
%BACKUP-I-NOBACKUP, ALPHA1$DRA0:[SYS0.SYSEXE]SYSDUMP.DMP;2 data not copied, file
marked NOBACKUP
%BACKUP-I-NOBACKUP, ALPHA1$DRA0:[SYS0.SYSEXE]SYSDUMP.DMP;1 data not copied, file
marked NOBACKUP
%BACKUP-E-OPENIN, error opening ALPHA1$DRA0:[SYS0.SYSMGR]ACCOUNTNG.DAT;1 as input
-SYSTEM-W-ACCONFLICT, file access conflict
%BACKUP-E-OPENIN, error opening ALPHA1$DRA0:[VMS$COMMON.SYSEXE]VMS$OBJECTS.DAT;1
as input
-SYSTEM-W-ACCONFLICT, file access conflict
%BACKUP-E-OPENIN, error opening ALPHA1$DRA0:[VMS$COMMON.SYSMGR]SECURITY.AUDIT$JO
URNAL;1 as input
-SYSTEM-W-ACCONFLICT, file access conflict
%BACKUP-E-OPENIN, error opening ALPHA1$DRA0:[VMS$COMMON.SYSMGR]VMS$AUDIT_SERVER.
DAT;1 as input
-
-SYSTEM-W-ACCONFLICT, file access conflict
%BACKUP-E-OPENIN, error opening ALPHA1$DRA0:[]ACCOUNTNG.DAT;1 as input
-SYSTEM-W-ACCONFLICT, file access conflict
%BACKUP-E-OPENIN, error opening ALPHA1$DRA0:[]VMS$AUDIT_SERVER.DAT;1 as input
-SYSTEM-W-ACCONFLICT, file access conflict
%BACKUP-E-OPENIN, error opening ALPHA1$DRA0:[]SECURITY.AUDIT$JOURNAL;1 as input
-SYSTEM-W-ACCONFLICT, file access conflict
%BACKUP-E-OPENIN, error opening ALPHA1$DRA0:[]VMS$OBJECTS.DAT;1 as input
-SYSTEM-W-ACCONFLICT, file access conflict
%BACKUP-E-OPENIN, error opening ALPHA1$DRA0:[]NET$PROXY.DAT;1 as input
-SYSTEM-W-ACCONFLICT, file access conflict
%BACKUP-I-STARTVERIFY, starting verification pass at 8-JUL-2012 13:56:05.53
%BACKUP-E-VERIFYERR, verification error for block 2008 of ALPHA1$DRA0:[SYS0.SYSE
RR]ERRLOG.SYS;1
%BACKUP-E-VERIFYERR, verification error for block 2 of ALPHA1$DRA0:[SYS0]SYSEXE.
DIR;1
%BACKUP-E-OPENIN, error opening ALPHA1$DRA0:[SYS0.SYSEXE]DNS$CACHE.0000003442;1
as input
-SYSTEM-W-NOSUCHFILE, no such file
%BACKUP-E-OPENIN, error opening ALPHA1$DRA0:[SYS0.SYSEXE]DNS$CACHE.VERSION;3443
as input
-SYSTEM-W-NOSUCHFILE, no such file
- didn't list the .LOG files
- not sure if the errors would invalidate the BACKUP, if I needed it to recover the system? - the BACKUP/IGNORE looks like it will solve the problem of locked files, etc.
I haven't checked but I don't think the 'hot swapping' of (non-RAID? ) disks is allowed which were it possible, might prove to be a useful facility.
Edited by somersdave on July 08 2012 10:06
The hot swap capability of a drive is a hardware feature of the enclosure. The controller or OS must also be expecting the event or possibility of the event.
Generally RAID controllers need to be told through a management utility before any disk that they can see is removed or replaced. If one is changed with out that, the RAID controller will mark it as failed and ignore it. Usually you can do a force override of this.
One common issue is having the power lost from the disk shelf and not the controller, even if the operating system is shutdown. In that case, forcing an override will usually result in the disk data being recovered.
malmberg August 04 2022 No more VAX hobbyist licenses.
Community licenses for Alpha/IA64/X86_64 VMS Software Inc.
Commercial VMS software licenses for VAX available from HPE.
ozboomer July 20 2022 Just re-visiting.. No more hobbyist licenses? Is that from vmssoftware.com, no 'community' licenses?
valdirfranco July 01 2022 No more hobbyist license...sad
mister_wavey February 12 2022 I recall that the disks failed on the public access VMS systems that included Fafner
parwezw January 03 2022 Anyone know what happened to FAFNER.DYNDS.ORG?
I had a hobbyist account here but can longer access the site.
gtackett October 27 2021 Make that DECdfs _2.1A_ for Vax
gtackett October 27 2021 I'm looking for DECdfs V2.4A kit for VAX.
Asking here just in case anyone is still listening.
MarkRLV September 17 2021 At one time, didn't this web site have a job board? I would love to use my legacy skills one last time in my career.
malmberg January 18 2021 New Hobbyist PAKs for VAX/VMS are no longer available according to reports. Only commercial licenses are reported to be for sale from HPE
dfilip January 16 2021 Can someone please point me to hobbyist license pak? I'm looking for VAX/VMS 7.1, DECnet Phase IV, and UCX/TCPIP ... have the 7.1 media, need the license paks ... thanks!