Disaster recovery

17 views
Skip to first unread message

Bill Farrell

unread,
Jul 11, 2012, 10:10:47 PM7/11/12
to intersy...@googlegroups.com
We're all bound to need it some time. This week we needed it. The
system hung up and there was no graceful way to shut everything down.
That left some 285 transactions partially updated.

In Pick we could do a SEL.RESTORE and pull back selected items. In
Cache', what we were left with was the transaction log. Not a very easy
thing to work with. We looked at making a JRNFILT routine, but that
doesn't let us dump sets/kills to another place where we can examine
what the impact might be.

Is it possible to use JRNDUMP in such a way as to make a text file with
the sets/kills so we can pick and choose (with a bit of programing)
which to play back into the database?

Alternatively, is it possible to parse the transaction file
programmatically?

One of my co-workers remembered something about playing the wij file
back into the system at hospitals where he'd worked before to capture a
stateful moment in history, but he wasn't clear on the details. Would
there have been an avenue here?

I'd love to see a thread on recovery. We all hope we'll never need to
know but it would be fantastic to have tips and tricks at hand.

Bill

Tony Gravagno

unread,
Jul 12, 2012, 1:50:55 PM7/12/12
to InterSy...@googlegroups.com
> From: Bill Farrell
> In Pick we could do a SEL.RESTORE and pull back selected items. In
> Cache', what we were left with was the transaction log. Not a very
> easy thing to work with.

(Caution, this rambling could be a waste of your time, maybe not if
you aren't familiar with logging in general)

I have no Cache'-specific insight into the issues you're having on
this, Bill, but can only comment on the above. Sel-restore is only
good for items saved in an account-save the night prior to an event.
All of the other Pick platforms support transaction logging too, and
they are no easier or more difficult to use in the current situation.
A logger like this records transactions in the order they're written
to disk. If the application is written to use transaction bracketing,
which most of the other Pick platforms support, then entire business
transactions, not just data file transactions, will be restored, or
not, to ensure referential integrity. I don't know if or how that's
done manually with any MV system; like you and most others it's just
not an area that I've explored but it would be interesting to
understand the capabilities of the various systems.

A transaction logger writes transactions sequentially to a queue which
is usually saved off to removable media or a different system. Ideally
the queue only exists for a few seconds but some "un-de-queued" logs
can accumulate into terabytes if gone unchecked. The point here is
that you're asking about a utility to do something on an entity that
shouldn't exist. I know that in D3, the outbound queue can be parsed
with BASIC but I don't know about other platforms. However, and here's
the point, we wouldn't parse the queue to retrieve specific
transactions and not others. Such a thing would have to be done
surgically, and because getting that wrong can cause as much damage as
the problem that prompted the effort, I doubt the DBMS providers would
spend too much time making this easy. Consider that the queue is
sequentially written in time and that your system processes data
sequentially in time as well. Your system makes decisions based on
available data: This record doesn't exist so it will write that
record, or the counter is set to this, so it will be incremented to
that. Reaching into the queue and pulling out transactions can mess
with that logic. Further, remember that unless the application is
transaction bracketed, which very few are, pulling a single disk
update out of context is very problematic, and even doing this
programmatically you would need to manually find and retrieve all data
operations associated with a specific bracketed transaction in order
to recover that event. A good utility would allow for that of course,
but does anyone have such a beast for the tiny audience that might
ever use it?

All in all, I know you just want to restore the integrity of the data
as of a moment in time. That's tough because as above at any moment
any number of transactions were in flux. Logging does exactly that, it
allows for restoring to a specific moment in time but ironically in
all of these systems, when you finally do restore to that point, where
everything was still in flux, even with logging you're left with a
system that isn't referentially in-tact, but you do have all of your
data! The solution, again, is transaction bracketing at the business
transaction level, and I know Very few applications that have ever had
this implemented, though one could argue that any "real" application
should. YMMV

As I said above, the rambling doesn't go anywhere but may serve as a
base for comments about what Cache' actually does support in this
context.

Good luck,
T

Michael Cohen

unread,
Jul 12, 2012, 3:39:08 PM7/12/12
to intersy...@googlegroups.com
Hi Bill,

I am sorry, but your question is not clear to me (not an old-time MV user, so not familiar with SEL.RESTORE).

If you are talking about Disaster Recovery, what is the scenario?

For example, assume a system is 24/7 with a weekly hot backup moved offsite, and nightly journals copied offsite. If the main sight has a disaster, the offsite can restore the backup, apply the journals, and rollback any open transactions. The system would then be in a clean state, but missing today's data. For example, if an app began a transaction, subtracted $1000 from my checking account, and was about to add $1000 to my savings account when the last journal was saved, then this process leaves my accounts in a good state.

Looking at this completely differently, assume you discover that record 1234 in file AR now contains a wrong value. The Management Portal makes it easy to examine the journals and determine which processes changed the data and when.

If you can describe the problem, I will attempt to offer a solution.


Bill Farrell

unread,
Jul 16, 2012, 11:30:29 AM7/16/12
to intersy...@googlegroups.com
Thanks Tony,

I apologize for the delay in responding. Last week was, mm, interesting.

Thanks for giving some good background, Tony. Michael, that's a
"traditional" MV command to scan a backup medium for a particular file and
perhaps also a particular list of items to retrieve from it. One of the
nice things about SEL.RESTORE is you could restore items to a file and
shuffle through them either with a rolling edit or run some program or
English/Access/Recall across it for doing some forensics. You might need
one record, a few, none, or all of them depending on the disaster that
befell you.

As you might can tell, what we hit is one of those uglies where some records
are updated, some are partially updated, and one file doesn't match another.
As it turns out, our vendor laid out the datasets quite strangely. It would
be very straightforward, I'd think, if it were possible to write classes for
the most critical file. It most certainly is not. The thing I'm looking at
takes the cake for "not only did the vendor fail to understand
post-relational databases, they failed to understand databases at all".

What it looks like happened is we hit a disk full condition on a namespace
that encompasses several databases. One or more databases couldn't be
updated; the way one global that got the worst damage is laid out, only
pieces got updated. We had to look at everything and try to repair by dint
of force.

I took apart the stock JRNDUMP routine. It's pretty straightforward. It
just needed a few throw-pillows and some nice curtains to make it home :) I
cloned off a custom copy and added a couple of tags to do a scrape for
sets/kills containing a given substring then write a version of those event
records into a flat file. The modified version gave us something to work
with. On a "normal" layout, it would have been entirely possible to just
replay the journal, or at worst take our "scraped" dump, modify it and play
it back.

But such is why we make the big bucks :)

B
--
You received this message because you are subscribed to the Google Groups
"InterSystems: MV Community" group.
To post to this group, send email to Cac...@googlegroups.com To unsubscribe
from this group, send email to CacheMV-u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/CacheMV?hl=en

Bill Farrell

unread,
Jul 20, 2012, 12:08:18 PM7/20/12
to InterSy...@googlegroups.com, intersy...@googlegroups.com
Okay, here's the windup. I cloned off JRNDUMP into another .INT routine.  I then added an "X" (eXport) option to the "E"xamine submenu when then calls the new code below.  Yeah, I know it's not MV but if you're in the fix we were in, not much MV code will do you a lot of good.  It is possible to convert this into a standalone MV routine and deploy it in MGR %SYS.  I don't see why you couldn't.  I just haven't had time yet.  But since I opened a thread I figured I'd best come up with an end to it :)

"D"o your modified JRNDUMP routine, select the journal you want to work with and "E"xamine it just the way the stock JRNDUMP works.  Have your export option call the new code below.

When you choose your export (mine is eXport), you get a device prompt since the new tag calls %IS.  (There is a note that gets sent to the screen reminding you to choose “NWS” options if you output to a flat file.)  Next you’re prompted for strings to search for.  Target strings will be searched for both in the global and in the data.  If either contains at least one of the target strings and the record has either a Set or Kill operator, it will be sent to the output file (or screen).  No other records will be output (no marks, no checkpoints, etc).

If you enter no strings at the “Next string” prompt (just an empty return) or if you hit “//” at any point in this prompt, the routine returns to the previous set of options.

Whether output is to the screen or to a file, the output is the same when a target is found:

address <tab> process_id <tab> timestamp <tab> S/K <tab> global <tab> new_value

where:

                address:               the location in the log file where the record was found
                process_id:         the process id that caused the set or kill
                timestamp:         the $H-format date-time when the set or kill event occurred
                S/K:                        (S)et or (K)ill
                global:                   global location that was affected
                new_value:        if the operation was a Set, the new value that was inserted

Tab characters are reasonably unlikely in data (but the delimiter is easy to change) and easy to piece with.  If tab is problematic we can always change it to something else like, say, @vm ($c(253)) which ought to be pretty uncommon on non-MV Cache.  MV users will probably have to be a bit more creative.

Since operations could be long, if you’re outputting results to a file, there’s a busy-box that outputs a dot to the screen every 1 second and every thirty seconds gives you a progress stamp.  Hopefully that’s enough to keep a VPN alive.  That would look something like:

Searching...
..............................

07/18/2012:  Currently at 27272868, next address is 27272952, found 16 records.
...............................
07/18/2012:  Currently at 56886240, next address is 56886328, found 16 records.
...............................

07/18/2012:  Currently at 87775424, next address is 87775508, found 16 records.
.............
      16 record(s) found.

<Enter> to continue


The contents of the file look like:

 

133384 574642506     62655,79602   K      VAL(0,2,0,"P12014834",62474,0)   
133440 574642506     62655,79602   S      ERR(0,"P12014834")   62655,79602^JOB^ARVL0^P12014834^62474^^
133524 574642506     62655,79602   S      ERR(0,"P12014834")   62655,79602^PRICE^ARPRICE^P12014834^62474^^
133612 574642506     62655,79602   S      ERR(0,"P12014834")   62655,79602^LID^ARLID^P12014834^62474^^
133696 574642506     62655,79602   S      ERR(0,"P12014834")   62655,79602^PRICE^ARPRICE^P12014834^62474^^
133784 574642506     62655,79602   S      ERR(0,"P12014834")   62655,79602^PRICE^ARPRICE^P12014834^62474^^
133872 574642506     62655,79602   S      VAL(0,2,0,"P12014834",62474,0)    1
133936 574642506     62655,79602   S      VAL(0,2,0,"P12014834",62474,0,1)  Ordered code (9999999) has been inactivated.
134040 574642506     62655,79602   S      ERR(0,"P12014834")   62655,79602^JOB^ARVL0^P12014834^62474^^
134124 574642506     62655,79602   S      ERR(0,"P12014834")   62655,79602^SPLIT^ARVSB^P12014834^62474^^
134212 574642506     62655,79602   S      ERR(0,"P12014834")   62655,79602^ENDCT^ARVL0^P12014834^62474^^
134300 574642506     62655,79602   S      ERR(0,"P12014834")   62655,79602^^AROEPC^P12014834^62474^^
134384 574642506     62655,79602   S      ERR(0,"P12014834")   62655,79602^JOB^ARVL0^P12014834^62474^^
134468 574642506     62655,79602   S      ERR(0,"P12014834")   62655,79602^VAL^NCXJBP^P12014834^62474^^
134556 574642506     62655,79602   S      ERR(0,"P12014834")   62655,79602^ENDCT^ARVL0^P12014834^62474^^
134644 574642506     62655,79602   K      ERR(0,"P12014834")

I'll warn right now:  I'm an MV guy.  My COS isn't the prettiest, but the routine does seem to work under fire.
Best,
Bill


    ;This tag only dumps forward from the current file cursor position.  The "seek backward" capability
    ;from the original "find" tag has been removed since it makes no sense to dump transactions in reverse
    ;order.  Multiple search strings may be entered.  Enter an empty return to stop entering strings and
    ;proceed or enter // to exit this tag.
exportrec(addr)    ;find a substring somewhere in the transaction log then export it to the output device.
   
    New (addr)
    s row=22, blank=$j("",80)
    s GCount=0
    s DotCount = 0
    s htab = $Char(9)
   
    ; Capture the name of the terminal in case output goes to a file.  This will allow the routine
    ; to switch easily to give progress reports.
    s Term = $IO
   
    w /cup(row+1,1),blank,/cup(row+1,1)
    w /cup(row,1),blank,/cup(row,1)
    d message("Note: If you plan to write the results to a file, use options ""NWS"".",3,row)
    w /cup(row+1,1),blank,/cup(row+1,1)
    w /cup(row,1),blank,/cup(row,1)
   
    d ^%IS ; select a device
    s ODevice=IO
    w /cup(row,1),blank,/cup(row,1)
    s StringList = "" ; list of strings to look for
    u 0:(:"-S")
   
    ; clear the last two CRT rows
    w /cup(row,1),blank,/cup(row,1)
    w /cup(row+1,1),blank,/cup(row+1,1)
   
    s StillEntering = 1
    While StillEntering {
   
            w /cup(row,1),blank,/cup(row,1)
            r "Next string: ",str
            s:str="//" str="" ; look for the "abandon" signal
            q:str=""
           
            ; Add to the list of strings to look for.
            i $ll(StringList)=0 {
                    s StringList = $lb(str)
            } else {
                    s StringList = StringList _ $lb(str)
            }
   
    }
   
    q:$ll(StringList)=0  ; empty string list
    i str="//" Quit  ; found the "abandon" signal
   
    w /cup(row+1,1),blank,/cup(row+1,1)
    w /cup(row,1),blank,/cup(row,1)
    s rdy=$$YN("Ready","Y",0)
    q:rdy="N"
   
    w #,"Searching...",!
   
    s a = addr, found = 0
    s a =$ZUtil(78,17,a)
   
    s Tick = $Piece( $H, ",", 2 )
    s LastTick = Tick
   
    While a'<0 { ; from first record up to EOF
   
           s rec=$$thisrec(a,.glo,.val)
          
           s op = $ZStrip($Piece(rec,"^",3),"*C") ; the operator is surrounded by weirdness
   
           i "KkSs"[op { ; only looking for (S)ets or (K)ills
          
           f Idx = 1:1:$ll(StringList) {
          
            s str = $lg( StringList, Idx, "zzznotfoundzzz" )
           
            i ($Get(glo)[str)!($Get(val)[str) { ; if the search string is located either in the node location or data value
       
                s timestamp = $zu(78,24,a)
                s PID = $zu(78,10,a)
               
                ; The record looks like: addr \t PID \t timestamp \t operation (s/k) \t global \t value
                s ORec = a _ htab _ PID _ htab _ timestamp _ htab _ op _ htab _ glo _ htab _ val
               
                i (ODevice'=Term) Use ODevice
                w ORec,!
                i (ODevice'=Term) Use 0
               
                s GCount=GCount+1 ; keep track of progress
                   
                } ; if we got the operator we were looking for
      
            }
          
           }
   
              
           i (ODevice'=Term) {
          
            s Tick = $Piece( $H, ",", 2 )
           
            i Tick = 1 Set LastTick = 0 ; we crossed midnight
           
            i Tick > LastTick {
       
                s DotCount = DotCount + 1
               
                u 0
               
                i (DotCount>30),(ODevice'=Term) {
                    s RightNow=$ZTIME($PIECE($HOROLOG,",",2),1)
                    w !,RightNow,":  Currently at ", Lasta _ ", next address is " _ a, ", found ", GCount, " records.",!
                    s DotCount = 0
                }
               
                w "."
               
                s LastTick = Tick
              
               }
   
           }
   
           s Lasta = a ; remember the last address
           s a = $ZUtil(78,17,a)
          
           q:Lasta'<a
          
    }
   
    i (ODevice'=Term) Close ODevice u 0
    w !!,$Justify(GCount,8)_" record(s) found.",!
    r !,"<Enter> to continue",dummy:300
    w #
    q
   
Reply all
Reply to author
Forward
0 new messages