[Gross] grossd problems after solaris 10 patching

Wed Aug 6 00:12:02 EEST 2008

We managed to get it to sync reliably if we move both zones onto the
same server (patched OS).  We can get it to work with one unpatched
server and one patched server, but only under a specific scenario (see
below).

Given server A is not patched, and server B is patched.

Scenario #1:

  Delete state files on both and recreate with -C.

  Start up server A; it runs normally.

  Start up server B.

  Server B has one thread that is effectively using an entire CPU.

  Server A's NLWP grows steadily until around 4000 (4096) and then
  crashes with the "Bad file number" error (trying to allocate the next
  thread.)

Scenario #2:

  Delete state files on both and recreate with -C.

  Start up server B; it runs normally.

  Start up server A.

  Server A has one thread that is effectively using an entire CPU.

  Server B's NLWP grows steadily until around 450 and then server A
  finishes syncing and starts.

Jesse Thompson wrote:
> If I shut down the peer (with the older Solaris patch revs) and start up
> gross on the patched server, it processes client requests just fine.
> Then if I start up the peer, the same problems occurs (in reverse).
> 
> We didn't see the problem in our test environment; most likely because
> we patched both of those servers at the same time.  I presume that I
> could upgrade the other production server and this problem would go
> away... but it's hard to tell for sure.
> 
> 
> 
> I changed the client port so that requests wouldn't come in, and I
> started gross on both servers with -dDD
> 
> 
> _Server 1 (started first):_
> Tue Aug  5 10:24:33 2008 #3: bloommgr starting...
> Tue Aug  5 10:24:33 2008 #4: Connect to peer failed, host 1.2.3.4, port 1112
> Tue Aug  5 10:24:33 2008 #4: Waiting sync connection on host 2.3.4.5
> port 1112
> Tue Aug  5 10:24:33 2008 #1: Filters in sync. Starting...
> Tue Aug  5 10:24:33 2008 #1: initializing dns checker thread pool 'dnsbl'
> Tue Aug  5 10:24:33 2008 #1: doubling the space for message queues from
> 2 to 4
> Tue Aug  5 10:24:33 2008 #1: initializing Sophos blocker thread pool
> Tue Aug  5 10:24:33 2008 #5: threadpool 'dnsbl' thread #1 starting
> watchdog enabled
> Tue Aug  5 10:24:33 2008 #1: starting sjsms policy server
> Tue Aug  5 10:24:33 2008 #6: threadpool 'blocker' thread #1 starting
> watchdog enabled
> Tue Aug  5 10:24:33 2008 #7: initializing sjsms worker thread pool
> Tue Aug  5 10:24:33 2008 #7: doubling the space for message queues from
> 4 to 8
> Tue Aug  5 10:24:33 2008 #9: threadpool 'sjsms' thread #1 starting
> Tue Aug  5 10:24:34 2008 #6: threadpool 'blocker' idling
> Tue Aug  5 10:24:34 2008 #5: threadpool 'dnsbl' idling
> Tue Aug  5 10:24:34 2008 #9: threadpool 'sjsms' idling
> etc...
> 
> 
> _Started Server 2:_
> Tue Aug  5 10:24:41 2008 #3: bloommgr starting...
> Tue Aug  5 10:24:41 2008 #4: Peer fd 6
> Tue Aug  5 10:24:41 2008 #3: received rotate command
> Tue Aug  5 10:24:41 2008 #4: Waiting sync connection on host
> 144.92.197.229 port 1112
> Tue Aug  5 10:24:41 2008 #6: rotate thread starting
> Tue Aug  5 10:24:41 2008 #6: rotation not needed
> 
> 
> _Server 1 shows:_
> Tue Aug  5 10:24:41 2008 #4: Got sync connection
> Tue Aug  5 10:24:41 2008 #4: Examining peer config
> Tue Aug  5 10:24:41 2008 #4: freeze queue 0
> Tue Aug  5 10:24:42 2008 #5: threadpool 'dnsbl' idling
> Tue Aug  5 10:24:42 2008 #9: threadpool 'sjsms' idling
> Tue Aug  5 10:24:42 2008 #6: threadpool 'blocker' idling
> etc...
> Tue Aug  5 10:24:43 2008 #4: Sent buffer: 0
> Tue Aug  5 10:24:44 2008 #5: threadpool 'dnsbl' idling
> Tue Aug  5 10:24:44 2008 #9: threadpool 'sjsms' idling
> Tue Aug  5 10:24:44 2008 #6: threadpool 'blocker' idling
> etc...
> Tue Aug  5 10:24:47 2008 #4: Sent buffer: 1
> Tue Aug  5 10:24:48 2008 #6: threadpool 'blocker' idling
> Tue Aug  5 10:24:48 2008 #9: threadpool 'sjsms' idling
> Tue Aug  5 10:24:48 2008 #5: threadpool 'dnsbl' idling
> etc...
> Tue Aug  5 10:24:55 2008 #4: Sent buffer: 2
> Tue Aug  5 10:24:56 2008 #9: threadpool 'sjsms' idling
> Tue Aug  5 10:24:56 2008 #5: threadpool 'dnsbl' idling
> Tue Aug  5 10:24:56 2008 #6: threadpool 'blocker' idling
> etc...
> Tue Aug  5 10:25:05 2008 #4: Sent buffer: 3
> Tue Aug  5 10:25:06 2008 #6: threadpool 'blocker' idling
> Tue Aug  5 10:25:06 2008 #5: threadpool 'dnsbl' idling
> Tue Aug  5 10:25:06 2008 #9: threadpool 'sjsms' idling
> etc...
> 
> It's hard to tell from the logs I pasted, since I snipped out most of
> the "idling" logs, but the time it takes for each buffer to be sent
> takes longer each time
> 1s, 4s, 8s, 10s
> 
> Buffer #4 is never sent in the 20 minutes I was watching.  I also did
> not see the "Bad file number" error occur.
> 
> Jesse
> 
> 
> 
> 
> Jesse Thompson wrote:
>> After patching Solaris 10 on one server to the latest patch sets.  We
>> did not change gross on either host.
>>
>> Start grossd (-dDD) on the server that was patched.
>>
>> _STDOUT shows:_
>> Tue Aug  5 09:37:42 2008 #1: config: ...
>> ...
>> Tue Aug  5 09:37:42 2008 #1: Listening host address 1.2.3.4
>> Tue Aug  5 09:37:42 2008 #1: Sync listen address 1.2.3.4
>> Tue Aug  5 09:37:42 2008 #1: Peer 2.3.4.5 configured. Replicating.
>> Tue Aug  5 09:37:42 2008 #1: Sync peer address 2.3.4.5
>> Tue Aug  5 09:37:42 2008 #1: updatestyle: GREY
>> Tue Aug  5 09:37:42 2008 #1: adding dnsbl: ...
>> ...
>> Tue Aug  5 09:37:42 2008 #1: Blocker host address 127.0.0.1
>> Tue Aug  5 09:37:42 2008 #1: doubling the space for message queues from
>> 1 to 2
>> Tue Aug  5 09:37:42 2008 #2: delay queue manager thread starting
>> Tue Aug  5 09:37:42 2008 #2: waiting for messages
>> Tue Aug  5 09:37:42 2008 #3: Found the correct state file magic string.
>> Tue Aug  5 09:37:42 2008 #3: fixing bloom ring queue memory pointers,
>> offset=0
>> Tue Aug  5 09:37:42 2008 #3: bloommgr starting...
>> Tue Aug  5 09:37:42 2008 #4: Peer fd 6
>> Tue Aug  5 09:37:42 2008 #3: received rotate command
>> Tue Aug  5 09:37:42 2008 #4: Waiting sync connection on host
>> 144.92.197.229 port 1112
>> Tue Aug  5 09:37:42 2008 #6: rotate thread starting
>> Tue Aug  5 09:37:42 2008 #6: rotation not needed
>>
>> (no activity for 10 minutes)
>>
>>
>> _Truss shows minor activity_
>> /5:     read(6, "\0\0\0\0\0\0\088", 8)                  = 8
>> /5:     read(6, "\0\0\0\0\0\0 Q !\0\0\0\0".., 136)      = 136
>> /5:     clock_gettime(4, 0x00058814)                    = 0
>> /5:     lwp_unpark(3)                                   = 0
>> /3:     lwp_park(0x00000000, 0)                         = 0
>> /3:     time()                                          = 1217947218
>>
>> (similar activity is repeated in truss)
>>
>> (snoop indicates that the peer is seeing the activity)
>>
>>
>> _The peer's log shows:_
>> Aug  5 09:37:42 grossd: [ID 702911 local3.info] #4: Got sync connection
>> Aug  5 09:37:42 grossd: [ID 702911 local3.info] #4: Examining peer config
>> Aug  5 09:37:42 grossd: [ID 702911 local3.error] #4: freeze queue 0
>>
>> (continues to process client requests normally)
>>
>>
>> _After the 10 minutes, the first gross server shows:_
>> Tue Aug  5 09:47:50 2008 #4: Got sync connection
>> Tue Aug  5 09:47:50 2008 #4: Examining peer config
>>
>> (then nothing)
>>
>> (truss shows: lwp_park(0x00000000, 0)         (sleeping...) )
>>
>>
>> _And the peer's log shows:_
>> Aug  5 09:47:50 grossd: [ID 702911 local3.error] #1299: Grossd shutdown
>> with exit code 2: pthread_create Bad file number
>> Aug  5 09:47:50 grossd: [ID 702911 local3.info] #3: bloommgr starting...
>> Aug  5 09:47:50 grossd: [ID 702911 local3.info] #3: received rotate command
>> Aug  5 09:47:50 grossd: [ID 702911 local3.info] #4: Waiting sync
>> connection on host 2.3.4.5 port 1112
>> Aug  5 09:47:50 grossd: [ID 702911 local3.info] #6: rotate thread starting
>>
>> (no activity, including client requests, for the next X minutes)
>>
>> (truss shows: lwp_park(0x00000000, 0)         (sleeping...) )
>>
>>
>> _So, I kill (-9 is required) gross on the first server, and the peer's
>> log shows:_
>> Aug  5 09:54:16 grossd: [ID 702911 local3.info] #5: connection closed by
>> client
>>
>> (still no activity in log or truss)
>>
>>
>> I kill (-9 is required) gross on the peer, start is back up, and it is
>> functioning fine as long as I don't start gross on the server that was
>> just patched.
>>
>>
>> I also tried deleting and recreating the state file on the upgraded
>> server... same problem.
>>
>>
>> _Also, I see from this morning, log entries like this,  but I can't seem
>> to trigger it now on demand:_
>> Aug  5 07:59:57 grossd: [ID 702911 local3.error] #4: Send filters: Bad
>> file number
>> Aug  5 07:59:57 last message repeated 291 times
>> Aug  5 07:59:57 grossd: [ID 702911 local3.error] #4: Send filters: Bad
>> file number
>> Aug  5 07:59:57 last message repeated 402 times
>> Aug  5 07:59:57 grossd: [ID 702911 local3.error] #4: Send filters: Bad
>> file number
>> Aug  5 07:59:57 last message repeated 21 times
>> ...
>>
>>
>> Any ideas?
>>
>> Jesse
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Gross mailing list
>> Gross at lists.utu.fi
>> https://lists.utu.fi/mailman/listinfo/gross
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Gross mailing list
> Gross at lists.utu.fi
> https://lists.utu.fi/mailman/listinfo/gross

-- 
  Jesse Thompson
  Email/IM: jesse.thompson at doit.wisc.edu
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3353 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.utu.fi/pipermail/gross/attachments/20080805/faf83044/attachment.bin>