[Gross] grossd problems after solaris 10 patching

Jesse Thompson jesse.thompson at doit.wisc.edu
Tue Aug 5 18:46:49 EEST 2008


If I shut down the peer (with the older Solaris patch revs) and start up
gross on the patched server, it processes client requests just fine.
Then if I start up the peer, the same problems occurs (in reverse).

We didn't see the problem in our test environment; most likely because
we patched both of those servers at the same time.  I presume that I
could upgrade the other production server and this problem would go
away... but it's hard to tell for sure.



I changed the client port so that requests wouldn't come in, and I
started gross on both servers with -dDD


_Server 1 (started first):_
Tue Aug  5 10:24:33 2008 #3: bloommgr starting...
Tue Aug  5 10:24:33 2008 #4: Connect to peer failed, host 1.2.3.4, port 1112
Tue Aug  5 10:24:33 2008 #4: Waiting sync connection on host 2.3.4.5
port 1112
Tue Aug  5 10:24:33 2008 #1: Filters in sync. Starting...
Tue Aug  5 10:24:33 2008 #1: initializing dns checker thread pool 'dnsbl'
Tue Aug  5 10:24:33 2008 #1: doubling the space for message queues from
2 to 4
Tue Aug  5 10:24:33 2008 #1: initializing Sophos blocker thread pool
Tue Aug  5 10:24:33 2008 #5: threadpool 'dnsbl' thread #1 starting
watchdog enabled
Tue Aug  5 10:24:33 2008 #1: starting sjsms policy server
Tue Aug  5 10:24:33 2008 #6: threadpool 'blocker' thread #1 starting
watchdog enabled
Tue Aug  5 10:24:33 2008 #7: initializing sjsms worker thread pool
Tue Aug  5 10:24:33 2008 #7: doubling the space for message queues from
4 to 8
Tue Aug  5 10:24:33 2008 #9: threadpool 'sjsms' thread #1 starting
Tue Aug  5 10:24:34 2008 #6: threadpool 'blocker' idling
Tue Aug  5 10:24:34 2008 #5: threadpool 'dnsbl' idling
Tue Aug  5 10:24:34 2008 #9: threadpool 'sjsms' idling
etc...


_Started Server 2:_
Tue Aug  5 10:24:41 2008 #3: bloommgr starting...
Tue Aug  5 10:24:41 2008 #4: Peer fd 6
Tue Aug  5 10:24:41 2008 #3: received rotate command
Tue Aug  5 10:24:41 2008 #4: Waiting sync connection on host
144.92.197.229 port 1112
Tue Aug  5 10:24:41 2008 #6: rotate thread starting
Tue Aug  5 10:24:41 2008 #6: rotation not needed


_Server 1 shows:_
Tue Aug  5 10:24:41 2008 #4: Got sync connection
Tue Aug  5 10:24:41 2008 #4: Examining peer config
Tue Aug  5 10:24:41 2008 #4: freeze queue 0
Tue Aug  5 10:24:42 2008 #5: threadpool 'dnsbl' idling
Tue Aug  5 10:24:42 2008 #9: threadpool 'sjsms' idling
Tue Aug  5 10:24:42 2008 #6: threadpool 'blocker' idling
etc...
Tue Aug  5 10:24:43 2008 #4: Sent buffer: 0
Tue Aug  5 10:24:44 2008 #5: threadpool 'dnsbl' idling
Tue Aug  5 10:24:44 2008 #9: threadpool 'sjsms' idling
Tue Aug  5 10:24:44 2008 #6: threadpool 'blocker' idling
etc...
Tue Aug  5 10:24:47 2008 #4: Sent buffer: 1
Tue Aug  5 10:24:48 2008 #6: threadpool 'blocker' idling
Tue Aug  5 10:24:48 2008 #9: threadpool 'sjsms' idling
Tue Aug  5 10:24:48 2008 #5: threadpool 'dnsbl' idling
etc...
Tue Aug  5 10:24:55 2008 #4: Sent buffer: 2
Tue Aug  5 10:24:56 2008 #9: threadpool 'sjsms' idling
Tue Aug  5 10:24:56 2008 #5: threadpool 'dnsbl' idling
Tue Aug  5 10:24:56 2008 #6: threadpool 'blocker' idling
etc...
Tue Aug  5 10:25:05 2008 #4: Sent buffer: 3
Tue Aug  5 10:25:06 2008 #6: threadpool 'blocker' idling
Tue Aug  5 10:25:06 2008 #5: threadpool 'dnsbl' idling
Tue Aug  5 10:25:06 2008 #9: threadpool 'sjsms' idling
etc...

It's hard to tell from the logs I pasted, since I snipped out most of
the "idling" logs, but the time it takes for each buffer to be sent
takes longer each time
1s, 4s, 8s, 10s

Buffer #4 is never sent in the 20 minutes I was watching.  I also did
not see the "Bad file number" error occur.

Jesse




Jesse Thompson wrote:
> After patching Solaris 10 on one server to the latest patch sets.  We
> did not change gross on either host.
> 
> Start grossd (-dDD) on the server that was patched.
> 
> _STDOUT shows:_
> Tue Aug  5 09:37:42 2008 #1: config: ...
> ...
> Tue Aug  5 09:37:42 2008 #1: Listening host address 1.2.3.4
> Tue Aug  5 09:37:42 2008 #1: Sync listen address 1.2.3.4
> Tue Aug  5 09:37:42 2008 #1: Peer 2.3.4.5 configured. Replicating.
> Tue Aug  5 09:37:42 2008 #1: Sync peer address 2.3.4.5
> Tue Aug  5 09:37:42 2008 #1: updatestyle: GREY
> Tue Aug  5 09:37:42 2008 #1: adding dnsbl: ...
> ...
> Tue Aug  5 09:37:42 2008 #1: Blocker host address 127.0.0.1
> Tue Aug  5 09:37:42 2008 #1: doubling the space for message queues from
> 1 to 2
> Tue Aug  5 09:37:42 2008 #2: delay queue manager thread starting
> Tue Aug  5 09:37:42 2008 #2: waiting for messages
> Tue Aug  5 09:37:42 2008 #3: Found the correct state file magic string.
> Tue Aug  5 09:37:42 2008 #3: fixing bloom ring queue memory pointers,
> offset=0
> Tue Aug  5 09:37:42 2008 #3: bloommgr starting...
> Tue Aug  5 09:37:42 2008 #4: Peer fd 6
> Tue Aug  5 09:37:42 2008 #3: received rotate command
> Tue Aug  5 09:37:42 2008 #4: Waiting sync connection on host
> 144.92.197.229 port 1112
> Tue Aug  5 09:37:42 2008 #6: rotate thread starting
> Tue Aug  5 09:37:42 2008 #6: rotation not needed
> 
> (no activity for 10 minutes)
> 
> 
> _Truss shows minor activity_
> /5:     read(6, "\0\0\0\0\0\0\088", 8)                  = 8
> /5:     read(6, "\0\0\0\0\0\0 Q !\0\0\0\0".., 136)      = 136
> /5:     clock_gettime(4, 0x00058814)                    = 0
> /5:     lwp_unpark(3)                                   = 0
> /3:     lwp_park(0x00000000, 0)                         = 0
> /3:     time()                                          = 1217947218
> 
> (similar activity is repeated in truss)
> 
> (snoop indicates that the peer is seeing the activity)
> 
> 
> _The peer's log shows:_
> Aug  5 09:37:42 grossd: [ID 702911 local3.info] #4: Got sync connection
> Aug  5 09:37:42 grossd: [ID 702911 local3.info] #4: Examining peer config
> Aug  5 09:37:42 grossd: [ID 702911 local3.error] #4: freeze queue 0
> 
> (continues to process client requests normally)
> 
> 
> _After the 10 minutes, the first gross server shows:_
> Tue Aug  5 09:47:50 2008 #4: Got sync connection
> Tue Aug  5 09:47:50 2008 #4: Examining peer config
> 
> (then nothing)
> 
> (truss shows: lwp_park(0x00000000, 0)         (sleeping...) )
> 
> 
> _And the peer's log shows:_
> Aug  5 09:47:50 grossd: [ID 702911 local3.error] #1299: Grossd shutdown
> with exit code 2: pthread_create Bad file number
> Aug  5 09:47:50 grossd: [ID 702911 local3.info] #3: bloommgr starting...
> Aug  5 09:47:50 grossd: [ID 702911 local3.info] #3: received rotate command
> Aug  5 09:47:50 grossd: [ID 702911 local3.info] #4: Waiting sync
> connection on host 2.3.4.5 port 1112
> Aug  5 09:47:50 grossd: [ID 702911 local3.info] #6: rotate thread starting
> 
> (no activity, including client requests, for the next X minutes)
> 
> (truss shows: lwp_park(0x00000000, 0)         (sleeping...) )
> 
> 
> _So, I kill (-9 is required) gross on the first server, and the peer's
> log shows:_
> Aug  5 09:54:16 grossd: [ID 702911 local3.info] #5: connection closed by
> client
> 
> (still no activity in log or truss)
> 
> 
> I kill (-9 is required) gross on the peer, start is back up, and it is
> functioning fine as long as I don't start gross on the server that was
> just patched.
> 
> 
> I also tried deleting and recreating the state file on the upgraded
> server... same problem.
> 
> 
> _Also, I see from this morning, log entries like this,  but I can't seem
> to trigger it now on demand:_
> Aug  5 07:59:57 grossd: [ID 702911 local3.error] #4: Send filters: Bad
> file number
> Aug  5 07:59:57 last message repeated 291 times
> Aug  5 07:59:57 grossd: [ID 702911 local3.error] #4: Send filters: Bad
> file number
> Aug  5 07:59:57 last message repeated 402 times
> Aug  5 07:59:57 grossd: [ID 702911 local3.error] #4: Send filters: Bad
> file number
> Aug  5 07:59:57 last message repeated 21 times
> ...
> 
> 
> Any ideas?
> 
> Jesse
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Gross mailing list
> Gross at lists.utu.fi
> https://lists.utu.fi/mailman/listinfo/gross

-- 
  Jesse Thompson
  Email/IM: jesse.thompson at doit.wisc.edu
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3353 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.utu.fi/pipermail/gross/attachments/20080805/37a45d95/attachment.bin>


More information about the Gross mailing list