[Gross] grossd problems after solaris 10 patching

Jesse Thompson jesse.thompson at doit.wisc.edu
Tue Aug 5 18:11:23 EEST 2008


After patching Solaris 10 on one server to the latest patch sets.  We
did not change gross on either host.

Start grossd (-dDD) on the server that was patched.

_STDOUT shows:_
Tue Aug  5 09:37:42 2008 #1: config: ...
...
Tue Aug  5 09:37:42 2008 #1: Listening host address 1.2.3.4
Tue Aug  5 09:37:42 2008 #1: Sync listen address 1.2.3.4
Tue Aug  5 09:37:42 2008 #1: Peer 2.3.4.5 configured. Replicating.
Tue Aug  5 09:37:42 2008 #1: Sync peer address 2.3.4.5
Tue Aug  5 09:37:42 2008 #1: updatestyle: GREY
Tue Aug  5 09:37:42 2008 #1: adding dnsbl: ...
...
Tue Aug  5 09:37:42 2008 #1: Blocker host address 127.0.0.1
Tue Aug  5 09:37:42 2008 #1: doubling the space for message queues from
1 to 2
Tue Aug  5 09:37:42 2008 #2: delay queue manager thread starting
Tue Aug  5 09:37:42 2008 #2: waiting for messages
Tue Aug  5 09:37:42 2008 #3: Found the correct state file magic string.
Tue Aug  5 09:37:42 2008 #3: fixing bloom ring queue memory pointers,
offset=0
Tue Aug  5 09:37:42 2008 #3: bloommgr starting...
Tue Aug  5 09:37:42 2008 #4: Peer fd 6
Tue Aug  5 09:37:42 2008 #3: received rotate command
Tue Aug  5 09:37:42 2008 #4: Waiting sync connection on host
144.92.197.229 port 1112
Tue Aug  5 09:37:42 2008 #6: rotate thread starting
Tue Aug  5 09:37:42 2008 #6: rotation not needed

(no activity for 10 minutes)


_Truss shows minor activity_
/5:     read(6, "\0\0\0\0\0\0\088", 8)                  = 8
/5:     read(6, "\0\0\0\0\0\0 Q !\0\0\0\0".., 136)      = 136
/5:     clock_gettime(4, 0x00058814)                    = 0
/5:     lwp_unpark(3)                                   = 0
/3:     lwp_park(0x00000000, 0)                         = 0
/3:     time()                                          = 1217947218

(similar activity is repeated in truss)

(snoop indicates that the peer is seeing the activity)


_The peer's log shows:_
Aug  5 09:37:42 grossd: [ID 702911 local3.info] #4: Got sync connection
Aug  5 09:37:42 grossd: [ID 702911 local3.info] #4: Examining peer config
Aug  5 09:37:42 grossd: [ID 702911 local3.error] #4: freeze queue 0

(continues to process client requests normally)


_After the 10 minutes, the first gross server shows:_
Tue Aug  5 09:47:50 2008 #4: Got sync connection
Tue Aug  5 09:47:50 2008 #4: Examining peer config

(then nothing)

(truss shows: lwp_park(0x00000000, 0)         (sleeping...) )


_And the peer's log shows:_
Aug  5 09:47:50 grossd: [ID 702911 local3.error] #1299: Grossd shutdown
with exit code 2: pthread_create Bad file number
Aug  5 09:47:50 grossd: [ID 702911 local3.info] #3: bloommgr starting...
Aug  5 09:47:50 grossd: [ID 702911 local3.info] #3: received rotate command
Aug  5 09:47:50 grossd: [ID 702911 local3.info] #4: Waiting sync
connection on host 2.3.4.5 port 1112
Aug  5 09:47:50 grossd: [ID 702911 local3.info] #6: rotate thread starting

(no activity, including client requests, for the next X minutes)

(truss shows: lwp_park(0x00000000, 0)         (sleeping...) )


_So, I kill (-9 is required) gross on the first server, and the peer's
log shows:_
Aug  5 09:54:16 grossd: [ID 702911 local3.info] #5: connection closed by
client

(still no activity in log or truss)


I kill (-9 is required) gross on the peer, start is back up, and it is
functioning fine as long as I don't start gross on the server that was
just patched.


I also tried deleting and recreating the state file on the upgraded
server... same problem.


_Also, I see from this morning, log entries like this,  but I can't seem
to trigger it now on demand:_
Aug  5 07:59:57 grossd: [ID 702911 local3.error] #4: Send filters: Bad
file number
Aug  5 07:59:57 last message repeated 291 times
Aug  5 07:59:57 grossd: [ID 702911 local3.error] #4: Send filters: Bad
file number
Aug  5 07:59:57 last message repeated 402 times
Aug  5 07:59:57 grossd: [ID 702911 local3.error] #4: Send filters: Bad
file number
Aug  5 07:59:57 last message repeated 21 times
...


Any ideas?

Jesse
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3353 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.utu.fi/pipermail/gross/attachments/20080805/5af6d00f/attachment.bin>


More information about the Gross mailing list