Vagabond Bits

IO SYSTEM VERIFICATION ERROR in Ntfs.sys

Hi Everyone,

got a new DFS Cluster on Windows 2008 going and when adding disks to the cluster, the active node bugchecked. So I enabled driver verification and added another disk got the following error: IO SYSTEM VERIFICATION ERROR in Ntfs.sys (WDM DRIVER ERROR 224)

see screenshot below.

Analyzing the memory dump didnt get a very far. First port of call is to update all drivers/firmware on the system including HBA firmware as well. Once I get the drivers/firmware updated on this HP BL470c blade I will attempt to add more LUNs.

ID no: c0070005

If you get the below error when trying to get to move transactions logs in Exchange 2003 (Access is denied , ID no: c0070005 , Exchange System Manager)

go to http://support.microsoft.com/default.aspx?scid=kb;en-us;Q323915. However you may need to restart the node to get the registry permissions in effect.

Just spent the whole of sunday bringing back an Exchange cluster. I wonder why disasters happen on Sunday..

Disabling Replication on Domain Controllers

Hi All,

While working on the Schema upgrade I disable the outbound replication on Windows 2003/2008 Active Directory Domain Controller and when doing the schema upgrade. Once its finished successfully and I am sure all is ok on the DC, I enable the replication.

Here is the command to disable Outbound Replication for Domain Controllers.

repadmin /options localhost +disable_otbound_repl

To enable replication again:

repadmin /options localhost -disable_otbound_repl

Here is the command to disable Inbound Replication for Domain Controllers.

repadmin /options localhost +disable_inbound_repl

To enable replication again:

repadmin /options localhost -disable_inbound_repl

Free Microsoft Press E-Books

Hi All

Microsoft is giving away two free E-Books:

Windows Server 2008 Terminal Services Resource Kit

All the tools and technical information to help you setup, deploy and manage Terminal Services farms using Windows Server 2008

and the second one is The Practical Guide to Defect Prevention

This guide helps developers build a process of best practices that helps to avoid defects during the development process rather than trying to fix them after the fact.

Free iSCSI Target Software

Leading Storage Virtualization Now Free With Unlimited Connections and Large 2TB Size

Yes you read that correct. StarWind Software are giving away their iSCSI Target software for free. Here are the features:

• Large 2 TB capacity

• Support for Server Clusters with VMware ESX and ESXi and Microsoft Hyper-V
• Unlimited number of supported concurrent iSCSI connections
• Compression, Encryption and CHAP authentication
• iSCSI RAM disk for network performance tuning
• iSCSI CD/DVD/Blu-Ray/HD-DVD emulation

This is really great news for all techies as it allows us to play with various solutions that require SAN type storage like ESX/Clusters etc

Grab it from here.

Stop 0xC4 and 0xD5

Hi All,

we seems to have become a magnet for bugchecks (bluescreens) and my group is getting hit with them from all angles. We had been having a bunch of blue screens (stop 0xC4) on a perticular server and all of testing and debugging got us no where.

So we ran verifier to check for all drivers on this server. As soon as we finished configuring verifier.exe and gave the server a reboot it blue screened with a STOP 0x000000D5, but this time the culprit came out. It was cdprobe.sys running on this windows 2003 server as part of Symantec's Centennial Discovery Client Agent. We disabled it and the server has been running fine since then.

So how do you use Verifier.exe ???

Here are the steps to enable Verifier.

Start > Run, type Verifier and hit enter.

Select the radio button for Create custom settings (for code developers), click Next

on this screen click Select individual settings from a full list, click Next

Tick the first three check boxes of Special Pool, Pool Tracking, Force IRQL checking, click Next

Check the box for Select driver names for a list and click Next

In this screen, sort by Provider and then select all the Non-Microsoft drivers. Click Finish. It will ask you to reboot.

If your server has a bad driver it will bluescreen and display the name of the problematic driver (see screenshot below).

DFS Replication with Windows 2003 R2 , eventid 6012

This morning got a new DFS built going and after getting all the config work on it, tried to get the Folder replication to start but got this rude error.

Domain: The Active Directory schema on domain controller dc1 cannot be read. This error might me caused by schema that has not been extended, or was extended improperly. A class schema object cannot be found.

So went into event viewer to check the error in DFS replication and found this:

It was followed by this:

Event ID: 6012

Description:The DFS Replication service detected an incompatible Active Directory schema version while trying to read configuration objects from server DC1. The service disconnected from this server and will try again in the next polling cycle.

Additional Information:

Expected Version: 31

Incompatible Server Version: 30

Looked up Event ID 6012 and found this article by Microsoft. Going through the article cleared everything.

Our Domain Controllers were running Windows 2003 while our file servers had moved up to Windows 2003 R2. So the solution was simple, backup your domain controller, get the CD 2 of the Windows 2003 R2 and run adprep.exe /forestprep from it. We did that and voila DFS replication working happily.

Event Type: Information

Event Source: DFSREvent

Event ID: 1206

Description:The DFS Replication service successfully contacted domain controller \\DC1 to access configuration information.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

Richcopy

Back from my holiday and straight into couple of big migration projects. While running these projects I needed to move 20TB of data around.

Now I have generally used Robocopy for pretty much all my copy/move/sync needs but to move that much of data I needed another utility that could do it faster. Comes in Microsoft with their RichCopy, which was an internal tool for many years till they made it public.

The advantage of Richcopy is that its multithreaded. Really multithreaded. I ran 256 directory copy threads with eachdirectory copy getting 256 file copy operations, effectively running 65536 copy operations simultaniously and Richcopy ran with it just fine. Check out the screehshots below.

You can grab your copy from here.

Bug 5077897

hey guys, remember this post Oracle Server Non paged pool memory error ? Quick recap :

Oracle 10g Release 2 (10.2) server running out of Non Paged Pool Memory with Event ID: 2019 ?

The Oracle guys had logged a Service Request with Oracle and after reviewing our research that I mentioned before they clearly identified the issue as "Bug 5077897" "Windows: Server side handle leak"

Here is the Oracle Blurb about Bug 5077897:

Bug 5077897 - Windows: Server side handle leak

version affected 10.2.0.2

Description

This problem is introduced on Windows platforms

in 10.2.0.2. The Oracle server exhibits a thread Handle

leak which is observed for every TCP connect/disconnect

to a 10.2 database.

So we will be upgrading our database very soon.

Also on that note I am on vacation till mid March so although I will be checking my mails on a weekly basis, I wont be doing any blog posts. Take care and have fun. See you all in a month's time.

Permissions with Vmware ESX Server 3.5

So you just installed a new VMware ESX server. You tried to SSH to it and login as root. What happened?

It didn’t work, did it?

The firewall allows it, right? (yes) You can login to the physical server console with the same username & password, right? (yes) But it still doesn’t work, does it?

Let’s find out how to fix it….

To allow the root user to login to a VMware ESX Server over the network using SSH, do the following:

1. Go to the service console on the physical server & login

2. vi /etc/ssh/sshd_config

3. Change the line that says PermitRootLogin from “no” to “yes”

4. do service sshd restart

And your problem is solved…

OR from console run this:

mv /etc/ssh/sshd_config /etc/ssh/sshd_config.orig

cat /etc/ssh/sshd_config.orig | sed 's/PermitRootLogin no/PermitRootLogin yes/g' > /etc/ssh/sshd_config

service sshd restart

However, having said the above, it is not a good security practise to allow direct root level login over the network even if its using SSH. I prefer to add a regular users, SSH to the server using that account and then SU - to get to root.

Another recommendation is to use one non-root group for VM admins and add operator/admin users there. To create that group, enter the following command:

groupadd -g 7777 vmadmins

To create an account for the new admins, enter the following commands:

useradd -c "ESX server operator" ESXOps

Create a single userid, which will be able to operate all of the VMs.

useradd -g 7777 johndoe

Create a userid, and make groupid 7777 (vmadmins) as its primary group.

useradd -g 7777 -c "Joe Blog" joeblog2

Create a userid, and make groupid 7777 (vmadmins) as its primary group.

The curious case of missing DLL

One fine day the SAP system on our development server did not come back after an overnight offline backup. The guys tried a lot but it wouldn't start. Trying to start it from the command line told us that the mfc71u.dll was not present in C:\WINDOWS\system32. We restored the file from last nights backup and the SAP system started perfectly.

Now on a regular PC generally nobody goes in C:\WINDOWS\system32 and deletes files on a fancy and this was a heavily restricted SAP server with only the SAP and the OPS team having access. System logs didn't not indicate any unwanted activity. The only activity was the installation of Symantec EndPoint Protection Manager client, but this was a week back. Also this deployment had been tested on 5 test server previously without any issue.

The server's application event log indicated that Symantec EPM installed successfully. Still I got the boys to check the log created by Symantec EPM. This is what we found in the installation log.

Info 1603.The file C:\WINDOWS\system32\mfc71u.dll is being held in use by the following process: Name: sapstartsrv, ID: 10572, Window Title: (not determined yet). Close that application and retry.

MSI (s) (58:10) [15:31:12:215]: Note: 1: 2727 2:

...

MSI (c) (C8:88) [15:31:12:215]: No window with title could be found for FilesInUse

MSI (s) (58:10) [15:31:12:215]: Doing action: uExtBeginUninstallImmediate.6500F9C2_37EA_4F25_A4DE_6211026D9C01

Action ended 15:31:12: InstallValidate. Return value 1.

MSI (s) (58:28) [15:31:12:231]: Invoking remote custom action. DLL: C:\WINDOWS\Installer\MSI324.tmp, Entrypoint: _BeginUninstallImmediate@4

...

MSI (s) (58:10) [15:31:46:590]: Executing op: SetTargetFolder(Folder=C:\WINDOWS\system32\)

MSI (s) (58:10) [15:31:46:590]: Executing op: FileRemove(,FileName=mfc71u.dll,,ComponentId={3AC4AA25-A28A-4F09-826A-30CA0A620F35})

So it looked like Symantec EPM client install had removed the file post installation. Surprisingly we did not notice this behaviour on any other PC/Server.

Fair to say sometimes you find the cause of a problem in the least expected places.

Non Paged Pool Memory followup

Remember this Post , here is a quick recap:

Event ID: 2019
The server was unable to allocate from the system nonpaged pool because the pool was empty.

So what does it really mean ?

This error is from Server Service reporting that when it was trying to satisfy a request, it was not able to find enough free memory of the respective type of pool. Error 2020 indicates Paged Pool and 2019, NonPaged Pool. This doesn’t mean that the Server Service (srv.sys) is broken or the root cause of the problem, more often rather it is the first component to see the resource problem and report it to the Event Log.

I installed Poolmon which told me that Thre is the largest consumer of memory

So what's "Thre" ?

Thre - nt!ps - Thread objects

Note, the nt before the ! means that this is NT or the kernel’s tag for Thread objects. So there has to be a process that is leaking memory.

I got WinDbg running on this machine asap and entered "!proccess 0 0" in the command. Here is the output with all Processes with more than 1000 handle count.

PROCESS 89b9ad88  SessionId: 0  Cid: 0afc    Peb: 7ffd7000  ParentCid: 01c0
    DirBase: dfff07e0  ObjectTable: e628b498  HandleCount: 95969.
    Image: oracle.exe

PROCESS 89b2a690  SessionId: 0  Cid: 0c4c    Peb: 7ffdf000  ParentCid: 01c0
    DirBase: dfff0860  ObjectTable: e63b2358  HandleCount: 2244.
    Image: pinetmgr.exe

PROCESS 8a386698  SessionId: 0  Cid: 0f04    Peb: 7ffd4000  ParentCid: 01c0
    DirBase: dfff0a60  ObjectTable: e17e7408  HandleCount: 2167.
    Image: pimsgss.exe

A handle count of more than 95000 definitely set off alarms. I dug a bit deeper into the Oracle process with

!PROCESS 89b9ad88  4

The process brought a whole bunch of threads.

!process 89b9ad88 4

PROCESS 89b9ad88 SessionId: 0 Cid: 0afc    Peb: 7ffd7000 ParentCid: 01c0
   DirBase: dfff07e0 ObjectTable: e628b498 HandleCount: 114448.
   Image: oracle.exe

   THREAD 89b97998 Cid 0afc.0b00 Teb: 7ffdf000 Win32Thread: e6355328 WAIT
   THREAD 89b77b78 Cid 0afc.0b44 Teb: 7ffdd000 Win32Thread: 00000000 WAIT
   THREAD 89b64458 Cid 0afc.0b50 Teb: 7ffdc000 Win32Thread: 00000000 WAIT
   THREAD 89a01020 Cid 0afc.1204 Teb: 7ffdb000 Win32Thread: e660b768 WAIT
   THREAD 89a007d0 Cid 0afc.1208 Teb: 7ffd9000 Win32Thread: 00000000 WAIT
   THREAD 89a003b8 Cid 0afc.120c Teb: 7ffd8000 Win32Thread: 00000000 WAIT
   THREAD 899e5db0 Cid 0afc.1214 Teb: 7ffd6000 Win32Thread: 00000000 WAIT
   THREAD 899fadb0 Cid 0afc.121c Teb: 7ffd5000 Win32Thread: 00000000 WAIT
   THREAD 899e6db0 Cid 0afc.1220 Teb: 7ffd4000 Win32Thread: 00000000 WAIT

I opened two random threads with

!thread command and this is what it came up with:

THREAD 897ebaf0 Cid 0afc.1b58 Teb: 00000000 Win32Thread: 00000000 TERMINATED
Not impersonating
DeviceMap                 e1000908
Owning Process            89b9ad88       Image:         oracle.exe
Attached Process          N/A            Image:         N/A
Wait Start TickCount      17893          Ticks: 16701178 (3:00:29:15.906)
Context Switch Count      24
UserTime                  00:00:00.000
KernelTime                00:00:00.000
Win32 Start Address 0x0040162c
Start Address 0x77e617ec
Stack Init 0 Current b949fba0 Base b94a0000 Limit b949d000 Call 0
Priority 10 BasePriority 8 PriorityDecrement 0

THREAD 8969e020 Cid 0afc.08c0 Teb: 00000000 Win32Thread: 00000000 TERMINATED
Not impersonating
DeviceMap                 e1000908
Owning Process            89b9ad88       Image:         oracle.exe
Attached Process          N/A            Image:         N/A
Wait Start TickCount      45267          Ticks: 16678772 (3:00:23:25.812)
Context Switch Count      27
UserTime                  00:00:00.000
KernelTime                00:00:00.015
Win32 Start Address 0x0040162c
Start Address 0x77e617ec
Stack Init 0 Current b9c3fba0 Base b9c40000 Limit b9c3d000 Call 0
Priority 10 BasePriority 8 PriorityDecrement 0
ChildEBP RetAddr Args to Child

I could tell that the threads have been terminated and they belonged to Oracle.exe but somehow they have not been cleared from memory.

I opened Task Manager and from the View column option added the Handle count and saw this:

The Handle count was growing at a fair bit of speed.

I have contacted the Oracle boys to check out the issue but I am pretty sure one of the Oracle app on that box or Oracle itself is the cause of the memory leak.

I’ll post back when the Oracle team have come back with their investigation.

Vagabond Bits

About Me

Blog Archive

Labels

Vagabond Bits

RSS

About Me

Blog Archive

Labels