Hi Everyone,

got a new DFS Cluster on Windows 2008 going and when adding disks to the cluster, the active node bugchecked. So I enabled driver verification and added another disk got the following error: IO SYSTEM VERIFICATION ERROR in Ntfs.sys (WDM DRIVER ERROR 224)
see screenshot below.




Analyzing the memory dump didnt get a very far. First port of call is to update all drivers/firmware on the system including HBA firmware as well. Once I get the drivers/firmware updated on this HP BL470c blade I will attempt to add more LUNs.



If you get the below error when trying to get to move transactions logs in Exchange 2003 (Access is denied , ID no: c0070005 , Exchange System Manager)
go to http://support.microsoft.com/default.aspx?scid=kb;en-us;Q323915. However you may need to restart the node to get the registry permissions in effect.

Just spent the whole of sunday bringing back an Exchange cluster. I wonder why disasters happen on Sunday..

Hi All,


While working on the Schema upgrade I disable the outbound replication on Windows 2003/2008 Active Directory Domain Controller and when doing the schema upgrade. Once its finished successfully and I am sure all is ok on the DC, I enable the replication.

Here is the command to disable Outbound Replication for Domain Controllers.
repadmin /options localhost +disable_otbound_repl

To enable replication again:
repadmin /options localhost -disable_otbound_repl

Here is the command to disable Inbound Replication for Domain Controllers.
repadmin /options localhost +disable_inbound_repl

To enable replication again:
repadmin /options localhost -disable_inbound_repl

Hi All

Microsoft is giving away two free E-Books:

All the tools and technical information to help you setup, deploy and manage Terminal Services farms using Windows Server 2008


This guide helps developers build a process of best practices that helps to avoid defects during the development process rather than trying to fix them after the fact.




Leading Storage Virtualization Now Free With Unlimited Connections and Large 2TB Size


Yes you read that correct. StarWind Software are giving away their iSCSI Target software for free. Here are the features:

• Large 2 TB capacity
• Support for Server Clusters with VMware ESX and ESXi and Microsoft Hyper-V
• Unlimited number of supported concurrent iSCSI connections
• Compression, Encryption and CHAP authentication
• iSCSI RAM disk for network performance tuning
• iSCSI CD/DVD/Blu-Ray/HD-DVD emulation

This is really great news for all techies as it allows us to play with various solutions that require SAN type storage like ESX/Clusters etc

Grab it from here.

Hi All,

we seems to have become a magnet for bugchecks (bluescreens) and my group is getting hit with them from all angles. We had been having a bunch of blue screens (stop 0xC4) on a perticular server and all of testing and debugging got us no where.

So we ran verifier to check for all drivers on this server. As soon as we finished configuring verifier.exe and gave the server a reboot it blue screened with a STOP 0x000000D5, but this time the culprit came out. It was cdprobe.sys running on this windows 2003 server as part of Symantec's Centennial Discovery Client Agent. We disabled it and the server has been running fine since then.

So how do you use Verifier.exe ???

Here are the steps to enable Verifier.

Start > Run, type Verifier and hit enter.

Select the radio button for Create custom settings (for code developers), click Next

on this screen click Select individual settings from a full list, click Next

Tick the first three check boxes of Special Pool, Pool Tracking, Force IRQL checking, click Next

Check the box for Select driver names for a list and click Next

In this screen, sort by Provider and then select all the Non-Microsoft drivers. Click Finish. It will ask you to reboot.

If your server has a bad driver it will bluescreen and display the name of the problematic driver (see screenshot below).


This morning got a new DFS built going and after getting all the config work on it, tried to get the Folder replication to start but got this rude error.

Domain: The Active Directory schema on domain controller dc1 cannot be read. This error might me caused by schema that has not been extended, or was extended improperly. A class schema object cannot be found.



So went into event viewer to check the error in DFS replication and found this:


It was followed by this:

Event ID: 6012

Description:The DFS Replication service detected an incompatible Active Directory schema version while trying to read configuration objects from server DC1. The service disconnected from this server and will try again in the next polling cycle.

Additional Information:

Expected Version: 31

Incompatible Server Version: 30





Looked up Event ID 6012 and found this article by Microsoft. Going through the article cleared everything.


Our Domain Controllers were running Windows 2003 while our file servers had moved up to Windows 2003 R2. So the solution was simple, backup your domain controller, get the CD 2 of the Windows 2003 R2 and run adprep.exe /forestprep from it. We did that and voila DFS replication working happily.


Event Type: Information

Event Source: DFSREvent

Event ID: 1206

Description:The DFS Replication service successfully contacted domain controller \\DC1 to access configuration information.
For more information, see Help and Support Center at
http://go.microsoft.com/fwlink/events.asp.


Back from my holiday and straight into couple of big migration projects. While running these projects I needed to move 20TB of data around.

Now I have generally used Robocopy for pretty much all my copy/move/sync needs but to move that much of data I needed another utility that could do it faster. Comes in Microsoft with their RichCopy, which was an internal tool for many years till they made it public.





The advantage of Richcopy is that its multithreaded. Really multithreaded. I ran 256 directory copy threads with eachdirectory copy getting 256 file copy operations, effectively running 65536 copy operations simultaniously and Richcopy ran with it just fine. Check out the screehshots below.


You can grab your copy from here.



hey guys, remember this post Oracle Server Non paged pool memory error ? Quick recap :

Oracle 10g Release 2 (10.2) server running out of Non Paged Pool Memory with Event ID: 2019 ?
The Oracle guys had logged a Service Request with Oracle and after reviewing our research that I mentioned before they clearly identified the issue as "Bug 5077897"  "Windows: Server side handle leak"

Here is the Oracle Blurb about Bug 5077897:

Bug 5077897 - Windows: Server side handle leak

version affected 10.2.0.2 

Description
This problem is introduced on Windows platforms
in 10.2.0.2. The Oracle server exhibits a thread Handle 
leak which is observed for every TCP connect/disconnect 
to a 10.2 database.

So we will be upgrading our database very soon.

Also on that note I am on vacation till mid March so although I will be checking my mails on a weekly basis, I wont be doing any blog posts. Take care and have fun. See you all in a month's time.

So you just installed a new VMware ESX server. You tried to SSH to it and login as root. What happened?
It didn’t work, did it?
The firewall allows it, right? (yes) You can login to the physical server console with the same username & password, right? (yes) But it still doesn’t work, does it?
Let’s find out how to fix it….
To allow the root user to login to a VMware ESX Server over the network using SSH, do the following:
1. Go to the service console on the physical server & login 
2. vi /etc/ssh/sshd_config 
3. Change the line that says PermitRootLogin from “no” to “yes” 
4. do service sshd restart 
And your problem is solved…

OR from console run this:
 
mv /etc/ssh/sshd_config /etc/ssh/sshd_config.orig
cat /etc/ssh/sshd_config.orig | sed 's/PermitRootLogin no/PermitRootLogin yes/g' > /etc/ssh/sshd_config
service sshd restart

However, having said the above, it is not a good security practise to allow direct root level login over the network even if its using SSH. I prefer to add a regular users, SSH to the server using that account and then SU - to get to root.

Another recommendation is to use one non-root group for VM admins and add operator/admin users there. To create that group, enter the following command: 
groupadd -g 7777 vmadmins 

To create an account for the new admins, enter the following commands: 

useradd -c "ESX server operator" ESXOps 
Create a single userid, which will be able to operate all of the VMs. 

useradd -g 7777 johndoe 
Create a userid, and make groupid 7777 (vmadmins) as its primary group. 

useradd -g 7777 -c "Joe Blog" joeblog2 
Create a userid, and make groupid 7777 (vmadmins) as its primary group. 

One fine day the SAP system on our development server did not come back after an overnight offline backup. The guys tried a lot but it wouldn't start. Trying to start it from the command line told us that the mfc71u.dll was not present in C:\WINDOWS\system32. We restored the file from last nights backup and the SAP system started perfectly.


Now on a regular PC generally nobody goes in C:\WINDOWS\system32 and deletes files on a fancy and this was a heavily restricted SAP server with only the SAP and the OPS team having access. System logs didn't not indicate any unwanted activity. The only activity was the installation of Symantec EndPoint Protection Manager client, but this was a week back. Also this deployment had been tested on 5 test server previously without any issue.

The server's application event log indicated that Symantec EPM installed successfully. Still I got the boys to check the log created by Symantec EPM. This is what we found in the installation log. 

Info 1603.The file C:\WINDOWS\system32\mfc71u.dll is being held in use by the following process: Name: sapstartsrv, ID: 10572, Window Title: (not determined yet).  Close that application and retry.
MSI (s) (58:10) [15:31:12:215]: Note: 1: 2727 2:  
...
MSI (c) (C8:88) [15:31:12:215]: No window with title could be found for FilesInUse
MSI (s) (58:10) [15:31:12:215]: Doing action: uExtBeginUninstallImmediate.6500F9C2_37EA_4F25_A4DE_6211026D9C01
Action ended 15:31:12: InstallValidate. Return value 1.
MSI (s) (58:28) [15:31:12:231]: Invoking remote custom action. DLL: C:\WINDOWS\Installer\MSI324.tmp, Entrypoint: _BeginUninstallImmediate@4
... 

MSI (s) (58:10) [15:31:46:590]: Executing op: SetTargetFolder(Folder=C:\WINDOWS\system32\)
MSI (s) (58:10) [15:31:46:590]: Executing op: FileRemove(,FileName=mfc71u.dll,,ComponentId={3AC4AA25-A28A-4F09-826A-30CA0A620F35})

So it looked like Symantec EPM client install had removed the file post installation. Surprisingly we did not notice this behaviour on any other PC/Server. 

Fair to say sometimes you find the cause of a problem in the least expected places.

Remember this Post , here is a quick recap:

Event ID: 2019
The server was unable to allocate from the system nonpaged pool because the pool was empty.

So what does it really mean ?

This error is from  Server Service reporting that when it was trying to satisfy a request, it was not able to find enough free memory of the respective type of pool. Error 2020 indicates Paged Pool and 2019, NonPaged Pool. This doesn’t mean that the Server Service (srv.sys) is broken or the root cause of the problem, more often rather it is the first component to see the resource problem and report it to the Event Log.

I installed Poolmon which told me that Thre  is the largest consumer of memory

Poolmon 

So what's "Thre" ?

Thre - nt!ps - Thread objects

Note, the nt before the ! means that this is NT or the kernel’s tag for Thread objects. So there has to be a process that is leaking memory.

I got WinDbg running on this machine asap and entered "!proccess 0 0" in the command. Here is the output with all Processes with more than 1000 handle count.

PROCESS 89b9ad88  SessionId: 0  Cid: 0afc    Peb: 7ffd7000  ParentCid: 01c0
    DirBase: dfff07e0  ObjectTable: e628b498  HandleCount: 95969.
    Image: oracle.exe
PROCESS 89b2a690  SessionId: 0  Cid: 0c4c    Peb: 7ffdf000  ParentCid: 01c0
    DirBase: dfff0860  ObjectTable: e63b2358  HandleCount: 2244.
    Image: pinetmgr.exe
PROCESS 8a386698  SessionId: 0  Cid: 0f04    Peb: 7ffd4000  ParentCid: 01c0
    DirBase: dfff0a60  ObjectTable: e17e7408  HandleCount: 2167.
    Image: pimsgss.exe

A handle count of more than 95000 definitely set off alarms. I dug a bit deeper into the Oracle process with

!PROCESS 89b9ad88  4 

The process brought a whole bunch of threads.

!process 89b9ad88 4

PROCESS 89b9ad88  SessionId: 0  Cid: 0afc    Peb: 7ffd7000  ParentCid: 01c0
    DirBase: dfff07e0  ObjectTable: e628b498  HandleCount: 114448.
    Image: oracle.exe

      THREAD 89b97998  Cid 0afc.0b00  Teb: 7ffdf000 Win32Thread: e6355328 WAIT
        THREAD 89b77b78  Cid 0afc.0b44  Teb: 7ffdd000 Win32Thread: 00000000 WAIT
        THREAD 89b64458  Cid 0afc.0b50  Teb: 7ffdc000 Win32Thread: 00000000 WAIT
        THREAD 89a01020  Cid 0afc.1204  Teb: 7ffdb000 Win32Thread: e660b768 WAIT
        THREAD 89a007d0  Cid 0afc.1208  Teb: 7ffd9000 Win32Thread: 00000000 WAIT
        THREAD 89a003b8  Cid 0afc.120c  Teb: 7ffd8000 Win32Thread: 00000000 WAIT
        THREAD 899e5db0  Cid 0afc.1214  Teb: 7ffd6000 Win32Thread: 00000000 WAIT
        THREAD 899fadb0  Cid 0afc.121c  Teb: 7ffd5000 Win32Thread: 00000000 WAIT
        THREAD 899e6db0  Cid 0afc.1220  Teb: 7ffd4000 Win32Thread: 00000000 WAIT

I opened two random threads with

!thread command and this is what it came up with:

THREAD 897ebaf0  Cid 0afc.1b58  Teb: 00000000 Win32Thread: 00000000 TERMINATED
Not impersonating
DeviceMap                 e1000908
Owning Process            89b9ad88       Image:         oracle.exe
Attached Process          N/A            Image:         N/A
Wait Start TickCount      17893          Ticks: 16701178 (3:00:29:15.906)
Context Switch Count      24            
UserTime                  00:00:00.000
KernelTime                00:00:00.000
Win32 Start Address 0x0040162c
Start Address 0x77e617ec
Stack Init 0 Current b949fba0 Base b94a0000 Limit b949d000 Call 0
Priority 10 BasePriority 8 PriorityDecrement 0

THREAD 8969e020  Cid 0afc.08c0  Teb: 00000000 Win32Thread: 00000000 TERMINATED
Not impersonating
DeviceMap                 e1000908
Owning Process            89b9ad88       Image:         oracle.exe
Attached Process          N/A            Image:         N/A
Wait Start TickCount      45267          Ticks: 16678772 (3:00:23:25.812)
Context Switch Count      27            
UserTime                  00:00:00.000
KernelTime                00:00:00.015
Win32 Start Address 0x0040162c
Start Address 0x77e617ec
Stack Init 0 Current b9c3fba0 Base b9c40000 Limit b9c3d000 Call 0
Priority 10 BasePriority 8 PriorityDecrement 0
ChildEBP RetAddr  Args to Child

I could tell that the threads have been terminated and they belonged to Oracle.exe but somehow they have not been cleared from memory.

I opened Task Manager and from the View column option added the Handle count and saw this:

TaskManager-3

The Handle count was growing at a fair bit of speed.

TaskManager-4

I have contacted the Oracle boys to check out the issue but I am pretty sure one of the Oracle app on that box or Oracle itself is the cause of the memory leak.

I’ll post back when the Oracle team have come back with their investigation.