scotgrid

ScotGrid

GridPP is a collaboration of particle physicists and computer scientists from the UK and CERN. They are building a distributed computing Grid across the UK for particle physicists. At the moment there is a working particle physics Grid across 17 UK institutions. When the Large Hadron Collider, the world's largest particle accelerator, opens at CERN in 2007 the Grid will be used to process the accompanying data deluge. The UK Grid will contribute the equivalent of 10,000 PCs to this worldwide effort.

Scotgrid is a regional Tier 2 centre, originally comprising Durham, Edinburgh and Glasgow and aimed at supporting the UK particle physics community, and now expanding to include other scottish universities and other disciplines.

The aim is to engage with a diverse range of academic groups in an effort to provide a cutting edge grid based computing environment for researchers in Scotland. ScotGrid is in contact with research communities as diverse as: Arts and Humanities; Rural Policy Support; Atomic and Molecular Diagnostic Processes in Plasmas; Thermodynamics and Fluid Engineering; Computational Chemistry; Nuclear and Particle Physics.

UK_Tier-2

GridPP

A Super(Symmetric) Grid

For the last few months researchers at Queen Mary, University of London, working with colleagues from Durham University and Brown University, have been using the GridPP infrastructure to investigate Supersymmetric theories. This is a continuation of work that had already been moved from a single machine to a local batch system but needed the extra power the grid could offer.This piece is copyright GridPP if you wish to reproduce it please credit GridPP and contact Neasan ONeill to say you are using it.

Care to share? Grid computing on a general-purpose cluster

Edinburgh Compute and Data Facility (ECDF) is a large local computing resource for hundreds of researchers at Edinburgh University engaged in pursuits from across the academic spectrum such as analysing brain scans to understanding mental illness and exploring the dynamics of complex chemical systems. The diverse user base brings a broad range of requirements that need to work happily together and GridPP are one of the more challenging users.This piece is copyright GridPP if you wish to reproduce it please credit GridPP and contact Neasan ONeill to say you are using it.

UK Leads Europe Across The Grid

The Enabling Grids for E-SciencE (EGEE) project closed on 30 April 2010. The project brought together a computing infrastructure, software tools and services to support more than 10,000 scientific researchers across more than 170 research communities. During the two year term GridPP played a key role in EGEE's success, being the biggest national contributor of computing resources.This piece is copyright GridPP if you wish to reproduce it please credit GridPP and contact Neasan ONeill to say you are using it.

Refreshing The Real Time Monitor

The GridPP/Imperial College developed Real Time Monitor has undergone another overhaul and is now available in different versions, supporting a greater number platforms, is more stable and has a new website.This piece is copyright GridPP if you wish to reproduce it please credit GridPP and contact Neasan ONeill to say you are using it.

ScotGrid on Fire!

Don't panic - it's only a test...

Hmm. We had a malicious user's DN on the glasgow system this morning. Am sure that other UKI sites may be affected too. Be careful with your cleanup processes as we missed something the 1st time round. Grr.

"Oh no! Not again..."

After being all enthusiastic that the gSOAP errors had been nailed, we failed two SE tests in the last 24 hours. Exactly the same issue as before.

As this error message is so vague it looks like lcg-rollout is our only hope.

I note in passing that Glasgow has one of the most reliable SEs in the UK for ATLAS (2.1% job loss, only beaten by Oxford who have 0.8%; UK average in Q3 was 8% loss) so this is particularly galling.

Shouldn't we be making the results as seen by our real customers rather more important than a once an hour stab in the dark from ops?

Death to gSOAP...

Even after the successful upgrade of DPM we started to get plagued again by SAM test failures with the generic failure message:

httpg://svr018.gla.scotgrid.ac.uk:8443/srm/managerv1:
CGSI-gSOAP: Error reading token data header: Connection closed

This time they came principally from the SE test, instead of from the CE-rm test.

For a while I wondered if there was a DNS problem, but this seemed unlikely for two reasons:

Durham use the .scotgrid.ac.uk domain, but they don't see errors.
We see the connection in the srmv1 logs, so the host can be resolved.

Then I started to wonder if there was a CRL problem as we occasionally get CRL warnings from SAM WN tests. We have an optimised CRL download system at Glasgow - the CE downloads CRLs as normal, then the remaining nodes mirror the CRLs from the CE. This means we make 1 outbound connection every 6 hours, instead of 150, which seems eminently sensible on a large cluster. However, the default crons for the nodes are 6 hours to process CRLs, which means that CRLs could be up to 12 hours old, in the worst case, on client nodes.

On this suspicion I changed the CE configuration to download CRLs every hour and for the clients do download these from the CE every 4 hours.

I made this change on Friday and, so far, we haven't seen the error again.

My eternal complaint with X509/openssl is why the error is reported as "CGSI-gSOAP: Error reading token data header: Connection closed" and not "CGSI-gSOAP: Error reading token data header: Connection closed [CRL for DN BLAH out of date]".

Is that so very hard to do?

ScotGrid Edinburgh progress

Finally we are green for the latest Atlas releases...

We've made a lot of progress this past week with ECDF. It all started on Friday 10th Oct when were trying to solve some Atlas installation problems in a somewhat ad hoc fashion.
We then incorrectly tagged/published having a valid production release. This then caused serious problems with the Atlas jobs, which resulted in us being taken out of the UK production and missing out on a lot of CPU demand. This past week we've been working hard to solve the problem and here are a few things we found:

1) First of all there were a few access problems to the servers for a few of us. So it was hard to see what was actually going on with the mounted atlas software area. Some of this has now been resolved.

2) The installer was taking ages and them timing (proxy and also SGE killing it off eventually). strace on the nodes linked this to a very slow performance while doing many chmod write to the file system. We solved this in a two fold approach
- Alessandro modified the installer script to be more selective regarding which files needs chmoding, but the system was still very slow.
- The nfs export was then changed to allow asynchronous write which helped speed up the tiny writes to the underlying LUN considerably. There is a worry now of possible data corruption, so should be borne in mind if the server goes down and/or we have edinburgh specific segv/problems with a release. Orlando may want to post later information about the nfs changes.

3) The remover and installer used ~ 3GB and 4,5 GB of vmem respectively and the 6GB vmem limit had only been applied to prodatlas jobs. The 3GB vmem default started causing serious problems for sgmatlas. This has now been changed to 6GB.

We're also planning in the ce to add "qsub -m a -M" SGE options to allow the middleware team to monitor better the occurence of vmem aborts. We also might add a flag to help better parse the SGE account logs for apel. Note: the APEL monitoring problem has been fixed. However, that's for another post (Sam?)...

Well done to Orlando, Alessandro, Graeme and Sam for helping us get to the bottom of this!

Well Done Guys!

Well, I was waiting for Mike and Andrew to blog this, but they haven't. They very successfully upgraded Glasgow's DPM to the native 64bit version on Monday last week (when we had upgraded to SL4 only the 32 bit version was available). This was a significant step forwards but required the head node and all of the disk servers to have their OS rebuilt without losing data, and the database restored onto the head node.

It went very well and we were up and running again within 6 hours - no data lost!

We are also seeing an improvement in the SAM test results, with the spurious 'gSOAP' errors which were plaguing us now seemingly having gone (fingers crossed!).

It's terrible that the LHC is not running right now, but it does mean that interventions like this can be done.

Great work guys!