The task for DB mindbox in May 2106 was clear, but far from easy: organise a challenge to find solutions for digitising infrastructure plans of DB Netz. The aim was to discover new ways of solving problems and to recruit external experts to develop a program that can extract information from various plan documents and structure it in such a way that it can be imported into the SAP system at Deutsche Bahn.
A prototype of the finished solution with a host of functions has been ready since early February. The complete processing chain is modelled, from the image file through to import into the plan management system. Automated functions take care of process like converting the image format, recognising the text blocks, and if necessary, rotating the document. The user interface offers a variety of filtering and structuring functions for post-processing.
Project: Digitising headers
A real treasure lies at DB Netz: several hundred thousand plans document where and in what form telecommunications (TC) systems are installed in the Deutsche Bahn railway network. These plans have already been digitised in order to be transferred to a central database. This means that employees could soon have access to infrastructure plans directly at their workplace or on a tablet while on the move – without having to inconveniently search for them. However, before the plans can even be found, the metadata on each plan, together with the data indicating what the plans actually depict, has to be extracted from the text blocks.
Text blocks – known as headers – provide a kind of legend. Individual blocks contain all the relevant information about the respective plan. They indicate, for example, the track section or the person who created the plan – often in abbreviated form. Some of the plans are almost 100 years old and many have handwritten notes scrawled over them – which means the text blocks are neither consistent nor standardised.
“There are mandatory fields and optional fields”, explains Javier Glaubitz from DB Netz, who is managing the digitisation project. “It’s an enormous challenge for a system to recognise automatically where which fields are located – and the significance of their contents.”
Competition to find the best solution
The task of the challenges set by DB mindbox is generally to find fast, innovative or unconventional solutions to the challenges that DB is facing. The ten participating teams, including DB Systel, had the task of developing an algorithm capable of extracting the text blocks of TC plans and automatically assigning the correct keywords. To help them, sample solutions were prepared specifically for the challenge. Thanks to the extensive preparation, the participants didn’t have to worry about the details of the test methodology, allowing them to instead concentrate on the core problem: content recognition.
The competition was won by a team from the German Research Centre for Artificial Intelligence (DFKI), who opted for an open-source solution with which they were able to achieve the best results when reading the headers. The solution devised by DB Systel came third. “We can be extremely satisfied,” Dr Feldner. The gap between DB Systel and the winning team, who won due to clever pre-processing, was extremely small. In terms of the data recognition rates, on the other hand, there was hardly any difference. “We were able to prove that we can keep pace with our competitors in this sector.” This provided for mass processing of the headers with a user-friendly interface, allowing recognised texts to be spotted and processed very quickly. What paid off here, is the fact that the Group has been using an input platform for over ten years. Its functions are being used and combined on a modular basis for the different stages of the digitisation process: the platform records documents from different sources, recognises them automatically, evaluates them and then outputs them as required in different formats. Quality controls between the individual stages ensure the highest possible level of reliability. With the proof of concept in the Digital Header Challenge, the DB Systel team was able to experiment with the recognition rates for the headers.
Successful use of a standard solution and development of a prototype
“We were aware of the requirements beyond the challenge,” said Dr Feldner. His team used commercial software – Kofax from Lexmark – which is already being used for the DB input platform. This solution permitted the extraction and evaluation of form-based documents, as in the case of the headers. In addition, the DB Systel team used machine learning. Above all, however, DB Systel had an answer to issues that went beyond the challenge. The solution had a prompting system that was suitable for end users. Dr Wolter is satisfied too: “Our hopes have been fulfilled: with the challenge, we’ve found a functional, innovative approach and a great winning team. It was decided that the winning team from DFKI would join forces with DB Systel to implement the project.”
Whereas DFKI advised primarily on pre-processing, a Group-wide, interdisciplinary team developed a prototype to digitise text blocks. In particular, the combination of the two different methods ensured a successful collaboration. The next step is an internal DB field test, in which end users will test to what extent the prototype is suitable for everyday use.
Digitale Bestandspläne 4.0 (Digital Infrastructure Plans 4.0)
The next concrete steps for the system are already clear: the project will be adopted into the “Digitale Bestandspläne 4.0” (Digital Infrastructure Plans 4.0) project, where it will be further developed for system types used at DB Netz. The focus will then be on automatic quality assurance of headers that have already been digitised. For plans in quite a different format, a flexible solution is required. Other business units have already been knocking at the door in order to have documents automatically digitised. “If this really happens, that would be about ten million pages per year”, says Dr Feldner. What’s perfectly clear is that the man has a plan.