A high-speed railway network dataset from train operation records and weather data | Scientific Data – Nature.com

To obtain the high-speed railway network dataset, we first collect the train operation records, mileage information and the geographical locations of the railway stations. The historical weather related data are collected based on the geographical locations, and the dates of major holidays from October 8, 2019 to January 27, 2020 are obtained. Second, we calculate the arrival and departure delay time of one train and count the number of delayed trains per hour in different directions of one station. Third, compute the mileage of adjacent stations. Fourth, train operation conditions of Chinas top ten junctions are statistics. Fifth, according to the geographical locations and time stamps, the train directions, station types, weather, holidays and other complex factors are expanded to the operation data of high-speed trains and delay data of railway stations. Finally, we check and validate our dataset.

Figure1 shows the flowchart of methodology to obtain the high-speed railway network dataset from train operation records and weather data. The steps involved are described in detail below.

Flow chart of methodology. The figure shows the flowchart of methodology to obtain the high-speed railway network dataset from train operation records and weather data.

The source data for the high-speed railway network dataset consists of the high-speed trains operation data, the high-speed trains mileage data, the locations of railway stations, the junction stations, the weather related data and the major holidays.

High-speed train operation records consist of the historical schedule and actual operation information. We use the web scraping method with python28 to obtain 2,751,713 running data of 3,399 trains from China Railway Ticket System (https://www.12306.cn), from October 8, 2019 to January 27, 2020, 16 weeks in total. The operation records of one train consist of stopping stations, scheduled departure and arrival time, actual departure and arrival time, etc. Fig.2 shows China high-speed railway network, the 727 stations and actual operation lines of 3,399 trains are included.

China high-speed railway network. The figure shows the actual operation network of China Railway High Speed, which includes the 727 railway stations and 3,399 high-speed trains in the dataset. (a) shows location of stations. (b) shows the railway lines.

According to the train operation records, we use the web scraping method to obtain the operating mileage of 3,399 trains from http://www.huochepiao.com. We obtain the data updated to 2020 because the railway routes are constantly adjusted. The attributes contained in the data include train number, station order, station name and the mileage between one station and the departure station. We supplement the missing mileage data by manual search.

We get 727 stations after deleting the duplicates based on the 3,399 high-speed trains operating lines. The names of these stations are unique. Then, we get the geographic locations of them, which include the province, city and district. We supplement the missing location information by manual search.

In the railway network, the connection place of several trunk lines is generally called railway junction, which is composed of several stations, inter-station connecting lines, inbound lines and signals. In the dataset, we consider ten representative junctions in China, the stations are shown in Table1.

It is reported that the operation of high-speed train is affected by climate, such as strong wind, low temperature and torrential rain. So we consider weather, wind power, and temperature as external influential factors to make the dataset more valuable for research. We crawl the data for 16 weeks from a website (http://www.tianqihoubao.com) that records historical weather related data by matching the districts where the stations located in. The data contains a total of 81,242 weather related samples from 727 districts.

We use the Scrapy-Redis multi-task asynchronous framework to crawl the above data and store them in MongoDB database. To improve the efficiency of I/O operations, we use mongoexport to store the data in a csv file.

It is well known that the passenger flow is also an important factor influencing train operation. When multiple trains are late, dispatchers often need to decide the train departure order based on the capacity and the real number of passengers of one station. However, we can not accurately obtain the real number of passengers at one station due to the high mobility of passengers. Luckily, it is clear that the number of passengers tends to be higher than usual during the holidays, especially major holidays, such as Spring Festival and National Day. Therefore, we take major holidays as one of the external influencing factors.

From October 8, 2019 to January 27, 2020, the major holidays considered are Halloween on October 31, 2019, Thanksgiving Day on November 28, 2019, National Memorial Day on December 13, 2019, Christmas on December 25, 2019, New Years Day on January 1, 2020, Laba Festival on January 2, 2020, Chinese New Years Eve on January 24, 2020, and Spring Festival on January 25, 2020.

In this step we correct the collected high-speed train operation records. There are some missing and wrong information in the records, which will affect the computation of train delay time and delay number. Therefore, it is crucial to correct the records before judging and computing delayed trains.

To prevent the loss of observations that may be valuable, we fill in the missing values with data close to them on the date. That is because, for one train, its running status shows a certain trend, which generally remains consistent in the same period.

In the process of data collection, we find that the actual departure time is smaller than the actual arrival time in some of the operation records, which is impossible in the real train operation scene. We regard them as abnormal data. In most cases, one train runs normally according to the schedule, and the stop time at one station is also planned. Therefore, we compute the sum of actual arrival time and scheduled stop time to replace the abnormal actual departure time.

In this step, we compute the delay time of one train on its operation line and count the number of delayed trains per hour in different directions at one station. The delay of one high-speed train includes departure delay and arrival delay. So we mainly construct these two attributes in our dataset.

The original collected high-speed train operation records contains the train running dates, the name of the stations passing by the trains, station order, scheduled departure time and arrival time, actual departure time and arrival time, stop time, etc.

For one station S, the schedule defines that one train should arrive at time ({t}_{A}^{S}) and leave at time ({t}_{D}^{S}) after stopping at station S for a period of time. In most cases, the schedule is accurate, which means that most trains will depart and arrive on time. However, due to uncontrollable reasons such as extreme weather and large passenger flow, trains may not depart or arrive on time. The actual arrival and departure time are defined as ({widehat{t}}_{A}^{S}) and ({widehat{t}}_{D}^{S}). Then ({widehat{t}}_{A}^{S}-{t}_{A}^{S}) is defined as arrive not on time, ({widehat{t}}_{D}^{S}-{t}_{D}^{S}) is defined as depart not on time. Apparently, when ({widehat{t}}_{A}^{S}-{t}_{A}^{S} > 0), it shows that the train arrives late at S; ({widehat{t}}_{D}^{S}-{t}_{D}^{S} > 0) shows that the train departs late at S. When ({widehat{t}}_{A}^{S}-{t}_{A}^{S} < 0), it shows that the train arrives at S ahead of time; ({widehat{t}}_{D}^{S}-{t}_{D}^{S} < 0) shows that the train departs at S ahead of time.

According to the above definition, we add attributes departure delay and arrival delay in the high-speed train operation data. We compute the time of non-on-time arrive and depart. When these two values are bigger than 0, they represent the time of train delays. when these two values are smaller than 0, they represent the time of train departs or arrives early. It is worth noting that one train has no arrival delay at the departure station, so the value of arrival delay is always 0, and no departure delay at the terminal station, so the value of departure delay is always 0. We store the final processing results in a csv file.

The departure time of one train depends on the scheduling strategy of one station when the delay occurs. Analyzing the number of historical train delays at one station and mining the existing rules can help railway dispatching. It is also an effective way to evaluate the dispatching capacity of one station. In a word, statistic on the number of arrival and departure delayed trains at one station is very valuable.

The operation line of one train is directional, which is divided into up and down. According to China Railway, up means that the train is leaving for Beijing or running from the branch line to the trunk line (the train number is even number), down means that the train is leaving to Beijing or running from the trunk line to the branch line (the train number is odd number). From [00:00, 01:00), October 8, 2019 to [23:00, 24:00), January 27, 2020, we take one hour as a time step to compute the number of departure delays and arrival delays at 727 stations. Supposing that the train number of one train passing through station S is T, the number of trains with (T=2times n) is U, the number of trains with (T=2times (n-1)) is W, then the number of arrival delayed trains in the upward direction is (mathop{sum }limits_{i=1}^{U};left({widehat{t}}_{A}^{S}-{t}_{A}^{S} > 0right)), in downward direction is (mathop{sum }limits_{i=1}^{W};left({widehat{t}}_{A}^{S}-{t}_{A}^{S} > 0right)), the number of departure delayed trains in upward direction is (mathop{sum }limits_{i=1}^{U};left({widehat{t}}_{D}^{S}-{t}_{D}^{S} > 0right)), in downward direction is (mathop{sum }limits_{i=1}^{W};left({widehat{t}}_{D}^{S}-{t}_{D}^{S} > 0right)). We store the delay number data of the railway stations in a csv file.

In the high-speed railway network dataset, adjacent stations refer to neighboring stations on the train diagram that are not geographically close to each other (separated by multiple small stations). Since the lines in different directions between two adjacent stations may be different, resulting in different distances between them, we add direction attribute to the mileage data of adjacent stations (high-speed railway network is a directed network). That is, we calculate the mileage between adjacent stations in the upward and downward directions. According to the high-speed trains mileage data, we can get the distance ({M}_{{S}_{i}}) between one station ({S}_{i}) and departure station, and then the distance between adjacent stations is ({M}_{{S}_{i}}-{M}_{{S}_{i-1}}).

In this step, we compute the total number of the upward and downward trains, the upward and downward arrival delayed trains and departure delayed trains passing through each junction station from October 8, 2019 to January 27, 2020. The above data can be easily computed by matching Table1 and the junction station names in the high-speed train operation data.

In this step, we need to add the train direction, station type, weather related data and major holidays to the processed train operation data and delay number data of railway stations.

The direction of one train is divided into upward and downward. By judging whether the train number is odd or even, we get the operation direction and combine it with the train operation data. Station types include junction stations and non junction stations. By matching the station names in Table1 and delay number data of railway stations, we can easily judge whether one station is a junction station and combine it with the station delay data.

Weather, wind power and temperature information of 727 stations in 16 weeks are contained in the weather related data. By matching the dates and station names, we obtain the train operation data and delay data of stations with weather related factors.

The major holidays are on October 31, 2019, November 28, 2019, December 13, 2019, December 25, 2019, January 1, 2020, January 2, 2020, January 24, 2020 and January 25, 2020. We respectively add the attribute holiday to the train operation data and the delay data of stations. The value of holiday is True or False. By matching dates, we judge whether the dates in the train operation data and the delay data of stations are included in the above 8 dates.

Through the above data processing methods, we obtain the final high-speed railway network dataset.

We perform validation steps for the high-speed railway network dataset from train operation records and weather data. Please see Section Technical Validation for more details.

See the original post here:

A high-speed railway network dataset from train operation records and weather data | Scientific Data - Nature.com

Related Posts

Comments are closed.