C#: Parsing Horses.

Horserace

My goal with this post is to parse out all the horses which came in 1st to 3rd place from each race that took place at Arlington Park this season.

The website http://www.arlingtonpark.com/racing-handicapping/equibase/charts provides a calendar which links to the result file for each race date from the past. Clicking on a particular date will link you to a pdf which contains various statistics.

calendar

So lets download all of these programmatically. To achieve this, I wrote a small program to iterate over dates in the uri, requesting every race date pdf file.

Then, I save them into a folder on my c: drive called \HORSEDATA\. I also, wait 1 -3 second before each request, alternate the web client’s header, and delete the downloaded file if it is too small (No Race happened on that date).

DateTime Today = DateTime.Now;
int randomWait = 0;
Random random = new Random();
for (int i = 0; i < 200; i++)
{
    randomWait = random.Next(1000, 3000);
    Thread.Sleep(randomWait);
    Console.WriteLine("Downloading : " + Today.ToString("d"));
    Today =  Today.AddDays( -1 );
    try
    {
        using (WebClient client = new WebClient())
        {
client.Headers.Add("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.370"+i+";)");
client.DownloadFile("http://www.equibase.com/premium/eqbPDFChartPlus.cfm?RACE=A&BorP=P&TID=AP&CTRY=USA&DT=" 
    + Today.ToString("d") + "&DAY=D&STYLE=EQB", @"C:\HORSEDATA\\" + Today.ToString("MM-dd-yy") + ".pdf");
        }
    }
    catch (Exception x)
    {Console.WriteLine("No Race on this Date.");}
    FileInfo f = new FileInfo(@"C:\HORSEDATA\\" + Today.ToString("MM-dd-yy") + ".pdf");
    long s1 = f.Length;
    Console.WriteLine(s1.ToString());
    if (s1 < 4000)
        File.Delete(@"C:\HORSEDATA\\" + Today.ToString("MM-dd-yy") + ".pdf");
}

Now your \HORSEDATA\ folder should contain the proper pdfs.
horsedatapdf

Next, read through all the race date PDFs. For each PDF, I want to get the names of horses which came in 1st to 3rd place, and count up the amount of times each unique horse at least showed.

3 horse parse

Using iTextSharp library to read the PDFs and count up my horses.

using iTextSharp;
using iTextSharp.text.pdf;
using System.IO;
using iTextSharp.text.pdf.parser;
using System.Text.RegularExpressions;
Console.BufferHeight = 6000;
Dictionary<string, int> dictHorseShowCount = new Dictionary<string, int>();
LinkedList uniqueHorse = new LinkedList();
string[] filePaths = Directory.GetFiles(@"c:\HORSEDATASET\");
foreach (string file in filePaths)
{
    Console.WriteLine("Reading : " + file);
    PdfReader reader = new PdfReader(file); 
    StringWriter output = new StringWriter();  
    for (int i = 1; i <= reader.NumberOfPages; i++)
        output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy()));
    Regex regPattern = new Regex(@"Total WPS Pool(.*?)Copyright", RegexOptions.Singleline);
    MatchCollection matchX = regPattern.Matches(output.ToString());
    foreach (Match match in matchX)
    {
        Match match2 = Regex.Match(match.ToString(), @"Win Place Show(.*?)Wager Type", RegexOptions.Singleline);
        if (match2.Success)
        {
            string [] lineparts = Regex.Split(match2.ToString(),"\n");
            for( int i = 1; i < lineparts.Length - 1; i++ )
            {
                string horseOnly = Regex.Replace(lineparts[i].ToString(), "[^a-zA-Z -]", "");
                if (!uniqueHorse.Contains(horseOnly.Trim()))
                    uniqueHorse.AddLast(horseOnly.Trim());
                if (dictHorseShowCount.ContainsKey(horseOnly.Trim()))
                {
                    int existing_int = dictHorseShowCount[horseOnly.Trim()];
                    existing_int += 1;
                    dictHorseShowCount[horseOnly.Trim()] = existing_int;
                }
                else
                    dictHorseShowCount[horseOnly.Trim()] = 1;
            }
        }
    }
}
var sortedDict = (from entry in dictHorseShowCount orderby entry.Value ascending select entry)
.ToDictionary(pair => pair.Key, pair => pair.Value);
foreach (string key in sortedDict.Keys)
    Console.WriteLine(key + "   "   + sortedDict[key]);
Console.WriteLine("");
Console.WriteLine("Distinct Horse Showed : " + uniqueHorse.Count);
Console.WriteLine("");

Output
outputHORSES

One thought on “C#: Parsing Horses.

  1. Josh

    wow… I wish I could code properly.
    It would take me hundreds of lines of code to parse that info out of a pdf.
    I wrote a database that extracted race data to a sql database, and got over one million unique results… but it ran to soooo many pages of code… and then they’d change the page a bit and the app would break… ended up giving up… but still keeping an interest out there.
    Thanks for your great examples… I’ll have to have a look into it if I ever get the time.
    Well written and concise.
    Thank you.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *