It is currently Mon Apr 24, 2017 12:25 am

All times are UTC - 5 hours




 Page 1 of 1 [ 4 posts ] 
Author Message
 Post subject: Basic File/Text Parsing.
PostPosted: Sat Dec 21, 2013 1:57 pm 

Joined: Sat Aug 16, 2008 7:58 am
Posts: 447
I am having trouble parsing information here. The scenario is I have already open and read in all of the contents of a text into a stringstream and is returned as a string from using the TextFileReader class. My parse function takes in as its parameter a const std::string. When I run through the debugger the full contents of my text file is in the string. Now I am trying to parse this string instead of reading in the file line by line. I only have so many keywords or tags to look for. Some key words must appear at least once, while others can show up multiple times, and some are even option. I will provide some of my code to show where I am at with this.

My txt file

: This is a comment.

HEADER[AudioBookReader v1.0]
TITLE[Gulliver's Travel]
AUTHOR[Jonathan Swift]
YEAR[1726]
CHAPTER_COUNT[39]

TABLE_OF_CONTENTS

INTRO[intro] 

SECTION[Part I. A Voyage To Lilliput]
CHAPTER[gt_1_01]
CHAPTER[gt_1_02]
CHAPTER[gt_1_03]
CHAPTER[gt_1_04]
CHAPTER[gt_1_05]
CHAPTER[gt_1_06]
CHAPTER[gt_1_07]
CHAPTER[gt_1_08]

SECTION[Part II. A Voyage To Brobdingnag]
CHAPTER[gt_2_01]
CHAPTER[gt_2_02]
CHAPTER[gt_2_03]
CHAPTER[gt_2_04]
CHAPTER[gt_2_05]
CHAPTER[gt_2_06]
CHAPTER[gt_2_07]
CHAPTER[gt_2_08]

SECTION[Part III. A Voyage To Laputa, Balnibarbi, Luggnagg, Glubbdubdrib, And Japan]
CHAPTER[gt_3_01]
CHAPTER[gt_3_02]
CHAPTER[gt_3_03]
CHAPTER[gt_3_04]
CHAPTER[gt_3_05]
CHAPTER[gt_3_06]
CHAPTER[gt_3_07]
CHAPTER[gt_3_08]
CHAPTER[gt_3_09]
CHAPTER[gt_3_10]
CHAPTER[gt_3_11]

SECTION[Part IV. A Voyage To The Country Of The Houyhnhnms]
CHAPTER[gt_4_01]
CHAPTER[gt_4_02]
CHAPTER[gt_4_03]
CHAPTER[gt_4_04]
CHAPTER[gt_4_05]
CHAPTER[gt_4_06]
CHAPTER[gt_4_07]
CHAPTER[gt_4_08]
CHAPTER[gt_4_09]
CHAPTER[gt_4_10]
CHAPTER[gt_4_11]
CHAPTER[gt_4_12]

OUTRO[NONE]         : This is here as an example if there is none it can be
               : omitted or use the tag NONE inside the parameter braces.

END               : This tag represents the end of the file anything after
               will not get parsed, as you can see I did not use a comment.


Here is my Enumeration and my function to create the keywords or tokens
this enumeration belongs to the same class and a public member
enum FileTagDesc {
      NONE = 0,
      HEADER,
      TITLE,
      AUTHOR,
      YEAR,
      CHAPTER_COUNT,   
      TABLE_OF_CONTENTS,
      NAME,
      INTRO,
      SECTION,
      CHAPTER,
      OUTRO,
      END        // MUST BE LAST
   }; // FileTagDesc


In my createTags method, in the second part of the pairs with my tokens, if they have a parameter associated with them you can see
that I have already included the first "[". So that the next character starts the parameter and ends when the delimiter "]" is found.

// ----------------------------------------------------------------------------
// createTags()
void AudioTextBook::createTags() {
   using namespace std;
   _mTags.clear();

   _mTags.insert( make_pair( NONE,            string( "NONE" ) ) );
   _mTags.insert( make_pair( HEADER,         string( "HEADER[" ) ) );
   _mTags.insert( make_pair( TITLE,         string( "TITLE[" ) ) );
   _mTags.insert( make_pair( AUTHOR,         string( "AUTHOR[" ) ) );
   _mTags.insert( make_pair( YEAR,            string( "YEAR[" ) ) );
   _mTags.insert( make_pair( CHAPTER_COUNT,   string( "CHAPTER_COUNT[" ) ) );
   _mTags.insert( make_pair( TABLE_OF_CONTENTS, string( "TABLE_OF_CONTENTS" ) ) );
   _mTags.insert( make_pair( NAME,            string( "NAME[" ) ) );
   _mTags.insert( make_pair( INTRO,         string( "INTRO[" ) ) );
   _mTags.insert( make_pair( SECTION,         string( "SECTION[" ) ) );
   _mTags.insert( make_pair( CHAPTER,         string( "CHAPTER[" ) ) );
   _mTags.insert( make_pair( OUTRO,         string( "OUTRO[" ) ) );
   _mTags.insert( make_pair( END,            string( "END" ) ) );
} // createTags


this is my parse function where I am struggling
// ----------------------------------------------------------------------------
// parseFile()
bool AudioTextBook::parseFile( const std::string strFileContents ) {
   if ( strFileContents.empty() ) {
      throw ExceptionHandler( __FUNCTION__ + std::string( " failed, invalid filename " ) );
   }

   // Temporary strings to work with since our text file being passed in is a const string
   std::string strContents( strFileContents );
   
   // First we need to compare text in our string to look for keywords and fill in our Info Structure
   std::string comment( ":" );
   std::string delim( "]" );
   std::string substr;
   std::string param;
   unsigned index;
   unsigned length = strContents.length();
   
   // Save The Tokens And Sizes Of Each Token
   std::vector<std::string>  vTokens;
   std::vector<std::size_t>  vSizes;   
   for ( index = 0; index <= END; index++ ) {
      vTokens.push_back( _mTags.find( index )->second );
      vSizes.push_back( _mTags.find(index)->second.length() );
   }




   return true;
} // parseFile


This function only returns true or false on success or failure. This parse function will extract the data out of the parameters and save the data into my structures, and containers within this class. Some of the basic rules that I have set for this file are as follows

A colon and anything past it is a comment.
Words in ALL CAPS are the tokens. Everything inside the [ ] is the parameter to be extracted. The only three tokens that do not have data associated with them are TABLE_OF_CONTENTS, END, and NONE. If NONE is used, this token can be omitted from the text file, but if the user chooses to have it there as a reference that is allowed. If this is the case we can treat that line as if it is a commented line and skip to the next. The first five tokens must appear. The first token and parameter has to match as is. The TABLE_OF_CONTENTS is only meant for the reader of the text file to know where the actual book's contents start. It is not meant to change anything within the program or data information. These tokens are option INTRO, SECTION, OUTRO and NAME and NONE. These tokens' parameters are filenames to be extracted INTRO, CHAPTER, OUTRO. This example book has different parts so I named this tag SECTIONS. The SECTION's parameter will always have a name associated with it so the NAME tag is not needed. The CHAPTERS parameter is the filename itself, and if a specific book has a Title to each chapter then the token NAME is to be used as such CHAPTER[filename] next line NAME[this chapters title here]. The filenames in this file do not have extensions associated with them. The reason for this is especially for the Chapters there is a text file and a audio file for each chapter that must have the same name, only the extension is different. This file is a look up table to load in the necessary chapter files when needed. The INTRO and OUTRO may only have a text file, but can optionally have an audio file as well. They can also have a NAME[] associated and used the same way with the chapters. The END tag means it is the END of the file and parsing terminates. Another feature that matters, is while parsing this file, I need for it to keep count of how many INTRO, OUTRO, SECTION and CHAPTER counts there are. The CHAPTER count reads must match the parameter pulled from the CHAPTER_COUNT token. This verifies that the file has all proper chapter filenames.

I am having trouble breaking this string down. If I wasn't using the TextFileReader class to load this file into my class. I might of taken the read in one line approach since every token and parameter is on its own line. Any help or suggestions would be great.


Offline
 Profile  
 
 Post subject: Re: Basic File/Text Parsing.
PostPosted: Sun Dec 22, 2013 6:58 pm 
Site Admin

Joined: Sun Feb 11, 2007 8:59 am
Posts: 1094
Location: Ontario Canada
If I were you, I'd modify the class so that you can read one line at a time, rather than the whole thing at once. I actually do that in the Shader Engine.

BTW, I'm still waiting for someone to post a solution to the last Shader Engine VMK so that I can continue posting new VMK tutorials in that series!


Offline
 Profile  
 
 Post subject: Re: Basic File/Text Parsing.
PostPosted: Sun Dec 22, 2013 7:20 pm 

Joined: Sat Aug 16, 2008 7:58 am
Posts: 447
I modified my class since my last post. I changed the structure of my text file a little bit. It now looks like this instead.

: This is a comment.

HEADER = AudioBookReader v1.0
TITLE  = Gulliver's Travel
AUTHOR = Jonathan Swift
YEAR   = 1726
CHAPTER_COUNT = 39

TABLE_OF_CONTENTS

INTRO  = intro 

SECTION = Part I. A Voyage To Lilliput
CHAPTER = gt_1_01
CHAPTER = gt_1_02
CHAPTER = gt_1_03
CHAPTER = gt_1_04
CHAPTER = gt_1_05
CHAPTER = gt_1_06
CHAPTER = gt_1_07
CHAPTER = gt_1_08

SECTION = Part II. A Voyage To Brobdingnag
CHAPTER = gt_2_01
CHAPTER = gt_2_02
CHAPTER = gt_2_03
CHAPTER = gt_2_04
CHAPTER = gt_2_05
CHAPTER = gt_2_06
CHAPTER = gt_2_07
CHAPTER = gt_2_08

SECTION = Part III. A Voyage To Laputa, Balnibarbi, Luggnagg, Glubbdubdrib, And Japan
CHAPTER = gt_3_01
CHAPTER = gt_3_02
CHAPTER = gt_3_03
CHAPTER = gt_3_04
CHAPTER = gt_3_05
CHAPTER = gt_3_06
CHAPTER = gt_3_07
CHAPTER = gt_3_08
CHAPTER = gt_3_09
CHAPTER = gt_3_10
CHAPTER = gt_3_11

SECTION = Part IV. A Voyage To The Country Of The Houyhnhnms
CHAPTER = gt_4_01
CHAPTER = gt_4_02
CHAPTER = gt_4_03
CHAPTER = gt_4_04
CHAPTER = gt_4_05
CHAPTER = gt_4_06
CHAPTER = gt_4_07
CHAPTER = gt_4_08
CHAPTER = gt_4_09
CHAPTER = gt_4_10
CHAPTER = gt_4_11
CHAPTER = gt_4_12

OUTRO = NONE        : This is here as an example if there is none it can be
                    : omitted or use the tag NONE inside the parameter braces.
                           
END                 : This tag represents the end of the file anything after
                    will not get parsed, as you can see I did not use a comment.


in my class I created this new function to tokenize this string.
// ----------------------------------------------------------------------------
// tokenize()
void AudioTextBook::tokenize( const std::string& str, std::vector<std::string>& tokens, const std::string startDelim ) {
   using namespace std;

   string::size_type lastPos = str.find_first_not_of( startDelim, 0 );
   string::size_type pos     = str.find_first_of( startDelim, lastPos );

   while ( string::npos != pos || string::npos != lastPos ) {
      // Found A Token, Add It To The Vector
      tokens.push_back( str.substr( lastPos, pos - lastPos ) );
      // Skip Delimiters. Note the "not_of"
      lastPos = str.find_first_not_of( startDelim, pos );
      // Find next "non-delimiter"
      pos = str.find_first_of( startDelim, lastPos );
   }

} // tokenize


I am at this point in my parse function now.
// ----------------------------------------------------------------------------
// parseFile()
bool AudioTextBook::parseFile( const std::string strFileContents ) {
   using namespace std;

   if ( strFileContents.empty() ) {
      throw ExceptionHandler( __FUNCTION__ + std::string( " failed, invalid filename " ) );
   }

   vector<string> lineTokens;
   vector<string> keyWords;
   vector<string> params;

   // First Separate This String into tokens to represent each line
   tokenize( strFileContents, lineTokens, std::string("\n") );

   // This time we want to go through a loop for each line in our vector of tokens
   unsigned index;
   for ( index = 0; index < lineTokens.size(); index++ ) {      
      // First Check To See If We Found The END word
      if ( lineTokens.at( index ).substr( 0, 3 ) == string( "END" )  ) {
         // Remove all lines including END and afterwards
         unsigned i;
         for ( i = index; i < lineTokens.size(); i++ ) {
            lineTokens.at(i).erase();
            //lineTokens.at(i).pop_back();
            //--index;
         }
         break;
      }
      // This time tokenize each line into sections by looking for =
      tokenize( lineTokens[index], keyWords, std::string( "=" ) );
   }
   
   // Start Searching Through our Key Word Vectors And Remove Any That Start With A Comment
   for ( index = 0; index < keyWords.size(); index++ ) {
      if ( keyWords[index][0] == ':' ) {
         // We can discard the whole line
         keyWords.at( index ).erase();
         //swap( keyWords.at( index ) , keyWords.back() );
         //keyWords.pop_back();
      } else {
         // Search To Find A : On the Line and Discard everything after
         string::size_type pos = keyWords.at( index ).find( ":" );
         string keep; // discard;
         if( pos != keyWords.at( index ).npos ) {
            keep = keyWords.at( index ).substr( 0, pos - 1 );
            // discard = keyWords.at( index ).substr( pos, keyWords.at( index ).npos - 1 );
            //keyWords.at( index ).erase();
            keyWords.at( index ).assign( keep );
         }
      }
   }
   

   //string strKeep, strDiscard;
   
   return true;
} // parseFile


So far at this point I call the tokenize() method passing in "\n" as the delimiter. This creates a vector<string> where each line is now in its own string inside this vector. This works great. I then go inside of a for loop traversing through this vector of lineTokens searching for the key word "END" and want to remove it and all lines past it, which I am still working on getting it right. Then I call tokenize again passing in this time an "=" and this breaks down everything into keywords and stores them in a different vector<strings> when I check this vector of strings through the debugger it looks good. I'll have HEADER in one string and in the next string I'll have everything after the "=" from that line for the header and so on. The next part I am trying to do is to remove all comments from the vector<string> of keyWords. After this is completed I then want to split the vector<string> of keyWords to only contain keyWords and save all the other text into another vector<string> called params. Once I have these two temp vector<strings> to where I want them I can then use these temp vectors in my class. I am just struggling with basic string searches within the std::string class and resizing of vectors.


Offline
 Profile  
 
 Post subject: Re: Basic File/Text Parsing.
PostPosted: Sun Dec 22, 2013 7:57 pm 

Joined: Sat Aug 16, 2008 7:58 am
Posts: 447
I refined even a little more. I removed the option to have comments in the file. I removed some of the extra processing in the parse function and now I have all my keywords and parameters successfully in the same vector<string> now all I have to do is extract these values out and store them into my class and its structures.

Here is my new parse function without the worry of having comments.
// ----------------------------------------------------------------------------
// parseFile()
bool AudioTextBook::parseFile( const std::string strFileContents ) {
   using namespace std;

   if ( strFileContents.empty() ) {
      throw ExceptionHandler( __FUNCTION__ + std::string( " failed, invalid filename " ) );
   }

   vector<string> lineTokens;
   vector<string> keyWords;

   // First Separate This String into tokens to represent each line
   tokenize( strFileContents, lineTokens, std::string("\n") );

   // This time we want to go through a loop for each line in our vector of tokens
   unsigned index;
   for ( index = 0; index < lineTokens.size(); index++ ) {      
      // First Check To See If We Found The END word
      if ( lineTokens.at( index ).substr( 0, 3 ) == string( "END" )  ) {
         // Remove all lines including END and afterwards
         unsigned i;
         for ( i = index; i < lineTokens.size(); i++ ) {
            lineTokens.at(i).erase();
            //lineTokens.at(i).pop_back();
            //--index;
         }
         break;
      }
      // This time tokenize each line into sections by looking for =
      tokenize( lineTokens[index], keyWords, std::string( "=" ) );
   }

   return true;
} // parseFile


Offline
 Profile  
 
Display posts from previous:  Sort by  
 Page 1 of 1 [ 4 posts ] 

All times are UTC - 5 hours


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Jump to:  

cron